import icechunk
import obstore
import xarray as xr
from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from virtualizarr.registry import ObjectStoreRegistry
Walk-through - Virtualizing NetCDFs from Amazon’s Open Data Program
Let’s start with the first example from the VirtualiZarr homepage.
This example uses the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) from the Registry of Open Data on AWS. The virtualization process will be much faster if run proximal to the data in AWS’s us-west-2 region.
Creating the virtual dataset looks quite similar to how we normally open data with xarray, but there are a few notable differences that are shown through this example.
First, import the necessary functions and classes:
Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet – let’s suppress those.
import warnings
warnings.filterwarnings("ignore",
="Numcodecs codecs are not in the Zarr version 3 specification*",
message=UserWarning,
category )
We can use Obstore’s obstore.store.from_url
convenience method to create an ObjectStore
that can fetch data from the specified URLs.
= "s3://nex-gddp-cmip6"
bucket = "NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc"
path = obstore.store.from_url(bucket, region="us-west-2", skip_signature=True) store
We also need to create an ObjectStoreRegistry that maps the URL structure to the ObjectStore.
= ObjectStoreRegistry({bucket: store}) registry
Now, let’s create a parser instance and create a virtual dataset by passing the URL, parser, and registry to virtualizarr.open_virtual_dataset
.
= HDFParser()
parser = open_virtual_dataset(
vds =f"{bucket}/{path}",
url=parser,
parser=registry,
registry=[],
loadable_variables
)print(vds)
<xarray.Dataset> Size: 1GB
Dimensions: (time: 365, lat: 600, lon: 1440)
Coordinates:
time (time) float64 3kB ManifestArray<shape=(365,), dtype=float64, ch...
lat (lat) float64 5kB ManifestArray<shape=(600,), dtype=float64, chu...
lon (lon) float64 12kB ManifestArray<shape=(1440,), dtype=float64, c...
Data variables:
tasmax (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/22)
cmip6_source_id: ACCESS-CM2
cmip6_institution_id: CSIRO-ARCCSS
cmip6_license: CC-BY-SA 4.0
activity: NEX-GDDP-CMIP6
Conventions: CF-1.7
frequency: day
... ...
doi: https://doi.org/10.7917/OFSG3345
external_variables: areacella
contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
creation_date: Sat Nov 16 13:31:18 PST 2024
disclaimer: These data are considered provisional and subject ...
tracking_id: d4b2123b-abf9-4c3c-a780-58df6ce4e67f
Since we specified loadable_variables=[]
, no data has been loaded or copied in this process. We have merely created an in-memory lookup table that points to the location of chunks in the original netCDF when data is needed later on. The default behavior (loadable_variables=None
) will load data associated with coordinates but not data variables. The size represents the size of the original dataset - you can see the size of the virtual dataset using the vz accessor:
print(f"Original dataset size: {vds.nbytes} bytes")
print(f"Virtual dataset size: {vds.vz.nbytes} bytes")
Original dataset size: 1261459240 bytes
Virtual dataset size: 11776 bytes
VirtualiZarr’s other top-level function is virtualizarr.open_virtual_mfdataset
, which can open and virtualize multiple data sources into a single virtual dataset, similar to how xarray.open_mfdataset
opens multiple data files as a single dataset.
= [
urls f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_{year}_v2.0.nc"
for year in range(2015, 2017)
]= open_virtual_mfdataset(urls, parser=parser, registry=registry)
vds print(vds)
<xarray.Dataset> Size: 3GB
Dimensions: (time: 731, lat: 600, lon: 1440)
Coordinates:
* time (time) datetime64[ns] 6kB 2015-01-01T12:00:00 ... 2016-12-31T12:...
* lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
* lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
tasmax (time, lat, lon) float32 3GB ManifestArray<shape=(731, 600, 1440...
Attributes: (12/22)
cmip6_source_id: ACCESS-CM2
cmip6_institution_id: CSIRO-ARCCSS
cmip6_license: CC-BY-SA 4.0
activity: NEX-GDDP-CMIP6
Conventions: CF-1.7
frequency: day
... ...
doi: https://doi.org/10.7917/OFSG3345
external_variables: areacella
contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
creation_date: Sat Nov 16 13:31:18 PST 2024
disclaimer: These data are considered provisional and subject ...
tracking_id: d4b2123b-abf9-4c3c-a780-58df6ce4e67f
The magic of VirtualiZarr is that you can persist the virtual dataset to disk in a chunk references format such as Icechunk, meaning that the work of constructing the single coherent dataset only needs to happen once. For subsequent data access, you can use xarray.open_zarr to open that Icechunk store, which on object storage is far faster than using xarray.open_mfdataset to open the the original non-cloud-optimized files.
Let’s persist the Virtual dataset using Icechunk. Here we store the dataset in a memory store but in most cases you’ll store the virtual dataset in the cloud.
= icechunk.in_memory_storage()
icechunk_store = icechunk.Repository.create(icechunk_store)
repo = repo.writable_session("main")
session
vds.vz.to_icechunk(session.store)"Create virtual store") session.commit(
'NMNW069WM3CYHXFPW4F0'
Lastly, let’s show what it looks like to re-use this serialized virtual store.
= repo.readonly_session("main") session
=3, consolidated=False) xr.open_zarr(session.store, zarr_format
<xarray.Dataset> Size: 3GB Dimensions: (lat: 600, lon: 1440, time: 731) Coordinates: * lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88 * lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9 * time (time) datetime64[ns] 6kB 2015-01-01T12:00:00 ... 2016-12-31T12:... Data variables: tasmax (time, lat, lon) float32 3GB ... Attributes: (12/22) cmip6_source_id: ACCESS-CM2 cmip6_institution_id: CSIRO-ARCCSS cmip6_license: CC-BY-SA 4.0 activity: NEX-GDDP-CMIP6 Conventions: CF-1.7 frequency: day ... ... doi: https://doi.org/10.7917/OFSG3345 external_variables: areacella contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou... creation_date: Sat Nov 16 13:31:18 PST 2024 disclaimer: These data are considered provisional and subject ... tracking_id: d4b2123b-abf9-4c3c-a780-58df6ce4e67f