Query via Icechunk - Connected Data Products Demo

Query NISAR GUNW via Icechunk¶

Open the virtual Icechunk store created in the previous notebook and extract a spatial subset. Only the chunks overlapping the subset are fetched from NASA’s S3 storage.

import warnings

warnings.filterwarnings("ignore", message="Numcodecs codecs are not in the Zarr")
warnings.filterwarnings(
    "ignore", category=UserWarning, message=".*does not have a Zarr V3 specification.*"
)
warnings.filterwarnings("ignore", category=FutureWarning, module="earthaccess")

Open the Icechunk store¶

import time

import earthaccess
import icechunk
import xarray as xr
import zarr

zarr.config.set({"async.concurrency": 128})

earthaccess.login()

# Search for the same granule to get credentials and bucket
results = earthaccess.search_data(
    short_name="NISAR_L2_GUNW_BETA_V1",
    point=(174.1192, -39.3379),
    temporal=("2026-01-01", "2026-01-04"),
)
s3_url = [
    link for link in results[0].data_links(access="direct") if link.endswith(".h5")
][0]
bucket = "/".join(s3_url.split("/")[:3])

storage = icechunk.local_filesystem_storage("nisar-icechunk")
config = icechunk.RepositoryConfig.default()

config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        bucket + "/",
        icechunk.s3_store(region="us-west-2"),
    ),
)

s3_creds = earthaccess.get_s3_credentials(results=results)
virtual_credentials = icechunk.containers_credentials(
    {
        bucket + "/": icechunk.s3_credentials(
            access_key_id=s3_creds["accessKeyId"],
            secret_access_key=s3_creds["secretAccessKey"],
            session_token=s3_creds["sessionToken"],
        )
    }
)

repo = icechunk.Repository.open(
    storage=storage,
    config=config,
    authorize_virtual_chunk_access=virtual_credentials,
)

session = repo.readonly_session("main")
t0 = time.perf_counter()
ds = xr.open_zarr(session.store, consolidated=False, zarr_format=3)
print(f"Opened in {time.perf_counter() - t0:.2f}s")
ds

Extract a spatial subset¶

Instead of downloading the full file, we select a 500x500 pixel region from the center of the scene. Icechunk fetches only the chunks that overlap this window.

import numpy as np

ny, nx = ds.dims["yCoordinates"], ds.dims["xCoordinates"]

# 500x500 pixel window from the center of the scene
cy, cx = ny // 2, nx // 2
subset_slice = dict(
    yCoordinates=slice(cy - 250, cy + 250), xCoordinates=slice(cx - 250, cx + 250)
)

t0 = time.perf_counter()
subset = ds.isel(**subset_slice).load()
elapsed = time.perf_counter() - t0

print(f"Loaded 500x500 subset in {elapsed:.2f}s (from {ny}x{nx} full scene)")
subset

Visualize the subset¶

import pygmt

fig = pygmt.Figure()
pygmt.makecpt(
    cmap="SCM/romaO", series=[0, 2 * np.pi, np.pi / 2], cyclic=True, continuous=True
)
fig.grdimage(grid=subset.unwrappedPhase, cmap=True, frame=True)
fig.colorbar()
fig.show()

Visualize coherence magnitude¶

fig = pygmt.Figure()
fig.grdimage(grid=subset.coherenceMagnitude, cmap="gray", frame=True)
fig.colorbar()
fig.show()

How does this compare?¶

The h5netcdf baseline notebook loads the same 500x500 subset from the same granule. With h5netcdf, the HDF5 chunking layout means reading a spatial subset may pull far more data than needed. With Icechunk, only the chunks overlapping the subset are fetched from S3.

Run the h5netcdf baseline notebook and compare the “Loaded 500x500 subset in ...” times.