Exploring the WMA STAC Catalog¶

The USGS Water Mission Area (WMA) STAC Catalog provides a standardized interface for discovering and accessing geospatial datasets managed by the WMA. It organizes water-related data products — including gridded climate datasets, hydrologic model outputs, and remote sensing products — as browsable, machine-readable collections.

This notebook demonstrates two approaches for finding and opening datasets from the catalog:

Option 1 — UI-Based Discovery: Browse the catalog using the pygeoapi or Radiant Earth STAC Browser web interfaces, then copy the dataset access metadata into Python to open with xarray.
Option 2 — Programmatic Exploration: Use the PySTAC library and a helper class to navigate the catalog hierarchy in code, retrieve dataset metadata, and open data with xarray.

Both approaches end with opening a dataset as an xarray.Dataset, ready for analysis.

Catalog URL: https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/

Prerequisites¶

The following Python packages are required to run the code in this notebook:

pystac — for connecting to and navigating the STAC catalog
xarray — for opening and working with datasets
zarr — backend storage format for the catalog's datasets
packaging — for zarr version detection and compatibility handling

In [1]:

import pystac
import xarray as xr
import zarr
from packaging.version import Version
from typing import Any

Option 1: Manual UI Discovery¶

The WMA STAC Catalog can be browsed directly in a web browser. Follow these steps to find a dataset and its access metadata:

Browse the catalog using one of these interfaces:
- pygeoapi STAC interface — the native catalog UI with value-added metadata fields surfaced to improve your data access experience
- Radiant Earth STAC Browser
Navigate to a collection. The top level of the catalog contains a mix of:
- Datasets — data access endpoints you can open directly with xarray
- Sub-collections — groupings of data that contain child collections; drill into these to find the datasets within
Find a dataset's assets. Once you reach a dataset collection, look at its assets. Each asset represents a data access endpoint (typically a zarr store on S3-compatible storage).
Copy the access metadata. From an asset, note these three fields:
- href — the URL to the zarr store
- xarray:storage_options — connection parameters (endpoint, authentication)
- xarray:open_kwargs — keyword arguments for xarray.open_dataset (engine, chunks, etc.)

Note on JSON boolean capitalization: The STAC catalog returns JSON, which uses lowercase true and false. When copying these values into Python, you must capitalize them to True and False. For example, "anon": true in JSON becomes "anon": True in Python.

In [2]:

# === Option 1: Manual UI Discovery ===
# Replace these values with what you find in the STAC Browser UI.
# Note: JSON uses lowercase true/false — capitalize to True/False for Python.

zarr_url = "s3://mdmf/gdp/gridMET.zarr/"

open_kwargs = {
    "chunks": {},
    "consolidated": True,  # JSON: "consolidated": true → Python: True
    "engine": "zarr",
}

storage_options = {
    "anon": True,  # JSON: "anon": true → Python: True
    "client_kwargs": {
        "endpoint_url": "https://usgs.osn.mghpcc.org/"
    },
}

In [3]:

# Zarr v3 defaults to format 3, but WMA catalog data is stored in zarr format 2.
# This helper detects the installed zarr version and passes zarr_format=2 when needed.


def open_zarr_dataset(
    url: str,
    storage_options: dict,
    open_kwargs: dict,
) -> xr.Dataset:
    """Open a zarr dataset with automatic zarr v2/v3 compatibility handling.

    Detects the installed zarr package version and passes `zarr_format=2`
    when zarr >= 3.0.0, since zarr v3 defaults to format 3 but the WMA
    catalog stores data in zarr format 2.

    Args:
        url: S3 or HTTP URL to the zarr store.
        storage_options: Dict of fsspec storage options (e.g., anon, endpoint_url).
        open_kwargs: Dict of keyword arguments for xarray.open_dataset
                     (e.g., engine, chunks, consolidated).

    Returns:
        An xarray.Dataset opened from the zarr store.
    """
    kwargs = {"storage_options": storage_options, **open_kwargs}
    if Version(zarr.__version__) >= Version("3.0.0"):
        kwargs["zarr_format"] = 2
    return xr.open_dataset(url, **kwargs)


ds = open_zarr_dataset(zarr_url, storage_options, open_kwargs)
ds

Out[3]:

<xarray.Dataset> Size: 823GB
Dimensions:                                    (lat: 585, lon: 1386, time: 15861)
Coordinates:
  * lat                                        (lat) float64 5kB 49.4 ... 25.07
  * lon                                        (lon) float64 11kB -124.8 ... ...
  * time                                       (time) datetime64[ns] 127kB 19...
Data variables:
    crs                                        int64 8B ...
    max_air_temperature                        (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    max_relative_humidity                      (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    min_air_temperature                        (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    min_relative_humidity                      (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    precipitation_amount                       (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    specific_humidity                          (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    surface_downwelling_shortwave_flux_in_air  (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    wind_speed                                 (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
Attributes: (12/42)
    Conventions:                CF-1.0
    Metadata_Conventions:       Unidata Dataset Discovery v1.0
    acknowledgement:            Whenever you publish research based on data f...
    author:                     John Abatzoglou - University of Idaho, jabatz...
    cdm_data_type:              Grid
    contributors:               Dr. John Abatzoglou
    ...                         ...
    publisher_name:             Center for Integrated Data Analytics
    publisher_url:              https://www.cida.usgs.gov/
    summary:                    This archive contains daily surface meteorolo...
    time_coverage_resolution:   P1D
    time_coverage_start:        1979-01-01T00:00
    title:                      Daily Meteorological data for continental US

Option 2: Programmatic Exploration¶

This approach uses the PySTAC library and a set of helper functions to navigate the catalog hierarchy directly in Python. Rather than browsing a web UI, you interact with the catalog programmatically — listing collections, drilling into datasets, and retrieving access metadata — all in code.

This is especially useful for reproducible workflows and scripted data access, where you want your notebook to document exactly how a dataset was discovered and opened.

The cells below define helper functions followed by a step-by-step walkthrough demonstrating how to connect to the catalog, browse collections, and open a dataset with xarray.

In [4]:

def connect_catalog(catalog_url: str) -> pystac.Catalog:
    """Connect to a STAC catalog and return the catalog object.

    Args:
        catalog_url: URL to the STAC catalog's root JSON document.

    Returns:
        A pystac.Catalog instance connected to the remote catalog.
    """
    return pystac.Catalog.from_file(catalog_url)


def list_collections(catalog: pystac.Catalog) -> list[str]:
    """List all child collections in the catalog.

    Prints each collection's ID and its cite-as link (if available).

    Args:
        catalog: A pystac.Catalog instance.

    Returns:
        A list of collection ID strings.
    """
    collection_ids = []
    for child in catalog.get_children():
        collection_ids.append(child.id)
        cite_link = None
        for link in child.links:
            if link.rel == "cite-as":
                cite_link = link.target
                break
        cite_str = cite_link if cite_link else "Not available"
        print(f"  {child.id} \u2014 cite-as: {cite_str}")
    return collection_ids


def select_collection(catalog: pystac.Catalog, collection_id: str):
    """Navigate into a specified collection.

    If the collection contains assets, prints them (ID, title, description).
    If it contains child collections, prints those instead.

    Args:
        catalog: A pystac.Catalog instance.
        collection_id: The ID of the collection to select.

    Returns:
        The selected pystac collection object.

    Raises:
        ValueError: If collection_id is not found among the catalog's children.
    """
    available_ids = []
    target = None
    for child in catalog.get_children():
        available_ids.append(child.id)
        if child.id == collection_id:
            target = child

    if target is None:
        raise ValueError(
            f"Collection '{collection_id}' not found. "
            f"Available collections: {available_ids}"
        )

    if target.assets:
        print(f"Collection '{collection_id}' assets:")
        for asset_key, asset in target.assets.items():
            title = asset.title or "No title"
            description = asset.description or "No description"
            print(f"  {asset_key}: {title} \u2014 {description}")
    else:
        print(f"Collection '{collection_id}' contains sub-collections:")
        for sub_child in target.get_children():
            print(f"  {sub_child.id}")

    return target


def get_asset_info(collection, asset_id: str) -> dict[str, Any]:
    """Retrieve xarray-compatible metadata from a collection's asset.

    Args:
        collection: A pystac collection object (returned by select_collection).
        asset_id: The ID (key) of the asset to retrieve metadata from.

    Returns:
        A dict with keys: url, storage_options, open_kwargs.

    Raises:
        KeyError: If asset_id doesn't exist or xarray metadata fields are missing.
    """
    assets = collection.assets
    if asset_id not in assets:
        raise KeyError(
            f"Asset '{asset_id}' not found. Available: {list(assets.keys())}"
        )

    asset = assets[asset_id]
    extra = asset.extra_fields

    if "xarray:storage_options" not in extra:
        raise KeyError(
            f"Asset '{asset_id}' is missing 'xarray:storage_options' field."
        )
    if "xarray:open_kwargs" not in extra:
        raise KeyError(
            f"Asset '{asset_id}' is missing 'xarray:open_kwargs' field."
        )

    return {
        "url": asset.href,
        "storage_options": extra["xarray:storage_options"],
        "open_kwargs": extra["xarray:open_kwargs"],
    }


def open_zarr_dataset(
    url: str,
    storage_options: dict,
    open_kwargs: dict,
) -> xr.Dataset:
    """Open a zarr dataset with automatic zarr v2/v3 compatibility handling.

    Detects the installed zarr package version and passes `zarr_format=2`
    when zarr >= 3.0.0, since zarr v3 defaults to format 3 but the WMA
    catalog stores data in zarr format 2.

    Args:
        url: S3 or HTTP URL to the zarr store.
        storage_options: Dict of fsspec storage options (e.g., anon, endpoint_url).
        open_kwargs: Dict of keyword arguments for xarray.open_dataset
                     (e.g., engine, chunks, consolidated).

    Returns:
        An xarray.Dataset opened from the zarr store.
    """
    kwargs = {"storage_options": storage_options, **open_kwargs}
    if Version(zarr.__version__) >= Version("3.0.0"):
        kwargs["zarr_format"] = 2
    return xr.open_dataset(url, **kwargs)

Step-by-Step Walkthrough¶

The following cells demonstrate the complete workflow for programmatic catalog exploration:

Connect to the STAC catalog
List available collections
Select a collection to explore
Retrieve asset metadata for a dataset
Open the dataset with xarray

Note: The examples below use the gridMET collection and zarr-s3-osn asset. Replace these IDs with your own choices based on the output of list_collections() and select_collection(). Each function takes its inputs explicitly, so you can copy individual cells into other scripts without needing the full notebook.

In [5]:

# Connect to the WMA STAC Catalog
catalog_url = "https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/"
catalog = connect_catalog(catalog_url)
list_collections(catalog)

  AIEM_permafrost — cite-as: Not available

  CA-BCM-2014 — cite-as: https://doi.org/10.21429/dye8-h568

  FLET — cite-as: Not available

  GMO — cite-as: https://doi.org/10.21429/v7ys-6n72

  GMO_New — cite-as: https://doi.org/10.21429/c6s4-ve70

  LOCA2 — cite-as: Not available

  PRISM_v2 — cite-as: https://prism.oregonstate.edu/

  PuertoRico — cite-as: Not available

  RedRiver — cite-as: Not available

  SPEI — cite-as: Not available

  TTU_2019 — cite-as: Not available

  TopoWx2017 — cite-as: Not available

  WUS_HSP — cite-as: Not available

  alaska_et_2020 — cite-as: Not available

  bcca — cite-as: Not available

  bcsd_mon_vic — cite-as: http://gdo-dcp.ucllnl.org/downscaled_cmip_projections/

  bcsd_obs — cite-as: http://gdo-dcp.ucllnl.org/downscaled_cmip_projections/

  cmip5_bcsd — cite-as: https://doi.org/10.21429/sbxv-1n90

  conus404-biasadjusted — cite-as: Not available

  conus404-pgw — cite-as: Not available

  conus404 — cite-as: Not available

  cooper — cite-as: Not available

  cprep — cite-as: Not available

  dcp_compressed — cite-as: https://doi.org/https%3A//doi.org/10.21429/j9f1-b218

  gridMET — cite-as: https://www.climatologylab.org/gridmet.html

  hawaii_2018 — cite-as: Not available

  iclus — cite-as: Not available

  loca — cite-as: Not available

  maca-vic — cite-as: Not available

  macav2 — cite-as: Not available

  maurer — cite-as: https://doi.org/10.21429/m7y0-xy02

  mows — cite-as: Not available

  nc-casc-snow — cite-as: Not available

  nlcd — cite-as: Not available

  notaro_2018 — cite-as: Not available

  openet — cite-as: Not available

  pacis — cite-as: Not available

  puerto_rico — cite-as: Not available

  red_river_2018 — cite-as: https://doi.org/10.21429/em59-hn43

  serap — cite-as: Not available

  slr2d — cite-as: https://doi.org/10.21429/66gt-dm26

  snodas — cite-as: https://doi.org/10.7265/N5TB14TC

  ssebopeta — cite-as: Not available

  stageiv_combined — cite-as: https://pubs.usgs.gov/fs/2013/3035/

  wicci — cite-as: https://doi.org/10.21429/dtp5-z505

Out[5]:

['AIEM_permafrost',
 'CA-BCM-2014',
 'FLET',
 'GMO',
 'GMO_New',
 'LOCA2',
 'PRISM_v2',
 'PuertoRico',
 'RedRiver',
 'SPEI',
 'TTU_2019',
 'TopoWx2017',
 'WUS_HSP',
 'alaska_et_2020',
 'bcca',
 'bcsd_mon_vic',
 'bcsd_obs',
 'cmip5_bcsd',
 'conus404-biasadjusted',
 'conus404-pgw',
 'conus404',
 'cooper',
 'cprep',
 'dcp_compressed',
 'gridMET',
 'hawaii_2018',
 'iclus',
 'loca',
 'maca-vic',
 'macav2',
 'maurer',
 'mows',
 'nc-casc-snow',
 'nlcd',
 'notaro_2018',
 'openet',
 'pacis',
 'puerto_rico',
 'red_river_2018',
 'serap',
 'slr2d',
 'snodas',
 'ssebopeta',
 'stageiv_combined',
 'wicci']

In [6]:

# Select a collection — replace "gridMET" with your choice from the list above
collection = select_collection(catalog, "gridMET")

Collection 'gridMET' assets:
  zarr-s3-osn: Free access to zarr via S3 API — Free, public access to zarr data store via the S3 API. This data is stored on an Open Storage Network Pod.
  nc-s3-osn: Free access to archival legacy files via S3 API — Free, public access (via the S3 API) to archival legacy files from WMA THREDDS server that were used to create this zarr store. This data is stored on an Open Storage Network Pod.
  default: No title — No description

In [7]:

# Get asset metadata — replace "zarr-s3-osn" with your choice from the list above
asset_info = get_asset_info(collection, "zarr-s3-osn")

In [8]:

# Open the dataset with xarray
ds = open_zarr_dataset(**asset_info)
ds

Out[8]:

<xarray.Dataset> Size: 823GB
Dimensions:                                    (lat: 585, lon: 1386, time: 15861)
Coordinates:
  * lat                                        (lat) float64 5kB 49.4 ... 25.07
  * lon                                        (lon) float64 11kB -124.8 ... ...
  * time                                       (time) datetime64[ns] 127kB 19...
Data variables:
    crs                                        int64 8B ...
    max_air_temperature                        (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    max_relative_humidity                      (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    min_air_temperature                        (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    min_relative_humidity                      (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    precipitation_amount                       (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    specific_humidity                          (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    surface_downwelling_shortwave_flux_in_air  (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    wind_speed                                 (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
Attributes: (12/42)
    Conventions:                CF-1.0
    Metadata_Conventions:       Unidata Dataset Discovery v1.0
    acknowledgement:            Whenever you publish research based on data f...
    author:                     John Abatzoglou - University of Idaho, jabatz...
    cdm_data_type:              Grid
    contributors:               Dr. John Abatzoglou
    ...                         ...
    publisher_name:             Center for Integrated Data Analytics
    publisher_url:              https://www.cida.usgs.gov/
    summary:                    This archive contains daily surface meteorolo...
    time_coverage_resolution:   P1D
    time_coverage_start:        1979-01-01T00:00
    title:                      Daily Meteorological data for continental US

Reporting Issues¶

If you encounter problems with the catalog or this notebook, please reach out:

Open an issue: code.usgs.gov
Email: mdmf@usgs.gov

Contributing to the Catalog¶

The WMA STAC Catalog welcomes contributions of geospatial datasets relevant to water science. We are especially interested in gridded data products stored in cloud-optimized formats such as Zarr and Cloud Optimized GeoTIFF (COG), though other formats may be considered.

If you have a dataset that could benefit the water science community, please open a Dataset Addition Inquiry and we will work with you to evaluate it for inclusion in the catalog.

Use Cases