Use Cases
Water Mission Area STAC Catalog
Keywords: STAC; catalog; data; spatiotemporal; machine-readable
Domain: Domain Agnostic
Language: Python
Description:
This use case describes how to discover spatiotemporal data assets in the WMA STAC Catalog and read them into a python workflow.
Linked Catalog Datasets:
- IsDerivedFrom CONUS404: Four-kilometer long-term regional hydroclimate reanalysis over the conterminous United States (ver. 3.0, June 2026)
- IsDerivedFrom CONUS404 PGW: Four-kilometer long-term regional hydroclimate reanalysis perturbed with pseudo-global warming (PGW) conditions over the conterminous United States
- IsDerivedFrom CONUS404 BA: CONUS404 climate forcing variable subset for hydrologic models, 1979-2022: downscaled to 1 km and bias-adjusted for precipitation and temperature
- IsDerivedFrom SNODAS: Snow Data Assimilation System (SNODAS) Data Products at NSIDC, Version 1
- IsDerivedFrom Annual National Land Cover Database (NLCD) Collection 1.1
- IsDerivedFrom SSEBop (MODIS)
- IsDerivedFrom gridMET
- IsDerivedFrom PRISM: 4km Monthly Parameter-elevation Regressions on Independent Slopes Model Monthly Climate Data for the Continental United States.
- IsDerivedFrom United States Stage IV Quantitative Precipitation Archive
Additional Information:
NotebookExploring the WMA STAC Catalog¶
The USGS Water Mission Area (WMA) STAC Catalog provides a standardized interface for discovering and accessing geospatial datasets managed by the WMA. It organizes water-related data products — including gridded climate datasets, hydrologic model outputs, and remote sensing products — as browsable, machine-readable collections.
This notebook demonstrates two approaches for finding and opening datasets from the catalog:
- Option 1 — UI-Based Discovery: Browse the catalog using the pygeoapi or Radiant Earth STAC Browser web interfaces, then copy the dataset access metadata into Python to open with xarray.
- Option 2 — Programmatic Exploration: Use the PySTAC library and a helper class to navigate the catalog hierarchy in code, retrieve dataset metadata, and open data with xarray.
Both approaches end with opening a dataset as an xarray.Dataset, ready for analysis.
Catalog URL: https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/
Prerequisites¶
The following Python packages are required to run the code in this notebook:
- pystac — for connecting to and navigating the STAC catalog
- xarray — for opening and working with datasets
- zarr — backend storage format for the catalog's datasets
- packaging — for zarr version detection and compatibility handling
import pystac
import xarray as xr
import zarr
from packaging.version import Version
from typing import Any
Option 1: Manual UI Discovery¶
The WMA STAC Catalog can be browsed directly in a web browser. Follow these steps to find a dataset and its access metadata:
Browse the catalog using one of these interfaces:
- pygeoapi STAC interface — the native catalog UI with value-added metadata fields surfaced to improve your data access experience
- Radiant Earth STAC Browser
Navigate to a collection. The top level of the catalog contains a mix of:
- Datasets — data access endpoints you can open directly with xarray
- Sub-collections — groupings of data that contain child collections; drill into these to find the datasets within
Find a dataset's assets. Once you reach a dataset collection, look at its assets. Each asset represents a data access endpoint (typically a zarr store on S3-compatible storage).
Copy the access metadata. From an asset, note these three fields:
href— the URL to the zarr storexarray:storage_options— connection parameters (endpoint, authentication)xarray:open_kwargs— keyword arguments forxarray.open_dataset(engine, chunks, etc.)
Note on JSON boolean capitalization: The STAC catalog returns JSON, which uses lowercase
trueandfalse. When copying these values into Python, you must capitalize them toTrueandFalse. For example,"anon": truein JSON becomes"anon": Truein Python.
# === Option 1: Manual UI Discovery ===
# Replace these values with what you find in the STAC Browser UI.
# Note: JSON uses lowercase true/false — capitalize to True/False for Python.
zarr_url = "s3://mdmf/gdp/gridMET.zarr/"
open_kwargs = {
"chunks": {},
"consolidated": True, # JSON: "consolidated": true → Python: True
"engine": "zarr",
}
storage_options = {
"anon": True, # JSON: "anon": true → Python: True
"client_kwargs": {
"endpoint_url": "https://usgs.osn.mghpcc.org/"
},
}
# Zarr v3 defaults to format 3, but WMA catalog data is stored in zarr format 2.
# This helper detects the installed zarr version and passes zarr_format=2 when needed.
def open_zarr_dataset(
url: str,
storage_options: dict,
open_kwargs: dict,
) -> xr.Dataset:
"""Open a zarr dataset with automatic zarr v2/v3 compatibility handling.
Detects the installed zarr package version and passes `zarr_format=2`
when zarr >= 3.0.0, since zarr v3 defaults to format 3 but the WMA
catalog stores data in zarr format 2.
Args:
url: S3 or HTTP URL to the zarr store.
storage_options: Dict of fsspec storage options (e.g., anon, endpoint_url).
open_kwargs: Dict of keyword arguments for xarray.open_dataset
(e.g., engine, chunks, consolidated).
Returns:
An xarray.Dataset opened from the zarr store.
"""
kwargs = {"storage_options": storage_options, **open_kwargs}
if Version(zarr.__version__) >= Version("3.0.0"):
kwargs["zarr_format"] = 2
return xr.open_dataset(url, **kwargs)
ds = open_zarr_dataset(zarr_url, storage_options, open_kwargs)
ds
<xarray.Dataset> Size: 823GB
Dimensions: (lat: 585, lon: 1386, time: 15861)
Coordinates:
* lat (lat) float64 5kB 49.4 ... 25.07
* lon (lon) float64 11kB -124.8 ... ...
* time (time) datetime64[ns] 127kB 19...
Data variables:
crs int64 8B ...
max_air_temperature (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
max_relative_humidity (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
min_air_temperature (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
min_relative_humidity (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
precipitation_amount (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
specific_humidity (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
surface_downwelling_shortwave_flux_in_air (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
wind_speed (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
Attributes: (12/42)
Conventions: CF-1.0
Metadata_Conventions: Unidata Dataset Discovery v1.0
acknowledgement: Whenever you publish research based on data f...
author: John Abatzoglou - University of Idaho, jabatz...
cdm_data_type: Grid
contributors: Dr. John Abatzoglou
... ...
publisher_name: Center for Integrated Data Analytics
publisher_url: https://www.cida.usgs.gov/
summary: This archive contains daily surface meteorolo...
time_coverage_resolution: P1D
time_coverage_start: 1979-01-01T00:00
title: Daily Meteorological data for continental USOption 2: Programmatic Exploration¶
This approach uses the PySTAC library and a set of helper functions to navigate the catalog hierarchy directly in Python. Rather than browsing a web UI, you interact with the catalog programmatically — listing collections, drilling into datasets, and retrieving access metadata — all in code.
This is especially useful for reproducible workflows and scripted data access, where you want your notebook to document exactly how a dataset was discovered and opened.
The cells below define helper functions followed by a step-by-step walkthrough demonstrating how to connect to the catalog, browse collections, and open a dataset with xarray.
def connect_catalog(catalog_url: str) -> pystac.Catalog:
"""Connect to a STAC catalog and return the catalog object.
Args:
catalog_url: URL to the STAC catalog's root JSON document.
Returns:
A pystac.Catalog instance connected to the remote catalog.
"""
return pystac.Catalog.from_file(catalog_url)
def list_collections(catalog: pystac.Catalog) -> list[str]:
"""List all child collections in the catalog.
Prints each collection's ID and its cite-as link (if available).
Args:
catalog: A pystac.Catalog instance.
Returns:
A list of collection ID strings.
"""
collection_ids = []
for child in catalog.get_children():
collection_ids.append(child.id)
cite_link = None
for link in child.links:
if link.rel == "cite-as":
cite_link = link.target
break
cite_str = cite_link if cite_link else "Not available"
print(f" {child.id} \u2014 cite-as: {cite_str}")
return collection_ids
def select_collection(catalog: pystac.Catalog, collection_id: str):
"""Navigate into a specified collection.
If the collection contains assets, prints them (ID, title, description).
If it contains child collections, prints those instead.
Args:
catalog: A pystac.Catalog instance.
collection_id: The ID of the collection to select.
Returns:
The selected pystac collection object.
Raises:
ValueError: If collection_id is not found among the catalog's children.
"""
available_ids = []
target = None
for child in catalog.get_children():
available_ids.append(child.id)
if child.id == collection_id:
target = child
if target is None:
raise ValueError(
f"Collection '{collection_id}' not found. "
f"Available collections: {available_ids}"
)
if target.assets:
print(f"Collection '{collection_id}' assets:")
for asset_key, asset in target.assets.items():
title = asset.title or "No title"
description = asset.description or "No description"
print(f" {asset_key}: {title} \u2014 {description}")
else:
print(f"Collection '{collection_id}' contains sub-collections:")
for sub_child in target.get_children():
print(f" {sub_child.id}")
return target
def get_asset_info(collection, asset_id: str) -> dict[str, Any]:
"""Retrieve xarray-compatible metadata from a collection's asset.
Args:
collection: A pystac collection object (returned by select_collection).
asset_id: The ID (key) of the asset to retrieve metadata from.
Returns:
A dict with keys: url, storage_options, open_kwargs.
Raises:
KeyError: If asset_id doesn't exist or xarray metadata fields are missing.
"""
assets = collection.assets
if asset_id not in assets:
raise KeyError(
f"Asset '{asset_id}' not found. Available: {list(assets.keys())}"
)
asset = assets[asset_id]
extra = asset.extra_fields
if "xarray:storage_options" not in extra:
raise KeyError(
f"Asset '{asset_id}' is missing 'xarray:storage_options' field."
)
if "xarray:open_kwargs" not in extra:
raise KeyError(
f"Asset '{asset_id}' is missing 'xarray:open_kwargs' field."
)
return {
"url": asset.href,
"storage_options": extra["xarray:storage_options"],
"open_kwargs": extra["xarray:open_kwargs"],
}
def open_zarr_dataset(
url: str,
storage_options: dict,
open_kwargs: dict,
) -> xr.Dataset:
"""Open a zarr dataset with automatic zarr v2/v3 compatibility handling.
Detects the installed zarr package version and passes `zarr_format=2`
when zarr >= 3.0.0, since zarr v3 defaults to format 3 but the WMA
catalog stores data in zarr format 2.
Args:
url: S3 or HTTP URL to the zarr store.
storage_options: Dict of fsspec storage options (e.g., anon, endpoint_url).
open_kwargs: Dict of keyword arguments for xarray.open_dataset
(e.g., engine, chunks, consolidated).
Returns:
An xarray.Dataset opened from the zarr store.
"""
kwargs = {"storage_options": storage_options, **open_kwargs}
if Version(zarr.__version__) >= Version("3.0.0"):
kwargs["zarr_format"] = 2
return xr.open_dataset(url, **kwargs)
Step-by-Step Walkthrough¶
The following cells demonstrate the complete workflow for programmatic catalog exploration:
- Connect to the STAC catalog
- List available collections
- Select a collection to explore
- Retrieve asset metadata for a dataset
- Open the dataset with xarray
Note: The examples below use the
gridMETcollection andzarr-s3-osnasset. Replace these IDs with your own choices based on the output oflist_collections()andselect_collection(). Each function takes its inputs explicitly, so you can copy individual cells into other scripts without needing the full notebook.
# Connect to the WMA STAC Catalog
catalog_url = "https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/"
catalog = connect_catalog(catalog_url)
list_collections(catalog)
AIEM_permafrost — cite-as: Not available
CA-BCM-2014 — cite-as: https://doi.org/10.21429/dye8-h568
FLET — cite-as: Not available
GMO — cite-as: https://doi.org/10.21429/v7ys-6n72
GMO_New — cite-as: https://doi.org/10.21429/c6s4-ve70
LOCA2 — cite-as: Not available
PRISM_v2 — cite-as: https://prism.oregonstate.edu/
PuertoRico — cite-as: Not available
RedRiver — cite-as: Not available
SPEI — cite-as: Not available
TTU_2019 — cite-as: Not available
TopoWx2017 — cite-as: Not available
WUS_HSP — cite-as: Not available
alaska_et_2020 — cite-as: Not available
bcca — cite-as: Not available
bcsd_mon_vic — cite-as: http://gdo-dcp.ucllnl.org/downscaled_cmip_projections/
bcsd_obs — cite-as: http://gdo-dcp.ucllnl.org/downscaled_cmip_projections/
cmip5_bcsd — cite-as: https://doi.org/10.21429/sbxv-1n90
conus404-biasadjusted — cite-as: Not available
conus404-pgw — cite-as: Not available
conus404 — cite-as: Not available
cooper — cite-as: Not available
cprep — cite-as: Not available
dcp_compressed — cite-as: https://doi.org/https%3A//doi.org/10.21429/j9f1-b218
gridMET — cite-as: https://www.climatologylab.org/gridmet.html
hawaii_2018 — cite-as: Not available
iclus — cite-as: Not available
loca — cite-as: Not available
maca-vic — cite-as: Not available
macav2 — cite-as: Not available
maurer — cite-as: https://doi.org/10.21429/m7y0-xy02
mows — cite-as: Not available
nc-casc-snow — cite-as: Not available
nlcd — cite-as: Not available
notaro_2018 — cite-as: Not available
openet — cite-as: Not available
pacis — cite-as: Not available
puerto_rico — cite-as: Not available
red_river_2018 — cite-as: https://doi.org/10.21429/em59-hn43
serap — cite-as: Not available
slr2d — cite-as: https://doi.org/10.21429/66gt-dm26
snodas — cite-as: https://doi.org/10.7265/N5TB14TC
ssebopeta — cite-as: Not available
stageiv_combined — cite-as: https://pubs.usgs.gov/fs/2013/3035/
wicci — cite-as: https://doi.org/10.21429/dtp5-z505
['AIEM_permafrost', 'CA-BCM-2014', 'FLET', 'GMO', 'GMO_New', 'LOCA2', 'PRISM_v2', 'PuertoRico', 'RedRiver', 'SPEI', 'TTU_2019', 'TopoWx2017', 'WUS_HSP', 'alaska_et_2020', 'bcca', 'bcsd_mon_vic', 'bcsd_obs', 'cmip5_bcsd', 'conus404-biasadjusted', 'conus404-pgw', 'conus404', 'cooper', 'cprep', 'dcp_compressed', 'gridMET', 'hawaii_2018', 'iclus', 'loca', 'maca-vic', 'macav2', 'maurer', 'mows', 'nc-casc-snow', 'nlcd', 'notaro_2018', 'openet', 'pacis', 'puerto_rico', 'red_river_2018', 'serap', 'slr2d', 'snodas', 'ssebopeta', 'stageiv_combined', 'wicci']
# Select a collection — replace "gridMET" with your choice from the list above
collection = select_collection(catalog, "gridMET")
Collection 'gridMET' assets: zarr-s3-osn: Free access to zarr via S3 API — Free, public access to zarr data store via the S3 API. This data is stored on an Open Storage Network Pod. nc-s3-osn: Free access to archival legacy files via S3 API — Free, public access (via the S3 API) to archival legacy files from WMA THREDDS server that were used to create this zarr store. This data is stored on an Open Storage Network Pod. default: No title — No description
# Get asset metadata — replace "zarr-s3-osn" with your choice from the list above
asset_info = get_asset_info(collection, "zarr-s3-osn")
# Open the dataset with xarray
ds = open_zarr_dataset(**asset_info)
ds
<xarray.Dataset> Size: 823GB
Dimensions: (lat: 585, lon: 1386, time: 15861)
Coordinates:
* lat (lat) float64 5kB 49.4 ... 25.07
* lon (lon) float64 11kB -124.8 ... ...
* time (time) datetime64[ns] 127kB 19...
Data variables:
crs int64 8B ...
max_air_temperature (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
max_relative_humidity (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
min_air_temperature (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
min_relative_humidity (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
precipitation_amount (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
specific_humidity (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
surface_downwelling_shortwave_flux_in_air (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
wind_speed (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
Attributes: (12/42)
Conventions: CF-1.0
Metadata_Conventions: Unidata Dataset Discovery v1.0
acknowledgement: Whenever you publish research based on data f...
author: John Abatzoglou - University of Idaho, jabatz...
cdm_data_type: Grid
contributors: Dr. John Abatzoglou
... ...
publisher_name: Center for Integrated Data Analytics
publisher_url: https://www.cida.usgs.gov/
summary: This archive contains daily surface meteorolo...
time_coverage_resolution: P1D
time_coverage_start: 1979-01-01T00:00
title: Daily Meteorological data for continental USReporting Issues¶
If you encounter problems with the catalog or this notebook, please reach out:
- Open an issue: code.usgs.gov
- Email: mdmf@usgs.gov
Contributing to the Catalog¶
The WMA STAC Catalog welcomes contributions of geospatial datasets relevant to water science. We are especially interested in gridded data products stored in cloud-optimized formats such as Zarr and Cloud Optimized GeoTIFF (COG), though other formats may be considered.
If you have a dataset that could benefit the water science community, please open a Dataset Addition Inquiry and we will work with you to evaluate it for inclusion in the catalog.
