Water Mission Area STAC Catalog¶

The USGS Water Resources Mission Area (WMA) has created a STAC Catalog to help users find and access data related to our water modeling projects. STAC Catalogs provide a mechanism to expose spatiotemporal datasets using a machine-readable format, allowing users to discover and access datasets in a standardized way. The WMA Catalog hosts a variety of datasets related to hydro-terrestrial modeling. The scope of the WMA STAC Catalog datasets may include:

common hydrologic model inputs such as climate or other forcing datasets
hydrologic model outputs
observational datasets related to hydrology and/or water budgets

We have exposed our STAC Catalog through a pygeoapi endpoint that is compliant with the OGC API suite of standards. This endpoint allows both API access, as well as a user interface for browsing our data assets.

This STAC Catalog is part of a modernized replacement for the legacy Geo Data Portal.

Finding the data you need and reading it into your workflow¶

You have two ways to discover the data assets in the WMA STAC Catalog:

Using the pygeoapi user interface to browse our datasets through a user interface.
Reading the pygeoapi API endpoint into a workflow using a library designed for reading STAC catalogs, such as PySTAC.

We will describe/demonstrate both methods below.

In [1]:

# first import the packages we will need
import pystac
import xarray as xr
import zarr
from packaging.version import Version

Option 1: Find and access a dataset using the PyGEOAPI or Radiant Earth user interface¶

Step 1: Explore the catalog through either our pygeoapi user interface or the Radiant Earth STAC Browser and find the dataset you want to use.

This catalog contains both datasets and collections of datasets in the top level of the catalog. Therefore, some assets that you click may open up directly to a dataset explore page (with a map viewer and metadata about the dataset), while others will open up to another catalog listing. As you drill down through the collections, you will eventually land on a dataset explore page.
You can learn more about each dataset at its source publication point, which is linked in the cite-as property displayed on each dataset page (if available).
Click through the catalog until you find a dataset that you want to use. For this example, we will use the gridMET dataset.

Step 2: Identify which asset you want to use to access the dataset you've chosen.

Review the Title and Description of each asset listed in the Assets section. Each asset is a different copy of the dataset that may be stored in a different format or location.
Typically, we recommend using the asset with the title "Free access to zarr via S3 API".
Copy the URL, Open Keywords, and Storage Options for the asset you have chosen, and store them as python variables, demonstrated below:

In [2]:

# URL
zarr_url = 's3://mdmf/gdp/stageiv_combined.zarr/'

# Open Keywords
# note that you will need to capitalize the True or False in "consolidated" - these keys are stored as lowercase boolens in the STAC catalog json
# but should be capitalized for python
open_keywords = {
  "chunks": {},
  "consolidated": True,
  "engine": "zarr"
}

# Storage Options
# note that you will need to capitalize the True or False in "anon" - these keys are stored as lowercase boolens in the STAC catalog json
# but should be capitalized for python
storage_options = {
  "anon": True,
  "client_kwargs": {
    "endpoint_url": "https://usgs.osn.mghpcc.org/"
  }
}

Open the dataset using xarray¶

Now we will open and view the dataset using xarray.

Please note that there are two major zarr format specifications: 2 and 3. If you are using the python package zarr>=3.0.0, you must specify the format of the zarr store you are trying to open in the xarray.open_dataset function. If you are using zarr<3.0.0, you do not need to specify the format, as it will default to version 2.

In [3]:

if Version(zarr.__version__) < Version("3.0.0"):
    ds = xr.open_dataset(
        zarr_url,
        storage_options=storage_options,
        **open_keywords
    )
else:
    ds = xr.open_dataset(
    zarr_url,
    storage_options=storage_options,
    **open_keywords,
    zarr_format=2
    )
ds

Out[3]:

<xarray.Dataset> Size: 819GB
Dimensions:                                          (time: 207360, y: 881,
                                                      x: 1121)
Coordinates:
    latitude                                         (y, x) float64 8MB dask.array<chunksize=(53, 68), meta=np.ndarray>
    longitude                                        (y, x) float64 8MB dask.array<chunksize=(53, 68), meta=np.ndarray>
  * time                                             (time) datetime64[ns] 2MB ...
Dimensions without coordinates: y, x
Data variables:
    Total_precipitation_surface_1_Hour_Accumulation  (time, y, x) float32 819GB dask.array<chunksize=(720, 53, 68), meta=np.ndarray>
    crs                                              int64 8B ...
Attributes:
    Conventions:    CF-1.7
    institution:    US National Weather Service - NCEP
    last_timestep:  2025-08-13T08

Option 2: Find and access a dataset using the PySTAC python library¶

Before we begin, we will define a helper function that can be used to drill down through the STAC Catalog and extract key metadata.

In [4]:

def get_children(catalog, collection_id=None):
    """
    This function retrieves a specified collection from a STAC catalog/collection and prints key metadata 
    for exploring/accessing the datasets contained within it.
    If there is no collection ID provided, the collections in the top level of the catalog will be printed.
    If a collection ID is provided, it will retrieve the collection with that ID from the input catalog/collection.
    If the collection ID points to a dataset, it will print the assets available for the dataset.
    If the collection ID points to another collection, it will list the child collections in the IDed collection.

    Args:
        catalog (pystac.Catalog | pystac.Collection): The STAC catalog/collection object.
        collection_id (str): The ID of the collection or dataset to retrieve from catalog.
    
    Returns:
        collection (pystac.Catalog | pystac.Collection): The collection object corresponding to the provided ID
                                                         or the top-level catalog if no ID is provided.
    """
    dataset = False
    if collection_id:
        collection = catalog.get_child(collection_id)
        if collection.assets:
            dataset = True
            print(f"{collection_id} is a dataset. Please review the assets below and select one to open.")

        else:
            print(f"{collection_id} is a collection. Please review the child items and select one to open in the next cell.")
    else:
        collection = catalog
    if dataset==True:
        # List the assets
        for asset in collection.assets:
            print(f"Asset ID: {asset}")
            print(f"    Title: {collection.assets[asset].title}")
            print(f"    Description: {collection.assets[asset].description}")
    else:
        collections = list(collection.get_collections())
        print(f"Number of collections: {len(collections)}")
        print("Collections IDs:")
        for child_collection in collections:
            id = child_collection.id
            cite_as = "Not available"
            for link in child_collection.links:
                if link.rel == "cite-as":
                    cite_as = link.target
            print(f"- {id}, Source: {cite_as}")
    return collection

Step 1: Explore the catalog using the PySTAC library and find the dataset you want to use.

The following code will read the WMA STAC Catalog and print the items in the top level of the catalog.
This catalog contains both datasets and collections of datasets in the top level of the catalog. Therefore, some collections may point to a dataset endpoint, while others will point to another collection of several datasets. As you drill down through the collections, you will eventually land on a dataset.
You can learn more about each dataset at its source publication point, which the helper function defined above prints out (if available).
Identify a dataset that you want to use. For this example, we will use the gridMET dataset.

Note: We are working on enabling stac-search to our pygeoapi endpoint, which will allow you to search for datasets using keywords and other metadata which will help facilitate this exploration process.

In [5]:

# url for the WMA STAC Catalog
catalog_url = "https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/"

# use pystac to read the catalog
catalog = pystac.Catalog.from_file(catalog_url)

# list the collections in the catalog
catalog = get_children(catalog)

Number of collections: 38
Collections IDs:
- AIEM_permafrost, Source: Not available
- CA-BCM-2014, Source: https://doi.org/10.21429/dye8-h568
- FLET, Source: Not available
- GMO, Source: https://doi.org/10.21429/v7ys-6n72
- GMO_New, Source: https://doi.org/10.21429/c6s4-ve70
- LOCA2, Source: Not available
- PRISM_v2, Source: https://prism.oregonstate.edu/
- PuertoRico, Source: Not available
- RedRiver, Source: Not available
- TTU_2019, Source: Not available
- TopoWx2017, Source: Not available
- WUS_HSP, Source: Not available
- alaska_et_2020, Source: Not available
- bcca, Source: Not available
- bcsd_mon_vic, Source: http://gdo-dcp.ucllnl.org/downscaled_cmip_projections/
- bcsd_obs, Source: http://gdo-dcp.ucllnl.org/downscaled_cmip_projections/
- cmip5_bcsd, Source: https://doi.org/10.21429/sbxv-1n90
- conus404-daily, Source: Not available
- cooper, Source: Not available
- cprep, Source: Not available
- dcp_compressed, Source: Not available
- gridMET, Source: https://doi.org/10.21429/qynh-qy46
- hawaii_2018, Source: Not available
- iclus, Source: Not available
- loca, Source: Not available
- maca-vic, Source: Not available
- macav2, Source: Not available
- maurer, Source: https://doi.org/10.21429/m7y0-xy02
- mows, Source: Not available
- notaro_2018, Source: Not available
- pacis, Source: Not available
- puerto_rico, Source: Not available
- red_river_2018, Source: Not available
- serap, Source: Not available
- slr2d, Source: https://doi.org/10.21429/66gt-dm26
- ssebopeta, Source: Not available
- stageiv_combined, Source: https://pubs.usgs.gov/fs/2013/3035/
- wicci, Source: https://doi.org/10.21429/dtp5-z505

In [6]:

# select a collection from the catalog, replace the collection ID with the one you want to use:
collection = get_children(catalog, collection_id="gridMET")

gridMET is a dataset. Please review the assets below and select one to open.
Asset ID: zarr-s3-osn
    Title: Free access to zarr via S3 API
    Description: Free, public access to zarr data store via the S3 API. This data is stored on an Open Storage Network Pod.
Asset ID: nc-s3-osn
    Title: Free access to archival legacy files via S3 API
    Description: Free, public access (via the S3 API) to archival legacy files from WMA THREDDS server that were used to create this zarr store. This data is stored on an Open Storage Network Pod.
Asset ID: default
    Title: None
    Description: None

In [7]:

# if the above was a collection, uncomment the line below and enter the collection ID 
# you want to use from the parent collection you selected:
#collection = get_children(collection, collection_id="alaska_et_2020_ccsm4_historical_simulation")

Step 2: Identify which asset you want to use to access the dataset you've chosen.

Review the Title and Description of each asset printed above. Each asset is a different copy of the dataset that may be stored in a different format or location.
Typically, we recommend people use the zarr-s3-osn asset unless they have a reason to use a different asset.
Copy the Asset ID for the asset you have chosen, and paste in the code below to read the asset metadata.

In [8]:

# replace with the asset ID you want to use:
selected_asset_id = "zarr-s3-osn"

# read the asset metadata
asset = collection.assets[selected_asset_id]

Open the dataset using xarray¶

Now we will open and view the dataset using xarray.

Please note that there are two major zarr format specifications: 2 and 3. If you are using the python package zarr>=3.0.0, you must specify the format of the zarr store you are trying to open in the xarray.open_dataset function. If you are using zarr<3.0.0, you do not need to specify the format, as it will default to version 2.

In [9]:

if Version(zarr.__version__) < Version("3.0.0"):
    ds = xr.open_dataset(
        asset.href,
        storage_options=asset.extra_fields['xarray:storage_options'],
        **asset.extra_fields['xarray:open_kwargs']
    )
else:
    ds = xr.open_dataset(
    asset.href,
    storage_options=asset.extra_fields['xarray:storage_options'],
    **asset.extra_fields['xarray:open_kwargs'],
    zarr_format=2
    )
ds

Out[9]:

<xarray.Dataset> Size: 823GB
Dimensions:                                    (lat: 585, lon: 1386, time: 15861)
Coordinates:
  * lat                                        (lat) float64 5kB 49.4 ... 25.07
  * lon                                        (lon) float64 11kB -124.8 ... ...
  * time                                       (time) datetime64[ns] 127kB 19...
Data variables:
    crs                                        int64 8B ...
    max_air_temperature                        (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    max_relative_humidity                      (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    min_air_temperature                        (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    min_relative_humidity                      (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    precipitation_amount                       (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    specific_humidity                          (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    surface_downwelling_shortwave_flux_in_air  (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
    wind_speed                                 (time, lat, lon) float64 103GB dask.array<chunksize=(2190, 150, 150), meta=np.ndarray>
Attributes: (12/42)
    Conventions:                CF-1.0
    Metadata_Conventions:       Unidata Dataset Discovery v1.0
    acknowledgement:            Whenever you publish research based on data f...
    author:                     John Abatzoglou - University of Idaho, jabatz...
    cdm_data_type:              Grid
    contributors:               Dr. John Abatzoglou
    ...                         ...
    publisher_name:             Center for Integrated Data Analytics
    publisher_url:              https://www.cida.usgs.gov/
    summary:                    This archive contains daily surface meteorolo...
    time_coverage_resolution:   P1D
    time_coverage_start:        1979-01-01T00:00
    title:                      Daily Meteorological data for continental US

You can use whichever of the two methods described above to find and access datasets from our STAC Catalog. Now that you have the dataset open in xarray, you can build on the workflow to visualize, analyze, or do any other data processing you might need to do. You don't need to download your own copy of the dataset - you can perform your analysis directly on the dataset from the source we provide. If your workflow exceeds your computer's memory, you can use the dask library to parallelize your analysis and take advantage of HPC or scalable cloud computing resources.

Reporting Issues with the Catalog¶

If you find any issues with our STAC catalog or the datasets it contains, please reach out to us in one of the following ways:

If you have a code.usgs.gov account, you can open an issue
E-mail us at mdmf@usgs.gov

We will do our best to address any issues you find in a timely manner.

Contributing to the Catalog¶

We are very interested in expanding the datasets available in our STAC Catalog to include more datasets that are relevant to the Water Mission Area. If you have a dataset that you think should be included, we would love to hear from you!

We are particularly interested in datasets that are very large or difficult to access. We are also interested in datasets needed by multiple project teams in the Water Mission Area so that we can reduce duplicated copies of datasets and standardize the way we access these data.

Please open an issue if you think you have a dataset that should be included in the catalog and tell us a bit about the dataset, what project(s) are using it, and why you would like to add it to our STAC Catalog.

Use Cases