Data Extraction Guide¶

This guide covers extracting remote sensing and meteorological data for SWIM-RS using Google Earth Engine and GridMET/ERA5-Land.

Overview¶

SWIM-RS requires several data inputs that are typically extracted from remote sources:

Some extraction pathways use optional OpenET Python implementations (and refetgee). If you plan to run those exporters, install:

pip install "swimrs[openet]"

Data Type	Source	Function
NDVI	Landsat 8/9, Sentinel-2	`clustered_sample_ndvi`, `sparse_sample_ndvi`
ET Fraction (ETf)	OpenET (SSEBop, PT-JPL, SIMS, geeSEBAL)	`clustered_sample_etf`, `sparse_sample_etf`
Meteorology	GridMET (CONUS), ERA5-Land (global)	`swim extract` CLI or THREDDS direct
Snow (SWE)	SNODAS (CONUS), ERA5 (global)	`swim extract` CLI
Properties	SSURGO/HWSD soils, CDL/NLCD land cover, LANID/IrrMapper irrigation	`swimrs.data_extraction.ee.ee_props`

CLI Extraction¶

The simplest approach uses the swim extract CLI command:

# CONUS workflow (GridMET + SNODAS + Landsat)
swim extract my_project.toml

# Add Sentinel-2 NDVI
swim extract my_project.toml --add-sentinel

# International workflow (ERA5-Land)
swim extract my_project.toml --international

Outputs are exported to Google Drive (default) or a Cloud Storage bucket (if earth_engine.bucket is set in the TOML).

Python API: Clustered vs Sparse Sampling¶

For programmatic control, use the extraction functions directly.

When to Use Each¶

Function	Use Case
`clustered_sample_*`	Fields are geographically clustered (e.g., a single watershed or county). More efficient — groups nearby fields into fewer EE tasks.
`sparse_sample_*`	Fields are widely dispersed (e.g., flux towers across CONUS). Creates one task per field or small group.

Function Signatures¶

Both clustered functions accept:

shapefile: Local shapefile path, EE FeatureCollection asset ID, or ee.FeatureCollection object
feature_id: Column name for unique field identifiers
state_col: Column with US state codes (enables IrrMapper west / LANID east mask selection)
mask_type: 'irr' (irrigated), 'inv_irr' (non-irrigated), or 'none'
start_yr / end_yr: Year range for extraction
select: List of feature IDs to limit extraction (optional)
dest: 'drive' or 'bucket'
drive_folder: Google Drive folder name (if dest='drive')

ETf-Specific Parameters¶

model: ETf model — 'ssebop', 'ptjpl', 'sims', 'geesebal', or 'disalexi'
usgs_nhm: Use USGS NHM SSEBop (higher resolution) instead of OpenET SSEBop

NDVI-Specific Parameters¶

satellite: 'landsat' or 'sentinel'

Example: Extract ETf and NDVI for Selected Fields¶

from swimrs.data_extraction.ee.etf_export import clustered_sample_etf
from swimrs.data_extraction.ee.ndvi_export import clustered_sample_ndvi

select_fields = ['043_000130', '043_000128', '043_000161']
shapefile_path = 'examples/1_Boulder/data/gis/mt_sid_boulder.shp'

# Extract SSEBop ETf with irrigation mask
clustered_sample_etf(
    shapefile_path,
    mask_type='irr',
    start_yr=2004,
    end_yr=2023,
    feature_id='FID_1',
    state_col='STATE',
    select=select_fields,
    dest='drive',
    drive_folder='swim',
    drive_categorize=True,
    model='ssebop',
    usgs_nhm=True,
)

# Extract Landsat NDVI with same parameters
clustered_sample_ndvi(
    shapefile_path,
    mask_type='irr',
    start_yr=2004,
    end_yr=2023,
    feature_id='FID_1',
    state_col='STATE',
    select=select_fields,
    satellite='landsat',
    dest='drive',
    drive_folder='swim',
    drive_categorize=True,
)

Example: Extract Multiple ETf Models for Ensemble¶

from swimrs.data_extraction.ee.etf_export import clustered_sample_etf

for model in ['ssebop', 'ptjpl', 'sims']:
    clustered_sample_etf(
        'data/gis/flux_fields.shp',
        mask_type='irr',
        start_yr=2000,
        end_yr=2023,
        feature_id='site_id',
        state_col='state',
        dest='drive',
        drive_folder=f'swim_etf_{model}',
        model=model,
    )

Monitoring EE Tasks¶

After submitting exports, monitor progress at: https://code.earthengine.google.com/tasks

Tasks are named with the field ID and date range for easy identification.

Downloading Results¶

From Google Drive¶

Use gsutil or the Drive web interface to download CSVs:

# Sync from Drive to local
# (requires gcloud auth and Drive API enabled)
rclone sync gdrive:swim data/remote_sensing/landsat/extracts/

From Cloud Storage Bucket¶

gsutil -m rsync -r gs://your-bucket/swim/ data/remote_sensing/

Meteorology Extraction¶

GridMET data is fetched directly from THREDDS (no EE required):

from swimrs.data_extraction.gridmet.gridmet import get_gridmet

get_gridmet(
    shapefile='data/gis/fields.shp',
    output_dir='data/met_timeseries/gridmet',
    start_date='1987-01-01',
    end_date='2023-12-31',
    feature_id='site_id',
)

For ERA5-Land (international sites), use the swim extract --international CLI or the ERA5 extraction utilities.

Tips¶

Rate limits: EE has quotas. For large extractions (>1000 fields), batch over multiple days or use select to limit fields per run.
Mask selection: The state_col parameter automatically selects IrrMapper (western US) or LANID (eastern US) for irrigation masks.
Drive vs Bucket: Drive exports are simpler but have size limits. Use bucket exports for large projects.
Sentinel-2: Only available from 2017+. For longer time series, use Landsat alone or fuse both in the container.