2 Background

Currently, AquaMatch remote sensing products contain data from the historical Landsat Collection 2 record from 1984 until the end of 2024, including the following missions:

Landsat 4 Thematic Mapper (TM)
Landsat 5 TM
Landsat 7 Enhanced Thematic Mapper Plus (ETM+ or ETM)
Landsat 8 Operational Land Imager/Thermal Infrared Sensor (OLI/TIRS)
Landsat 9 Operational Land Imager 2/Thermal Infrared Sensor 2 (OLI-2/TIRS-2 or OLI/TIRS)

WARNING: the Landsat Surface Reflectance and Surface Temperature products should not be used without applying a handoff coefficient to harmonize data between missions. See Section 8 for details on this process.

There will be additional satellite data incorporated into AquaMatch in the future.

For in situ parameter matching, AquaMatch_siteSR currently includes sites from the WQP, NWIS, and any sites present in the AquaMatch_harmonize pipeline not available via WQP at the time of site collation. At the time of collation, the AquaMatch_harmonize pipeline contained data for Secchi disc depth (SDD) (De La Torre et al. 2025), chlorophyll a (chla) (Brousil et al. 2024), dissolved organic carbon (DOC) (Brousil, Meyer, Willi, Steele, and Ross 2025), and total suspended solids (TSS) (Brousil, Meyer, Willi, Steele, Gardner, et al. 2025).

2.1 lakeSR Code Architecture

lakeSR code is broken down into groups of targets that perform specific tasks, listed below with a brief summary about what task(s) each group completes. This particular workflow incorporates both R and Python programming languages to complete these tasks.

2.1.1 lakeSR {targets} groups

a_Calculate_Centers:

For all lakes, we aggregate remotely-sensed data via a central location with a buffer instead of the entire lake area to reduce processing time and the downstream impacts of lakes whose surface crosses World Reference System 2 (WRS-2) path-row boundaries. This {targets} list calculates Pole of Inaccessibility (POI) (e.g. Stefansson 1920) for all non-intermittent lakes, ponds, and reservoirs greater than 1 hectare in surface area and intermittent lakes, ponds, and reservoirs greater than 4 hectares using the National Hydrography Dataset Plut Version 2 (NHDPlusV2) polygons using the {nhdplusTools} package (Blodgett and Johnson 2023), NHD Best Resolution files for non-CONUS waterbodies, and the poi() function in the {polylabelr} package (Larsson 2024). We limit the inclusion of small intermittent waterbodies operating under the assumption that these features are not consistently visible via remote sensing. See Section 3 for additional background and detailed methodology.

This group is either run completely or pulled from existing files based on lakeSR general configuration file using the boolean calculate_centers setting. If set to FALSE a version date must be provided in the centers_version setting. Additional guidance is provided below in Section 2.1.3, in the README, and in the general configuration file of the lakeSR repository (config.yml).

b_pull_Landsat_SRST_poi:

This {targets} group uses the Google Earth Engine configuration file b_pull_Landsat_SRST_poi/config_files/config_poi.yml and the POI points created in the a_Calculate_Centers group to pull Landsat Collection 2 Surface Reflectance (SR) and Surface Temperature (ST) using the Google Earth Engine (GEE) Application Programming Interface (API). In this group, we use the most conservative LS4-7 pixel filters, as we are applying these settings across such a large continuum of time and space. This group ends with a branched target that sends tasks to GEE by mapping over WRS-2 path rows that intersect with the points created in the a_Calculate_Centers group. This task-submission process is managed by the {targets} workflow which checks for the number of running or cued tasks and only sends additional tasks when the total tasks is below a certain threshold.

Note: this group of targets takes a very long time, running 2 minutes - 1 hour per path-row branch in b_eeRun_poi. There are just under 800 path rows executed in this target. Anecdotally speaking, processing time is often defined by the number of queued tasks globally, so weekends and nights are often periods of quicker processing than weekday during business hours. As written for data publication, run time is 7-10 days.

See Section 4 for details on software used in this workflow, Section 5 for additional background on Landsat data, and Section 6 for detailed methodology of the Landsat pull.

This group is either run completely or pulled from existing files based on lakeSR general configuration file using the boolean run_GEE setting. If set to FALSE a version date must be provided in the collated_version setting. Additional guidance is provided below in Section 2.1.3, in the README, and in the GEE configuration file of the lakeSR repository (“b_pull_Landsat_SRST_poi/config_poi.yml”).

c_collate_Landsat_data:

This {targets} list collates the data from the Google Earth Engine run orchestrated in the {targets} group b_pull_Landsat_SRST_poi and creates publicly-available files for downstream use, storing a dataframe of Drive identifiers in a .csv in the c_collate_Landsat_data/out/ folder.

This group is either run completely or pulled from existing files based on lakeSR general configuration file using the boolean run_GEE setting. If set to FALSE a version date must be provided in the collated_version setting. Additional guidance is provided below in Section 2.1.3, in the README, and in the general configuration file of the lakeSR repository (“config.yml”).

d_qa_filter_sort:

This {targets} list applies some rudimentary post-hoc QA to the Landsat stacks and saves them as sorted files locally. If update_and_share is set to TRUE, the workflow will apply QA procedures to the collated data from the {targetse} group c_collate_Landsat_data and send dated, publicly available files to Google Drive and save Drive file information in the d_qa_filter_sort/out/ folder. If set to FALSE, no files will be sent to Drive nor any QA completed and the pipeline will pull existing collated and QA’d datasets based on the general configuration setting qa_version.

See Section 7 for detailed methodology of this portion of the workflow.

e_calculate_handoffs:

This {targets} group creates “matched” data for two different “intermission handoff” methods that standardize the SR values relative to Landsat 7 and to Landsat 8. Handoffs are visualized and are saved as tables for use downstream in this group. Corrections are calculated for all neighboring missions, even if not explicitly used downstream (e.g. Landsat 4/5 and Landsat 8/9).

See Section 8 for detailed methodology of this portion of the workflow.

y_siteSR_targets:

This {targets} group pulls information from the siteSR workflow to use in the bookdown. If the configuration setting update_bookown is set to FALSE, this list will be empty.

z_render_bookdown:

This {targets} group tracks chapters of the bookdown for changes and renders the bookdown. If the configuration setting update_bookown is set to FALSE, this list will be empty.

2.1.2 lakeSR Output Files

This workflow creates a number of files:

locations file (lakeSR_poi_with_flags_2025-02-12.csv) containing all sites within the lakeSR dataset, defined by the National Hydrography Dataset. This file also contains column that link the location (Latitude/Longitude) to NHD waterbody features, and flags meant to aid in data interpretation. See Section 3.1 for more information about this file. This file is the same content as the {targets} object a_poi_with_flags, which can be loaded into the R studio environment using tar_load(a_poi_with_flags).
remote sensing data file(s) for sites (e.g. lakeSR_Landast8_DSWE1_2025-06-04.feather, lakeSR_HUC2_18_Landsat7_DSWE1_2025-06-04.csv) which contain the lakeSR identifier (lakeSR_id, can be matched with the locations file), numerical summaries of remote sensing data for each site and satellite image, as well as columns containing summarized information about pixels that have been masked in the Google Earth Engine QA and filtering process. The csv files (zipped per mission) and the feather files contain the same data, but are provided in an alternate data format, where the data have been grouped by the two-digit Hydrologic Unit Code (“HUC2”). See sections 5, 6, and 7 for additional details.
remote sensing scene-level metadata file(s) (e.g. lakeSR_collated_metadata_LS89_export_2022-06-04.csv, lakeSR_collated_metadata_LS457_export_2025-06-04.csv) which contain a subset of columns of the scene-level metadata for each Landsat image within the lakeSR remote sensing data file. This can be joined to the remote sensing data using the sat_id column. While we reduce the number of columns from the upstream data during the remote sensing data file collation, we do not process the metadata information further nor have suggested uses but rather encourage users to use this metadata file for further data interpretation.
intermission handoff file(s) (e.g. lakeSR_collated_handoffs_GEEv2025-02-12_QAv2025-06-04.csv) which contains coefficients for aligning data across multiple Landsat missions for timeseries analysis. See Section 8 for details.

2.1.3 Configuration Files

config.yml (“General Configuration”)

lakeSR relies on this general configuration file to run specific profiles that determine what operations are being run. The file contains two configuration profiles: “default” and “admin_update”.

“default” uses publicly-posted stable versions of datasets from AquaMatch_harmonize_WQP and is intended for those who wish to modify choices made after Landsat stacks are acquired (e.g. baseline quality assurance filters {targets} group -d- and calculation of handoff coefficients group -e-). By default, the bookdown is not rendered in this option, as changes to the pipeline may require non-automated changes to the text of the bookdown.
“admin_update” is intended for use by ROSSyndicate members when updating lakeSR datasets and by by default creates publicly-stable versions of the lakeSR dataset in Google Drive, and the Drive file identifiers are stored in the AquaMatch_lakeSR repository for external users. This configuration could also be used by others to re-run the pipeline and wish to make changes to the site location calculations ({targets} group -a-) or GEE implementation ({targets} group -b-) and collation ({targets} group -c-). By default, the bookdown is rendered in this option, but keep in mind not all changes implemented are not automatically reflected in the bookdown.

Advanced users are welcome to adapt the pipeline to incorporate other masks or filters or to incorporate other quality assurance filters or handoff methods.

b_pull_Landsat_SRST_poi/config_poi.yml (“Google Earth Engine Configuration”)

This configuration file defines parameters of the GEE pull. This configuration can be customized. To create a different configuration of the GEE pull, fill out the yaml file at the file path b_pull_Landast_SRST_poi/config_poi.yml. If you change the name of this file, you will need to update the file name at line 32 of the _targets.R file.

2.1.4 Repository Folder Structure

The lakeSR folder structure is as follows:

|-- AquaMatch_lakeSR
    |-- README.md
    |-- run_targets.Rmd
    |-- config.yml
    |-- _targets.R
    |-- a_Calculate_Centers.R
    |-- a_Calculate_Centers
        |-- src
            |-- calculate_bestres_centers.R
            |-- calculate_centers_HUC4.R
    |-- b_pull_Landsat_SRST_poi.R
    |-- b_pull_Landsat_SRST_poi
        |-- config_files
            |-- config_poi.yml
        |-- in
            |-- WRS2_descending.shp
        |-- py
            |-- check_for_failed_tasks.py
            |-- poi_wait_for_completion.py
            |-- run_GEE_per_pathrow.py
        |-- src
            |-- check_if_fully_within_pr.R
            |-- format_yml.R
            |-- get_WRS_pathrow_poi.R
            |-- reformat_locations.R
            |-- run_GEE_per_pathrow.R
    |-- c_collate_Landsat_data.R
    |-- c_collate_Landsat_data
        |-- src
            |-- collate_csvs_from_drive.R
            |-- download_csvs_from_drive.R
    |-- d_qa_filter_sort.R
    |-- d_qa_filter_sort
        |-- src
            |-- prep_LS_metadata_for_export.R
            |-- qa_and_document_LS.R
            |-- sort_qa_Landast_data.R
    |-- e_calculate_handoffs.R
    |-- e_calculate_handoffs
        |-- src
            |-- calc_quantiles.R
            |-- calculate_gardner_handoff.R
            |-- calculate_roy_handoff.R
            |-- get_matches.R
            |-- get_quantile_values.R
    |-- y_siteSR_targets.R
    |-- z_render_bookdown.R
    |-- python
        |-- pySetup.R
    |-- src
        |-- export_single_file.R
        |-- export_single_target.R
        |-- retrieve_data.R
        |-- retrieve_target.R
    |-- bookdown
        |-- index.Rmd
        |-- 01-Background.Rmd
        |-- 02-Data_Acquisition_Locations.Rmd
        |-- 03-Acquisition_Software_Settings.Rmd
        |-- 04-Landsat_C2_SRST.Rmd
        |-- 05-lakeSR_LS_C2_SRST.Rmd
        |-- 06-post_hoc_qa.Rmd
        |-- 07-intermission_handoffs.Rmd
        |-- z-Refs.Rmd
        |-- refs.bib
        |-- _bookdown.yml
        |-- _output.yml

2.2 siteSR Code Architecture

siteSR code retrieves WQP sites using dataRetrieval::whatWQPsites(), NWIS sites using dataRetrieval::whatNWISsites(), and any remaining WQP sites at which there are historical in situ data from the AquaMatch harmonization pipeline. Like with lakeSR, these data should not be used across multiple satellite missions without the application of inter-mission handoff coefficients for interoperability between sensors (see Section 8).

2.2.1 {targets} Groups

_targets.R:

This initial group of targets checks the configuration settings in config.yml.

a_compile_sites:

This {targets} group collates the sites from the WQP, NWIS, and the AquaMatch harmonization pipeline, creating a list of locations to acquire remote sensing data. All locations are associated with an eight-digit Hydrologic Unit Code (HUC8), regardless if one is not listed in the metadata for the site from the WQP, then we associate points with waterbodies and flowlines of the NHDPlusV2 for those in CONUS or NHD Best Resolution for non-CONUS HUCs datasets where HUC8s were able to be assigned. This is a very computationally-intensive group of functions and can take multiple days to run if compile_locations in the config file is set to TRUE. We do not recommend running this without at least 64GB memory available on your computer.

See section 3.2 for details on this process.

b_determine_RS_visibility:

The resulting list of in situ locations are used to assess remote-sensing visibility in this {targets} group. Sites are assessed for visibility using the European Commission’s Joint Research Centre (JRC) Global Surface Water product (Pekel et al. 2016) which is based on the historical Landsat record. This {targets} group takes a number of hours to run, if the configuration of run_pekel is set to TRUE. If the configuration is set to FALSE, run time will be dependent on your internet connection (to access the previously-created files) and the number of cores available to run the workflow.

See section 3.2.1 for details on this process.

c_siteSR_stack:

This group of {targets} acquires Landsat Collection 2 SR & ST stacks for sites determined to be visible in the previous {targets} group. Data are collated in this step. This {targets} group takes about one week to run, if the configuration run_GEE is set to TRUE. If the configuration is set to FALSE, run time will be dependent on your internet connection (to access the previously-created files) and the number of cores available to run the workflow.

See section 6 for details on this process.

d_siteSR_qa:

Collated Landsat data are filtered for quality based on broadly applicable thresholds and exported from the pipeline for data archiving. The resulting data are archived in Google Drive and the Drive file identifiers area stored in this repository for external access.

See section 7.2 for details on this process.

2.2.2 siteSR Output Files

This workflow creates a number of files:

locations file (siteSR_collated_WQP_NWIS_sites_with_NHD_info_2025-06-04.csv) containing all sites from the WQP and NWIS at the time of running the pipeline, as well as any additional sites from the AquaMatch pipeline (for Secchi disc depth, chlorophyll a, dissolved organic carbon, and total suspended solids) that were not present in the WQP site pull. This file also contains columns attributing a HUC8, NHD features (waterbodies and/or flowlines), and flags meant to aid in data interpretation. See Section 3.2 for more information about this file. This file is the same content as the {targets} object a_sites_with_NHD_info, which can be loaded into the R studio environment using tar_load(a_sites_with_NHD_info).
remote sensing data file(s) for sites (e.g. siteSR_Landsat4_DSWE1_2025-06-06.feather, siteSR_HUC2_22_Landsat8_DSWE1_2025-06-06.csv) which contain the siteSR identifier (can be matched with the locations file), numerical summaries of remote sensing data for each site and satellite image, as well as columns containing summarized information about pixels that have been masked in the Google Earth Engine QA and filtering process. See sections 5, 6, and 7 for additional details.
remote sensing scene-level metadata file(s) (e.g. siteSR_collated_metadata_LS89_2025-06-06.csv, siteSR_collated_metadata_LS457_2025-06-06.csv) which contain a subset of columns of the scene-level metadata for each Landsat image within the siteSR remote sensing data file. This can be joined to the remote sensing data using the sat_id column. While we reduce the number of columns from the upstream data during the remote sensing data file collation, we do not process the metadata information further nor have suggested uses but rather encourage users to use this metadata file for further data interpretation.

2.2.3 Configuration Files

config.yml (“General Configuration”):

siteSR_WQP relies on this configuration file to run specific profiles that determine what operations are being run. The file provides two configuration profiles: “default” and “admin_update”.

“default” runs the pipeline using archived versions of datasets made with specific {targets} groups within this workflow. In this setting, {targets} groups -a- (collating locations from WQP, NWIS, and AquaMatch harmonization pipeline), -b- (determining remote sensing visibility), and -c- (pulling the Landsat record from GEE) are not run, but rather archived files are pulled in to reduce run time.
“admin_update” is intended for use by ROSSyndicate members when updating siteSR datasets and by default creates publicly-stable versions of the siteSR dataset in Google Drive, and the Drive file identifiers are stored in the AquaMatch_siteSR repository for external users. This configuration could also be used by others to re-run the pipeline and wish to make changes to the site location collation ({targets} group -a-), remote sensing visibility thresholds ({targets} group -b-) or GEE implementation ({targets} group -c-).

All repositories stored on the AquaMatch GitHub will contain files that link to versions of the data that the AquaMatch team has harmonized so that a local run is not necessitated unless changes are made in upstream steps of the workflow.

gee_config.yml (“Google Earth Engine Configuration”):

This configuration file defines parameters of the GEE pull. This configuration can be customized. To create a different configuration of the GEE pull, fill out the yaml file at the file path gee_config.yml. If you change the name of this file, you will need to update the file name at line 9 of the b_determine_RS_visibility.R script. No GEE configuration is required if run_pekel and run_GEE is set to FALSE in the general configuration file.

2.2.4 Repository Folder Structure

The siteSR folder structure is as follows:

|-- AquaMatch_siteSR
    |-- README.md
    |-- run_targets.Rmd
    |-- _targets.R
    |-- config.yml
    |-- gee_config.yml
    |-- a_compile_sites.R
    |-- a_compile_sites
        |-- src
            |-- add_HUC8_to_sites.R
            |-- add_NHD_flowline_to_sites.R
            |-- add_NHD_waterbody_to_sites.R
            |-- get_site_info.R
            |-- harmonize_crs.R
    |-- b_determine_RS_visibility.R
    |-- b_determine_RS_visibility
        |-- py
            |-- run_pekel_per_pathrow.py
            |-- wait_for_completion.py
        |-- in
            |-- WRS2_descending.shp
        |-- src
            |-- check_for_containment.R
            |-- download_csvs_from_drive.R
            |-- format_yml.R
            |-- get_WRS_pathrows.R
            |-- grab_locs.R
            |-- run_pekel_per_pathrow.R
    |-- c_siteSR_stack.R
    |-- c_siteSR_stack
        |-- py
            |-- check_for_failed_tasks.py
            |-- run_siteSR_per_pathrow.py
            |-- siteSR_wait_for_completion.py
        |-- src
            |-- check_if_fully_within_pr.R
            |-- collate_csvs_from_drive.R
            |-- download_csvs_from_drive.R
            |-- run_stieSR_per_pathrow.R
    |-- d_qa_stack.R
    |-- d_qa_stack
        |-- src
            |-- prep_Landsat_for_export.R
            |-- qa_and_document_LS.R
    |-- src
        |-- export_single_file.R
        |-- export_single_target.R
        |-- get_file_ids.R
        |-- retrieve_data.R
        |-- retrieve_target.R
    |-- python
        |-- pySetup.R