1 AquaMatch data harmonization process

This bookdown documents the harmonization process for data downloaded from the Water Quality Portal (WQP) used to build the AquaMatch dataset, a data product of AquaSat v2. AquaSat v2 is a forthcoming update to AquaSat (Ross et al., 2019). The overarching purpose of AquaSat V2 is to emphasize the individual parts of the AquaSat pipeline that make-up the matchups between satellite and in-situ measurements. The WQP is a data warehouse for water-related data measured or observed within the United States and US Territories managed by the Environmental Protection Agency (EPA), United States Geological Survey (USGS), and the National Water Quality Monitoring Council (NWQMC). AquaMatch collates water quality measurements and observations with optical and thermal satellite data from lakeSR and riverSR.

All data in this workflow are processed as parameter groups, a high-level grouping used in AquaMatch for specific water quality measures found in the WQP. This document first describes selections made while downloading WQP data using the {dataRetrieval} package (De Cicco et al., 2023). It then walks through how the chlorophyll a, dissolved organic carbon (DOC), Secchi disc depth, and total suspended solids (TSS) parameter groups are each prepared (“pre-harmonization”), and then filtered and harmonized.

Note: for the purposes of this bookdown, many terms surrounded by backticks are column names from the dataset being referenced. Others are code snippets or data values. For those that are column names, a full list of definitions for columns present in the original WQP dataset can be found in Table 7 here.

Data are contributed to the WQP by a wide range of data providers (organizationIdentifier and organizationFormalName) with varying sampling and analysis methods and multiple characteristicNames. characteristicNames are defined by the WQP as:

The object, property, or substance which is evaluated or enumerated by either a direct field measurement, a direct field observation, or by laboratory analysis of material collected in the field.

Parameter groups are comprised of data from multiple characteristicNames in the WQP. For example, chlorophyll-related characteristicNamesinclude “Chlorophyll a”, “Chlorophyll a, corrected for pheophytin”, “Chlorophyll b”, and “Chlorophyll c” among several others. Only a subset of these characteristicNames are included in our “chlorophyll a” parameter group. As mentioned previously, measurements and observations within the WQP for a given parameter group are often sampled and analyzed using a variety of methods. Some of these methods for sampling and analysis are directly interoperable and others are not. The processing steps described within this bookdown walk through how we harmonize the data for each parameter group while flagging or removing data that may not have enough information associated with it. The product of this is a database of WQP entries with interoperability and reliability in mind.

The result of this harmonization procedure is two datasets per parameter group. The first is a harmonized dataset that includes all of the original WQP columns in addition to those introduced in this workflow. The second is the output of all harmonization steps, including aggregation to the mean value where simultaneous records occur. We define simultaneous records as those measured on the same day by the same organizationIndentifier at the same location with the same harmonized depth criteria. We believe that most users will only use this aggregated dataset, but we provide the unaggregated version for those who would like to make different decisions at any point in the tiering, flagging, or aggregation steps.