3 Locations of Data Acquisition

Technical Note

This process is completed in the targets group a_calculate_centers. The primary output file is the lakeSR site file (lakeSR_poi_with_flags_2025-02-12.csv).

Background

As noted in the Introduction (Section 1), for the purposes of AquaMatch, surface reflectance and surface temperature data are acquired at specific, centrally-located points within waterbodies (typically in deep locations of lakes, lakeSR) and at locations where there are in situ data (siteSR). For the purposes of lakeSR, we consider “deep” locations to be relative to a waterbody and defined by the NHD waterbody feature, where a deep location is far from the shoreline indicated by the NHD (see Section 3.1). The data acquired at locations where there are in situ data are meant to create location-specific algorithms using the AquaMatch database, which can then be applied to the data collected over the centrally-located point across all waterbodies in the lakeSR database. lakeSR does not acquire nor summarize data over the entire waterbody’s surface, as it is computationally impractical for most large lakes, especially those that cross multiple satellite path-rows or tiles. For the lakeSR product, we summarize the Landsat data within a 120 meter radius of a defined location. For siteSR, we use a 200 meter buffer.

AquaMatch products summarize data within a buffered area of unique locations primarily to reduce computation time versus aggregating the entire lake area and to reduce issues with lakes that lie within more than one WRS-2 path-row which would require additional data handling steps at both the remote sensing and post-hoc filter steps to provide consistent data summaries. This may mean that each waterbody has differing proportions of the area represented in the data summaries. Users may wish to consider total surface area of waterbodies (provided in the lakeSR site file in the column areasqkm) when analyzing the remote sensing data.

3.1 Waterbodies Included in lakeSR

For lakeSR, we use the NHDPlusV2 dataset for lakes within the conterminous US using the {nhdplusTools} R package (Blodgett and Johnson 2023) and the NHD Best Resolution (e.g. US Geological Survey 2023b) data through The National Map for lakes outside of the extent of the conterminous US. Two versions are used because NHDPlusV2 waterbodies are not available outside of CONUS (oCONUS). The NHD Best Resolution data are of slightly higher resolution relative to NHDPlusV2 and are complete and available for all states and territories oCONUS. All waterbody polygons were downloaded and processed on 2025-02-12 by HUC4 using the most updated version available at the time of download.

Drawing of the Continental United States with HUC2 boundaries indicated with thick red lines and HUC4 boundaries indcated with thinner red lines. Insets of HUC2 and HUC4 shown for Alaska, Hawaii, Caribbean Region, South Pacific Region.

Figure 3.1: NHD HUC4 map for the United States and territories, courtesy of the USGS.

For every HUC4 in the United States and territories (Figure 3.1), all waterbodies are limited to those with NHD Waterbody Subtypes belonging to the following groups: 390 (lake/pond) and 436 (reservoir) that are at least 0.01 km2 (1 hectare) in area according to the area value provided in the NHD file. If the feature type (FType) of the waterbody is assigned to an intermittent category (39001, 39005, 39006, 43614) the threshold for inclusion was increased to 0.04 km2 (4 hectares) to reduce processing time when extracting data from GEE assuming that intermittent waterbodies smaller than 4 hectares would not normally be “visible” in remote sensing images (see Section 3.2.1). This filtering resulted in 729,941 waterbodies included in our dataset, including 313,248 oCONUS waterbodies. This is a 1,185 percent increase over the 56,792 lakes included in the original AquaSat product.

For each waterbody, the pole of inaccessibility (see Section 3.1.2) and distance-to-shore radius was calculated using the polylabelr::poi() function.

3.1.1 Updates in AquaMatch

Some changes in lake polygon and center point have been made in modernizing and scaling from LimnoSat-US. The lake center aspect called “deepest point” of LimnoSat-US was built upon HydroLakes (Messager et al. 2016), a global database of lakes greater than 10 hectares accounting for 1.4 million waterbodies and a total surface area of 2.67 million km² worldwide. While this dataset of lakes represents ~55% of the worldwide surface area of lakes greater than 1 hectare, it is only a sliver of the estimated 27 million waterbodies in the world (Verpoorter et al. 2014). AquaMatch uses the USGS’s National Hydrography products which map the surface waters of the United States, and allows for reduction in minimum size of waterbody and an increase in coverage of freshwater systems across the United States and permanently inhabited territories.

3.1.2 Pole of Inaccessibility

The USGS National Hydrography products contain smaller and higher resolution polygons than the HydroLakes shapes, which makes it computationally impossible to use the Chebyshev Center (“deepest point,” Yang 2020) calculation used in LimnoSat-US due to the number of vertices in each polygon. To replace this important step in the update, we employ the concept of pole of inaccessibility (POI, Stefansson 1920).

The concept of POI is used to define the geographic center of a circle with the largest circumference within any complex polygon. The foundational principle is used widely to describe the arctic pole of inaccessibility, that is the point in the northern arctic circle that is the furthest from land, but has also been used to describe the geographic center of landmasses (Garcia-Castellanos and Lombardo 2007). For lakeSR, we use POI to determine the point in a waterbody that is furthest from the shoreline using the polylabelr::poi() function (Larsson 2024), which calculates a point in space and the radius of the circle used to define the POI.

The poi() function can sometimes attribute a POI to a location other than the point furthest away from a shoreline. This appears to occur in features with a very large number of indices (due to sheer area or geomorphological complexity); however the points calculated should be an acceptable proxy for deep lake conditions operating under the assumption that lake depth increases as distance from shore increases. No action was taken to correct or enumerate instances where this occurred within our dataset given the assumption that the point is still far from shore (though not always the “farthest”) and there is no obvious programmatic way to identify these points. Additionally, we did not compare the LimnoSat-US locations with AquaMatch_lakeSR locations (even for pseudonymous lakes) since the spatial extent of the polygons used do not share geometries and comparing locations between these products was outside of the scope of lakeSR at this time.

3.1.3 Technical Implementation of lakeSR

The processing begins by acquiring the polygons of all US states and territories using the {tigris} package (Walker 2025). These polygons are used to acquire a list of HUC4s that intersect with each municipal boundary using the nhdplusTools::get_huc() function, which are then reduced to distinct HUC4’s and transformed into a vector of HUC4s. HUC4s are then split into CONUS (HUC4 < 1900) and oCONUS (HUC4 ≥ 1900) groups. To efficiently calculate the POI across hundreds of HUC4s, we use the {targets} dynamic branching feature to iterate over each HUC4. For each HUC4, the NHDPlusV2 (CONUS) or NHD Best Resolution (oCONUS) waterbodies are acquired and filtered for lake/ponds and reservoirs of at least 1 hectare in area or 4 hectares for intermittent lake/ponds or reservoirs.

In order to accurately calculate distance-to-shore when using the polylabelr::poi() function as described in Section 3.1, each waterbody was converted to the proper Universal Transverse Mercator (“UTM”) projection calculated from the mean longitudinal value of the polygon vertices prior to applying the poi() function. By using the point-local UTM projection, we decrease distortion expected from any single coordinate reference system used to represent all of the locations from which we have lakes. The latitude and longitude values of the POI were transformed to decimal degrees in World Geodetic System 1984 (WGS84, EPSG:4326) from UTM easting and northing coordinates for use later in the workflow. Within the data presented in optical remote sensing products from AquaMatch, we do not currently flag reflectance data for possible bottom reflectance signals (that is, we do not signify whether the optical reflectance values are wholly attributable to the water column or whether there may be signals of bottom reflectance due to clear water/shallow bathymetry at the given location). Users should consider that, especially at sites where the distance to shore is small or where the waterbody is particularly shallow, that there may be bottom reflectance introduced into the remote sensing summaries. We encourage users to consider this in their analysis and interpretation of the data.

Note: AquaMatch assumes static water elevation and inundation area based on the associated NHD features. Given this, users should consider whether elevational changes in surface water height (and therefore inundation area) may impact their analyses or interpretation of the information provided in the location data.

To increase computational efficiency, we allow for multicore processing and the use of targets::crew_controller_local() function within this workflow. This reduces processing time substantially as processing thousands of polygons is quite time consuming. If you are running this workflow on your own computer, the length of time that it takes to calculate POIs will be dependent on the number of cores you allow for processing (we used 11 cores during the development and operation of this workflow).

3.1.4 Flagging shoreline proximity

Within the context of lakeSR, we do not identify whether there are multiple waterbodies/visible water sources within the site buffer used for lakeSR or siteSR extraction. We do offer two flags to help users assess possibilities of shoreline contamination through a simple calculation. This calculation is based on the distance to shore calculated in the POI step, which does not account for differing shoreline locations due to water surface level change. A flagged site for shoreline proximity also does not imply that the data are contaminated, rather is intended as a diagnostic tool to identify sites that may be more likely to have some sort of shoreline contamination. Because the lakeSR pipeline uses a dynamic water filter for each point and each Landsat scene (see Section 6.3.2), shoreline and mixed pixels should be masked from the lakeSR data.

To determine the possibility of shoreline contamination, we add the site buffer used for GEE extraction (defined in the lakeSR GEE configuration file) to the pixel size of the sensor used for reflectance detection (see Section 5.2.2 for details). For optical bands (Red, Green, Blue, Near Infrared (NIR), Short Wave Infrared (SWIR)) we use the site buffer plus 30 meters. For thermal bands, we create 3 flag columns for each sensor’s (Landsat 4/5: TM, Landsat 7: ETM+, Landsat 8/9: TIRS) native thermal resolution (120, 60, 100 meters). These flags are binary and are included in the lakeSR site file as the flag columns flag_optical_shoreline, flag_thermal_TM_shoreline, flag_thermal_ETM_shoreline, flag_thermal_TIRS_shoreline:

0: unlikely shoreline contamination (distance to shore is greater than the sum of the site buffer and pixel size)

1: possible shoreline contamination (distance to shore is less than or equal to sum of the site buffer and pixel size)

These flags are defined based on static shoreline location defined by the NHD, so caution should be taken in interpretation of these flags if the waterbody in question experiences water level fluctuations.

3.1.5 Case Study: Wisconsin Waterbodies

The state of Wisconsin contains more than 15,000 freshwater lakes. LimnoSat-US contained just 2,694 waterbodies within Wisconsin, whereas AquaMatch via the lakeSR product includes 10,574 (Figure 3.2) - accounting for a majority of the state’s freshwater water bodies.

This is a multi-panel figure, on the top row, two maps of the state of Wisconsin with dots representing all the lakes included in the lakeSR dataset. The top left represents the water bodies included in LimnoSat, the top right is those included in AquaSat v2, which has a higher density of dots. Below these two maps are histograms of the number of lakes with an x axis for Longitude (bottom left) or Latitude (bottom right), showing the difference in count between the LimnoSat and AquaSat v2 data products.

Figure 3.2: Deepest point of Wisconsin lakes included in LimnoSat-US defined by the Hydrolakes data product (left) and those included in AquaMatch via lakeSR defined by the NHDPlusV2 data product (right). Deepest point for each lake is indicated with a black dot, spatial histogram of density by longitude and latitude presented below the maps.

3.2 siteSR Acquisition Locations

Technical Note

This process is orchestrated in the targets group a_compile_sites in the siteSR workflow. The primary output file of this process is the siteSR site file (siteSR_collated_WQP_NWIS_sites_with_NHD_info_2025-06-04.csv).

Background

The acquisition locations within siteSR are defined using the {dataRetrieval} package (DeCicco et al. 2025) functions whatWQPsites() and whatNWISsites() which are stored in the siteSR site file. All sites were filtered for those that represent surface water according to the populated MonitoringLocationTypeName for WQP sites and site_tp_cd (“site type code”) for NWIS sites. These sites were supplemented by those for which we have harmonized parameter data from the AquaMatch parameter harmonization pipeline but that did not appear in the whatWQPsites() query. Absence of a limited number of sites in the whatWQPsites() query is expected and is due to the dynamic nature of the WQP, since some of the AquaMatch parameter harmonization was queried at a different time than the query for siteSR. The siteSR pipeline collates these sites, then reduces that collation to only distinct sites defined by organization identifier, monitoring location identifier, WGS84 latitude and WGS84 longitude. Across the WQP, NWIS, and the harmonization pipeline, there are currently 1,207,214 sites in the United States and territories for which we attempt to create “stacks” of remote sensing data. A “stack” refers to all the remote sensing data available during a time period of interest at a single site.

Every site in the WQP has a MonitoringLocationTypeName associated with it. Within siteSR there are 1,015,748 sites from the WQP.

This figure is a histogram representing the number of NHD features associated with siteSR site locations acquired from the Water Quality Portal. The highest bars are for River/Stream (490,637 sites), Stream (152,812), Lake (129,908), and Estuary (102,222).

Figure 3.3: Water Quality Portal data location types present in siteSR. Unique WQP monitoring location count above bar, divided by MonitoringLocationTypeName.

Similarly, every site in the NWIS database has a site type code (site_tp_cd) associated with it. Within siteSR there are 191,466 sites from NWIS.

National Water Information System data location types present in siteSR. Unique NWIS monitoring location count above bar, divided by site type. ST = Stream, ST-CA = Canal, ST-DCH = Ditch, LK = Lake/Reservoir, ES = Estuary.

Figure 3.4: National Water Information System data location types present in siteSR. Unique NWIS monitoring location count above bar, divided by site type. ST = Stream, ST-CA = Canal, ST-DCH = Ditch, LK = Lake/Reservoir, ES = Estuary.

3.2.1 Harmonizing siteSR site types

In order to harmonize the site types across WQP and NWIS data providers, we attempt to generalize site types into five groups: Lake/Reservoir/Pond, Stream, Canal/Ditch, Estuary, Other. These designations are stored in the column harmonized_site_type in the siteSR site file.

AquaMatch harmonized_site_type NWIS site_tp_cd WQP MonitoringLocationTypeName
Lake/Reservoir LK contains “lake”, “reservoir”, or “impoundment”
Pond no associated code contains “pond”
Stream ST contains “stream”
Canal/Ditch ST-CA, ST-DCH contains “canal” or “ditch”
Estuary ES Estuary
Other all remaining codes all remaining codes
This figure shows a histogram of all sites in siteSR using the simplified `harmonized_site_type` category. The highest category is Stream (816,145 sites), followed by Lake/Reservoir (224,551), then Estuary (106,447).

Figure 3.5: AquaMatch harmonized_site_type categories provided in siteSR. Unique AquaMatch site count above bar, divided by harmonized_site_type.

3.2.2 Assignment of NHD flowline and waterbodies

HUC8 assignment

For each siteSR location, we attempt to assign HUC8’s using the get_huc() function in the {nhdplusTools} (Blodgett and Johnson 2023) package in R. While there is a column (HUCEightDigitCode) provided with WQP sites query that attributes a HUC8 to sites, some of them are missing data (n = 10577) and some of them differ from those obtained using the reported location latitude and longitude and the get_huc() function (n = 25628). Additionally, NWIS sites are not associated with HUC8’s when queried and this process assigns a HUC8 as a value add to downstream users. These HUC assignments are stored in the assigned_HUC column in the siteSR site file. After this process, only 1641 sites remained without a HUC8 assignment using the nhdplusTools method. Most of these sites were coastal sites outside of the extent of NHD boundaries (n = 1357) or were assigned a HUC8 in the WQP data and were estuaries, but were unable to be assigned in this process (n = 284). All sites remain in siteSR, but those without an assigned HUC8 using this process (flag_HUC of 3 or 4) will have no NHD feature attribution information.

In siteSR, we provide a flag_HUC8 column in the siteSR site file with the following code definitions:

0: HUC8 reported in WQP site information, matches nhdplusTools assignment

1: HUC8 successfully assigned using nhdplusTools, no HUC8 reported in HUCEightDigitCode

2: HUC8 mismatch between Water Quality Portal HUCEightDigitCode assignment and nhdplusTools assignment

3: HUCEightDigitCode was assigned to a site type of estuary, but not able to be assigned using the nhdplusTools method

4: HUC8 unable to be assigned for site location using the nhdplusTools method and was not provided by WQP/NWIS site information

NHD Waterbody Feature Assignment

Using the assigned HUC8 information to iterate, we then can use {nhdplusTools} to assign waterbodies and flowlines to sites within the CONUS and data from The National Map best resolution files (e.g. US Geological Survey 2023b) for sites outside of CONUS by HUC4. For all sites with a harmonized_site_type of Lake/Reservoir, Pond, Estuary or Other, we assign NHD ids to each point for the NHD waterbody feature it is contained by or the NHD feature it is closest to. The NHD waterbody layer has been filtered to only contain Lake, Ponds, Reservoirs, and Estuaries (NHD feature types 390, 436, 493) when performing this task. We flag (flag_wb “waterbody flag” in the siteSR site file) the results of this assignment as follows:

0: point inside NHD waterbody feature polygon

1: point within GEE buffer distance (default 200 meters for siteSR) of the NHD waterbody feature

2: point between GEE buffer distance and 500 meters of the NHD waterbody feature polygon

3: point unable to be assigned to waterbody feature (it is > 500 meters away from the closest NHD waterbody feature polygon)

4: point does not have HUC8 assignment or harmonized_site_type is not Lake/Reservoir, Pond, Estuary, or Other. No NHD waterbody feature is assigned.

For all points with a flag value of 0, 1, and 2, we provide the NHD identifier (wb_nhd_id in the siteSR site file) for the polygon the monitoring location is associated with (comid for NHDPlusV2 and permanent identifier for NHD Best Resolution), for all points with a flag value of 1, 2, and 3, we provide the distance (in meters) to the nearest NHD waterbody (dist_to_wb in siteSR site file) of feature type 390, 436, or 493. For waterbodies with a flag of 0, we provide distance to shore (dist_to_shore in siteSR site file).

We follow lakeSR’s implementation of flags for shoreline proximity for these sites. For sites with a distance to shore value, we also provide a diagnostic column that indicates likelihood of shoreline contamination within the scope of Landsat extraction. To determine the possibility of shoreline contamination, we add the site buffer used for GEE extraction (200 m for siteSR) to the pixel size of the sensor used for reflectance detection. For optical sensor bands (Red, Green, Blue, NIR, SWIR) we use the site buffer plus 30 meters (flag_optical_shoreline). For thermal sensor bands, we create 3 flag columns for each sensor’s native thermal resolution (Landsat 4/5: TM 120m flag_thermal_TM_shoreline, Landsat 7: ETM+ 60 m flag_thermal_ETM_shoreline, Landsat 8/9: TIRS 100m flag_thermal_TIRS_shoreline). These flags are binary values in the siteSR site file for sites with a flag_wb value of 0 (indicating that the site is inside of the NHD waterbody feature):

0: unlikely shoreline contamination (distance to shore is greater than the sum of the site buffer and pixel size)

1: possible shoreline contamination (distance to shore is less than or equal to sum of the site buffer and pixel size)

NHD Flowline Feature Assignment

Flowlines are assigned to all siteSR sites. NHD flowlines are filtered to contain streams/rivers, canals, and artificial paths (that is flowlines within NHD waterbody features) (NHD feature types 460, 336, and 558). We flag (flag_fl in the siteSR site file) the results of this assignment as follows:

0: point <= 100 meters from nearest NHD flowline feature

1: point between 100 meters and GEE buffer distance (default 200 meters for siteSR) to nearest NHD flowline feature

2: point between GEE buffer distance and 500 meters from nearest NHD flowline feature

3: point unable to be assigned to a NHD flowline for a stream/canal site because distance to nearest NHD flowline feature > 500 meters

4: point is a lake/reservoir/pond/other/estuary (defined by harmonized_site_type) and is > 500 meters away from nearest NHD flowline feature

5: point does not have HUC8 assignment. No flowline assigned.

For all points with a flag 0-3 we provide the NHD identifier (fl_nhd_id in the siteSR site file) for the flowline the monitoring location is associated with (comid for NHDPlusV2 and permanent identifier for NHD Best Resolution), for all points with a flag value of 0-4, we provide the distance (in meters) to the nearest NHD flowline (dist_to_fl in siteSR site file) of feature type 460, 558, 336.

Multiple Intersecting NHD Features

For all sites with an assigned_HUC8, we calculate how many waterbodies or flowlines are within the buffer distance set in the siteSR GEE configuration file (stored in number_int_wb for number of waterbodies and number_int_fl for flowlines in the siteSR site file). Any site that has more than one waterbody and/or flow line within the buffer distance may be including reflectance data from another surface water source. Within the context of siteSR, we do not investigate what, if any, impacts multiple waterbodies, flowlines, or combinations thereof have on the validity of the reflectance data or thermal data summary.

We provide these identifiers and additional metadata about the sites to help users make informed decisions about remote sensing data quality (specifically, related to edge/shoreline contamination) and NHD common unique identifier (COMID) or permanent identifier attribution to individual sites and any available flags to assess uncertainty to those assignments or the buffered area relative to the site.

Note that the siteSR site file noted above collates NHD Best Resolution permanent identifiers and NHDPlusV2 comid values into a column for waterbody features wb_nhd_id and one for flowline features fl_nhd_id. The upstream source of the NHD data is noted in wb_nhd_source and fl_nhd_source columns to allow users to know whether the value in the wb_nhd_id or fl_nhd_id columns is a NHD Best Resolution permanent identifier (NHDBestRes) or a NHDPlusV2 COMID (NHDPlusV2).

3.2.3 Assessing for remote sensing visibility

For siteSR, we will acquire Landsat stacks for all sites deemed “visible” by remote sensing irrelevant of site type. Note that this is different from lakeSR, where we only acquire data for lakes, reservoirs and ponds according to the NHD. This was a functional choice, as estuary extents are limited in the NHD (and when present the technique to determine the POI may not be appropriate).

We use the European Commission’s JRC Global Surface Water Mapping Layers v1.4 (Pekel et al. 2016) to determine whether or not sites are visible by remote sensing (RS visible) and should therefore be sent through the GEE data acquisition process detailed in Section 6. The JRC surface water dataset is a compilation of 38 years of the Landsat record and enumerates the proportion of time a single 30x30 meter pixel has the presence of water. As long as a siteSR site location has at least one pixel designated as being occupied by water at least 80% of the time during that 38 year period within a 200 meter radius, we include the site in the list for which we acquire Landsat stacks.

After assessing for remote sensing visibility, 395,307 sites were deemed to be visible by remote sensing (67.3% loss of sites).

The figure is a histogram of siteSR sites after being filtered for remote sensing visibility. The highest category after filtering is Lake/Reservoir (165,745 sites), Stream (118,156), and Estuary (100,323).

Figure 3.6: siteSR data location types present in the dataset after filtering for those that are deemed visible by remote sensing. Count of sites provided above bar, arranged by harmonized_site_type.

3.2.4 siteSR updates in AquaMatch

In our updated workflow, we acquire and provide the historical Landsat stacks for all sites available from the WQP and NWIS, regardless of whether they are associated with a specific in situ observation present in our harmonization pipeline. In the previous iteration, we only provided a matchup dataset for in situ data. Assigning NHD flowlines and waterbodies to all siteSR sites is also a feature introduced in AquaMatch.