7 Post-Hoc Quality Assurance

Up until this point in the workflow, no data filtering has occurred other than the masking procedures that eliminate obviously contaminated pixel-level data due to sensor malfunction, clouds, haze, and glint as described in Section 6.3.1. We also implement further QA filters for the remote sensing stacks to assure high-quality data. These post-hoc filters are described in 7.2.

7.1 Raw lakeSR & siteSR stack collation

The data exported from the GEE tasks described in Section 6 were collated into large compressed .feather files per DSWE type (DSWE1 and DSWE1a), per Landsat mission, and per path prefix. The path prefix subset is designed to reduce data corruption due to file size in the upload/download process to Google Drive, where the collated files are stored for posterity and to add functionality to this workflow. The Google Drive identifiers can be found at the folder path c_collate_Landsat_data/out/ and can be accessed without special permissions by downstream users. Many of these files are quite large (on the order of many gigabytes (“GBs”)) and are very difficult to handle outside of a programming or database environment. We provide these files for advanced users who wish to make changes to the QA or intermission handoff procedures. At the time of running the pipeline, we assigned a max vector size (‘R_MAX_VSIZE’) of 40 GB using the .Renviron document in this repository and it was run on a machine with 64 GB of memory. If your system has less than 40 GB of memory, this portion of the workflow may not successfully run.

7.2 Post-hoc filters

We acknowledge that even with the best masking procedures, our workflow may still result in erroneous or misleading SR values. To address this we implement post-hoc filters that are applied to the stack to reduce uncertainty in the remote sensing data as much as possible.

image quality (from scene-level metadata files) must be >= 8. In Landsat 4, 5, and 7 this value is stored in the metadata column IMAGE_QUALITY, in Landsat 8 and 9, this value originates from the column IMAGE_QUALITY_OLI. These columns indicate the overall quality of the image on a scale of 0 (worst) to 9 (best).
total count of pixels (pCount_dswe1 or pCount_dswe1a) contributing to summary values must be ≥ 8. This is a slight change from AquaSat v1 which required 10 pixels. Due to the more rigorous masking procedures (than AquaSat v1) we believe allows for a slightly more conservative QA filter via the reduction of required pixels for inclusion in the lakeSR and siteSR datasets.
either NIR surface reflectance or SWIR1 and SWIR2 surface reflectance must be < 0.1, this is to remove any extracted samples where sun glint has likely affected the data that was not masked in the red, green, or blue bands. In aquatic environments, NIR, SWIR1, and SWIR2 surface reflectance values should be very low and these bands can be used for detecting sun glint affected areas (e.g. Mondejar and Tongco 2019; Vanhellemont 2019). Since NIR bands can be elevated in high-sediment waters (e.g., Doxaran, Froidefond, and Castaing 2002) and SWIR can be elevated in high chlorophyll environments (e.g, Hu 2009), and we did not wish to bias our dataset and remove data of this nature, we embraced this conditional approach.

While there are numerous other diagnostic tests we could apply to filter the RS data, these are the QA filters that we deem to be the most universal for reliable RS data across the US and territories. Figures in the sections below show how many rows of data were dropped through this QA process. For the purposes of this documentation, we show summaries for DSWE1 data only, but running the pipeline locally results in graphical summaries for both DSWE1 and DSWE1a data.

We also added use flags in the remote sensing summary file indicating when the thermal data were outside of specific minimum and maximum thresholds:

if the median temperature is below freezing (med_SurfaceTemp < 273.15 ° K), the row is flagged flag_temp_min as 2 (otherwise 0 for valid data for this flag or 1 for no data after GEE filters). This is our best attempt at identifying data during any freezing period and also is an attempt at detecting clouds, which can impact the thermal estimates. In particular, cirrus clouds can go unnoticed in the cloud detection algorithms that inform our masking procedures.
if the median temperature is greater than or equal to 40°C (313.15 °K) the row is flagged flag_temp_max as 2 (otherwise 0 for valid data for this flag or 1 for no data after GEE filters). This maximum was defined based on the maximum value included in Willard et al. (2021), a dataset composed of in situ measurements and remote sensing estimates.

When thermal data are outside of these ranges (and the flag column is 2) we suggest removing the thermal data from your analysis. While the thermal data and optical data are processed separately, when either flag_temp_min or flag_temp_max is 2, users should consider whether the optical data is appropriate to use in your analysis or whether there may be contamination of some kind that could impact the validity optical data. For instance, if your site is far from shore and the flag_temp_min is 2, it is possible there is either unidentified ice or unidentified cloud contamination.

7.2.1 Data Truncation

Data coming out of Google Earth Engine carry nearly unending significant digits due to the data aggregation that occurs in that step. Prior to exporting any final lakeSR or siteSR files, we also truncate optical data to 3 significant digits and thermal data to 2 significant digits.

7.2.2 Landsat 4 QA Summary

Summary of rows retained during lakeSR Landsat 4’s quality filtering for DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 827,630 records, after post-hoc QA there were 384,324 records.

Summary of rows retained during siteSR Landsat 4’s quality filtering for DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 496,827 records, after post-hoc QA there were 356,824 records.

7.2.3 Landsat 5 QA Summary

Summary of rows retained during lakeSR Landsat 5’s quality filtering for DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 61,286,631 records, after post-hoc QA there were 35,285,627 records.

Summary of rows retained during siteSR Landsat 5’s quality filtering for DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 89,645,418 records, after post-hoc QA there were 74,295,228 records.

7.2.4 Landsat 7 QA Summary

Summary of rows retained during lakeSR Landsat 7’s quality filtering for DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 66,838,911 records, after post-hoc QA there were 37,409,839 records.

Summary of rows retained during siteSR Landsat 7’s quality filtering for DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 73,452,572 records, after post-hoc QA there were 61,650,399 records.

7.2.5 Landsat 8 QA Summary

Summary of rows retained during lakeSR quality filtering for Landsat 8 DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 23,152,027 records, after post-hoc QA there were 10,656,479 records.

Summary of rows retained during siteSR quality filtering for Landsat 8 DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 83,945,134 records, after post-hoc QA there were 24,416,786 records.

7.2.6 Landsat 9 QA Summary

Summary of rows retained during lakeSR quality filtering for Landsat 9 DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 5,845,374 records, after post-hoc QA there were 2,733,304 records.

Summary of rows retained during siteSR quality filtering for Landsat 9 DSWE1 (confident water) remote sensing summaries. Prior to post-hoc QA, there were 9,347,528 records, after post-hoc QA there were 6,753,349 records.