7 Post-Hoc Quality Assurance
Up until this point in the workflow, no data filtering has occurred other than the masking procedures that eliminate obviously contaminated pixel-level data due to sensor malfunction, clouds, haze, and glint as described in Section 6.3.1. We also implement further QA filters for the remote sensing stacks to assure high-quality data. These post-hoc filters are described in 7.2.
7.1 Raw lakeSR & siteSR stack collation
The data exported from the GEE tasks described in Section 6
were collated into large compressed .feather files per DSWE type (DSWE1 and
DSWE1a), per Landsat mission, and per path prefix. The path prefix subset is
designed to reduce data corruption due to file size in the upload/download
process to Google Drive, where the collated files are stored for posterity and
to add functionality to this workflow. The Google Drive identifiers can be found
at the folder path c_collate_Landsat_data/out/ and can be accessed without
special permissions by downstream users. Many of these files are quite large (on
the order of many gigabytes (“GBs”)) and are very difficult to handle outside of
a programming or database environment. We provide these files for advanced users
who wish to make changes to the QA or intermission handoff procedures. At the
time of running the pipeline, we assigned a max vector size (‘R_MAX_VSIZE’) of
40 GB using the .Renviron document in this repository and it was run on a
machine with 64 GB of memory. If your system has less than 40 GB of memory, this
portion of the workflow may not successfully run.
7.2 Post-hoc filters
We acknowledge that even with the best masking procedures, our workflow may still result in erroneous or misleading SR values. To address this we implement post-hoc filters that are applied to the stack to reduce uncertainty in the remote sensing data as much as possible.
image quality (from scene-level metadata files) must be >= 8. In Landsat 4, 5, and 7 this value is stored in the metadata column
IMAGE_QUALITY, in Landsat 8 and 9, this value originates from the columnIMAGE_QUALITY_OLI. These columns indicate the overall quality of the image on a scale of 0 (worst) to 9 (best).total count of pixels (
pCount_dswe1orpCount_dswe1a) contributing to summary values must be ≥ 8. This is a slight change from AquaSat v1 which required 10 pixels. Due to the more rigorous masking procedures (than AquaSat v1) we believe allows for a slightly more conservative QA filter via the reduction of required pixels for inclusion in the lakeSR and siteSR datasets.either NIR surface reflectance or SWIR1 and SWIR2 surface reflectance must be < 0.1, this is to remove any extracted samples where sun glint has likely affected the data that was not masked in the red, green, or blue bands. In aquatic environments, NIR, SWIR1, and SWIR2 surface reflectance values should be very low and these bands can be used for detecting sun glint affected areas (e.g. Mondejar and Tongco 2019; Vanhellemont 2019). Since NIR bands can be elevated in high-sediment waters (e.g., Doxaran, Froidefond, and Castaing 2002) and SWIR can be elevated in high chlorophyll environments (e.g, Hu 2009), and we did not wish to bias our dataset and remove data of this nature, we embraced this conditional approach.
While there are numerous other diagnostic tests we could apply to filter the RS data, these are the QA filters that we deem to be the most universal for reliable RS data across the US and territories. Figures in the sections below show how many rows of data were dropped through this QA process. For the purposes of this documentation, we show summaries for DSWE1 data only, but running the pipeline locally results in graphical summaries for both DSWE1 and DSWE1a data.
We also added use flags in the remote sensing summary file indicating when the thermal data were outside of specific minimum and maximum thresholds:
if the median temperature is below freezing (
med_SurfaceTemp< 273.15 ° K), the row is flaggedflag_temp_minas 2 (otherwise 0 for valid data for this flag or 1 for no data after GEE filters). This is our best attempt at identifying data during any freezing period and also is an attempt at detecting clouds, which can impact the thermal estimates. In particular, cirrus clouds can go unnoticed in the cloud detection algorithms that inform our masking procedures.if the median temperature is greater than or equal to 40°C (313.15 °K) the row is flagged
flag_temp_maxas 2 (otherwise 0 for valid data for this flag or 1 for no data after GEE filters). This maximum was defined based on the maximum value included in Willard et al. (2021), a dataset composed of in situ measurements and remote sensing estimates.
When thermal data are outside of these ranges (and the flag column is 2) we
suggest removing the thermal data from your analysis. While the thermal data and
optical data are processed separately, when either flag_temp_min or
flag_temp_max is 2, users should consider whether the optical data is
appropriate to use in your analysis or whether there may be contamination of
some kind that could impact the validity optical data. For instance, if your
site is far from shore and the flag_temp_min is 2, it is possible there is
either unidentified ice or unidentified cloud contamination.
7.2.1 Data Truncation
Data coming out of Google Earth Engine carry nearly unending significant digits due to the data aggregation that occurs in that step. Prior to exporting any final lakeSR or siteSR files, we also truncate optical data to 3 significant digits and thermal data to 2 significant digits.
7.2.2 Landsat 4 QA Summary
7.2.3 Landsat 5 QA Summary
7.2.4 Landsat 7 QA Summary
7.2.5 Landsat 8 QA Summary
7.2.6 Landsat 9 QA Summary