9 CDOM harmonization process

Following the completion of the {dataRetrieval} download process described previously, the pipeline contains raw WQP data for each parameter of interest. Before we harmonize each parameter we run through a series of universal “pre-harmonization” steps, which ensure that the datasets are appropriately formatted when entering their harmonization routines.

The text below first walks through the pre-harmonization steps for the colored dissolved organic matter (CDOM) dataset and then delves into the specifics of the harmonization process.

9.1 Pre-harmonization of the raw CDOM WQP dataset

At the start of the pre-harmonization process the raw CDOM WQP dataset contains 186.4 thousand rows.

Bar plot showing the row count update for the current harmonization step.


9.1.1 Missing results

Next, records that have missing data are dropped from the dataset. Several criteria are used when checking for missing data. If any of the below criteria are met the row is flagged as missing:

  1. Both the result column and detection limit column had NA data
  2. Result, result unit, activity comment, laboratory comment, and result comment columns are all NA
  3. The result comment column contains any user-provided text indicating a missing value. This currently includes: analysis lost, not analyzed, not recorded, not collected, or no measurement taken

473 rows are dropped, resulting in a final count of 185.9 thousand.

Bar plot showing the row count update for the current harmonization step.


9.1.2 Filter status

The final step in pre-harmonization is to filter the ResultStatusIdentifier column to include only the following statuses:

  • "Accepted"
  • "Final"
  • "Historical"
  • "Validated"
  • "Preliminary"
  • NA

These statuses generally indicate a reliable result was obtained, however we also include NA in an effort to be conservative. More specifically, when making decisions for this and other columns we occasionally retain NA values if removing the records would otherwise drop 10% or more of the available data.

This step removes 0 rows of data, leaving it with 185.9 thousand rows remaining.

Bar plot showing the row count update for the current harmonization step.


9.2 Harmonization-ready CDOM dataset

Once ready for harmonization, the CDOM WQP dataset contains the following user-defined CharacteristicNames: Absorbance at 280 nanometers, Absorbance at 280 nm, Absorbance at 370 nanometers, Absorbance at 412 nm, Absorbance at 440 nm, Absorption coefficient at 440 nm, Absorption spectral slope (Sag), Colored dissolved organic matter (CDOM), Emission intensity ratio, Fluorescence index, Fluorescence, excitation 260 emission 450, Fluorescence, excitation 275 emission 304, Fluorescence, excitation 275 emission 340, Fluorescence, excitation 280 emission 370, Fluorescence, excitation 300 emission 390, Fluorescence, excitation 340 emission 440, Fluorescence, excitation 370 emission 460, Fluorescence, excitation 390 emission 510, Fluorescence, excitation 420 emission 460, Specific UV Absorbance at 254 nm, Specific UV Absorbance at 254 nm, corrected for Fe, UV 254, UV Absorption, relative conc. of organic constituents.

Stacked bar chart with counts of CharacteristicName records

These CharacteristicNames are chosen in order to select for only those measurements that pertain to CDOM. We selected them by reviewing options in the Characteristics field of the Water Quality Portal Filter Results page.


9.2.1 Filter media and fractions, categorize parameters

We next ensure that the media type for all CDOM data is "Surface Water", "Water", "Estuary", or NA. Any records not meeting these criteria are dropped.

We then assign each record to one of the parameter values in the table below, based on its values for CharacteristicName, ResultAnalyticalMethod.MethodName, and ResultMeasure.MeasureUnitCode. This helps us to group comparable measurements throughout the rest of the process and generally simplify the way the dataset is organized:


Parameter Definition
Absorbance at 254 nm CharacteristicName is either “Specific UV Absorbance at 254 nm” or “UV 254” and ResultMeasure.MeasureUnitCode is not “L/mg-cm” or “L/mgDOC*m”. Alternatively, CharacteristicName is “UV Absorption, relative conc. of organic constituents” and ResultAnalyticalMethod.MethodIdentifier is “5910-B”
Absorbance at 280 nm CharacteristicName is “Absorbance at 280 nanometers”
Absorbance at 370 nm CharacteristicName is “Absorbance at 370 nanometers”
Absorbance at 412 nm CharacteristicName is “Absorbance at 412 nm”
Absorbance at 440 nm CharacteristicName is “Absorption coefficient at 440 nm” OR “Absorbance at 440 nm” OR CharacteristicName is “Colored dissolved organic matter (CDOM)” and ResultAnalyticalMethod.MethodName is “CDOM absorption (440nm)”
Absorption spectral slope, 275 to 295 nm CharacteristicName is “Absorption spectral slope (Sag)” and USGSPCode is 32300
Absorption spectral slope, 290 to 350 nm CharacteristicName is “Absorption spectral slope (Sag)” and USGSPCode is 32301
Absorption spectral slope, 350 to 400 nm CharacteristicName is “Absorption spectral slope (Sag)” and USGSPCode is 32302
Absorption spectral slope, 400 to 500 nm ResultAnalyticalMethod.MethodName is “Slope Of CDOM Absorption Coefficient Spectrum (400 To 500 Nm)” and StatisticalBaseCode is “Slope”
Absorption spectral slope, 412 to 600 nm CharacteristicName is “Absorption spectral slope (Sag)” and USGSPCode is 32331
Absorption spectral slope, 412 to 676 nm CharacteristicName is “Absorption spectral slope (Sag)” and USGSPCode is 32303
FDOM CharacteristicName is “Colored dissolved organic matter (CDOM)” AND ResultAnalyticalMethod.MethodName is NA or mentions a fluorometer AND ResultMeasure.MeasureUnitCode is one of “RFU”, “ug/l QSE”, “mg/l”, or “ug/L”. Alternatively, CharacteristicName is formatted like “Fluorescence, excitation 275 emission 304” but exact numbers will vary.
Fluorescence Index CharacteristicName is “Fluorescence index”
SUVA ResultMeasure.MeasureUnitCode is “L/mg-cm” or “L/mgDOC*m” OR CharacteristicName is “UV Absorption, relative conc. of organic constituents” and ResultAnalyticalMethod.MethodIdentifier is “415.3”


Records that do not meet the criteria of one of the parameter entries above are dropped (n = 1,113). A summary of the types of dropped records and their row counts is below:

Code
param_drop_record %>%
kable() %>%
  kable_paper() %>%
  scroll_box(width = "750px", height = "500px")
CharacteristicName ResultAnalyticalMethod.MethodName USGSPCode ResultAnalyticalMethod.MethodIdentifier ResultAnalyticalMethod.MethodIdentifierContext ResultMeasure.MeasureUnitCode ResultSampleFractionText n
Colored dissolved organic matter (CDOM) 9222 B ~ Standard Total Coliform Membrane Filter Procedure NA 9222B APHA ug/L NA 7
Colored dissolved organic matter (CDOM) CHROMOPHORIC DISSOLVED ORGANIC MATTER NA NASA-CDOM 21VASWCB None Dissolved 839
Colored dissolved organic matter (CDOM) NASA/TM-2003-211621/Rev4-Vol.IV NA NASA/TM-2003-211621 RTI m NA 96
Colored dissolved organic matter (CDOM) NA 32295 NA NA NA Dissolved 2
Colored dissolved organic matter (CDOM) NA 32322 NA NA NA Total 2
Colored dissolved organic matter (CDOM) NA NA NA NA m NA 130
Emission intensity ratio Fluorescence scan, 250-620 nm 32347 FL020 USGS None Dissolved 5
UV Absorption, relative conc. of organic constituents UNKNOWN NA UNKNOWN AZDEQ_SW units/cm Total 32


Below is a stacked barplot of the dataset organized by the new parameter column:

Stacked bar chart with counts of parameter records

1.4 thousand rows are removed. The final row count after this is 184.6 thousand.

Bar plot showing the row count update for the current harmonization step.


9.2.2 Document and remove fails

In this step we filter out records based on indications that they have failed data quality assurance or quality control for some reason given by the data provider (these instances are referred to here as “failures”).

After reviewing the contents of the ActivityCommentText, ResultLaboratoryCommentText, ResultCommentText, ResultDetectionConditionText, and ResultMeasureValue_original columns, we developed a list of terms that captured the majority of instances where records had failures or unacceptable measurements. (Note: ResultMeasureValue_original is a duplicate, character version of the ResultMeasureValue column that we use as a reference for the column’s contents before it was converted to a numeric type.) We found the phrasing to be consistent across columns, so we searched for the same (case agnostic) text in all four locations. The terms are: “beyond accept”, “cancelled”, “contaminat”, “error”, “fail”, “improper”, “interference”, “invalid”, “no result”, “no test”, “not accept”, “outside of accept”, “problem”, “questionable”, “suspect”, “unable”, “violation”, “reject”, “no data”, “time exceed”, “value extrapolated”, “exceeds”, “biased”, “parameter not required”, “not visited”, “warm”, “broken”.

Below are pie charts that break down the number of failure detections by column. Note that the plotting below is automated so if one or more of the columns listed above are not plotted, this indicates that the column(s) did not return any matches for the failure phrases. Also note that a single record can contain multiple failure phrases; therefore, failure phrases are not mutually exclusive.


9.2.2.1 ActivityCommentText fail detects

ActivityCommentText failure detects

9.2.2.2 ResultCommentText fail detects

ResultCommentText failure detects

9.2.2.3 ResultLaboratoryCommentText fail detects

ResultLaboratoryCommentText failure detects


9.2.2.4 ResultDetectionConditionText fail detects

ResultDetectionConditionText failure detects

9.2.2.5 ResultMeasureValue_original fail detects

ResultMeasureValue_original failure detects


8806 rows are removed after detecting failure-related phrases and 175.7 thousand rows remain.

Bar plot showing the row count update for the current harmonization step.


9.2.3 Clean MDLs

In this step method detection limits (MDLs) are used to clean up the reported values. When a numeric value is missing for the data record (i.e., NA or text that became NA during an as.numeric call) we check for non-detect language in the ResultLaboratoryCommentText, ResultCommentText, ResultDetectionConditionText, and ResultMeasureValue columns. This language can be "non-detect", "not detect", "non detect", "undetect", or "below".

If non-detect language exists then we use the DetectionQuantitationLimitMeasure.MeasureValue column for the MDL. Otherwise if there is a < and a number in the ResultMeasureValue column, we use that number instead.

We then use a random number between 0 and 0.5 * MDL as the record’s value moving forward.

We produce a new column, mdl_flag, from the MDL cleaning process. Records where no MDL-based adjustment was made and which are at or above the MDL are assigned a 0. Records with corrected values based on the MDL method are assigned a 1. Finally, records where no MDL-based adjustment was made and which contain a numeric value below the provided MDL are assigned a 2.

This should not result in a change in rows, but we still check: 0 rows are removed. The final row count after this is 175.7 thousand.

Bar plot showing the row count update for the current harmonization step.


9.2.4 Clean approximate values

Cleaning approximate values involves a similar process as for MDL cleaning. We flag “approximated” values in the dataset. The ResultMeasureValue column gets checked for all three of the following conditions:

  1. Numeric-only version of the column is still NA after MDL cleaning
  2. The original column text contained a number
  3. Any of ResultLaboratoryCommentText, ResultCommentText, or ResultDetectionConditionText match this regular expression, ignoring case: "result approx|RESULT IS APPROX|value approx"

We then use the approximate value as the record’s value moving forward.

Records with corrected values based on the above method are noted with a 1 in the approx_flag column. Note, however, that occasionally approximate language will be used in a record but not changed or flagged. This occurs when the language is used in a comment-related column and not the result column itself, meaning that there is a usable numeric value provided (and thus doesn’t need correction).

This should not result in a change in rows, but we still check: 0 rows are removed. The final row count after this is 175.7 thousand.

Bar plot showing the row count update for the current harmonization step.


9.2.5 Clean values with “greater than” data

The next step is similar to the MDL and approximate value cleaning processes, and follows the approximate cleaning process most closely. The goal is to clean up values that were entered as “greater than” some value. The ResultMeasureValue column gets checked for all three of the following conditions:

  1. Numeric-only version of the column is still NA after MDL & approximate cleaning
  2. The original column text contained a number
  3. The original column text contained a >

We then use the “greater than” value (without >) as the record’s value moving forward.

Records with corrected values based on the above method are noted with a 1 in the greater_flag column.

This should not result in a change in rows, but we still check: 0 rows are removed. The final row count after this is 175.7 thousand.

Bar plot showing the row count update for the current harmonization step.


9.2.6 Drop unresolved NA measurements

The goal of the preceding three steps was to prevent records with seemingly missing measurement data from being dropped if there was still a chance of recovering a usable value. At this point we’ve finished with that process and we proceed to check for remaining records with NA values in their harmonized_value column. If they exist, they are dropped. We also filter out any negative values in the dataset at this point because CDOM cannot be negative. The exception to this is for the “Absorption spectral slope” parameters, which we take the absolute value of instead of removing.

6 rows are removed. The final row count after this is 175.7 thousand.

Bar plot showing the row count update for the current harmonization step.


9.2.7 Harmonize record units

The next step in CDOM harmonization is converting the units of WQP records. Records that don’t make sense for CDOM or can’t be converted to a standardized unit dropped. We standardize units to “AU/m” unless the parameter is “SUVA” (“L/mgDOC*m”), “FDOM” (“RFU”, “RU”, “ug/l QSE”), “Fluorescence index” (“None”), or one of the “Absorption spectral slope” variants (“nm-1”). We use the unit conversion table below for this.

ResultMeasure.MeasureUnitCode conversion
AU/cm 100
units/cm 100
#/cm 100
cm 100
None 1
nm 100
NA 1
m 1
L/mg-cm 100
L/mgDOC*m 1
mg/l 1000
ug/L 1
ug/l QSE 1
RFU 1
RU 1
per m 1
nm-1 1


Note that in the table above “nm” units are given a conversion factor of 100. This is because we assessed that these records are likely intended to use cm-1 units. For example: The DetectionQuantitationLimitMeasure.MeasureUnitCode for these records is “units/cm”, their measured values fall within expected ranges for natural waters for absorbance at 254 nm reported in cm-1 units, and the detection limit listed as 0.005 also aligns with cm-1 measurements. Additionally the table lists “None” with a conversion factor of 1, which allows it to be retained as-is for records with the “Fluorescence index” parameter. Records with parameter variants of “Absorption spectral slope” are often provided in the WQP with units of “None” as well, but we convert these to “nm-1” in this step.


Below is a pie chart that breaks down the different unit codes that were dropped in the unit harmonization process, and how many records were lost with each code. (Note that if there is no pie chart below that is because no unit codes were dropped).


ResultMeasure.MeasureUnitCode mismatched codes


0 rows are removed. The final row count after this is 175.7 thousand.

Bar plot showing the row count update for the current harmonization step.


9.2.8 Clean depth data

The next harmonization step cleans the four depth-related columns obtained from the WQP. The details behind this step are covered in the Depth flags section of the Tiering, flagging, and quality control philosophy chapter.

This should not result in a change in rows, but we still check: 0 rows are removed. The final row count after this is 175.7 thousand.

Bar plot showing the row count update for the current harmonization step.


9.2.9 Filter and tier methods

We next review information related to the analytical methods, sample fraction, and result units for each record to construct hierarchical tiers as described in the Method tiering section of the Tiering, flagging, and quality control philosophy chapter.

In preparation for tiering, we check the ResultAnalyticalMethod.MethodName column for names that indicate non-CDOM measurements.

Finally, we group the data into three tiers as described in Tiering, flagging, and quality control philosophy. Broadly, these tiers are:


Tier Name Description CDOM Details
0 Restrictive Data that are verifiably self-similar across organizations and time-periods and can be considered highly reliable and interoperable Has appropriate (and consistent) information in CharacteristicName, methods columns, and ResultMeasure.MeasureUnitCode. ResultSampleFractionText is filtered
1 Narrowed Data that we have good reason to believe are self-similar, but for which we can’t verify full interoperability across data providers There is a mismatch in information, e.g., between the methods columns
2 Inclusive Data that are assumed to be reliable and are harmonized to our best ability given the information available from the data provider. This tier includes NA or non-resolvable descriptions for the analytical method, which often make up the majority of methods descriptions for any given parameter ResultSampleFractionText is reported as total or important information is missing from one or more relevant columns


At this point we create a file (in/cdom/cdom_tiering_record.csv) that contains a record of how specific method text was tiered and how many row counts corresponded to each method. This file is also included as an interactive HTML table below, in order to illustrate how the specifics of each tiering decision in the table above are carried out.

Code
tiering_record %>%
kable() %>%
  kable_paper() %>%
  scroll_box(width = "750px", height = "500px")
parameter CharacteristicName ResultAnalyticalMethod.MethodName USGSPCode ResultAnalyticalMethod.MethodIdentifier ResultAnalyticalMethod.MethodIdentifierContext ResultMeasure.MeasureUnitCode ResultSampleFractionText tier min_value max_value n
SUVA UV 254 Computation by NWIS algorithm 63162 ALGOR USGS L/mgDOC*m Dissolved 0 0.0000000 1.980000e+02 37767
Absorbance at 254 nm UV 254 UV Absorb, 254 nm, Supor (NWQL) 50624 UV006 USGS units/cm Dissolved 0 0.7000000 1.260000e+02 12301
Absorbance at 254 nm UV 254 UV Absorb, 254 nm, GFF (NWQL) 50624 UV005 USGS units/cm Dissolved 0 1.0000000 6.320000e+02 12301
Absorbance at 280 nm Absorbance at 280 nanometers UV Absorb, 280 nm, Supor (annon) 61726 UV002 USGS units/cm Dissolved 0 0.2000000 9.810000e+01 12272
Absorbance at 280 nm Absorbance at 280 nanometers UV Absorb, 280 nm, GFF (NWQL) 61726 UV007 USGS units/cm Dissolved 0 0.7000000 4.950000e+02 12221
Absorbance at 254 nm UV 254 UV absorbance, wf, 254 nm 50624 UV008 USGS units/cm Dissolved 0 0.0000000 1.905000e+02 6856
Absorbance at 254 nm UV 254 UV Absorb, 254 nm,GFF (USGS-NYL) 50624 UV019 USGS units/cm Dissolved 0 0.3000000 1.680000e+02 5991
Absorbance at 254 nm UV 254 NA 50624 NA NA units/cm Dissolved 0 0.0000000 2.140000e+02 4980
SUVA UV 254 SUVA, wf, calculation 63162 UV003 USGS L/mgDOC*m Dissolved 0 0.1000000 1.000000e+01 4759
Absorption spectral slope, 275 to 295 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32300 ABS02 USGS None Dissolved 0 0.0077000 2.840000e-02 2680
Absorbance at 280 nm Absorbance at 280 nanometers Absorbance, 200 to 800 nm 32296 ABS01 USGS AU/cm Dissolved 0 0.0700000 1.135000e+02 2678
Absorbance at 370 nm Absorbance at 370 nanometers Absorbance, 200 to 800 nm 32297 ABS01 USGS AU/cm Dissolved 0 0.0000000 2.369000e+01 2676
Absorption spectral slope, 290 to 350 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32301 ABS02 USGS None Dissolved 0 0.0000000 2.770000e-02 2670
Absorption spectral slope, 350 to 400 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32302 ABS02 USGS None Dissolved 0 0.0071000 4.930000e-02 2662
Absorbance at 440 nm Absorbance at 440 nm Absorbance, 200 to 800 nm 32299 ABS01 USGS AU/cm Dissolved 0 0.0500000 8.470000e+00 2639
Absorbance at 412 nm Absorbance at 412 nm Absorbance, 200 to 800 nm 32298 ABS01 USGS AU/cm Dissolved 0 0.0500000 1.094000e+01 2629
Fluorescence index Fluorescence index Fluorescence scan, 250-620 nm 32312 FL020 USGS None Dissolved 0 1.1340000 2.171000e+00 2476
FDOM Fluorescence, excitation 300 emission 390 Fluorescence scan, 250-620 nm 32309 FL020 USGS RU Dissolved 0 0.0030000 5.971000e+00 2476
FDOM Fluorescence, excitation 340 emission 440 Fluorescence scan, 250-620 nm 52901 FL020 USGS RU Dissolved 0 0.0010000 7.140000e+00 2474
FDOM Fluorescence, excitation 370 emission 460 Fluorescence scan, 250-620 nm 52902 FL020 USGS RU Dissolved 0 0.0010000 5.260000e+00 2419
FDOM Fluorescence, excitation 260 emission 450 Fluorescence scan, 250-620 nm 32304 FL020 USGS RU Dissolved 0 0.0096000 1.502000e+01 2379
FDOM Fluorescence, excitation 390 emission 510 Fluorescence scan, 250-620 nm 32307 FL020 USGS RU Dissolved 0 0.0015000 3.236000e+00 2338
FDOM Fluorescence, excitation 280 emission 370 Fluorescence scan, 250-620 nm 32310 FL020 USGS RU Dissolved 0 0.0117000 5.096000e+00 2319
Absorbance at 254 nm UV 254 UV Absorb, 254 nm, silver (NWQL) 50624 UV004 USGS units/cm Dissolved 0 0.1000000 1.430000e+02 1990
SUVA UV 254 NA 63162 NA NA L/mgDOC*m Dissolved 0 0.3000000 6.300000e+00 1809
FDOM Fluorescence, excitation 275 emission 340 Fluorescence scan, 250-620 nm 32311 FL020 USGS RU Dissolved 0 0.0026000 3.517000e+00 1713
Absorbance at 280 nm Absorbance at 280 nanometers UV Absorb, 280 nm,silver (annon) 61726 UV001 USGS units/cm Dissolved 0 0.1000000 9.930000e+06 1697
FDOM Colored dissolved organic matter (CDOM) Colorized Dissolved Organic Matter NA FLUORO COEOMAHA_WQX ug/L NA 2 10.0000000 2.680000e+02 1683
Absorbance at 254 nm Specific UV Absorbance at 254 nm 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA #/cm Dissolved 0 0.0008584 6.520000e+01 1577
FDOM Colored dissolved organic matter (CDOM) Turner Designs Trilogy Fluorometer with CDOM specific excitation/emission filters for monitoring CDOM in natural waters NA TurnerDesigns-CDOM 31DELRBC_WQX RFU Dissolved 2 472.5600000 9.261500e+03 1557
Absorption spectral slope, 412 to 676 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32303 ABS02 USGS None Dissolved 0 0.0029000 2.920000e-02 1509
FDOM Colored dissolved organic matter (CDOM) NA 32295 NA NA ug/l QSE Dissolved 0 0.0000000 3.230000e+04 1449
Absorbance at 440 nm Absorbance at 440 nm Dissolved Organic Matter Absorption Coefficient (Gelbstoff) NA L01~CDOM_440 CBP_WQX per m NA 2 0.1228000 2.246920e+01 1414
Absorption spectral slope, 400 to 500 nm Absorbance at 440 nm Slope Of CDOM Absorption Coefficient Spectrum (400 To 500 Nm) NA L01~CDOM_SLOPE CBP_WQX nm-1 NA 2 0.0075000 2.580000e-02 1414
FDOM Colored dissolved organic matter (CDOM) NA 32322 NA NA RFU Total 2 0.0000000 4.060000e+03 1345
Absorption spectral slope, 412 to 600 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32331 ABS02 USGS None Dissolved 0 0.0030000 1.700000e-02 973
Absorbance at 254 nm UV 254 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA units/cm Total 2 1.6000000 1.460000e+02 821
SUVA Specific UV Absorbance at 254 nm UV absorbance, 254 nm NA UV absorbance, 254 nm YRITWC L/mg-cm Dissolved 0 0.0000000 1.296472e+05 816
Absorbance at 254 nm UV 254 UV absorbance, 254 nm NA UV absorbance, 254 nm YRITWC nm Dissolved 2 0.0000000 7.151227e+02 770
Absorbance at 440 nm Absorbance at 440 nm Gross Alpha and Beta Activity in Water NA 00-01 USEPA units/cm Total 2 0.1000000 2.130000e+01 719
FDOM Fluorescence, excitation 275 emission 304 Fluorescence scan, 250-620 nm 32305 FL020 USGS RU Dissolved 0 0.0041000 1.075000e+00 699
Absorbance at 254 nm UV 254 Absorbance, 254 nm, wf (SM5910) 50624 UV011 USGS units/cm Dissolved 0 0.0000000 3.510000e+01 634
FDOM Colored dissolved organic matter (CDOM) YSI EXO fluorometer, 365/480 nm 32295 FL017 USGS ug/l QSE Dissolved 0 0.1000000 2.117000e+02 631
Absorbance at 254 nm UV 254 Determination of Total Organic Carbon and Specific UV Absorbance at 254 nm in Source Water and Drinking Water NA 415.3 USEPA units/cm Dissolved 0 1.5000000 3.880000e+01 400
FDOM Colored dissolved organic matter (CDOM) NA 32289 NA NA mg/l Dissolved 2 670.0000000 2.510000e+03 385
Absorbance at 440 nm Absorbance at 440 nm Determination of UV Absorbance at 440 nm in Water, by MDH modified method NA MDH055 MNPCA units/cm Total 2 0.1000000 8.150000e+00 324
Absorbance at 254 nm UV 254 UV Absorbance at 254 nm NA UVA_UV_254 11NPSWRD_WQX #/cm Dissolved 0 1.3000000 1.840000e+01 319
FDOM Colored dissolved organic matter (CDOM) Turner Designs fluoro.,365/470nm 32295 FL013 USGS ug/l QSE Dissolved 0 37.9000000 1.087000e+03 207
Absorbance at 254 nm UV 254 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA units/cm NA 2 2.8000000 1.370000e+02 194
Fluorescence index Fluorescence index Fluorescence Index 32312 ABS06 USGS None Dissolved 2 1.0000000 2.000000e+00 187
FDOM Fluorescence, excitation 260 emission 450 Spectrometry - 5 x 2 matrix scan 32304 ABS04 USGS RU Dissolved 0 0.0480000 2.452000e+00 187
FDOM Fluorescence, excitation 275 emission 340 Spectrometry - 5 x 2 matrix scan 32311 ABS04 USGS RU Dissolved 0 0.0120000 6.921000e-01 187
FDOM Fluorescence, excitation 280 emission 370 Spectrometry - 5 x 2 matrix scan 32310 ABS04 USGS RU Dissolved 0 0.0190000 1.048000e+00 187
FDOM Fluorescence, excitation 300 emission 390 Spectrometry - 5 x 2 matrix scan 32309 ABS04 USGS RU Dissolved 0 0.0220000 1.218000e+00 187
FDOM Fluorescence, excitation 340 emission 440 Spectrometry - 5 x 2 matrix scan 32332 ABS04 USGS RU Dissolved 0 0.0290000 1.312000e+00 187
FDOM Fluorescence, excitation 370 emission 460 Spectrometry - 5 x 2 matrix scan 32333 ABS04 USGS RU Dissolved 0 0.0200000 1.105000e+00 187
FDOM Fluorescence, excitation 390 emission 510 Spectrometry - 5 x 2 matrix scan 32307 ABS04 USGS RU Dissolved 0 0.0110000 6.395000e-01 187
Absorbance at 412 nm Absorbance at 412 nm UV absorbance, wf, 412 nm 66700 UV010 USGS AU/cm Dissolved 0 0.4000000 1.570000e+01 165
Absorbance at 254 nm UV 254 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA units/cm Dissolved 0 18.4000000 2.930000e+02 165
FDOM Colored dissolved organic matter (CDOM) NA 32330 NA NA mg/l Total 2 120.0000000 2.964000e+04 154
SUVA Specific UV Absorbance at 254 nm, corrected for Fe 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA L/mg-cm Dissolved 0 246.0000000 4.510000e+02 153
FDOM Colored dissolved organic matter (CDOM) NA NA NA NA ug/L NA 2 19.0714216 1.065636e+02 130
Absorbance at 440 nm Absorption coefficient at 440 nm 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA #/cm NA 2 0.2053158 1.745181e+01 123
SUVA Specific UV Absorbance at 254 nm 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA L/mg-cm NA 2 1.2248924 4.844491e+00 123
FDOM Fluorescence, excitation 420 emission 460 Fluorescence scan, 250-620 nm 32355 FL020 USGS RU Dissolved 0 0.0215000 5.180000e-01 119
Absorbance at 254 nm UV Absorption, relative conc. of organic constituents 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA units/cm Total 2 9.0800000 1.460000e+02 97
Absorbance at 254 nm Specific UV Absorbance at 254 nm 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA units/cm Total 2 0.0300000 2.514000e+01 95
FDOM Fluorescence, excitation 275 emission 304 Spectrometry - 5 x 2 matrix scan 32305 ABS04 USGS RU Dissolved 0 0.0186000 2.625000e-01 93
Absorption spectral slope, 412 to 676 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32303 ABS02 USGS NA Dissolved 0 0.0001329 1.718640e-02 85
Absorbance at 280 nm Absorbance at 280 nanometers NA 61726 NA NA units/cm Dissolved 0 0.3000000 6.780000e+01 78
Absorbance at 254 nm UV 254 Determination of Total Organic Carbon and Specific UV Absorbance at 254 nm in Source Water and Drinking Water NA 415.3 USEPA None NA 2 0.0120000 2.270000e-01 73
SUVA UV Absorption, relative conc. of organic constituents Determination of Total Organic Carbon and Specific UV Absorbance at 254 nm in Source Water and Drinking Water NA 415.3 USEPA None NA 2 1.6000000 4.400000e+00 73
FDOM Fluorescence, excitation 275 emission 304 Spectrometry - 5 x 2 matrix scan 32305 ABS04 USGS NA Dissolved 2 0.0017291 2.483674e-01 73
Absorbance at 440 nm Colored dissolved organic matter (CDOM) CDOM absorption (440nm) NA CDOM absorption RTI m NA 2 0.0793651 2.813000e+01 58
FDOM Colored dissolved organic matter (CDOM) YSI EXO, 365/480/80 nm 32295 FL029 USGS ug/l QSE Dissolved 0 5.2500000 9.320000e+01 54
FDOM Colored dissolved organic matter (CDOM) YSI EXO, 365/480/80 nm 32322 FL029 USGS RFU Total 2 1.7500000 2.448000e+01 48
Absorbance at 440 nm Absorbance at 440 nm Gross Alpha and Beta Activity in Water NA 00-01 USEPA NA Total 2 0.0000073 4.830000e-04 41
Absorbance at 440 nm Absorbance at 440 nm Determination of UV Absorbance at 440 nm in Water, by MDH modified method NA MDH055 MNPCA NA Total 2 0.0000239 3.582000e-03 38
Absorbance at 412 nm Absorbance at 412 nm UV absorbance, wf, 412 nm 66700 UV010 USGS NA Dissolved 2 0.0000468 5.267600e-03 34
Absorbance at 254 nm UV 254 Determination of Total Organic Carbon and Specific UV Absorbance at 254 nm in Source Water and Drinking Water NA 415.3 USEPA units/cm Total 2 4.5500000 4.280000e+01 34
Absorption spectral slope, 350 to 400 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32302 ABS02 USGS NA Dissolved 0 0.0000533 8.895900e-03 18
FDOM Fluorescence, excitation 275 emission 304 Fluorescence scan, 250-620 nm 32305 FL020 USGS NA Dissolved 2 0.0009441 2.497560e-02 18
Absorption spectral slope, 290 to 350 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32301 ABS02 USGS NA Dissolved 0 0.0000715 7.584400e-03 14
Absorbance at 254 nm UV 254 Determination of UV Absorbance at 440 nm in Water, by MDH modified method NA MDH055 MNPCA None Dissolved 2 1.0000000 1.000000e+00 11
Absorption spectral slope, 412 to 600 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32331 ABS02 USGS NA Dissolved 0 0.0002522 2.449000e-03 10
Absorbance at 254 nm UV 254 Computation by NWIS algorithm 63162 ALGOR USGS NA Dissolved 2 0.0091553 7.818130e-02 8
Absorbance at 254 nm UV 254 5910 B ~ UV - Absorbing Organic Compounds NA 5910-B APHA units/cm None 2 0.0000000 3.540000e+00 6
FDOM Colored dissolved organic matter (CDOM) YSI EXO fluorometer, 365/480 nm 32322 FL017 USGS RFU Total 2 0.0600000 1.266000e+01 5
Absorbance at 254 nm UV 254 UV Absorb, 254 nm, Supor (NWQL) 50624 UV006 USGS NA Dissolved 2 0.0000515 1.520700e-03 5
Absorbance at 280 nm Absorbance at 280 nanometers UV Absorb, 280 nm, Supor (annon) 61726 UV002 USGS NA Dissolved 2 0.0001215 1.596300e-03 5
Absorption spectral slope, 275 to 295 nm Absorption spectral slope (Sag) Sag, specified wavelength range 32300 ABS02 USGS NA Dissolved 0 0.0002761 1.240770e-02 4
FDOM Colored dissolved organic matter (CDOM) NA 32314 NA NA ug/l QSE Dissolved 2 49.8000000 1.210000e+02 3
Absorbance at 254 nm UV 254 UV Absorb, 254 nm, silver (NWQL) 50624 UV004 USGS NA Dissolved 2 0.0014522 1.645300e-03 2
Fluorescence index Fluorescence index Fluorescence scan, 250-620 nm 32312 FL020 USGS NA Dissolved 0 0.0524613 2.471460e-01 2
Absorbance at 280 nm Absorbance at 280 nanometers UV Absorb, 280 nm, GFF (NWQL) 61726 UV007 USGS NA Dissolved 2 0.0004394 1.803810e-02 2
Absorbance at 254 nm UV 254 UV Absorb, 254 nm, GFF (NWQL) 50624 UV005 USGS NA Dissolved 2 0.0010575 1.838950e-02 2
Absorbance at 254 nm UV 254 NA 50624 NA NA NA Dissolved 2 0.0002493 2.774000e-04 2
Absorbance at 440 nm Absorbance at 440 nm Absorbance, 200 to 800 nm 32299 ABS01 USGS NA Dissolved 2 0.0006541 1.257400e-03 2
Absorbance at 254 nm UV 254 UV - Absorbing Organic Compounds NA 5910-B KAWNATON_WQX cm NA 2 12.4000000 1.240000e+01 1
Absorbance at 280 nm Absorbance at 280 nanometers UV Absorb, 280 nm,silver (annon) 61726 UV001 USGS NA Dissolved 2 0.0002786 2.786000e-04 1
Absorbance at 440 nm Absorbance at 440 nm Determination of UV Absorbance at 440 nm in Water NA USEPA 415.3_440 MNPCA #/cm Total 2 4.3300000 4.330000e+00 1
SUVA Specific UV Absorbance at 254 nm Determination of Total Organic Carbon and Specific UV Absorbance at 254 nm in Source Water and Drinking Water NA 415.3 USEPA L/mg-cm Total 2 258.2000000 2.582000e+02 1
SUVA Specific UV Absorbance at 254 nm, corrected for Fe Determination of Total Organic Carbon and Specific UV Absorbance at 254 nm in Source Water and Drinking Water NA 415.3 USEPA L/mg-cm Total 2 238.5000000 2.385000e+02 1
Absorbance at 254 nm UV 254 Determination of Total Organic Carbon and Specific UV Absorbance at 254 nm in Source Water and Drinking Water NA 415.3 USEPA #/cm Total 2 22.3300000 2.233000e+01 1
Absorbance at 280 nm Absorbance at 280 nanometers NA 61726 NA NA NA Dissolved 0 0.0000386 3.860000e-05 1
Absorbance at 254 nm UV 254 UV absorbance, wf, 254 nm 50624 UV008 USGS NA Dissolved 2 0.0001090 1.090000e-04 1
Absorbance at 280 nm Absorbance at 280 nanometers Absorbance, 200 to 800 nm 32296 ABS01 USGS NA Dissolved 2 0.0025234 2.523400e-03 1
Absorbance at 370 nm Absorbance at 370 nanometers Absorbance, 200 to 800 nm 32297 ABS01 USGS NA Dissolved 2 0.0014879 1.487900e-03 1


The above process drops 0 rows, leaving 175.7 thousand remaining.


Bar plot showing the row count update for the current harmonization step.


9.2.10 Flag based on field methods

Next, we use the field_flag column to flag records based on the type of sampling equipment that was used:

  • field_flag = 0 when discrete sampling methods were used. Indicated by detection of “grab”, “bucket”, “point”, “kemmerer”, “van dorn”, “bailer”, or “bottle” in SampleCollectionEquipmentName
  • field_flag = 1 when integrated sampling methods were used. Indicated by detection of “integrated”, “multiple”, or “pump” in SampleCollectionEquipmentName
  • field_flag = 2 for all other records

No records should be removed by this process and so there are 0 rows dropped leaving 175.7 thousand remaining in the harmonized CDOM dataset.

Bar plot showing the row count update for the current harmonization step.


9.2.11 Miscellaneous flag

Next, we populate the miscellaneous flag (misc_flag) column. The miscellaneous flag is used in this dataset to mark measurements that fall outside published maximum records based on a review of the relevant literature. We do this instead of filtering the records out, but a depth-related filter still occurs in the next harmonization step.

For this flag column, values are as follows: 0 = does not exceed published range, 1 = below published minimum, and 2 = above published maximum. Absorption spectral slope parameters, SUVA, and Fluorescence index are checked for values below minimum or above maximum published records, while the rest of the parameter options are just checked for values above published maximums. We use the absolute value of absorption spectral slope when testing against thresholds.

Thresholds used and citations are below, followed by a table with counts by parameters, unit, and flag value:

Parameter Units Min value Max value Source
Absorbance at 254 nm AU/m - 450 Yates et al. 2023
Absorbance at 280 nm AU/m - 36.1 Korak and McKay 2024
Absorbance at 370 nm AU/m - 11 Korak and McKay 2024
Absorbance at 412 nm AU/m - 14 We use the threshold for 440 nm (below) as a conservative estimate
Absorbance at 440 nm AU/m - 14 Brezonik et al. 2019
Absorption spectral slope (multiple) nm-1 0.0026 0.0473 Korak and McKay 2024
FDOM RU - 4.38 Korak and McKay 2024
FDOM ug/L, ug/l QSE - 84.4 Avouris et al. 2025
Fluorescence index None 1.31 2.18 Korak and McKay 2024
SUVA L/mgDOC*m 0.651 6.83 Korak and McKay 2024


parameter harmonized_units misc_flag n
Absorbance at 254 nm AU/m 0 49634
Absorbance at 254 nm AU/m 2 3
Absorbance at 280 nm AU/m 0 27431
Absorbance at 280 nm AU/m 2 1525
Absorbance at 370 nm AU/m 0 2653
Absorbance at 370 nm AU/m 2 24
Absorbance at 412 nm AU/m 0 2826
Absorbance at 412 nm AU/m 2 2
Absorbance at 440 nm AU/m 0 5347
Absorbance at 440 nm AU/m 2 12
Absorption spectral slope, 275 to 295 nm nm-1 0 2681
Absorption spectral slope, 275 to 295 nm nm-1 1 3
Absorption spectral slope, 290 to 350 nm nm-1 0 2679
Absorption spectral slope, 290 to 350 nm nm-1 1 5
Absorption spectral slope, 350 to 400 nm nm-1 0 2671
Absorption spectral slope, 350 to 400 nm nm-1 1 8
Absorption spectral slope, 350 to 400 nm nm-1 2 1
Absorption spectral slope, 400 to 500 nm nm-1 0 1414
Absorption spectral slope, 412 to 600 nm nm-1 0 973
Absorption spectral slope, 412 to 600 nm nm-1 1 10
Absorption spectral slope, 412 to 676 nm nm-1 0 1543
Absorption spectral slope, 412 to 676 nm nm-1 1 51
FDOM RFU 0 2955
FDOM RU 0 18391
FDOM RU 2 38
FDOM ug/L 0 1719
FDOM ug/L 2 633
FDOM ug/l QSE 0 2053
FDOM ug/l QSE 2 291
Fluorescence index None 0 2628
Fluorescence index None 1 37
SUVA L/mgDOC*m 0 43998
SUVA L/mgDOC*m 1 130
SUVA L/mgDOC*m 2 1374


No records should be removed by this process and so there are 0 rows dropped leaving 175.7 thousand remaining in the harmonized CDOM dataset.

Bar plot showing the row count update for the current harmonization step.


9.2.12 Remove unrealistic values

Before finalizing the dataset we check for and remove any depths > 592m, the deepest point in a lake in the U.S.

0 rows are removed. The final row count after this is 175.7 thousand.

Bar plot showing the row count update for the current harmonization step.


9.2.13 Aggregate simultaneous records

The final step of CDOM harmonization is to aggregate simultaneous observations.

The aggregation process: Any group of samples determined to be simultaneous are simplified into a single record containing the mean and coefficient of variation (CV) of the group. These can be either true duplicate entries in the WQP or records with non-identical values recorded at the same time and place and by the same organization (field and/or lab replicates/duplicates). The CV can be used to filter the dataset based on the amount of variability that is tolerable to specific use cases. Note, however, that many entries will have a CV that is NA because there are no simultaneous records or 0 because the records are duplicates and all entries have the same harmonized_value.

We identify simultaneous records to aggregate by creating identical subgroups (subgroup_id) from the following columns: parameter, OrganizationIdentifier, MonitoringLocationIdentifier, MonitoringLocationTypeName, ResolvedMonitoringLocationTypeName, ActivityStartDate, ActivityStartTime.Time, ActivityStartTime.TimeZoneCode, harmonized_tz, harmonized_local_time, harmonized_utc, ActivityStartDateTime, harmonized_top_depth_value, harmonized_top_depth_unit, harmonized_bottom_depth_value, harmonized_bottom_depth_unit, harmonized_discrete_depth_value, harmonized_discrete_depth_unit, depth_flag, mdl_flag, approx_flag, greater_flag, tier, field_flag, misc_flag, harmonized_units. This selection limits the columns included in the final dataset, but we also provide a copy of the AquaMatch dataset prior to its aggregation (pipeline target p3_cdom_preagg_grouped; available as an RDS file in the data release: AquaMatch_harmonize_WQP/_targets/objects/p3_cdom_preagg_grouped), and include the subgroup_id column, so that users can use the disaggregated data as well and match make joins between dataset versions.

Note: We hold out any records where misc_flag is 1 or 2 from the aggregation step above, and as a result those records are not included in the p3_cdom_preagg_grouped object, nor its RDS export. The records are added back into the final dataset once the aggregation is complete, but they have NA for subgroup_id, harmonized_value_cv, and harmonized_row_count because they are never grouped or aggregated. We do this to allow end users to decide how they want to handle values that exceed published maximums.

The final values are presented in the harmonized_value and harmonized_value_cv columns. The number of rows used per group is recorded in the harmonized_row_count column.

1.6263^{4} rows dropped leaving 159.5 thousand remaining in the final harmonized and aggregated CDOM dataset.

Bar plot showing the row count update for the current harmonization step.


9.2.14 Harmonized CDOM

At this point the harmonization of the CDOM data from the WQP is complete and we export the final dataset for use later in the workflow.

Below are several sets of figures illustrating different qualities of the dataset:

  1. Histograms showing the distribution of harmonized measurements (top) and CVs (bottom) broken down by tier after aggregating simultaneous records. There are very few simultaneous records in the dataset, so only a few CVs are produced.


Distribution of CDOM values by tier

Distribution of CDOM CVs by tier


  1. Maps showing the geographic distribution of records by tier within the US:

Geographic distribution of CDOM records by tier in the conterminous US


  1. Bar charts showing the distribution of values by tier across years, months, and days of the week:

Temporal distribution of CDOM records by tier across years

Temporal distribution of CDOM records by tier across months

Temporal distribution of CDOM records by tier across days of the week


  1. Lastly, bar charts showing the distribution of depth values by parameter, location type, and tier:

Distribution of harmonized_top_top_value column values by parameter, tier, and ResolvedMonitoringLocationTypeName

Distribution of harmonized_top_bottom_value column values by parameter, tier, and ResolvedMonitoringLocationTypeName

Distribution of harmonized_discrete_depth_value column values by parameter, tier, and ResolvedMonitoringLocationTypeName