6 DOC harmonization process

Following the completion of the {dataRetrieval} download process described previously, the pipeline contains raw WQP data for each parameter of interest. Before we harmonize each parameter we run through a series of universal “pre-harmonization” steps, which ensure that the datasets are appropriately formatted when entering their harmonization routines.

The text below first walks through the pre-harmonization steps for the raw dissolved organic carbon (DOC) dataset and then delves into the specifics of the harmonization process.

6.1 Pre-harmonization of the raw doc WQP dataset

At the start of the pre-harmonization process the raw DOC WQP dataset contains 2.7 million rows.

6.1.1 Missing results

Next, records that have missing data are dropped from the dataset. Several criteria are used when checking for missing data. If any of the below criteria are met the row is flagged as missing:

Both the result column and detection limit column had NA data
Result, result unit, activity comment, laboratory comment, and result comment columns are all NA
The result comment column contains any user-provided text indicating a missing value, currently including: analysis lost, not analyzed, not recorded, not collected, or no measurement taken

94.7 thousand rows are dropped, resulting in a final count of 2.7 million.

6.1.2 Filter status

The final step in pre-harmonization is to filter the ResultStatusIdentifier column to include only the following statuses:

"Accepted"
"Final"
"Historical"
"Validated"
"Preliminary"
NA

These statuses generally indicate a reliable result having been reached, however we also include NA in an effort to be conservative. More specifically, when making decisions for this and other columns we occasionally retain NA values if removing the records would otherwise drop 10% or more of the available data.

This step removes 2037 rows of data, leaving it with 2.6 million rows remaining.

6.2 Harmonization-ready DOC dataset

Once ready for harmonization, the DOC WQP dataset contains the following user-defined characteristicNames: Organic carbon, Total carbon, Non-purgeable Organic Carbon (NPOC).

These names are chosen in order to select for only those measurements that pertain to DOC.

6.2.1 Filter media and fractions

We next ensure that the media type for all DOC data is "Surface Water", "Water", "Estuary", or NA. Additionally at this step we select a subset of CharacteristicName and ResultSampleFractionText combinations that are relevant to DOC. These are:

CharacteristicName	ResultSampleFractionText
Non-purgeable Organic Carbon (NPOC)	Dissolved
Non-purgeable Organic Carbon (NPOC)	Filtered, lab
Non-purgeable Organic Carbon (NPOC)	None
Non-purgeable Organic Carbon (NPOC)	Total
Non-purgeable Organic Carbon (NPOC)	NA
Organic carbon	Dissolved
Organic carbon	Filtered, lab
Organic carbon	Filtered, field
Organic carbon	Filterable
Organic carbon	Total
Organic carbon	Organic
Total carbon	Filterable
Total carbon	NA
Total carbon	Total
Total carbon	Organic

Below are pie charts that break down the type of fractions dropped for each CharacteristicName, including record counts:

6.2.1.1 Non-purgeable organic carbon (NPOC) fractions dropped

Non-purgeable organic carbon (NPOC) ResultSampleFractionText types dropped

6.2.1.2 Organic carbon fractions dropped

Organic carbon ResultSampleFractionText types dropped

6.2.1.3 Total carbon fractions dropped

Total carbon ResultSampleFractionText types dropped

303 thousand rows are removed. The final row count after this is 2.3 million.

6.2.2 Document and remove fails

In this step we filter out records based on indications that they have failed data quality assurance or quality control for some reason given by the data provider (these instances are referred to here as “failures”).

After reviewing the contents of the ActivityCommentText, ResultLaboratoryCommentText, ResultCommentText, ResultDetectionConditionText, and ResultMeasureValue_original columns, we developed a list of terms that captured the majority of instances where records had failures or unacceptable measurements. (Note: ResultMeasureValue_original is a duplicate, character version of the ResultMeasureValue column that we use as a reference for the column’s contents before it was converted to a numeric type.) We found the phrasing to be consistent across columns and so we searched for the same (case agnostic) terms in all four locations. The terms are: “beyond accept”, “cancelled”, “contaminat”, “error”, “fail”, “improper”, “interference”, “invalid”, “no result”, “no test”, “not accept”, “outside of accept”, “problem”, “questionable”, “suspect”, “unable”, “violation”, “reject”, “no data”, “time exceed”, “value extrapolated”, “exceeds”, “biased”, “parameter not required”, “not visited”.

Below are pie charts that break down the number of failure detections by column. Note that the plotting below is automated so if one or more of the columns listed above are not plotted, this indicates that the column(s) did not return any matches for the failure phrases. Also note that a single record can contain multiple failure phrases; therefore, failure phrases are not mutually exclusive.

6.2.2.1 ActivityCommentText fail detects

ActivityCommentText failure detects

6.2.2.2 ResultCommentText fail detects

ResultCommentText failure detects

6.2.2.3 ResultLaboratoryCommentText fail detects

ResultLaboratoryCommentText failure detects

6.2.2.4 ResultDetectionConditionText fail detects

ResultDetectionConditionText failure detects

6.2.2.5 ResultMeasureValue_original fail detects

ResultMeasureValue_original failure detects

58.5 thousand rows are removed after detecting failure-related phrases and 2.3 million rows remain.

6.2.3 Clean MDLs

In this step method detection limits (MDLs) are used to clean up the reported values. When a numeric value is missing for the data record (i.e., NA or text that became NA during an as.numeric call) we check for non-detect language in the ResultLaboratoryCommentText, ResultCommentText, ResultDetectionConditionText, and ResultMeasureValue columns. This language can be "non-detect", "not detect", "non detect", "undetect", or "below".

If non-detect language exists then we use the DetectionQuantitationLimitMeasure.MeasureValue column for the MDL, otherwise if there is a < and a number in the ResultMeasureValue column we use that number instead.

We then use a random number between 0 and 0.5 * MDL as the record’s value moving forward.

We produce a new column, mdl_flag, from the MDL cleaning process. Records where no MDL-based adjustment was made and which are at or above the MDL are assigned a 0. Records with corrected values based on the MDL method are assigned a 1. Finally, records where no MDL-based adjustment was made and which contain a numeric value below the provided MDL are assigned a 2.

This should not result in a change in rows but we still check: 0 rows are removed. The final row count after this is 2.3 million.

6.2.4 Clean approximate values

Cleaning approximate values involves a similar process as for MDL cleaning. We flag “approximated” values in the dataset. The ResultMeasureValue column gets checked for all three of the following conditions:

Numeric-only version of the column is still NA after MDL cleaning
The original column text contained a number
Any of ResultLaboratoryCommentText, ResultCommentText, or ResultDetectionConditionText match this regular expression, ignoring case: "result approx|RESULT IS APPROX|value approx"

We then use the approximate value as the record’s value moving forward.

Records with corrected values based on the above method are noted with a 1 in the approx_flag column. Note, however, that occasionally approximate language will be used in a record but not changed or flagged. This occurs when the language is used in a comment-related column and not the result column itself, meaning that there is a usable numeric value provided (and thus doesn’t need correction).

This should not result in a change in rows but we still check: 0 rows are removed. The final row count after this is 2.3 million.

6.2.5 Clean values with “greater than” data

The next step is similar to the MDL and approximate value cleaning processes, and follows the approximate cleaning process most closely. The goal is to clean up values that were entered as “greater than” some value. The ResultMeasureValue column gets checked for all three of the following conditions:

Numeric-only version of the column is still NA after MDL & approximate cleaning
The original column text contained a number
The original column text contained a >

We then use the “greater than” value (without >) as the record’s value moving forward.

Records with corrected values based on the above method are noted with a 1 in the greater_flag column.

This should not result in a change in rows but we still check: 0 rows are removed. The final row count after this is 2.3 million.

6.2.6 Drop unresolved NA measurements

The goal of the preceding three steps was to prevent records with seemingly missing measurement data from being dropped if there was still a chance of recovering a usable value. At this point we’ve finished with that process and we proceed to check for remaining records with NA values in their harmonized_value column. If they exist, they are dropped. We also filter out any negative values in the dataset at this point.

3920 rows are removed. The final row count after this is 2.3 million.

6.2.7 Harmonize record units

The next step in doc harmonization is converting the units of WQP records. We create the following conversion table, which is used to translate units provided in WQP into mg per liter (mg/L):

ResultMeasure.MeasureUnitCode	conversion
mg/L	1.00000e+00
mg/l	1.00000e+00
ppm	1.00000e+00
ug/l	1.00000e-03
ug/L	1.00000e-03
mg/m3	1.00000e-03
ppb	1.00000e-03
mg/cm3	1.00000e+03
ug/ml	1.00000e+00
ug/mL	1.00000e+00
mg/ml	1.00000e+03
ppt	1.00000e+03
umol/L	1.20107e-02

Below is a pie chart that breaks down the different unit codes that were dropped in the unit harmonization process, and how many records were lost with each code.

ResultMeasure.MeasureUnitCode mismatched codes

Additionally we provide a set of histograms of harmonized measurements by CharacteristicName:

Distribution of DOC values by CharacteristicName

8.2 thousand rows are removed. The final row count after this is 2.3 million.

6.2.8 Clean depth data

The next harmonization step cleans the four depth-related columns obtained from the WQP. The details behind this step are covered in the Depth flags section of the Tiering, flagging, and quality control philosophy chapter.

This should not result in a change in rows but we still check: 0 rows are removed. The final row count after this is 2.3 million.

6.2.9 Filter and tier methods

We next review the analytical methods used in measuring DOC, primarily by classifying the text provided with each record in ResultAnalyticalMethod.MethodName. Once these methods are classified we arrange them into hierarchical tiers as described in the Method tiering section of the Tiering, flagging, and quality control philosophy chapter.

However, prior to classification we check the ResultAnalyticalMethod.MethodName column for names that indicate non-DOC measurements. Phrases used to flag and remove unrelated methods from DOC data are: “Oxygen”, “Nitrogen”, “Ammonia”, “Metals”, “E. coli”, “Anion”, “Cation”, “Phosphorus”, “Silica”, “PH”, “HARDNESS”, “Nutrient”, “Turbidity”, “Nitrate”, “Conductance”, “Alkalinity”, “Chlorophyll”, “Solids”, “Temperature”, “Color in Water”, “Coliform”, “PARTICULATE CARBON (inorg+org)”, “5210”, “bed sed”, “bs, calc”, “5220”, “Suspended-Sediment in Water”.

This process drops 82.9 thousand rows leaving 2.2 million remaining.

The next step towards creating tiers is to then classify the methods in ResultAnalyticalMethod.MethodName into either: Combustion + infrared methods, Persulfate-UV/Heated Persulfate Oxidation+IR detection methods, Wet Oxidation+Persulfate+IR methods, Elemental Analyzer methods, or Carbonaceous Analyzer methods. These classifications are not the final tiers, but they inform the tiering in the final step of this process. The criteria for each of the above classifications are:

Combustion+IR: Detection of “5310 B”, “5310B”, “415.1”, “combustion” but only without “5310 C” or “5310C”, “0.45u silver, persulfate IR”, “0.7 um GFF, combust IR”, “Laboratory Procedures for Water Quality Chemical Analysis”, “CO2 formation”, “Qian & Mopper 1996”, or “Shimadzu TOC Analyzer” in ResultAnalyticalMethod.MethodName
Persulfate-UV/Heated Persulfate Oxidation+IR: Detection of “5310 C”, “5310C”, “persulfate oxid”, “415.2”, “Ultraviolet”, “Heated-Persulfate”, or “UV” in ResultAnalyticalMethod.MethodName
Wet Oxidation+Persulfate+IR detection: Detection of “0.45 ?um cap”, “0.45um cap”, “wet oxidation”, “5310 D”, “5310D”, “415.3”, “O-3100 ~ Total Organic Carbon in Water” in ResultAnalyticalMethod.MethodName
Elemental Analyzer: Detection of “440” in ResultAnalyticalMethod.MethodName
Carbonaceous Analyzer: Detection of “9060 A” in ResultAnalyticalMethod.MethodName

Finally, we group the data into three tiers as described in Tiering, flagging, and quality control philosophy. These tiers are:

Tier	Name	Description	DOC details
0	Restrictive	Data that are verifiably self-similar across organizations and time-periods and can be considered highly reliable and interoperable	Includes records using Wet Oxidation+Persulfate+IR (EPA 415.3, SM 5310D) or Persulfate-UV/Heated Persulfate Oxidation+IR (EPA 415.2, SM 5310C) methods
1	Narrowed	Data that we have good reason to believe are self-similar, but for which we can’t verify full interoperability across data providers	Includes records using combustion+IR (EPA 415.1, SM 5310B) or elemental analyzer methods (EPA 440)
2	Inclusive	Data that are assumed to be reliable and are harmonized to our best ability given the information available from the data provider. This tier includes `NA` or non-resolvable descriptions for the analytical method, which often make up the majority of methods descriptions for any given parameter	Carbonaceous (EPA 9060A) and all other records by default, including `NA` methods

At this point we export a file (3_harmonize/out/doc_tiering_record.csv) that contains a record of how specific method text was tiered and how many row counts corresponded to each method.

6.2.10 Flag based on field methods

DOC doesn’t have field sampling methods that lend themselves well to comparing with analytical methods and assigning flags, unlike variables like chlorophyll a. We fill the field_flag column with NA for “not applicable” for DOC.

No records should be removed by this process and so there are 0 rows dropped leaving 2.2 million remaining in the harmonized DOC dataset.

6.2.11 Miscellaneous flag

Next we add a placeholder for the miscellaneous flag column, misc_flag.

Similarly to the field_flag column, this will be filled with NA. Some parameters will have additional flagging requirements that DOC does not, so we include this placeholder to maintain the same columns across all parameter data products.

No records should be removed by this process and so there are 0 rows dropped leaving 2.2 million remaining in the harmonized DOC dataset.

6.2.12 Remove unrealistic values

Before finalizing the dataset we remove DOC values that are beyond a realistic threshold. We use 300 mg/L as our cutoff for removal as typical values are often less than 120 mg/L (Hotchkiss and DelSontro, 2024) but values of 300 mg/L occur in some climates (Mulholland, 2003).

We also remove any depths > 592m, the deepest point in a lake in the U.S.

19.6 thousand rows are removed. The final row count after this is 2.2 million.

6.2.13 Aggregate simultaneous records

The final step of DOC harmonization is to aggregate simultaneous observations. Any group of samples determined to be simultaneous are simplified into a single record containing the mean and coefficient of variation (CV) of the group. These can be either true duplicate entries in the WQP or records with non-identical values recorded at the same time and place and by the same organization (field and/or lab replicates/duplicates). The CV can be used to filter the dataset based on the amount of variability that is tolerable to specific use cases. Note, however, that many entries will have a CV that is NA because there are no simultaneous records or 0 because the records are duplicates and all entries have the same harmonized_value.

We identify simultaneous records to aggregate by creating identical subgroups (subgroup_id) from the following columns: parameter, OrganizationIdentifier, MonitoringLocationIdentifier, MonitoringLocationTypeName, ResolvedMonitoringLocationTypeName, ActivityStartDate, ActivityStartDateTime, ActivityStartTime.TimeZoneCode, harmonized_tz, harmonized_utc, harmonized_top_depth_value, harmonized_top_depth_unit, harmonized_bottom_depth_value, harmonized_bottom_depth_unit, harmonized_discrete_depth_value, harmonized_discrete_depth_unit, depth_flag, mdl_flag, approx_flag, greater_flag, tier, field_flag, misc_flag, harmonized_units. This selection limits the columns included in the final dataset, but we also provide a copy of the AquaMatch dataset prior to its aggregation (pipeline target p3_doc_preagg_grouped), and include the subgroup_id column, so that users can use the disaggregated data as well and match make joins between dataset versions.

The final, aggregated values are presented in the harmonized_value and harmonized_value_cv columns. The number of rows used per group is recorded in the harmonized_row_count column.

0.3 million rows dropped leaving 1.8 million remaining in the final harmonized and aggregated doc dataset.

6.2.14 Harmonized DOC

At this point the harmonization of the DOC data from the WQP is complete and we export the final dataset for use later in the workflow.

Below are several sets of figures illustrating different qualities of the dataset:

Histograms showing the distribution of harmonized measurements (top) and CVs (bottom) broken down by tier after aggregating simultaneous records.

Distribution of DOC values by tier

Distribution of DOC CVs by tier

Maps showing the geographic distribution of records by tier within the US:

Geographic distribution of DOC records by tier in the conterminous US

Lastly, bar charts showing the distribution of values by tier across years, months, and days of the week:

Temporal distribution of DOC records by tier across years

Temporal distribution of DOC records by tier across months

Temporal distribution of DOC records by tier across days of the week