Use Cases
NWWDB-Tools: Harmonizing lithologic logs using natural language processing
Keywords: bedrock geologic units; data integration; environment; geography; geoscientificInformation; groundwater; hydrogeology; inlandWaters; stratigraphy; unconsolidated deposits; water budget; well drilling
Domain: Hydrogeology
Language: Python
Description:
The NWWDB-Tools Python package contains functionality for harmonizing lithologic logs using natural language processing.
NWWDB-Tools: Harmonizing lithologic logs using natural language processing
Lithologic data collected during the completion of a water well can be used for wide-ranging geologic mapping applications and evaluations of groundwater resources.
However, drillers’ lithologic descriptions are often provided in a free-text format that is not readily usable in a structured or repeatable way.
The
nwwdb
Python package offers functionality for the harmonization of
free-text lithologic logs. Within the nwwdb.lith module there are three functions named revise_lithology(), parse_lithology(), and map_lithology(),
which are executed in series to translate raw descriptions to a standardized set of lithology codes, or in other words, to “harmonize” the lithologic log.

Methods
The first step towards translating free-text descriptions to a standardized code set is to clean the original text through a series of find-and-replace routines, i.e., string substitions, so that the revised text can be reliably and systematically parsed. String substitutions involve the elimination of special characters, condensing or trimming extra whitespace, replacing abbreviated terms with full terms, correcting misspellings and typos, etc.
The second step in the process is to parse the revised lithologic description for occurrences of recognized or accepted lithology terms. The parser moves in a left-to-right fashion across the lithologic description and treats single spaces as delimiters for isolating the words that make up the description. It is therefore a requirement that the list of accepted lithology terms supplied by the end user not contain spaces. The parser returns an ordered list of up to 10 words encountered in the description. From this list, the first 3 words that have matches in the list of list of accepted lithology terms are extracted from the free-text description and characterized as the representative (dominant), secondary, and minor lithologies, respectively. Some descriptions may not have three simple lithology terms, and others may not have any, in which case the parser extracts as many lithology terms as it can but not more than three.
Lastly, combinations of representative, secondary, and minor lithology terms are mapped to their respective lithology codes. These mappings must be established by the end user unless it is desirable to accept the USGS-defined default mappings.
Note: At the time of this writing, the USGS-controlled code set for lithology consists of 91 distinct 4-character codes, each representing a simple lithology term, e.g., “SAND,” or in some cases a compound lithology term, e.g., “SDGL” for “sand and gravel.” The USGS lithology code set does not provide texture or color qualifiers by convention. Generalizations such as these have practical applications for performing simple queries that do not require detailed information on lithology composition or qualifiers, for example, a query that sums the thicknesses of all units having lithology codes that are suggestive of coarse-grained aquifer material, however, some use cases may find greater utility in the individual terms that are mined from the verbatim lithology description in the second processing step.
Workflow
In summary, the workflow for harmonizing lithologic logs follows three steps:
- Revise (clean)
- Parse (extract)
- Map (relate)

1) Revise lithology descriptions
The revise_lithology() function produces a revisedUnitDescription from the originalUnitDescription in the GW_LithologyLog table of the input GeoPackage.
input_gpkg: The input GPKG to be processed (must be a valid OGC GeoPackage formatted according to the USGS-NWWDB data model).
string_substitutions: A dictionary of string substitutions, or “find-and-replace” routines, to perform on the original unit description.
show_substitutions: Optionally, create a table within the input GPKG showing the string substitutions that were performed to revise the description and the order in which they were performed.
#
import os
import sqlite3
import json
import pandas as pd
import nwwdb
# input GeoPackage
input_gpkg = 'filename-of-nwwdb-formatted-geopackage.gpkg'
# create dictionary of string substitutions
string_substitutions = {
"gvl": "gravel"
"bldrs": "boulders"
"cl": "clay"
"cly": "clay"
"dol": "dolomite"
"hrdpn": "hardpan"
"sndy": "sandy"
"ls": "limestone"
"limetsone": "limestone"
"grnite": "granite"
"lime rock": "limestone"
"cobles": "cobbles"
"withcalcite": "with calcite"
"sandrock": "sandstone"
"sandy clay": "clay and sand"
"clayey gravel": "gravel and clay"
"gravelly sand": "sand and gravel"
...
}
# revise lithology descriptions given a list of string substitutions and the input GeoPackage
nwwdb.revise_lithology(input_gpkg, string_substitutions = string_substitutions)
| gwWellName | fromDepth | toDepth | originalUnitDescription | revisedUnitDescription |
|---|---|---|---|---|
| 8BH075 | 0 | 2 | CLAY | clay |
| 8BH075 | 2 | 30 | SANDY CLAY | clay and sand |
| 8BH075 | 30 | 53 | GRAVEL | gravel |
| 8BH075 | 53 | 95 | LIME ROCK | limestone |
| XQ760 | 0 | 33 | DIRTY SAND & SMALL ROCKS | dirty sand and small gravel |
| XQ760 | 33 | 40 | GREY SANDY CLAY/GRAVEL | gray clay and sand gravel |
| XQ760 | 40 | 76 | GREY DIRTY GRAVELLY SAND | gray dirty sand and gravel |
| XQ760 | 76 | 85 | TAN SAND | brown sand |
2) Parse up to three simple lithology terms
The parse_lithology() function populates representativeLithology, secondaryLithology, and minorLithology with terms parsed from the revisedUnitDescription.
input_gpkg: The input GPKG to be processed (must be a valid OGC GeoPackage formatted according to the USGS-NWWDB data model).
lithology_terms: A list of accepted lithology terms to parse from the revisedUnitDescription.
show_unhandled: Optionally, create a table within the input GPKG showing terms encountered in the revisedUnitDescription that are without matches in the list of accepted lithology_terms.
# create list of simple lithology terms
simple_lithology = [
"sand",
"clay",
"gravel",
"lignite",
"limestone",
"schist",
"shale",
"siltstone",
"alluvium",
"outwash",
"hardpan",
"soil",
"boulders",
"caliche",
"pyrite"
...
]
# parse up to three simple lithology terms from the revised unit description
nwwdb.parse_lithology(input_gpkg, lithology_terms = simple_lithology, show_unhandled = True)
| gwWellName | fromDepth | toDepth | ... | revisedUnitDescription | representativeLithology | secondaryLithology | minorLithology |
|---|---|---|---|---|---|---|---|
| 8BH075 | 0 | 2 | ... | clay | clay | ||
| 8BH075 | 2 | 30 | ... | clay and sand | clay | sand | |
| 8BH075 | 30 | 53 | ... | gravel | gravel | ||
| 8BH075 | 53 | 95 | ... | limestone | limestone | ||
| XQ760 | 0 | 33 | ... | dirty sand and small gravel | sand | gravel | |
| XQ760 | 33 | 40 | ... | gray clay and sand gravel | clay | sand | gravel |
| XQ760 | 40 | 76 | ... | gray dirty sand and gravel | sand | gravel | |
| XQ760 | 76 | 85 | ... | brown sand | sand |
3) Map combinations of lithology terms to a standard code set
Combinations of up to three simple lithology terms are now mapped to a standard set of lithology codes. These mappings/relations can be assigned in one of three ways:
Supervised: With this approach, the user has complete control over how combinations of lithology terms are mapped to lithology codes. Only the mappings that are explicitly defined in the
supervised_mappingvariable are performed. Harmonizing well logs using this method tends to run faster than the other methods because the relationships are explictly defined and the processing does not involve loops.Unsupervised: The software determines the mappings based on minimal information stored in a configuration file named
unsupervised_mapping.json. As of the current software release, this file is not customizable by the end user. It stores mappings in the same format assupervised_mapping.jsonbut only one mapping is defined per lithology code. For example, a description with “limestone” and “shale” components maps to the code “LMSH,” which could also be associated with a description consisting of “limestone”, “shale”, and “shale”, but only the former is defined inunsupervised_mapping.json, whereas the latter would be inferred by the software. The utility of this method is that the user does not have to imagine every possile combination of representative, secondary, and minor lithology components for a given lithology code, rather, the software accepts any and all combinations as long as the minimum components are matched. This approach has the disadvantage of taking longer to run because it involves a for-loop, and it does not provide any safeguards against mapping illogical combinations of lithology, for example, “granite”, “sand”, and “gravel”.Hybrid: This approach is exactly as the name would suggest. First, the software applies the mappings defined by the user in
supervised_mappingand after the supervised mapping has finished, it runs the unsupervised approach to “fill in” the rest. It inherits the advantages and disadvantages of both methods.
The map_lithology() function assigns a geologicUnitCode based on the representativeLithology, secondaryLithology, and minorLithology components.
input_gpkg: The input GPKG to be processed (must be a valid OGC GeoPackage formatted according to the USGS-NWWDB data model).
supervised_mapping: User-supplied dictionary with instructions for mapping combinations of lithology terms to their respective lithology codes. Defaults to None.
remove_unmatched: Optionally, delete well logs if one or more units could not be mapped to a lithology code. Defaults to False.
show_unmatched: Optionally, create a table within the input GPKG showing the lithology combinations that were encountered but not mapped to a lithology code. Defaults to False.
show_mapping: Optionally, create a table within the input GPKG showing the lithology mapping instructions that were applied. Defaults to False.
method: The method to be used for mapping cominations of lithology terms to their respective codes. Takes one of three options: “supervised”, “unsupervised”, or “hybrid” (explained above). Defaults to “unsupervised”.
overwrite: Whether to overwrite existing data in the input GPKG. Defaults to True.
# create dictionary with lithology mapping instructions
supervised_mapping = {
"ALVM": [
{
"representativeLithology": "alluvium",
"secondaryLithology": "sand",
"minorLithology": ""
},
{
"representativeLithology": "alluvium",
"secondaryLithology": "sand",
"minorLithology": "gravel"
},
{
"representativeLithology": "alluvium",
"secondaryLithology": "silt",
"minorLithology": "sand"
}
],
"ARKS": [
{
"representativeLithology": "arkose",
"secondaryLithology": "",
"minorLithology": ""
}
],
"BLDR": [
{
"representativeLithology": "boulders",
"secondaryLithology": "",
"minorLithology": ""
},
{
"representativeLithology": "rock",
"secondaryLithology": "boulder",
"minorLithology": ""
}
],
"BLSC": [
{
"representativeLithology": "boulders",
"secondaryLithology": "silt",
"minorLithology": "clay"
}
],
"BLSD": [
{
"representativeLithology": "boulders",
"secondaryLithology": "sand",
"minorLithology": ""
},
{
"representativeLithology": "boulder",
"secondaryLithology": "gravel",
"minorLithology": "sand"
}
],
"CLAY": [
{
"representativeLithology": "clay",
"secondaryLithology": "",
"minorLithology": ""
},
{
"representativeLithology": "clay",
"secondaryLithology": "clay",
"minorLithology": ""
},
{
"representativeLithology": "clay",
"secondaryLithology": "clay",
"minorLithology": "clay"
}
],
"CLSD": [
{
"representativeLithology": "clay",
"secondaryLithology": "sand",
"minorLithology": ""
},
{
"representativeLithology": "clay",
"secondaryLithology": "silt",
"minorLithology": "sand"
},
{
"representativeLithology": "clay",
"secondaryLithology": "sand",
"minorLithology": "silt"
}
],
"CLVM": [
{
"representativeLithology": "colluvium",
"secondaryLithology": "",
"minorLithology": ""
}
],
"COAL": [
{
"representativeLithology": "coal",
"secondaryLithology": "",
"minorLithology": ""
},
{
"representativeLithology": "coal",
"secondaryLithology": "slate",
"minorLithology": ""
},
{
"representativeLithology": "rock",
"secondaryLithology": "coal",
"minorLithology": ""
},
{
"representativeLithology": "coal",
"secondaryLithology": "claystone",
"minorLithology": ""
},
{
"representativeLithology": "coal",
"secondaryLithology": "sandstone",
"minorLithology": ""
},
{
"representativeLithology": "sand",
"secondaryLithology": "coal",
"minorLithology": ""
}
],
"DMSH": [
{
"representativeLithology": "dolomite",
"secondaryLithology": "shale",
"minorLithology": ""
}
],
"GRSC": [
{
"representativeLithology": "gravel",
"secondaryLithology": "silt",
"minorLithology": "clay"
},
{
"representativeLithology": "clay",
"secondaryLithology": "gravel",
"minorLithology": "silt"
},
{
"representativeLithology": "gravel",
"secondaryLithology": "clay",
"minorLithology": "silt"
},
{
"representativeLithology": "silt",
"secondaryLithology": "clay",
"minorLithology": "gravel"
},
{
"representativeLithology": "silt",
"secondaryLithology": "gravel",
"minorLithology": "clay"
}
],
"LMDM": [
{
"representativeLithology": "limestone",
"secondaryLithology": "dolomite",
"minorLithology": ""
},
{
"representativeLithology": "limestone",
"secondaryLithology": "dolomite",
"minorLithology": "limestone"
},
{
"representativeLithology": "limestone",
"secondaryLithology": "dolomite",
"minorLithology": "rock"
}
]
}
# parse up to three simple lithology terms from the revised unit description
nwwdb.map_lithology(input_gpkg, lithology_terms = simple_lithology, show_unhandled = True)
| gwWellName | fromDepth | toDepth | ... | representativeLithology | secondaryLithology | minorLithology | geologicUnitCode |
|---|---|---|---|---|---|---|---|
| 8BH075 | 0 | 2 | ... | clay | CLAY | ||
| 8BH075 | 2 | 30 | ... | clay | sand | CLSD | |
| 8BH075 | 30 | 53 | ... | gravel | GRVL | ||
| 8BH075 | 53 | 95 | ... | limestone | LMSN | ||
| XQ760 | 0 | 33 | ... | sand | gravel | SDGL | |
| XQ760 | 33 | 40 | ... | clay | sand | gravel | SGVC |
| XQ760 | 40 | 76 | ... | sand | gravel | SDGL | |
| XQ760 | 76 | 85 | ... | sand | SAND |
Selected References
- Arihood, L.D., 2009, Processing, analysis, and general evaluation of well-driller records for estimating hydrogeologic parameters of the glacial sediments in a ground-water flow model of the Lake Michigan Basin: Scientific Investigations Report 2008-5184, 26 p.
- Bayless, E.R., Arihood, L.D., Reeves, H.W., Sperl, B.J.S., Qi, S.L., Stipe, V.E., and Bunch, A.R., 2017, Maps and grids of hydrogeologic information created from standardized water-well drillers’ records of the glaciated United States: U.S. Geological Survey Scientific Investigations Report 2015–5105, 34 p., https://doi.org/10.3133/sir20155105 .
