National Water-Quality Assessment (NAWQA) Project

Go to:

Adapted from original article published in the Environmental Science &
Technology, v 31, 1997.

ABSTRACT

INTRODUCTION

ATRAZINE DATA IN GROUND WATER

STATISTICAL MODEL

CENSORED REGRESSION MODEL SELECTION AND IMPUTATION

MULTIPLE IMPUTATION

EMPIRICAL RESULTS

APPENDIX

REFERENCES

TABLES

FIGURE 1

Although censored regression has been widely used by statisticians and
economists, it has been rarely used to analyze environmental data
(*4*). For environmental data, however, the total concentrations
of several contaminants in the environment, each having some censored
observations, may be of interest. This total concentration could be
calculated using a simple summation or a weighted summation, depending
of the interest of the study and compounds being considered. For
example, the potential threats of atrazine contamination to humans and
terrestrial and aquatic ecosystems may depend on total concentrations
of several compounds such as atrazine (ATZ), deethylatrazine (DEA) and
deisopropylatrazine (DIA) in the environment. DEA and DIA can both be
derived from the degradation of ATZ (*5*). The importance of DEA
and DIA are that they are structurally and toxicological similar to ATZ
(*6-8*). Therefore, the risks from the same concentration level
of ATZ, DEA, and DIA likely are similar. The actual risk of using
groundwater as a source of drinking water may depend on the total
atrazine-residue concentration (ATZ+DEA+DIA) and not that of ATZ alone.
Therefore, when designing appropriate policies to improve or protect
the groundwater, it is necessary to consider total atrazine residue to
properly determine risk levels. Direct summation of these three
atrazine compounds, however, is not feasible because censored
observations of ATZ, DEA, and DIA are common in groundwater (*9*).
An appropriate statistical method has been developed to deal with this
issue of censored environmental data.

The purpose of this paper is to present a statistical method developed for estimating significant factors that affect the total concentrations of several contaminants from measurements that include censored observations. The procedure is demonstrated using concentrations of ATZ, DEA, and DIA in groundwater. The methodology presented in this paper, however, can be used to analyze similar cases of environmental contamination involving censored data.

Two groups of ancillary factors
also were collected for each well: hydrogeologic and land use. Because
multicollinearity among the regressors causes inefficiency and
inconsistency, an initial screening procedure implemented in previous
research (*13, 14*) was used to eliminate variables that have
limited explanatory power. Details of specific factors collected in the
survey are given elsewhere (*11, 12*). The sample statistics for
selected factors are given in Table 2.

where G indicates the response (dependent) variable,
__x__ is a vector of the explanatory (independent) variables,
g(__x__) is a function of __x__, and
represents the modeling error (*15*). For this study, __x__ is
land use and hydrogeological characteristics, and the response
variable, atrazine residue, G is expressed as

where ATZ, DEA, and DIA are concentration levels of atrazine and two of its degradation products in groundwater. To estimate significant factors that affect G in equation (1), it is necessary to estimate values for G given in equation (2) first. The problem arises because some observations of ATZ, DEA, and DIA are below the analytical reporting limit (0.05 µg/L). For levels below the analytical reporting limit, the observations were recorded as "less than 0.05 µg/L", and their precise values are unknown.

When a sample is censored, use of a standard estimation procedure such as
simple linear regression by substituting an arbitrary value for censored data
or treating censored data as missing values produces biased and inconsistent
parameter estimates (*16, 17*). Censored regression analysis provides an
appropriate method to accommodate censoring in the response variable. The
censored regression model is characterized by a latent regression equation

The above likelihood function is for only one compound. The likelihood function becomes extremely complicated for the case of three compounds, having 8 possible combinations. The 8 cases are: (1) ATZ, DEA, and DIA are all observed; (2) ATZ and DEA are observed but DIA is censored; (3) ATZ and DIA are observed but DEA is censored; (4) DEA and DIA are observed but ATZ is censored; (5) DIA is observed but ATZ and DEA are censored; (6) DEA is observed but ATZ and DIA are censored; (7) ATZ is observed but DEA and DIA are censored; and (8) ATZ, DEA, and DIA are all censored. For each case, the likelihood function is a multiplication of either the cumulative density function or the probability density function of ATZ, DEA, and DIA, depending on what data is censored. This makes estimation extremely difficult and is one of the major reasons for estimating the parameters of the regression equation for each compound separately. Furthermore, ATZ, DEA, and DIA are log-normally distributed. Therefore, the distribution of G (sum of ATZ, DEA, and DIA) is unclear. The details of the likelihood function can be obtained from the authors (because of space considerations and complexity, it was not provided here). [An additional alternative approach to deal with the issue we are addressing here is to incorporate the censoring levels into the sum of the three compounds. By doing so, the sum of the three compounds is either uncensored (all three compounds are observed) interval censored (one or two compounds are censored), or left censored (all three compounds are censored). This approach, however, oversimplifies the problem for the case of at least one compound being censored. It is not clear what censoring level should be used. For example, 0.15 may be used as a censoring level for the case where all three compounds are censored. But G <0.15 can come from an infinite number of different combinations (such as ATZ <0.05, DIA <0.05, and DEA <0.05; or from ATZ <0.10, DIA <0.04, and DEA <0.01). The same issue arises for the case of either one or two compounds being censored. In contrast, the methodology proposed in this study provides a precise restriction for each compound when it is censored. The procedure presented for this study is one that is both practically and theoretically justified to analyze the type of data used in this demonstration.]

The procedure just described can be applied to ATZ, DEA, and DIA to impute the values for those observations below the analytical reporting limit. With the estimated dependent variable, atrazine-residue concentrations, the significant factors that relate to environmental contaminant concentrations can be found by standard regression procedures.

To identify the significant factors that relate to atrazine-residue concentrations in groundwater, an appropriate model selection procedure has to be used. There is, however, no procedure available for selecting a term in regression with censored data. LIFEREG and other statistical programs for analyzing censored data are designed only for estimating the parameters of a given regression model. They are not programmed to perform model selection.

To select an
appropriate model with censored regression analysis, the censored
forward regression procedure will be used in this study (*21*).
The procedure is a forward stepwise procedure used with Tobit model. In
this procedure, variables are added one at a time as long as they
contribute significantly to the fit. The Wald-type statistic is used in
judging whether a new variable should be added to the model. The
significance level is artificially determined as 0.10.

With the selected model for each compound, the
concentration levels for those sites where observations are recorded as
below the analytical reporting limit could be imputed based on equation
(8). The total atrazine-residue concentration for each site can then be
calculated by using observed data for those sites where observations
are above the analytical reporting limit and imputed data for those
below the analytical reporting limit . Finally, the all-subset model
selection procedure (*22*) can be used to select the final model
for atrazine residue. The adjusted R^{2} is used for selecting the
final model. Therefore, the all subsets regression procedure simply
picks the model with the highest R^{2}.

With the estimated censored regression equations, the values of censored observations can be imputed based on equation (8). The values of û in equation (6) were calculated by using the estimated parameters from Tables 3-5. The conditional mean for each compound at each censored site is imputed with equation (8) also using the estimated parameters from Tables 3-5.

With the estimated mean for each compound (ATZ, DEA, DIA) at each censored site and a standard deviation, five sets of observations were generated for each site. With randomly generated values, the inverse transformation based on equation (8) was used to obtain an imputed value for each site. At the end of this process, five pseudo-complete data sets were obtained.

The statistics of the pseudo-complete data are
given in Table 6. The means of the imputed minimum concentrations for
censored data based on the five pseudo-complete data sets were
4.48 x 10^{-5}, 1.274 x 10^{-4}, and 9.576 x 10^{-9}
for ATZ, DEA, and DIA, respectively, much less than the censored
limit of 0.05 µg/L. Previous
research has confirmed the prevalence of ATZ and DEA concentrations in
groundwater below 0.05 µg/L
(*24*). The frequency of atrazine detection roughly doubles if the
reporting limit is lowered from 0.05 to 0.003 µg/L .

By using the pseudo-complete data, the final model for atrazine residue was
estimated by an all-subset model selection procedure (*22*),
although other methods, such as the sum of squares analysis, could also
have been used. The final selected models are not given here because of
the issue discussed in the multiple imputation section. The model was
not unique, depending on which pseudo-complete data set is used. In
Table 7, only the sign and significance level for the variables are
given. Five final models selected from five pseudo-complete data sets
for atrazine residue were identical in terms of variables selected,
significant levels, and signs of coefficients indicating the stability
of the model. The model selected estimates based on each
pseudo-complete data set and is the best model obtained from the list
of 20 potential explanatory variables available
(Table 1).

2)Tobin, J. *Econometrica* 1958, *26*, 24-36.

3)Greene, W. H. *Econometric Analysis*, Macmillan Publishing
Company, New York, NY, 1990.

4)Siymen, D.; Peyster, A.D. *Environ. Sci. Technol. *1994,
*28*, 898-902.

5)Paris, D.F; Lewis, D.L. *Resid Rev*. 1973,* 45*,
95-124.

6)Ciba-Geigy Corporation.* Summary of Toxicological Data on Atrazine and
Its Chorotrazine Metabolites*; Attachment 12, 56 FR 3526,
Ciba-Geigy, 1993.

7)Kaufman, D.D.; Kearney, P.C. *Res. Rev*.
1970, *32*, 235-265.

8)Moreau, C.; Mouvet, C. *J.
Environ. Qual*. 1997, *26*, 416-424.

(9) Kolpin, D.W.; Thurman, E.M.; Goolsby, D.A. *Environ.
Sci. Technol*. 1996, *30*, 335-340.

10) Burkart, M.R.; Kolpin, D.W. *J. Environ.
Qual. *1993*,* *22*, 646-656.

(11) Kolpin, D.W.; Burkart, M.R.; Thurman, E.M. *Open-File Rep.
U.S. Geol. Surv*. 1993, *No.* 93-114.

12) Kolpin, D.W.; Burkart, M.R.; Thurman,
E.M. U.S. *Geol. Surv. Water-Supply Pap*. 1994, *No. 2413.*

13) Liu, S.; Yen, S.T.; Kolpin, D.W.
Kolpin, *J. Environ. Qua., *1996*, 25*, 992-999.

(14) Liu, S.; Yen, S.T.;
Kolpin, D.W. *Water Resour. Bull*. 1996,* 32*, 845-853.

(15) Johnston, J. *Econometric
Methods*, Third Edition, McGraw-Hill Publishing Company, New York, NY,
1984.

(16) Amemiya, T., *Advanced Econometrics*, Harvard
University Press, Cambridge, MA, 1985.

(17) Liu, S.; Stedinger, J. R. *Water Resources Planning
and Management and Urban Water Resources*, The American Society of Civil
Engineers, 1991; pp. 27-31.

(18) Miller, D.M.. *American Statistician*,
1984, *38*, 124-126.

(19) Schmee, J.; Hahn, G. J., *Technometrics*
1979, *21*, 417-432.

(20) SAS, Version 6, *A Statistical Software System
Registered Trademark of SAS Institute Inc.*, Cary, North Carolina, USA,
1989.

(21) Lu, J-C; Liu, S; Unal, C. Unal, Department of Statistics, North Carolina State University, Raleigh, NC, written communication, 1995.

(22) Draper, N. R.; Smith, H.
*Applied Regression Analysis*, Second Edition, John Wiley, New York, NY,
1981.

(23) Rubin, D. B.; Schenker, N., *J. Am. Stat.
Assoc.* 1986,* 81*, 366-374.

(24) Kolpin, D.W.; Goolsby, D.A.; Thurman, E.M. *J.
Envrion. Qual*. 1995, *24*, 1125-1132.

Shiping Liu

BF&S Consulting

Global Business Intelligence Solutions, IBM, San Francisco, CA

Jye-Chyi Lu

Associate Professor, Department of Statistics

North Carolina State University

Raleigh, NC, 27695

Dana W. Kolpin

U.S. Geological Survey

400 S. Clinton St.

Iowa City, IA, 52244

(319) 358-3614

Fax: (319) 358-3606

William Q. Meeker

Distinguished Professor, Department of Statistics

Iowa State University

Ames, IA 50011