Pesticide Persistence in the Environment - Collected Data and Structure-Based Analysis
Sokratis ALIKHANIDI and Yoshimasa TAKAHASHI
A great number of new chemical substances are employed by different present-day industries, such as chemical, pharmaceutical, and agricultural. The disposability of a new compound is a key feature of a chemical to be applied anywhere, together with its target functionality. This fact especially applies to the pesticides, which are normally released onto the ground.
The experimental evaluation of chemical degradation in the environment is highly complicated due to multiple reasons. These include the variation of moisture, temperature, chemical and microbiological composition of soil, ability of a chemical to volatilize and photo-degrade. The involvement of some longer-term processes, primarily related to changes in the population of microorganism species in soil or natural waters, can not be excluded. The assessment test for disposability is expensive and usually time-consuming. In some cases, when a chemical is not mineralized or broken down to nontoxic products, there is a possibility of extensive spoilage creating health hazards to fauna and humans.
Thus, the theoretical estimation of disposability is very valuable. Unfortunately, its modeling is extremely complex, due to a quantity of reaction mechanisms for environmental degradation. On the other hand, the usage of the Quantitative Structure-Biodegradation Relationship (QSBR) [1, 2] techniques is a reasonable alternative way for developing of so-called "Expert Systems" for estimation of the environmental degradability. QSBR approaches are based on the postulation that the degradation ability of the compound can be described by some kind of modeling function of numerical descriptors representing the molecular structure. These numerical descriptors may be of many types: calculable physicochemical properties (octanol-water partition coefficient, surface area, refractivity, polarizability), various spatial or topological molecular indices, charge-distribution-related parameters, quantum chemical and molecular field parameters, and occurrences of certain structural features. Multiple QSBR models have been suggested for the estimation of degradability for some homogeneous series of molecules belonging to specific chemical classes [3 - 9]; most of them operate with short and congeneric training sets up to 40 compounds, and may have only a limited application. In a newer model , the size of a training set is not stated but the model has been validated by a homologous set of 177 mono benzene derivatives and 168 acyclic compounds only. Some models have been created with very short data sets of 18 compounds  and 36 compounds .
Models dealing with the larger sets of heterogeneous molecules are of more practical importance. Howard and coworkers developed linear and nonlinear models to estimate the probability of aerobic biodegradation of 264 compounds from the BIODEG database [13, 14]. To classify the compounds to be rapidly or slowly biodegradable, they used 35 fragmental descriptors achieving accuracies for the training set of 90.5 and 89.8% and for the test set (27 chemicals) of 81.5 and 88.8% for linear and nonlinear models, respectively. Klopman and coworkers created the program CASE  which automatically identifies and analyzes molecular fragments of a data set to create the discriminant function between rapid and slow biodegradability. They used 283 aliphatic and aromatic compounds from the BIODEG database and found 37 important substructures. For a test set of 27 compounds the correct prediction was 74%. Devillers and colleagues developed a new database of aerobic biodegradability of 184 chemicals with the help of 17 experts . They associated the experts' 'hours', 'days', 'weeks', 'months', and 'longer' responses with 1 to 5 integers. As the next step, they used 66 autocorrelation indices encoding the hydrophobicity of molecules followed by the extraction of nine first principal components. Biodegradability was modeled by back-propagation neural network resulting in a squared correlation coefficient of 0.76 for the training set of 172 molecules and 0.49 for the test set of 12 molecules .
The newer studies deal with the more comprehensive MITI I data set for biodegradability of about 900 diverse compounds . Loonen et al. used PLS analysis processing 127 predefined structural fragments and gained 85% and 83% of correct predictions for the total set and on average for four test sets, respectively . On the other hand, Sabljic and coworkers applied the inductive machine learning method to the whole MITI I data set achieving 84% of correctly classified chemicals . Those published results are in fact significant and may have a potential for practical use.
Our study was specifically directed at the pesticide area. In this work, (i) the collection of pesticide degradability database and (ii) the development of an estimating scheme on its basis were carried out.
2 Methodology and Experimental Part
2. 1 Data set
As a first approximation, environmental disappear rate of a chemical is proportional to its concentration and the first-order reaction may be assumed:
where [A]0 and [A] are initial and remained after time t concentrations of the chemical and k is a time constant. The convenient half-life period (HL) of a chemical is the time needed to decrease the concentration by factor 2. Then the equation (1) can be transformed to
Unfortunately, the environmental HL of a chemical compound is a highly fuzzy value, due to a number of reasons. Also, there may be different order degradation mechanisms, as well as some other peculiarities like the accumulation of hazardous but stable decay products.
Because of such uncertainty, the use of a discrete value may be a quite reasonable way to describe the degradation ability of a chemical compound. Several scales have been used, such as 2-level (i.e. a chemical degrades rapidly or not) , 3-level , and 4 (and 5)-level . We consider that 3-level scale (i.e. a chemical degrades rapidly, moderately, or slowly) is a reasonable choice because the 2-level scale is too rough, whereas in scales with many levels it is often more difficult to find the correspondence for a chemical with an appropriate single level. 3-Level scale  categorizes the pesticides into the following classes:
Compilation of the degradation rates of many compounds is a significant problem because test conditions ought to be uniform. Only the MITI (Japanese Ministry of International Trade and Industry)  assessment of 894 chemicals satisfies this normal requirement. However, it includes the 2-level valuation.
A massive compilation of degradation data for about 240 pesticides was made by the U.S. Environmental Protection Agency . The data were received from many references, thus the test conditions varied appreciably. The project is continuous and the renewed database is presented on the Internet  (we have processed 334 pesticides).
Nevertheless, some problems related to few observations for certain chemicals and in general with the absence of comments for a given half-life data, often arose while using Refs. [22, 23]. A good solution for these problems involved using the rich Hazardous Substances Data Bank of the U.S. National Library of Medicine (HSDB) . It offers the explanation of degradation details and frequently provides new additional references.
The Pesticide Management Education Program at Cornell University was another and often complementary database . For several compounds with unreliable or too scanty persistence reports, it offered data unlisted in Refs. [22 - 24]. Those chemicals are dienochlor, dodine (cyprex), mepiquat chloride, propamocarb, propoxur, pyrithiobac, quizalofop-ethyl, temephos, terbufos, triadimefon, triflumizole, and trimethacarb. For trichloroacetic acid (TCA) a review of its environmental behavior has been published recently .
Gathering all the available data from Refs. [22, 23], with the help of the other above-mentioned sources, the data set of the persistence of 315 pesticides in field conditions was collected and presented in the Table 1 . It consists of organic and 'organic-like' (carbon disulfide, ammonium sulfamidate, chloropicrin) compounds, where HL data exists. Several pesticides include nontoxic aluminium, iron, and silicon. Compounds with toxic arsenic and tin were excluded as these elements are dangerous for the environment, and the usage of such pesticides is specially regulated by law.
On the other hand, it was recognized that HSDB still has many pesticides not included in our data set. To take them, we used the comprehensive Compendium of Pesticide Common Names  (1086 compounds, up to October 2002) to extract all the new CAS numbers, followed by the extraction of corresponding information on environmental degradability from HSDB. As the last step, we cleaned the data removing the compounds with insufficient degradability information, substances which include toxic elements (As, Sn, Hg), or too complex mixtures of compounds under a single trade name (camphechlor - mixture of at least 177 chlorinated camphenes). Often, we have found specific relevant facts in the Pesticide Management Education Program at Cornell University . As a result, the new data set has 105 pesticides, and is presented at the Table 2.
The pesticides were associated with the persistence class according to the rules (3). The persistence data were critically evaluated employing additional rules: (i) for the wide data ranges, a greater half-life period was decided to be the cheaper (i.e. safer) mistake for the environment; (ii) for the ranges of HL, geometric mean was assumed (as for log-normally distributed data). For instance, the pesticide aldrin has the following HL values collected: 28, 43-63, 10, 183, 273-365, 21-584, and 20-100 days. Definitely, the class 1 is not the case (rule (i)), and the smallest values (10, 20, 21, and 28) were removed to retain the greater values only. For the rest (43, 63, 183, 273, 365, 30, 584, 30, and 100 (30 were used twice to replace 21 and 20), its geometric mean (rule (ii)) is 111 days associating aldrin with the class 3. Such serious data inconsistency problems arose frequently during the preparation of the data set. Often compounds had either too wide range of HL values (as for aldrin), which led to the determination of a fuzzy persistence class, or only 1 or 2 observations, which resulted in an unreliable conclusion; such cases were marked in Table 1 by '' and '' respectively .
For each compound the structural information was prepared. The ChemIDPlus database  of the National Library of Medicine (350K structures) was used as a convenient source for acquiring molecular structures. For other compounds, molecular structures were constructed from the chemical names.
Preprocessing of structures before QSBR analysis included:
(i) Compounds with weak ionic bonds were processed as a protonated free acid and tertiary free amine (or quaternary cation, if the case), for organic acid salts and organic ammonium salts respectively.
(ii) The aromaticity of benzene-like and heterocyclic fragments was carefully checked by the Hueckel rule. In particular, ring compounds with the exocyclic carbonyl group or with atoms of nitrogen and oxygen in the same ring were considered as non-aromatic. As an exception, only pyrimidine-2,4(1H,3H)-diones with an atom of chlorine or bromine in 5th position (for terbacil and bromacil) were assumed as aromatic, due to the positive mesomeric effect of the halogen in this case. Nevertheless, the problem of the unambiguous marking of aromaticity in rings is a subject for further clarification [30, 31].
2. 2 QSBR Analysis
In classical QSAR/QSBR analysis, the predictive model is built by using many kinds of descriptors: physicochemical properties, molecular indices, and molecular fragments . However, many authors noted the lack of the first two descriptor types for QSBR purposes, especially for diverse molecules. Redundancy and low discriminative power of those descriptors determine the situation, because often a small structure modification of a chemical can change its degradation ability appreciably [2, 33]. Only for sets of homogeneous compounds some success has been obtained.
Fragmental approaches, where the presence or absence of particular molecular functional groups and/or substructure in the molecule has influence on the model's output, are especially designed for sets of diverse molecules. The better quality of descriptive models in this case is generally accepted . Even the very large and heterogeneous data set of the MITI test (2-level scale)  was well explained [19, 20].
A number of learning approaches for determination of the relationship between degradability and molecular fragments were applied - classical MLR [11, 14, 15], PLS , rule-based [10, 20] as well as other methods including neural networks [12, 16, 34]; the newer data-mining arsenal is available from specific literature .
In this work, the decision tree approach  has been employed for developing the QSBR model. It has the following advantages: (i) simplicity of results - in most cases, the interpretation of results summarized in a tree is simple. (ii) Tree methods are nonparametric and nonlinear; final result for use of a tree method can be summarized in a series of (usually few) logical if-then rules (tree nodes). So, there is no implicit assumption that the underlying relationships between the predictor variables and the response (here the degradability class) are linear, follow some specific non-linear link function, or that they are even monotonic in nature. On the other hand, the sophisticated learning methods like variants of neural networks usually behave as the "black-boxes" with tangled inner interrelations and lack of transparency. This in turn leads to the serious complexity in interpretation of the structure of solution and the danger of unexpected "surprises" in predictions.
However, processing of a huge amount of possible substructural descriptors even for the algorithms especially designed for handling of a number of predictors is a challenge . The development of the decision tree was carried out manually, and most efforts were directed to design and selection of appropriate fragments. The "manual" way has been chosen as it provides flexibility in generating and combining rules taking care of the chemically believable results. Testing of the hypotheses was done by supporting programming in the Statistica's Basic  and the Perl languages.
The train set of 315 compounds for creation of the model is presented at the Table 1. The model was finally validated by the test set of 105 compounds from the Table 2.
The general way for construction of the decision tree was as follows: a compound is assumed as stable for degradation unless it contains some fragments associated with a quicker break-up tendency. First process was to find certain proper features that have an ability to discriminate diverse pesticides of the class 1 against more persistent ones. It was found that phosphorus compounds, derivatives of carbamic acid, amides of dicarbonic acids, amides of a-chlorocarbonic acids, and some other types of compounds are generally of the class 1 (low persistence), while the structures with the aromatic nitrogen have special behavior and other parts of a molecule must be analyzed. As the next step, the decision tree was grown and optimized to decrease the total number of decision rules achieving the better generalization of the model.
The key criterion for selection of a decision rule for the model was its high discrimination. No additional rule or fragment have been added for explanation of only 1 or 2 compounds, but the extension of generalization of some existing fragment or rule was allowed. The generalization of rules was done by using of the Boolean operators combining several descriptors into the one rule. Special attention was paid to the design of the fuzzy fragmental descriptors with multiple atom types or bond types, as can be seen in Figure 2 (fragments N-CA-X, N-CA-C, Ph-COO, and others) for better generalization. Some descriptors (N-CO, CN, NO2, CYC_3C, nBO, nDB) were used several times in different places in the decision tree to compress the descriptor dimensionality of the solution.
3 Results and Discussion
The constructed tree of decisions and respective molecular fragments are shown in Figure 1 and Figure 2. There are 12 decision nodes and one algorithmic rule of 7 steps. The model employs 31 topological descriptors totally. Contributions of the core rules are listed in Table 3, classification results for the training set and the test set are presented in Table 4 and Table 5, respectively.
Discrimination power of descriptors is presented in terms of the total separated compounds in the Table 3. The table shows that the highest discrimination is achieved by UNSATW and N-CO-O (carbamic acid's derivatives) descriptors. UNSATW parameter, as some kind of the weighted unsaturation index, summarizes the instability of a compound determined mostly by the number of presenting CC, CN, or CO double bonds normalized by the number of total bonds excluding hydrogen (nBO parameter). UNSATW along with HETERO value (absence of sulfur or halogens except iodine) and nBO were introduced to detach heterogeneous structures of weak persistence. Fragments O-C (single or aromatic bond between carbon and oxygen in molecule with amide group), O-CO (esters), and Ph-NHC (aromatic secondary or tertiary amines) have a high significance as well. Another interesting fact is that the presence of a 3-membered ring in a molecule noticeably increases its persistence (leaves 8 and 12 in Figure 1).
Generally, the opening nodes of the decision tree move compounds into classes of less persistence and separation is easier and more numerous. Taking into account Figure 1 and Table 3, the opening nodes (leaves 1-4) may be believed as the most important, together with the special algorithmic part for processing of the aromatic nitrogen-bearing heterocycles. Compounds with an aromatic atom of nitrogen were separated from the decision tree because no fragment was detected to be able to discriminate a particular degradability class. Presence of fragment N-CA-X decreases the pesticide stability, possibly by coupling effect, but presence of fragment N-CA-C does oppositely. Other fragments of the algorithmic part decrease the estimated degradability class (it is 'I' in the algorithm).
Table 4 and Table 5 have summarized the classification results for training and test sets. For the extremely heterogeneous training data set of 315 compounds including almost all types of pesticides, there were only 31 descriptors and in total 19 rules employed producing 86.3% of correctly calculated values among three classes (see Table 4). For the test set of 105 compounds, there were 78.1% of correctly predicted values. Many misclassified pesticides have either very wide range of half-life values or few observations, and may not belong unambiguously to one specific class; such compounds are marked in Table 1 and Table 2. There are also many unique structural classes where no reliable rule may be produced.
The case of two-unity misclassifications should be specially discussed. (i) Negative value means that a compound can be broken down very quickly, unexpectedly with the model output; benodanil and bentazon have unique fragments (iodine atom and sulfuric diamide chain, respectively), while the instability of propanil in soil is caused by known specific sensitivity to microbial hydrolysis . (ii) Positive value of the mistake has a dangerous potential for the environment. Both fenac and chloroneb are persistent but were falsely marked as quickly degradable by UNSATW descriptor. For the test set there were only positive two-unity misclassifications also incorrectly marked by UNSATW descriptor (leaf 4): 2,3,6-tba acid, clofencet (rare phenyl-hydrazide), and cycloheximide (heterogeneous and very limited persistence information). The UNSATW parameter was designed for discrimination of molecules with weak persistence of really various structural classes and has the highest discriminating power for the training set (see Table 3). As such, this high generalization may lead to mistakes in specific cases.
Comparing this result with the former works, we should notice that the prediction rate is on a comparable level with the best models handling two classes (rapidly/slowly degradable) [19, 20], while we processed three classes (low, moderate, or high persistence).
We developed a data set of pesticide persistence and explained it by the decision tree model. The structure of QSBR solution is visual and simple. For every pesticide from Table 1 and Table 2, the terminal leaf is presented; therefore, the classification path through the tree can be reproduced with ease. This is an important achievement because our model does not only allow prediction of the field persistence, but also assists in the construction of new compounds with desired degradation ability.
On the basis of this research, a computer expert system EKeeper has been developed and made available for downloading . Future investigations will be directed to the completely automated QSBR analysis of the presented data set using the new concept of fuzzy fragments.
Table 1. Classified Observed and Predicted persistence of pesticides in environment, and predicting Terminal Leaves (see Figure 1) for the training set of 315 compounds.
Table 2. Classified Observed and Predicted persistence of pesticides in environment, and predicting Terminal Leaves (see Figure 1) for the test set of 105 compounds.
Table 3. Contribution of the core rules.
Figure 1. Decision tree model for classification of pesticide persistence in environment. Each decision node is accompanied by the numbers of compounds that arrive at the node and flow away. Terminal leaves are marked by double board; their index numbers are given in circles. Algorithmic part is shown in the BASIC-like style.
Figure 2. Chart of topological descriptors. Parameters nDB and nBO are the constitutional descriptors; UNSATW is the empirical parameter. Each other predictor is the number of non-overlapped occurrences of the corresponding fragment in a molecular structure.
Table 4. Summary of classification results for the training set.
|Interclass mistake (Observed - Predicted)||Count|
Table 5. Summary of classification results for the test set.
|Interclass mistake (Observed - Predicted)||Count|
This work was partially supported by Japan Chemical Industry Association. The authors are also thankful to the U.S. Environmental Protection Agency, the U.S. National Library of Medicine, and the Pesticide Management Education Program at Cornel University for free access to corresponding databases.
[ 1] We accepted the common tendency and used the QSBR abbreviation (Quantitative Structure-Biodegradation Relationship). In fact, the biodegradation is one of the major ways of chemical decay, and often is associated with the whole degradation process.
[ 2] G. Klopman and M. Tu, Encyclopedia of Computational Chemistry, Wiley, Chichester (1998), pp. 128-135.
[ 3] W. J. G. M. Peijnenburg, Pure Appl. Chem., 66, 1931 (1994).
[ 4] J. R. Parsons and H. A. J. Govers, Ecotoxicol. Environ. Safety, 19, 212 (1990).
[ 5] G. J. Niemi, G. D. Veith, R. R. Regal, and D. D. Vaishnav, Environ. Toxicol. Chem., 6, 515 (1987).
[ 6] R. S. Boethling, B. Gregg, F. R. Gabel, N. W. Campbell, and A. Sablijic, Ecotoxicol. Environ. Safety, 18, 252 (1989).
[ 7] S. M. Desai, R.Govind, and H. H. Tabak, Environ. Toxicol. Chem., 9, 473 (1990).
[ 8] P. Bhagat, Chem. Eng. Prog., 86, 55 (1990).
[ 9] G. Klopman and M. J. McGonigal, J. Chem. Inf. Comput. Sci., 21, 48 (1981).
 K. Hiromatsu, Y. Yakabe, K. Katagiri, and Tsu. Nishihara, Chemosphere, 41, 1749 (2000).
 H. H. Tabak, C. Gao, S. Desai, and R. Govind, Water Sci. Technol., 26, 763 (1992).
 H. H. Tabak and R. Govind, Environ. Technol. Chem., 12, 251 (1993).
 BIODEG, Environmental Fate Database of Syracuse Research Corporation, Environmental Science Center division, 301 Plainfield Road, Syracuse, NY 13212 USA.
 P. H. Howard, R. S. Boethling, W. M. Stiteler, W. M. Meylan, A. E. Hueber, H. A. Beauman, and M. E. Larosche, Environ. Toxicol. Chem., 11, 593 (1992).
 G. Klopman, D. M. Balthasar, and H. S. Rosendranz, Environ. Toxicol. Chem., 12, 231 (1993).
 J. Devillers, D. Domine, and R. S. Boethling, Neural Networks in QSAR and Drug Design, ed by J. Devillers, Academic Press, New York (1996), pp. 65-82.
 Correlation coefficient for this test set had been absent in the original, but was calculated by ourselves just from the tabulated results of original's Table II.
 Japan Chemical Industry Ecology-Toxicology & Information Center (JETOC), "Biodegradation and Bioaccumulation Data of Existing Chemicals Based on the Chemical Substances Control Law (CSCL Japan)," Tokyo (1992).
 H. Loonen, F. Lindgren, B. Hansen, W. Karcher, J. Niemela, K. Hiromatsu, M. Takatsuki, W. Peijnenburg, E. Rorije, J. Struijs, Environ. Toxicol. Chem., 18, 1763 (1999).
 A. Sabljic and W. Peijnenburg, Pure Appl. Chem., 73, 1331 (2001).
 A. C. Waldron, "Pesticides and Groundwater Contamination, "Ohio State University Extension Bulletin, 820, Columbus (Ohio) (1992). Available on the Internet at http://ohioline.ag.ohio-state.edu/b820/index.html.
 A. G. Hornsby, R. Don Wauchope, and A. E. Herner, Pesticide Properties in the Environment, Springer, New York (1995).
 Pesticide Property Database of the Alternate Crops and Systems Laboratory of Beltsville Agricultural Research Center. Available on the Internet at http://wizard.arsusda.gov/acsl/ppdb.html.
 Hazardous Substances Data Bank of U.S. National Library of Medicine. Available on the Internet at http://toxnet.nlm.nih.gov/cgi-bin/sis/htmlgen?HSDB.
 Pesticide Management Education Program at Cornell University. Available on the Internet at http://pmep.cce.cornell.edu.
 A. McCulloch, Chemosphere, 47, 667 (2002).
 Very short record for each chemical is given in the Table 1, because of space limitation. Detailed information about all HL vales and molecular structures is available from authors on request.
 Compendium of Pesticide Common Names. Available on the Internet at http://www.hclrss.demon.co.uk.
 ChemIDPlus Database of U.S. National Library of Medicine. Available on the Internet at http://chem.sis.nlm.nih.gov/chemidplus/.
 J. March, Advanced Organic Chemistry: Reactions, Mechanisms, and Structure, Wiley, New York (1992).
 M. K. Cyranski, T. M. Krygowski, A. R. Katritzky, and P. von R. Schleyer, J. Org. Chem., 67, 1333 (2002).
 M. Karelson, Molecular Descriptors in QSAR/QSPR, Wiley, New York (2000).
 J. Devillers, Encyclopedia of Computational Chemistry, Wiley, Chichester (1998), pp. 932-941.
 J. W. Raymond, T. N. Rogers, D. R. Shonnard, and A. A. Kline, J. Hazard. Mater., 84, 189 (2001).
 "KDnuggets News," the e-newsletter on Data Mining, Data Mining Books section. Available on the Internet at http://www.kdnuggets.com/publications/books.html.
 J. R. Rose, Encyclopedia of Computational Chemistry, Wiley, Chichester (1998), pp. 1521-1525.
 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth, Belmont (1984).
 Data analysis and statistical programming environment STATISTICA v. 5-6. Information is available on the Internet at http://www.statsoft.com.
 R. Bartha, J. Agr. Food Chem., 19, 385 (1971).
 EKeeper software for evaluation of the level of persistence of chemicals in environment. Available on Internet at http://www.mis.tutkie.tut.ac.jp/; go to "English" / "MIS-services".