Prediction of Carcinogenicity of Noncongeneric Chemical Substances by a Support Vector Machine

Kazutoshi TANABEa*, Takahiro SUZUKIb, Mikio KAIHARAc and Natsuo ONODERAa

aGraduate School of Library, Information and Media Studies, University of Tsukuba
Kasuga 1-2, Tsukuba 305-8550, Japan
bFaculty of Economics, Toyo University, Hakusan
Bunkyo-ku, Tokyo 112-8606, Japan
cDepartment of Chemical Engineering, Ichinoseki National College of Technology
Takanashi, Hagisho, Ichinoseki 021-8511, Japan

(Received: December 11, 2007; Accepted for publication: August 4, 2008)

The ability to assess the toxicity of a chemical substance depends on the available information on the compound and/or its related compounds. Among chemicals currently in commerce, very few are ascertained on their toxicity, and especially reliable data on the carcinogenicity are very limited for pharmaceutical chemicals. Therefore, attempts on the basis of quantitative structure-activity relationship (QSAR) models for estimating the carcinogenicity have been performed. But none of the models so far developed shows satisfactory performance for predicting the carcinogenicity of noncongeneric chemicals from their structures.
The support vector machine (SVM) technique was applied to develop a QSAR model that relates the structures of diverse chemicals to their carcinogenicity, and its predictability was compared with that of our previous artificial neural network (ANN) model. The relationship between experimental carcinogenicity data used in the Predictive Toxicology Challenge (PTC) 2000-2001 contest on 454 chemicals and 37 molecular descriptors calculated from their structures alone was analyzed with a software LIBSVM ver.2.85 for support vector regression (SVR). Models were optimized using a cross-validation test for the training dataset, and their performances were evaluated using the test dataset.
The training of ANN models took several months using seven PCs to solve the problems such as over-training, over-fitting and local minima, while SVM gave a just comparable predictability of 74 % with that by ANN, within much shorter computation time. It comes from the advantage of SVM that gives only one global optimum solution after training while ANN gives numerous local minimum solutions. Moreover the prediction accuracy of the SVM model was higher than the best predictability value of 71 % reported in the literature for the same dataset. It is concluded that the support vector machine, a novel nonlinear machine learning approach, leads to a model for predicting the carcinogenicity of noncongeneric chemicals from information on the molecular structure alone with a higher performance than any of the so-far proposed approaches.

Keywords: Support vector machine (SVM), Support vector regression (SVR), Carcinogenicity prediction, Artificial neural network (ANN), Quantitative structure-activity relationship (QSAR), Predictive Toxicology Challenge (PTC)

Abstract in Japanese

Text in Japanese(HTML)

PDF file on J-STAGE