Pharmacophore Refinement in the Chemical Structure Space

Satoshi FUJISHIMA, Yoshimasa TAKAHASHI and Takashi OKADA


1 Introduction

It is important to establish relationships between the structures of chemical compounds and their physiological activities. After King et al. succeeded in applying the inductive logic programming method to mutagenicity analysis [1], several techniques were proposed for extracting characteristic substructures from chemical compounds with a variety of structures [2]. These methods are divided into two approaches.
The first approach tries to discover characteristic substructures by using a relational learning scheme, such as inductive logic programming or graph-based mining methods [3, 4]. The main shortcoming of this approach lies in the huge search space within the graph space, and the problem becomes obvious when the search space includes unconnected graphs. Therefore, the mining process must often use features that characterize very minute points like atomic charge, and chemists cannot identify the pharmacophore in their language, i.e., the structural formula. Otherwise, the discovered substructures are limited to ones that are very significant statistically, and most of these are already well-known to medicinal chemists. As a result, this kind of approach has not gained popularity in pharmacological research.
In the second approach, the mining method is confined to propositional logic, and numerous substructures are generated initially and used as features. Then, a combination of a few features is expected to explain the pharmacophore. The problem with this approach is that the description of the pharmacophore is limited by the space expressed in the features. Therefore, the classifiers do not always show a substructure that medicinal chemists can understand easily, even if the classifiers possess strong discriminating power.
We applied the second approach using linear fragments and the cascade model [5, 6], and succeeded in determining new pharmacophore structures. However, chemists must spend a great deal of time in translating rules to reveal the pharmacophore structures.
In this paper, we start from obtained rules, and place the essential fragment of the rule into the chemical structural formula space. The seed fragment is then refined to give the pharmacophore structure using a simple hill-climbing procedure. We expect this method to provide information that chemists can understand easily. The next section outlines the problem in terms of analyzing the dopamine D1 agonist. Here, we can see an obtained rule, as well as a final pharmacophore structure hand-carved by a chemist. The third section describes an implementation of the knowledge-refinement system using the SMILES/SMARTS chemical structure representation language. The results of applying this to the rules for dopamine D1 agonists are discussed in the last section.

2 Formulation of the Problem

2. 1 The Cascade Model

One of the authors developed the cascade model: a mining method that can detect local correlations from among hundreds of attributes [7]. The cascade model can be considered an extension of association rule mining. The method creates an itemset lattice in which an [attribute: value] pair is used as an item to constitute itemsets. Links in the lattice are selected and interpreted as rules. That is, we observe the distribution of the RHS (right hand side) attribute values along all links, and if a distinct change in the distribution appears along some link, then we focus on the two terminal nodes of the link. Consider that the itemset at the upper end of a link is [A: y] and item [B: n] is added along the link. If a marked activity change occurs along this link, we can write the rule:

where the added item [B: n] is the main condition of the rule, and the items at the upper end of the link ([A: y]) are considered preconditions. In the following "THEN" clauses, (y n) means pair of ratios of the presence and absence of an attribute described after "THEN" clause, respectively. The first "THEN" clause describes that while the number of supporting instances decreases from 200 to 50, the main condition changes the ratio of the active compounds increases from 0.8 to 0.3, whereas the ratio of the inactive compounds from 0.2 to 0.7. BSS means the between-groups sum of squares, which is derived from the decomposition of the sum of squares for a categorical variable. Its value can be used as a measure of the strength of a rule [9].
The BSS is calculated as follows:

where i designates an attribute, the superscripts U and L indicate the upper and lower nodes, respectively, n is the number of cases supporting a node by applying the main condition; and pi(a) is the probability of obtaining the value a for attribute i. The second "THEN" clause indicates that the distribution of the values of attribute [C] also changes sharply with the application of the main condition. This description is called the collateral correlation.

2. 2 Computation of Rules in the Previous Study

The method was successfully applied to the analysis of the relationship between physiological activity and a set of linear fragments generated from chemical structures [5]. Recently, our efforts have concentrated on the problem of revealing the pharmacophore structures of dopamine agonists and antagonists precisely.
The MDDR database (version 2001.1) by MDL Inc. was used as the data source [8]. It contains 369 records that describe dopamine (D1, D2, and autoreceptor) agonist activity, with 63, 143, and 186 active compounds possessing D1, D2, and autoreceptor activity, respectively. Some of the compounds affected multiple receptors.
In the dopamine D1 agonist study [6], 345 fragments and 4 physicochemical properties were used to construct the itemset lattice. The resulting lattice contained 9,508 nodes, and 1,762 links (BSS > 2.583 = 0.007*#compounds) were selected as rule candidates. Greedy optimization of these links resulted in 407 rules (BSS > 5.535 = 0.015*#compounds). Organization of these rules gave us 14 principal rules and 53 relative rules. Many rules indicate characteristics leading to inactive compounds or have few supporting compounds, and those rules with activity ratio < 0.8 and those with #compounds < 10 after the application of the main condition were omitted. Finally, 2 principal and 14 relative rules were inspected.
The following rule was the strongest rule derived in the dopamine D1 agonist study. Where item [O2H-c3:c3-O2H:y] is the main condition of the rule, and the items in the final parentheses ([ ]) are considered preconditions. No preconditions are applied here. The number of supporting compounds decreases from 369 to 52 on application of this main condition. The first "THEN" clause describes how the ratio of active compounds, which have D1 agonist activity, increases from 17% in 369 compounds to 96% in 52 compounds with this main condition. The cascade model searches for such links that change the activity ratio significantly.

The following "THEN" clauses are called collateral correlations. For example, the second "THEN" clause reveals that the dopamine autoreceptor agonist activity ratio is reduced to 0%. Other collateral correlations suggest the existence of an N3H-C4H-C4H-c3 group located at the positions meta and para to the OH group.
The final pharmacophore derived from this rule is shown in Figure 1. However, in the former studies, chemists must inspect all of the rule conditions and collateral correlations carefully, referring to the supporting chemical structures, and thus the time and effort involved in this work is truly prohibitive. The purpose of this study was to find an easier way to obtain the pharmacophore.

Figure 1. Pharmacophore structure derived from the strongest rule for dopamine D1 agonist activity.

3 Refinement Method

The knowledge refinement system accepts a seed and searches its neighborhood to reach a pertinent piece of information. The pharmacophore knowledge is a substructure pattern that affects an activity, and we use the SMILES/SMARTS language to describe the pattern, which can be interpreted by the structure database system. Once molecules matching the pattern are retrieved, their adequacy as the pharmacophore can be judged using the BSS values, and we can apply a hill-climbing approach in the search. It shares the idea of using SMILES string in the detection of the pharmacophore with SMIREP [10], but its building blocks and the optimization criterion is completely different from the refinement process proposed here. This section first introduces the SMILES/SMARTS language used to represent chemical structures and patterns, and then describes the refinement algorithm and user support facilities.

3. 1 SMILES/SMARTS Language

The Simplified Molecular Input Line Entry System (SMILES) is a line notation developed to enter and represent molecules and reactions using short ASCII strings [11]. Its expression is already well established as having the capability of expressing general organic chemicals. SMARTS is a straightforward extension of SMILES language that describes molecular patterns [12]. It allows us to specify substructures including wildcard atoms (*) and bonds (~). Both of these languages are widely used, such as in keys for database access, entry systems for chemical data, and specifying search structures in chemical database search applications.
We used SMILES/SMARTS as a way to specify chemical structures and patterns. The fragment expression that appears in a rule shares many points with SMILES/SMARTS, and it is easy to compare them. We can easily retrieve molecules with a specified pattern when we apply a chemical structure database system using SMILES/SMARTS expressions. In addition, a chemist can enter a seed pattern based on an idea and start the refinement procedure.
Figure 2 shows examples of SMARTS notation for the fragment enclosed by the dotted line, which appeared in the main condition of the rule discussed earlier. Strict translation of the fragment results in notation A, while organic chemists usually use the simplified notation B.

Figure 2. Examples of SMARTS notation for the fragment enclosed by a dotted line.

3. 2 Refining a SMARTS Pattern

The refinement process necessitates a seed SMARTS pattern, a database of specific structures written in SMILES, and lists of atom and bond types (Table 1) to be inserted into the pattern. The following algorithm describes the refinement process:

Algorithm 1. Refinement algorithm from a SMARTS pattern.

  1. Set Alist to atom types and Blist to bond types.
  2. Set ptrn to initial seed fragment written in SMARTS.
  3. Compute BSS from the instances retrieved by ptrn.
  4. Set maxBSS = 0.0.
  5. For each combination of atom in Alist and bond in Blist,
  6.   For each position in ptrn,
  7.     Generate newptrn by inserting atom and bond at
        the position in ptrn.
  8.     Set tuples to the instances retrieved by newptrn.
  9.     Compute newBSS from tuples.
  10.     If newBSS > maxBSS,
          set maxBSS = newBSS; maxptrn = newptrn.
  11. If maxBSS < BSS then return ptrn.
  12. Set ptrn = maxptrn; BSS = maxBSS
  13. Goto step 4.

Table 1. Default atom and bond types lists that can be inserted into the seed fragment in the system. Lowercase letters indicate atoms in a conjugated system, and [ ] implies that one of the atoms in parentheses is to be chosen.
Atom list:C, O, N, S, c, o, n, s, [I, Br, Cl, F]
Bond list:-, =, :

In step (9), the BSS value for refinement process is computed as follows:

where p(a) and pnewptrn(a) are the probabilities of active/inactive compounds retrieved using the initial seed fragment and the newptrn, respectively. The refinement process is continued until the BSS value of the selected pattern in the current step is smaller than that of the previous step, and the previous pattern is returned as refinement pattern (in step (11)).
Table 2 shows three candidate patterns generated in the first refinement step, where the initial seed is set to the SMARTS string in Figure 2(B). The first two columns in Table 2 show the candidate patterns and corresponding structural formula. The atoms depicted in bold are the inserted atoms. For example, the first and second patterns are the results of aromatic ring expansion using carbon and nitrogen atoms, respectively, while a carbon is attached to the oxygen atom in the last example.
The third and final columns show the distributions of active/inactive compounds and their BSS values, respectively. In this case, the pattern in the first row has the highest BSS value, and it is selected for further refinement in the succeeding steps.

Table 2. Three candidate patterns generated at the initial refinement step starting from OccO.

3. 3 Implementation of the System

The refinement system has been implemented as a Web application program using python and OEChem toolkit. OEChem is a program library for chemistry and chemical informatics developed by OpenEye Scientific Software [13]. We can use the functionalities from python.
Database access using SMARTS and SMILES is one of the basic functions of OEChem, and we can easily develop a system that stores molecules written in SMILES and retrieves molecules that match a SMARTS pattern. The handling of SMARTS expressions has been implemented in python.
Another function that makes it easier for the user is the display of the supporting specific chemical structures. Users can recognize the molecular environment surrounding the pharmacophore, and judge the adequacy of the proposed information. In fact, chemists will not accept the pharmacophore hypotheses proposed by the refinement system unless they can check the structural formulae of the supporting compounds. We implemented this display function using OpenEye's OEDepict (Ogham) library, which can illustrate structural formulae from a list of SMILES notations. The actual display window is shown in the next section.

4 Results and Discussion

Table 3. Rules suggesting the characteristics of dopamine D1 agonist activity. #compounds and Distribution change indicate that change of the number of supporting compounds and activity ratio increase on before/after application of main condition.
Rule ID#compoundsMain conditionPreconditionsDistribution change
R1369 ® 52[O2H-c3:c3-O2H: y]none17% 96%
R1-UL9288 ® 16[C4H-N3----:c3-O2H: y][N3H: y]19% ®100%
R1-UL12170 ® 12[N3H-C4H---::c3-O2H: y][N3H-C4H--:c3H:c3H: n]14%®100%
[O2-c3:c3: n]
R1472 ® 11[C4H-C4H-O2-c3: n][LUMO: 1-3]15% ® 91%
[C4H-N3---c3:c3: n]
[O2-c3:c3:c3H: y]

The knowledge refinement system is applied to the pharmacophore of the dopamine D1 agonist activity in this section. We analyzed this activity using the cascade model and obtained 16 rules, from which chemists selected the four rules shown in Table 3. Here, we examine rules R1 and R1-UL9. The linear fragment that appeared in these rules is given to the refinement system as a seed pattern, and the resulting pharmacophores are evaluated.

Table 4. Refinement process starting from OccO with structural formula of the seed fragment and five patterns generated from the seed. Support (Y/N) shows the distributions of active/inactive compounds.

4. 1 Rule 1: The Strongest Rule

Rule R1 is already explained in Section 2, and no preconditions exist. The change in the activity ratio is very sharp, from 17% in 369 compounds to 96% in 52 compounds, and this might be detected using any mining method. However, this is not the final pharmacophore that chemists expect. Let us examine the refinement results when we start from two initial patterns: OccO and [O;H1]cc[O;H1] (Figure 2).
Table 4 summarizes the refinement process starting from OccO and depicts the structural formula expressions of the patterns at steps 0-4 and 8. We can see that the BSS value increases at steps 2 and 4, while the value does not change at the other steps. Few alternative pattern candidates occur with the same BSS at every refinement step, but we reach the same final pattern even if we use other pattern expressions in the refinement process.
The BSS increases at steps 2 and 4 suggest that the attached components at these steps have significant meaning for understanding the pharmacophore. The decreases in the number of supporting negative instances are the main reason that the BSS increases, and chemists can recognize the group of compounds excluded by growth of this pattern.
Some supporting molecules that form the final pattern are shown in Figure 3, where the bold line indicates the pattern. Active and inactive molecules are shown at the top and bottom of the browser window. This illustration makes it is very easy for chemists to evaluate the adequacy of the proposed pharmacophore.

Figure 3. Structures supporting the refined pharmacophore at step 8 in Table 4. The bold line part indicates the final pattern obtained. Red color is used instead of the bold line in the actual application.

Table 5. Refinement results using the other SMARTS pattern for the fragment under the main condition of Rule1.
SMARTS patternSupport (Y / N)BSS
Seed pattern[O;H1]cc[O;H1]50 / 232.557
Refined pattern[O;H1]c(:c(:c(:c(:c))(-C(-C(-N)))))c[O;H1]50 / 133.471

Table 5 summarizes another refinement starting from the other initial pattern: [O;H1]cc[O;H1]. This seed is the strict interpretation of the rule fragment shown in Figure 2(A). It had 52 supporting compounds, of which 50 had dopamine D1 agonist activity. The resulting pharmacophore shown in Figure 4 is identical to the hand-carved one illustrated in Figure 1, and it differs from that in step 8 (Table 4) in two points: the attachment of hydrogen atoms at the oxygen and the absence of a carbon atom connected to the aromatic ring. This result shows that the fragment that appeared under the main condition of R1 has captured the essential point of the pharmacophore.
The group of compounds excluded by growth of the pattern can also be recognized by comparing the supporting compounds before and after pattern growth. Chemists can compare the structures of active/inactive compounds to devise a structural explanation for their difference.
A previous study proposed the same pharmacophore after a great deal of manual work. The refinement system presented here proposes the same substructure efficiently, and it helps chemists to evaluate hypotheses easily. The group of compounds excluded by growth of the pattern can also be recognized by comparing the supporting compounds before and after pattern growth. Chemists can compare the structures of active/inactive compounds to devise a structural explanation for their difference.

4. 2 Rule 1-UL9: A Rule with a Precondition

In R1-UL9, the activity ratio increases from 19% in 288 compounds to 100% in 16 compounds with the presence of the fragment C4H-N3----:c3-O2H (Table 3). However, a precondition indicates that this activity change takes place when an N3H fragment exists. Since the placement of these two fragments is not given, it is difficult to deduce the pharmacophore from this rule.
The SMARTS language allows the disconnected structure as the query pattern. That is, we can retrieve structures containing these two fragment patterns when we give a query concatenating two SMARTS patterns by using a period (".") delimiter.
The refinement process using this technique is summarized in Table 6. Several structures supporting the final pattern are shown in Figure 5, and we can see that a large connected pattern has developed. Note that the lengths between the two initial patterns differ in these specific structures, as shown by the two structural formulae in Table 6. The result was compared with that of the preceding research shown in Figure 6. The two results have almost the same chemical meaning, and chemists judged that it was easy to reach the pattern in Figure 6 from the structures shown in Figure 5.

Table 6. Refinement results using the fragment of main condition and precondition of the R1-UL9.

Figure 4. The pharmacophore extracted starting from [O;H1]cc[O;H1] (Figure 2(A)).

Figure 5. Specific structures retrieved from the final pattern in Table 6.

Figure 6. The pharmacophore extracted from R1-UL9 in the preceding work.

5 Conclusion

We have shown that knowledge refinement is very effective at determining valid pharmacophore hypotheses. The method used was a simple hill-climbing approach based on optimizing the BSS value. We cannot attain a valid conclusion if we start from an invalid initial pattern. Therefore, the success of our approach can be attributed to two reasons. First, we can start the refinement process from a fragment with good characteristics. Second, the BSS criterion used is an effective guide leading to a satisfactory hypothesis.
Similar approaches are expected to provide high-quality knowledge in other domains. That is, a good starting hypothesis is obtained using the standard mining method, and then refinement proceeds in the original problem space. Some examples involve mining from sentences in natural language and from plant operations in which the quality of knowledge should be judged in the problem-specific representation space.
From the viewpoint of applications in chemistry, the burden of rule interpretation has been reduced greatly. Moreover, the system allows chemists to incorporate their own ideas in the initial hypothesis, which can then give a somewhat unconstrained feeling to users. Our system is now one of the principal software components in the pharmacophore knowledge-base project being developed in our laboratory.

The authors thank Dr. Masumi Yamakawa for his work on dopamine activity analysis as a medicinal chemist. This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (A) 14208032.


[ 1] R. D. King, S. H. Muggleton, A. Srinivasan, and M. J. E. Sternberg, Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming, Proc. Natl. Acad. Sci. U. S. A., 93, 438-442 (1996).
[ 2] T. Okada, Mining from chemical graphs, Mining Graph Data, ed. by Holder, L. B., and Cook, D. J., Wiley-Interscience (2006), pp.347-379.
[ 3] A. Inokuchi, T. Washio, and H. Motoda., An apriori-based algorithm for mining frequent substructures from graph data, Proc. PKDD 2000, ed. by D.A. Zighed, H.J.Komorowski, and J.M. Zytkow, Springer-Verlag, LNAI 1910 (2000), pp.13-23.
[ 4] S. Kramer, L. De Raedt, and C. Helma, Molecular feature mining in HIV data, Proc. of the 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, ed. by F. Provost and R. Srikant, ACM Press (2001), pp.136-143.
[ 5] T. Okada, Discovery of structure activity relationships using the cascade model: the mutagenicity of aromatic nitro compounds, J. Comput. Aided Chem., 2, 79-86 (2001).
[ 6] T. Okada, M. Yamakawa, Characteristic Ligand Substructures to Dopamine Receptors, Joint Workshop of Vietnamese Society of AI, SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining, 2004, 123-128.
[ 7] T. Okada, Efficient detection of local interactions in the cascade model, Knowledge Discovery and Data Mining PAKDD-2000, ed. by Terano, T., Liu, H., and Chen A. L. P., Springer-Verlag, LNAI 1805 (2000), pp.193-203.
[ 8] MDL, Drug Data Report, MDL, 2001.1 (2001).
[ 9] T. Okada, Rule induction in cascade model based on sum of squares decomposition, Principles of Data Mining and Knowledge Discovery PKDD99, ed. by Zytkow, J. M., and Rauch, J., Springer-Verlag, LNAI 1704 (1999), pp.468-474.
[10] A. Karwath and L. De Raedt, SMIREP: Predicting Chemical Activity from SMILES, J. Chem. Inf. Model., 46, 2432-2444 (2006).
[11] D. Weininger, SMILES, a chemical language and information system. 1. Introduction and encoding rules, J. Chem. Inf. Comput. Sci., 28, 31-36 (1988).
[12] Daylight Chemical Information Systems, Inc.. Daylight Theory: SMARTS, Available online at
[13] OpenEye Scientific Software, OEChem Toolkit,