3D Molecular Similarity: Method and Algorithms

Oleg URSU, Mircea V. DIUDEA and Shin-ichi NAKAYAMA


1 Introduction

The investigation of a molecular structure involves research on its constitution - the number and chemical identity of atoms and bonds joining them along with the configuration in 3D space. Molecular similarity has been studied from two different major view points: (i) topological similarity defined in connectivity and constitutional terms and (ii) geometrical similarity, when geometrical aspects of the molecular structure are taken into account.
Similarity searching in databases of 2D chemical structures is widely used for virtual screening and lead discovery. A similarity measure, that quantifies the degree of structural resemblance between the target structure and each of the database structure, is based on fingerprint or molecular descriptor encoding of the molecular structure with similarity between pairs of such representations being computed using the Tanimoto coefficient. Another topological similarity measure of increasing interest (although more computationally demanding) is detection of 2D maximum common subgraphs (MCS) proposed by Raymond et. al [1, 2]. The binding affinity of the ligand to the receptor site, which usually expresses the biological activity, is related to a single geometrical configuration of the ligand.
The procedures and algorithms used in detection of the MCS can be extended to 3D similarity searching, with several modifications, the most important one being conformational flexibility in the matching algorithm. This study presents a complete implementation of such an algorithm. The effectiveness of the proposed method in classifying chemical structures with respect to a given bioactive leader is evaluated.

2 Methods

Chemical graphs. All molecular structures can be represented as simple, undirected graphs. In a 3D chemical graph, the vertices denote atoms but an edge here can indicate the geometric distance or a range of distances between a pair of atoms (vertices). The main difference between these two representations is that the 3D chemical graph is usually weighted by geometrical distance (see Figure 1).

Figure 1. Acetic acid chemical graph representations

Distance geometry. In the ligand-receptor interaction mechanism, a ligand usually exhibits some degree of flexibility and thus the distances between atoms are not fixed.
One approach to cope with this drawback is to generate several conformations of low energy for each structure under consideration and then to compare all possible pairs of the resulting conformations. This approach is, however, computationally demanding, if exhaustive sampling of the molecules' conformational space is to be achieved and still cannot guarantee that the optimal similarity has been identified.
Distance geometry, herein considered, encodes the molecules' conformational flexibility within a single graphical representation by Crippen [3]. Specifically, each edge of a 3D molecular graph is represented by a range of distances spanning the maximum and minimum allowable distances between two atoms. Distance ranges are imposed by some constrains, e.g., distance and chirality. The distance constrains are simply the lower and upper bounds of the interatomic distances; the chirality includes the handedness of the asymmetric centers in the molecule.
Covalent distance constrains. The local covalent structure of a molecule is easily defined by distance constrains. Unless one is dealing with a highly strained ring system, it is sufficient to use the exact distance constrains in which the lower and upper bounds are equal. For example, the distances among covalently bonded pairs of atoms, are determined with high precision by the bond order and the types of atoms connected. Similarly, the bond angles can usually be determined from the covalent structure, while for fixed bond lengths there is a one-to-one relation between the bond angle and the geminal distance, so that these distances can also be determined. The relation between the geminal distances and the bond angle q is given explicitly by the law of cosines:

where, l13 = |d12 - d23| and u13 = d12 + d13 are called the lower and upper triangle inequality limits, respectively.
Vicinal distance constrains. Similarly, when the incident and geminal distances are held fixed, there is a one-to-one relation between the absolute value of the torsion angle j and the vicinal distance, given by:

where l14 and u14 are cis and trans limits on the 1,4 distance.
Chirality constrains. The chirality c1234 of an ordered quadruple of points numbered 1,2,3,4 is given in terms of their Cartesian coordinates by the sign of the following determinant:

Torsion angle constrains. As shown above, the absolute value of a torsion angle can be constrained to any range of values by means of suitable 1,4 distance constraints, including its cis and trans limits. Moreover, since the chirality c1234 of a chain of four bonded atoms A1-A2-A3-A4 is equal to the sign of the torsion angle sgn(j) = 0,±1 about the 2,3 bond, by a suitable combination of distance and chirality constraints we can obtain any range of values with a given sign. This is sufficient to specify the rotameric state (gauche+, gauche- or anti) about the single bonds.
Steric distance constrains. Since two atoms cannot be in nearly the same place at the same time, in order to obtain reasonable conformations it is necessary to impose lower bound constraints on the distances between all pairs of atoms, separated by more than three bonds. For the sake of simplicity, these lower bounds are generally set to the sum of suitable hard sphere radii (van der Waals radii):

After applying the preceding distance and chirality constrains, we ensure that the structures which satisfy them are not grossly unreasonable on energetic grounds. In order to get the correct conformation, it is necessary to impose constrains on interatomic distances for atoms that are separated by four or more bonds. Such constrains are determined by bound smoothing procedures: the triangle bound smoothing and tetrangle bound smoothing.
Triangle bound smoothing. Triangle inequality bound smoothing is based upon the well-known triangle inequality among the distances:

for all triples of atoms i, j, k. It follows that if dik <= uik and djk <=ujk then:

So, if , uij > uik + ujk then uij > dij and hence uij can be replaced by the upper limit uik + ujk on dij without eliminating any conformations that satisfy the constrains dik <= uik and djk <= ujk (see Figure 2).

Figure 2. Triangle inequality.gif

In order to compute the triangle inequality limits efficiently, we can reduce their calculations to the all pairs, shortest path problem in a certain digraph. A path in such a digraph is a sequence of nodes such that any consecutive nodes in the sequence are connected by an arc in digraph (Figure 3), whose arrow points from the first to the second. It is easily seen that the upper triangle limits are equal to the lengths of the shortest path in an undirected digraph, whose arc lengths are equal to the given upper bounds. It can be shown that all the lower triangle limits are of the form:

where i, j, k, m are not necessarily distinct atom indices, lij and uij which respectively denote lower and upper bounds on the corresponding indexed atoms, and the overbars indicate the corresponding triangle inequality limits (eq. 5). This shows that the upper limits can be computed independently of the lower and also that the greatest lower limit cannot exceed the greatest lower bound.

Figure 3. The digraph whose shortest path determines the triangle inequality limits

Current implementation of the method uses Floyd's [7] shortest path algorithm. This algorithm takes each node k of the digraph in turn, and then makes a pass through all ordered pairs of other nodes (i, j). If the length of the path i ® k® j is shorter than the length of the direct path i ® j, the latter is set to the former. This ensures that after each pass all the path lengths are at least as short as any path that goes through node k, and hence iterating on this procedure for k = I, N produces the desired matrix of shortest paths.
Tetrangle inequality bound smoothing. Unfortunately, the triangle inequality limits represent a rather poor approximation to the actual Euclidean limits, so that the triangle inequality bound smoothing is not a very effective approach to locating errors in the bounds. A somewhat more effective (although much more time-consuming) approach looks at four atoms at a time, rather than three. In this case, the algebraic form of the relations among the distances is far more complicated, so that the tetrangle inequality limits are best described pictorially as in Figure 4.

Figure 4. Tetrangle inequality limits

The mathematical form of this inequality can be expressed in terms of Cayley-Menger determinants Easthope [4]:

Thus the corresponding mathematical forms of the above tetrangle inequalities are:

together with

The following pseudocode implements the procedures for triangle and tetrangle bound smoothing as described above.

Figure 5 illustrates a pair of structures and their corresponding 3D MCS; the initial coordinates are generated by Hyperchem molecular modeling package.

Figure 5. 3D MCS example, bold lines denote maximum common subgraphs

Upper and lower distance matrices for structure (a) Figure 5, before bound smoothing and after bound smoothing procedures are illustrated in Figure 6.
A significant improvement in upper and lower bound is observed after applying distance geometry bound smoothing procedure (Figure 6).

Figure 6. Upper and lower bound matrices, the upper (lower) diagonal values indicate upper (lower) bound distances, respectively

Once the set of structures to be queried has been preprocessed and stored using the upper and lower bound values calculated using the previously described distance bounding techniques, similarity searching is performed by using an MCS graph matching algorithm. The algorithm is a clique based method which computes the MCS by determining the maximum clique in the correspondence graph, this process is described in detail by Raymond [1, 2]. In the current approach the maximum clique algorithm as described in Ostergard [8] is used. The following pseudocode implements the maximum clique detection algorithm.

3 Experimental Part, Results and Discussion

All algorithms were implemented in C# programming language and require an ECMA compliant implementation of .NET framework. The program can run on any platform which supports such an implementation. Binaries are available electronically from authors free of charge. Input file format for program is Hyperchem HIN.
Validation studies were carried out on a set of 19 dopamine receptor antagonists. Dopamine receptors in the brain are important in modulating motor, endocrine, and emotional functions by Strange [5] and Waddington [6]. The antagonist affinity was measured at recombinant receptors selectively expressed in cloned cells (Figure 7).

Figure 7. Structures of dopamine receptor antagonists

The test set used here was published by Brusniak[7] it contains experimental data, log(1/Kd) where Kd is the dissociation constant of the receptor-antagonist complex (see Table 1). For this simulation, the most active compound ((R) SKF82526) was used as a query; 18 pairwise comparisons were performed, with distance tolerance value e = 0.1A. Initial coordinates are obtained by Hyperchem program, followed by a bound smoothing procedure. The 3D similarity threshold was set to 0.2 to prevent unnecessary pairwise comparisons. On a PC with 2.8 GHz Intel processor, 512 RAM, Windows XP, the computations for overall pairs comparisons took less than 2 s. The results are summarized in Table 1. It is a noticeable fact that the receptor can discriminate between stereo-isomers (R)-SKF-82526 and (S)-SKF82526, although the difference is only one carbon atom configuration; the procedure is discriminative, giving appropriate similarity index.

Table 1. Drug affinities and similarity calculation results
aquery structure
bestimate values for BA

The similarity index values were calculated using a weighing atom scheme. Thus the atoms that ensemble the active scaffold necessary for a structure to be active have a higher rank than the irrelevant atoms. All the active structures contain this scaffold, so that, after the overlapping procedure, it is easy to identify these atoms.

Figure 8. Linear dependence between Similarity scores and BA

Linear regression analysis showed good correlations between similarity score and BA. Thus an attempt to give estimative values of BA for two compounds of interest (INDOMETHACIN and LSD) was made. The predicted values (by using regression eq. in Figure 8) showed mild activity of these known antagonists. Considering the simplicity of the used model, we can draw the conclusion that the similarity scores can classify correctly the unknowns, which is the most desirable feature (see Figure 8).

Figure 9. Pharmacophore map for dopamine receptor antagonists

The obtained results reveal the important structural features, i.e., the pharmacophore, of antagonists of the dopamine receptor (see Figure 9). In the high affinity compounds, the distance between the cationic nitrogen and the m-hydroxyl oxygen ranged from 6 to 6.45 A with highest activity compound (R)-SKF-82526 having a distance of 6 A. The distance between the cationic nitrogen and the first carbon in the second benzene cycle ranged from 3.78 to 3.81 A; in the lowest activity compounds, this pharmacophore is missing.

4 Conclusions

In this paper we described an advanced method for the calculation of intermolecular structural similarity, useful in mining databases of 3D structures. This method takes full account of the conformational flexibility, being in the mean time sufficiently rapid to allow search in databases of nontrivial size. Validation studies demonstrated that the method is more accurate than fingerprint screening, allowing discrimination even between stereo-chemical isomers. It would be also possible to use the fingerprint screening procedure prior to graph matching in view of improving the overall efficiency. The method provides an effective extension to current approaches in virtual screening and lead optimization procedures.


[ 1] J. Raymond, E. Gardiner, P. Willett, Comput. J., 45, 631-644 (2002).
[ 2] J. Raymond, E. Gardiner, P. Willett, J. Chem. Inf. Comput. Sci., 42, 305-316 (2002).
[ 3] G. Crippen, T. Havel, Distance Geometry and Molecular Conformation, Research Studies Press (1988).
[ 4] P. Easthope, T. F. Havel, Bull. Math. Biol., 51, 173-194 (1989).
[ 5] P. G. Strange, Brain biochemistry and brain disorders, Oxford University Press, New York (1993).
[ 6] J. Waddington, D1:D2 Dopamine receptor interactions, Academic Press, New York (1993).
[ 7] R. Floyd, Communications of the ACM, 7, 345 (1962).
[ 8] R. J. Ostergard, Discrete Applied Math., 120, 197-207 (2002).