Return

Information obtained by comparing protein sequences is thought to be useful in studies of molecular biology. Thus, various algorithms comparing two protein sequences have been developed. Needleman and Wunsch [1] presented a dynamic programming algorithm, in which a comparison matrix of two protein sequences, consisting of similarity weights between two amino acids in the sequences, was transformed into a maximum match matrix. From this matrix, a similarity score and an alignment between the two protein sequences were obtained. Although this algorithm gives reasonable similarity scores and alignments, implementation of the algorithm requires computational times proportional to (

Murata [2] modified the algorithm of Needleman and Wunsch by introducing a maximum array for a column of interests in the original comparison matrix. Elements of the array consist of maximum values in submatrixes, which are the lower right-hand areas viewed from each point in the column. His algorithm reduces the computational times to values proportional to nm and gives the same scores and alignments obtained by Needleman and Wunsch.

If similarities between different types of amino acids were not taken into account, computation involving such similarities could be eliminated from the above algorithm. Thus, we modified the algorithm of Murata by introducing a lookup table and achieved rapid computational times. In this paper we describe our algorithm and discuss the effect of gap penalty scores on alignments of two protein sequences and the sensitivity of our similarity scores.

To obtain the similarity scores Needleman and Wunsch developed the following algorithm[1]:

for (wherei = m;i>= 1;i-- ) { for (j = n;j>= 1;j-- ) {max_d = M(i+1,j+1); for (x=m;x>=i+2 ) {max_d= max[max_d, M(x,j+1) -gamma ]; } for (y=n;y>=j+2 ) {max_d= max[max_d, M(i+1,y) -gamma ]; }M(i, j) = M(i, j)+max_d; } }

To the above algorithm, Murata introduced [2] a maximum array mu composed of

for (Implementation of this algorithm to obtain the similarity score requires computational time proportional toi = m;i>= 1;i-- ) { for (j = n;j>= 1;j-- ) {=M(i, j)+ max[M(i, j)+1,M(ij+1), mu(j+1) -gamma ]; if (j<n) { mu(j+1) = max[,M(ij+1), mu(j+1), mu(j+2)]; } } }

Comparison to obtain the elements of the maximum array at the 0-element in the original comparison matrix is unnecessary. As the number of amino acids types in protein sequences is 20, the probability of existence of any particular amino acid type in a protein sequence is 1/20. Thus, if the weights of the similarity between different types of amino acids are set to 0, the proportion of the elements with 0 to all elements of the original comparison matrix is 19/20. As we could transform the original comparison matrix by skipping those elements, the computational time for obtaining a similarity score between two sequences decreases to 1/20 of Murata's algorithm. Lookup tables are useful to carry out the above idea [3]. The lookup table of a sequence is constructed for each type of amino acid using its position in the sequence. For example, a sequence z, consisting of NLWRNWN, gives the table of

Using this lookup table, we modified Murata's algorithm as follows:

for (wherei = m;i>= 1;i-- ) {new_m= 0;count= 1;j = Tb(Aa(i),count); while (j<> NULL ) { delta(i-j) =W(Aa(i))+ max[delta(i-j), mu(j+1) -gamma ]; mu(j+1) = max[mu(j+1),new_m, delta(i-j)];before_j = j;new_m= mu(j+1);count++;j = Tb(Aa(i), count); for (k = before_j;k>j+ 1 && mu(before_j+1) > muk) ;k-- ) { mu(j) = mu(before_j+1); } } for (k = before_j;k> 0 && mu(before_j+1) > mu(k);k-- ) { mu(j) = mu(before_j+1); } }

Using the above modified algorithm, the alignment of two sequences is obtained. In the first step the next positions of pathway in each element are stored in

for (wherei = m;i>= 1;i-- ) {new_m= 0;count= 1;j = Tb(Aa(i), count); while (j<> NULL ) { delta(i-j) =W(Aa(i))+ max[delta(i-j), mu(j+1) -gamma ]; if (delta(i-j) > mu(j+1) - gamma){Pi (i, j) = dpi(i-j);Pj(i, j) = dpj(i-j); }else{Pi(i, j) = mpi(j+1);Pj(i, j) = mpj(j+1); }dpi(i-j) = i;dpj(i-j) = j; mu(j+1) = max[mu(j+1),new_m, delta(i-j)]; if (new_m> mu(j+1) &&new_m>delta(i-j)){mpi(j+1) =new_pi;mpj(j+1) =new_pj; } else { if ( delta(i-j) > mu(j+1) ) {mpi(j+1) =Pi(i, j);mpj(j+1) =Pj(i, j); } }before_j = j;new_m= mu(j+1);new_pi = mpi(j+1);new_pj = mpj(j+1);count++;j = Tb(Aa(i), count); for (k = before_j;k>j+ 1 && mu(before_j+1) > mu(k);k-- ) { mu(j) = mu(before_j+1);mpi(k) = mpi(before_j+1);mpj(k) = mpj(before_j+1); } } for (k = before_j;k> 0 && mu(before_j+1) > mu(k);k-- ) { mu(j) = mu(before_j+1);mpi(k) = mpi(before_j+1);mpj(k) = mpj(before_j+1); } }

whereip = mpi(1);jp = mpj(1);nsq = i = j= 1; while (ip - jp>nsq){sq_Aa(nsq) = Aa(i);sq_Ab(nsq)= '-';nsq++;i++; } while (lp - ip>nsq){sq_Aa(nsq)= '-';sq_Ab(nsq) = Ab(j);nsq++;j++; } while (ip>=i&&jp>j){sq_Aa(nsq) = Aa(i);sq_Abnsq) = Ab(j);nsq++;i++;j++; }ip = Pi(i-1,j-1);jp = Pj(i-1,j-1); while (m>i&&n>j){diff = (ip - i) - (jp - j);old_i = i;old_j = j; if (diff> 0) { while (old_i+diff>i) {sq_Aa(nsq) = Aa(i);sq_Ab(nsq)= '-';nsq++;i++; } } else { while (old_j+diff>j) {sq_Aa(nsq)= '-';sq_Ab(nsq) = Ab(j);nsq++;j++; } } while (ip>=i&&jp>=j){sq_Aa(nsq) = Aa(i);sq_Ab(nsq) = Ab(j);nsq++;i++;j++; }ip = Pi(i-1,j-1);jp = Pj(i-1,j-1); } while (ip>=i&&jp>=j){sq_Aa(nsq) = Aa(i);sq_Ab(nsq) = Ab(j);nsq++;i++;j++; } while (m>=i) {sq_Aa(nsq) = Aa(i);sq_Ab(nsq)= '-';nsq++;i++; } while (n>=j) {sq_Aa(nsq)= '-';sq_Ab(nsq) = Ab(j);nsq++;j++; }

**Experiments:** To investigate the advantage of our algorithm in computation, systems for calculating the similarity scores between two protein sequences based on Murata's and our algorithms were constructed. Also, the systems for aligning two protein sequences based on both algorithms were made. In those systems, the similarity weights between two types of amino acids were taken from the PAM250 matrix [4]. Those programs were written in language C and run on the SONY NEWS-5000 workstation, which uses the R4000SC CPU (66.7MHz).

For the computation times, similarity scores between one of four sequences described below and all sequences in the Protein Sequence Database Release 44.00 of the Protein Identification Resource (PIR), which contains 12,404 proteins, were calculated and the times for those calculations were measured. The sequences examined are proteins of different lengths: cytochrome *c* (CCDG), trypsinogen (TRBOTR), helix-destabilizing protein (DDRT), and pepsinogen A (PEPS), which have 104, 229, 320, and 388 amino acid residues, respectively.

For the sequence alignments, human hemoglobin alpha (HAHU) was aligned to human hemoglobin beta (HBHU) and human myoglobin (MYHU), respectively, where the gap penalty was set to 6 in Murata's algorithm and set to 6, 12, and 18 in our one. Furthermore, CCDG was aligned to three distantly related cytochromes, where the gap penalty was set to 6 in Murata's algorithm and set to 18 in ours.

As to the similarity scores, those by Murata's and our algorithm, and the optimized score of FASTA [5] for the two sequences produced from CCDG with first 300 proteins in the protein sequence database were calculated. FASTA is used widely in searching protein sequence databases. In Murata's algorithm, similarities between different types of amino acids are taken into account. The gap penalty score is set to 6 in Murata's algorithm and is set to 18 in ours. In FASTA, the ktup is set to 1. The ktup parameter determines how many consecutive amino acids are required in match. The most sensitive searches can be done using ktup = 1.

The computational times by our algorithm are about ten times shorter than those obtained using the Murata algorithm. If the occurrence of each type of amino acid is even in each sequence used in this experiment, 19/20 of the original comparison matrix would be zero. Thus, the number of elements transformed to construct the maximum match matrix would decrease to 1/20 and the computational times with our algorithm would be 1/20 of those

Table 1. Computational Times (s) for Calculating Similarity Scores by Both Algorithms

-------------------------------------------------------------------- Algorithms Proteins Lengths ------------------------------------------- Murata[2] This work -------------------------------------------------------------------- CCDG 104 353 39 TRBOTR 229 765 73 DDRT 331 1080 99 PEPS 388 1290 128 --------------------------------------------------------------------of Murata. The computation times with our algorithm obtained in this experiment are, however, only ten times shorter than those achieved by Murata. This delay may be ascribed to unbalanced distributions in the numbers of each type of amino acids. A similar effect is observed in the FASTA [5]. There is little effect on the number of times that the values of the maximum array are renewed.

These results show that the similarity scores by our algorithm are computed faster than by Murata's algorithm. The similarity scores obtained by both algorithms give the same values, if the similarity weights between different types of amino acids are set to 0 in Murata's algorithm.

The alignments obtained by Murata's and our algorithms with various gap penalties are shown in Figure 1. In smaller gap penalty scores, the alignments by our algorithm are found to include many more gaps than that of Murata. The number of gaps is affected by the magnitude of the gap penalty scores. If the score is set to 18, the alignments by our algorithm are almost equal to those of Murata. The results suggest that our algorithm can give proper alignments with the gap penalty scores of around 18. To confirm this estimation, some alignment pairs, obtained by both algorithms, were compared. The results are shown in Figure 2. The alignment pairs are almost equal and the estimation is confirmed. In these experiments Murata's similarity scores between those aligned proteins are about 1000 and it is suggested that these pairs of proteins have distantly related sequences. Thus, it is concluded that our algorithm, setting gap penalty score to 18, can give proper alignments between lowly similar proteins.

The same similarity scores and alignments could be obtained by the algorithm of sparse dynamic programming [6]. The algorithm is a modification of Wilber and Lipman's fragment alignment algorithm [7]. In the sparse dynamic programming algorithm, the array, called ACTIVE, is used to keep the comparison places to determine the maximum score pathway. The number of comparisons for obtaining similarity scores and alignments in the algorithm is similar to ours. However the sparse dynamic programming algorithm has the restriction that the weight of the gap penalty is a linear function of the difference of fragments. In the case where a constant value was used for the gap penalty score, our algorithm was advantageous.

The correlation between Murata's and our similarity scores is shown in Figure 3. Our similarity scores quickly decrease from 1350 to 300 against Murata's similarity scores in between 1350 and 900, while our scores gradually decrease from 300 to 150 against Murata's between 900 and 300. If two similarity scores have the same sensitivity the graph is a straight line descending to the left. Since similarities between different types of amino acids are not taken into account in our algorithm, our similarity scores have lower sensitivity in distantly related proteins compared to those of Murata.

The correlation between Murata's similarity scores and the FASTA's optimized scores is

Figure 1. Alignments of human hemoglobin alpha (HAHU) (under sequence) to human hemoglobin beta (HBHU) and human myoglobin (MYHU) (over sequence) by Murata's and our algorithms.Murata's algorithm with gap penalty = 6HBHUVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFR -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK LLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRMYHUGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEF VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL ISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Our algorithm with gap penalty = 6HBHUVHLTPEEKSAVTALWGKVNVDEVG-------GEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGL----AHLDNLKGTFATLSELHCDK -VLSPADKTNVKAAWGKV-----GAHAGEYGAEALER-MFLSFPTTKTYFPHFDLSHGSA-----QVKGHGKKV----ADALTNAVAHVDDMPNALSALSDLHAHK LHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH LRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRMYHUGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDL-------KKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHK- VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-------------DLSHGSAQVKGHGKKVADAL--TNAVAHVDDMPNALSALSDLHAHKL --IPV--KYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG- RVDPVNFKLL---SHCLLVTLAAHLPAEF-------TPAVHASLDKFLASVSTVLTSKYROur algorithm with gap penalty = 12HBHUVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFR -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK LLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRMYHUGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHH----EAEIKPLAQSHATKHKIPVK VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQVKGHGKKVADAL----TNAVAHVDDMPNALSALSDLHA--HKLRVD YLEF--ISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG PVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Our algorithm with gap penalty = 18HBHUVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFR -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK LLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH LLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRMYHUGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEF VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHA--HKLRVDPVNF --ISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG KLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------

Figure 2. Alignments of cytochromeCCRF2GMurata's algorithmGSAPPGDPVEGKHLFHTICILCHTDIKG-RNKVGPSLYGVVGRHSGIEPGYNYSEANIKSGIVWTPDVLFKYIEHPQKIVPGTKMGYPGQPDPQKRADIIAYLETL -----GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKMIFAGIKKTGERADLIAYLKKA K-- TKEOur algorithmGSAPPGDPVEGKHLFHTICILCHTDIKG-RNKVGPSLYGVVGRHSGIEPGYNYSEANIKSGIVWTPDVLFKYIEHPQKIVPGTKMGYPGQPDPQKRADIIAYLETL -----GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKMIFAGIKKTGERADLIAYL--- K----- KKATKECCQF2FMurata's algorithmADAPTA---FNQ-CKACHSIE-AGKNGVGPSLSGAYGRKVGLAPNYKYSAAHLASGMTIDEAMLTNYLANPKATIPGNKMGASFGGLKKPEDVKAVIEYLKTVK-- GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKM--IFAGIKKTGERADLIAYLKKATKEOur algorithm---ADAPTAFNQCKA-CHSIE-AGKNGVGPSLSGAYGRKVGLAPNYKYSAAHLASGMTIDEAMLTNYLANPKATIPGNKMGASFGGLKKPEDVKAVIEYLKTVK-- GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKM--IFAGIKKTGERADLIAYLKKATKECCAGC6Murata's algorithmAGEVEKREGMMKQIGGSMGALAAISKGQKPYDAEAVKAAVTTISTNAKAFPDQFPPGSETGSAAAPAIW--ENFDDFKSKAAKLGADADKVLASLPADQAGVTAAM -GDVEK--GK-KIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYT-DANKNKGIT-----WGEETLMEYLENPKKYIPGTKMIFAGI--KKTGERADL QTLGADCGACHQTYRLKK IAYLKKAT--------KEOur algorithmAGEVEKREGMMKQIGGSMGALAAISKGQKPYDAEAVKAAVTTISTNAKAFP-DQFPPGSETGSAAAPAIWENFDDFKSKAAKLGADADKVLASLPADQAGVTAAMQ -GDVEK--------------------GKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQA--P-GFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKMIF TLGADCG--ACHQTYRLKK---- AGIKKTGERADLIAY-LKKATKE

shown in Figure 4. The FASTA's optimized scores also quickly decrease from 550 to 25 against Murata's scores in between 1350 and 1000. On the contrary, the FASTA's optimized scores do not change to Murata's scores in between 1000 and 300. It is shown that the FASTA's optimized scores have no sensitivity in distantly related proteins.

The above results indicate that the similarity scores by our algorithm are more sensitive than the FASTA's optimized scores in searching distantly related proteins. The computation times by FASTA are two to three times shorter than those by our algorithm. Thus, it is concluded that our system is slower than FASTA in computation times but our scores are more sensitive than those by FASTA.

Our algorithm may be useful for obtaining multiple alignments. The amount of calculation

Figure 3. Correlation between Murata's and our similarity scores.

Figure 4. Correlation between Murata's similarity scores and the FASTA's optimezed scores.

for obtaining a multiple alignment increases exponentially with the number of sequences [8]. As our algorithm works ten times faster than that of Murata, the calculations for obtaining the multiple alignments of three protein sequences with our algorithm are about 1/100 of those of Murata. In our algorithm, the similarity between different types of amino acids is omitted, but this has less influence on obtaining alignments, because the examination of multiple alignments is usually carried out only among similar proteins, which have many amino acid pairs of the same type.

2) M. Murata, Comput. Chem., 12, 21 (1988) .

3) J. P. Dumas and J. Ninio, Nucleic Acids Res., 10, 197 (1982).

4) M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, in M. O. Dayhoff (ed.), "Atlas of protein sequence and structure," Vol 5 Suppl 3, National Biomed. Res. Foun., Silver Spring, Maryland (1978).

5) W. R. Pearson, Methods enzymol., 183, 63 (1990) .

6) D. Eppstein, Z. Galil, R. Giancarlo, and G.F.Italiano, J. ACM, 39, 519 (1992).

7) W. J. Wilbur and D. J. Lipman, Proc. Natl. Acad. Sci. USA, 80, 726 (1983).

8) M. Murata, J. S. Richardson, and J.L.Sussman, Proc. Natl. Acad. Sci. USA , 82, 3073 (1985).

Return