Protein Sequence Identity Comparison Essay

Abstract

In recent years improvements to existing programs and the introduction of new iterative algorithms have changed the state-of-the-art in protein sequence alignment. This paper presents the first systematic study of the most commonly used alignment programs using BAliBASE benchmark alignments as test cases. Even below the ‘twilight zone’ at 10–20% residue identity, the best programs were capable of correctly aligning on average 47% of the residues. We show that iterative algorithms often offer improved alignment accuracy though at the expense of computation time. A notable exception was the effect of introducing a single divergent sequence into a set of closely related sequences, causing the iteration to diverge away from the best alignment. Global alignment programs generally performed better than local methods, except in the presence of large N/C-terminal extensions and internal insertions. In these cases, a local algorithm was more successful in identifying the most conserved motifs. This study enables us to propose appropriate alignment strategies, depending on the nature of a particular set of sequences. The employment of more than one program based on different alignment techniques should significantly improve the quality of automatic protein sequence alignment methods. The results also indicate guidelines for improvement of alignment algorithms.

Introduction

The multiple alignment of protein sequences has become an essential tool in molecular biology. It has traditionally been used to find characteristic motifs and conserved regions in protein families, in the determination of evolutionary linkage and in the improved prediction of secondary and tertiary structure. With the rapid increase in the number of protein sequences, notably from the genome sequencing projects, automatic methods of searching protein databases for homologous sequences ( 1 , 2 ), followed by the multiple alignment of the top scoring hits ( 3–6 ) are becoming standard practice. These automatic systems frequently involve the alignment of large numbers of sequences, of very divergent sequences and of multi-domain proteins often with large N/C-terminal extensions or internal insertions. Moreover, with the available sequenced genomes, the alignment of single divergent sequences (typically of eukaryotic origin) with a large closely related group (typically of prokaryotic origin) is now commonplace. The development of accurate, reliable multiple alignment programs capable of handling these divergent sets of data is therefore of major importance. Although a dynamic programming algorithm ( 7 ) exists which guarantees a mathematically optimal alignment, the method is limited to a small number of short sequences since the computing power required for larger alignments becomes too prohibitive. To overcome this problem, various heuristic approaches have been developed leading to a huge quantity of programs using fundamentally different strategies (progressive, iterative, mixed, etc.) based on very different algorithms. Figure 1 shows some of the most commonly used programs today, together with examples of the main algorithms that have been developed recently. Traditionally the most popular approach has been the progressive alignment method ( 8 ). A multiple alignment is built up gradually by aligning the closest sequences first and successively adding in the more distant ones. A number of alignment programs based on this method exist, for example MULTALIGN ( 9 ), MULTAL ( 10 ), PILEUP (Wisconsin Package v.8; Genetics Computer Group, Madison, WI) and CLUSTALX ( 11 ), which provides a graphical interface for CLUSTALW ( 12 ). They use a global alignment algorithm ( 13 ) to construct an alignment of the entire length of the sequences. They differ mainly in the method used to determine the order of alignment of the sequences. MULTAL uses a sequential branching method to align the two closest sequences first and then subsequently align the next closest sequence to those already aligned. MULTALIGN and PILEUP construct a guide tree using the UPGMA method ( 14 ). A consensus method is then used to align larger and larger groups of sequences according to the branching order of the tree. CLUSTALX uses the alternative Neighbour-Joining algorithm ( 15 ) to construct a guide tree, incorporating in addition sequence weighting, position-specific gap penalties and a choice of residue comparison matrix depending on the degree of identity of the sequences. In contrast to the above global methods, PIMA ( 16 ) uses a local dynamic programming algorithm ( 17 ) to align only the most conserved motifs. PIMA offers two alignments by default using maximum linkage and sequential branching algorithms to decide the order of alignment, which we will refer to as MLPIMA and SBPIMA, respectively.

In addition, numerous new alignment algorithms have recently been developed which offer fresh approaches to the multiple alignment problem. A common point of interest has been the application of iterative strategies to refine and improve the initial alignment. A local alignment approach is implemented in the DIALIGN program ( 18 ) to construct multiple alignments based on segment-to-segment comparisons rather than the residue-to-residue comparisons used previously. The segments are incorporated into a multiple alignment using an iterative procedure. The PRRP program ( 19 ) optimises a progressive, global alignment by iteratively dividing the sequences into two groups, which are subsequently realigned using a global group-to-group alignment algorithm. SAGA ( 20 ) uses a genetic algorithm to select from an evolving population the alignment which optimises the COFFEE Objective Function (OF) ( 21 ). The OF is a measure of the consistency between the multiple alignment and a library of CLUSTALW pairwise alignments. Hidden Markov models (HMMs) have also been used as statistical models of the primary structure consensus of a sequence family ( 22 , 23 ). The program HMMT ( 24 ) uses a simulated annealing method to maximise the probability that an HMM represents the sequences to be aligned.

In spite of this wide variety of alignment programs, there are few comparisons available of their relative performance and reliability. Twelve different global and local progressive alignment programs were compared ( 25 ) using alignments of four different protein domains as test cases. In general, the global methods performed better than local methods in the tests, but the performance of all the programs was affected by the number of sequences, the degree of identity of the sequences and the number of insertions/deletions in the alignment. Seven multiple alignment Web servers covering various global and local methods have been compared ( 26 ) to evaluate their ability to identify the reliable regions in an alignment. However, no comprehensive study and comparison of the numerous new alignment algorithms exists. The lack of a standard set of reference alignments has meant that existing programs could not be benchmarked and the increase in performance realised by the new iterative alignment methods could not be accurately measured. A benchmark alignment database called BAliBASE ( 27 ) has recently been developed specifically for this purpose. The 142 validated test alignments of real proteins based on three-dimensional superimpositions are organised into reference sets which represent some of the most common problems currently encountered when aligning real families of proteins. Core blocks in each alignment define those regions that can be reliably aligned. BAliBASE is available on the World Wide Web at http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE

In this paper, we present a systematic analysis and comparison of the main alignment programs currently in use ( Fig. 1 ), using the BAliBASE reference alignments as test cases. A comparison of different scoring methods has highlighted the importance of the non-superimposable regions in the evaluation of a program. We show that the ‘twilight zone’ still exists as a real barrier for all the programs in this study, but that some alignment is possible below the twilight zone. The strong and weak points of the programs are highlighted, in particular the effect on alignment accuracy of different criteria such as the sequence length, the degree of identity of the sequences, their re-partition into subfamilies and the presence of large N/C-terminal extensions and internal insertions. This has enabled us to define possible strategies for improving the programs and guidelines for optimising alignments.

Materials and Methods

All the programs were installed on a DEC Alpha 6100 computer running OSF Unix and each program was tested using default parameters (with the exception of the PRRP-b option, which indicates that the input sequences have not been pre-aligned). We assume that the parameters chosen by the authors have been selected to give a near optimal alignment under normal conditions and, therefore, for the purposes of this study no optimisation of parameters such as residue weight matrix and gap penalties was performed. The test alignments produced here provide a reference which will be used as a basis for further study of optimum parameters (work in progress).

Reference alignments

In order to evaluate and compare the 10 alignment programs selected for this study, we needed objective criteria to assess the quality of an alignment. The BAliBASE benchmark alignment database contains 142 reference alignments, divided into five hierarchical reference sets each containing at least 12 representative alignments ( Table 1 ). The alignments of sequences sharing the same three-dimensional fold have been validated to ensure the alignment of functional and other conserved residues. Core blocks are defined for each alignment as being the regions that can be reliably aligned. The core blocks (representing 58% of the residues in the alignments) specifically exclude ambiguous or non-superimposable three-dimensional regions such as distinct secondary structures, unrelated secondary structure borders or structurally unreliable loop regions.

Reference 1 alignments consist of a small number of equidistant sequences of similar length, i.e. the per cent residue identity (% ID) between any two sequences is within a specified range and no large extensions or insertions have been introduced.

Reference 2 contains alignments of a family (closely related sequences with >25% ID), plus up to three ‘orphan’ sequences (distant members of the family with <20% ID, sharing a common fold). It is designed to evaluate program accuracy according to two criteria: (i) the stability of the family alignment when orphans are introduced into the sequence set; (ii) the quality of the alignment of the orphan sequences. The program MULTAL has been removed from this test since it frequently excludes the divergent orphans as unrelated or unalignable sequences.

Reference 3 demonstrates the ability of the programs to correctly align equidistant divergent families into a single alignment. The reference alignments consist of up to four families, with <25% ID between any two sequences from different families. MULTAL is not included in reference 3 (see explanation in reference 2).

References 4 and 5 contain sequences with large N/C-terminal extensions or internal insertions, respectively. In order to evaluate a program's ability to identify the presence of the insertions, the core blocks in BAliBASE define only the most conserved motifs flanking the extension/insertion. These tests are not designed to judge the overall quality of an alignment. MULTAL is not included in these tests (see explanation in reference 2). HMMT is also excluded because many of the alignments contain only a small number of sequences.

Alignment scores

To assess the performance of the programs in this study, we calculate two different scores which estimate the quality of an alignment compared to the BAliBASE reference. The sum-of-pairs score (SPS) is calculated such that the score increases with the number of sequences correctly aligned. It is used to determine the extent to which the programs succeed in aligning some, if not all, of the sequences in an alignment. The column score (CS) is a binary score which tests the ability of the programs to align ALL of the sequences correctly.

Sum-of-pairs score . Suppose we have a test alignment of N sequences consisting of M columns. We can designate the ith column in the alignment by A i1 , A i2 , …, A iN . For each pair of residues A ij and A ik we define p ijk such that p ijk = 1 if residues A ij and A jk are aligned with each other in the reference alignment, otherwise p ijk = 0. The score S i for the ith column is defined as:
The SPS for the alignment is then: where Mr is the number of columns in the reference alignment and S ri is the score S i for the ith column in the reference alignment.
Column score . For the ith column in the alignment described above, the score C i = 1 if all the residues in the column are aligned in the reference alignment, otherwise, C i = 0. The CS for the alignment is then:

For each reference test we have selected the most suitable scoring function according to the nature of the test and the particular question posed. The two scoring systems have been implemented in the program BaliScore, which takes as input a reference alignment and a test alignment in MSF format, plus an optional BAliBASE annotation file describing the core blocks in the reference alignment. The output includes the two scores described above, plus an optional representation of the scores for each column in the test alignment. BAliScore is available by ftp from ftp-igbmc.u-strasbg.fr/pub/BAliBASE/ BAliScore

Statistical methods

In each reference, BAliBASE provides a number of representative alignments that were used as a sample in statistical analyses. For each reference alignment we calculate a score estimating the accuracy of the alignment produced by every program. Since the distribution of scores is expected to be neither normal nor symmetric, we use the median as a measure of location and the interquartile range as a measure of dispersion. The first and third quartiles give an idea of the shape of the distribution.

Friedman tests ( 28 ) were used to assess whether or not there is a systematic pattern in the way programs are ranked by score for every alignment, i.e. whether or not some programs significantly tend to perform better than others across reference alignments.

Wilcoxon signed rank tests ( 29 ) were used to determine whether a change in the conditions of an alignment, such as the addition of orphans (reference 2) or an increase in the number of family members (reference 3), leads to a significant difference between paired scores. The scores for HMMT are not included in the Wilcoxon signed rank tests, because different alignments may be obtained for the same input sequences each time the program is executed. Therefore, the difference in the scores obtained under different alignment conditions cannot be reliably compared.

Results

Reference 1: a small number of approximately equidistant sequences

This test is designed to study the effect of sequence length and % ID on alignment program performance and provides a basis for the remaining tests. The importance of the ambiguous or non-superimposable regions in the evaluation of alignment program performance has been studied by comparing alignment scores based only on the core blocks defined in BAliBASE with scores obtained over the full-length sequences. The ambiguous regions represent 42% of the residues in BAliBASE and account for a mean 32, 22 and 11% of the full-length scores calculated in categories V1, V2 and V3, respectively. Obviously, some discrepencies in the program evaluation may arise using either of the scoring methods. Here we will present the results of this study using the core block scores, unless a comparison of the two scores sheds light on interesting results.

How do percent identity and sequence length affect program performance?Figure 2a shows the median core block scores obtained in the nine variability/length categories in reference 1. A decrease in accuracy of the alignments with decreasing identity is clearly demonstrated, with the greatest loss occurring in category V1 (<25% ID), which corresponds to the so-called ‘twilight zone’ of evolutionary relatedness. Nevertheless, some alignment is still possible below the twilight zone. The best alignment in V1 was achieved by PRRP, with 72% of the total residues correctly aligned. The highest scoring programs, PRRP, CLUSTALX and SAGA, correctly align on average 61% of the residues (or 42% of the columns) in the core blocks and 47% of the total residues (or 26% of the total columns) in V1. In contrast, between 20 and 40% identity, 92% of the residues (or 87% of the columns) in the core blocks and 82% (or 72% of the total columns) of the total residues are successfully aligned by these three programs. Figure 2b shows a plot of the median core block scores obtained by each of the 10 programs in identity ranges V1, V2 and V3. It can be seen that loss of accuracy with decreasing sequence identity is exhibited by all the programs in this study. The greatest difference in program scores is always observed in category V1. According to a Friedman test used to compare the performance of the alignment programs ( Fig. 2c ), PRRP ranks significantly higher (α = 0.05) than the other programs, CLUSTALX and SAGA rank second highest, with the global alignment programs generally performing better than the local methods.

Figure 2

( a ) SPS for reference 1, showing the median score in each category, ( b ) Median SPS for the programs in reference 1, categories V1, V2 and V3. Programs are displayed in the order of the Friedman test, with the highest scoring program on the left, ( c ) Results of the Friedman rank test to compare the performance of the programs in reference 1 (S = 9, N = 81, test statistic = 106.9). For each test alignment, the programs are assigned a rank between 1 and 10 (with 1 indicating the highest scoring program). The ranks are then summed over all alignments. Thus, a lower rank sum indicates that a program tends to achieve higher scores. The programs are listed in rank sum order. The grey boxes indicate that the two corresponding programs cannot be differentiated using the Friedman test (α = 5%). L , local alignment program; I iterative alignment program.

Figure 2

( a ) SPS for reference 1, showing the median score in each category, ( b ) Median SPS for the programs in reference 1, categories V1, V2 and V3. Programs are displayed in the order of the Friedman test, with the highest scoring program on the left, ( c ) Results of the Friedman rank test to compare the performance of the programs in reference 1 (S = 9, N = 81, test statistic = 106.9). For each test alignment, the programs are assigned a rank between 1 and 10 (with 1 indicating the highest scoring program). The ranks are then summed over all alignments. Thus, a lower rank sum indicates that a program tends to achieve higher scores. The programs are listed in rank sum order. The grey boxes indicate that the two corresponding programs cannot be differentiated using the Friedman test (α = 5%). L , local alignment program; I iterative alignment program.

Figure 3a shows the median core block scores for the length categories short, medium and long in V1. It is important to note that for long sequences the local programs achieve a similar quality of alignment to the global ones, with the exception only of PRRP, which ranks significantly higher (α = 0.05) than the other programs in a Friedman test. In general, the core blocks in medium and long sequences are aligned better than in short ones by all the programs except CLUSTALX. However, an analysis of the full-length scores ( Fig. 3b ) reveals: (i) a general decrease in full-length scores compared to core block scores; (ii) an inversion of order observed above for global programs with the scores now decreasing with increasing sequence length; (iii) in contrast, the scores for the local programs maintain the same order as before, with the scores increasing with increasing sequence length.

Dataset and template library

All the sequence alignment programs are benchmarked on the same set of 538 non-redundant proteins randomly collected from the PDB library3. These proteins have a pair-wise sequence identity less than 30% and length ranging from 34 to 804 residues. Proteins with broken chains or missing residues were not included. The sequences were divided into three categories: Easy, Medium and Hard targets, based on the consensus confidence score of the meta-threading LOMETS program44, which consists of 9 protein threading programs (dPPAS, MUSTER, HHsearch-I, HHsearch-II, PPAS, PROSPECT, SAM, SPARKS and SP3). A target is defined as Easy if at least one strong template hit can be detected for the target by each program with the Z-score higher than the confidence cutoff; a target is defined as Hard if none of the threading programs has a strong template hit; otherwise, it is considered a Medium target. In total, the 538 proteins are selected to include a balanced category distribution of difficulty with 137 Easy, 177 Medium, and 224 Hard targets. Here, we have put more focus on the challenging targets by arbitrarily increasing the number of Medium and Hard proteins in our benchmark protein set, although a naturally collected sample from the PDB would have a much lower portion of Medium/Hard cases. A list of all the 538 proteins, together with the classification, are provided in http://zhanglab.ccmb.med.umich.edu/publicdata/benchmark1/protein_types.txt.

The existence of template structures in the library is a precondition for template identification. To eliminate potential bias of the alignment algorithms from the template structure library, we constructed the libraries of all threading programs using the same sequence identity cutoff updated to the same time stamp (by Jan, 2013). In fact, the template libraries for NW-align, SW-align, BLAST, PSI-BLAST, PSA, PPA, PPAS, dPPAS, MUSTER, SAM, PRC, PROSPECT, SPARKS, SP3 and FFAS are generated from the same set of non-redundant PDB proteins with a pair-wise sequence identify cutoff 70% (see http://zhanglab.ccmb.med.umich.edu/library/). The libraries for HHsearch-I and HHsearch-II are downloaded from ftp://toolkit.lmb.uni-muenchen.de, which has also a sequence identity cutoff of 70%. The size of these two libraries is about the same. The programs of all the tested methods are described in METHODS.

Summary of performance by individual alignment methods

Table 1 presents a summary of the 3D structural models, which are built by copying the framework of the highest ranked and the best in the top ten scoring templates based on the alignments generated by different alignment programs. The quality of alignments is generally measured by the root-mean-square deviation (RMSD) of the models (Columns 4–5), where BLAST, PSI-BLAST, PRC, FFAS and HH- search programs have the lowest RMSD to the targets (~7–9 Å). However, the alignments by these programs tend to have a smaller number of residues aligned (i.e. lower alignment coverage, Columns 6–7), typically below 80%. Such short alignments can have a negative impact on the full-length structure construction by homology modeling since structure information is missed for a large portion of unaligned sequences. In fact, the full-length models by MODELLER45 have a very high RMSD (>20 Å) for all these local alignment methods (see values in parentheses). Here, the full-length models ware generated by the script model-default.py in the MODELLER package. The modeling results from MODELLER are deterministic in the sense that more runs do not change the quality result of the final models.

In Columns 2–3, we also list the result of the alignment models on TM-score, which is defined to combine the alignment accuracy and coverage as a unique score46: where L is the length of target sequence, Lali the number of aligned residues, and di the pair-wise distance of ith residue in the model and target after the optimal superposition. In this scoring function, the programs, which have a better balance of alignment coverage and RMSD, excel, including MUSTER, HHsearch and PPAS programs. The simple sequence-to-sequence based alignment algorithms generally have a lower TM-score.

Meanwhile, the local-to-local alignment based algorithms generally have a lower coverage and TM-score compared to the global-to-global alignment methods. A typical example is SW-align based on Smith-Waterman, which only identifies the highly conserved regions and has on average 56% residues aligned, while Needleman-Wunsch based NW-align uses the same parameter and scoring function but generates alignments with a much higher coverage (84.9%). Accordingly, the TM-score of NW-align is 21.1% higher than that of SW-align. The completeness of alignment searching also plays a role in final model determination. For instance, both BLAST and SW-align are local sequence alignments based on BLOSUM62 mutation scores. But BLAST searches are based on an incomplete heuristic word search algorithm, which has an average TM-score 7% lower than SW-align. BLAST is however 39 times faster than SW-align in our test.

Although TM-score aims to balance the accuracy and coverage of alignments, it still favors algorithms that have a higher coverage, since including additional residues in the alignments always has a positive contribution to TM-score according to Eq. 1, although the contribution is small if the added residues from templates are far away from the target. To examine the effect of such bias, we constructed full-length models of the targets based on the alignments, using the widely-used comparative modeling tool MODELLER45. Although TM-score is now normalized by the same length of the target sequence, the TM-score ranking of full-length models is largely consistent with that of the original alignments, except for some small but notable variations. Taking the top hits as an example, the original alignments by HHsearch-I have a lower TM-score than those by dPPAS (0.422 vs. 0.426) due to the low coverage of the sequences (76.3% vs. 81.9%). After full-length modeling, the TM-score of HHsearch-I becomes higher than that of dPPAS (0.444 vs. 0.438) and several other related algorithms (e.g. PPAS and SP3). Here, the more precise alignments by HHsearch-I in the aligned regions have probably introduced some restraint/guidance to the structure modeling of the unaligned regions, e.g. through bond-length and change connectivity, which have resulted in models of a higher overall TM-score.

Performance of sequence alignment programs in different target categories

The performance of different alignment programs varies with the difficulty of the targets, i.e. the evolutionary distance between target and template proteins. If we use the target structure as a probe to search through the PDB library by TM-align47, the average TM-scores of aligned regions of the best structural templates in the three categories of Easy, Medium and Hard are 0.779, 0.666 and 0.586, respectively, after excluding homologous templates with a sequence identity > 30%. This data on one hand sets up an upper-bar for template identifications by fold-recognition; on the other hand, it demonstrates that the target category as defined by the LOMETS prediction is largely consistent with the actual difficulty of the template identification for the targets.

In supplementary Tables S1, S2, and S3, we summarize the results of different programs on the Easy, Medium and Hard targets, respectively. Figure 1 is the histogram of the average TM-score achieved by different programs. As shown in Table S1, HHsearch programs generate the highest TM-score in the Easy targets. MUSTER and other structure-assisted alignment methods (dPPAS, SP3 etc) generally outperform the HHsearch programs in the Medium and Hard targets. This data demonstrates the usefulness of structure-based features in detecting the distant homologous templates.

Specificity of alignment programs

Except for the accuracy of the template alignments (or sensitivity), the specificity of the alignments (i.e. the correlation of the scoring function and the accuracy of the final alignments) is another important measurement of the alignment algorithms, as this correlation essentially decides how the results can be used in the comparative structural modeling and function annotations.

In Figure 2, we present the TM-score data of the highest ranked alignment models versus the alignment scores by the 20 alignment programs. Here, we tried both the default alignment score of the programs and the Z-score (defined as the difference between the raw alignment score and average in units of standard deviation), and chose the one with the highest correlation to the TM-score of the final models to present in the plot. As expected, positive correlations are observed for all the alignment programs, with PPAS, SAM and MUSTER having the highest Pearson correlation coefficients (0.789, 0.787, and 0.782, respectively). The NW-align and BLAST programs have the lowest correlation coefficient because a number of targets have a high alignment score but with low quality (TM-score < 0.5), indicating a low specificity of the programs.

We also mark in Figure 2 an alignment score cut-off that minimizes the false positive rate, FPR = FP/(FP + TN), and the false negative rate, FNR = FN/(TP + FN), where a model of TM-score > 0.5 is defined as a positive hit that has the correct fold48. The score cut-offs, false positive and false negative rates are listed in Table 2. The programs with alignment score that are calibrated by the statistics of random samples, including PSI-BLAST, SAM and FFAS, have the lowest FPR + FNR values, i.e. the highest specificity, based on this measurement. Meanwhile, the Easy and Hard targets are clearly grouped in the right-up and left-bottom regions in Figure 2 for all programs, demonstrating the dependence of the performance of the alignment algorithms on the evolutionary distance of target and templates.

Profile-based alignments versus sequence-based alignments

Depending on whether the homologous sequences are included in the target-template alignments, the alignment methods can be grouped into the three general categories of sequence-to-sequence alignment (including NW-align, SW-align and BLAST), sequence-to-profile alignment (PSI-BLAST, SAM and PSA), and profile-to-profile alignment (PRC, HHsearch-I, HHsearch-II, PPA, PPAS, dPPAS, MUSTER, PROSPECT, SPARKS, SP3 and FFAS). Since the sequence profiles derived from multiple sequence alignment of protein families contain important information of conserved/diverged locations along the sequences, the profile-based alignments can generally generate more accurate target-template alignments than that made by single sequence-based alignments16,49.

Such insight is also observed in our data analysis. As shown in Table 1 (rows highlighted in bold), the average TM-score obtained by the sequence-to-profile based methods is 18.4% higher than the TM-score from the sequence-to-sequence based methods. Similarly, the TM-score from profile-to-profile alignment methods is 49.8% higher than that of sequence-to-sequence based methods. These increases in TM-score are not only due to the higher coverage of alignments (81.8% vs. 62.6%), but also the enhanced accuracy of alignments as the average RMSD is reduced slightly in the profile-profile methods from 10.4 to 10.2 Å. This tendency is also seen in Tables S1,S2,S3 where the targets were categorized into different groups of Easy, Medium, and Hard, demonstrating that the profile-based alignments enhance both close and distant homology identifications.

Two types of sequence profiles, PSSM and HMM, are often exploited in various alignment methods. Given a MSA, the PSSM profile is designed to account for the estimated frequency of amino acids at each position, while the HMM profile accounts for both amino acid frequency and position-specific probabilities for insertion and deletion. Although the HMMs seem to contain additional gap information from MSAs, there is no obvious difference between the HMM-based (e.g. HHsearch-I and –II) and PSSM-based alignment algorithms (e.g. PPAS), in terms of the TM-score of the alignment models (Table 1). However, HMM-based methods did generate higher TM-scores than PSSM-based methods in the Easy targets (Table S1). Additionally, the HMM-based methods have generally a lower RMSD and lower coverage of alignments, indicating that the HMM method is more sensitive in detecting local structural motifs and scaffolds.

Meanwhile, there are a number of targets that have the correct templates identified by either HMM- or PSSM-based methods (but not both), demonstrating that these two types of methods are complementary to each other. This complementarity from multiple alignment algorithms is essential to the success of meta-server based structure modeling approaches44,50.

How much space is left for improvement by structural feature prediction?

The performance of profile alignments could be further improved by incorporating structural information. One example is secondary structure comparison, which has been used by almost all contemporary alignment/threading methods to guide the target-template alignments. As a quantitative test of the impact of secondary structure information on alignment accuracy, we developed two sequence profile-profile based methods, PPA and PPAS, where the only difference is that PPAS contains a secondary structure match in the scoring function but PPA doesn't (see Eqs. 3 and 4 in METHODS). As a result, the inclusion of secondary structure prediction increases the TM-score of PPA by 6.8%. MUSTER is another typical profile-profile alignment based algorithm that incorporates multiple composite structure features in the alignments, including secondary structure, residue depth, solvent accessibility, and backbone torsion angle predictions. These features result in a TM-score increase of 9.6% compared to the PPA method, corresponding to a p-value < 10−14 in paired student t-test.

The performance of the structure feature assisted algorithms relies on the accuracy of structure feature predictions for the target sequence. In our test on the 538 non-homologous proteins, the average Q3 score (three structure states per residue overall accuracy) for PSSpred and PSI-pred is 83.1% and 80.7%, respectively; the mean absolute errors in ψ and φ angle predictions are 28° and 41°, respectively; and the Pearson correlation coefficient correlation between predicted and actual solvent accessibility is 0.678. Incorrect assignments of the structure features can compromise the performance of MUSTER. In fact, we observed a number of cases where the TM-score of the alignments by MUSTER, which considers additional structural features, is lower than that of PPAS.

In order to explore the potential of the alignment improvement obtained by considering structural features, we implemented MUSTER using the native structure features derived from the target structures, where the weighting parameters were re-optimized in a separate test of 100 proteins. As shown in Table 1, the average TM-score of the full-length models from MUSTER alignments showed a gradual increase from 0.449 to 0.511, when we exploited more native structure features from secondary structures (MUSTERSS), backbone torsion angle (MUSTERSS+BTA), and solvent accessibility (MUSTERSS+BTA+SA). This change corresponds to an overall increase of 13.8% in the average TM-score.

In Figure 3, we present an illustrative example from the PP7 bacteriophage coat protein (PDB ID: 2qudA), which has the secondary structures arranged as β1-β2-β3-β4-β5-β6-α1-α2 from the N- to C-terminals. The PSI-pred method has however mis-assigned most secondary structure elements in 12S-90M (see ‘*’ in Figure 3A), which resulted in the first two beta-strands (7T-11S and 15A-25T) in the target mis-aligned to the coiled regions in the template 1qbeB by MUSTER. This alignment has a TM-score = 0.607. When using the actual secondary structure assignment, the MUSTERSS program correctly matches the two beta strands of the target with the strands on the template. Based on the same template, the correction of the secondary structure comparison increases the TM-score of the model to 0.7 in this example.

Despite the significant increase in alignment accuracy brought by the integration of structure features, the quality of the alignments using the best structure features from the native is still far from the best templates detected by structural alignments, i.e. the average TM-score by TM-align is 37.1% higher than that by MUSTERSS + BTA + SA (Table 1), which demonstrates considerable room for further alignment improvement. The gap is relatively small in Easy targets (7.4%) according to the data in Table S1, which indicates that the current state-of-the-art alignment methods generate nearly optimal alignments for close homology targets. But for the Medium and Hard targets, the gaps become highly significant, which correspond to a TM-score difference of 35.6% and 79.2%, respectively (Tables S2 and S3). Apparently, such gaps cannot be filled by solely improving the structure feature prediction methods, and a completely different alignment system based on novel scoring and alignment schemes might be required.

0 Replies to “Protein Sequence Identity Comparison Essay”

Lascia un Commento

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *