 
- 积分
- 773
- 威望
- 773
- 包包
- 1251
|

http://cshprotocols.cshlp.org/cgi/content/full/2009/7/pdb.top44/ X) L: b4 M9 u
Using Iterative Methods for Global Multiple Sequence Alignment
! }. G9 S1 T+ R- a- hDavid W. Mount
/ H% q4 f# _1 d8 B& Z* uAdapted from Bioinformatics: Sequence and Genome Analysis, 2nd edition, by David W. Mount. CSHL Press, Cold Spring Harbor, NY, USA, 2004.
! B K: _8 r; U# q+ t6 Z0 f2 W4 m" h
1 C' ~9 R. [% h5 Q8 m
INTRODUCTION
$ D* T6 b8 P2 h3 j+ Y. c% b/ I( t! i- J( i2 _
Finding a global optimal alignment of more than two sequences that includes matches, mismatches, and gaps and that takes into account the degree of variation in all of the sequences at the same time is especially difficult. The dynamic programming algorithm used for optimal alignment of pairs of sequences can be extended to global alignment of three sequences, but for more than three sequences, only a small number of relatively short sequences may be analyzed. Thus, approximate methods are used for global alignment. One class of these is iterative global alignment, which makes an initial global alignment of groups of sequences and then revises the alignment to achieve a more reasonable result. This article discusses several iterative alignment methods. In particular, steps are provided for using the Sequence Alignment by Genetic Algorithm (SAGA).
5 B/ Y+ K) h% v' ~( u2 K& ^; z+ E7 A! V8 [8 Q- V: h
8 Z& Z+ r: X4 L1 b" t/ }7 Z Q: ARELATED INFORMATION. K9 p- S' [' k; u
5 o+ c6 B m+ c3 g/ L; m' w& h/ pOther approximate methods for global alignment of multiple sequences are discussed in Using Progressive Methods for Global Multiple Sequence Alignment (Mount 2008a) and Using Hidden Markov Models to Align Multiple Sequences (Mount 2008b). In Comparing Programs and Methods to Use for Global Multiple Sequence Alignment (Mount 2008c), additional alignment methods are introduced, and the utility of different methods is compared under various conditions. Programs that format and edit multiple sequence alignments (msas) are presented in Using Multiple Sequence Alignment Editors and Formatters (Mount 2008d). ; t/ t' x; Y: \% t4 a
" i% ^6 |' v$ o' C5 M! x8 p6 a
3 O" r5 K4 h% T+ N; d3 yITERATIVE METHODS FOR GLOBAL MULTIPLE SEQUENCE ALIGNMENT0 o0 p! i% ~, Y: @+ J
6 `0 ~7 @& C$ e2 c" s
Iterative methods are an alternative to progressive alignment methods. The major problem with progressive alignment methods (see Using Progressive Methods for Global Multiple Sequence Alignment [Mount 2008a]) is that errors in the initial alignments of the most closely related sequences are propagated to the msa. This problem is more acute when the starting alignments are between more distantly related sequences. Iterative methods attempt to correct for this problem by repeatedly realigning subgroups of the sequences and then by aligning these subgroups into a global alignment of all of the sequences. The objective is to improve the overall alignment score, such as a Sum-of-Pairs (SP) score. Selection of these groups may be based on the ordering of the sequences on a phylogenetic tree predicted in a manner similar to that of progressive alignment, separation of one or two of the sequences from the rest, or a random selection of the groups. These methods are compared in Hirosawa et al. (1995).
) h9 y+ N4 P+ H: U" W& S# l' u) z/ K2 m5 ^& u2 L# q
Three programs that use iterative methods are MultiAlin, PRRP, and DIALIGN. MultiAlin (Corpet 1988) recalculates pairwise scores during the production of a progressive alignment and uses these scores to recalculate the tree, which is then used to refine the alignment in an effort to improve the score. The program PRRP uses iterative methods to produce an alignment. An initial pairwise alignment is made to predict a tree. The tree is then used to produce weights for making alignments in the same manner as the MSA program except that the sequences are analyzed for the presence of aligned regions that include gaps rather than being globally aligned. These regions are iteratively recalculated to improve the alignment score. The best-scoring alignment is then used in a new cycle of calculations to predict a new tree, new weights, and new alignments, as illustrated in Figure 1 . The program repeats this process until there is no further increase in the alignment score (Gotoh 1994, 1995, 1996).
2 b. r$ I8 J1 Q6 b' V' }) C
/ r( i. H# m2 j9 M) T( [7 d T- l% F m# Y* u
View larger version (29K):
" D# U7 d$ o4 o8 m" x' {[in this window]1 \# u& x3 v) {' D. ~
[in a new window]
8 Y4 L& l u3 O- q8 Y( T/ f O2 {# @; c8 a
Figure 1. The iterative procedures used by PRRP to compute a multiple sequence alignment. (Reproduced from Gotoh 1996, with permission from Elsevier © 1996.)+ a5 P% H- {9 Y5 s* g9 W) ]
% y, D# \" b% G
The program DIALIGN finds an alignment by a different iterative method. Pairs of sequences are aligned to locate aligned regions that do not include gaps, much like continuous diagonals in a dot-matrix plot. Diagonals of various lengths are identified. A consistent collection of weighted diagonals that provides an alignment, which is a maximum sum of weights, is then found. The result is an alignment of the sequences based on alignment of these weighted diagonals. Additional methods that use iterative methods--specifically, genetic algorithms, partial-order graphs, and hidden Markov models--are introduced in this article.
5 ]1 K5 H9 s( o3 ^. A3 F* V' J) b5 f0 s
Genetic Algorithm" q& l# n3 n' Y
# w, D, P; l0 d& b6 m H
The genetic algorithm is a general type of machine-learning algorithm that has no direct relationship to biology and that was invented by computer scientists. The method has been recently adapted for msa by Notredame and Higgins (1996) in a computer program package called SAGA (see the next section for the steps used in the SAGA algorithm). Zhang and Wong (1997) have developed a similar program. The method is of considerable interest because the algorithm can find high-scoring alignments that are as good as those found by other methods such as CLUSTALW. Similar genetic algorithms have been used for RNA sequence alignment (Notredame et al. 1997) and for prediction of RNA secondary structure (Shapiro and Navetta 1994). Although the method is relatively new and not used extensively, it likely represents the first of a series of sequence analysis programs that produce alignments by attempted simulation of the evolutionary changes in sequences. Genetic algorithms are quite complex, and user experience with them is necessary.
2 K2 t" Y2 w# v, D$ n2 [7 v, h# S) B9 x
The basic idea behind SAGA is to try to generate many different msas by rearrangements that simulate gap insertion and recombination events during replication, to produce a higher and higher score for the msa. The alignments are not guaranteed to be optimal (highest scoring that is achievable). Although SAGA can generate alignments for many sequences, the program is slow for more than about 20 sequences.
: ^1 Q9 S) s+ i, f+ R& I0 r2 p
# B: f, G# I3 t2 FA similar approach for obtaining a higher-scoring msa by rearranging an existing alignment uses a probability approach called "simulated annealing" (Kim et al. 1994). The program Multiple Sequence Alignment by Simulated Annealing (MSASA) starts with a heuristic msa and then changes the alignment by following an algorithm designed to identify changes that increase the alignment score. 7 B" o+ K. n% f! _# Z
& R; z. N6 Y' YSteps in the SAGA Genetic Algorithm for Global Sequence Alignment
/ ]& ]5 F% {- {, r/ f) _. S/ C* `# K. `5 w7 S
The success of the genetic algorithm may be attributed to the steps used to rearrange sequences, many of which might be expected to have occurred during the evolution of the protein family.
6 K; |* i, o: e6 k: n) U
9 M/ x5 S) @. V+ J# }; i) {The steps in the algorithm are as follows:
! P7 b- w% B' S+ {2 F0 ^) b2 Z2 M$ V9 L+ U0 h1 m
8 o, H# g: {6 C# x" v* |: h3 B' P. k, {1. The sequences to be aligned (up to approximately 20 in number) are written in rows, as on a page, except that they are made to overlap by a random amount of sequence, up to 50 residues long for sequences that are ~200 residues in length. The ends are then padded with gaps. A typical population of 100 of these msas is made, although other numbers may be set. Shown below in Figure 2 is an initial msa for the genetic algorithm (1 of approximately 100):1 R' s0 A1 C3 `4 z
2 x) Y- W' Z4 H0 g, B0 y. n1 v: sFigure 2. 9 v' w3 t2 O! q3 H+ d4 v
3 l! ]; b0 G `4 I2 n- E8 q# r$ ]: D$ s% I7 p) L+ z
2. The 100 initial msas are scored by the SP method, using both natural and quasi-natural gap-scoring schemes. Standard amino acid scoring matrices and gap opening and extension penalties are used by SAGA. 0 T, z8 v& E; u- [4 w' c
5 q: L+ w( U% \& g0 _
) h& c2 A- T! ]5 ]5 u# w$ }
3. These initial msas are now replicated to give another generation of msas. The half of the replicates with the lowest SP scores are sent to the next generation unchanged. The remaining half for the next generation are selectively chosen by lot, like picking marbles from a bag, except that the chance for a particular choice is inversely proportional to the msa score (the lower the score, the better the msa, therefore giving that one a greater chance of replicating). 2 P4 S- r: n" ?6 u
2 O( o# s5 S8 A) j: K/ P
6 |9 A3 g+ A0 d# d0 xThis latter one-half of the choices for the next generation is now subject to mutation, as described in Step 4 below, to produce the children of the next generation. All members of the next-generation msas undergo recombination to make new child msas derived from the two parents, as described in Step 5 below. The relative probabilities of these separate events are governed by program parameters. These parameters are also adjusted dynamically as the program is running to favor those processes that have been most useful for improving msa scores. + S# C( J0 u6 Z' S, \' i
9 X5 `% h- _1 c4 p
( `- e5 b& B( M0 m T4. In the mutation process, the sequence is not changed (else it would no longer be an alignment), but gaps are inserted and rearranged in an attempt to create a better-scoring msa. In the gap insertion process, the sequences in a given msa are divided into two groups based on an estimated phylogenetic tree, and gaps of random length are inserted into random positions in the alignment. Alternatively, in a "hill-climbing" version of the procedure, the position is so chosen as to provide the best possible score following the change. Shown below (see Fig. 3) are random gap insertions into phylogenetically related sequences. The first two and last three sequences comprise the two related groups in this example. x indicates any sequence character.' D% [! F3 V7 M4 J1 A- o/ L" [; H1 R
# t- n9 s3 J. p. m( s8 `Figure 3.
& y6 Q$ R; A+ E' |$ u' Q0 L# `' d! t: e
, ^. f ?6 m4 [1 a0 l3 d x% G
Another mutational process is to move common blocks of sequence (overlapping ungapped regions) delineated by a gap, or blocks of gaps (overlapping gaps). Some of the possible moves are illustrated below (see Fig. 4). These moves may also be tailored to improve the alignment score.* ?; e. [7 W5 |. _+ v
) n: }; u8 I v1 y) j: g5 @$ HFigure 4. , E+ T* P) k- n# I# S4 B0 u
; I. }6 ^( S# F* ~
# X3 P% b/ M( c$ w& t
5. Recombination among next-generation parent msas is accomplished by one of two mechanisms. The first is not homology-driven. One msa is cut vertically through, and the other msa is cut in a staggered manner that does not lose any sequence after the fragments are spliced. The higher scoring of the two reciprocal recombinants is kept. The second mechanism, illustrated below (see Fig. 5), is recombination between msas driven by conserved sequence positions. It is driven by homology expressed as a vertical column of the same residue and is very much like standard homologous recombination.! d/ s+ m4 f3 i* }' c
, o( ~5 f9 N/ x
Figure 5. 0 C, R# D6 d9 B6 I
* ^# G8 A: Q* Y4 H
5 P: z. k* p6 Z; W9 d
6. The next generation, an overlapping one of the previous one-half of the best-scoring parental msas and the mutated children, is now evaluated as in Step 2, and the cycle of Steps 2-5 is typically repeated as many as 100 times, although as many as 1000 generations can be run. The best-scoring msa is then kept. 2 P! y/ r) J" {# y! c8 i( ~( t& O* ]
! f* v$ r( t$ h; C% }
: x/ l+ b( x- V& P; G( E
7. The entire process of producing a set of msas for replication and mutation is repeated several times to obtain several possible msas, and the best-scoring one is chosen. 8 w8 f7 K T( Q4 L J+ n% b
; C1 |# m$ H1 Y- d q! O+ \" oPartial-Order Graphs3 |" Z1 C% [5 ~9 }
+ Y- I' p# }, [6 F3 M
A dramatic improvement in the speed of producing an msa has been achieved by representing sequences and msas as partial-order graphs, a class of directed acyclic graphs (Lee et al. 2002). An example of a partial-order graph representation of an msa is illustrated in Figure 6 . The partial-order graph representing an msa can be rapidly aligned with a sequence or with another msa-representing graph by dynamic programming in an amount of time proportional to the average number of branches per node. This method is particularly efficient for sequences that share many identities, as in overlapping sets of expressed sequence tag (EST) sequences. Another advantage to this method is that each stage of the alignment can store and use information from previous alignments, in contrast to progressive alignment methods, which use profiles that do not have this information. The method of alignment is illustrated in Figure 2C. The program POA (Partial Order Alignment) for a UNIX or Linux environment is available from http://packages.debian.org/etch/poa, and a web page for data input is also available. * X4 U8 @' o& i& V/ r
& V' d' b ?# [2 a8 b( V/ W! V5 E, L! b: ?8 G# c
View larger version (12K):
9 c5 Q4 X1 b0 }: Y: N[in this window]
% X) x' [' Y9 H; i3 p. _9 w. R" N[in a new window]7 }; s& g' `5 j) j4 v1 T
, B. s9 a I- @
Figure 6. A partial-order graph representation of an msa (see Lee et al. 2002). (A) A partial-order graph. The graph has a line of nodes representing columns of (black) conserved sequence positions in the msa joined by directed edges representing consecutive sequence letters that run from the start to the end of the graph. (Purple) Aligned substitutions and (blue) an internal insertion in the msa are depicted by loops with edges representing consecutive positions in the divergent sequences. (Green) Initial unaligned termini of the sequences are also depicted. Each pair of nodes is joined by only one edge, but one node can have more than one edge entering or leaving it, thereby serving as a junction. There are no connections between branches. (B) The msa represented by the graph. The graph depicts the msa in a compacted form from which the original msa can be derived, because the nodes representing conserved msa positions store information about the location of the character in each sequence. (C) Possible moves between cells in the dynamic programming matrix by aligning the graph in A at position P with a new sequence, also represented by a graph, by the Smith-Waterman dynamic programming algorithm. Branch joining at this position gives two possible diagonal moves (purple arrows, one for each branch), two horizontal moves, and one vertical move (blue arrows to each branch). In addition, there is a start move at each cell of score zero that is the graphical equivalent of the scoring system used in the Smith-Waterman local alignment of sequences. In simpler unbranched regions, only the three standard sequence alignment moves are used. Optimal scores are calculated in each cell, and the best overall score is found. The optimal alignment between sequence and nodes is then produced by a trace-back procedure similar to that used for sequence alignment until a start move is reached. The existing graph is then updated with the new alignment information according to a set of rules as described in Lee et al. (2002). As more sequences are added to a node, the lists of sequence and aligned sequence positions are also updated. This method does not follow a progressive alignment strategy as do CLUSTALW and T-COFFEE. The order of addition of sequences to the msa can influence the results in regions of low sequence similarity. Although not implemented in the version of POA described here, this information can be used to analyze rearrangements of conserved domains in the sequences.+ x' @4 l3 R g" { R4 R. n
3 a4 q. }' ^) i/ F$ N/ `Directed acyclic graphs are also used to model gene ontologies, which classify gene functions.
5 {- W2 X( I1 }" C) x! ~% j5 a; N. y3 ~/ e" {- t2 H, i' }
Hidden Markov Models of a Global Multiple Sequence Alignment* @5 b9 a3 t% I k9 P: s) g; I
* b4 g; k7 j8 S! Y. aThe hidden Markov model (HMM) is a probabilistic, statistical model that considers all possible combinations of matches, mismatches, and gaps to generate an alignment of a set of sequences. Both global and local msas (PROFILE HMMs) may be modeled, and the methods are quite similar. A discussion of HMMs can be found in Using Hidden Markov Models to Align Multiple Sequences (Mount 2008b).
; x% x* g' G% w+ y3 ^+ c3 X: \; Z2 m& _, w
. {* x1 c+ s) e6 y! T. e$ f* @; r: p% W) [; s9 W E' ~% g% p
REFERENCES9 Z' K- F; |( J4 r8 m
/ S2 W/ F7 n, `
Corpet F. 1988. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16: 10881–10890.[Abstract/Free Full Text]& Z( F7 w( p. X) m8 }' I9 |
6 Q6 [3 |8 d3 t, _Gotoh O. 1994. Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comput Appl Biosci 10: 379–387.[Abstract/Free Full Text]
$ N6 n9 p: C8 o4 Z t
) p$ G- S, E9 K! L1 e, ?: E1 d1 @ OGotoh O. 1995. A weighting system and algorithm for aligning many phylogenetically related sequences. Comput Appl Biosci 11: 543–551.[Abstract/Free Full Text]. q+ Z- [% ]) r9 s5 A! i
9 x8 x4 H' e8 O+ wGotoh O. 1996. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264: 823–838.[Medline]7 s( y7 Q) J+ m
- Q% X' v$ D! D ^Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. 1995. Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11: 13–18.[Abstract/Free Full Text]0 h3 e5 A/ Y, c- v& o
5 I0 N' m- ^! Z8 MKim J, Pramanik S, Chung MJ. 1994. Multiple sequence alignment by simulated annealing. Comput Appl Biosci 10: 419–426.[Abstract/Free Full Text]
4 l2 d) e7 m8 B: i; h* z
& Y4 w3 G, Z2 _0 P6 Q% KLee C, Grasso C, Sharlow MF. 2002. Multiple sequence alignment using partial order graphs. Bioinformatics 18: 452–464.[Abstract/Free Full Text]' B/ `: [0 ?, Z& k8 _0 b6 u' \
# U: C. u* o; c N' S; WMount DW. 2008a. Using progressive methods for global multiple sequence alignment. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top43.[Abstract/Free Full Text]
2 r& p3 @/ Y! G& L& d5 F& l# k
6 X" C% I) n3 IMount DW. 2008b. Using hidden Markov models to align multiple sequences. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top41.[Abstract/Free Full Text]) \6 F; Z" i- |2 K: ]* U1 S
+ A) O1 Z# y) w2 I4 ^! fMount DW. 2008c. Comparing programs and methods to use for global multiple sequence alignment. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.ip61.[Abstract/Free Full Text]
$ ~: g: V' p W) G0 X8 U2 D( j* P
Mount DW. 2008d. Using multiple sequence alignment editors and formatters. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top45.[Abstract/Free Full Text]! S: }. P4 ~" P" g7 V
2 T/ _: j- n8 P: e- L( o) A
Notredame C, Higgins DG. 1996. SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Res 24: 1515–1524.[Abstract/Free Full Text]
0 n, Q2 q5 P8 `, L3 P, r
" N z. V& W& @( @Notredame C, O’Brien EA, Higgins DG. 1997. RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res 25: 4570–4580.[Abstract/Free Full Text]
; u/ D5 ^6 t: v4 D- t1 F* Z p; D
/ H! E% I- s" B9 B- CShapiro B, Navetta J. 1994. A massively parallel genetic algorithm for RNA secondary structure prediction. J Supercomput 8: 195–207.
# Y% ]# F4 L( m' X& ?
- _2 v0 K1 v0 t- ^4 @! GZhang C, Wong AK. 1997. A genetic algorithm for multiple molecular sequence alignment. Comput Appl Biosci 13: 565–581.[Abstract/Free Full Text] |
-
总评分: 威望 + 15
包包 + 30
查看全部评分
|