求助冷泉港三篇付费文献，关于测序生物信息学计算的，谢谢 - 文献互助区干细胞之家



免疫细胞治疗专区	欢迎关注干细胞微信公众号

返回列表

查看: 9007\|回复: 3	go [已解决求助] 求助冷泉港三篇付费文献，关于测序生物信息学计算的，谢谢 [复制链接]

ziyunfei1982

版主

Rank: 7 Rank: 7 Rank: 7

积分: 437
威望: 437
包包: 66

优秀版主金话筒

楼主

发表于 2010-8-3 16:24 |只看该作者 |倒序浏览 |打印

1，http://cshprotocols.cshlp.org/cgi/content/full/2009/7/pdb.ip61

2，http://cshprotocols.cshlp.org/cgi/content/full/2009/7/pdb.top45 t" O1 g6 K8 j& E v# D) t" v
/ r" @: V) Y, j& K# ]7 Z
3。http://cshprotocols.cshlp.org/cgi/content/full/2009/7/pdb.top44

回复引用

Rank: 3 Rank: 3

积分: 773
威望: 773
包包: 1251

金话筒优秀会员

沙发

发表于 2010-8-3 17:04 |只看该作者

No pdf。
http://cshprotocols.cshlp.org/cgi/content/full/2009/7/pdb.ip61! i7 z, o6 ]- U3 E
Comparing Programs and Methods to Use for Global Multiple Sequence Alignment
David W. Mount / o) A$ L: n3 d% M4 V
Adapted from Bioinformatics: Sequence and Genome Analysis, 2nd edition, by David W. Mount. CSHL Press, Cold Spring Harbor, NY, USA, 2004.

INTRODUCTION3 u! f1 t# u1 b; C% L: h* A# [3 l
4 ^. [- b  _+ I" U) D- c6 a6 b
It is difficult to find a global optimal alignment of more than two sequences (and, especially, more than three) that includes matches, mismatches, and gaps and that takes into account the degree of variation in all of the sequences at the same time. Thus, approximate methods are used, such as progressive global alignment, iterative global alignment, alignments based on locally conserved patterns found in the same order in the sequences, statistical methods that generate probabilistic models of the sequences, and multiple sequence alignments produced by graph-based methods. When 10 or more sequences are being compared, it is common to begin by determining sequence similarities between all pairs of sequences in the set. A variety of methods are then available to cluster the sequences into the most related groups or into a phylogenetic tree. This article discusses several of these methods and provides data that compare their utility under various conditions.
# q9 c2 T! H9 L5 ]
( v$ g! t4 r& N, D9 c3 K
RELATED INFORMATION
8 `% i: G7 m( A1 H& x8 b/ Y
Some of the approximate methods for global alignment of multiple sequences are discussed in more detail in Using Iterative Methods for Global Multiple Sequence Alignment (Mount 2008a), Using Progressive Methods for Global Multiple Sequence Alignment (Mount 2008b), and Using Hidden Markov Models to Align Multiple Sequences (Mount 2008c). Programs that format and edit multiple sequence alignments (msas) are presented in Using Multiple Sequence Alignment Editors and Formatters (Mount 2008d). A discussion of Distance Methods for Phylogenetic Prediction (Mount 2008e) is also available.
" F- N7 E4 E3 S, u/ i
4 f3 l0 M- J8 l* f) {4 F
PROGRAMS AND METHODS FOR GLOBAL MULTIPLE SEQUENCE ALIGNMENT' t7 O% K2 b. T$ D2 {
% i1 u9 y+ N  i
The msa method often used (especially for 10 or more sequences) is first to determine sequence similarity between all pairs of sequences in the set. On the basis of these similarities, various methods are used to cluster the sequences into the most related groups or into a phylogenetic tree. " @; {, t9 T9 p# p8 x6 {6 H" z
) P% A, z( \& m1 X6 ~
There are several approaches to using these methods:
+ p) }" Q2 K) X8 k
8 @* F! A, Y. G4 ?
1. In the group approach, a consensus is produced for each group of sequences and then used to make further alignments between groups. Two examples of programs using the group approach are the program PIMA (Smith and Smith 1992), which uses several novel alignment techniques, and the program MULTAL, described by Taylor (1990, 1996). + E2 @' ~/ v% v& g1 }; s

2. The tree method uses the distance method of phylogenetic analysis to arrange the sequences (see Distance Methods for Phylogenetic Prediction [Mount 2008e]). The two closest sequences are first aligned, and the resulting consensus alignment is aligned with the next best sequence or cluster of sequences, and so on, until an alignment is obtained that includes all of the sequences. The program CLUSTALW (see Using Progressive Methods for Global Multiple Sequence Alignment [Mount 2008b]) is an example of this approach. The ALIGN set of programs (Feng and Doolittle 1996) and the MS-DOS program by Corpet (1988) use this method. Additional programs for msa are also described in Barton (1994), Kim et al. (1994), and Morgenstern et al. (1996).
% Z9 Q3 ]& C3 x( Q) ~, y

3. Another approach (Vingron and Argos 1991) aligns all possible pairs of sequences to create a set of dot matrices, and the matrices are then filtered sequentially to find motifs that provide a starting point for sequence alignment. A set of programs for interactive msa by dot-matrix analysis and other alignment techniques has also been developed (Boguski et al. 1992).   ]# [$ C1 E+ [# Q
" T: [+ U7 A" E3 r2 s

4. The program TREEALIGN takes the approach that msas should be done in a fashion that simultaneously minimizes the number of changes needed during evolution to generate the observed sequence variation (Hein 1990). TREEALIGN (also called "ALIGN" in the program versions) performs the alignment and the most parsimonious tree construction at the same time. The initial steps are similar to other msa methods, except that TREEALIGN uses a distance scale: That is, the sequences are aligned pairwise, and the resulting distance scores are used sequentially to produce a tree, which is rearranged as more sequences are added. The sequences are then realigned so that the same tree can be produced by maximum parsimony. Finally, the tree is rearranged to maximize parsimony. The advantage to this method is the increased use of phylogenetic analysis to improve the msa. + a' P& d7 [: `) T0 \) U
3 ^' z" P# R' t- W0 Y- j

PERFORMANCE OF GLOBAL MULTIPLE SEQUENCE ALIGNMENT PROGRAMS/ z& ?' d& K8 Q& v: [; A1 H, @

The performance of global msa programs is commonly assessed by comparing the computed msa with a structural alignment of the proteins and by other objective methods (Notredame et al. 1998). The programs are compared for their ability to reproduce structurally derived alignments from BAliBASE (Thompson et al. 1999b), a database of protein families, each with a known three-dimensional structure and a documented msa alignment based on a scheme of expected sequence changes using the program Rose (Stoye et al. 1998). Reviews on the performance of msa software are given in McClure et al. (1994) (progressive alignment methods), Gotoh (1996), and Thompson et al. (1999a). A review of websites is given in Briffeuil et al. (1998), and a review on iterative algorithms is given in Hirosawa et al. (1995) and Gotoh (1999). ' ^6 ?; J1 m& ?8 Y

In a recent set of program comparisons, T-COFFEE slightly outperformed the iterative program PRRP (11%-17%) in matching BALiBASE alignments (Notredame et al. 2000). A later review of the programs Partial Order Alignment (POA), DIALIGN, T-COFFEE, and CLUSTALW for speed and quality of their alignments was performed (Lassmann and Sonnhammer 2002). Of these, DIALIGN was most accurate for msas of sequences of low sequence identity, T-COFFEE was best for sequences of high sequence identity, POA was fastest and almost as accurate as DIALIGN and T-COFFEE, and CLUSTALW was only as good as the others for global msas of sequences with high sequence similarity. % v) y0 k. Q4 x3 _
1 o7 E8 [& _! u; u' P9 C) f
% F" q; P  ?; S0 {! V( r
REFERENCES- E! t- [9 Z: Q9 X+ k/ F- T

Barton GJ. 1994. The AMPS package for multiple protein sequence alignment. Computer analysis of sequence data. Part II. Methods Mol Biol 25: 327–347.[Medline]  N) p* g) U" @0 ~8 |7 }

Boguski M, Hardison RC, Schwartz S, Miller W. 1992. Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control regions using software tools for multiple alignment and visualization. New Biol 4: 247–260.[Medline]
6 a7 m$ J5 _0 a7 x6 {
Briffeuil P, Baudoux G, Reginster I, Debolle X, Depiereux E, Feytmans E. 1998. Comparative analysis of seven multiple protein sequence alignment servers: Clues to enhance reliability of predictions. Bioinformatics 14: 357–366.[Abstract/Free Full Text]3 b6 }$ k+ ~$ B' p5 e0 D' N/ j
  t, C4 Z# \' n5 Z) c  s3 Z
Corpet F. 1988. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16: 10881–10890.[Abstract/Free Full Text]; d  P* |: w7 n" s  e/ s

Feng DF, Doolittle RF. 1996. Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol 266: 368–382.[Medline]

Gotoh O. 1996. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264: 823–838.[Medline]
' z' i! p* m7 ^
Gotoh O. 1999. Multiple sequence alignment: Algorithms and applications. Adv Biophys 36: 159–206.[Medline]2 C+ `# R+ I1 N& ~3 [2 Y) f8 J
. r- O# ~5 e3 x% C% `' f& X, @+ I
Hein J. 1990. Unified approach to alignment and phylogenies. Methods Enzymol 183: 626–645.[Medline]" C) E7 [- z6 n! |$ Y5 g

Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. 1995. Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11: 13–18.[Abstract/Free Full Text]

Kim J, Pramanik S, Chung MJ. 1994. Multiple sequence alignment by simulated annealing. Comput Appl Biosci 10: 419–426.[Abstract/Free Full Text]' d2 D, E6 o: d; T) v
' X# K9 B5 y5 X$ Z) R0 v8 R
Lassmann T, Sonnhammer EL. 2002. Quality assessment of multiple alignment programs. FEBS Letters 529: 126–130.[Medline]9 A1 _' A/ Z6 C  c; X" H
) W; p% ~: C# V+ k- {2 ^
McClure MA, Vasi TK, Fitch WM. 1994. Comparative analysis of multiple protein-sequence alignment methods. Mol Biol Evol 11: 571–592.[Abstract]7 g; `# U5 h6 z  }9 m9 |
7 g3 a  h: V3 b" }" C% L
Morgenstern B, Dress A, Werner T. 1996. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci 93: 12098–12103.[Abstract/Free Full Text]
/ D7 u- k- |" Y  ]" U. @! M
Mount DW. 2008a. Using iterative methods for global multiple sequence alignment. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top44.[Abstract/Free Full Text]

Mount DW. 2008b. Using progressive methods for global multiple sequence alignment. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top43.[Abstract/Free Full Text]6 B5 I# q6 X( J" m! S. z+ $ M! E/ j2 D

Mount DW. 2008c. Using hidden Markov models to align multiple sequences. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top41.[Abstract/Free Full Text]- r  N) U, i  P* f3 m$ M

Mount DW. 2008d. Using multiple sequence alignment editors and formatters. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top45.[Abstract/Free Full Text]  q3 J$ R1 a& ?8 u

Mount DW. 2008e. Distance methods for phylogenetic prediction. Cold Spring Harb Protoc doi: 10.1101/pdb.top33.[Abstract/Free Full Text]. a5 G0 b$ k' $ A8 p" \
( L" n. {) A; q8 I
Notredame C, Holme L, Higgins DG. 1998. COFFEE: A new objective function for multiple sequence alignment. Bioinformatics 14: 407–422.[Abstract/Free Full Text]

Notredame C, Higgins DG, Heringa J. 2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302: 205–217.[Medline]5 m1 Q9 c8 y* T( @, w

Smith RF, Smith TF. 1992. Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. Protein Eng 5: 35–41.[Abstract/Free Full Text]
6 i4 Q; g  ?4 H; ]! }2 X
Stoye J, Evers D, Meyer F. 1998. Rose: Generating sequence families. Bioinformatics 14: 157–163.[Abstract/Free Full Text]: Z+ c3 |' _- t: [; r$ g; {

Taylor WR. 1990. Hierarchical method to align large numbers of biological sequences. Methods Enzymol 183: 456–474.[Medline]

Taylor WR. 1996. Multiple protein sequence alignment: Algorithms and gap insertion. Methods Enzymol 266: 343–367.[Medline]
. L; c* x$ D; o2 M0 a4 }! t
Thompson JD, Plewniak F, Poch O. 1999a. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 27: 2682–2690.[Abstract/Free Full Text]  y( u- A6 c6 n

Thompson JD, Plewniak F, Poch O. 1999b. BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15: 87–88.[Abstract/Free Full Text]
5 F& C: @9 }; M0 l+ R% c7 F6 T
Vingron M, Argos P. 1991. Motif recognition and alignment for many sequences by comparison of dot matrices. J Mol Biol 218: 33–43.[Medline]

回复引用

举报返回顶部

lixiaodong

中级会员

Rank: 3 Rank: 3

积分: 773
威望: 773
包包: 1251

金话筒优秀会员

藤椅

发表于 2010-8-3 17:05 |只看该作者

http://cshprotocols.cshlp.org/cgi/content/full/2009/7/pdb.top45
Using Multiple Sequence Alignment Editors and Formatters
David W. Mount
Adapted from Bioinformatics: Sequence and Genome Analysis, 2nd edition, by David W. Mount. CSHL Press, Cold Spring Harbor, NY, USA, 2004.- D7 B  r0 V+ ^

/ k: E& u7 ~/ U6 {3 f3 s' M
INTRODUCTION& M; D! C# s1 v( o
! t. T. f8 q& \, D$ A" v6 T
Sequence alignment editors enable the user to manually edit a multiple sequence alignment (msa) in order to obtain a more reasonable or expected alignment. Editors allow sequences to be reordered and/or modified using the computer’s cut and paste commands. They are designed to accept various msa formats and to provide the output file in a suitable user-designated format. Sequence formatters provide various output formatting options, such as color and shading schemes to enhance visualization of residue alignments. The formatters can output files in Postscript, EPS, RTF, and other widely recognized formats, while accepting the standard input formats, such as MSF, ALN, and FASTA. This article introduces a number of sequence alignment editors and formatters, and provides links to sites where they can be found.
1 Q$ y# ^& b* r6 C  K
! q( V  M. x1 Q$ {
RELATED INFORMATION

Approximate methods for global alignment of multiple sequences are discussed in Using Iterative Methods for Global Multiple Sequence Alignment (Mount 2008a), Using Progressive Methods for Global Multiple Sequence Alignment (Mount 2008b), and Using Hidden Markov Models to Align Multiple Sequences (Mount 2008c). In Comparing Programs and Methods to Use for Global Multiple Sequence Alignment (Mount 2008d), additional alignment methods are introduced and the utility of different methods is compared under various conditions.
6 ]$ e) }4 U3 B7 ^7 s3 O5 T
$ v. g/ @& N: l
CHOOSING AN MSA EDITOR OR FORMATTER

Once an msa has been obtained by a global msa program, it may be necessary to edit the sequence manually to obtain a more reasonable or expected alignment. Several considerations must be kept in mind when choosing a sequence editor, which should include as many of the following features as possible: ! o9 n5 \8 X8 H* c4 l

1. Provision for displaying the sequence on a color monitor with residue colors to aid in a clear visual representation of the alignment.
" ~) p, [8 @4 w$ Z+ M

2. Recognition of the multiple sequence format that was output by the msa program and maintenance of the alignment in a suitable format when the editing is completed. , C6 ~* ~( O  ^, C: m) v

4 R- S8 ?5 H9 w$ L7 C
3. Provision of a suitable windows interface, allowing use of the mouse to add, delete, or move sequences followed by an updated display of the alignment.

In addition, there are other types of editing that are commonly performed on msas, for example, shading conserved residues in the alignment.

A large number of msa formats are in use. Two commonly encountered examples are the Genetics Computer Group’s MSF format and the CLUSTALW ALN format. Because these formats follow a precise outline, one may be readily converted to another by computer programs. READSEQ by D.G. Gilbert at Indiana University, Bloomington, is one such program. This program will run on almost any computer platform and may be obtained by anonymous FTP from ftp://ftp.bio.indiana.edu/molbio/readseq. There is also a web-based interface for READSEQ from Baylor College of Medicine at http://searchlauncher.bcm.tmc.edu/seq-util/seq-util.html. The software package SEQIO, which provides C program modules for conversion of sequence files from one format to another, is available from http://www.cs.ucdavis.edu/~gusfield/seqio.html.
- ~8 W/ n0 O  M3 z8 U3 G
A short list of the many available programs that have or exceed the above-listed features is discussed below. For a more comprehensive list, visit the catalog of software at http://www.biocatalogue.org/.
! ]- `% m+ J  z) j' m! ?
! m* G# ]7 @+ z
SEQUENCE EDITORS0 _0 Y/ v8 X, s
% M) M) X$ [: z% [
) ?) F+ i  b$ F! Z8 V1 K
1. CINEMA (Colour Interactive Editor for Multiple Alignments) at http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.02/kit.html is a broadly functional program for sequence editing and analysis, including dot matrix analysis (Parry-Smith et al. 1998; Lord et al. 2002). It features drag-and-drop editing, sequence shifting to left or right, viewing of different parts of an alignment using the split-screen option, multiple motif selection and manipulation, and a number of added features such as viewing of protein structures. CINEMA was developed by A.W.R. Payne, D.J. Parry-Smith, A.D. Michie, and T.K. Attwood. CINEMA is an applet that runs under a web browser and therefore will run on almost any computer platform. : ]+ s; e, ~0 r1 J9 y/ t# S
5 M1 ^' m3 P2 {1 w
! A( T$ }9 \- Y% _5 ~. M+ X
2. GDE (Genetic Data Environment) provides a general interface on UNIX machines for sequence analysis, sequence alignment editing, and display (Smith et al. 1994) and is available from several anonymous FTP sites including ftp.ebi.ac.uk/pub/software/unix. GDE is described at http://bimas.dcrt.nih.gov/gde_sw.html. GDE features are incorporated into the Seqlab interface for the GCG software, vers. 9. This interface requires communication with a host UNIX machine running the Genetics Computer Group software. Interface with MS-DOS or Macintosh is possible if the computer is equipped with the appropriate X-Window client software. / g* V2 S& _. B/ w4 N) S
7 |6 g: ?$ x6 k+ ~' Y; Q
6 u8 H( Y$ P7 [$ g4 ~! a
3. GeneDoc is an alignment editing and display editor by K. Nicholas and H. Nicholas of the Pittsburgh Supercomputing Center for MSF-formatted msas (Fig. 1 ). It can also import files in other formats. GeneDoc can move residues by inserting or deleting gaps, and features drag-and-drop editing. As the alignment is edited, a new alignment score is calculated by the Sum-of-Pairs (SP) method or based on a phylogenetic tree. GeneDoc is available from http://www.nrbsc.org/gfx/genedoc/index.html and runs under MS Windows.

View larger version (31K):
[in this window]8 j; G. p) u* P# ?+ K, x
[in a new window]; y; T) z% _; S3 T! g! J. f

Figure 1. GeneDoc, a multiple sequence alignment editor with many useful features. Shown is an illustrative msa of three DNA repair genes similar to the Saccharomyces cerevisiae Rad1 gene. The sequences were aligned with CLUSTALW, and the FASTA-formatted alignment was imported into GeneDoc on a PC.1 k1 O' j2 O- n7 h, V4 b

4. MACAW is both a local multiple sequence alignment program and a sequence-editing tool (Schuler et al. 1991). Given a set of sequences, the program finds ungapped blocks in the sequences and gives their statistical significance. Later versions of the program find blocks by one of three user-chosen methods: by searching for maximum segment pairs or common patterns present in the sequences scored by a scoring matrix such as PAM250 or BLOSUM matrices (the methods used by the BLAST algorithm); by using the Gibbs sampling strategy, a statistical method; or by searching for user-provided patterns provided in a particular format called a regular expression. Executable programs that run under MS Windows, Macintosh, and other computer platforms are available by anonymous FTP from ftp://ftp.ncbi.nlm.nih.gov/pub/schuler/macaw.
1 ?) V) \. W/ A+ P: [

5. DCSE, Dedicated Comparative Sequence Editor, was developed by Peter De Rijk for editing of protein, DNA, and RNA sequences using the X Windows interface. / s/ V) z8 y- m2 I0 H6 U2 L% G
' z1 S; P* v+ {
3 n1 x0 M3 P( \$ U/ ^- |  u
6. SEAVIEW, developed by Galtier et al. (1996), edits and converts multiple sequence alignment formats on a variety of computer platforms and is available from http://pbil.univ-lyon1.fr/software/seaview.html. ' k8 |9 D% d( c" `% G
0 U+ s& Y' a3 u, }9 b% [* z
) n1 E2 W9 y: @5 A6 o
7. SEQPUP, developed by Don Gilbert at Indiana University, is a sequence editor and sequence analysis tool that runs on most computer platforms and is available from ftp://iubio.bio.indiana.edu/molbio/seqpup/java/.

SEQUENCE FORMATTERS
& ^8 f; |) L* @  C1 k0 z

1. Boxshade is a formatting program by K. Hofmann for marking identical or similar residues in msas with shaded boxes, and is available by anonymous FTP from ftp://www.isrec.isb-sib.ch/pub/. The web server at http://www.ch.embnet.org/software/BOX_form.html takes a multiple-alignment file in either the Genetics Computer Group MSF format or CLUSTAL ALN format and can output a file in many forms, including Postscript/EPS and PICT, for editing on Macintosh and MS-DOS machines. 5 K4 S, t( n- q  |

2. CLUSTALX is a sequence formatting tool that provides the Windows interface for a CLUSTALW msa and is available for many computer platforms, including MS-DOS and Macintosh machines, by anonymous FTP from ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ (Thompson et al. 1997).
4 n+ m, e% P! S4 k3 P0 ?
5 I( i( I' D1 E5 h' @5 @
REFERENCES

Galtier N, Gouy M, Gautier C. 1996. SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Comput Applic Biosci 12: 543–548.[Abstract/Free Full Text]6 e5 c# D% G+ {* M! \% C! N

Lord PW, Selley N, Attwood TK. 2002. CINEMA-MX: A modular multiple alignment editor. Bioinformatics 18: 1402–1403.[Abstract/Free Full Text]: k. g, N: q' B4 Y/ O4 p0 H( C

Mount DW. 2008a. Using iterative methods for global multiple sequence alignment. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top44.[Abstract/Free Full Text]

Mount DW. 2008b. Using progressive methods for global multiple sequence alignment. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top43.[Abstract/Free Full Text]
2 {4 N+ G; C4 `8 `) h
Mount DW. 2008c. Using hidden Markov models to align multiple sequences. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top41.[Abstract/Free Full Text]
- F4 l2 v. p' p' l* W$ |; ?3 f
Mount DW. 2008d. Comparing programs and methods to use for global multiple sequence alignment. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.ip61.[Abstract/Free Full Text]1 \% s% L  P. U1 ^$ b
- C* `$ Q1 u/ d6 m
Parry-Smith DJ, Payne AW, Michie AD, Attwood TK. 1998. CINEMA-A novel colour INteractive editor for multiple alignments. Gene 221: GC57–GC63.[Medline]  n# h1 g/ B0 `1 C9 w, H
9 s8 F2 T3 o- x5 O" E; b; r% y
Schuler GD, Altschul SF, Lipman DJ. 1991. A workbench for multiple alignment construction and analysis. Proteins 9: 180–190.[Medline]
! F: W) ?& L$ R7 Q0 Z0 d6 J3 n! p
Smith SW, Overbeek R, Woese CR, Gilbert W, Gillevet PM. 1994. The genetic data environment and expandable GUI for multiple sequence analysis. Comput Appl Biosci 10: 671–675.[Abstract/Free Full Text]& p5 ~3 i( \1 {) {

Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. 1997. The CLUSTAL X windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 25: 4876–4882.[Abstract/Free Full Text]

回复引用

举报返回顶部

lixiaodong

中级会员

Rank: 3 Rank: 3

积分: 773
威望: 773
包包: 1251

金话筒优秀会员

板凳

发表于 2010-8-3 17:07 |只看该作者

http://cshprotocols.cshlp.org/cgi/content/full/2009/7/pdb.top44/ X) L: b4 M9 u
Using Iterative Methods for Global Multiple Sequence Alignment
David W. Mount
Adapted from Bioinformatics: Sequence and Genome Analysis, 2nd edition, by David W. Mount. CSHL Press, Cold Spring Harbor, NY, USA, 2004.
+ t6 Z0 f2 W4 m" h
1 C' ~9 R. [% h5 Q8 m
INTRODUCTION
% b/ I( t! i- J( i2 _
Finding a global optimal alignment of more than two sequences that includes matches, mismatches, and gaps and that takes into account the degree of variation in all of the sequences at the same time is especially difficult. The dynamic programming algorithm used for optimal alignment of pairs of sequences can be extended to global alignment of three sequences, but for more than three sequences, only a small number of relatively short sequences may be analyzed. Thus, approximate methods are used for global alignment. One class of these is iterative global alignment, which makes an initial global alignment of groups of sequences and then revises the alignment to achieve a more reasonable result. This article discusses several iterative alignment methods. In particular, steps are provided for using the Sequence Alignment by Genetic Algorithm (SAGA).
& ^; z+ E7 A! V8 [8 Q- V: h

RELATED INFORMATION. K9 p- S' [' k; u

Other approximate methods for global alignment of multiple sequences are discussed in Using Progressive Methods for Global Multiple Sequence Alignment (Mount 2008a) and Using Hidden Markov Models to Align Multiple Sequences (Mount 2008b). In Comparing Programs and Methods to Use for Global Multiple Sequence Alignment (Mount 2008c), additional alignment methods are introduced, and the utility of different methods is compared under various conditions. Programs that format and edit multiple sequence alignments (msas) are presented in Using Multiple Sequence Alignment Editors and Formatters (Mount 2008d). ; t/ t' x; Y: \% t4 a
" i% ^6 |' v$ o' C5 M! x8 p6 a

ITERATIVE METHODS FOR GLOBAL MULTIPLE SEQUENCE ALIGNMENT0 o0 p! i% ~, Y: @+ J
6 `0 ~7 @& C$ e2 c" s
Iterative methods are an alternative to progressive alignment methods. The major problem with progressive alignment methods (see Using Progressive Methods for Global Multiple Sequence Alignment [Mount 2008a]) is that errors in the initial alignments of the most closely related sequences are propagated to the msa. This problem is more acute when the starting alignments are between more distantly related sequences. Iterative methods attempt to correct for this problem by repeatedly realigning subgroups of the sequences and then by aligning these subgroups into a global alignment of all of the sequences. The objective is to improve the overall alignment score, such as a Sum-of-Pairs (SP) score. Selection of these groups may be based on the ordering of the sequences on a phylogenetic tree predicted in a manner similar to that of progressive alignment, separation of one or two of the sequences from the rest, or a random selection of the groups. These methods are compared in Hirosawa et al. (1995).
) z/ K2 m5 ^& u2 L# q
Three programs that use iterative methods are MultiAlin, PRRP, and DIALIGN. MultiAlin (Corpet 1988) recalculates pairwise scores during the production of a progressive alignment and uses these scores to recalculate the tree, which is then used to refine the alignment in an effort to improve the score. The program PRRP uses iterative methods to produce an alignment. An initial pairwise alignment is made to predict a tree. The tree is then used to produce weights for making alignments in the same manner as the MSA program except that the sequences are analyzed for the presence of aligned regions that include gaps rather than being globally aligned. These regions are iteratively recalculated to improve the alignment score. The best-scoring alignment is then used in a new cycle of calculations to predict a new tree, new weights, and new alignments, as illustrated in Figure 1 . The program repeats this process until there is no further increase in the alignment score (Gotoh 1994, 1995, 1996).

( [7 d  T- l% F  m# Y* u
View larger version (29K):
[in this window]1 \# u& x3 v) {' D. ~
[in a new window]
  O2 {# @; c8 a
Figure 1. The iterative procedures used by PRRP to compute a multiple sequence alignment. (Reproduced from Gotoh 1996, with permission from Elsevier © 1996.)+ a5 P% H- {9 Y5 s* g9 W) ]
% y, D# \" b% G
The program DIALIGN finds an alignment by a different iterative method. Pairs of sequences are aligned to locate aligned regions that do not include gaps, much like continuous diagonals in a dot-matrix plot. Diagonals of various lengths are identified. A consistent collection of weighted diagonals that provides an alignment, which is a maximum sum of weights, is then found. The result is an alignment of the sequences based on alignment of these weighted diagonals. Additional methods that use iterative methods--specifically, genetic algorithms, partial-order graphs, and hidden Markov models--are introduced in this article.
3 ^. A3 F* V' J) b5 f0 s
Genetic Algorithm" q& l# n3 n' Y
# w, D, P; l0 d& b6 m  H
The genetic algorithm is a general type of machine-learning algorithm that has no direct relationship to biology and that was invented by computer scientists. The method has been recently adapted for msa by Notredame and Higgins (1996) in a computer program package called SAGA (see the next section for the steps used in the SAGA algorithm). Zhang and Wong (1997) have developed a similar program. The method is of considerable interest because the algorithm can find high-scoring alignments that are as good as those found by other methods such as CLUSTALW. Similar genetic algorithms have been used for RNA sequence alignment (Notredame et al. 1997) and for prediction of RNA secondary structure (Shapiro and Navetta 1994). Although the method is relatively new and not used extensively, it likely represents the first of a series of sequence analysis programs that produce alignments by attempted simulation of the evolutionary changes in sequences. Genetic algorithms are quite complex, and user experience with them is necessary.
7 v, h# S) B9 x
The basic idea behind SAGA is to try to generate many different msas by rearrangements that simulate gap insertion and recombination events during replication, to produce a higher and higher score for the msa. The alignments are not guaranteed to be optimal (highest scoring that is achievable). Although SAGA can generate alignments for many sequences, the program is slow for more than about 20 sequences.

A similar approach for obtaining a higher-scoring msa by rearranging an existing alignment uses a probability approach called "simulated annealing" (Kim et al. 1994). The program Multiple Sequence Alignment by Simulated Annealing (MSASA) starts with a heuristic msa and then changes the alignment by following an algorithm designed to identify changes that increase the alignment score. 7 B" o+ K. n% f! _# Z

Steps in the SAGA Genetic Algorithm for Global Sequence Alignment
. S/ C* `# K. `5 w7 S
The success of the genetic algorithm may be attributed to the steps used to rearrange sequences, many of which might be expected to have occurred during the evolution of the protein family.

The steps in the algorithm are as follows:
0 ^) b2 Z2 M$ V9 L+ U0 h1 m

1. The sequences to be aligned (up to approximately 20 in number) are written in rows, as on a page, except that they are made to overlap by a random amount of sequence, up to 50 residues long for sequences that are ~200 residues in length. The ends are then padded with gaps. A typical population of 100 of these msas is made, although other numbers may be set. Shown below in Figure 2 is an initial msa for the genetic algorithm (1 of approximately 100):1 R' s0 A1 C3 `4 z

Figure 2. 9 v' w3 t2 O! q3 H+ d4 v

- E8 q# r$ ]: D$ s% I7 p) L+ z
2. The 100 initial msas are scored by the SP method, using both natural and quasi-natural gap-scoring schemes. Standard amino acid scoring matrices and gap opening and extension penalties are used by SAGA. 0 T, z8 v& E; u- [4 w' c
5 q: L+ w( U% \& g0 _
) h& c2 A- T! ]5 ]5 u# w$ }
3. These initial msas are now replicated to give another generation of msas. The half of the replicates with the lowest SP scores are sent to the next generation unchanged. The remaining half for the next generation are selectively chosen by lot, like picking marbles from a bag, except that the chance for a particular choice is inversely proportional to the msa score (the lower the score, the better the msa, therefore giving that one a greater chance of replicating). 2 P4 S- r: n" ?6 u
2 O( o# s5 S8 A) j: K/ P

This latter one-half of the choices for the next generation is now subject to mutation, as described in Step 4 below, to produce the children of the next generation. All members of the next-generation msas undergo recombination to make new child msas derived from the two parents, as described in Step 5 below. The relative probabilities of these separate events are governed by program parameters. These parameters are also adjusted dynamically as the program is running to favor those processes that have been most useful for improving msa scores. + S# C( J0 u6 Z' S, \' i

4. In the mutation process, the sequence is not changed (else it would no longer be an alignment), but gaps are inserted and rearranged in an attempt to create a better-scoring msa. In the gap insertion process, the sequences in a given msa are divided into two groups based on an estimated phylogenetic tree, and gaps of random length are inserted into random positions in the alignment. Alternatively, in a "hill-climbing" version of the procedure, the position is so chosen as to provide the best possible score following the change. Shown below (see Fig. 3) are random gap insertions into phylogenetically related sequences. The first two and last three sequences comprise the two related groups in this example. x indicates any sequence character.' D% [! F3 V7 M4 J1 A- o/ L" [; H1 R

Figure 3.
' |$ u' Q0 L# `' d! t: e
, ^. f  ?6 m4 [1 a0 l3 d  x% G
Another mutational process is to move common blocks of sequence (overlapping ungapped regions) delineated by a gap, or blocks of gaps (overlapping gaps). Some of the possible moves are illustrated below (see Fig. 4). These moves may also be tailored to improve the alignment score.* ?; e. [7 W5 |. _+ v

Figure 4. , E+ T* P) k- n# I# S4 B0 u
; I. }6 ^( S# F* ~
# X3 P% b/ M( c$ w& t
5. Recombination among next-generation parent msas is accomplished by one of two mechanisms. The first is not homology-driven. One msa is cut vertically through, and the other msa is cut in a staggered manner that does not lose any sequence after the fragments are spliced. The higher scoring of the two reciprocal recombinants is kept. The second mechanism, illustrated below (see Fig. 5), is recombination between msas driven by conserved sequence positions. It is driven by homology expressed as a vertical column of the same residue and is very much like standard homologous recombination.! d/ s+ m4 f3 i* }' c
, o( ~5 f9 N/ x
Figure 5. 0 C, R# D6 d9 B6 I
* ^# G8 A: Q* Y4 H
5 P: z. k* p6 Z; W9 d
6. The next generation, an overlapping one of the previous one-half of the best-scoring parental msas and the mutated children, is now evaluated as in Step 2, and the cycle of Steps 2-5 is typically repeated as many as 100 times, although as many as 1000 generations can be run. The best-scoring msa is then kept. 2 P! y/ r) J" {# y! c8 i( ~( t& O* ]
! f* v$ r( t$ h; C% }
: x/ l+ b( x- V& P; G( E
7. The entire process of producing a set of msas for replication and mutation is repeated several times to obtain several possible msas, and the best-scoring one is chosen. 8 w8 f7 K  T( Q4 L  J+ n% b

Partial-Order Graphs3 |" Z1 C% [5 ~9 }
+ Y- I' p# }, [6 F3 M
A dramatic improvement in the speed of producing an msa has been achieved by representing sequences and msas as partial-order graphs, a class of directed acyclic graphs (Lee et al. 2002). An example of a partial-order graph representation of an msa is illustrated in Figure 6 . The partial-order graph representing an msa can be rapidly aligned with a sequence or with another msa-representing graph by dynamic programming in an amount of time proportional to the average number of branches per node. This method is particularly efficient for sequences that share many identities, as in overlapping sets of expressed sequence tag (EST) sequences. Another advantage to this method is that each stage of the alignment can store and use information from previous alignments, in contrast to progressive alignment methods, which use profiles that do not have this information. The method of alignment is illustrated in Figure 2C. The program POA (Partial Order Alignment) for a UNIX or Linux environment is available from http://packages.debian.org/etch/poa, and a web page for data input is also available. * X4 U8 @' o& i& V/ r

! V5 E, L! b: ?8 G# c
View larger version (12K):
[in this window]
[in a new window]7 }; s& g' `5 j) j4 v1 T
, B. s9 a  I- @
Figure 6. A partial-order graph representation of an msa (see Lee et al. 2002). (A) A partial-order graph. The graph has a line of nodes representing columns of (black) conserved sequence positions in the msa joined by directed edges representing consecutive sequence letters that run from the start to the end of the graph. (Purple) Aligned substitutions and (blue) an internal insertion in the msa are depicted by loops with edges representing consecutive positions in the divergent sequences. (Green) Initial unaligned termini of the sequences are also depicted. Each pair of nodes is joined by only one edge, but one node can have more than one edge entering or leaving it, thereby serving as a junction. There are no connections between branches. (B) The msa represented by the graph. The graph depicts the msa in a compacted form from which the original msa can be derived, because the nodes representing conserved msa positions store information about the location of the character in each sequence. (C) Possible moves between cells in the dynamic programming matrix by aligning the graph in A at position P with a new sequence, also represented by a graph, by the Smith-Waterman dynamic programming algorithm. Branch joining at this position gives two possible diagonal moves (purple arrows, one for each branch), two horizontal moves, and one vertical move (blue arrows to each branch). In addition, there is a start move at each cell of score zero that is the graphical equivalent of the scoring system used in the Smith-Waterman local alignment of sequences. In simpler unbranched regions, only the three standard sequence alignment moves are used. Optimal scores are calculated in each cell, and the best overall score is found. The optimal alignment between sequence and nodes is then produced by a trace-back procedure similar to that used for sequence alignment until a start move is reached. The existing graph is then updated with the new alignment information according to a set of rules as described in Lee et al. (2002). As more sequences are added to a node, the lists of sequence and aligned sequence positions are also updated. This method does not follow a progressive alignment strategy as do CLUSTALW and T-COFFEE. The order of addition of sequences to the msa can influence the results in regions of low sequence similarity. Although not implemented in the version of POA described here, this information can be used to analyze rearrangements of conserved domains in the sequences.+ x' @4 l3 R  g" {  R4 R. n

Directed acyclic graphs are also used to model gene ontologies, which classify gene functions.
; N. y3 ~/ e" {- t2 H, i' }
Hidden Markov Models of a Global Multiple Sequence Alignment* @5 b9 a3 t% I  k9 P: s) g; I

The hidden Markov model (HMM) is a probabilistic, statistical model that considers all possible combinations of matches, mismatches, and gaps to generate an alignment of a set of sequences. Both global and local msas (PROFILE HMMs) may be modeled, and the methods are quite similar. A discussion of HMMs can be found in Using Hidden Markov Models to Align Multiple Sequences (Mount 2008b).

; r: p% W) [; s9 W  E' ~% g% p
REFERENCES9 Z' K- F; |( J4 r8 m
/ S2 W/ F7 n, `
Corpet F. 1988. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16: 10881–10890.[Abstract/Free Full Text]& Z( F7 w( p. X) m8 }' I9 |

Gotoh O. 1994. Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comput Appl Biosci 10: 379–387.[Abstract/Free Full Text]

Gotoh O. 1995. A weighting system and algorithm for aligning many phylogenetically related sequences. Comput Appl Biosci 11: 543–551.[Abstract/Free Full Text]. q+ Z- [% ]) r9 s5 A! i

Gotoh O. 1996. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264: 823–838.[Medline]7 s( y7 Q) J+ m

Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. 1995. Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11: 13–18.[Abstract/Free Full Text]0 h3 e5 A/ Y, c- v& o

Kim J, Pramanik S, Chung MJ. 1994. Multiple sequence alignment by simulated annealing. Comput Appl Biosci 10: 419–426.[Abstract/Free Full Text]

Lee C, Grasso C, Sharlow MF. 2002. Multiple sequence alignment using partial order graphs. Bioinformatics 18: 452–464.[Abstract/Free Full Text]' B/ `: [0 ?, Z& k8 _0 b6 u' \

Mount DW. 2008a. Using progressive methods for global multiple sequence alignment. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top43.[Abstract/Free Full Text]

Mount DW. 2008b. Using hidden Markov models to align multiple sequences. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top41.[Abstract/Free Full Text]) \6 F; Z" i- |2 K: ]* U1 S

Mount DW. 2008c. Comparing programs and methods to use for global multiple sequence alignment. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.ip61.[Abstract/Free Full Text]
) G0 X8 U2 D( j* P
Mount DW. 2008d. Using multiple sequence alignment editors and formatters. Cold Spring Harb Protoc (this issue). doi: 10.1101/pdb.top45.[Abstract/Free Full Text]! S: }. P4 ~" P" g7 V
2 T/ _: j- n8 P: e- L( o) A
Notredame C, Higgins DG. 1996. SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Res 24: 1515–1524.[Abstract/Free Full Text]

Notredame C, O’Brien EA, Higgins DG. 1997. RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res 25: 4570–4580.[Abstract/Free Full Text]

Shapiro B, Navetta J. 1994. A massively parallel genetic algorithm for RNA secondary structure prediction. J Supercomput 8: 195–207.

Zhang C, Wong AK. 1997. A genetic algorithm for multiple molecular sequence alignment. Comput Appl Biosci 13: 565–581.[Abstract/Free Full Text]

已有 1 人评分	威望	包包	收起理由
细胞海洋	+ 15	+ 30	极好资料

总评分: 威望 + 15 包包 + 30 查看全部评分

回复引用

举报返回顶部

‹ 上一主题|下一主题 ›

返回列表

关闭安全验证

[已解决求助] 求助冷泉港三篇付费文献，关于测序生物信息学计算的，谢谢 [复制链接]

浏览过的版块

关闭 安全验证

[已解决求助] 求助冷泉港三篇付费文献，关于测序生物信息学计算的，谢谢 [复制链接]

浏览过的版块

关闭安全验证