The recent determination of numerous bacterial and eukaryotic genome sequences poses new challenges for comparative sequence analysis. In addition to identifying local changes in the sequences of individual genes, the availability of genome sequences provides a basis for comparison of the structure and organization of genomes as a whole. Genomes are known to undergo several types of large-scale evolutionary events. Gene duplication can result in the existence of paralogous genes, whereas gene loss may remove a copy and obscure the assumption of orthology. Reordering of genetic elements occurs by mechanisms such as repeated inversion or translocation. Horizontal transfer introduces new genetic elements into bacterial genomes (Hacker and Carniel 2001). Furthermore, the rates and patterns of each event depend on the particular set of genomes being compared. For example, observations of gene duplication and repetitive sequences are much more common among higher eukaryotes than bacteria, whereas genome rearrangements can be readily observed between both closely related and divergent organisms of all types (Tillier and Collins 2000; Eichler and Sankoff 2003). Genome comparison systems must account for all of these evolutionary phenomena to provide a complete picture of genetic differences among organisms.
Early sequence comparison methods were designed to identify nucleotide substitutions and small insertions and deletions by computing an alignment of pairs of short sequences. Such early techniques as Needleman-Wunsch global alignment and Smith-Waterman local alignment use methods whose computation time scales as O(n2), where n is the length of input sequences. Numerous multiple sequence alignment and comparison methods are based on dynamic programming algorithms similar to Smith-Waterman and Needleman-Wunsch (Thompson et al. 1994; Morgenstern et al. 1996; Morgenstern 1999; Notredame et al. 2000; Lee et al. 2002). Such pairwise and multiple sequence alignment methods suffer the limitation that application to long (typically n > 10 kb) sequences is prohibitively timeconsuming (Ureta-Vidal et al. 2003).
The availability of genome sequences demands methods for aligning long genomic DNA sequences. Several heuristic approaches to align long sequences have been developed under the assumption that highly similar subsequences can be found quickly and are likely to be part of the correct global alignment. These local alignments are used to anchor a global alignment, reducing the number of possible global alignments considered during a subsequent O(n2) dynamic programming step. Some spurious local alignments are typically found because of random sequence similarity, particularly when using a sensitive local alignment method. A method for selecting alignment anchors must be used to filter out spurious matching regions. Alignment tools such as MUMmer, GLASS, AVID, and WABA align pairs of long sequences, implementing various methods to discover local alignments (Delcher et al. 1999; Batzoglou et al. 2000; Kent and Zahler 2000; Morgenstern 2000; Bray et al. 2003). Similar multiple sequence alignment methods for long sequences have been developed and implemented in software packages such as MAVID, MLAGAN, and MGA (Hohl et al. 2002; Bray and Pachter 2003; Brudno et al. 2003a). All of these pairwise and multiple sequence aligners assume the input sequences are free from significant rearrangements of sequence elements, selecting a single collinear set of alignment anchors.
Recently, methods have been developed to perform pairwise genome comparison in the presence of rearrangements. Shuffle-LAGAN, a variant of the LAGAN alignment system, was the first genome comparison method described that explicitly deals with genome rearrangements during the alignment process (Brudno et al. 2003b). Like other genome alignment methods, Shuffle-LAGAN uses an anchored alignment approach. Rather than selecting a single collinear set of anchors, Shuffle-LAGAN selects anchors collinear in the first sequence with rearrangements permitted in the other sequence. Although Shuffle-LAGAN’s alignment approach works for pairwise comparison, an extension of the method to multiple genome sequences has not yet been suggested.
MultiPipMaker, based on BLASTZ, is a tool that can align multiple genomes to a single reference genome in the presence of rearrangements (Schwartz et al. 2003a). MultiPipMaker uses BLASTZ (Schwartz et al. 2003b) on each pair of reference and nonreference genomes to calculate pairwise local alignments. These local alignments are used to construct a rough global alignment that is iteratively refined. Because MultiPipMaker does not provide a mechanism for global alignment of regions not included in the initial local alignments, more divergent homologous regions between local alignments may remain unaligned. As such, MultiPipMaker can best be described as a multiple local aligner for genome sequences, rather than a global aligner. Furthermore, neither Shuffle-LAGAN nor MultiPipMaker provides a means to precisely identify the breakpoints of multiple genome rearrangements.
During the past several years, researchers from around the world published the finished genome sequences of several enterobacteria, nine of which we presently consider (Table 1). Previous studies have shown that these nine enterobacterial genomes have undergone significant horizontal transfer and numerous genome rearrangements since their divergence. However, a lack of effective tools has constrained comparison of the rates and patterns of large-scale evolutionary processes in these bacteria to pairwise and three-way studies.
We describe a genome comparison method that identifies conserved genomic regions, rearrangements and inversions in conserved regions, and the exact sequence breakpoints of such rearrangements across multiple genomes. Furthermore, our comparison method performs traditional multiple alignment of conserved regions to identify nucleotide substitutions and small insertions and deletions (indels). We implemented our methods in a genome alignment package called Mauve. Mauve represents the first alignment system that integrates analysis of large-scale evolutionary events with traditional multiple sequence alignment. By integrating these previously separate analysis steps, Mauve provides additional ease-of-use and sensitivity over other systems when comparing genomes with significant rearrangements.
Like other genome alignment methods, Mauve uses anchoring as a heuristic to speed alignment. Unlike other multiple genome alignment systems, Mauve’s anchor selection method relaxes the assumption that the genomes under study are collinear. Instead, Mauve identifies and aligns regions of local collinearity called locally collinear blocks (LCBs). Each locally collinear block is a homologous region of sequence shared by two or more of the genomes under study, and does not contain any rearrangements of homologous sequence. The algorithms described in this paper are limited to identifying LCBs that contain sequence elements conserved among all the genomes being aligned; in the general case, however, an LCB may be composed solely of sequence regions shared by a subset of the genomes. Remaining unaligned regions conserved among a subset of the genomes can be extracted and aligned using other methods.
The locally collinear blocks identified by Mauve’s anchor selection algorithm are required to meet a user-specified minimum weight criteria as described in the Methods section. The weight of an LCB provides a measure of confidence that it is a true genome rearrangement rather than a spurious match. By selecting a high minimum weight during alignment, the user can identify genome rearrangements that are very likely to exist, whereas by selecting a lower minimum weight, the user can trade some specificity for sensitivity to smaller genome rearrangements.
Prior to Mauve, other methods have been developed to identify homologous regions of genome sequence in the presence of large-scale rearrangements, a problem also known as strip generation. Such methods typically use some metric to cluster matches between two or more genomes then evaluate which “clusters” represent homologous regions of interest rather than spurious matches. GRIMM-Synteny is one such method that clusters matches within some given gap distance and then removes clusters that span less than a given length of the chromosome (Pevzner and Tesler 2003a). FISH, another software package, implements a similar clustering method but uses a statistical framework to determine which clusters of matches are significant (Calabrese et al. 2003). Unlike Mauve, GRIMM and FISH do not identify strictly collinear clusters of matches necessary for genome alignment, nor do they perform recursive homology detection. However, with extensions similar in nature to the Mauve algorithm, GRIMM and FISH could become suitable methods for alignment anchor selection.
In addition to the Mauve alignment algorithm, a simple viewing system has been developed to display the rearrangement structure of several genome sequences. The viewer uses the first sequence to assign a reference orientation to LCBs in the remaining sequences. Thus, regions that are in the reverse-complement orientation relative to the first sequence appear inverted in the viewer. Because the boundaries of rearrangement have been determined, the viewer is able to draw a single line that logically connects the entire homologous collinear blocks from each genome. Previous visualization systems drew one line per local alignment, often yielding a confusing picture of complex rearrangement structures.
Finally, to make an informed decision when choosing between alignment tools, it is important to have not only an understanding of the algorithms used but also the empirical performance of the alignment system. Toward this end, we empirically characterized our alignment system and compared its performance with other well-known genome alignment systems. Manually validating a benchmark alignment on the genome scale is too labor-intensive. Instead, we developed a simple genome evolution simulation system that incorporates large- and small-scale evolutionary events. Because the evolutionary history is known, the simulator can generate the “correct” alignment in addition to the evolved sequences. We measured the ability of Mauve and other genome aligners to reproduce the “correct” alignment for the evolved sequences.
The Mauve alignment system and visualization environment are available for download from http://gel.ahabs.wisc.edu/mauve.