[13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 311 311–316 BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 3 2011, pages 311–316 doi:10.1093/bioinformatics/btq674 Genome analysis Advance Access publication December 6, 2010 Algorithms for sorting unsigned linear genomes by the DCJ operations Haitao Jiang1,2, Binhai Zhu1,∗ and Daming Zhu2 1Department of Computer Science, Montana State University, Bozeman, MT 59717, USA and 2School of Computer Science and Technology, Shandong University, Jinan, China Associate Editor: Martin Bishop ABSTRACT Motivation: The double cut and join operation (abbreviated as DCJ) has been extensively used for genomic rearrangement. Although the DCJ distance between signed genomes with both linear and circular (uni- and multi-) chromosomes is well studied, the only known result for the NP-complete unsigned DCJ distance problem is an approximation algorithm for unsigned linear unichromosomal genomes. In this article, we study the problem of computing the DCJ distance on two unsigned linear multichromosomal genomes (abbreviated as UDCJ). Results: We devise a 1.5-approximation algorithm for UDCJ by exploiting the distance formula for signed genomes. In addition, we show that UDCJ admits a weak kernel of size 2k and hence an FPT algorithm running in O(22kn) time. Contact: bhz@cs.montana.edu Received on September 27, 2010; revised on December 1, 2010; accepted on December 2, 2010 1 INTRODUCTION Computing genomic distance on gene order is a fundamental problem in computational biology. In the last two decades, a variety of biological operations, such as reversals, translocations, fusions, fissions, transpositions and block-interchanges, have been proposed to handle gene order. The double cut and join operation, introduced byYancopoulos et al., 2005, unifies all the classical operations. In the past, the rearrangement distance for signed genomes is well studied for single operations, like reversals (Hannenhalli and Pevzner, 1999), combinations of operations (reversals, translocations, fusions and fissions) (Hannenhalli and Pevzner, 1995) and universal operations (double cut and join) (Bergeron et al., 2006; Yancopoulos et al., 2005). Unfortunately, as for unsigned genomes, most of these problems seem to be NP-hard. Then it is natural to devise relevant approximation algorithms. A 1.5-approximation algorithm was devised for sorting by unsigned reversals (Christie, 1998), and the approximation factor was improved to 1.375 by Berman et al., 2002. The problem of sorting by unsigned translocations was investigated by Cui et al., 2008, and an algorithm with an approximation factor of 1.5+ε was proposed. Transposition, though occurring much less than reversal and translocation, is an indispensable operation in the evolutionary events. The problem of sorting by transpositions was first studied by Bafna and Pevzner, 1998, who devised a ∗To whom correspondence should be addressed. 1.5-approximation algorithm running in quadratic time. Later, the approximation factor was improved to 1.375 by Elias and Hartman, 2006. The problem of sorting by short block-moves, a special but more practical case of transpositions, was studied by Jiang and Zhu, 2011, and they obtained an 14/11-approximation algorithm. The design of FPT algorithms for genome rearrangement problems was started very recently, with the help of weak kernels. (Intuitively, an FPT algorithm is an exact algorithm which runs in polynomial time when the problem solution size, like the number of unsigned reversals to sort a sequence, is bounded by a constant. The relevant formal definitions will be given in the next section.) Both sorting by unsigned reversals and sorting by unsigned translocations admit small weak kernels, hence are in FTP (Jiang et al., 2010). As far as we know, the only known positive result for sorting unsigned genomes by minimum DCJ operations (or interchangeably, the unsigned DCJ distance problem) is a factor-1.416 approximation for the case of linear unichromosomal genomes (Chen, 2010). Of course, even in this case the problem involves computing a maximum alternating-cycle decomposition (MAX-ACD) of the breakpoint graph, which is NP-complete (Caprara, 1999); therefore, it is not surprising that the unsigned DCJ distance problem is NP-complete, even for linear unichromosomal genomes (Chen, 2010). Prior to our current work, there has been no FPT algorithm known for the unsigned DCJ distance problem. Our contributions: In this article, we introduce DCJ operations on unsigned linear multichromosomal genomes to compute the corresponding genomic distance. We devise a 1.5-approximation algorithm for linear multichromosomal genomes in Section 3. In Section 4, we obtain a weak kernel of size 2k for UDCJ; moreover, we present an FPT algorithm running in O(22kn) time. 2 PRELIMINARIES Gene, chromosome and genome: An unsigned gene is a sequence of DNA, usually denoted by a positive integer. A chromosome can be viewed as a sequence of genes and denoted by a permutation, while a genome is a set of chromosomes. A gene that lies at the end of some linear chromosome is called an end-gene. Gene i and j form an adjacency if they are consecutive in some chromosome. An adjacency (gi,gi+1) is perfect if it satisfies |gi+1 −gi|=1. A chromosome is perfect if every adjacency is perfect. A genome is perfect if all its chromosomes are perfect. As a convention, we always list the genes in a perfect genome in increasing order. For instance, a perfect genome with 3 chromosomes and 10 genes can be listed as (1, 2, 3, 4), (5, 6, 7) and (8, 9, 10). We study unsigned linear © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 311 atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom [13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 312 311–316 H.Jiang et al. multichromosomal (multilinear or simply linear, for short) genomes in this article. Breakpoint graph: Above all, we recall the well-known tool for computing the genomic rearrangement distance, the Breakpoint Graph (Bafna and Pevzner, 1998). Given two unsigned genomes A and B on the same set of n genes, the Breakpoint Graph BG(A,B)= (V,Eb ∪Eg), where |V|=n and each vertex in V corresponds to a gene, every adjacency in A forms a black edge belonging to Eb and every adjacency in B forms a gray edge belonging to Eg. It is known that in this case computing a maximum alternating-cycle decomposition in BG(A,B) is NP-complete (Caprara, 1999). As for signed genomes F and H, the breakpoint graph BGs(F,H) is a bit different. Due to the sign, each gene has one head and one tail corresponding to two vertices in the breakpoint graph. Consequently, the head has only one adjacency in F and H respectively, so does the tail. Then each vertex in the breakpoint graph has degree at most two, which means that the breakpoint graph is composed of cycles and paths, and the black edges and gray edges appear alternatively in the cycles or paths. So the maximum alternating-cycle decomposition is easy in this case. A cycle that contains l black edges is called an l-cycle. The double cut and join operations: The Double Cut and Join operation (abbreviated as DCJ) unifies all the traditional genome rearrangement operations such as reversal, translocation, fusion, fission, transposition and block interchange, as well as excision, integration, circularization and linearization. The formal definition of a DCJ operation on the breakpoint graph is as follows. Definition 1. The double cut and join operation acts on the breakpoint graph in the following four ways (Fig. 1): (1) For two black edges b1 =(gi,gi+1) and b2 =(gj,gj+1), cut them, and either form two new black edges b1 =(gi,gj+1) and b2 =(gj,gi+1) or form two new black edges b1 =(gi,gj) and b2 =(gi+1,gj+1). (2) For a black edge b=(gi,gi+1) and an end-gene gj, cut the black edge, and either form a new black edge b =(gi,gj) and a new end-gene gi+1 or form a new black edge b =(gj,gi+1) and a new end-gene gi. (3) For two end-genes gi and gj, join them with a black edge (gi,gj). (4) For a black edge b=(gi,gi+1), cut it into two end-genes gi and gi+1. Note that the black edges and end-genes involved in one DCJ operation can be in the same chromosome, then a circular chromosome may form after some DCJ operations. Problem statement: We now formally formulate the problem to be investigated in this article. Sorting unsigned genomes by the DCJ operations (UDCJ): Input: Two unsigned linear genomes A and B and an integer k. Question: Can A be converted into B by a series of k DCJ operations ρ1,ρ2,...,ρk? The minimum k is the unsigned DCJ distance between A and B. Following the results in (Caprara, 1999; Chen, 2010), UDCJ is also NP-complete. W.L.O.G, assume that B is perfect. Let lA and lB be the number of linear chromosomes in A and B, respectively, we can also assume Fig. 1. The DCJ operation. that lA ≥lB, since all the DCJ operations are reversible, which means that if there exists consecutive DCJ operations ρ1ρ2...ρm that convert A into B, then we can also convert B into A by ρ−1 m ρ−1 m−1...ρ−1 1 , where ρ−1 i is the reversed operation of ρi. FPT and weak kernel: Basically, a fixed parameter tractable (FPT) algorithm for a decision problem with solution value k is an algorithm, which solves the problem in O(f (k)nc)=O∗(f (k)) time, where f is any function only on k, n is the input size and c is some fixed constant not related to k. FPT also stands for the set of problems that admit such an algorithm (Downey and Fellows, 1999; Flum and Grohe, 2006). Weak kernel is a relatively new concept; intuitively, it refers to the direct or indirect ‘search space’ to solve a search problem. For a search problem in NP, if it admits a weak kernel of size g(k), then it is in FPT (Jiang et al., 2010). We comment that weak kernel is different from the traditional kernel in which the problem instance size is reduced (to a function of k), while a weak kernel only implies that the direct or indirect solution search space is reduced (to a function of k). More details can be found in (Jiang et al., 2010). 3 A 1.5-APPROXIMATION ALGORITHM In this section, we present a 1.5-approximation algorithm for double cut and join distance on unsigned multilinear genomes. We first comment that the method by Chen (2010) cannot be converted to solve our problem, as with multilinear genomes the underlying breakpoint graph is more complex (i.e. possibly with many paths). Given an original genome A with lA chromosomes and a target perfect genome B with lB chromosomes, our goal is to convert A into B by a series of DCJ operations so that the number of DCJ operations is as few as possible. To design an approximation algorithm, we first need the structure properties of UDCJ, which in fact can be obtained from the corresponding signed genomes. 3.1 Structure properties of UDCJ For an unsigned genome A, a signed-version of A is obtained by assigning ‘+ or ‘− to each gene in A, with ‘+ signs usually omitted. Obviously, every genome of n genes has exponential, i.e. 2n, signed versions. Given two signed genomes F,H, we use DCJs(F,H) to denote their signed DCJ distance. Theorem 1. Given two unsigned linear multichromosomal genomes A and B, let the minimum DCJ distance between A,B be DCJ(A,B). 312 atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom [13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 313 311–316 Sorting by DCJ operations Then DCJ(A,B)=DCJs(A∗,B+), where A∗ is some signed version of A, and B+ is a special signed version of B with all signs being positive. Proof. Notice that, loosely speaking, we can take B=B+. (⇒) Assume that there exists a series of consecutive DCJ operations ρ1ρ2...ρm that convert A into B. We say that a DCJ operation ρ changes the sign of a gene g if ρ involves reversing a segment of genes including g. For each gene g in A, let Tg denote the number of times that the sign of g is changed if we trace all the m DCJ operations. g is assigned ‘− , if Tg is odd; and ‘+ , if Tg is even. Then we obtain a signed version of A, A∗, which can be converted into B+ by the m equivalent signed DCJ operations. Thus, DCJ(A,B)≥DCJs(A∗,B+). (⇐) If there exists a signed version A∗ of A that can be converted into B+ by m signed DCJ operations ρ1ρ2...ρm, then we can also use these m (signed) DCJ operations to convert A into B, ignoring the gene signs. Thus DCJ(A,B)≤DCJs(A∗,B+). We now proceed to obtain the necessary properties of the optimal solution. First of all, in order to avoid distinct end points of chromosomes in A and B, we add unlabeled caps to both ends of each linear chromosome in genomes A and B, respectively, then connect the A-cap and its adjacent end-gene with a black edge and the B-cap and its adjacent end-gene with a gray edge in BG(A,B). The above preprocess is called capping. Note that each gene in BG(A,B) has degree 4 after capping, i.e. with two black edges and two gray edges. After capping, genomes A and B become ¯A and ¯B, respectively. We denote the resulting graph by BG(¯A, ¯B). As it seems to be hard to extract the properties of the optimal solution from BG(¯A, ¯B) directly, we take a detour. We notice that, for signed genomes F and H, after capping each vertex in the breakpoint graph BGs(F,H) has degree two and each cap has degree one, which means that all the paths end with caps. A path with an A-cap end and a B-cap end (respectively, two A-cap ends, two B-cap ends) is an AB-path (respectively, AA-path, BB-path). There are three ways to construct cycles from BGs(F,H)in the breakpoint graph BGs(F,H) of signed genomes F and H, after capping . (1) single-identifying: identify the two caps of each AB-path, close the path into a cycle containing just one A-cap (with the B-cap eliminated). (2) double-identifying: identify each B-cap of a BB-path and each A-cap of an AA-path, join an AA-path and a BB-path into a cycle containing twoA-caps (with the two B-caps eliminated). (3) joining: connect the two A-caps of an AA-path with a gray edge. Let BGs( ¯F, ¯H) denote the resulting breakpoint graph after constructing cycles from BGs(F,H) following the above three ways. Then the signed DCJ distance between the signed genomes ¯F and ¯H, DCJs( ¯F, ¯H)=b−c, where b is the number of black edges and c is the number of cycles in BGs( ¯F, ¯H) (Yancopoulos et al., 2005). In Figure 2, we show an example of ¯F, ¯H and BGs( ¯F, ¯H), before the identifying and joining operations are performed. In the figure, an empty round (respectively, square) node is anA-cap (respectively, B-cap); moreover, in BGs( ¯F, ¯H), a signed gene +i (respectively, −i) is already converted to (2i−1,2i) [respectively, (2i,2i−1)]. After two single-identifying operations are performed, we have two new Fig. 2. The breakpoint graph BGs( ¯F, ¯H), before the identifying and joining operations are performed. cycles (6) and (9, 8, 4, 5, 14). After a double-identifying operation is performed, we have a new cycle (2, 3, 7, 1).After a joining operation is performed, we have a new cycle (12, 13). It is worth mentioning that this distance formula is equivalent to that of Bergeron et al., 2006, i.e. DCJs(F,H)=n−C− I/2 , where n is the number of genes, C is the number of cycles and I is the number of odd paths in their corresponding adjacency graph. To see this, note that I also equals to the number of AB-paths in the breakpoint graph; in addition, we have b=n+lA, c=C+I + (2lA −I)/2 . So DCJs( ¯F, ¯H)=DCJs(F,H). Corollary 1. Given two unsigned linear multichromosomal genomes A and B, let A∗ and B+ be defined as in Theorem 1. Then DCJ(A,B)=DCJs(¯A∗, ¯B+), where ¯A∗ (respectively, ¯B+) is a capping of A∗ (respectively, B+). Proof. It follows from Theorem 1 that DCJ(A,B)= DCJs(A∗,B+). The statements in the last paragraph show that DCJs(A∗,B+)=DCJs(¯A∗, ¯B+). Then the corollary follows. Notice that computing an alternating-cycle decomposition of BG(¯A, ¯B) is equivalent to finding a signed version of ¯A. To extract the properties of the optimal solution, we first try to make use of the breakpoint graph BG(¯A, ¯B) instead of BG(A,B). Following Corollary 1, we can now make use of the breakpoint graph BGs(¯A∗, ¯B+). From the way BGs(¯A∗, ¯B+) is constructed, we only need to find an optimal ¯A∗ such that the number of disjoint alternating-cycles in BGs(¯A∗, ¯B+) is maximized. The reason is that the number of black edges in BG(¯A, ¯B) is fixed. Then we have dopt =DCJ(A,B)=DCJs(¯A∗, ¯B+)=b−c1 −c2 − c3, where b is the number of black edges in BGs(¯A∗, ¯B+), c1 and c2 are the number of 1-cycles and 2-cycles in BGs(¯A∗, ¯B+), respectively, and c3 is the number of cycles with three or more black edges in BGs(¯A∗, ¯B+). Obviously, c3 ≤(b−c1 −2c2)/3, thus we have the following formula: dopt = b−c1 −c2 −c3 ≥ b−c1 −c2 −(b−c1 −2c2)/3 313 atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom [13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 314 311–316 H.Jiang et al. (a) (b) (c) (d) Fig. 3. 1-cycle containing two genes. = 2(b−c1)/3−c2/3 = 2 3 ·(b−c1 −c2/2). The above formula implies that, if we can convert A into B by at most b−c1 −c2/2 DCJ operations, then we obtain a 1.5-approximation algorithm for UDCJ. 3.2 The algorithm The idea of our approximation algorithm is as follows. We compute BG(¯A, ¯B) and try to first keep all the 1-cycles in it. Then we compute many 2-cycles from BG(¯A, ¯B) (in fact, at least c2/2 such 2-cycles). We comment that a similar idea was used by Christie (1998) on sorting by unsigned reversals. On the other hand, the LP-relaxation algorithm by Chen, 2010 cannot handle paths (and caps) so it cannot be immediately generalized to solve our problem. The following lemma, which involves handling paths and caps, shows that keeping all the 1-cycles in BG(¯A, ¯B) is a good strategy to obtain some optimal alternating-cycle decomposition of it. Lemma 1. There is some maximum alternating-cycle decomposition of BG(¯A, ¯B) in which all c1 1-cycles in BG(¯A, ¯B) are kept. Proof. We modify the optimal alternating-cycle decomposition in BG(¯A, ¯B) in such a way: if two genes, say gi and gi+1, are connected by a black edge and a gray edge, then we reassign the signs of these two genes to obtain a 1-cycle; if a gene, say gi, is connected to an A-cap by a black edge and to a B-cap by a gray edge, then we reassign the sign of the gene and identify the two caps to obtain a 1-cycle. If the newly obtained 1-cycle contains two genes, then there are two cases. Case (I): Only one of the signs of gi and gi+1 is changed. W.L.O.G, assume that the sign of gi is changed, see Figure 3a. The number of cycles is increased by one. Case (II): Both of the signs of gi and gi+1 are changed, see Figure 3b–d. The number of cycles is increased by two or one or unchanged, respectively. If the newly obtained 1-cycle contains one gene and a cap (which is identified by an A-cap a and a B-cap b), then there are four cases. Note that b must be identified with some A-cap a . Case (1): The A-cap a joins with another A-cap a . The number of cycles is unchanged. See Figure 4a. Case (2): TheA-cap a is identified with a B-cap b and a,b belongs to distinct cycles. The number of cycles is unchanged. See Figure 4b. Case (3): TheA-cap a is identified with a B-cap b and a,b belongs to the same cycle. The number of cycles is increased by one. See Figure 4c. Case (4): The A-cap a is identified with the B-cap b but the cycle containing a,b also contains two identified caps a and b . The number of cycles is increased by one. See Figure 4d. Following Lemma 1, we know that keeping all the 1-cycles in in BG(¯A, ¯B) will not affect the value of some optimal alternating-cycle decomposition of it. Therefore, from now on we only focus on the optimal alternating-cycle decomposition of BG(¯A, ¯B), which always keeps all the 1-cycles. Consequently, in order to approximate the optimal DCJ distance, we just need to find out as many as at least half of the 2-cycles in an optimal alternating-cycle decomposition of BG(¯A, ¯B) (which keeps all 1-cycles). Now we present the algorithm 2-Cycle Decomposition to compute such 2-cycles. In this algorithm, we first construct a graph G1 whose vertices are the black edges (not in any 1-cycle) in BG(¯A, ¯B) and M is a maximum matching in G1. Note that the maximum matching M can be computed in polynomial time (Galil et al., 1986); moreover, each edge in M results in a candidate 2-cycle. In order to bound the cardinality of S, we need the following lemmas. Lemma 2. Let M be a maximum matching in G1, then |M|≥c2. Proof. Following the discussion in Section 3.1, c2 corresponds to the number of 2-cycles in an optimal alternating-cycle decomposition of BG(¯A, ¯B). These 2-cycles clearly form a matching in G1. By the maximality of M, we have |M|≥c2. Algorithm 2-Cycle Decomposition Input: BG(¯A, ¯B) Output: A set of edge-disjoint 2-cycles 1 Construct a graph G1 =(P,E1) as follows: 1.1 Each black edge in BG(¯A, ¯B), not contained in any 1-cycle, corresponds to a vertex in P. 1.2 For each pair of vertices u=(gi,gi+1) and v=(gj,gj+1) corresponding to black edges between two genes, (u,v)∈E1 iff there exist two gray edges which can form a 2-cycle together with these two black edges. 1.3 For each pair of vertices u=(gi,ai) and v=(gj,aj) corresponding to black edges between a gene and a A-cap, (u,v)∈E1 iff there exist a gray edge (gi,gj) in BG(¯A, ¯B). 1.4 For a 2-gene vertex u=(gi,gi+1) and a 1-gene-1-cap vertex v=(gj,aj), (u,v)∈E1 iff there exist two gray edges (gi,gj) and (gi+1,bj) or two gray edges (gi+1,gj) and (gi,bj+1) in BG(¯A, ¯B), where bj and bj+1 are B-caps. 2 Compute a maximum matching M in G1. 3 Construct a graph G2 =(Q,E2) as follows: 3.1 Each 2-cycle computed at Step 2 corresponds to a vertex in Q. 3.2 Two vertices in Q form an edge in E2 iff their corresponding cycles share a gray edge. 4 Compute a maximum independent set S of G2. 5 Return S which is a set of edge-disjoint 2-cycles. Lemma 3. The graph G2 is composed of simple paths and isolated vertices. Proof. All 2-cycles computed at Step 2 cannot share black edges. Since each gray edge is connected to at most four black edges, at most two 2-cycles which do not share black edges can share this gray edge. Equivalently, each gray edge can belong to at most two cycles computed from M. Each 2-cycle has two gray edges, so each vertex in G2 has degree at most two. 314 atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom [13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 315 311–316 Sorting by DCJ operations (a) (b) (c) (d) Fig. 4. 1-cycle containing one gene and one cap. Fig. 5. An example of 2-cycles sharing gray edges. It is sufficient to prove that G2 does not contain cycles. Assume to the contrary that 2-cycles C1C2...Cr form a cycle in G2, where Ci shares gray edge gi with Ci+1, 1≤i≤r−1, and Cr shares gr with C1. Then C1 contains two gray edges g1 and gr, but the end points of g1 and gr cannot form two black edges (otherwise these two black edges will force into some black cycle—which implies that the input genome contains some circular chromosome). See Figure 5. It is obvious that every 2-cycle containing caps has degree at most one in G2, because the gray edge containing caps cannot be shared by two 2-cycles computed from M. The property we just proved in Lemma 3 is important for us to compute a maximum independent set in G2 (without this property, the computation of a maximum independent set might be intractable). Lemma 3 immediately implies the next lemma. Lemma 4. Let S be a maximum independent set in G2, then |S|≥ |M| 2 . Note that if a gene is contained in some 1-cycle, then its sign can be fixed easily, i.e. if the black edge reads from (left to right) like (i,i+1) then both genes i and i+1 will be given positive signs, otherwise they will be given negative signs. If a gene is contained in some 2-cycle, its sign is fixed similarly. For instance, if in a 2-cycle the two black edges read like (i,j),(i+1,j+1) (from left to right), then the signing should be +i,−j,−(i+1),+(j+1). The other cases, e.g., when the directions of these black edges are possibly changed, are very much symmetric hence omitted. To complete the cycle decomposition, we arbitrarily assign signs to the remaining genes, then properly identify and join the remaining caps in the corresponding breakpoint graph. The complete Whole-Cycle Decomposition algorithm is presented as follows. Algorithm Whole-Cycle Decomposition Input: BG(¯A, ¯B) Output: BGs( ¯A , ¯B+) 1 Keep all 1-cycles and assign proper signs to genes involved in the 1-cycles in BG(¯A, ¯B). 2 Call 2-Cycle Decomposition, and assign proper signs to genes involved in the resulting 2-cycles. 3 Assign arbitrary signs to the remaining genes to have a signed genome ¯A . 4 Construct BGs( ¯A , ¯B+) by identifying and joining caps with single-identifying, double-identifying and joining operations. Notice that once we have BGs( ¯A , ¯B+) it is straightforward to compute the signed DCJ distance dwcd =DCJs( ¯A , ¯B+) in linear time (Bergeron et al., 2006; Yancopoulos et al., 2005). Theorem 2. Algorithm Whole-Cycle Decomposition approximates the DCJ distance between two unsigned linear multichromosomal genomes with a factor of 1.5. Proof. From Lemma 4, we know that |S|≥ c2/2 ; it follows from Lemma 1 that c1 =c1. The distance computed by Algorithm Whole-Cycle Decomposition is dwcd ≤b−c1 −|S|. The optimal distance dopt satisfies that dopt ≥2(b−c1 −c2/2)/3. Thus, dwcd ≤ (b−c1 − c2/2 )≤1.5dopt. 4 A WEAK KERNEL AND AN FPT ALGORITHM Similar to the problem of sorting by unsigned reversals and sorting by unsigned translocations (Jiang et al., 2010), the UDCJ problem also possesses a (small and indirect) weak kernel. Let k be the minimum number of DCJ operations converting A into B. A weak kernel for UDCJ is a set of genes in A whose signs cannot be fixed after the genes involved in all 1-cycles have been properly signed (following Lemma 1). Before computing the size of the weak kernel, we state the following lemma, which is simple but critical. Lemma 5. Each DCJ operation can generate at most two 1-cycles. Proof. Each DCJ operation cuts two black edges and forms at most two new black edges. Each new black edge can form at most one 1-cycle. Theorem 3. The UDCJ problem has a weak kernel of size 2k, hence can be solved in O∗(22k) time. Proof. The 2k weak kernel is straightforward from Lemma 1 and Lemma 5. For any optimal alternating-cycle decomposition of BG(¯A, ¯B) which contains all possible number of c1 1-cycles, we have k =b−c1 −c2 and c2 ≤(b−c1)/2, where c2 is the number of cycles of length at least 2 in the optimal alternating-cycle decomposition of BG(¯A, ¯B). Thus, k ≥(b−c1)/2, equivalently, (b−c1)≤2k. Following Lemma 1, we can assign signs to all genes involved in 1-cycles. So each of the remaining gene is connected to two black edges, and each black edge has at most two unsigned genes as its end points, which means that the number of unsigned genes N is bounded by the number of black edges not involved in any 1-cycle, e.g. N ≤b−c1 ≤2k. Hence, the problem admits a weak kernel of size 2k. 315 atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom [13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 316 311–316 H.Jiang et al. In other words, if the DCJ distance is equal to or smaller than k, there are at most 22k signed versions of A among which there must be an optimal one (e.g. A∗ in Theorem 1). For each signed version of A, we can exploit the algorithm in Bergeron et al. (2006); Yancopoulos et al. (2005) to check whether it can be converted into B+ by k or few DCJ operations. If so, we can compute the corresponding k unsigned DCJ operations to convert A into B. If no valid solution is found, we report NO. This algorithm clearly runs in O(22kn)=O∗(22k) time. 5 DISCUSSION In this article, we devise the first approximation algorithm with a factor of 1.5 and an FPT algorithm running in O(22kn) time for the NP-complete problem of sorting linear multichromosomal genomes under unsigned DCJ distance. It is interesting to improve the approximation factor as well as the running time of the FPT algorithm. For genomes containing circular chromosomes, our approximation algorithm cannot achieve the same performance as linear genomes, so it is also meaningful to handle the problem of sorting mixed genomes (i.e. with both linear and circular chromosomes) under unsigned DCJ distance. ACKNOWLEDGEMENT We thank anonymous reviewers for their valuable comments. Funding: NSF grant (DMS-0918034); NSF of China under grant (60928006 and 61070019 in part). Conflict of Interest: none declared. REFERENCES Bafna,V. and Pevzner,P. (1998) Sorting by Transpositions. SIAM J. Discrete Math., 11, 224–240. Bergeron,A. et al. (2006) A unifying view of genome rearrangements. In Proceedings of the 6th International Workshop on Algorithms in Bioinformatics (WABI’06), Springer, Germany, pp. 163–173. Berman,P. et al. (2002) 1.375-Approximation algorithm for sorting by reversals. In Proceedings of the 10th Annual European Symposium on Algorithms (ESA’02), Springer, Germany, pp. 200–210. Caprara,A. (1999) Sorting permutations by reversals and Eulerian cycle decompositions. SIAM J. Discrete Math., 12, 91–110. Chen,X. (2010) On sorting permutations by double-cut-and-joins. In Proceedings of the 16th International Conf. on Computing and Combinatorics (COCOON’10), Springer, Germany, pp. 439–448. Christie,D. (1998) A 3/2-Approximation algorithm for sorting by reversals. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’98), ACM, New York, pp. 244–252. Cui,Y. et al. (2008) A (1.5 + )-Approximation algorithm for unsigned translocation distance. IEEE/ACM Trans. Comput. Biol. Bioinform., 5, 56–66. Downey,D. and Fellows,M. (1999) Parameterized Complexity, Springer. Elias,I. and Hartman,T. (2006) A 1.375-Approximation algorithm for sorting by transpositions. IEEE/ACM Trans. Comput. Biol. Bioinform., 3, 369–379. Flum,J. and Grohe,M. (2006) Parameterized Complexity Theory. Springer, Germany. Galil,Z. et al. (1986) An O(EVlog V) algorithm for finding a maximal weighted matching in general graphs. SIAM J. Comput., 15, 120–130. Hannenhalli,S. and Pevzner,P. (1995) Transforming men into mice (polynomial algorithm for genomic distance problem). In Proceedings of the 36th Annual IEEE Symposium on Foundations of Computer Science (FOCS’95), IEEE Computer Society, pp. 581–589. Hannenhalli,S. and Pevzner,P. (1999) Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J. ACM, 46, 1–27. Jiang,H. et al. (2010) Weak kernels. ECCC Report, TR10-005. Jiang,H. and Zhu,D. (2011) A 14/11-Approximation algorithm for sorting by short block-moves. To appear in Science in China Series F. Yancopoulos,S. et al. (2005) Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics, 21, 3340–3346. 316 atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom