[13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 311 311–316
BIOINFORMATICS ORIGINAL PAPER
Vol. 27 no. 3 2011, pages 311–316
doi:10.1093/bioinformatics/btq674
Genome analysis Advance Access publication December 6, 2010
Algorithms for sorting unsigned linear genomes by the DCJ
operations
Haitao Jiang1,2, Binhai Zhu1,∗ and Daming Zhu2
1Department of Computer Science, Montana State University, Bozeman, MT 59717, USA and 2School of Computer
Science and Technology, Shandong University, Jinan, China
Associate Editor: Martin Bishop
ABSTRACT
Motivation: The double cut and join operation (abbreviated as DCJ)
has been extensively used for genomic rearrangement. Although
the DCJ distance between signed genomes with both linear and
circular (uni- and multi-) chromosomes is well studied, the only
known result for the NP-complete unsigned DCJ distance problem
is an approximation algorithm for unsigned linear unichromosomal
genomes. In this article, we study the problem of computing the
DCJ distance on two unsigned linear multichromosomal genomes
(abbreviated as UDCJ).
Results: We devise a 1.5-approximation algorithm for UDCJ by
exploiting the distance formula for signed genomes. In addition, we
show that UDCJ admits a weak kernel of size 2k and hence an FPT
algorithm running in O(22kn) time.
Contact: bhz@cs.montana.edu
Received on September 27, 2010; revised on December 1, 2010;
accepted on December 2, 2010
1 INTRODUCTION
Computing genomic distance on gene order is a fundamental
problem in computational biology. In the last two decades, a variety
of biological operations, such as reversals, translocations, fusions,
ﬁssions, transpositions and block-interchanges, have been proposed
to handle gene order. The double cut and join operation, introduced
byYancopoulos et al., 2005, uniﬁes all the classical operations. In the
past, the rearrangement distance for signed genomes is well studied
for single operations, like reversals (Hannenhalli and Pevzner,
1999), combinations of operations (reversals, translocations, fusions
and ﬁssions) (Hannenhalli and Pevzner, 1995) and universal
operations (double cut and join) (Bergeron et al., 2006; Yancopoulos
et al., 2005).
Unfortunately, as for unsigned genomes, most of these problems
seem to be NP-hard. Then it is natural to devise relevant
approximation algorithms. A 1.5-approximation algorithm was
devised for sorting by unsigned reversals (Christie, 1998), and the
approximation factor was improved to 1.375 by Berman et al., 2002.
The problem of sorting by unsigned translocations was investigated
by Cui et al., 2008, and an algorithm with an approximation factor
of 1.5+ε was proposed. Transposition, though occurring much less
than reversal and translocation, is an indispensable operation in
the evolutionary events. The problem of sorting by transpositions
was ﬁrst studied by Bafna and Pevzner, 1998, who devised a
∗To whom correspondence should be addressed.
1.5-approximation algorithm running in quadratic time. Later, the
approximation factor was improved to 1.375 by Elias and Hartman,
2006. The problem of sorting by short block-moves, a special but
more practical case of transpositions, was studied by Jiang and
Zhu, 2011, and they obtained an 14/11-approximation algorithm.
The design of FPT algorithms for genome rearrangement problems
was started very recently, with the help of weak kernels. (Intuitively,
an FPT algorithm is an exact algorithm which runs in polynomial
time when the problem solution size, like the number of unsigned
reversals to sort a sequence, is bounded by a constant. The relevant
formal deﬁnitions will be given in the next section.) Both sorting
by unsigned reversals and sorting by unsigned translocations admit
small weak kernels, hence are in FTP (Jiang et al., 2010).
As far as we know, the only known positive result for sorting
unsigned genomes by minimum DCJ operations (or interchangeably,
the unsigned DCJ distance problem) is a factor-1.416 approximation
for the case of linear unichromosomal genomes (Chen, 2010).
Of course, even in this case the problem involves computing
a maximum alternating-cycle decomposition (MAX-ACD) of the
breakpoint graph, which is NP-complete (Caprara, 1999); therefore,
it is not surprising that the unsigned DCJ distance problem is
NP-complete, even for linear unichromosomal genomes (Chen,
2010). Prior to our current work, there has been no FPT algorithm
known for the unsigned DCJ distance problem.
Our contributions: In this article, we introduce DCJ operations
on unsigned linear multichromosomal genomes to compute the
corresponding genomic distance. We devise a 1.5-approximation
algorithm for linear multichromosomal genomes in Section 3. In
Section 4, we obtain a weak kernel of size 2k for UDCJ; moreover,
we present an FPT algorithm running in O(22kn) time.
2 PRELIMINARIES
Gene, chromosome and genome: An unsigned gene is a sequence
of DNA, usually denoted by a positive integer. A chromosome can
be viewed as a sequence of genes and denoted by a permutation,
while a genome is a set of chromosomes. A gene that lies at the
end of some linear chromosome is called an end-gene. Gene i and j
form an adjacency if they are consecutive in some chromosome.
An adjacency (gi,gi+1) is perfect if it satisﬁes |gi+1 −gi|=1.
A chromosome is perfect if every adjacency is perfect. A genome
is perfect if all its chromosomes are perfect. As a convention, we
always list the genes in a perfect genome in increasing order. For
instance, a perfect genome with 3 chromosomes and 10 genes can be
listed as (1, 2, 3, 4), (5, 6, 7) and (8, 9, 10). We study unsigned linear
© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 311
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 312 311–316
H.Jiang et al.
multichromosomal (multilinear or simply linear, for short) genomes
in this article.
Breakpoint graph: Above all, we recall the well-known tool for
computing the genomic rearrangement distance, the Breakpoint
Graph (Bafna and Pevzner, 1998). Given two unsigned genomes A
and B on the same set of n genes, the Breakpoint Graph BG(A,B)=
(V,Eb ∪Eg), where |V|=n and each vertex in V corresponds to
a gene, every adjacency in A forms a black edge belonging to
Eb and every adjacency in B forms a gray edge belonging to Eg.
It is known that in this case computing a maximum alternating-cycle
decomposition in BG(A,B) is NP-complete (Caprara, 1999).
As for signed genomes F and H, the breakpoint graph BGs(F,H)
is a bit different. Due to the sign, each gene has one head and one tail
corresponding to two vertices in the breakpoint graph. Consequently,
the head has only one adjacency in F and H respectively, so does the
tail. Then each vertex in the breakpoint graph has degree at most two,
which means that the breakpoint graph is composed of cycles and
paths, and the black edges and gray edges appear alternatively in the
cycles or paths. So the maximum alternating-cycle decomposition
is easy in this case. A cycle that contains l black edges is called an
l-cycle.
The double cut and join operations: The Double Cut and Join
operation (abbreviated as DCJ) uniﬁes all the traditional genome
rearrangement operations such as reversal, translocation, fusion,
ﬁssion, transposition and block interchange, as well as excision,
integration, circularization and linearization. The formal deﬁnition
of a DCJ operation on the breakpoint graph is as follows.
Deﬁnition 1. The double cut and join operation acts on the
breakpoint graph in the following four ways (Fig. 1):
(1) For two black edges b1 =(gi,gi+1) and b2 =(gj,gj+1), cut
them, and either form two new black edges b1 =(gi,gj+1)
and b2 =(gj,gi+1) or form two new black edges b1 =(gi,gj)
and b2 =(gi+1,gj+1).
(2) For a black edge b=(gi,gi+1) and an end-gene gj, cut the
black edge, and either form a new black edge b =(gi,gj) and
a new end-gene gi+1 or form a new black edge b =(gj,gi+1)
and a new end-gene gi.
(3) For two end-genes gi and gj, join them with a black edge
(gi,gj).
(4) For a black edge b=(gi,gi+1), cut it into two end-genes gi
and gi+1.
Note that the black edges and end-genes involved in one
DCJ operation can be in the same chromosome, then a circular
chromosome may form after some DCJ operations.
Problem statement: We now formally formulate the problem to be
investigated in this article.
Sorting unsigned genomes by the DCJ operations (UDCJ):
Input: Two unsigned linear genomes A and B and an integer k.
Question: Can A be converted into B by a series of k DCJ operations
ρ1,ρ2,...,ρk?
The minimum k is the unsigned DCJ distance between A and B.
Following the results in (Caprara, 1999; Chen, 2010), UDCJ is also
NP-complete.
W.L.O.G, assume that B is perfect. Let lA and lB be the number of
linear chromosomes in A and B, respectively, we can also assume
Fig. 1. The DCJ operation.
that lA ≥lB, since all the DCJ operations are reversible, which
means that if there exists consecutive DCJ operations ρ1ρ2...ρm
that convert A into B, then we can also convert B into A by
ρ−1
m ρ−1
m−1...ρ−1
1 , where ρ−1
i is the reversed operation of ρi.
FPT and weak kernel: Basically, a ﬁxed parameter tractable (FPT)
algorithm for a decision problem with solution value k is an
algorithm, which solves the problem in O(f (k)nc)=O∗(f (k)) time,
where f is any function only on k, n is the input size and c is some
ﬁxed constant not related to k. FPT also stands for the set of problems
that admit such an algorithm (Downey and Fellows, 1999; Flum and
Grohe, 2006). Weak kernel is a relatively new concept; intuitively,
it refers to the direct or indirect ‘search space’ to solve a search
problem. For a search problem in NP, if it admits a weak kernel of
size g(k), then it is in FPT (Jiang et al., 2010). We comment that
weak kernel is different from the traditional kernel in which the
problem instance size is reduced (to a function of k), while a weak
kernel only implies that the direct or indirect solution search space
is reduced (to a function of k). More details can be found in (Jiang
et al., 2010).
3 A 1.5-APPROXIMATION ALGORITHM
In this section, we present a 1.5-approximation algorithm for double
cut and join distance on unsigned multilinear genomes. We ﬁrst
comment that the method by Chen (2010) cannot be converted
to solve our problem, as with multilinear genomes the underlying
breakpoint graph is more complex (i.e. possibly with many paths).
Given an original genome A with lA chromosomes and a target
perfect genome B with lB chromosomes, our goal is to convert A into
B by a series of DCJ operations so that the number of DCJ operations
is as few as possible. To design an approximation algorithm, we ﬁrst
need the structure properties of UDCJ, which in fact can be obtained
from the corresponding signed genomes.
3.1 Structure properties of UDCJ
For an unsigned genome A, a signed-version of A is obtained by
assigning ‘+ or ‘− to each gene in A, with ‘+ signs usually
omitted. Obviously, every genome of n genes has exponential,
i.e. 2n, signed versions. Given two signed genomes F,H, we use
DCJs(F,H) to denote their signed DCJ distance.
Theorem 1. Given two unsigned linear multichromosomal genomes
A and B, let the minimum DCJ distance between A,B be DCJ(A,B).
312
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 313 311–316
Sorting by DCJ operations
Then DCJ(A,B)=DCJs(A∗,B+), where A∗ is some signed version
of A, and B+ is a special signed version of B with all signs being
positive.
Proof. Notice that, loosely speaking, we can take B=B+.
(⇒) Assume that there exists a series of consecutive DCJ
operations ρ1ρ2...ρm that convert A into B. We say that a DCJ
operation ρ changes the sign of a gene g if ρ involves reversing a
segment of genes including g. For each gene g in A, let Tg denote
the number of times that the sign of g is changed if we trace all
the m DCJ operations. g is assigned ‘− , if Tg is odd; and ‘+ , if
Tg is even. Then we obtain a signed version of A, A∗, which can
be converted into B+ by the m equivalent signed DCJ operations.
Thus, DCJ(A,B)≥DCJs(A∗,B+).
(⇐) If there exists a signed version A∗ of A that can be converted
into B+ by m signed DCJ operations ρ1ρ2...ρm, then we can also
use these m (signed) DCJ operations to convert A into B, ignoring
the gene signs. Thus DCJ(A,B)≤DCJs(A∗,B+).
We now proceed to obtain the necessary properties of the optimal
solution. First of all, in order to avoid distinct end points of
chromosomes in A and B, we add unlabeled caps to both ends of each
linear chromosome in genomes A and B, respectively, then connect
the A-cap and its adjacent end-gene with a black edge and the B-cap
and its adjacent end-gene with a gray edge in BG(A,B). The above
preprocess is called capping. Note that each gene in BG(A,B) has
degree 4 after capping, i.e. with two black edges and two gray edges.
After capping, genomes A and B become ¯A and ¯B, respectively. We
denote the resulting graph by BG(¯A, ¯B).
As it seems to be hard to extract the properties of the optimal
solution from BG(¯A, ¯B) directly, we take a detour. We notice that, for
signed genomes F and H, after capping each vertex in the breakpoint
graph BGs(F,H) has degree two and each cap has degree one, which
means that all the paths end with caps. A path with an A-cap end
and a B-cap end (respectively, two A-cap ends, two B-cap ends) is
an AB-path (respectively, AA-path, BB-path).
There are three ways to construct cycles from BGs(F,H)in the
breakpoint graph BGs(F,H) of signed genomes F and H, after
capping .
(1) single-identifying: identify the two caps of each AB-path,
close the path into a cycle containing just one A-cap (with
the B-cap eliminated).
(2) double-identifying: identify each B-cap of a BB-path and each
A-cap of an AA-path, join an AA-path and a BB-path into a
cycle containing twoA-caps (with the two B-caps eliminated).
(3) joining: connect the two A-caps of an AA-path with a gray
edge.
Let BGs( ¯F, ¯H) denote the resulting breakpoint graph after
constructing cycles from BGs(F,H) following the above three ways.
Then the signed DCJ distance between the signed genomes ¯F and
¯H, DCJs( ¯F, ¯H)=b−c, where b is the number of black edges and c
is the number of cycles in BGs( ¯F, ¯H) (Yancopoulos et al., 2005).
In Figure 2, we show an example of ¯F, ¯H and BGs( ¯F, ¯H), before
the identifying and joining operations are performed. In the ﬁgure,
an empty round (respectively, square) node is anA-cap (respectively,
B-cap); moreover, in BGs( ¯F, ¯H), a signed gene +i (respectively, −i)
is already converted to (2i−1,2i) [respectively, (2i,2i−1)]. After
two single-identifying operations are performed, we have two new
Fig. 2. The breakpoint graph BGs( ¯F, ¯H), before the identifying and joining
operations are performed.
cycles (6) and (9, 8, 4, 5, 14). After a double-identifying operation is
performed, we have a new cycle (2, 3, 7, 1).After a joining operation
is performed, we have a new cycle (12, 13).
It is worth mentioning that this distance formula is equivalent
to that of Bergeron et al., 2006, i.e. DCJs(F,H)=n−C− I/2 ,
where n is the number of genes, C is the number of cycles and I
is the number of odd paths in their corresponding adjacency graph.
To see this, note that I also equals to the number of AB-paths in
the breakpoint graph; in addition, we have b=n+lA, c=C+I +
(2lA −I)/2 . So DCJs( ¯F, ¯H)=DCJs(F,H).
Corollary 1. Given two unsigned linear multichromosomal
genomes A and B, let A∗ and B+ be deﬁned as in Theorem 1.
Then DCJ(A,B)=DCJs(¯A∗, ¯B+), where ¯A∗ (respectively, ¯B+) is a
capping of A∗ (respectively, B+).
Proof. It follows from Theorem 1 that DCJ(A,B)=
DCJs(A∗,B+). The statements in the last paragraph show
that DCJs(A∗,B+)=DCJs(¯A∗, ¯B+). Then the corollary follows.
Notice that computing an alternating-cycle decomposition of
BG(¯A, ¯B) is equivalent to ﬁnding a signed version of ¯A. To extract
the properties of the optimal solution, we ﬁrst try to make use
of the breakpoint graph BG(¯A, ¯B) instead of BG(A,B). Following
Corollary 1, we can now make use of the breakpoint graph
BGs(¯A∗, ¯B+). From the way BGs(¯A∗, ¯B+) is constructed, we only
need to ﬁnd an optimal ¯A∗ such that the number of disjoint
alternating-cycles in BGs(¯A∗, ¯B+) is maximized. The reason is that
the number of black edges in BG(¯A, ¯B) is ﬁxed.
Then we have dopt =DCJ(A,B)=DCJs(¯A∗, ¯B+)=b−c1 −c2 −
c3, where b is the number of black edges in BGs(¯A∗, ¯B+), c1
and c2 are the number of 1-cycles and 2-cycles in BGs(¯A∗, ¯B+),
respectively, and c3 is the number of cycles with three or more
black edges in BGs(¯A∗, ¯B+). Obviously, c3 ≤(b−c1 −2c2)/3, thus
we have the following formula:
dopt = b−c1 −c2 −c3
≥ b−c1 −c2 −(b−c1 −2c2)/3
313
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 314 311–316
H.Jiang et al.
(a) (b) (c) (d)
Fig. 3. 1-cycle containing two genes.
= 2(b−c1)/3−c2/3
=
2
3
·(b−c1 −c2/2).
The above formula implies that, if we can convert A into B by at most
b−c1 −c2/2 DCJ operations, then we obtain a 1.5-approximation
algorithm for UDCJ.
3.2 The algorithm
The idea of our approximation algorithm is as follows. We compute
BG(¯A, ¯B) and try to ﬁrst keep all the 1-cycles in it. Then we compute
many 2-cycles from BG(¯A, ¯B) (in fact, at least c2/2 such 2-cycles).
We comment that a similar idea was used by Christie (1998) on
sorting by unsigned reversals. On the other hand, the LP-relaxation
algorithm by Chen, 2010 cannot handle paths (and caps) so it cannot
be immediately generalized to solve our problem.
The following lemma, which involves handling paths and caps,
shows that keeping all the 1-cycles in BG(¯A, ¯B) is a good strategy
to obtain some optimal alternating-cycle decomposition of it.
Lemma 1. There is some maximum alternating-cycle decomposition
of BG(¯A, ¯B) in which all c1 1-cycles in BG(¯A, ¯B) are kept.
Proof. We modify the optimal alternating-cycle decomposition
in BG(¯A, ¯B) in such a way: if two genes, say gi and gi+1, are
connected by a black edge and a gray edge, then we reassign the
signs of these two genes to obtain a 1-cycle; if a gene, say gi, is
connected to an A-cap by a black edge and to a B-cap by a gray
edge, then we reassign the sign of the gene and identify the two
caps to obtain a 1-cycle. If the newly obtained 1-cycle contains two
genes, then there are two cases.
Case (I): Only one of the signs of gi and gi+1 is changed. W.L.O.G,
assume that the sign of gi is changed, see Figure 3a. The number of
cycles is increased by one.
Case (II): Both of the signs of gi and gi+1 are changed, see
Figure 3b–d. The number of cycles is increased by two or one or
unchanged, respectively.
If the newly obtained 1-cycle contains one gene and a cap (which
is identiﬁed by an A-cap a and a B-cap b), then there are four cases.
Note that b must be identiﬁed with some A-cap a .
Case (1): The A-cap a joins with another A-cap a . The number
of cycles is unchanged. See Figure 4a.
Case (2): TheA-cap a is identiﬁed with a B-cap b and a,b belongs
to distinct cycles. The number of cycles is unchanged. See Figure 4b.
Case (3): TheA-cap a is identiﬁed with a B-cap b and a,b belongs
to the same cycle. The number of cycles is increased by one. See
Figure 4c.
Case (4): The A-cap a is identiﬁed with the B-cap b but the
cycle containing a,b also contains two identiﬁed caps a and b .
The number of cycles is increased by one. See Figure 4d.
Following Lemma 1, we know that keeping all the 1-cycles in in
BG(¯A, ¯B) will not affect the value of some optimal alternating-cycle
decomposition of it. Therefore, from now on we only focus on the
optimal alternating-cycle decomposition of BG(¯A, ¯B), which always
keeps all the 1-cycles. Consequently, in order to approximate the
optimal DCJ distance, we just need to ﬁnd out as many as at least
half of the 2-cycles in an optimal alternating-cycle decomposition of
BG(¯A, ¯B) (which keeps all 1-cycles). Now we present the algorithm
2-Cycle Decomposition to compute such 2-cycles. In this algorithm,
we ﬁrst construct a graph G1 whose vertices are the black edges (not
in any 1-cycle) in BG(¯A, ¯B) and M is a maximum matching in G1.
Note that the maximum matching M can be computed in
polynomial time (Galil et al., 1986); moreover, each edge in M
results in a candidate 2-cycle. In order to bound the cardinality of
S, we need the following lemmas.
Lemma 2. Let M be a maximum matching in G1, then |M|≥c2.
Proof. Following the discussion in Section 3.1, c2 corresponds
to the number of 2-cycles in an optimal alternating-cycle
decomposition of BG(¯A, ¯B). These 2-cycles clearly form a matching
in G1. By the maximality of M, we have |M|≥c2.
Algorithm 2-Cycle Decomposition
Input: BG(¯A, ¯B)
Output: A set of edge-disjoint 2-cycles
1 Construct a graph G1 =(P,E1) as follows:
1.1 Each black edge in BG(¯A, ¯B), not contained in any 1-cycle,
corresponds to a vertex in P.
1.2 For each pair of vertices u=(gi,gi+1) and v=(gj,gj+1)
corresponding to black edges between two genes, (u,v)∈E1
iff there exist two gray edges which can form a 2-cycle
together with these two black edges.
1.3 For each pair of vertices u=(gi,ai) and v=(gj,aj)
corresponding to black edges between a gene and a A-cap,
(u,v)∈E1 iff there exist a gray edge (gi,gj) in BG(¯A, ¯B).
1.4 For a 2-gene vertex u=(gi,gi+1) and a 1-gene-1-cap
vertex v=(gj,aj), (u,v)∈E1 iff there exist two gray edges
(gi,gj) and (gi+1,bj) or two gray edges (gi+1,gj) and
(gi,bj+1) in BG(¯A, ¯B), where bj and bj+1 are B-caps.
2 Compute a maximum matching M in G1.
3 Construct a graph G2 =(Q,E2) as follows:
3.1 Each 2-cycle computed at Step 2 corresponds to a vertex in Q.
3.2 Two vertices in Q form an edge in E2 iff their corresponding
cycles share a gray edge.
4 Compute a maximum independent set S of G2.
5 Return S which is a set of edge-disjoint 2-cycles.
Lemma 3. The graph G2 is composed of simple paths and isolated
vertices.
Proof. All 2-cycles computed at Step 2 cannot share black edges.
Since each gray edge is connected to at most four black edges, at
most two 2-cycles which do not share black edges can share this
gray edge. Equivalently, each gray edge can belong to at most two
cycles computed from M. Each 2-cycle has two gray edges, so each
vertex in G2 has degree at most two.
314
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 315 311–316
Sorting by DCJ operations
(a) (b)
(c)
(d)
Fig. 4. 1-cycle containing one gene and one cap.
Fig. 5. An example of 2-cycles sharing gray edges.
It is sufﬁcient to prove that G2 does not contain cycles. Assume
to the contrary that 2-cycles C1C2...Cr form a cycle in G2, where
Ci shares gray edge gi with Ci+1, 1≤i≤r−1, and Cr shares gr
with C1. Then C1 contains two gray edges g1 and gr, but the end
points of g1 and gr cannot form two black edges (otherwise these
two black edges will force into some black cycle—which implies
that the input genome contains some circular chromosome). See
Figure 5.
It is obvious that every 2-cycle containing caps has degree at most
one in G2, because the gray edge containing caps cannot be shared
by two 2-cycles computed from M. The property we just proved in
Lemma 3 is important for us to compute a maximum independent
set in G2 (without this property, the computation of a maximum
independent set might be intractable). Lemma 3 immediately implies
the next lemma.
Lemma 4. Let S be a maximum independent set in G2, then |S|≥
|M|
2 .
Note that if a gene is contained in some 1-cycle, then its sign
can be ﬁxed easily, i.e. if the black edge reads from (left to right)
like (i,i+1) then both genes i and i+1 will be given positive
signs, otherwise they will be given negative signs. If a gene is
contained in some 2-cycle, its sign is ﬁxed similarly. For instance,
if in a 2-cycle the two black edges read like (i,j),(i+1,j+1) (from
left to right), then the signing should be +i,−j,−(i+1),+(j+1).
The other cases, e.g., when the directions of these black edges
are possibly changed, are very much symmetric hence omitted. To
complete the cycle decomposition, we arbitrarily assign signs to the
remaining genes, then properly identify and join the remaining caps
in the corresponding breakpoint graph. The complete Whole-Cycle
Decomposition algorithm is presented as follows.
Algorithm Whole-Cycle Decomposition
Input: BG(¯A, ¯B)
Output: BGs( ¯A , ¯B+)
1 Keep all 1-cycles and assign proper signs to genes involved
in the 1-cycles in BG(¯A, ¯B).
2 Call 2-Cycle Decomposition, and assign proper signs to genes
involved in the resulting 2-cycles.
3 Assign arbitrary signs to the remaining genes to have a signed
genome ¯A .
4 Construct BGs( ¯A , ¯B+) by identifying and joining caps with
single-identifying, double-identifying and joining operations.
Notice that once we have BGs( ¯A , ¯B+) it is straightforward to
compute the signed DCJ distance dwcd =DCJs( ¯A , ¯B+) in linear
time (Bergeron et al., 2006; Yancopoulos et al., 2005).
Theorem 2. Algorithm Whole-Cycle Decomposition approximates
the DCJ distance between two unsigned linear multichromosomal
genomes with a factor of 1.5.
Proof. From Lemma 4, we know that |S|≥ c2/2 ; it follows
from Lemma 1 that c1 =c1. The distance computed by Algorithm
Whole-Cycle Decomposition is dwcd ≤b−c1 −|S|. The optimal
distance dopt satisﬁes that dopt ≥2(b−c1 −c2/2)/3. Thus, dwcd ≤
(b−c1 − c2/2 )≤1.5dopt.
4 A WEAK KERNEL AND AN FPT ALGORITHM
Similar to the problem of sorting by unsigned reversals and sorting
by unsigned translocations (Jiang et al., 2010), the UDCJ problem
also possesses a (small and indirect) weak kernel.
Let k be the minimum number of DCJ operations converting A
into B. A weak kernel for UDCJ is a set of genes in A whose signs
cannot be ﬁxed after the genes involved in all 1-cycles have been
properly signed (following Lemma 1). Before computing the size of
the weak kernel, we state the following lemma, which is simple but
critical.
Lemma 5. Each DCJ operation can generate at most two 1-cycles.
Proof. Each DCJ operation cuts two black edges and forms at
most two new black edges. Each new black edge can form at most
one 1-cycle.
Theorem 3. The UDCJ problem has a weak kernel of size 2k, hence
can be solved in O∗(22k) time.
Proof. The 2k weak kernel is straightforward from Lemma 1
and Lemma 5. For any optimal alternating-cycle decomposition of
BG(¯A, ¯B) which contains all possible number of c1 1-cycles, we
have k =b−c1 −c2 and c2 ≤(b−c1)/2, where c2 is the number
of cycles of length at least 2 in the optimal alternating-cycle
decomposition of BG(¯A, ¯B). Thus, k ≥(b−c1)/2, equivalently,
(b−c1)≤2k. Following Lemma 1, we can assign signs to all genes
involved in 1-cycles. So each of the remaining gene is connected
to two black edges, and each black edge has at most two unsigned
genes as its end points, which means that the number of unsigned
genes N is bounded by the number of black edges not involved in
any 1-cycle, e.g. N ≤b−c1 ≤2k. Hence, the problem admits a weak
kernel of size 2k.
315
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[13:27 5/1/2011 Bioinformatics-btq674.tex] Page: 316 311–316
H.Jiang et al.
In other words, if the DCJ distance is equal to or smaller than
k, there are at most 22k signed versions of A among which there
must be an optimal one (e.g. A∗ in Theorem 1). For each signed
version of A, we can exploit the algorithm in Bergeron et al. (2006);
Yancopoulos et al. (2005) to check whether it can be converted
into B+ by k or few DCJ operations. If so, we can compute the
corresponding k unsigned DCJ operations to convert A into B. If no
valid solution is found, we report NO. This algorithm clearly runs
in O(22kn)=O∗(22k) time.
5 DISCUSSION
In this article, we devise the ﬁrst approximation algorithm with
a factor of 1.5 and an FPT algorithm running in O(22kn) time
for the NP-complete problem of sorting linear multichromosomal
genomes under unsigned DCJ distance. It is interesting to improve
the approximation factor as well as the running time of the FPT
algorithm. For genomes containing circular chromosomes, our
approximation algorithm cannot achieve the same performance as
linear genomes, so it is also meaningful to handle the problem
of sorting mixed genomes (i.e. with both linear and circular
chromosomes) under unsigned DCJ distance.
ACKNOWLEDGEMENT
We thank anonymous reviewers for their valuable comments.
Funding: NSF grant (DMS-0918034); NSF of China under grant
(60928006 and 61070019 in part).
Conﬂict of Interest: none declared.
REFERENCES
Bafna,V. and Pevzner,P. (1998) Sorting by Transpositions. SIAM J. Discrete Math., 11,
224–240.
Bergeron,A. et al. (2006) A unifying view of genome rearrangements. In Proceedings
of the 6th International Workshop on Algorithms in Bioinformatics (WABI’06),
Springer, Germany, pp. 163–173.
Berman,P. et al. (2002) 1.375-Approximation algorithm for sorting by reversals. In
Proceedings of the 10th Annual European Symposium on Algorithms (ESA’02),
Springer, Germany, pp. 200–210.
Caprara,A. (1999) Sorting permutations by reversals and Eulerian cycle
decompositions. SIAM J. Discrete Math., 12, 91–110.
Chen,X. (2010) On sorting permutations by double-cut-and-joins. In Proceedings of
the 16th International Conf. on Computing and Combinatorics (COCOON’10),
Springer, Germany, pp. 439–448.
Christie,D. (1998) A 3/2-Approximation algorithm for sorting by reversals. In
Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms
(SODA’98), ACM, New York, pp. 244–252.
Cui,Y. et al. (2008) A (1.5 + )-Approximation algorithm for unsigned translocation
distance. IEEE/ACM Trans. Comput. Biol. Bioinform., 5, 56–66.
Downey,D. and Fellows,M. (1999) Parameterized Complexity, Springer.
Elias,I. and Hartman,T. (2006) A 1.375-Approximation algorithm for sorting by
transpositions. IEEE/ACM Trans. Comput. Biol. Bioinform., 3, 369–379.
Flum,J. and Grohe,M. (2006) Parameterized Complexity Theory. Springer, Germany.
Galil,Z. et al. (1986) An O(EVlog V) algorithm for ﬁnding a maximal weighted
matching in general graphs. SIAM J. Comput., 15, 120–130.
Hannenhalli,S. and Pevzner,P. (1995) Transforming men into mice (polynomial
algorithm for genomic distance problem). In Proceedings of the 36th Annual IEEE
Symposium on Foundations of Computer Science (FOCS’95), IEEE Computer
Society, pp. 581–589.
Hannenhalli,S. and Pevzner,P. (1999) Transforming cabbage into turnip: polynomial
algorithm for sorting signed permutations by reversals. J. ACM, 46, 1–27.
Jiang,H. et al. (2010) Weak kernels. ECCC Report, TR10-005.
Jiang,H. and Zhu,D. (2011) A 14/11-Approximation algorithm for sorting by short
block-moves. To appear in Science in China Series F.
Yancopoulos,S. et al. (2005) Efﬁcient sorting of genomic permutations by translocation,
inversion and block interchange. Bioinformatics, 21, 3340–3346.
316
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom