On Data Structures for
String Searching Problems
Shunsuke Inenaga
February, 2002
Department of Informatics
Kyushu University
Contents
1 Introduction 1
1.1 A Research Area Called ‘Stringology’ . . . . . . . . . . . . . . . . . . . . . 1
1.2 Linear-Space Index Structures . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Preliminaries 6
2.1 Graphs and Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Deterministic Finite State Automata . . . . . . . . . . . . . . . . . . . . . 10
2.3 Index Structures for Text Strings . . . . . . . . . . . . . . . . . . . . . . . 10
3 On-Line Construction of Compact Directed Acyclic Word Graphs 18
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 On-Line Construction of the Suﬃx Trie for a Single String . . . . . . . . . 20
3.3 On-Line Construction of the Suﬃx Tree for a Single String . . . . . . . . . 22
3.4 On-Line Construction of the CDAWG for a Single String . . . . . . . . . . 25
3.5 Construction of the CDAWG for a Set of Strings . . . . . . . . . . . . . . . 37
4 Uniﬁcation of Algorithms for Constructing Index Structures 39
4.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 A Generic Algorithm Constructing Index Structures . . . . . . . . . . . . . 41
5 Construction of the CDAWG for a Trie 53
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Trie and Reversed Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Algorithm to Construct the CDAWG for a Trie . . . . . . . . . . . . . . . 56
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
i
6 On-Line Construction of Symmetric Compact Directed Acyclic Word
Graphs 61
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Bidirectional Index Structures . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 On-Line Construction of STree(w) with DAWG(wrev
) . . . . . . . . . . . . 65
6.4 On-Line Construction of SCDAWG . . . . . . . . . . . . . . . . . . . . . . 74
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7 Bidirectional Construction of Index Structures 81
7.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Bidirectional Construction of Suﬃx Trees . . . . . . . . . . . . . . . . . . . 82
7.3 Bidirectional Construction of DAWGs . . . . . . . . . . . . . . . . . . . . . 89
7.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8 The Minimum All-Suﬃxes Directed Acyclic Word Graphs 92
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 All-Suﬃxes Directed Acyclic Word Graphs . . . . . . . . . . . . . . . . . . 94
8.3 On-Line Construction of MASDAWGs . . . . . . . . . . . . . . . . . . . . 98
8.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9 Finding Patterns to Best Separate Two Sets of Strings 107
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.3 Pruning Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.4 Using Eﬃcient Data Structures . . . . . . . . . . . . . . . . . . . . . . . . 115
9.5 Finding Variable-Length-Don’t-Care’s Patterns . . . . . . . . . . . . . . . 117
9.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Acknowledgment 120
Bibliography 121
ii
Chapter 1
Introduction
1.1 A Research Area Called ‘Stringology’
In Theoretical Computer Science, sequences of characters are often referred to as strings.
The design of algorithms that processes and manipulates strings is one of the most active
subﬁeld of general algorithmic research. Stringology has been the nickname of this new
area [14, 3]. It covers any, and various, kinds of problems related to strings and things
that can be regarded as or transformed into strings such as music data [13, 34], DNA and
protein data [20].
Natural language texts and biological sequences, like those electrically available via
WWW or from biological databases, can be arose as typical examples of vast strings
frequently utilized today. Since a huge, and growing, amount of string data is at present
stored and requires processing, once acceptable naive algorithms are no longer eﬀective.
In the present information age, it directly causes neccessity to develop innovative string
processing algorithms, which must be correct and eﬃcient both in time and space. That
is the reason why Stringology is one particular area that has been extremely active for
the last few decades.
String processing problems include a variaty of applications such as exact pattern
matching, approximate pattern matching, similarity measurement between two strings,
locating repetition in strings, string compression, pattern matching on compressed strings,
two-dimensional pattern matching, and so on. Among those, the most fundamental and
important problem is the ﬁrst one, exact pattern matching on strings.
The exact pattern matching problem is formalized as follows: Given a text string w
1
and a pattern string p, return “yes” if p occurs in w, and “no” otherwise. There are two
situations for the problem, namely, one that w is static and p is dynamic, or vice versa.
In the former case, it is appropriate to construct a data structure for w that serves as
indexes of w. Such a structure is called an index structure for the text string w. Let n
and m be the length of w and p, respectively. An index structure for w must support all
factors of w, where a factor of w is a string occuring within w. Unfortunately, the number
of factors of w is O(n2
), although it is favorable to implement the structure with as little
memory space as possible, desirably with O(n) space.
1.2 Linear-Space Index Structures
In 1973, there was a ‘big bang’ in Stringology due to Weiner [58]. He succeeded to
introduce an index structure called a suﬃx tree, which only requires O(n) space. He
furthermore developed an algorithm to construct the suﬃx tree for a string w in O(n)
time. By means of the suﬃx tree of w, it is possible to solve the pattern matching problem
in O(n) preprocessing time and O(m) searching time. This result is very surprising and
its contribution is immeasurable. In fact, it is claimed that Knuth later on referred to
Weiner’s algorithm as ‘the algorithm of 1973’ [2]. Suﬃx trees are useful not only for the
above simple pattern matching problem but also for a ‘myriad’ [2] of other applications
represented by the problem to ﬁnd the longest common factors of two strings in linear
time, and so on. It is claimed that Knuth had conjectured in 1970 that linear time solution
to this problem was impossible to achieve [36]. This fact also tells us that the invention
of suﬃx trees was truly revolutionary.
Following Weiner’s invention, McCreight in 1976 proposed a more space-economical
algorithm for the construction of suﬃx trees [43]. It has permitted us to save more memory
space on building suﬃx trees (in constant term).
However, both algorithms above mentioned have been thought to be considerably
complicated. It might have caused the delay of the spread of suﬃx trees, in spite of their
remarkable usefulness. More recently, this matter was settled in 1995 by Ukkonen’s suﬃx
tree construction algorithm [55]. Ukkonen’s algorithm is believed to be the conceptually
easiest to understand, and elegant [20, 18]. It has a certain helpful property called on-line,
i.e., it processes a given string from left to right, one by one, while constructing the suﬃx
tree for the string already scanned at each step, with no need to read the whole string
2
beforehand. In addition, it allows us to construct the suﬃx tree for a set of strings in
time linear in the total length of the strings.
On the other hand, in 1985, Blumer et al. introduced counterpart of suﬃx trees, called
directed acyclic word graphs (DAWGs) [7]. The DAWG of a string w is known to be the
smallest automaton that accepts all suﬃxes of w [12]. One drawback of the suﬃx tree for
a string w is that the label of each edge needs to be implemented by a pair of integers
which respectively represent the beginning and ending positions of the label string in w.
It means that we have to store the input string w in order to keep the suﬃx tree with
linear space. Conversely, the label of any edge of the DAWG for w consists of a character,
not a string. Since there is no need to keep the input string w if the DAWG for w is once
completed, we can then delete w from main memory.
An algorithm to construct the DAWG of a given string was also proposed by Blumer
et al., which processes the string on-line, and runs in linear time [7]. In 1987 Blumer et al.
moreover gave an on-line algorithm that constructs the DAWG for a given set of strings,
running in time linear in the total length of strings [8].
A lot more space-economical index structure is a compact directed acyclic word graph
(CDAWG), introduced by Blumer et al. as well [8]. It has been shown that CDAWGs
require strictly smaller space than suﬃx trees and DAWGs, both theoretically and experimentally.
A typical biological sequence may be many millions of characters long. Thus it
is quite signiﬁcant to reduce the constant term usually ignored due to the big-O notation
for the space and time complexity.
Blumer et al. gave an algorithm that constructs the CDAWG for a string w, by once
building the DAWG of w and then shrinking it into the corresponding CDAWG. The
CDAWG for a set of strings can also be built similarly. The drawback of this method is that
we have to construct the corresponding DAWG as an intermediate, which contains many
nodes and edges turning out to be redundant in the CDAWG. It was 1997 when the ﬁrst
algorithm to directly construct the CDAWG of a given string was developed by Crochemore
and V´erin [16]. Their algorithm is based on McCreight’s suﬃx tree construction algorithm.
Thereby, their algorithm does not have the on-line property which can be useful in some
situations.
All the structures mentioned above are automata-oriented. Another idea to represent
all factors of a given string in a data structure is based on arrays. The ﬁrst array of this
kind is the suﬃx array [41], followed by the suﬃx cactus [35], the compact suﬃx array [40],
3
and the compressed suﬃx array [19, 47]. They are in general more space-economical than
those automata-oriented structures (in constant term), but they instead sacriﬁce searching
time. Namely, searching p in w by using an array takes O(m + log n) time.
1.3 Our Contribution
In the following chapters, we report our contribution to Stringology. We have chosen the
automata-oriented index structures to work for, since we wish the situation where we can
ﬁnd the occurrence of a pattern p in a text w as fast as possible, in O(m) searching time.
In Chapter 2.3, we deﬁne index structures basing on the equivalence classes on strings.
They are deﬁned in common notations, thus give us a good ‘uniﬁed’ view on the structures
which reveals their insights and relationship among them.
In Chapter 3, we focus our attention on CDAWGs. An algorithm which constructs
CDAWGs in on-line manner had been a long-term missing piece, as remarked in Section
1.2. In the chapter, we report the success of invention of an on-line algorithm that
constructs the CDAWG for a given string. We prove that the algorithm proposed runs in
linear time. Also, we show that the CDAWG for a set of strings can also be built by a
straightforward extension of the algorithm.
We dedicate Chapter 4 to the introduction of more surprising fact about the uniﬁed
view for the index structures. To be concrete, we introduce a generic algorithm that is
capable of constructing any of suﬃx tries, suﬃx trees, DAWGs, and CDAWGs. This saves
us a great deal of eﬀort to implement distinct algorithms for all the index structures. In
addition, it gives us insight into their properties, what is common to and what is diﬀerent
amongst them, from algorithmic point of view.
Chapter 5 is devoted to a report of an algorithm for the construction of the CDAWG
for a given trie that ‘compactly’ represents a set of strings. Since the trie for a set of
strings shares their common preﬁxes, the number of nodes in it is generally smaller than
the total length of the strings. The algorithm is a non-trivial extension of the one given
in Chapter 3. We establish that the algorithm performs in time linear in the number of
nodes in a given trie.
In Chapter6, we consider not only an index structure for a given string w, but also
that for the reversal of w, denoted by wrev
. It is a well known property that the suﬃx
tree of w and the DAWG for wrev
can share the same nodes [11]. Therefore, they can be
4
represented together with, and can be seen as one structure. In the chapter, we introduce
an on-line algorithm which simultaneously constructs both the suﬃx tree of w and the
DAWG for wrev
. Furthermore, the chapter is dedicated to the symmetric compact directed
acyclic word graph (SCDAWG) of w, which also supports both indexes of w and wrev
[7].
We report that we developed an on-line algorithm which builds the SCDAWG for a given
string w in O(n) time.
In Chapter 7, a bidirectional on-line linear-time construction algorithm for suﬃx trees
is presented. As mentioned in Section 1.2, Ukkonen’s algorithm allows us to update an
input string by adding new strings at its right. Nevertheless, we cannot extend an input
string to the left direction in using his algorithm. The result to be reported in this chapter
means that we can extend an input string to both direction, without re-constructing the
suﬃx tree from scratch. Furthermore, we show the DAWG for w can also be constructed
in bidirectional on-line manner, and in linear time.
In Chapter 8, we consider the collection of DAWGs for all suﬃxes of a given string w.
It is called the naive all-suﬃxes directed acyclic word graph (naive ASDAWG) for w. It
is clear that the size of the naive ASDAWG for w is O(n2
). We report that we succeeded
to develop a new structure, named the minimum all-suﬃxes directed acyclic word graph
(MASDAWG). The MASDAWG of w is the minimized version of the naive ASDAWG for
w. We prove its size is Θ(n) if the alphabet Σ is unary, and Θ(n2
) otherwise. An on-line
algorithm that directly constructs the MASDAWG for w in time proportional to its size
is also given.
Given two sets of strings, it is a quite important problem in Knowledge Discovery
and Data Mining to ﬁnd a rule which separates them. The accuracy of the separation
and the simplicity of the rule depend on what sort of pattern we adopt. There had been
algorithms in which substring patterns [48] and subsequence patterns [23] are utilized in
order to distinguish two given sets of strings. In Chapter 9, we report the work where we
extended the algorithms by applying episode patterns as rules. We succeeded to develop
a practical, eﬃcient algorithm to ﬁnd the best episode patterns to separate two sets of
strings. In [27], it is experimentally shown that the result of our work is superior to its
previous versions. Also, we in this chapter emphasize that MASDAWGs, introduced in
Chapter 8, are believed to be powerful ‘weapon’ to propose a new practical algorithm to
ﬁnd a variable-length-don’t-care’s pattern (VLDC-pattern) that eﬃciently separates given
two sets of strings.
5
Chapter 2
Preliminaries
2.0.1 Notation
Let Σ be a ﬁnite alphabet. An element of Σ∗
is called a string. Let x be a string such that
x = a1a2 · · ·an where n ≥ 1 and ai ∈ Σ for 1 ≤ i ≤ n. The length of x is n and denoted by
|x|, that is, |x| = n. If n = 0, x is said to be the empty string. It is denoted by ε, that is,
|ε| = 0. Let Σ+
= Σ∗
− {ε}. Let y be a string such that y = b1b2 · · · bm where m ≥ 1 and
bj ∈ Σ for 1 ≤ j ≤ m. Then, string a1a2 · · ·anb1b2 · · · bm is said to be the concatenation
of x and y, and denoted by x · y, or simply, by xy. For any string x ∈ Σ∗
,
xε = εx = x.
Strings x, y, and z are said to be a preﬁx, factor, and suﬃx of string w = xyz,
respectively. The sets of preﬁxes, factors, and suﬃxes of a string w are denoted by
Preﬁx(w), Factor(w), and Suﬃx(w), respectively.
Let w be a string and |w| = n. The i-th character of w is denoted by w[i] for 1 ≤ i ≤ n,
and the factor of w that begins at position i and ends at position j is denoted by w[i : j]
for 1 ≤ i ≤ j ≤ n. For convenience, let w[i : j] = ε for j < i.
For a set S of strings w1, w2, . . . , w , let |S| denote the cardinality of S, namely, |S| = .
We denote by S the total length of strings in S, that is,
S =
k=1
|wk|.
The sets of preﬁxes, factors, and suﬃxes of the strings in S are denoted by Preﬁx(S),
Factor(S), and Suﬃx(S), respectively.
6
Deﬁnition 2.1 Let S = {w1, . . . , wk} where wi ∈ Σ∗
for 1 ≤ i ≤ k and k ≥ 1. We say
that S has the preﬁx property iﬀ wi /∈Preﬁx(wj) for any 1≤i=j ≤ k.
2.0.2 Equivalence Relations on Strings
Let S ⊆ Σ∗
. For any string u ∈ Σ∗
, let Su−1
= {x | xu ∈ S} and u−1
S = {x | ux ∈ S}.
Deﬁnition 2.2 Let S ⊆ Σ∗
. The equivalence relations ≡L
S and ≡R
S on Σ∗
are deﬁned by
x ≡L
S y ⇔ Preﬁx(S)x−1
= Preﬁx(S)y−1
,
x ≡R
S y ⇔ x−1
Suﬃx(S) = y−1
Suﬃx(S).
The equivalence class of a string x ∈ Σ∗
with respect to ≡L
S (resp. ≡R
S ) is denoted by
[x]L
S
(resp. [x]R
S
).
If S = {cocoa, cola}, [c]L
S
= {c, co}, [o]L
S
= {o}, [l]L
S
= {l, la}, [c]R
S
= {c}, [o]R
S
=
{o, co}, [l]R
S
= {l, ol, col}, and so on.
Note that all strings that are not in Factor(S) form one equivalence class under ≡L
S .
This equivalence class is called the degenerate class. All other classes are called nondegenerate.
It follows from the deﬁnition of ≡L
S that, if two strings x, y ∈ Factor(S) are in
the same equivalence class under ≡L
S , then either x is a preﬁx of y, or vice versa. Therefore,
each equivalence class in ≡L
S other than the degenerate class has a unique longest member.
A similar argument holds for ≡R
S .
Deﬁnition 2.3 For any string x ∈ Factor(S),
S
−→x (resp.
S
←−x ) denotes the unique longest
member of [x]L
S
(resp. [x]R
S
). We call
S
−→x (resp.
S
←−x ) the representative of [x]L
S
(resp. [x]R
S
).
For any string x ∈ Factor(S), there uniquely exist strings α and β such that
S
←−x = αx
and
S
−→x = xβ. In the running example,
S
−→c = co,
S
−→o = o,
S
−→
l = la,
S
←−c = c,
S
←−o = co,
S
←−
l = col.
Deﬁnition 2.4 For any string x ∈ Factor(S), let
S
←→x be the string αxβ (α, β ∈ Σ∗
) such
that
S
←−x = αx and
S
−→x = xβ.
What
S
←→x = αxβ implies is that:
(1) Every time x occurs in w ∈ S, it is preceded by α and followed by β within w.
(2) α and β are the longest strings satisfying (1).
7
In the running example,
S
←→c = co and
S
←→o = co,
S
←→
l = cola.
Deﬁnition 2.5 Let x, y be arbitrary strings in Σ∗
. We write x ≡S y if,
1. x, y ∈ Factor(S) and
S
←→x =
S
←→y , or
2. x /∈ Factor(S) and y /∈ Factor(S).
The equivalence class of a string x ∈ Σ∗
with respect to ≡S is denoted by [x]S.
For any string x ∈ Factor(S),
S
←→x is the unique longest member of [x]S, and is called the
representative of [x]S.
Lemma 2.1 (Blumer et al. [8]) The equivalence relation ≡S is the transitive closure
of the relation ≡L
S ∪ ≡R
S .
It follows from the lemma above that
Corollary 2.1 For any string x ∈ Factor(S),
S
←→x =
S
−−→
(
S
←−x ) =
S
←−−
(
S
−→x ).
The number of the strings in Factor(S) is O( S 2
). However, the number of strings x
such that x =
S
−→x (or
S
←−x ) is O( S ). The following lemma gives tighter upper bounds.
Lemma 2.2 (Blumer et al. [8]) Assume that S > 1. The number of the non-degenerate
equivalence classes in ≡L
S (or ≡R
S ) is at most 2 S −1. The number of the non-degenerate
equivalence classes in ≡S is at most S + |S|.
If S is a singleton {w} where w ∈ Σ∗
, throughout this paper the notations deﬁned for
S are written by using w instead of S, as,
≡L
w, ≡R
w, ≡w, [(·)]L
w
, [(·)]R
w
, [(·)]w,
w
−→
(·),
w
←−
(·),
w
←→
(·).
8
2.1 Graphs and Trees
Let V be a ﬁnite set of nodes. Let E be a ﬁnite set of edges, namely, that of pairs of
nodes. Then G = (V, E) is said to be a directed graph.
In a directed graph G = (E, V ), the sequence of nodes u0, u1, . . . , un is called a path
if (ui−1, ui) ∈ E for each i (1 ≤ i ≤ n). The depth of the path is n. A path with u0 = un
is called a cycle. If G has no cycle, it is called a directed acyclic graph (DAG for short).
An edge (u, v) is said to be an out-going edge of u and an in-coming edge of v. The
number of in-coming (resp. out-going) edges of a node u is said to be the in-degree (resp.
out-degree) of u.
A directed graph T with the following properties is called a tree.
- There uniquely exists a node of in-degree 0 in T. It is called the root node.
- For any node u in T, there uniquely exists a path from the root node to u.
If (u, v) ∈ T, then u is said to be a parent node of v, and v is said to be a child node of
u. Any node in a tree other than the root node has its unique parent node. A node of
out-degree zero is called a leaf node. A node that is neither the root node nor a leaf node
is called an internal node. If there is a path from a node u to a node v, u is said to be an
ancestor of v, and v is said to be a descendant of u.
2.1.1 Tries
We here consider an edge-labeled tree T = (V, E) with E ⊆ V × Σ+
× V where the
second component of each edge represents its label. Let S be a set of strings. The tree
representing all strings in S is called the trie and denoted by Trie(S).
Deﬁnition 2.6 Trie(S) is the tree (V, E) such that
V = {x | x ∈ Preﬁx(S)},
E = {(x, a, xa) | x, xa ∈ Preﬁx(S) and a ∈ Σ}.
If S has the preﬁx property, each string in S is represented by a leaf node in Trie(S).
It is sometimes favorable if all strings are associated with leaf nodes. In such case, we
consider the set S such that
S = {wi$i | wi ∈ S and $i /∈ Σ for 1 ≤ i ≤ |S|}.
9
For any set S of strings, S has the preﬁx property. Hence every string in S is represented
by a lead node in Trie(S ).
2.2 Deterministic Finite State Automata
A deterministic ﬁnite automaton (DFA for short) is a quintuplet M = (Q, Σ, δ, q0, F),
where;
Q is a non-empty set. Its elements are called states.
Σ is an alphabet.
δ is a function Q × Σ → Q. It is called the state-transition function.
q0 ∈ Q is the initial state.
F is a subset of Q. Its elements are called accepting states.
We extend the state-transition function δ : Q × Σ → Q to ˆδ : Q × Σ∗
→ Q, as follows.



ˆδ(q, ε) = q (q ∈ Q)
ˆδ(q, xa) = δ(ˆδ(q, x), a) (q ∈ Q, a ∈ Σ, x ∈ Σ∗
)
Let w be an arbitrary string in Σ∗
. If ˆδ(q0, w) ∈ F, we say that w is accepted by DFA M.
We can examine in O(|w|) time whether or not w is accepted by DFA M.
2.3 Index Structures for Text Strings
In this section, we recall four index structures, the suﬃx trie, the suﬃx tree, the directed
acyclic word graph (DAWG), and the compact directed acyclic word graph (CDAWG)
for a set S of strings, denoted by STrie(S), STree(S), DAWG(S), and CDAWG(S),
respectively. All these structures represent every string x ∈ Factor(S). We deﬁne them
as edge-labeled graphs (V, E) with E ⊆ V × Σ+
× V where the second component of each
edge represents its label.
We also deﬁne the suﬃx links of each index structure. Suﬃx links are kinds of failure
function often utilized for time-eﬃcient construction of the index structures [58, 43, 55,
7, 8, 16].
10
2.3.1 Suﬃx Tries
Deﬁnition 2.7 STrie(S) is the tree (V, E) such that
V = {x | x ∈ Factor(S)},
E = {(x, a, xa) | x, xa ∈ Factor(S) and a ∈ Σ},
and its suﬃx links are the set
F = {(ax, x) | x, ax ∈ Factor(S) and a ∈ Σ}.
Each string x ∈ Factor(S) has a one-to-one correspondence to a certain node in
STrie(S). The root node of STrie(S) corresponds to ε. If Suﬃx(S) − {ε} has the preﬁx
property, every string in Suﬃx(S) − {ε} is represented by a leaf node in STrie(S).
If |S| = 1, STrie(S) is also written as STrie(w) where S = {w}. STrie(coco) and
STrie(cocoa) are displayed in Figure 2.1 together with their suﬃx links.
oc
o
c
c
o
o
oc
o
c
c
o
o
a
a
a
a
a
Figure 2.1: STrie(coco) on the left, and STrie(cocoa) on the right. The solid arrows
represent the edges, while the dotted arrows denote the suﬃx links.
2.3.2 Suﬃx Trees
Deﬁnition 2.8 STree(S) is the tree (V, E) such that
V = {
S
−→x | x ∈ Factor(S)},
E = {(
S
−→x , aβ,
S
−→xa) | x, xa ∈ Factor(S), a ∈ Σ, β ∈ Σ∗
,
S
−→xa = xaβ, and
S
−→x =
S
−→xa},
and its suﬃx links are the set
F = {(
S
−→ax,
S
−→x ) | x, ax ∈ Factor(S), a ∈ Σ, and
S
−→ax = a ·
S
−→x }.
11
The root node of STree(S) is associated with
S
−→ε . If Suﬃx(S) − {ε} has the preﬁx
property, every string in Suﬃx(S) − {ε} is represented by a leaf node in STree(S). If
|S| = 1, STree(S) is also written as STree(w) where S = {w}.
The node set of STree(S) is a subset of that of STrie(S), as seen in the deﬁnitions. It
means that a string in Factor(S) might be represented on an edge in STree(S). In this
case, we say that the string is represented in an implicit node. Conversely, every string in
the node set V of STree(S) is said to be represented in an explicit node. For example, in
STree(coco) of Figure 2.2, string c is represented by an implicit node, while string co is
on an explicit node.
STree(S) can be seen as the compacted version of STrie(S) with “
S
−→
(·) operation”.
See STrie(cocoa) in Figure 2.1 and STree(cocoa) in Figure 2.2. STree(cocoa) can be
obtained by removing any internal nodes of out-degree one in STrie(cocoa), and suﬃx
links associated with the removed nodes are also deleted. However, this approach cannot
derive STree(coco) from STrie(coco) (see Figure 2.1 and Figure 2.2). That is, even if a
node
S
−→x is of out-degree one in STrie(S), it is not removed if
S
−→x ∈ Suﬃx(S).
o
c
o
c
c
o
o
o
c
o
c
c
o
o
a
a
a
a
a
Figure 2.2: STree(coco) on the left, and STree(cocoa) on the right. The solid arrows
represent the edges, while the dotted arrows denote the suﬃx links.
Theorem 2.1 (McCreight [43]) Let STree(S) = (V, E). Assume S > 1. Then
|V | ≤ 2 S + |S| and |E| ≤ 2 S + |S| − 1.
12
2.3.3 DAWGs
Deﬁnition 2.9 DAWG(S) is the directed acyclic graph (V, E) such that
V = {[x]R
S
| x ∈ Factor(S)},
E = {([x]R
S , a, [xa]R
S ) | x, xa ∈ Factor(S) and a ∈ Σ},
and its suﬃx links are the set
F = {([ax]R
S
, [x]R
S
) | x, ax ∈ Factor(S), a ∈ Σ, and [ax]R
S
= [x]R
S
}.
The node [ε]R
S is called the source node. A node of out-degree zero is called a sink
node of DAWG(S). If Suﬃx(S) − {ε} has the preﬁx property, then there exactly exist
S sink nodes, each of which represents Suﬃx(w) for each w ∈ S. If |S| = 1, DAWG(S)
is also written as DAWG(w) where S = {w}.
We deﬁne the length of a node [x]R
S
by |
S
←−x |. Suppose that
S
←−x · a ∈ [y]R
S
for x, y ∈
Factor(S) and a ∈ Σ. If length([y]R
S
) = length([x]R
S
) + |a| = length([x]R
S
) + 1 (in other
words, if
S
←−x · a =
S
←−y ), the edge ([x]R
S
, a, [y]R
S
) is said to be solid. Otherwise, it is said to be
non-solid. For example, in DAWG(w) of Figure 2.3 where w = coco, edge ([c]R
w, o, [co]R
w)
is solid, whereas edge ([ε]R
w
, o, [co]R
w
) is non-solid.
As seen in the deﬁnition, each node of DAWG(S) is a non-degenerate equivalence
class with respect to ≡R
S . One can see that nodes of STrie(cocoa) are ‘merged’ by the
equivalence class under ≡R
S . In this sense, DAWG(S) can be seen as the minimized version
of STrie(S) with “[(·)]R
S
operation”.
Suppose that ax is the shortest member of [ax]R
S , for some character a ∈ Σ and string
x ∈ Factor(S). Then the suﬃx link of node [ax]R
S
in DAWG(S) points to the node [x]R
S
(for example, see nodes [oco]R
S and [co]R
S of DAWG(cocoa) in Figure 2.3).
Theorem 2.2 (Blumer et al. [8]) Let DAWG(S) = (V, E). Assume S > 1. Then
|V | ≤ 2 S − 1 and |E| ≤ 3 S − 3.
2.3.4 CDAWGs
Deﬁnition 2.10 CDAWG(S) is the directed acyclic graph (V, E) such that
V = {[
S
−→x ]R
S
| x ∈ Factor(S)},
E = {([
S
−→x ]R
S , aβ, [
S
−→xa]R
S ) | x, xa ∈ Factor(S), a ∈ Σ, β ∈ Σ∗
,
S
−→xa = xaβ, and
S
−→x =
S
−→xa},
and its suﬃx links are the set
F = {([
S
−→ax]R
S , [
S
−→x ]R
S ) | x, ax ∈ Factor(S), a ∈ Σ,
S
−→ax = a ·
S
−→x , and [
S
−→x ]R
S = [
S
−→ax]R
S }.
13
o
c
o
c
o
o
c
o
c
o
a
a
a
Figure 2.3: DAWG(coco) on the left, and DAWG(cocoa) on the right. The solid arrows
represent the edges, while the dotted arrows denote the suﬃx links.
The node [
S
−→ε ]R
S
is called the source node. A node of out-degree zero is called a sink
node of CDAWG(S). If Suﬃx(S) − {ε} has the preﬁx property, then there exactly exist
S sink nodes, each of which represents Suﬃx(w) for each w ∈ S. If |S| = 1, CDAWG(S)
is also written as CDAWG(w) where S = {w}.
We deﬁne the length of a node [
S
−→x ]R
S by
S
←−−
(
S
−→x ) = |
S
←→x |. Suppose that
S
←→x · α ∈ [
S
−→y ]R
S
for x, y ∈ Factor(S) and α ∈ Σ∗
. If length([
S
−→y ]R
S ) = length([
S
−→x ]R
S ) + |α| (in other words,
if
S
←→x · α =
S
←→y ), the edge ([
S
−→x ]R
S
, α, [
S
−→y ]R
S
) is said to be solid. Otherwise, it is non-solid.
In CDAWG(w) of Figure 2.4 where w = coco, edge ([
w
−→ε ]R
w, co, [
w
−→co]R
w) is solid, while edge
([
w
−→ε ]R
w
, o, [
w
−→co]R
w
) is non-solid.
It follows from the deﬁnition that CDAWG(S) is the minimization of STree(S) with
“[(·)]R
S
operation”. In fact, CDAWG(cocoa) in Figure 2.4 can be obtained by ‘merging’
the isomorphic subtrees in STree(cocoa). Similarly, CDAWG(S) can also be seen as the
compaction of DAWG(S) with “
S
−→
(·) operation”, as seen in DAWG(cocoa) in Figure 2.3
and CDAWG(cocoa).
Suppose that ay =
S
−→ax is the shortest member of [
S
−→ax]R
S
for some character a ∈ Σ and
strings x, y ∈ Factor(S). Then the suﬃx link of node [
S
−→ax]R
S points to the node [y]R
S , where
y =
S
−→y .
Theorem 2.3 (Blumer et al. [8]) Let CDAWG(S) = (V, E). Assume S > 1. Then
|V | ≤ S + |S| and |E| ≤ 2 S + |S| − 1.
14
o
c
o
c
o
o
c
o
c
o
a
a
a
Figure 2.4: CDAWG(coco) on the left, and CDAWG(cocoa) on the right. The solid
arrows represent the edges, while the dotted arrows denote the suﬃx links.
2.3.5 Suﬃx Trees Redeﬁned
For any non-empty set S of strings, let STree (S) denote the tree obtained by removing
all internal nodes of out-degree one from STree(S). As seen in Figure 2.5, nodes
w
−→co and
w
−→o in STree(coco) are omitted in STree (coco) together with their suﬃx links. Ukkonen’s
suﬃx tree construction algorithm [55] builds STree (S), not STree(S). The following
preparation is necessary for the formal deﬁnition of STree (S).
We introduce a relation XS over Σ∗
such that
XS = (x, xa) x ∈ Factor(S) and a ∈ Σ is the unique character such that xa ∈ Factor(S) ,
and let ≡ L
S be the equivalence closure of XS, i.e., the smallest superset of XS that is
symmetric, reﬂexive, and transitive. It can be readily shown that ≡L
S is a reﬁnement of
≡ L
S, namely, every equivalence class under ≡ L
S is a union of one or more equivalence classes
in ≡L
S. For a string x ∈ Factor(S), let
S
=⇒
x denote the longest string in the equivalence
class to which x belongs under the equivalence relation ≡ L
S.
Proposition 2.1 For any string x ∈ Σ∗
,
S
−→x is a preﬁx of
S
=⇒
x . If
S
−→x =
S
=⇒
x , then
S
−→x ∈
Suﬃx(S).
Proposition 2.2 If set Suﬃx(S) − {ε} satisﬁes the preﬁx property,
S
=⇒
x =
S
−→x for any
string x ∈ Factor(S).
We are now ready to deﬁne STree (S).
15
Deﬁnition 2.11 STree (S) is the tree (V, E) such that
V = {
S
=⇒
x | x ∈ Factor(S)},
E = {(
S
=⇒
x , aβ,
S
=⇒
xa) | x, xa ∈ Factor(S), a ∈ Σ, β ∈ Σ∗
,
S
=⇒
xa = xaβ, and
S
=⇒
x =
S
=⇒
xa},
and its suﬃx links are the set
F = {(
S
=⇒
ax,
S
=⇒
x ) | x, ax ∈ Factor(S), a ∈ Σ, and
S
=⇒
ax = a ·
S
=⇒
x }.
This deﬁnition is the same as the one obtained by replacing “
S
−→
(·) operation” with “
S
=⇒
(·)
operation” in Deﬁnition 2.8.
Corollary 2.2 If Suﬃx(S) − {ε} has the preﬁx property, STree (S) = STree(S).
As previously stated, Ukkonen’s algorithm constructs STree (S). Even in case the
set Suﬃx(S) − {ε} does not have the preﬁx property, STree (S ) = STree(S ) where
S = {wi$i | wi ∈ S and $i /∈ Factor(S) for 1 ≤ i ≤ |S|}.
oc
o
c
c
o
o
o
c
o
c
c
o
o
a
a
a
a
a
Figure 2.5: STree (coco) on the left, and STree (cocoa) on the right. The solid arrows
represent the edges, while the dotted arrows denote the suﬃx links.
2.3.6 CDAWGs Redeﬁned
Similarly to STree (S), for any non-empty set S of strings, CDAWG (S) has no internal
node of out-degree one. Our algorithm to be introduced in Section 3.4 and Section 3.5
constructs CDAWG (S).
16
Deﬁnition 2.12 CDAWG (S) is the directed acyclic graph (V, E) such that
V = {[
S
=⇒
x ]R
S
| x ∈ Factor(S)},
E = {([
S
=⇒
x ]R
S
, aβ, [
S
=⇒
xa]R
S
) | x, xa ∈ Factor(S), a ∈ Σ, β ∈ Σ∗
,
S
=⇒
xa = xaβ, and
S
=⇒
x =
S
=⇒
xa},
and its suﬃx links are the set
F = {([
S
=⇒
ax]R
S
, [
S
=⇒
x ]R
S
) | x, ax ∈ Factor(S), a ∈ Σ,
S
=⇒
ax = a ·
S
=⇒
x , and [
S
=⇒
x ]R
S
= [
S
=⇒
ax]R
S
}.
As in case of STree (S) and STree(S), the deﬁnition equals the one obtained by substituting
“
S
−→
(·) operation” with “
S
=⇒
(·) operation” in Deﬁnition 2.10.
Corollary 2.3 If Suﬃx(S) − {ε} has the preﬁx property, CDAWG (S) = CDAWG(S).
Even in case that Suﬃx(S) − {ε} does not have the preﬁx property, CDAWG (S ) =
CDAWG(S ) where S = {wi$i | wi ∈ S and $i /∈ Factor(S) for 1 ≤ i ≤ |S|}.
o
c
o
c
o
c
o
o
c
o
c
o
a
a
a
Figure 2.6: CDAWG (coco) on the left, and CDAWG (cocoa) on the right. The solid
arrows represent the edges, while the dotted arrows denote the suﬃx links.
17
Chapter 3
On-Line Construction of Compact
Directed Acyclic Word Graphs
3.1 Introduction
Several diﬀerent string problems, like those deriving from the analysis of biological sequences,
can be eﬃciently solved by means of a suitable index structure. The most widely
known and studied structure of this kind seems to be the suﬃx tree [58, 43, 11, 55, 20, 38,
54], perhaps because there are a “myriad” [2] of applications for it. For any string w the
suﬃx tree of w requires only O(n) space and can be built in O(n) time, where n is the
length of w. Although its theoretical space complexity is linear, much attention has been
devoted to the reduction of the practical space requirement of the structure. This has
led to the introduction of more space-economical index structures, like suﬃx arrays [41],
suﬃx cacti [35], compact suﬃx arrays [40], compressed suﬃx arrays [19, 47], and so on.
Blumer et al. [7] introduced the directed acyclic word graph (DAWG) for a string,
which is the smallest ﬁnite state automaton to recognize all suﬃxes of the string [12].
DAWGs are also involved in several combinatorial algorithms on strings [14, 8, 24, 5, 56],
since they serve as indexes of the string, as well as other index structures such as suﬃx
tries and suﬃx trees.
In this work, we focus our attention on the compact directed acyclic word graph
(CDAWG) ﬁrst introduced by Blumer et al. in [8]. Crochemore and V´erin displayed a
relationship among suﬃx tries, suﬃx trees, DAWGs, and CDAWGs [16]. Suﬃx trees
(resp. DAWGs) are the compacted (resp. minimized) version of suﬃx tries, as shown
18
c
o
a
o
c
o
a
a
c
o
a
o
c
o
a
a
o
c
a
o
o
c
a
o
a
c
a a
Suffix Tree
CDAWG
DAWG
minimization
minimization compaction
compaction
o
c
a
o
o
c
a
o
a
c
a a
Suffix Trie
Figure 3.1: Relationship among the suﬃx trie, the suﬃx tree, the DAWG, and the
CDAWG for string cocoa.
in Figure 3.1. Similarly, CDAWGs can be obtained by either compacting DAWGs or
minimizing suﬃx trees.
Not only in theory as stated above, but also in practice, CDAWGs provide signiﬁcant
reductions of the memory space required by suﬃx trees and DAWGs, as experimental results
have shown in [8, 16]. In Bioinformatics a considerable amount of DNA sequences has
to be processed eﬃciently, both in space and time. Therefore, from a practical viewpoint,
CDAWGs could also play an important role in Bioinformatics.
The ﬁrst algorithm to construct the CDAWG for a given string w was presented in [7].
It once builds the DAWG of w, then removes every node of out-degree one and modiﬁes
its edges accordingly, so that the resulting structure becomes the CDAWG for w. It runs
in liner time, but its main drawback is the construction of the DAWG as an intermediate
structure, which takes larger space. A solution to this matter was provided by Crochemore
and V´erin [16]: a linear-time algorithm to construct the CDAWG for a string directly.
Their algorithm is based on McCreight’s suﬃx tree construction algorithm [43]. Both
algorithms are oﬀ-line, that is, the whole input string has to be known beforehand. Thus,
the structure (suﬃx tree or CDAWG) has to be rebuilt from scratch, if a new character
is added to the input string. Table 3.1 summarizes some properties of typical algorithms
to construct index structures. As seen there, a missing piece, which we have been looking
for, is an on-line algorithm for constructing CDAWGs.
In this chapter, we present a new linear-time algorithm to directly construct the
19
Index Structure Algorithm linear time on-line multiple strings
suﬃx tries Ukkonen [55]
√ √
suﬃx trees
Weiner [58]
McCreight [43]
Ukkonen [55]
√
√
√ √ √
DAWGs
Blumer et al. [7]
Blumer et al. [8]
√
√
√
√ √
CDAWGs
Blumer et al. [8]
Crochemore and V´erin [16]
√
√
√
Table 3.1: The properties of algorithms for the construction of index structures.
CDAWG for a given string, which is based on Ukkonen’s suﬃx tree construction algorithm
[55]. Our algorithm is on-line: it processes the characters of the input string from
left to right, one by one, with no need to know the whole string beforehand. Our algorithm
would be more eﬃcient than the one in [16], in the sense that our algorithm allows
us to update the input string. Furthermore, we show that the algorithm can be easily
applied to building the CDAWG for a set of strings. The CDAWG for a set of strings
can be constructed by the algorithm given in [8] which compacts the DAWG for the set.
However, the drawback of this approach is that, when a new string is added to the set,
the DAWG has to be built from scratch. Instead, our algorithm permits us the addition
of a new string to the set.
3.2 On-Line Construction of the Suﬃx Trie for a Single
String
The on-line CDAWG construction algorithm we will give later on is based on Ukkonen’s
on-line suﬃx tree construction algorithm [55]. Moreover, Ukkonen’s algorithm is based
on an intuitive on-line algorithm that constructs suﬃx tries. We ﬁrstly consider the case
that we are given a single string as an input for the algorithm.
For a string x ∈ Factor(w), let suf (x) denote the node reachable via the suﬃx link of
the node x. It derives from Deﬁnition 2.7 that suf (x) = y for some y ∈ Factor(w) such
that x = ay for some character a ∈ Σ. For the case that x = ε, let suf (ε) =⊥ where
⊥ is an auxiliary node called the bottom node. We suppose that there exists an edge
(⊥, Σ, ε), where the symbol Σ here means every character in the alphabet. Assuming that
the bottom node ⊥ corresponds to the inverse a−1
for any a ∈ Σ, the edge (⊥, Σ, ε) is
20
consistently deﬁned as well as other edges, since a−1
· a = ε. The auxiliary node ⊥ allows
us to formalize the algorithm avoiding the distinction between the empty suﬃx and other
non-empty suﬃxes (in other words, between the root node and other nodes). We leave
suf (⊥) undeﬁned.
The algorithm reads a given string w ∈ Σ∗
from left to right, while building STrie(w[1 :
i]) for 1 ≤ i ≤ |w|. It is easy to construct STrie(w[1 : i + 1]) by updating STrie(w[1 : i]).
What is necessary here is to insert suﬃxes of w[1 : i + 1] into STrie(w[1 : i]).
Deﬁnition 3.1 For an arbitrary string u ∈ Σ∗
and an arbitrary character a ∈ Σ,
the longest repeated suﬃx (the LRS for short) of ua is the longest element of the set
Factor(u) ∩ Suﬃx(ua).
It is guaranteed that the LRS always exists for any string u ∈ Σ∗
since the empty string
ε belongs to the set Factor(u) ∩ Suﬃx(ua) for any character a ∈ Σ.
The suﬃxes of w[1 : i + 1] can be divided into the following two groups, by the LRS
of w[1 : i + 1].
(1) Suﬃxes w[h : i + 1] for 1 ≤ h ≤ j where w[j + 1 : i + 1] is the LRS of w[1 : i + 1].
(2) Suﬃxes w[h : i + 1] for j + 1 ≤ h ≤ i + 2.
The group (2) is empty in case the LRS of w[1 : i + 1] = ε, that is, in case j + 1 = i + 2.
There is no need to newly insert any suﬃxes in the group (2), simply because they
have already been represented in STree (w[1 : i]). The algorithm creates a new node
corresponding to w[h : i + 1] for each h (1 ≤ h ≤ j), together with a new edge (w[h :
i], w[i + 1], w[h : i + 1]), by traversing suf (w[h : i]) to move to the next node w[h − 1 : j].
When it ﬁnds the node corresponding to the LRS w[j +1 : i], the algorithm stops and the
update then gets completed. The node with respect to the LRS w[j + 1 : i + 1] is called
the end point of STrie(w[1 : i + 1]).
The on-line construction of STrie(cocoa) is shown in Figure 3.2.
Theorem 3.1 (Ukkonen [55]) Assume Σ is a ﬁxed alphabet. For any string w ∈ Σ∗
,
STrie(w) can be constructed on-line and in O(|w|2
) time, using O(|w|2
) space.
21
oc
o
c
c
o
o
a
oc
o
c
c
o
o
a
a
a
a
Σ Σ
oc
o
c
c
Σ
oc
o
Σ
c
ΣΣ
c co coc
coco cocoa
ε
Figure 3.2: On-line construction of STrie(w) with w = cocoa.
3.3 On-Line Construction of the Suﬃx Tree for a Single
String
3.3.1 Informal Description
We ﬁrstly summarize Ukkonen’s suﬃx tree construction algorithm in the comparison
with the previous suﬃx trie algorithm. Figure 3.3 shows the on-line construction of
STree (cocoa). Focus on the update of STree (co) to STree (coc). Diﬀerently from that
of STrie(co) to STrie(coc), the edges leading to leaf nodes are automatically extended
with the new character c in STree (coc). This is feasible by the idea so-called open edges.
See the ﬁrst and second steps of the update of STree (coco) to STree (cocoa). The
gray star mark indicates the active point from which a new edge is created in each step.
After the new edge (co, a, coa) is inserted, the active point moves to the implicit node
for string o. In case of the suﬃx trie, it is possible to move there by traversing the suﬃx
link of node co. However, there is yet to be the suﬃx link of node co in the suﬃx tree.
Thereof, Ukkonen’s algorithm simulates the traversal of the suﬃx link as follows: First,
it goes up to the explicit parent node ε of node co which has its suﬃx link. After that, it
moves to the bottom node ⊥ via the suﬃx link of the root node, and then advances along
the path spelling out co. Note that the string co corresponds to the label of the edge the
22
active point went up backward. This way, in Ukkonen’s algorithm the active point moves
via ‘implicit’ suﬃx links. Since suﬃx links of leaf nodes are never utilized in Ukkonen’s
algorithm, it does not create any of them.
3.3.2 Ukkonen’s Algorithm
Ukkonen’s on-line suﬃx tree construction algorithm is based on the on-line algorithm to
build suﬃx tries recalled in Section 3.2. As stated in Deﬁnition 2.11, an edge of STree (w)
is labeled by a string α ∈ Factor(w). The key to achieve a linear-space implementation of
the suﬃx tree is to label the edge (
w
=⇒
x , α,
w
==⇒
xα ) in STree (w) by (k, p), such that w[k : p] = α.
An implicit node y ∈ Factor(w) can be represented by a pair (
w
=⇒
x , α) of an explicit
node
w
=⇒
x and a string α ∈ Factor(w) such that y =
w
=⇒
x · α. The pair (
w
=⇒
x , α) is called a
reference pair for the implicit node y. Note that explicit nodes can also be represented by
reference pairs. There can be one or more reference pairs for a node y. The reference pair
(
w
=⇒
x , α) for y in which |α| is minimized is called the canonical reference pair for y. The
reference pair can also be written as (
w
=⇒
x , (k, p)) such that w[k : p] = α.
Ukkonen’s algorithm reads a given string w ∈ Σ∗
from left to right, while building
STree (w[1 : i]) for 1 ≤ i ≤ |w|. Suppose that we from now on update STree (w[1 : i]) to
STree (w[1 : i+1]). The group (1) of the suﬃxes of w[1 : i+1], mentioned in the previous
section, can moreover be divided into two as follows by integer j .
(1-a) Suﬃxes w[l : i + 1] for 1 ≤ l ≤ j where w[j + 1 : i] is the LRS of w[1 : i].
(1-b) Suﬃxes w[ : i + 1] for j + 1 ≤ ≤ j.
We remark that all the suﬃxes of the group (1-a) are those represented by leaf nodes in
STree (w[1 : i]). Note that, for any l,
w[1:i]
====⇒
w[l : i] = w[l : i] and
w[1:i+1]
=======⇒
w[l : i + 1] = w[l : i + 1]. That
is, intuitively, every leaf node of STree (w[1 : i]) is also a leaf node in STree (w[1 : i + 1]).
This fact is crucial to Ukkonen’s algorithm in order that it automatically inserts those in
the group (1-a) into STree (w[1 : i + 1]), by means of open edges.
Suppose that (
w[1:i]
=⇒
x , α,
w[1:i]
==⇒
xα ) is an edge of STree (w[1 : i]) where
w[1:i]
==⇒
xα is a leaf node.
Letting k be the integer such that w[k : i] = α, it is feasible to label the edge by (k, ∞).
This way we need no explicit insertion of the suﬃxes of w[1 : i + 1] in the group (1-a).
The location from which a suﬃx w[ : i+ 1] with respect to the group (1-b) is inserted
is called the active point of STree (w[1 : i + 1]). The active point for w[1 : i + 1] begins at
23
the node w[j + 1 : i], where w[j + 1 : i] is the end point of STree (w[1 : i]). Assume we
are now inserting suﬃx w[ : i + 1] into STree (w[1 : i]), where j + 1 ≤ ≤ j. There are
two cases to consider for the active point.
(Case 1) The active point is on an explicit node w[ : i]. In this case,
w[1:i]
====⇒
w[ : i] =
w[1:i+1]
====⇒
w[ : i] = w[ : i].
Let x = w[ : i]. In this case a new edge (
w[1:i+1]
==⇒
x , α,
w[1:i+1]
===⇒
xα ) is created, where α =
w[i + 1 : i + 1]. Note
w[1:i+1]
===⇒
xα = w[ : i + 1]. The edge is actually labeled by (i + 1, ∞).
After that, the active point moves to the explicit node suf (
w[1:i+1]
==⇒
x ), corresponding to
w[ − 1 : i], in order to insert the next suﬃx w[ − 1 : i + 1].
(Case 2) The active point is on an implicit node w[ : i]. In this case,
w[1:i]
====⇒
w[ : i] = w[ : i] but
w[ :i+1]
====⇒
w[ : i] = w[ : i].
Let (
w[1:i]
=⇒
x , α) be the canonical reference pair for the active point, namely,
w[1:i]
=⇒
x · α =
w[ : i]. Focus on the edge (
w[1:i]
=⇒
x , αβ,
w[1:i]
==⇒
xαβ) where β = ε. The edge is replaced by
the edges (
w[1:i+1]
=⇒
x , α,
w[1:i+1]
==⇒
xα ) and (
w[1:i+1]
==⇒
xα , β,
w[1:i+1]
===⇒
xαβ ) where
w[1:i+1]
==⇒
xα is a new explicit node.
Then a new edge (
w[1:i+1]
==⇒
xα , γ,
w[1:i+1]
===⇒
xαγ ) is created, where γ = w[i + 1 : i + 1]. Note
w[1:i+1]
===⇒
xαγ = w[ : i + 1]. The edge is actually labeled by (i + 1, ∞).
After that, we need to move to the (implicit or explicit) node corresponding to
w[ − 1 : i], the next active point, but the table suf is yet to be computed for the
new node
w[1:i+1]
==⇒
xα . Thus we once move to its parent node
w[1:i+1]
=⇒
x for which suf (
w[1:i+1]
=⇒
x )
must have already been computed. Let suf (
w[1:i+1]
=⇒
x ) =
w[1:i+1]
=⇒
y , where there exists some
character a such that
w[1:i+1]
=⇒
x = a·
w[i+1]
=⇒
y . Note that
w[1:i+1]
=⇒
y ·α = w[ −1 : i]. We go down
from the node
w[1:i+1]
=⇒
y with spelling out α, to obtain the canonical reference pair for
the active point w[ − 1 : i]. The node w[ − 1 : i] either is already, or will in this
step become, explicit. The value of suf (
w[1:i+1]
==⇒
xα ) is then set to w[ − 1 : i]. This way
the algorithm ‘simulates’ the suﬃx-link-traversal of suﬃx tries.
Figure 3.3 shows the on-line construction of STree (cocoa).
24
oc
o
c
c
o
o
Σ
coco
cocoa
oc
o
c
c
o
o
a
a
a
Σ
o
c
o
c
c
Σ
coc
o
c
o
Σ
co
c
Σ
c
Σ
ε
o
c
o
c
c
o
o
a
a
a
a
Σ
a
o
c
o
c
c
o
o
a
a
a
a
Σ
o
c
o
c
c
o
o
a
a
Σ
Figure 3.3: On-line construction of STree (w) with w = cocoa. The star represents the
active point for each step.
A pseudo-code for Ukkonen’s algorithm is shown in Fig 3.4. There, function canonize is
a routine to canonize a given reference pair. Function check end point is one that returns
true if a given reference pair is the end point, and false otherwise. Function split edge
splits an edge into two, by creating a new explicit node at the position to which the given
reference pair corresponds.
Theorem 3.2 (Ukkonen [55]) Assume Σ is a ﬁxed alphabet. For any string w ∈ Σ∗
,
STree (w) can be constructed on-line and in O(|w|) time, using O(|w|) space.
3.4 On-Line Construction of the CDAWG for a Single
String
3.4.1 Informal Description
Before delving into the technical detail of the algorithm for on-line construction of CDAWGs,
we informally describe how a CDAWG is built on-line. See Figure 3.5 that shows the
on-line construction of CDAWG (cocoa), in comparison with Figure 3.3 displaying the
on-line construction of STree (cocoa). Compare CDAWG (co) and STree (co). While
25
Algorithm for on-line construction of STree (w$)
in alphabet Σ = {w[−1], w[−2], . . .w[−m]}.
/* $ is the end-marker appearing nowhere in w. */
1 create nodes root and ⊥;
2 for j := 1 to m do create edge (⊥, (−j, −j), root);
3 suf (root) := ⊥;
4 (s, k) := (root, 1); i := 0;
5 repeat
6 i := i + 1;
7 (s, k) := update(s, (k, i));
8 until w[i] = $;
function update(s, (k, p)): pair of integers;
/* (s, (k, p − 1)) is the canonical reference pair for the active point. */
1 c := w[p]; oldr := nil;
2 while not check end point(s, (k, p − 1), c) do
3 if k ≤ p − 1 then r := split edge(s, (k, p − 1)); /* implicit case. */
4 else r := s; /* explicit case. */
5 create node r ; create edge (r, (p, ∞), r );
6 if oldr = nil then suf (oldr) := r;
7 oldr := r;
8 (s, k) := canonize(suf (s), (k, p − 1));
9 if oldr = nil then suf (oldr) := s;
10 return canonize(s, (k, p));
function check end point(s, (k, p), c): boolean;
1 if k ≤ p then /* implicit case. */
2 let (s, (k , p ), s ) be the w[k]-edge from s;
3 return (c = w[k + p − k + 1]);
4 else return (there is a c-edge from s);
function canonize(s, (k, p)): pair of node and integers;
1 if k > p then return (s, k); /* explicit case. */
2 ﬁnd the w[k]-edge (s, (k , p ), s ) from s;
3 while p − k ≤ p − k do
4 k := k + p − k + 1; s := s ;
5 if k ≤ p then ﬁnd the w[k]-edge (s, (k , p ), s ) from s;
6 return (s, k);
function split edge(s, (k, p)): node;
1 let (s, (k , p ), s ) be the w[k]-edge from s;
2 create node r;
3 replace the edge by edges (s, (k , k + p − k), r) and (r, (k + p − k + 1, p ), s );
4 return r;
Figure 3.4: Ukkonen’s on-line algorithm for constructing suﬃx trees.
26
Σ
c
ΣΣ
c coε
o
c
o
Σ
coco
o
c
o
c
o
c
o
Σ
coc
o
c
o
c
c
cocoa
Σ
o
c
o
c
o
c
o
a
a
Σ
o
c
o
c
o
a
a
c
o
a
Σ
o
c
o
c
o
a
a
Σ
o
c
o
c
o
a
a
a
Figure 3.5: On-line construction of CDAWG (w) with w = cocoa. The star mark represents
the active point for each step.
strings co and o are separately represented in STree (co), they are in the same node in
CDAWG (co). The destination of any open edge of a CDAWG is all the same, the sink
node. Open edges of a CDAWG are also automatically extended, as well as those of a
suﬃx tree (see CDAWG (coc) and CDAWG (coco)).
Focus on the ﬁrst step of the update of CDAWG (coco) to CDAWG (cocoa). String
co there gets to be explicitely represented, and at the second step the active point is
on implicit node o. In case of the construction of STree (cocoa), edge (ε, ocoa, ocoa) is
split into two edges (ε, o, o) and (o, coa, ocoa), and then an open edge (o, a, oa) is newly
created. However, in case of the CDAWG, edge (ε, ocoa, ocoa) is redirected to node co,
and the label is simultaneously modiﬁed. Since strings co and o are equivalent under the
equivalent relation ≡R
cocoa, they are merged into a single node in CDAWG (cocoa).
3.4.2 The Algorithm
The algorithm presented in this section for the on-line construction of CDAWGs behaves
similarly to Ukkonen’s algorithm. Let u = w[1 : i] and ua = w[1 : i + 1], namely,
a = w[i + 1]. The diﬀerence between them is summarized as follows.
27
- All the suﬃxes in the group (1) are equivalent under ≡R
ua. Thus all of them are
represented in the sink node [
ua
=⇒
ua]R
ua. Namely, the destinations of the open edges are
all the same. According to this property, we can generalize the idea of open edges
as follows. For any open edge (s, (k, ∞), t) of CDAWG (w) where t denotes the sink
node [
ua
=⇒
ua]R
ua
, we actually implement it as (s, (k, e), t) where e is a global variable that
denotes |ua|. Thus, when a new character added after u, we can extend all open
edges only with increasing the value of e by 1. Obviously, it only takes O(1) time.
- Consider (Case 2). There can be integers 1, 2 with j + 1 ≤ 1 < 2 ≤ j such
that w[ 1 : i] ≡R
ua w[ 2 : i]. In such case, they are merged into a single explicit node
[
ua
=====⇒
w[ 1 : i]]R
ua, during the update of CDAWG (u) to CDAWG (ua). The equivalence
test is performed on the basis of Lemma 3.1 to be given in the sequel.
- Consider strings x, y ∈ Factor(u) such that
u
=⇒
x = x and
u
=⇒
y = y. Assume that
x ≡R
u y, that is, they are represented in the same explicit node [x]R
u in CDAWG (u).
Note that, however, x, y might not be equivalent under ≡R
ua. When CDAWG (u)
is updated to CDAWG (ua), then the node has to be separated into two nodes
[x]R
ua
and [y]R
ua
. Since this node separation happens only when x /∈ Suﬃx(ua) but
y ∈ Suﬃx(ua), we can do this procedure after we ﬁnd the end point. The condition
of the node separation will be given later on, in Lemma 3.2.
Merging Implicit Nodes.
As mentioned above, it can happen that two or more nodes implicit in CDAWG (u)
are merged into one explicit node in CDAWG (ua). As a concrete example, we show
in Figure 3.7 the snapshot of the conversion of CDAWG (u) into CDAWG (ua) with
u = abcabcab and a = a. It can be observed that the implicit nodes for abcab, bcab,
and cab are merged into a single explicit node, and the implicit nodes for ab and b are
also merged into another single explicit node. The examination whether to merge implicit
nodes can be done by testing the equivalence of two nodes under the equivalence relation
≡R
ua. The equivalence test can be performed on the basis of the following proposition and
lemma.
Proposition 3.1 Let x ∈ Factor(w) for a string w, and let z =
w
←→x . Then, string x
occurs within string z exactly once.
28
b
a
b
c
a
b
c
a
b
c
a
b
c
a
b
a
c
a
b
a
c
a
b
a
a
aa
a a
b
a
b
c
a
b
c
a
b
c
a
b
c
a
b
c
a
b
c
a
b
c
a
b
b
a
c
a
b
b
c
a
b
a
a
a
c
c
a
a
b
b
b
a
c
c
a
b
b
b
a
c
c
a
b
b
a
Figure 3.6: Comparison of conversions. One is from STree (u) to STree (ua), while the
other is from CDAWG (u) to CDAWG (ua) for u = abcabcab and a = a. The black
circles represent implicit nodes to be merged in the next step, connected by implicit suﬃx
links corresponding to the traversal by the active point.
Lemma 3.1 Let w ∈ Σ∗
. For any strings x, y ∈ Factor(w) with y ∈ Suﬃx(x),
x ≡R
w y ⇔ [
w
−→x ]R
w = [
w
−→y ]R
w.
Proof. If x ≡R
w y, we have
w
←−x =
w
←−y by Deﬁnition 2.3. By Corollary 2.1, we know
w
−−→
(
w
←−x ) =
w
←−−
(
w
−→x ) and
w
−−→
(
w
←−y ) =
w
←−−
(
w
−→y ), which yield
w
←−−
(
w
−→x ) =
w
←−−
(
w
−→y ). Again by Deﬁnition 2.3, we have
[
w
−→x ]R
w = [
w
−→y ]R
w.
Conversely, suppose [
w
−→x ]R
w
= [
w
−→y ]R
w
. Recall that
w
←→x =
w
←−−
(
w
−→x ) by Corollary 2.1 and
w
←−−
(
w
−→x ) is
the unique longest member of [
w
−→x ]R
w. Similarly,
w
←→y is the unique longest member of [
w
−→y ]R
w.
Thus we have
w
←→x =
w
←→y . Let z =
w
←→x =
w
←→y . Then z = αxβ for some strings α and β.
Since y is a suﬃx of x, there exists a string δ such that x = δy. We thus have z = αδyβ.
This occurrence of y in z must be the only one due to Proposition 3.1. Since
w
←→y = αδyβ,
we conclude that every occurrence of y within w must be preceded by δ. Thus we have
29
Σ
c
c
a
a
b
b
b
a
c
c
a
b
b
b
a
c
c
a
b
b
a
abcabcab abcabcaba
Σ
c
c
a
a
b
b
b
a
c
c
a
b
b
b
a
c
c
a
b
b
a
a
a
a
Σ
c
c
a
a
b
b
b
a
c
c
a
b
b
b
a
c
c
a
b
b
a
a
a
aa
Σ
c
c
a
a
b
b
b
a
c
a
b
b
c
c
a
b
b
a
a
a
a
Σ
c
c
a
a
b
b
b
a
c
a
b
b
c
a
b
a
a
Σ
c
c
a
a
b
b
b
a
c
a
b
b
c
a
b
a
a
a
Σ
c
a
b
b
a
c
a
b
b
c
a
b
a
a
a
Σ
c
a
b
b
a
c
a
b
b
c
a
b
a
a
a
Figure 3.7: Detailed conversion from CDAWG (u) to CDAWG (ua) for u = abcabcab and
a = a.
30
x ≡R
w y.
For any string x ∈ Factor(w), the equivalence class [
w
−→x ]R
w is the closest explicit child of
the node for x in CDAWG(w). Thus we can test the equivalence of two suﬃxes x, y of w
with Lemma 3.1.
The matter is that, for a string v ∈ Suﬃx(w), the node
w
−→v might not be explicit in
CDAWG (w). Namely, on the equivalence test, we might refer to the node [
w
=⇒
x ]R
w instead
of [
w
−→x ]R
w
. Nevertheless, it does not actually happen in our on-line manner in which suﬃxes
are processed in decreasing order of their length.
See CDAWG (u) shown on the right of Figure 3.6, where u = abcabcab. The black
points are the implicit nodes the active point traverses in the next step via ‘implicit’
suﬃx links. In CDAWG (u), [
u
==⇒
cab]R
u = [
u
=⇒
ab]R
u = [
u
=⇒
u ]R
u . However, in CDAWG (ua), cab ≡R
ua
ab where a = a. See Figure 3.7 in which the detail of the update of CDAWG (u) to
CDAWG (ua) is displayed. Notice that there is no trouble on merging the implicit nodes.
Separating Explicit Nodes.
When CDAWG (u) is updated to CDAWG (ua), an explicit node [
u
=⇒
x ]R
u
with x ∈ Factor(u)
might be separated into two explicit nodes [
ua
=⇒
x ]R
ua
and [
ua
=⇒
y ]R
ua
if x /∈ Suﬃx(ua), y ∈
Suﬃx(x), and y ∈ Suﬃx(ua). It is inherently the same ‘phenomenon’ as the node separation
occurring in the on-line construction of DAWGs [7]. Therefor we brieﬂy recall the
essence of the node separation of DAWGs. For u ∈ Σ∗
and a ∈ Σ, ≡R
ua is a reﬁnement of
≡R
u . Furthermore, we have the following lemma.
Lemma 3.2 (Blumer et al. [7]) Let u ∈ Σ∗
and a ∈ Σ. Let z be the LRS of ua. For a
string x ∈ Factor(u), assume x =
u
←−x . Then,
[x]R
u
=
[x]R
ua ∪ [z]R
ua, if z ∈ [x]R
u and x = z;
[x]R
ua
, otherwise.
As stated in the above lemma, we need only to care about the node [x]R
u
where z ∈ [x]R
u
and z is the LRS of ua. Namely only one node can be separated when a DAWG is updated
with a new character added. If z =
u
←−x , the node is not separated (the latter case). If
z =
u
←−x , it is separated into two nodes [x]R
ua and [z]R
ua when DAWG(u) is updated to
DAWG(ua) (the former case). We examine whether z =
u
←−x or not by checking the length
of
u
←−x and z, as follows. Let y ∈ Factor(u) be the string such that
u
←−y · a = z. Note that
31
there then exists an edge ([y]R
u
, a, [x]R
u
). Then,
z =
u
←−x ⇔ length([y]R
u ) + |a| = length([x]R
u ), and
z =
u
←−x ⇔ length([y]R
u
) + |a| < length([x]R
u
).
If we deﬁne the length of the bottom node ⊥ by −1, no contradiction occurs even in case
that z = ε.
Figure 3.8 shows the conversion from DAWG(u) to DAWG(ua) with u = cocoa and
a = a. The LRS of the string cocoao is o, therefore we focus on edge ([ε]R
u
, o, [o]R
u
). Since
length([ε]R
u
) + |o| = 1 < length([o]R
u
) = 2, node [o]R
u
is separated into two nodes [co]R
ua
and
[o]R
ua, as shown in Figure 3.8.
Σ
o
c
o
c
o
a
a
a
cocoa cocoao
Σ
o
c
o
a
a
o
c
o
a a
c
Figure 3.8: The update of DAWG(u) to DAWG(ua), where u = cocoa and a = o.
Now we go back to the update of CDAWG (u) to CDAWG (ua). The test whether to
separate a node when a CDAWG is updated can also be done on the basis of Lemma 3.2
in the very similar way. Since only explicit nodes can be separated, we only need to care
about the case that z =
ua
=⇒
z where z is the LRS of ua. It is not diﬃcult to establish the
following lemma.
Lemma 3.3 Let w ∈ Σ∗
. Assume the LRS of w is z. Then, if z =
w
=⇒
z ,
w
=⇒
x =
w
−→x for any
string x ∈ Factor(w).
This lemma guarantees that the representative of [
u
=⇒
x ]R
u is equal to
u
←→x if the conditions
in the lemma are satisﬁed. We can therefore execute the node separation test as follows:
If z =
u
←→x , the node [x]R
u is not separated (the latter case). If z =
u
←→x , it is separated
32
into two nodes [x]R
ua
and [z]R
ua
when CDAWG (u) is updated to CDAWG (ua) (the former
case). We examine if z =
u
←→x or not by the length of
u
←→x and z in the following way. Let
y ∈ Factor(u) be the string such that
u
←→y · α = z for some string α ∈ Factor(u). Note
that there then exists an edge ([y]R
u
, α, [x]R
u
). Then,
z =
u
←→x ⇔ length([y]R
u
) + |α| = length([x]R
u
), and
z =
u
←→x ⇔ length([y]R
u
) + |α| < length([x]R
u
).
Figure 3.9 shows the update of CDAWG (u) to CDAWG (ua), where u = cocoa and
a = o. The LRS of the string cocoao is o, therefore we focus on edge ([
u
=⇒
ε ]R
u
, o, [
u
=⇒
o ]R
u
).
Since length([
u
=⇒
ε ]R
u ) + |o| = 1 < length([
u
=⇒
o ]R
u ) = 2, node [
u
=⇒
o ]R
u is separated into two nodes
[
ua
=⇒
co]R
ua
and [
ua
=⇒
o ]R
ua
, as shown in Figure 3.9.
Σ
o
c
o
c
o
a
a
a
cocoa cocoao
Σ
o
c
o
c
o
a
a
a
o
o
o
c
o
a
o
a
o
Figure 3.9: The update of CDAWG (u) to CDAWG (ua), where u = cocoa and a = o.
Pseudo-Code.
The algorithm is described in Figure 3.10 and Figure 3.11. Function extension returns the
explicit child node of a given node (implicit or explicit). Function redirect edge redirects
a given edge to a given node, with modifying the label of the edge accordingly. Function
split edge is the same as the one used in Ukkonen’s algorithm, except that it also computes
the length of nodes. Function separate node separates a given node into two, if necessary.
It is essentially the same as the separation procedure for DAWG(w) given by Blumer et
al. [7], except that implicit nodes are also treated.
33
Algorithm for on-line construction of CDAWG (w$)
in alphabet Σ = {w[−1], w[−2], . . . , w[−m]}.
/* $ is the end-marker appearing nowhere in w. */
1 create nodes source, sink, and ⊥;
2 for j := 1 to m do create a new edge (⊥, (−j, −j), source);
3 suf (source) := ⊥;
4 length(source) := 0; length(⊥) := −1;
5 e := 0; length(sink) := e;
6 (s, k) := (source, 1); i := 0;
7 repeat
8 i := i + 1; e := i; /* e is a global variable. */
9 (s, k) := update(s, (k, i));
10 until w[i] = $;
function update(s, (k, p)): pair of node and integers;
/* (s, (k, p − 1)) is the canonical reference pair for the active point. */
1 c := w[p]; oldr := nil;
2 while not check end point(s, (k, p − 1), c) do
3 if k ≤ p − 1 then /* implicit case. */
4 if s = extension(s, (k, p − 1)) then
5 redirect edge(s, (k, p − 1), r);
6 (s, k) := canonize(suf (s), (k, p − 1));
7 continue;
8 else
9 s := extension(s, (k, p − 1));
10 r := split edge(s, (k, p − 1));
11 else /* explicit case. */
12 r := s;
13 create edge (r, (p, e), sink);
14 if oldr = nil then suf (oldr) := r;
15 oldr := r;
16 (s, k) := canonize(suf (s), (k, p − 1));
17 if oldr = nil then suf (oldr) := s;
18 return separate node(s, (k, p));
Figure 3.10: Main routine, function update, and function check end point of the on-line
algorithm to construct CDAWGs.
34
function extension(s, (k, p)): node;
/* (s, (k, p)) is a canonical reference pair. */
1 if k > p then return s; /* explicit case. */
2 ﬁnd the w[k]-edge (s, (k , p ), s ) from s;
3 return s ;
function redirect edge(s, (k, p), r);
1 let (s, (k , p ), s ) be the w[k]-edge from s;
2 replace the edge by edge (s, (k , k + p − k), r);
function split edge(s, (k, p)): node;
1 let (s, (k , p ), s ) be the w[k]-edge from s;
2 create node r;
3 replace the edge by edges (s, (k , k + p − k), r) and (r, (k + p − k + 1, p ), s ),
4 length(r) := length(s) + (p − k + 1);
5 return r;
function separate node(s, (k, p)): pair of node and integer;
1 (s , k ) := canonize(s, (k, p));
2 if k ≤ p then return (s , k ); /* implicit case. */
3 /* explicit case. */
4 if length(s ) = length(s) + (p − k + 1) then return (s , k ); /* solid case. */
5 /* non-solid case. */
6 create node r as a duplication of s ; /* together with the out-going edges of s */
7 suf (r ) := suf (s ); suf (s ) := r ;
8 length(r ) := length(s) + (p − k + 1);
9 repeat
10 replace the w[k]-edge from s to s by edge (s, (k, p), r );
11 (s, k) := canonize(suf (s), (k, p − 1));
12 until (s , k ) = canonize(s, (k, p));
13 return (r , p + 1);
Figure 3.11: Other functions for the on-line algorithm to construct CDAWGs. Since
function check end point and function canonize used here are identical to those shown
in Fig. 3.4, they are omitted.
35
Complexity of the Algorithm.
Theorem 3.3 Assume Σ is a ﬁxed alphabet. For any string w ∈ Σ∗
, the proposed algorithm
constructs CDAWG (w) on-line and in O(|w|) time, using O(|w|) space.
Proof. The linearity proof is in a sense the combination of the one of the on-line
algorithm for DAWGs [7] and the one of the on-line algorithm for suﬃx trees [55]. We
divide the time requirement into two components, both turn out to be linear. The ﬁrst
component consists of the total computation time by canonize. The second component
consists of the rest.
Let x ∈ Factor(w). We deﬁne the suﬃx chain started at x on w, denoted by SCw(x),
to be the sequence of (possibly implicit) nodes reachable via suﬃx links from the (possibly
implicit) node associated with x to the source node in CDAWG (w), as in [7]. We deﬁne its
length by the number of nodes contained in the chain, and let |SCw(x)| denote it. Let k1 be
the number of iterations of the while loop of update and let k2 be the number of iterations
in the repeat-until loop in separate node, when CDAWG(w) is updated to CDAWG(wa).
By a similar argument in [7], it can be derived that |SCwa(wa)| ≤ |SCw(w)|−(k1 +k2)+2.
Initially |SCw(w)| = 1 because w = ε, and then it grows at most two (possibly implicit)
nodes longer in each call of update. Since |SCw(w)| decreases by an amount proportional
to the sum of the number of iterations in the while loop and in the repeat-until loop on
each call of update, the second time component is linear in the length of the input string.
For the analysis of the ﬁrst time component we have only to consider the number of
iterations in the while loop in canonize. By concerning the calls of canonize executed in
the while loop in update, it results in that the total number of the iterations is linear (by
the same argument in [55]). Thus we shall consider the number of iterations of the while
loop in canonize called in separate node. There are two cases to consider:
1. When the end point is on an implicit node. Then the computation in canonize takes
only constant time.
2. When the end point is on an explicit node. Let z be the LRS of w, which corresponds
to the end point. Consider the last edge in the path spelling out z from the source
node to the explicit node, and let the length of its label be k (≥ 1). The total
number of iterations of the while loop of canonize in the call of separate node is
at most k. Since the value of k increases at most by 1 each time a new character
36
is scanned, the time requirement of the while loop of canonize in separate node is
bounded by the total length of the input string.
As a result of the above discussion, we can ﬁnally conclude that the ﬁrst and second
components take overall linear time.
3.5 Construction of the CDAWG for a Set of Strings
For a set S of strings w1, w2, . . . , wk, we consider the set S = {wi$i | wi ∈ S and $i /∈
Factor(S) for 1 ≤ i ≤ |S|}. We remark that CDAWG (S ) = CDAWG(S ) for any set S
of strings.
CDAWG (S ) can be constructed by the same algorithm proposed in the previous
section, with a slight modiﬁcation. We use a global variable ei for each string in S , where
1 ≤ i ≤ |S|, which indicates the ending position of open edges for each string. We treat
the set S like a single sequence t = w1$1w2$2 · · · wk$k. Whenever we encounter an endmarker
$i, we stop increasing the value of ei. Then we create the new (i+1)-th sink node,
and start increasing the value of ei+1 each time a new character is scanned. Thereby we
have the following.
Theorem 3.4 Assume Σ is a ﬁxed alphabet. For any set S of strings, the proposed
algorithm constructs CDAWG (S ) on-line and in O( S ) time, using O( S ) space.
In Figure 3.12 the construction of CDAWG (S ) is displayed, where S = {cocoa$1, cola$2}.
Remark. As a secondary eﬀect of the end-markers $i, we obtain a good feature on
CDAWG (S ). For the set S = {cocoa, cola}, CDAWG(S) is shown in Figure 3.13.
One can see there are three sink nodes though S contains only two strings in it. This is
obviously because [
S
−→a ]R
S
= {a}. In such case, we cannot readily specify what string(s) in
S the string a is a factor of. However, there is no diﬃculty to specify it in CDAWG (S ),
since there exactly exist |S | = |S| sink nodes in it (see the right of Figure 3.13).
37
Σ
o
c
o
c
o
a
a
a
$1
$1
$1
$1
cocoa$1
o
o
a
a
a
$1
$1
$1
$1
Σ
c
o
c
cocoa$1,c
Σ
o
c
o
c
o
a
a
a
$1
$1
$1
$1
cocoa$1,co
cocoa$1,col
Σ
o
c
o
c
o
a
a
a
$1
$1
$1
$1
l
l
cocoa$1,cola
Σ
o
c
o
c
o
a
a
a
$1
$1
$1
$1
l
l
a
a
cocoa$1,cola$2
Σ
o
c
o
c
o
a
a
a
$1
$1
$1
$1
l
l
a
a
$2
$2
$2
$2
Figure 3.12: Construction of CDAWG (S ) for S = {cocoa$1, cola$2}.
c
o
a
a
a
l
l
a
a
c
o
o
c
o
a a
a
l
l
a
a
c
o o
$1 $1
$1
$2
$2
$2
$2$1
Figure 3.13: For set S = {cocoa, cola}, CDAWG(S) is shown on the left. For set
S = {cocoa$1, cola$2}, CDAWG (S ) is displayed on the right.
38
Chapter 4
Uniﬁcation of Algorithms for
Constructing Index Structures
4.1 Background and Motivation
When constructing an index structure for a given text string w, what is required primarily
is to build it in time linear in the length of w. To realize it, much attention and eﬀort has
been paid so far [58, 43, 11, 55, 7, 8, 16, 32]. From viewpoints of practice and algorithmics,
it is also very important to construct an index structure in on-line manner: for example,
we can obtain the suﬃx tree for wa only with small change to append new character a to
the existing suﬃx tree of w. If it is oﬀ-line, we have to construct the suﬃx tree of wa from
scratch, even if we beforehand had the suﬃx tree of w. Therefore, an on-line algorithm
constructs an index structure very eﬃciently, and it allows us to update the input string.
Another important factor in constructing an index structure is to build it for a set of
strings easily. Once constructing index structures for all strings in the set, by merging
them it may be possible to obtain an index structure for the set. However, the method
is rather straightforward and ineﬃcient, and takes us considerably much time. Hence an
algorithm that can directly build an index structure for a set of strings is truly helpful.
Table 4.1 shows the on-line algorithms for constructing index structures. Each index
structure is suitable to solve particular problems. For example, a suﬃx tree is optimal to
ﬁnd all the occurrences of a given pattern in a text string [20], a DAWG is a good structure
to ﬁnd the longest common factor of two strings [12], a CDAWG is ideal when we want to
save memory space, since its space complexity is strictly smaller than those of the other
index structures [8], and so on. Therefore, we should need every index structure in order
39
Index Structure Algorithm linear time on-line multiple strings
suﬃx tries Ukkonen [55]
√ √
suﬃx trees Ukkonen [55]
√ √ √
DAWGs Blumer et al. [8]
√ √ √
CDAWGs Inenaga et al. [32]
√ √ √
Table 4.1: On-line algorithms to construct index structures for a set of strings.
to solve various problems. The matter is, however, that we then have to implement at
least four diﬀerent algorithms, as there exist four index structures.
We had thereby been motivated to get rid of this trouble, and ﬁnally succeeded to unify
the four distinct algorithms, each of which constructs suﬃx tries, suﬃx trees, DAWGs,
and CDAWGs, respectively. That is, we produce a generic algorithm that is capable of
constructing any of suﬃx tries, suﬃx trees, DAWGs, and CDAWGs. The algorithm is
endowed with all the desired properties: it runs on-line and in linear time, and can apply
to a set of text strings. However, as an exception the construction of suﬃx tries cannot
always be achieved in linear time since they can require quadratic space.
A complete pseudo-code of our algorithm is shown in Fig. 4.1, Fig. 4.2, Fig. 4.3 and
Fig. 4.4. We have marked each line with four symbols: There, ♣ (♠, ♥ and ♦, resp.)
indicates the lines that can be executed when a suﬃx trie (a suﬃx tree, a DAWG, and a
CDAWG, reps.) is constructed. It has succeeded to reveal the essential common points
and separate the small diﬀerences among the typical algorithms [55, 7, 32]. In fact, all the
control blocks are exactly the same and all diﬀerences can be packed into the only one
procedure in Fig. 4.2 to create a new edge. This means that we can choose which index
structure to build, by ‘switching’ the one procedure.
Furthermore, by comparing the deﬁnitions of index structures given in Chapter 2.3
with the algorithm proposed here, some correspondence between them are revealed. Thus,
in a sense we provide an algorithmic uniﬁed view for the index structures.
The result reported in this chapter was published in [31].
40
4.2 A Generic Algorithm Constructing Index Struc-
tures
Our algorithm to be shown later on constructs STree (S) and CDAWG (S) rather than
STree(S) and CDAWG(S), since it processes input strings in on-line manner. In the following,
we explain the algorithm starting with the common part to every index structure,
then we separately give an exposition for the diﬀerent part.
4.2.1 Construction of an Index Structure for a Single String
We ﬁrstly consider the case that the input for the algorithm is a single string w ∈ Σ∗
.
Let Index(w[1 : i]) be an arbitrary index structure of string w[1 : i] for 1 ≤ i ≤ |w|.
Our algorithm updates Index(w[1 : i]) to Index(w[1 : i + 1]) by inserting the suﬃxes of
w[1 : i + 1] into Index(w[1 : i]). Recall Deﬁnition 3.1 of the LRS, given in Section 3.2.
The suﬃxes of w[1 : i + 1] can be divided into the following two groups, by the LRS of
w[1 : i + 1].
(1) Suﬃxes w[h : i + 1] for 1 ≤ h ≤ j where w[j + 1 : i + 1] is the LRS of w[1 : i + 1].
(2) Suﬃxes w[l : i + 1] for j + 1 ≤ l ≤ i + 2.
The group (2) is empty in such case that the LRS of w[1 : i + 1] = ε, that is, in case
j + 1 = i + 2.
Notice that we need not newly insert any suﬃxes in case (2), simply because they
have already been represented in Index(w[1 : i]). Meanwhile, we insert each suﬃx of
case (1) into Index(w[1 : i]), from w[1 : i + 1] to w[j : i + 1]. We call w[j + 1 : i + 1]
the longest duplicated suﬃx, the LRS for short, of w[1 : i + 1]. Let us call the start
point of Index(w[1 : i + 1]) the location where the LRS of w[1 : i] is represented in
Index(w[1 : i + 1]), and call the end point of Index(w[1 : i + 1]) the location where the
LRS of w[1 : i + 1] is represented in Index(w[1 : i + 1]). The suﬃxes of case (1) can
moreover be divided into the following two sub-cases by integer j .
(1-a) Suﬃxes w[h : i + 1] for 1 ≤ h ≤ j where w[j + 1 : i] is the LRS of w[1 : i].
(1-b) Suﬃxes w[h : i + 1] for j + 1 ≤ h ≤ j.
41
♣♠♥♦ Algorithm Construction of an index structure on Σ = {w[−1], . . . , w[−m]}.
♣♠♥♦ 1 create nodes root and ⊥;
♣♠♥♦ 2 for j := 1 to m do create a new edge (⊥, (−j, −j), root);
♣♠♥♦ 3 suf (root) := ⊥; suf (⊥) := nil /* suﬃx link */
♣♠♥♦ 4 length(root) := 0; length(⊥) := −1;
♣♠♥♦ 5 (s, k) := (root, 1); i := 0; h := 1; (t, q) := (root, 1);
♣♠♥♦ 6 repeat
♣♠♥♦ 7 i := i + 1;
♣♠♥♦ 8 ((s, k), (t, q)) := update((s, k), (t, q), i);
♣♠♥♦ 9 if w[i] = endmarker then
♣♠♥♦ 10 h := h + 1;
♣♠♥♦ 11 (t, q) := (root, i + 1);
♣♠♥♦ 12 until w[i] = EOF;
Figure 4.1: Main routine of our algorithm.
The main routine of our new algorithm is shown in Fig. 4.1. In the algorithm, an
edge (u, α, v) is represented by (u, (k, p), v) such that k (resp. p) represents the beginning
position (resp. the ending position) of the label in the input string w. The main routine
calls the function update, shown in Fig. 4.2, each time a new character is scanned. The
function update plays the main role to update the index structure with a newly scanned
character.
In update the suﬃxes of case (1) are inserted, while it is checked whether or not the
suﬃx currently focused on is the LRS of w[1 : i + 1]. This is examined by the function
check end point, in the 2nd line of update. In the 12th line a new edge is created for each
of the suﬃxes in case (1), and the way to do it depends on which index structure we are
constructing, as shown in the lower part of Fig. 4.2. The detail of the dependence will be
mentioned in the sequel. The location from which the algorithm should insert each suﬃx
in case (1) is called the active point for the suﬃx. Where the active point should start on
updating the structure also depends on which index structure we are constructing.
Now suppose that we have just before ﬁnished inserting a suﬃx w[h : i + 1] where
j +1 ≤ h ≤ j −1, which is in case (1-a). Then, in the 15th line of update the active point
is moved to the location where the string w[h + 1 : i + 1] is associated, via the suﬃx link.
The reference pair for w[h + 1 : i + 1] is then canonized by the function canonize. This
operation is continued until the LRS of w[1 : i + 1], i.e. the end point, is found.
42
♣♠♥♦ function update((s, k), (t, q), p): pairs of node and integer;
♣♠♥♦ /* (t, (q, p − 1)) is the canonical reference pair for the advanced point. */
♣♠♥♦ 1 c := w[p]; oldr := nil; s := nil;
♣♠♥♦ 2 while not check end point(s, (k, p − 1), c) do
♣♠♥♦ 3 if k ≤ p − 1 then /* implicit */
♠ ♦ 4 if s = extension(s, (k, p − 1)) then
♦ 5 redirect edge(s, (k, p − 1), r);
♦ 6 (s, k) := canonize(suf (s), (k, p − 1));
♦ 7 continue;
♠ ♦ 8 else
♠ ♦ 9 s := extension(s, (k, p − 1));
♠ ♦ 10 r := split edge(s, (k, p − 1));
♣♠♥♦ 11 else r := s; /* explicit */
♣♠♥♦ 12 CreateNewEdge /* Change only this line */
♣♠♥♦ 13 if oldr = nil then suf (oldr) := r;
♣♠♥♦ 14 oldr := r;
♣♠♥♦ 15 (s, k) := canonize(suf (s), (k, p − 1));
♣♠♥♦ 16 if oldr = nil then suf (oldr) := s;
♣♠♥♦ 17 (s, k) := separate node(s, (k, p));
♣♠♥♦ 18 (t, q) := canonize(t, (q, p));
♣♠♥♦ 19 if q > p then (s, k) := (t, q); /* the advanced point is explicit */
♣♠♥♦ 20 return ((s, k), (t, q));
CreateNewEdge should be replaced as follows respectively.
For Suﬃx Tries
create a new node v;
length(v) := length(r) + 1;
create a new edge (r, (p, p), v);
For Suﬃx Trees
create a new node v;
length(v) := ∞;
create a new edge (r, (p, ∞), v);
For DAWGs
if v has not been deﬁned yet
create a new node v;
length(v) := length(r) + 1;
create a new edge (r, (p, p), v);
For CDAWGs
if sh has not been deﬁned yet
create a sink node sh; /* sh is a global variable */
length(sh) := ∞;
create a new edge (r, (p, ∞), sh);
Figure 4.2: Function update.
43
♣♠♥♦ function check end point(s, (k, p), c): boolean;
♣♠♥♦ 1 if k ≤ p then /* implicit */
♠ ♦ 2 let (s, (k , p ), s ) be the w[k]-edge from s;
♠ ♦ 3 return (c = w[k + p − k + 1]);
♣♠♥♦ 4 else /* explicit */
♣♠♥♦ 5 return (there is a c-edge from s);
♣♠♥♦ function extension(s, (k, p)): node;
♣♠♥♦ /* (s, (k, p)) is a canonical reference pair. */
♣♠♥♦ 1 if k > p then return s; /* explicit */
♠ ♦ 2 ﬁnd the w[k]-edge (s, (k , p ), s ) from s; return s ; /* implicit */
♦ function redirect edge(s, (k, p), r);
♦ 1 let (s, (k , p ), s ) be the w[k]-edge from s;
♦ 2 replace this edge by edge (s, (k , k + p − k), r);
♣♠♥♦ function canonize(s, (k, p)): pair of integers;
♣♠♥♦ 1 if k > p then return (s, k); /* explicit */
♠ ♦ /* (s, (k, p)) is an implicit node. */
♠ ♦ 2 ﬁnd the w[k]-edge (s, (k , p ), s ) from s;
♠ ♦ 3 while p − k ≤ p − k do
♠ ♦ 4 k := k + p − k + 1; s := s ;
♠ ♦ 5 if k ≤ p then ﬁnd the w[k]-edge (s, (k , p ), s ) from s;
♠ ♦ 6 return (s, k);
Figure 4.3: Functions check end point, extension, and canonize.
44
♠ ♦ function split edge(s, (k, p)): node;
♠ ♦ 1 let (s, (k , p ), s ) be the w[k]-edge from s;
♠ ♦ 2 replace the edge by edges (s, (k , k +p−k), r) and (r, (k +p−k+1, p ), s )
♠ ♦ where r is a new node;
♠ ♦ 3 length(r) := length(s) + (p − k + 1);
♠ ♦ 4 return r;
♣♠♥♦ function separate node(s, (k, p)): pair of node integer;
♣♠♥♦ 1 (s , k ) := canonize(s, (k, p));
♣♠♥♦ 2 if k ≤ p then return (s , k ); /* implicit */
♣♠♥♦ /* (s , (k , p)) is an explicit node. */
♣♠♥♦ 3 if length(s ) = length(s) + p − k + 1 then return (s , k ); /* solid */
♠ ♦ /* non-solid case */
♠ ♦ 4 create a new node r as a duplication of s ;
♠ ♦ 5 suf (r ) := suf (s ); suf (s ) := r ;
♠ ♦ 6 length(r ) := length(s) + (p − k + 1);
♠ ♦ 7 repeat
♠ ♦ 8 replace the w[k]-edge from s to s by edge (s, (k, p), r );
♠ ♦ 9 (s, k) := canonize(suf (s), (k, p − 1));
♠ ♦ 10 until (s , k ) = canonize(s, (k, p));
♠ ♦ 11 return (r , p + 1);
Figure 4.4: Other functions.
What we mentioned above are common to all the four index structures. From now on,
let us treat the diﬀerences, CreateNewEdge .
4.2.2 CreateNewEdge in Case of Suﬃx Tries
Assume that we now have STrie(w[1 : i]). First, we insert the suﬃxes of case (1-a) into
the suﬃx trie. Deﬁnition 2.7 tells that every edge of a suﬃx trie must be labeled with a
single character. Therefore, from each leaf node of STrie(w[1 : i]) a new edge labeled by
w[i+1 : i+1] is created together with a new leaf node. This way the suﬃxes of case (1-a)
are inserted. The update from STrie(w[1 : i]) to STrie(w[1 : i + 1]) should begin at the
node w[1 : i]. We call this location the advanced point of STrie(w[1 : i]). The active point
was reset to node w[1 : i] after the construction of STrie(w[1 : i]) had been ﬁnished, and
this was done in the 19th line of update. Second, we insert the suﬃxes of case (1-b). By
creating new edges labeled by w[i + 1 : i + 1] from nodes w[h : i] where j + 1 ≤ h ≤ j,
they are inserted.
45
4.2.3 CreateNewEdge in Case of Suﬃx Trees
Assume that we now have STree (w[1 : i]). Recall Deﬁnition 2.8. A careful consideration
reveals the fact that any leaf node of STree (w[1 : i]) will also be a leaf node in STree (w[1 :
k]) for any k where i + 1 ≤ k ≤ |w|. Hence we refer the second value of label of an edge
directing a leaf node of STree (w[1 : i]) to “∞” as Ukkonen did in [55], and do the length
of the leaf node to “∞” as well. This way is so ingenious that we need not explicitly
insert any suﬃx in case (1-a). In addition, it inherently corresponds to the “compaction”
from a suﬃx trie to a suﬃx tree, shown in Fig. 3.1. Then all we have to do is to insert
the suﬃxes of case (1-b) into the suﬃx tree. That is why the active point should be on
the start point of STree (w[1 : i]) at the beginning of the update. Consider the case that
the active point is on an edge (on an implicit node) and corresponds to string w[h : i] for
some j +1 ≤ h ≤ j. Since w[h : i+1] is not the LRS of w[h : i+1], naturally it has to be
inserted into the suﬃx tree. To do it, a new node is created where the active point is. In
other words, the implicit node becomes explicit. This is done by the function split edge
called in the 10th line of update.
4.2.4 CreateNewEdge in Case of DAWGs
Assume that we now have DAWG(w[1 : i]). Deﬁnition 2.9 tells that only strings ending at
the same position in w must be represented in the same node. String w[h : i + 1] belongs
to [w[1 : i + 1]]R
w[1:i+1]
for any h with 1 ≤ h ≤ j , which is a suﬃx in case (1-a). These
result in the fact that an edge labeled by w[i + 1 : i + 1] and the new sink node should
be created from the last sink node [w[1 : i]]R
w[1:i+1]
, and by this procedure all the suﬃxes of
case (1-a) are inserted. Therefore, as in case of suﬃx tries, in the 19th line of update the
active point was moved to the advanced point of DAWG(w[1 : i]) after its construction
had been completed. To insert a suﬃx w[h : i + 1] in case (1-b) for j + 1 ≤ h ≤ j, a new
edge labeled with w[i + 1 : i + 1] is created from node [w[h : i]]R
w[1:i+1]
to the new sink node
[w[1 : i + 1]]R
w[1:i+1]
. It corresponds to the “minimization” from a suﬃx trie to a DAWG.
Suppose that the end point has already found, that is, the insertion of all the suﬃxes of
w[1 : i + 1] has been ﬁnished, and focus on the LRS w[j + 1 : i + 1] of w[1 : i + 1]. In the
17th line of update the function separate node is called, which examines whether or not
w[j + 1 : i + 1] = u, where u =
w[1:i+1]
←−−−−−−−−−−
w[j + 1 : i + 1]. If not, w[j + 1 : i + 1] cannot any longer
be represented in the same node as u. Therefore, the node is separated into two nodes,
46
[u]R
w[1:i+1]
and [w[j + 1 : i + 1]]R
w[1:i+1]
.
4.2.5 CreateNewEdge in Case of CDAWGs
Assume that we now have CDAWG (w[1 : i]). The way to update CDAWG (w[1 : i])
is like combination of those to update STree (w[1 : i]) and DAWG(w[1 : i]). Let us
call the terminal edges the edges directing the sink node of a CDAWG. It follows from
Deﬁnition 2.10 that any terminal edge of CDAWG (w[1 : i]) will also be a terminal edge
of CDAWG (w[1 : k]) for any k where i + 1 ≤ k ≤ |w|. Hence we refer the second
value of any terminal edge to “∞”, like the case of suﬃx trees. This way every suﬃx of
case (1-a) is implicitly inserted to the CDAWG, therefore the active point starts at the
start point of CDAWG (w[1 : i]). Suppose the active point is on an edge (on an implicit
node) right before inserting w[j + 1 : i + 1]. Then the edge is split into two, due to the
creation of the node from which an edge with label w[i + 1 : ∞] is created. This way to
label the edge corresponds to the “compaction” from a DAWG to a CDAWG. The edge
is directed to the sink node. It corresponds to the “minimization” from a suﬃx tree to a
CDAWG. To insert w[j + 2 : i + 1], the active point is moved to the location with which
w[j + 2 : i] is associated. If it is an implicit node, in the 4th line of update it is examined
if w[j + 2 : i + 1] is to belong to [w[j + 1 : i]]R
w[1:i+1]
. If so, the edge is redirected to the
node last created, [w[j + 1 : i]]R
w[1:i+1], and its label is modiﬁed accordingly. The function
redirect edge accomplishes the operation above. If not, a new node for [w[j + 2 : i]]R
w[1:i]
is newly created, so that a new edge labeled with w[j + 1 : ∞] can be created from it to
the sink node.
4.2.6 Extension to a Set of Strings
Given a set S = {w1, w2, . . . , wk}, we regard it as a sequence t = w1$1w2$2 · · ·wk$k, where
$h is the end-marker of wh for 1 ≤ h ≤ k. This way we can treat S like one string. In
the 9th line of the main routine in Fig. 4.1, if t[i] is an end-marker, integer h counting the
number of the input strings is increased one, and the advanced point is reset to the root
node preparing for the next string. The active point is also to be on the root node, since
any end-marker never appears in any string in S. Consequently, the algorithm builds
STrie(S), STree (S), DAWG(S), and CDAWG (S), for a given set S of strings.
As a result of the discussion, we have the following.
47
Theorem 4.1 For any set of strings the proposed algorithm constructs STrie(S), STree (S),
DAWG(S), and CDAWG (S) on-line, and in liner time except for STrie(S), by changing
the 12th line of the function update accordingly.
For comparison, for S = {abab, abb}, the on-line constructions of STrie(S), STree (S),
DAWG(S), and CDAWG (S) are shown in respectively Fig. 4.5, Fig. 4.6, Fig. 4.7, and
Fig. 4.8.
48
b
$1
b
a
a
b
b
b
a
$1
$1
$1
$1
bb
$1
b
a
a
b
b
b
a
$1
$1
$1
$1
$1
b
a
a
b
b
b
a
$1
$1
$1
$1
$1
b
a
a
b
b
b
a
$1
$1
$1
$1
$1
b
a
a
b
b
b
a
$1
$1
$1
$1
b
a
a
b
b
b
a
a
a
b
b
aa
b
b
a
b
$2
$2
$2
ε abababaaba
abab$1 ababab$1 aabab$1
abab$1 abb$2abab$1 abb
$2
Figure 4.5: A snapshot of the on-line construction of STrie(S) where S = {abab, abb}.
The gray star and the triangle represent the active point and the advanced point of each
step, respectively. The broken lines are suﬃx links.
49
b b$1
b
a
a
b b
b
a$1
$1
$1
$1
b
$1
b
a
a
b b
b
a$1
$1
$1
$1
$1
b
a
a
b b
b
a$1
$1
$1
$1
$1
b
a
a
b b
b
a$1
$1
$1
$1
b
a
a
b
b
b
a
a
a
b b
a
a
b b
a
b
$2
$2
ε abababaaba
abab$1 ababab$1 aabab$1
abab$1 abb$2abab$1 abb
$2
$1
b
a
a
b b
b
a$1
$1
$1
$1
$2
Figure 4.6: A snapshot of the on-line construction of STree (S) where S = {abab, abb}.
The gray star and the triangle represent the active point and the advanced point of each
step, respectively. The broken line are suﬃx links.
50
$1
a
b
a
b
$1
$1
b
b
b
$1
a
b b
a
b
$1
$1
a
b b
a
a
b b
$1
b
a
aa
b b
$2
ε abababaaba
abab$1 ababab$1 aabab$1
abab$1 abb$2abab$1 abb
a
b b
a
b
$1
$1
$1
a
b b
a
b
$1
$1
$1
a
b
a
b
$1
$1
b
b
b
$2
$2
a
$1
a
$1
Figure 4.7: A snapshot of the on-line construction of DAWG(S) where S = {abab, abb}.
The gray star and the triangle represent the active point and the advanced point of each
step, respectively. The broken lines are suﬃx links.
51
b
b
b$1
a
b
a
b
$1
$1
$1
ab
$1
$1
a
b
b
a
b
$1
$1
$1
a
b
b
a
b
$1
$1
a
b
b
a
a
b
b
$1
b
a
a
a
b
b
$2
ε abababaaba
abab$1 ababab$1 aabab$1
abab$1 abb$2abab$1 abb
a
b
b
a
b
$1
$1
b
b
b
$2
$2
a
a
a
b
$1
a
b
a
b
$1
$1
$1
ab
$1
$2
Figure 4.8: A snapshot of the on-line construction of CDAWG (S) where S = {abab, abb}.
The gray star and the triangle represent the active point and the advanced point of each
step, respectively. The broken lines are suﬃx links.
52
Chapter 5
Construction of the CDAWG for a
Trie
5.1 Introduction
In Chapter 3 and Chapter 4, we have discussed the construction of the CDAWG for a set
S of strings. In this chapter, we consider the case that the set S is given in the form of
a trie (see Section 2.1.1). Namely, our input is Trie(S) and output is CDAWG (S). Let
S = . Since the trie shares common preﬁxes of the strings in S, in general the number
N of nodes of the trie is less than . We show a non-trivial extension of the algorithm that
constructs CDAWG for a trie in O(N) time and space. The algorithm is designed on the
basis of the one for constructing CDAWGs for a set of strings, which has been introduced
in Chapter 3.
Some related work can be seen in literature: Kosaraju [37] introduced the suﬃx tree for
a reversed trie. We denote it by Trierev
(S). Let M be the number of nodes in Trierev
(S).
Kosaraju showed an algorithm to construct STree(S) in O(M log M) time. Later on,
Breslauer [9] reduced it to O(M) time.
On the other hand, our algorithm constructs a CDAWG for a (normal) trie. We
believe our assumption that a set S of strings is given as Trie(S) is natural. In addition,
we remark that the algorithm to be proposed also becomes capable of building STree(S)
and DAWG(S) in O(N) time, in the combination with the generic algorithm introduced
in Chapter 4. STrie(S) can also be constructed in O(N2
) time.
The result involved in this chapter was published in [29].
53
5.2 Trie and Reversed Trie
We deﬁne the reversed trie for a set S of strings as a reverse-directed tree. We denote it by
Trierev
(S). The root node of Trierev
(S) is out-degree zero and any leaf node is in-degree
zero. Formally, Trierev
(S) is deﬁned as follows.
Deﬁnition 5.1 Trierev
(S) is the tree (V, E) such that
V = {x | x ∈ Suﬃx(S)},
E = {(xa, a, x) | x, xa ∈ Suﬃx(S) and a ∈ Σ}.
We deﬁne a counterpart of the preﬁx property for a set of strings.
Deﬁnition 5.2 Let S = {w1, . . . , wk} where wi ∈ Σ∗
for 1 ≤ i ≤ k and k ≥ 1. We say
that S has the suﬃx property iﬀ wi /∈Suﬃx(wj) for any 1≤i=j ≤ k.
Then, the following obvious proposition holds.
Proposition 5.1 Any string in Preﬁx(S) can be spelled out from a leaf node in Trierev
(S)
iﬀ a set S of strings has the suﬃx property.
It directly follows from the contraposition of the above proposition that, if S does not
have the suﬃx property, we cannot spell out every string in Preﬁx(S). Since Breslauer’s
algorithm [9] traverses a given reversed trie from a leaf node, it is not supposed to construct
the suﬃx tree for a set of strings that does not have the suﬃx property.
Proposition 5.2 Given a set S = {w1, . . . , wk} such that wi /∈Suﬃx(wj) for any 1 ≤ i =
j ≤ k, let S = {w1$, . . . , wk$}. Then, Trierev
(S ) has at most S −|S | + 2 = S + 2
nodes.
The input of Breslauer’s algorithm is Trierev
(S ). If the strings in S have long and many
common suﬃxes, the number of nodes in Trierev
(S ) is by far smaller than the upper
bound S −|S |+2.
Trierev
(S ) for S = {aaab$, aac$, aa$, abc$, bab$, ba$} is shown in Fig. 5.1.
Theorem 5.1 (Breslauer [9]) Breslauer’s algorithm constructs the suﬃx tree of a reversed
trie in time proportional to the number of nodes in the reversed trie.
On the other hand, given a set S = {w1, . . . , wk} with k ≥ 1, we consider the set
S = {w1$1, . . . , wk$k} where $i denotes the unique end-marker for wi (1 ≤ i ≤ k).
54
$
a
a
aa
a
a
a
a
b
b
b
b
c
01
2
3
4
6
78
9
101112
13
5
14
Figure 5.1: Trierev
(S ) for S = {aaab$, aac$, aa$, abc$, bab$, ba$}.
Proposition 5.3 Given a set S = {w1, . . . , wk} with k ≥ 1, let S = {w1$1, . . . , wk$k}.
Then, Trie(S ) has at most S + 1 = S + |S| + 1 nodes.
See Fig. 5.2 in which Trie(S ) is displayed, where S = {aaab$1, aac$2, aa$3, abc$4, bab$5, ba$6}.
Even if set S does not have the preﬁx property, every string x ∈ S corresponds to a leaf
node. In fact, although a string aa is a preﬁx of a string aaab, the path spelling out aa$3
ends at leaf node 8 in Fig. 5.2.
a
a
a
a
b
b
b
b
c
c
$1
$2
$3
$4
$5
$6
0
1
2
3 4 5
6 7
8
9 10 11
12
13
14 15
16
Figure 5.2: Trie(S ) for S = {aaab$1, aac$2, aa$3, abc$4, bab$5, ba$6}.
Trie(S ) is the input of our algorithm to be introduced in Section 5.3.
55
5.3 Algorithm to Construct the CDAWG for a Trie
We ﬁrstly note that the CDAWG for Trie(S ) is the same as the CDAWG of S for any set S
of string, except the following point. The label of an edge in CDAWG(S ) is implemented
by a triple of integers (h, i, j) representing the starting position i and ending position j of
the label in the h-th string in S. Meanwhile, that in the CDAWG for Trie(S ) refers to a
pair of nodes in Trie(S ), between which the string corresponding to the label is lying.
The basic action of the algorithm is to update the CDAWG incrementally, synchronized
with the depth-ﬁrst traversal on Trie(S ). The key idea to achieve the linear time
construction is as follows.
(1) Keep track of the advanced point q in the CDAWG so that the path from the root
node to q coincides with the path from the root node to node v, where v is the node
currently visited in the trie.
(2) Create a new node in the CDAWG where the advanced point q is, before stepping
into the ﬁrst branch at each branching node in the trie.
We will explain the detail in the sequel. Suppose that, after having traveled nodes with
scanning α ∈ Preﬁx(S ) in Trie(S ), the algorithm encounters a node v having k (≥ 2)
branches in Trie(S ). Moreover suppose that it then chooses an edge from which to a
leaf node a string β ∈ Suﬃx(S ) is spelled out. After updating the CDAWG with string
αβ, the algorithm has to update it with the other strings represented in Trie(S ). Notice
that the current CDAWG already has the path representing α from the source node,
which corresponds to preﬁxes of at least k strings in S. Thus the algorithm has to restart
updating the CDAWG from the location to which α corresponds, and has to continue
traversing Trie(S ) from the node v. For that purpose, we trace the advanced point q
mentioned in (1) above.
Let us now clarify the aim of (2). The aim is to make the advanced point q be an
explicit node whenever the algorithm encounters a branching node in Trie(S ). That is,
the reference pair of q should then become of the form (s, ε) for some node s. What is
the matter if the advanced point q is not explicit before stepping into the ﬁrst branch?
Assume that the advanced point q was referred to as (u, γ) with some node u and string
γ = ε when the algorithm encountered the node v corresponding to α in Trie(S ). After
ﬁnishing updating the CDAWG with αβ, the algorithm focuses back on v and q =(u, γ).
56
The matter is that the reference (u, γ) might not be canonical any longer: the path
spelling out γ may contain extra nodes. Namely, the path spelling out γ may have been
split while the algorithm updated the CDAWG with string β. A concrete example is
shown in Fig. 5.3.
a ab bc $1
$2
0 1 2 5 643
a
98b
a
b
b
c
S
F1
$1
a
b b
S
cc
c
$1
F1
3
2
1
a
b
a
b
a
$1
$1
c
a
b
a
$1
7
7
input trie
Figure 5.3: Trie(S ) for S = {abcaab$1, abcb$2} is shown left. When the algorithm focuses
on node 3 in Trie(S ), it needs to memorize the location in the CDAWG corresponding
to abc. Since there is no node but F1 at the location, it is memorized by a reference pair
(0, abc). After having visited node 7 in Trie(S ), the algorithm updates the CDAWG
from (0, abc), and with node 3 in the trie. However, since the path spelling abc dose not
consist of an edge any more, the algorithm has to ﬁnd the nearest node from the location
the path ends on, that is, node 2. We have to avoid this, because traversing the path
spelling abc in the CDAWG just deserves traversing Trie(S ) from node 0 to 3 .
If the algorithm scans such extra nodes, its time complexity can become quadratic
with respect to the number of nodes in Trie(S ). In order to avoid this matter, the
algorithm creates a new node s so that the active point is guaranteed to be on an explicit
node. However, the algorithm dose not merge any other edges because at the moment it
is unknown how many edges should be merged into the new node s. Of course, if γ = ε,
there is no need to create any new node.
The algorithm is described is Fig, 5.4. The variable current node indicates the node on
which the algorithm currently focuses in Trie(S ). The variable advanced point is of the
57
Algorithm to construct the CDAWG for Trie(S ) /* The input is Trie(S ) */
1 current node := root; /* the root node of Trie(S ) */
2 active point := (source, ε);
3 advanced point := (source, ε);
4 traverse and update(current node, active point, advanced point);
procedure traverse and update(current node, active point, advanced point)
1 Let label set be the set of labels of the outgoing edges of current node;
2 if |label set| = 0 then return;
3 else if |label set| ≥ 2 then create node(advanced point);
4 for each c ∈ label set do
5 new active point := update CDAWG(c, active point);
6 Let new advanced point be the location where active point advances with c;
7 Let v be the node to which the edge labeled c points;
8 traverse and update(v, new active point, new advanced point);
Figure 5.4: Algorithm to construct the CDAWG for a trie.
form of a reference pair (u, β), where u is the parent node nearest to advanced point. As
mentioned above, the string β is actually implemented by a pair of nodes in Trie(S ). In
the procedure traverse and update, the function update CDAWG updates the CDAWG
with a letter c. The function update CDAWG is the same as the one for the construction
of the CDAWG for a set of strings, introduced in Chapter 3, excepting that
update CDAWG creates a new edge stemming from the node latest created by function
create node.
An example of the construction of the CDAWG for a trie is shown in Fig. 5.3.
Finally, we have the following theorem.
Theorem 5.2 The proposed algorithm constructs the CDAWG for a trie in linear time
and space with respect to the number of nodes in the trie.
Proof. We ﬁrst explain that the modiﬁcation of the function update CDAWG and the
function create node itself do not aﬀect the linearity of the algorithm.
Suppose that an input trie has n nodes. It is clear that the number of nodes visited
by advanced point in the CDAWG is at most n. Hence it takes O(n) time to calculate
advanced point all through the construction. Furthermore suppose that m nodes in
Trie(S ) are branching. It is clear that m < n, because any trie has at least one leaf node.
58
0
a ab b
c
$1
$2
0 71 2 5 6
43
input trie :
0 1
0
a
2
a
b b
0
4
a
b
b
c
c
$1
1
0
F1
$1
$1
c
2 5
a
6 7
a
b b
c
$11
0
F1
$1
c
a
b
F2
$2
$2
3
a
b
b
0
1
c
c
c
$1
a
b
b
c
c
$1
1
0
F1
$1
$1
c
$1
a
b b
c
$1
1
0
F1
$1
c
$1
a
a
b
b
c
$1
1
0
F1
$1
c
$1
b
$1$2
current_node
F1
1
F1
F1
F2
F2
Figure 5.5: Construction of the CDAWG for Trie(S ), where S = {abc$1, abab$2}. The gray starred
point represents active point, and the black dotted point represents advanced point. For simplicity, the
bottom node is omitted. As node 2 in the trie is branching, a new node 1 is created in the CDAWG
when current node arrives at node 2 for the ﬁrst time. After current node visits node 4, the algorithm
updates the CDAWG with current node = 2 and advanced point = 1.
59
Therefore, the function create node creates at most m nodes in the CDAWG, and it
implies that the time complexity of create node is O(m). This implies the modiﬁcation,
creating new edges due to the nodes made by the function create node, takes O(m) time
as well.
We from now on verify the overall linearity of the proposed algorithm. The matter we
have to clarify is the upper bound of the number of nodes active point visits throughout
the construction. Assume that a node v in the trie has k branches and there is a path
spelling α between the root node and v. When current node arrives at node v in the trie
for the ﬁrst time, the function create node creates a new node u where advanced point
is in the CDAWG. Then active point may traverse at most k|α| nodes from p to the
initial node via suﬃx links until ﬁnding the location it can stop on. However, k ≤ |Σ|.
Therefore, for a trie with n nodes, the number of nodes active point visits throughout the
construction is O(|Σ|n). Thus, if Σ is a ﬁxed alphabet, the proposed algorithm constructs
the CDAWG for a trie in O(n) time and space.
5.4 Conclusion
We gave an algorithm for constructing the CDAWG for a trie in linear time and space
with respect to the number of nodes in the trie. When input strings are given in the form
of a trie, the proposed algorithm constructs the CDAWG for the strings faster than the
one presented in [32] directly does from a set of the strings, especially when the strings
have many common preﬁxes. As the space complexity of CDAWGs is bounded strictly
lower than that of suﬃx trees, the algorithm presented in this chapter also allows to save
memory space.
60
Chapter 6
On-Line Construction of Symmetric
Compact Directed Acyclic Word
Graphs
6.1 Introduction
As seen in literature [58, 43, 7, 8, 55, 16, 32] and the previous chapters, suﬃx links are
often used, and essential, for eﬃcient constructions of index structures such as suﬃx
tries, suﬃx trees, DAWGs, and CDAWGs. An interesting fact is that, for any string
w ∈ Σ∗
, the suﬃx links of STrie(w) form STrie(wrev
) [18]. A DAWG also has a similar
property, that is, the suﬃx links of the DAWG(w) compose STree(wrev
) [11]. However,
this duality is damaged in case of suﬃx trees. Namely, the suﬃx links of STree(w) do not
form a structure supporting indexes of wrev
. However, the set of suﬃx links of STree(w)
corresponds to a subset of the set of edges of DAWG(wrev
) [14].
In order to obtain the complete duality on suﬃx trees, the aﬃx tree is developed by
Stoye [50, 51]. Aﬃx trees are the modiﬁcation of suﬃx trees so that the suﬃx links of
ATree(w) form ATree(wrev
) (see Fig. 6.3). Stoye could not prove his on-line algorithm for
constructing aﬃx trees runs in linear time, but Maaß [39] later succeeded to improve it so
as to run in linear time. Meanwhile, Blumer et al. [8] showed that the nodes of a CDAWG
are invariant under reversal: the nodes of the CDAWG for a string w exactly correspond
to those of the CDAWG for wrev
, which they call the symmetric compact directed acyclic
word graph (SCDAWG) for w (see Fig. 6.2, right).
61
Ukkonen [55] gave intuitive and excellent on-line algorithms for the construction of
STrie(w) and STree(w), as recalled in Chapter 3. Since the suﬃx links of STrie(w) are
equal to the edges of STrie(wrev
), it turns out that STrie(w) and STrie(wrev
) sharing the
same nodes can be simultaneously built on-line, scanning w from left to right. Also, as
the algorithm to construct DAWGs which Blumer et al. gave in [7] is on-line, it results
in that their algorithm builds DAWG(w) and STree(wrev
) at the same time, in on-line
(left to right) fashion. Moreover, the fact is that the ﬁrst algorithm that constructs suﬃx
trees, given by Weiner in [58], becomes more interesting when considered as an on-line
algorithm. His algorithm builds the suﬃx tree for a string w by appending the suﬃxes
of w to the current suﬃx tree in increasing order. In other words, his algorithm builds
STree(w) on-line, right to left. In addition to that, his algorithm can be modiﬁed so as to
create the edges of the DAWG for wrev
at the same time [16]. It implies that his algorithm
also simultaneously constructs DAWG(w) together with STree(wrev
) on-line, left to right.
In this chapter, we ﬁrst give an algorithm that simultaneously builds STree(w) with
DAWG(wrev
) on-line, left to right. This algorithm constructs STree(w) in the same way
as Ukkonen’s algorithm does, while computing the shortest extension links (sext links)
that form DAWG(wrev
) at the same time. Moreover, we show an algorithm that directly
constructs SCDAWG(w) on-line, left to right. It builds CDAWG(w) similarly to the
algorithm we introduced in [32], and computes the sext links that are equal to the edges
of CDAWG(wrev
).
From a practical point of view, SCDAWGs and aﬃx trees have the essentially same
range of applications. However, the number of nodes in SCDAWG(w) is much smaller
than that of ATree(w), although both are linear with respect to the length of a given
string w. In fact, an inequality comparing the number of nodes
|SCDAWG(w)|
≤ min{|STree(w)|, |STree(wrev
)|}
≤ max{|STree(w)|, |STree(wrev
)|}
≤ |ATree(w)|
holds for any string w. This is because, intuitively, the set of nodes in SCDAWG(w) is the
intersection of those in STree(w) and STree(wrev
), while the set of nodes in ATree(w) is
the union of them. Therefore, SCDAWGs considerably save space, compared to aﬃx trees.
Moreover, not only an SCDAWG is attractive as index structure, but also the underlying
equivalence relation is useful in Data Mining or Machine Discovery from textual databases.
62
Actually, the equivalence relation plays a central role in supporting human experts who
are involved in evaluation/interpretation task for mined expressions from anthologies of
classical Japanese poems [52].
The result shown in the chapter was published in [30].
6.2 Bidirectional Index Structures
If an index structure represents all the strings not only in Factor(w) but also in Factor(wrev
),
let us call it a bidirectional index structure for string w. We deﬁne such a structure as a
graph with two kinds of edges: the ones for a string w, and the others for wrev
.
Giegerich and Kurtz [18] observed that STrie(w) and STrie(wrev
) are dual in the sense
that they share the same nodes. We refer this bidirectional index structure as “STrie(w)
with STrie(wrev
)”. The formal deﬁnition follows.
Deﬁnition 6.1 STrie(w) with STrie(wrev
) is the bidirectional tree (V, EL→R, ER→L) such
that
V = {x | x ∈ Factor(w)},
EL→R = {(x, a, xa) | x, xa ∈ Factor(w) and a ∈ Σ},
ER→L = {(x, a, ax) | x, ax ∈ Factor(w) and a ∈ Σ}.
It is obvious that there is a trivial one-to-one correspondence between the set ER→L and
the set F for the suﬃx links of STrie(w) in Deﬁnition 2.7.
The duality of STree(w) and DAWG(wrev
), which was pointed out in [11, 14], is shown
in Deﬁnition 6.2.
Deﬁnition 6.2 STree(w) with DAWG(wrev
) is the bidirectional dag (V, EL→R, ER→L)
such that
V = {
w
−→x | x ∈ Factor(w)},
EL→R = {(
w
−→x , aβ,
w
−→xa) | x, xa ∈ Factor(w), a ∈ Σ, β ∈ Σ∗
,
w
−→xa = xaβ, and
w
−→x =
w
−→xa},
ER→L = {(
w
−→x , a,
w
−→ax) | x, ax ∈ Factor(w) and a ∈ Σ}.
Let V = {[x]L
w | x ∈ Factor(w)}. It is easy to see that there is a trivial one-to-one correspondence
between the node set V of Deﬁnition 6.2 and V . Using this correspondence,
we can identify ER→L of Deﬁnition 6.2 with
{([x]L
w
, a, [ax]L
w
) | x, ax ∈ Factor(w) and a ∈ Σ}
= {([y]R
wrev , a, [ya]R
wrev ) | y, ya ∈ Factor(wrev
) and a ∈ Σ},
63
b
a
g
g
gg
g
g
g
a
aa
a
e
ee
e
e
g
a
g
g
ab
a
g
g
g
a
e
e g
a
e
g
a
g
gg a
a
a
b
b
g
b
b
b
b
b
a
g
g
g
g
g
g
a
a
a
e
e
e
ab
g
g
g
a
e
g
a
g
g
a
g
g
a
e
e
e
b
a
b
g
g
Figure 6.1: STrie(w) with STrie(wrev
) on the left, and STree(w) with DAWG(wrev
) on
the right, where w = baggage. The thick solid lines represent the edges of STrie(w) and
STree(w), while the thin break lines do the ones of STrie(wrev
) and DAWG(wrev
). Since
the string baggage ends with a unique character e, the end-marker $ is omitted.
which is equivalent to the deﬁnition of DAWG(wrev
).
The edges ER→L of Deﬁnition 6.2 are the so-called shortest extension links (sext links)
of STree(w), which were introduced by Crochemore and Rytter in [14]. Moreover, a part
of the reversed sext links are known as suﬃx links. Recalling the deﬁnition, the suﬃx
links are the set
{(
w
−→ax,
w
−→x ) | x, ax ∈ Factor(w), a ∈ Σ, and
w
−→ax = a ·
w
−→x }.
The reversal of the suﬃx links are called reversed suﬃx link, deﬁned as
{(
w
−→x , a,
w
−→ax) | x, ax ∈ Factor(w), a ∈ Σ, and
w
−→ax = a ·
w
−→x }.
It can be observed that the suﬃx link set is a subset of the sext link set, under the
‘
w
−→ax = a ·
w
−→x ’-condition.
In Fig. 6.1 we illustrate STrie(w) with STrie(wrev
) and STree(w) with DAWG(wrev
),
where w = baggage.
By the duality, we omit the deﬁnition of the bidirectional index structure DAWG(w)
with STree(wrev
).
Now we pay our attention to CDAWG(w). Deﬁnition 2.10 can be transformed as
64
follows:
V {
w
←→x | x ∈ Factor(w)},
E {(
w
←→x , aβ,
w
←→xa) | x, xa ∈ Factor(w), a ∈ Σ, β ∈ Σ∗
,
w
−→xa = xaβ, and
w
−→x =
w
−→xa},
F {(
w
←→ax,
w
←→x ) | x, ax ∈ Factor(w), a ∈ Σ, and
w
←→x =
w
←→ax},
In Deﬁnition 6.3, we show the deﬁnition of the symmetric CDAWG (SCDAWG) of a
string w, denoted by SCDAWG(w), originally deﬁned by Blumer et al. [8].
Deﬁnition 6.3 SCDAWG(w) is the bidirectional dag (V, EL→R, ER→L) such that
V = {
w
←→x | x ∈ Factor(w)},
EL→R = {(
w
←→x , aβ,
w
←→xa) | x, xa ∈ Factor(w), a ∈ Σ, β ∈ Σ∗
,
w
−→xa = xaβ, and
w
−→x =
w
−→xa},
ER→L = {(
w
←→x , γa,
w
←→ax) | x, ax ∈ Factor(w), a ∈ Σ, γ ∈ Σ∗
,
w
←−ax = γax, and
w
←−x =
w
←−ax}.
The edges ER→L are called the sext links of CDAWG(w), as well. The reversed suﬃx
links of CDAWG(w) are the set
{(
w
←→x , γa,
w
←→ax) | x, ax ∈ Factor(w), a ∈ Σ, γ ∈ Σ∗
,
w
←−ax = γax,
w
←→x =
w
←→ax, and
w
←→ax = a ·
w
←→x }.
The suﬃx link set is a subset of the sext link set, under the ‘
w
←→ax = a ·
w
←→x ’-condition.
We illustrate DAWG(w) with STree(wrev
), and SCDAWG(w) in Fig. 6.2, where w =
baggage.
Another symmetric bidirectional index structure, called aﬃx tree, was introduced by
Stoye [50]. ATree(w) and ATree(wrev
) for w = baggage are shown in Fig. 6.3 without a
formal deﬁnition for comparison. Intuitively, the set of the nodes in SCDAWG(w) is the
intersection of those in STree(w) and STree(wrev
), while the set of the nodes in ATree(w)
is the union of them.
6.3 On-Line Construction of STree(w) with DAWG(wrev
)
In this section, we give an algorithm that simultaneously constructs STree(w) with DAWG(wrev
)
for a string w ∈ Σ∗
, on-line and in linear time with respect to |w|.
6.3.1 “STree(w) with DAWG(wrev
)” Redeﬁned
Our algorithm constructs STree (w) in the same fashion as the Ukkonen algorithm, and
therefore the DAWG(wrev
) being constructed at the same time is incomplete in the sense
65
a
b
g
g
g
a
a
e
g
a
a
g
g
e
e
e
g
g
b
a
a
g
e
g
g
b
g
g
g
e
g
g
a
b
a
g
g
g
a
a
a
b
g
b
b
b
a
g
g
e
ee
g
g
b
a
a
g
e
g
g
b
g
a
e
g
a
e
g
a
e
a
a
b
a
g
g
b
b
b
a
g
g
Figure 6.2: DAWG(w) with STree(wrev
) on the left, and SCDAWG(w) on the right, for
string w = baggage.
b
a
g
g
g
g
g
aa
a
e
e
e
e
g
a
g
g
ab
a
g
g
g
a
e
e
g
a
e
g
a
g
g
a
b
b
b
b
g
a
b
g
g
a
e
gab
b
a
g
g
g
a
e
b
a
g
g
g
a
b
a
g
gg
ab
a
b
b
be
b
ag
a
g
g
a
g
e
g
e
a
g
g
e
e
a
g
g
e
ag
e
Figure 6.3: ATree(w) on the left and ATree(wrev
) on the right, where w = baggage.
that it lacks the nodes corresponding to the non-branching internal nodes of STree (w)
and the sext links from/to them. However, the ﬁnally obtained structure for input w$ is
exactly the same as STree(w$) with DAWG($wrev
).
Deﬁnition 6.4 “STree(w) with sext links” is the bidirectional dag (V, EL→R, ER→L) such
66
that
V = {
w
=⇒
x | x ∈ Factor(w)},
EL→R = {(
w
=⇒
x , aβ,
w
=⇒
xa) | x, xa ∈ Factor(w), a ∈ Σ, β ∈ Σ∗
,
w
=⇒
xa = xaβ, and
w
=⇒
x =
w
=⇒
xa},
ER→L = {(
w
=⇒
x , a,
w
=⇒
ax) | x, ax ∈ Factor(w) and a ∈ Σ}.
Since
w$
=⇒
x =
w$
−→x , this structure is identical to that deﬁned in Deﬁnition 6.2 for input
string w$.
6.3.2 Main Idea of the Algorithm
As for STree (w) for a string w ∈ Σ∗
, our algorithm creates it in entirely the same way as
Ukkonen’s algorithm (see Section 3.3 or [55]). Every time a new node is created during
the construction of STree (w), the sext links of the new node, which correspond to certain
edges of DAWG(wrev
), are computed. Ukkonen’s algorithm creates no leaf node for the use
of so-called ‘∞-trick’ that enables his algorithm to achieve an O(|w|)-time construction
of STree (w), and an edge directed to a ‘transparent’ leaf node is called an open edge.
However, we modify it so as to create every leaf node not only because
(i) we need a leaf node in order to deﬁne its sext links, but also
(ii) the sext link of a leaf node is to be a clue to deﬁne the sext links of a node to be
created just above the leaf node.
First of all, one may wonder that if creating leaf nodes, the time complexity of the construction
of STree (w) can be quadratic due to a series of updating the open edges. However,
recall the fact that label α of an edge of STree (w) is usually implemented with a pair of
integers (i, j) such that α = w[i : j]. Furthermore, note that the second value of the label
of any open edge in STree (w[1 : h]) is h for 1 ≤ h ≤ n. Therefore, if we implement the
second value with a global variable, we can update all the open edges in constant time
with an increment of the variable h.
Let us pay our attention back to the two reasons (i) and (ii). We have an obvious
proposition about (i).
Proposition 6.1 Suppose that in STree (w) the reversed suﬃx link of a leaf node x, which
is labeled a, points to a node y. Then node y is also a leaf node in STree (w).
67
Proof. From the deﬁnition the reversed suﬃx link of node x is a triple (
w
=⇒
x , a,
w
=⇒
ax) such
that
w
=⇒
ax = a ·
w
=⇒
x . String x is a suﬃx of w because x is represented by a leaf in STree (w).
Hence
w
=⇒
x = x. Consequently,
w
=⇒
ax = a ·
w
=⇒
x = ax = y. This means that y is also a suﬃx of
w and is represented by a leaf node in STree (w).
The above proposition tells us that, in a suﬃx tree, the reversed suﬃx link of the newest
leaf node points to the last created leaf node. Conversely, the suﬃx link of the last created
leaf node is pointing to the leaf node which will be created next.
In the sequel, we shall clarify what the reason (ii) implies.
On the construction of “STree (w) with sext links”, we use a two dimensional table
sext. The description “sext[x, a] = y” means “the sext link of node x labeled with a points
to node y.” Similarly, we use tables suf and rsuf which correspond to the suﬃx link and
the reversed suﬃx link, respectively.
6.3.3 How to Maintain Sext Links
Here, we explain how the sext links of a new node are computed during the Ukkonentype
construction of STree (w). See Fig. 6.4 that shows each phase of the construction
of STree (#abab$). The starred point in Fig. 6.4 is called the active point. For a string
w ∈ Σ∗
, at the beginning of each phase w[1 : i] (i = 0, 1, . . . , |w| − 1), the active point
stays at which the algorithm should start to update STree (w[1 : i]) to STree (w[1 : i+1]).
Let acti denote the active point in phase w[1 : i]. In phase w[1 : i + 1], acti+1 moves until
it can stop with spelling out w[i + 1].
If it is possible for acti+1 to move ahead from the current location while spelling out
w[i + 1] (say case (a)), it moves and stops there, and then becomes acti+2. Notice that
no new node is created in case (a), as seen in phase #aba and phase #abab in Fig. 6.4.
Otherwise (say case (b)), a new edge labeled with w[i+1] has to be created from where
acti+1 currently stays. Case (b) is divided into two sub-cases:
• acti+1 is on a node u (case (b1)).
• acti+1 is on an edge (case (b2)).
In case (b1), the algorithm just creates a new edge labeled by w[i+1] with a new leaf node
v (see Fig. 6.5, left). Only v is the newly created node in case (b1). Concrete examples
68
can be seen in phases #, #a, #ab, and the third step of phase #abab$ in Fig. 6.4. As
for case (b2), the algorithm needs to create a new node u where acti+1 stays now, in the
middle of the edge, to insert a new edge labeled with w[i+1] from there (see Fig. 6.5,
right). Concrete examples of case (b2) can bee seen at the ﬁrst and second steps in phase
#abab$ in Fig. 6.4. After having making node u, it creates a new edge together with a
new leaf node v. These nodes u and v are all the nodes newly created in case (b2).
$
a
a
a
#
b
b
b
b
a#
#
#
a
a
b
a
b
$
$
$
$
b
b
b
$
a
a
a
a
a
#
b
b
b
b
a#
#
#
a
a
b
a
b
$
$
$
$
b
b
b
a
a
a
#
bb
b
b
a#
#
#abab:
a
a
b
aaa
a
#
bbb
a#
#
#aba:
a
a
b
a
#
bbb
a#
#
#ab:
a
#
a
##
a a
#
#
#a:#:
#abab$:
a
b a
b
a
a
a
# bb
b
b
a#
#
#
a
a
ba
b
$
$
$
$
b
b
b
b
$
a
a
b
$
Figure 6.4: The on-line construction of STree (#abab$) with the sext links represented
by the broken arrows. At the third step of phase #abab$, the sext links form
DAWG($baba#).
Sext Link of a Leaf Node
In both cases (b1) and (b2), it follows from Proposition 6.1 that the reversed suﬃx link
of a new leaf node v points to the last created leaf node v . Suppose v is the jth created
leaf node and v is (j−1)th one during the construction of STree (w), where 2 ≤ j ≤ |w|.
Then the reversed suﬃx link of node v pointing to v is labeled by w[j−1], in formula,
rsuf [v, w[j−1]] = v . We have the following proposition which concerns with the sext link
69
uu
v
ss
v
u
(b2)(b1)
Figure 6.5: The two cases of the position of the active point, which is denoted by a gray
star. Since the active point is on a node u in case (b1) displayed on the left, only leaf
node v is newly created. On the other hand, in case (b2) on the right, internal node u is
also created where the active point is at present, in the middle of an edge.
of v.
Proposition 6.2 Suppose that v and v are jth and (j − 1)th created leaf nodes of
STree (w), respectively, where 1 ≤ j ≤ |w|. Then sext[v, w[j−1]] = v is the sole sext link
of leaf node v.
Proof. Since v is a leaf node, v is a factor which has occurred only once in w, as a suﬃx.
Because v is the jth suﬃx, it is preceded by w[j − 1] and w[j − 1] · v = v . Therefore, for
any c ∈ Σ such that c = w[j − 1], the string cv is not in Factor(w).
For example, leaf node b is created in phase #ab of Fig. 6.4, and it is the third one.
Therefore, rsuf [b, a] = sext[b, a] = ab, where a is the second character in string #abab$.
Sext Links of an Internal Node
Since the leaf node v is the only node newly created in case (b1), the algorithm then
has only to do the above maintenance for node v. Meanwhile, because the node u is also
newly created in case (b2), we have to determine the sext links of u. Assume that in phase
w[1 : i] the internal node u is created in the middle of an edge between node s and node
r. It then results in that u has two children, r and v. If there exists a node u such that
suf [u ] = u, then let a be the character such that rsuf [u, a] = u . Suppose there is a node
r such that sext[r, b] = r with b = a. Then sext[u, b] is set to point to r as well, since
70
r =
w[1:i]
=⇒
bu in this case (remember the deﬁnition of sext links). For instance, at the ﬁrst
step of phase #abab$ in Fig. 6.4, u = ab, r = abab$ and r = sext[abab$, #] = #abab$.
Since rsuf [ab, #] is undeﬁned, we deﬁne sext[ab, #] = #abab$. If b = a, then sext[u, b]
stays pointing to node u , because obviously u =
w[1:i]
=⇒
bu =
w[1:i]
=⇒
au . For example, at the second
step of phase #abab$ in Fig. 6.4, sext[bab$, a] = abab$. However, since rsuf [b, a] = ab,
sext[b, a] = ab. As previously remarked in the reason (ii) in Section 6.3.2, we also refers
to the sext link of leaf node v in order to determine the sext links of node u, in the same
way as mentioned above about the sext links of r. Formally, we have the following lemma.
Lemma 6.1 When an internal node u is newly created in phase w[1 : i] during the
construction of STree (w) with sext links, let r be the existing child node of u and v be the
new leaf node which is also a child of u. Then, sext[u, c] is created for each character c
such that either sext[r, c] or sext[v, c] was present at the beginning of the phase.
Proof. It follows from the deﬁnition that a node x has a sext link labeled by a character
c if and only if an occurrence of the string x is preceded by c. Note that the string u is
a suﬃx of the string w[1 : i], and that each of the occurrences of u within w[1 : i − 1] is
followed by the string α such that uα = r. Therefore, if there is an occurrence of u within
w[1 : i − 1] which is preceded by c, then the node r has a sext link labeled by c. On the
other hand, if c is the preceding character of the occurrence of u within w[1 : i] that ends
at the last character of w[1 : i], then the node v has a sext link labeled by c.
On the other hand, if the active point arrives at a node when case (a) is applied, a
new sext link of the node is created. Suppose that, just after a leaf node v had been
created, the active point stopped on a node in phase w[1 : i] during the construction of
STree (w), where 1 ≤ i ≤ |w|. In addition, assume that v is the jth created leaf node,
where 1 ≤ j ≤ |w|. That is to say, v = w[j : i]. Notice that j ≤ i. After that, if the
active point stops on a node p with case (a) in the next phase, phase w[1 : i + 1], then
a sext link of node p which is labeled w[j] is created and set to point to node v, where v
now represents w[j : i + 1]. Let us clarify the reason for the above. Let u and u be the
parent nodes of v and p, respectively. Notice that then u · w[i : i + 1] = v = w[j : i + 1].
Furthermore, u · w[i + 1 : i + 1] = u · w[i + 1] = w[j + 1 : i + 1] since suf [u] = u . Namely,
node v currently represents w[j : i + 1] and node p corresponds to w[j + 1 : i + 1]. That is
why sext[p, w[j]] = v. If the active point again stops on a node until the algorithm faces
71
case (b), the sext link of the node whose label is w[j] is created and set to point to the
leaf node v as well. A concrete example is shown in Fig. 6.6.
a
a
a
a
a
#
b
b
a#
#
#
#
a
a
b
b
b
a
a
#aaab:
a
a
b
a
a
a
a
a
a
#
b
b
a#
#
#
#
a
a
b
b
b
a
a
#aaaba:
a
a
a
a
a
b
a
b
a
a
a
a
a
a
a
#
b
b
a#
#
#
#
a
a
b
b
b
a
a
#aaabaa:
a
a
a
a
a
b
a
b
a
a
a
a
b
Figure 6.6: STree (#aaab) with the sext links is shown on the left. Node b is the last
created leaf node in that phase. Scanning a new character a, the active point moves
to node a, as seen in the center ﬁgure STree (#aaaba). Then, sext[a, b] is set to point
to the last created leaf node ba. Also in the right ﬁgure representing STree (#aaabaa),
sext[aa, b] = baa, because the active point has arrived at node aa.
Sext Links Pointing to a New Node
The only thing we have not accounted for yet is to change the sext links that point to the
newly created nodes u and v. Let us ﬁrst mention the case of v, a new leaf node. The
following remark about a new leaf node v is common to case (b1) and case (b2). Whenever
a character w[i] appears in string w[1 : i] for the ﬁrst time, a new edge labeled with w[i] is
created from the root node, and v is associated with w[i]. Then, sext[ε, w[i]] = v, because
the root node corresponds to the empty string ε. This can be seen in phases #, #a, #ab,
and #abab$ in Fig. 6.4. If the character w[i] has already appeared in string w[1 : i − 1],
then leaf node v should be pointed to by the leaf node which will be created next.
We now treat how to decide what sext link of STree (w[1 : i]) should be modiﬁed so
as to point to a newly created internal node u, in case (b2). Recall that node u has two
children r and v. Let us suppose that node r is pointed to by a c-labeled sext link of a
node p in STree (w[1 : j]) where j = i − 1, that is, sext[p, c] = r. In other words,
w[1:j]
=⇒
cp = r.
If |u| > |p|, then the sext link of p is modiﬁed so as to point to u (sext[p, c] = u), because
72
w[1:i]
=⇒
cp = u. A concrete example can be seen between phase #abab and phase #abab$ in
Fig. 6.4. sext[ε, a] = abab in phase #abab is modiﬁed as sext[ε, a] = ab at the ﬁrst step
of phase #abab$, where node ab is the internal node newly created in phase #abab$.
In another case (if |u| ≤ |p|), the sext link of node p remains pointing to node r, since
w[1:i]
=⇒
cp = r in this case. Similar discussion holds for the sext links pointing to node v, another
child of node u.
6.3.4 Correctness and Complexity of the Algorithm
The algorithm is summarized as Fig. 6.7. If we compute the sext links of the nodes in
“STree (w) with sext links” according to the algorithm, we have the following:
Theorem 6.1 For any string w ∈ Σ∗
, STree (w) with sext links can be constructed on-line
and in linear time and space with respect to |w|.
Proof. Since it has been proven in [55] that STree (w) can be obtained on-line and in
O(|w|) time, all we have to clarify are the correctness and complexity of the construction
of sext links. The data structure we newly add to the Ukkonen algorithm are the table
sext and rsuf. It is clear that they require O(|Σ|·|w|) space. Therefore, if Σ is a ﬁxed
alphabet, the space complexity of our algorithm is linear.
We have assumed that a string w ends with a unique end-marker $. After $ is scanned,
a new edge labeled with $ is absolutely created from the root node, and the corresponding
new leaf node is also created. After that, the sext link of the root node, which is labeled
$, is set to point to the new leaf node. Then, the chain formed by the sext links of all
the leaf nodes in STree (w) exactly spells wrev
, i.e., the path of DAWG(wrev
) which corresponds
to string wrev
is then completed. This guarantees that the paths of DAWG(wrev
)
corresponding to the suﬃxes of wrev
are also created as the sext links of the internal nodes
of STree (w). This algorithm constructs DAWG(wrev
) on-line, because new sext links are
computed each time a new node is created.
From here on, we establish the sext links can be computed in linear time with respect
to |w|. It is obvious that to decide the sext link of any new leaf node takes only constant
time. When we determine the sext links of a newly created internal node, we copy the
sext links of the two children of the new node. It takes O(|Σ|) time, since each of the two
children has at most |Σ| sext links. Therefore, if Σ is a ﬁxed alphabet, it takes constant
73
time. The matter is the change of sext links due to a new-created internal node. Suppose
that, in phase w[1 : i], acti stays somewhere depth m in STree (w[1:i]). At the beginning
of phase w[1 : i + 1], the algorithm begins to seek for the location where the active point
can stop. Then, at most m sext links are changed until the active point stops. This
implies that the overall complexity of the change of sext links due to new internal nodes
takes O(|w|) time.
6.4 On-Line Construction of SCDAWG
In this section, we propose how to construct SCDAWG for a string w, on-line in O(|w|)
time. Deﬁne CDAWG (w) and SCDAWG (w) in a similar way to the deﬁnition of STree (w).
Our on-line algorithm builds CDAWG (w) in the same way as in [32], and builds certain
edges of CDAWG(wrev
) as the sext links of the nodes of CDAWG (w).
We stress that the algorithm of [32] is based on the Ukkonen suﬃx tree construction
algorithm. This implies, if we add the functions “redirect” and “separate node” in [32]
to the pseudo-code of the algorithm in Section 6.3, we obtain CDAWG (w). The matter
is how to build the edges of CDAWG(wrev
), the sext links of CDAWG(w), of course.
However, we fortunately have the fact that CDAWGs can have “the same amount of information”
as suﬃx trees. The loss of information comes from the property that CDAWGs
have a node having two or more incoming edges, which correspond to two or more nodes
connected by suﬃx links in suﬃx trees. Namely, the lost information is strings obtained
by concatenating labels of some suﬃx links. One hint has been given in [20] as an exercise.
Furthermore, the CDAWG construction algorithm in [32] is capable of storing the “lost”
information as integers in nodes. Notice that if we can treat CDAWGs like suﬃx trees, it
means we can obtain the sext links of CDAWGs.
In the following subsections, we show how the algorithm of Section 6.3 should be
changed when constructing CDAWGs, by using examples. If again turning our attention
to the pseudo-code, the 8th line of update function is changed to as “create a new edge
(r, (p, e), sink);” and labels of reversed suﬃx links and sext links can be of strings, not a
character.
74
6.4.1 Sext Link Corresponding to a Newly Created Edge
A sequence of snapshots on the on-line construction of SCDAWG (#abab$) is shown in
Fig. 6.8. Since character a has appeared in string #a, the edge labeled with a is created
and directed to the ﬁnal node in phase #a. After that, the sext link of the initial node
labeled with a# is set to point to the ﬁnal node. Comparing it with the corresponding
phase in Fig. 6.4, one can see that character # in the label a# of the CDAWG corresponds
to the label # of the sext link from the leaf node a to node #a of the suﬃx tree in the phase
#a. In general, in phase w[1 : i] of the construction of CDAWG (w), the representative
of the ﬁnal node is w[1:i]. Assume that an edge is then created from a node u and it is
the jth edge entering to the ﬁnal node, where 1 ≤ j ≤ i. Then, the jth edge is associated
with w[j :i]. There then exists a “gap” w[1:j − 1] between the representative w[1:i] and
w[j : i]. Notice that this “gap” corresponds to the reversal of the concatenation of the
labels of the sext links between leaf node w[j : i] and leaf node w[1 : i] in STree(w[1 : i]).
On the grounds of this gap w[1 : j − 1], a new sext link of node u is set to point to the
ﬁnal node with label (w[1:j])rev
.
6.4.2 Change of Sext Links
See phases #abab and #abab$ in Fig. 6.8. The active point stays on the middle of the
edge labeled abab in phase #abab, and the edge is split into two edges due to the creation
of the new edge labeled $. Notice that the sext link labeled with a# is also cut into two.
One labeled with a is set to point to the new node ab, and the other labeled # is set from
node ab. It is because
w[1:6]
=⇒
εa = ab and
w[1:6]
=⇒
#ab = #abab$ in this time, where w[1 : 6] = #abab$.
What if a sext link, whose label is of length more than 1, is cut? See Fig. 6.9 that
displays CDAWG (#abb) and CDAWG (#abba). There is a sext link of the initial node
pointing to the ﬁnal node, which is labeled with ba# in CDAWG (#abb). At the beginning
of the conversion to CDAWG (#abba), a new node b is created where the active point
currently stays. Then, the sext link labeled ba# is cut and its former part is set to point
to the new node b, labeled with b. In general, if a new node is created in the middle of an
edge, the sext link corresponding to the edge is cut into two, and its former part is labeled
with the single initial character of the label of the cut sext link. It does not depend on
the length of the label of the sext link to be cut.
To realize the operation above mentioned, we need to associate the sext link labeled
75
ba# with string bb in the ﬁnal node, where bb is not the representative of the ﬁnal node.
This is because if we associate that just with the representative, like sext[ε, ba#] = #abb,
we cannot recognize which sext link pointing to the ﬁnal node should be cut owing to the
newly created node (notice there exist other sext links from the initial node to the ﬁnal
node). Therefore, we make a sext link point to a string represented in a certain node, not
to the representative. For example, on CDAWG (#abb) in Fig. 6.9, sext[ε, #] = #abb,
sext[ε, a#] = abb, sext[ε, ba#] = bb.
As seen in phase #abab$ of Fig. 6.8, the edge labeled with bab$ is merged into the
node ab and its label is modiﬁed to b. According to it, sext[ε, ba#] = bab$ becomes
sext[ε, ba] = b. The character a at the tail of label ba of the sext link corresponds to the
label of the sext link from node b to ab in STree(#abab$) in Fig. 6.4.
Fig. 6.10 displays a node separation that can happen during the construction of
CDAWGs. In Fig. 6.10, as the active point arrives at node ab via the edge labeled b
which belongs to a non-longest path from the initial node to the node ab, the node is
cloned as seen in the CDAWG (#ababcb). Then, sext[ε, ba] = b in CDAWG (#ababc) is
cut into two, one of which is sext[b, a] = ab and the other sext[ε, b] = b.
6.4.3 Implementation of Factors Represented in a Node
As is mentioned above, a sext link in CDAWG (w) is set to point to a certain factor of w
represented in a node. However, if we actually implement all of such strings naively, the
space requirement can be quadratic. Therefore, we implement them with integers referring
to the positions in the input string w. Suppose that the representative of a node p is α in
CDAWG (w[1:i]) for 1 ≤ i ≤ |w|. Then, node p has integers j and k (1 ≤ j ≤ k ≤ i) such
that α = w[j : k] where j represents the beginning position of the left most occurrence
of α in w[1 : i]. In addition to it, each edge entering node p has an integer representing
the entrance order to node p. See the left ﬁgure in Fig. 6.10, CDAWG (#ababc). For
example, the edge labeled ab is the ﬁrst one and the edge labeled b is the second one
entering to node ab. Note that, in CDAWG (#ababc), the edge labeled abc entering
to the ﬁnal node represents two factors ababc and babc, which are the second and the
third members of the ﬁnal node, respectively. Thus the edge labeled abc is associated
with the set {2, 3}. In this way, the edges entering to the ﬁnal node are associated with
the sets {1}, {2, 3}, {4, 5}, {6} from left to right. In general, an edge in a node may
correspond to more than two strings represented in the node. However, the truth is that
76
such strings always occur sequentially in string w, for any w. Therefore, even if an edge
corresponds to more than two strings, we can represent all of them with a pair of integers,
the minimum and the maximum elements in the set associated with. As a result of the
above discussions, we now have:
Theorem 6.2 For any string w ∈ Σ∗
, SCDAWG for w can be constructed on-line in
linear time and space with respect to |w|.
6.5 Conclusion
First, we gave an on-line linear time algorithm to construct the suﬃx tree for a string
together with the DAWG for the reversal of the string. It builds the suﬃx tree based on
Ukkonen’s on-line algorithm [55], and simultaneously builds the DAWG as the sext links
of the nodes of the suﬃx tree.
Blumer et al. [8] gave an oﬀ-line linear time algorithm for the construction of the
SCDAWG for a string: ﬁrst builds the DAWG with the suﬃx links, and then compacts
the DAWG and its suﬃx links to the SCDAWG. Meanwhile, the algorithm we proposed
in this paper directly constructs the SCDAWG for a given string, on-line in linear time:
builds the CDAWG for the string on-line, with the sext links that compose the CDAWG
for the reversal of the string. This enables us to save time and space at the same time
when constructing an SCDAWG.
77
Algorithm Construction of STree (w$) with sext links
in alphabet Σ = {w[−1], . . . , w[−m]}.
1 create nodes root and ⊥;
2 for j := 1 to m do create a new edge (⊥, (−j, −j), root);
3 suf[root] := ⊥;
4 length(root) := 0; length(⊥) := −1;
5 (s, k) := (root, 1); i := 0;
6 lastleaf := nil; n := 0; /* lastleaf is the last (n-th) created leaf node */
7 repeat
8 i := i + 1;
9 (s, k, lastleaf, n) := update(s, (k, i), lastleaf, n);
10 until w[i] = $;
function update(s, (k, p), lastleaf, n):
1 oldr := nil; s := nil;
2 while not check end point(s, (k, p − 1), w[p]) do
3 if k ≤ p − 1 then /* implicit */
4 s := extension(s, (k, p − 1));
5 r := split edge(s, (k, p − 1));
6 else r := s; /* explicit */
7 create a new leaf node v and a new edge (r, (p, e), v);
8 /* e is the global variable representing the scanned length of the input string. */
9 let length(v) be e − n;
10 if oldr = nil then set suﬃx link(oldr, r);
11 if lastleaf = nil then set suﬃx link(lastleaf, v);
12 if r = s then /* maintenance of sext links */
13 c := w[n];
14 if rsuf[r, c] = nil then sext[r, c] := sext[v, c];
15 for each character a such that sext[s , a] = nil do
16 if rsuf[r, a] = nil then sext[r, a] := sext[s , a];
17 for each sext link sext[x, a] = s do /* modify sext links pointing to s */
18 if length(r) > length(x) then sext[x, a] := r;
19 oldr := r; lastleaf := v; n := n + 1;
20 (s, k) := canonize(suf[s], (k, p − 1));
21 if oldr = nil then set suﬃx link(oldr, s);
22 (s, k) := canonize(s, (k, p));
23 if k > p then sext[s, w[n]] := lastleaf;
24 return (s, k, lastleaf, n);
procedure set suﬃx link(s, t):
1 let c be the ﬁrst character of the string represented by s;
2 suf[s] := t; rsuf[t, c] := s; sext[t, c] := s;
Figure 6.7: The algorithm to construct “STree (w) with sext links”. check end point,
extension, canonize, and split edge are identical to those used in Chapter 3.
78
a
b
bba
a
$
$
$b
a
a
b
b
a
a
b
#
#
#
b
b b
a
#
$
$
$
$b
a
b
a
a
b
a
#
#
#
#
b b
a
#
b
a
a
bb
a
a b
a
#
a
#
# #
b
b
aa
a
bb
a
a b
a
#
a
#
#
# b
a
bb
a
b
a
#
a
#
#
#
# a#
a
a##
#
b
a
b
$
$
$$b
a
a
b
b
a
a b
a
#
aa
#
#
#
b
b
a
b
b
a
#
a
b
a
$
#abab:#aba:#ab:#a:#:
#abab$:
Figure 6.8: The on-line construction of SCDAWG (w), where w = #abab$. The solid
arrows represent the edges of CDAWG (w), whereas the broken arrows represent the sext
links of the nodes of CDAWG (w), that are equivalent to the edges of CDAWG(wrev
).
b
a
bb
a
a
b
a
#
a#
# #
b
b
b
#abba:
b
a
b
b
a b
a
#
a
#
#
#
b
b
b
#abb:
a aa b
a
#
Figure 6.9: CDAWG (#abb) and CDAWG (#abba) with their sext links. The sext link
sext[ε, ba#] is cut into two sext links sext[ε, b] and sext[b, a#].
79
a
b bba
a
cc
cc
b
b
a
b
a
a
b
a
#
#
#
#
b
b
b
a
#
b
a
c
a
b
bba
a
cc
c
cb
a
b
a
a
b
a
#
#
#
#
b b
a
#
b
a
c
#ababcb:#ababc:
c
b
a
b b
b
c
b
Figure 6.10: CDAWG (#ababc) and CDAWG (#ababcb) with their sext links.
80
Chapter 7
Bidirectional Construction of Index
Structures
7.1 Background and Motivation
As repeatedly remarked in the previous sections, the invention of Ukkonen’s algorithm
has allowed us to update an input string by adding new strings afterward. Without his
algorithm, we would have to construct the suﬃx tree from scratch. However, since his
algorithm is supposed to only read a string from left to right, even it is not capable of
permitting us to update the input string to the left direction. Namely, if wanting to
extend the current input string by appending another string to its front, we still now have
to reconstruct the suﬃx tree from the beginning. As typically seen in Bioinformatics, a
considerable amount of strings, like DNA sequences for example, has to be treated. It is
easy to imagine that reconstructing the suﬃx tree of such a quite long string is a very
huge task.
In this chapter we give the very one to settle the matter above. Namely, that is an
algorithm to construct a suﬃx tree bidirectionally, both left to right and right to left. The
idea of updating a suﬃx tree from right to left is based on Weiner’s algorithm. However,
his original algorithm is not suitable for updating a suﬃx tree which is specialized to
be treaded by Ukkonen’s algorithm. Thereby we modify Weiner’s algorithm. Mutually,
Ukkonen’s algorithm is also modiﬁed according to the inﬂuence from Weiner’s algorithm.
As once mentioned in Chapter 6, Chen and Seiferas ﬁrst declared the duality of suﬃx
trees and DAWGs [11]. For any string w, the suﬃx tree of w and the DAWG of wrev
can
81
share the same nodes, where wrev
is the reversed string of w. In [30] the on-line algorithm
to construct both the suﬃx tree for w and the DAWG for wrev
was given, whose basis is
Ukkonen’s suﬃx tree construction algorithm. In this sense the DAWG of string u = wrev
can be constructed on-line, from right to left. Also, it is remarked in [14] that Weiner’s
algorithm can be modiﬁed so to build the DAWG of wrev
together with the suﬃx tree of
w. Therefore it can build the DAWG of string u = wrev
from left to right. It results in
that the algorithm we present in this paper can also build the DAWG of a given string
bidirectionally, on-line.
Some related work can be seen in literature: Stoye [50] invented a variant of suﬃx
trees, called aﬃx trees. He gave an algorithm for bidirectional construction of aﬃx trees,
and Maaß improved the time complexity of the algorithm to O(n) [39].
The result of this chapter was published in [28].
7.2 Bidirectional Construction of Suﬃx Trees
7.2.1 Left Extension
The part to extend a given suﬃx tree with a character added at the left of the string is
based on Weiner’s algorithm [58]. This operation is denoted by Left Extension.
Weiner’s Algorithm.
For a smart implementation of the algorithm, we use the bottom node Ukkonen also used
in his algorithm [55]. For any string w ∈ Σ∗
, we suppose that between the bottom node
and the root node of STree (w) there is an edge labeled by symbol Σ, which means any
string in the alphabet. The main idea of Weiner’s algorithm is as follows. Suppose that we
are given string w ∈ Σ∗
for which we are constructing the suﬃx tree. Weiner’s algorithm
reads the string from right to left. The algorithm constructs STree (w[i − 1 : n]) from
STree (w[i : n]) for n + 1 ≥ i ≥ 2, by inserting preﬁxes of w[i − 1 : n] into STree (w[i : n]).
We divide the preﬁxes of w[i − 1 : n] into the following two groups.
(1) Preﬁxes w[i − 1 : h] for 0 ≤ h ≤ j where j is the largest number satisfying w[i − 1 :
j] ∈ Factor(w[i : n]).
(2) Preﬁxes w[i − 1 : h] for j + 1 ≤ h ≤ n.
82
Any preﬁxes in the group (1) do not need to be newly inserted into STree (w[i : n]),
simply because they are already represented in the suﬃx tree. We call the longest string
in the group (1), w[i−1 : j+1], the head for w[i−1 : n], and denote it by headw[i−1:n]. The
preﬁxes in the group (2) are inserted into the suﬃx tree from the location associated with
headw[i−1:n]. Since at least the empty string ε belongs to the group (1), there always exists
the location for headw[i−1:n] in STree (w[i : n]). We start the search of headw[i−1:n] from
the leaf node corresponding to the longest suﬃx of w[i : n], which is w[i : n] itself. We call
it the active leaf and denote it by active leafw[i−1:n]. The path from it to the bottom node
is called the working path. The search starts from active leafw[i−1:n] and proceeds along the
working path up to the bottom node. Naive method to localize it takes quadratic time.
To avoid it, we use two dimensional tables test and link, both of whose ﬁrst components
are characters and second are nodes.
Deﬁnition 7.1 Let w be an arbitrary string in Σ∗
and a be an arbitrary character in Σ.
Assuming that v is a node in STree (w) (i.e.
w
=⇒
v = v), then:
1. test[a, v] = true iﬀ av ∈ Factor(w).
2. link[a, v] = u if u = av for some node u in STree (w).
Otherwise, link[a, v] = nil.
Notice that test[a, v] = true if link[a, v] = nil for any character a ∈ Σ and node
v in STree (w). It is easy to see that link equals the reversed (and labeled) link of a
suﬃx link of STree (w) for any string w ∈ Σ∗
(see Deﬁnition 2.11). Let the ﬁrst test
be the node on the working path and nearest to active leafw[i−1:n] with which the table
test for w[i − 1] is true, and denote it by ﬁrst testw[i−1:n]. Similarly, let the ﬁrst link
be the node on the working path and nearest to active leafw[i−1:n] with which the table
link for w[i − 1] is u for some node u, and denote it by ﬁrst linkw[i−1:n]. Let α be the
concatenation of labels of edges between ﬁrst testw[i−1:n] and active leafw[i−1:n], and β the
one from ﬁrst linkw[i−1:n] to ﬁrst testw[i−1:n]. Note that α can never be ε whereas β might
be ε. Remark that headw[i−1:n] = link[w[i − 1], ﬁrst linkw[i−1:n]]. When ﬁrst testw[i−1:n] =
ﬁrst linkw[i−1:n] (β = ε) and headw[i−1:n] is on an edge (r, γ, s), the edge is split into the two
edges (r, β, headw[i−1:n]) and (headw[i−1:n], δ, s), where γ = βδ. These are followed by the
creation of the new edge (headw[i−1:n], α, new leafw[i−1:n]), where new leafw[i−1:n] denotes
the leaf node newly created. Because new leafw[i−1:n] corresponds to w[i − 1 : n] and all
83
suﬃxes of w[i − 1 : n] other than w[i − 1 : n] itself have already been inserted, it is the
completion of the construction of STree (w[i − 1 : n]). Then, the tables test and link are
maintained. Obviously, new leafw[i−1:n] becomes active leafw[i−2:n] in the next step i − 2.
Modifying Weiner’s Algorithm.
Consider the longest factor y of w[i : n] such that y ∈ Suﬃx(w[i − 1 : n]). We call it
the active point of STree (w[i − 1 : n]) the location y corresponds to, and denote it by
active pointw[i−1:n]. Notice that active pointw[i−1:n] is the longest suﬃx of w[i − 1 : n] not
represented by a leaf node.
Lemma 7.1 Let w ∈ Σ∗
. Suppose that we are updating STree (w[i : n]) to STree (w[i−1 :
n]), where n = |w| and n + 1 ≥ i ≥ 2. Let headw[i−1:n] = ax with a ∈ Σ and x ∈ Σ∗
. If
ax ∈ Suﬃx(w[i − 1 : n]), then x is always active pointw[i−1:n].
Let xy = active leafw[i−1:n] for some string y ∈ Σ+
. Consider the case that
w
=⇒
x = x.
(Then there is no node corresponding to active pointw[i:n] in STree (w[i : n]).) It implies
that ﬁrst testw[i−1:n] is the root node and ﬁrst linkw[i−1:n] is the bottom node. If x = ε,
then headw[i−1:n] = ax cannot be localized by the information from ﬁrst testw[i−1:n] and
ﬁrst linkw[i−1:n]. To avoid this trouble, we treat the active point as though there exists a
node for x = active pointw[i:n]. Concretely, the values of tables test and link are computed
also for active pointw[i:n].
When active pointw[i−1:n] = x is on an edge (r, α, s) in STree (w[i − 1 : n]), it is
represented by a reference pair (r , δ) such that r is an ancestor of r and r δ = x. For the
string β such that rβ = x, (r, β) is called the canonical reference pair of active pointw[i−1:n].
When STree (w[i : n]) is updated to STree (w[i − 1 : n]), the active point is maintained
as follows. Let active pointw[i:n] = u. The suﬃx of w[i : n] corresponding to
active pointw[i:n] is w[ : n], where = n − |u| + 1. There are three cases to consider:
(Case 1) w[ : n] is not a preﬁx of w[i : n].
(Case 2) w[ : n] is a preﬁx of w[i : n] but w[ − 1 : n] is not a preﬁx of w[i − 1 : n].
(Case 3) w[ : n] is a preﬁx of w[i : n] and w[ − 1 : n] is a preﬁx of w[i − 1 : n].
In both (Case 1) and (Case 2), active pointw[i−1:n] = active pointw[i:n]. Meanwhile, in
(Case 3), active pointw[i−1:n] corresponds to au where a = w[i − 1].
84
To implement the suﬃx tree using only linear space, an edge (r, α, s) in STree (w[i : n])
is actually implemented as (r, (k, p), s) such that w[k : p] = α, where k ≥ i and p ≤ n.
In case that s is a leaf node, p = n. However, we cannot specify p since the suﬃx tree
might be also extended to the right direction by Right Extension to be introduced in the
next section. If specifying the value of p, it becomes impossible that the algorithm runs
in linear time. Therefore, we adopt Ukkonen’s so-called “∞-trick” [55]. That is to say,
we implement the edge as (r, (k, ∞), s) such that w[k : n] = α.
Theorem 7.1 For any string w ∈ Σ∗
, the modiﬁed Weiner’s algorithm constructs STree (w)
on-line (right to left) and in linear time with respect to |w|.
The construction of STree (cocoo) by Left Extension is displayed in Fig 7.1.
ε
Σ
Σ
o
o
o
Σ
Σ
oo
Σ
o
o
Σ
o o
coo
Σ
Σ
o
o
o o
o
o
c
c
ocoo
Σ
Σ
o
o
o
o
o
o
c
c
o
o
c
o
cocoo
Σ
Σ
o
o
o
o
o
c
c
o
o
c
o
o
c
o
o
c
c
Figure 7.1: Construction of STree (cocoo) by Left Extension. The gray star denotes the
active point for each step. The broken arrows represent the links for the table link.
85
7.2.2 Right Extension
The part to extend a given suﬃx tree with a new character added to the right of the string
is based on Ukkonen’s algorithm [55]. This operation is denoted by Right Extension.
See Section 3.3 where Ukkonen’s algorithm has been recalled.
Modifying Ukkonen’s Algorithm.
We have to modify Ukkonen’s algorithm so that it also computes the tables test and link,
which are used by Left Extension.
Lemma 7.2 Ukkonen’s algorithm can be modiﬁed so as to compute the tables test and
link during the construction of STree (w) for any w ∈ Σ∗
. The overall time complexity is
still O(|w|).
We here omit the detail of proving the above lemma, but remark that computing the
tables test and link is inherently the same thing as computing the sext links, abbreviation
for the shortest extension links, introduced in [14]. It was showed in [30] that a modiﬁed
Ukkonen’s algorithm can compute the sext links in O(|w|) time. As for the active leaf
stated in the previous section, it is obvious that active leafw[1:i] = active leafw[1:i+1] for any
1 ≤ i ≤ n − 1. Therefore we need no maintenance for it.
The construction of STree (cocoo) by Right Extension is displayed in Fig 7.2.
7.2.3 Bidirectional On-Line Construction of Suﬃx Trees
We here explain how our algorithm constructs a suﬃx tree in bidirectional on-line manner.
Given a string w ∈ Σ∗
, the algorithm can get the construction started with any position
i in w, where 1 ≤ i ≤ |w|. Firstly STree (w[i : i]) is constructed, by either Left Extension
or Right Extension. Then by Left Extension STree (w[i : i]) is updated with characters
w[j] where j is from i − 1 down to 1, and by Right Extension with characters w[k]
where k is from i + 1 up to n. For any j and k the structure built by the algorithm is
STree (w[j : k]). Here a symbol $ appearing nowhere in w is appended after w if w[n] is
not a unique character in w. Since STree (w$) = STree(w$), after the construction we
obtain inherently the same structure as STree(w) for any string w. In Fig. 7.3 an example
of bidirectional construction of STree (cocoon) is shown.
86
ε
Σ
Σ
c
Σ
Σ
c
c o
o
co
Σ
Σ
c
o
c
cocoo
Σ
Σ
o
o
o
o
o
c
c
o
o
c
o
o
c
o
o
c
c
o
coc
Σ
Σ
c
o
c
c
c
c
o
o
coco
Σ
Σ
c
o
c
c
c
oo
o
Figure 7.2: Construction of STree (cocoo) by Right Extension. The gray star denotes the
active point for each step. Here only links for the table link are drawn as broken arrows,
so the links of the table suf correspond to the broken arrows in the reversed direction.
Theorem 7.2 For any string w ∈ Σ∗
the proposed algorithm constructs STree (w) in
bidirectional on-line manner and in time linear in |w|.
Proof. We brieﬂy prove that from any position of a given string the suﬃx tree can be
built in linear time.
Given a string w ∈ Σ∗
, suppose that we get the construction started with the position
i in w, where 1 ≤ i ≤ |w|. Let |w| = n. Suppose we here construct STree (w[i : i]) by
Right Extension. Then the suﬃx tree is updated by Right Extension (n − i + 1) times.
The matter needed to be considered is the cost to compute the active point. Though the
active point is computed (n − i + 1) times, it might take more than O(n − i + 1) time if
the suﬃx tree is extended also by Left Extension during the update by Right Extension.
87
ε
Σ
Σ
c
Σ
Σ
c
o
o
c
co
Σ
Σ
c
o
c
o
o
oco
Σ
Σ
c
o
c c
o
o
ocoo
Σ
Σ
o
o
o
o
o
o
c
c
o
o
c
o
ocoon
Σ
Σ
o o
o
o
o
o
c
c
o
o
c
o
o
o
o
c
c
n
n
n
n
n
o
n
cocoon
Σ
Σ
o o
o
o
o
c
c
o
o
c
o
n
n
n
n
n
o
n
c
n
Figure 7.3: Bidirectional Construction of STree (cocoon). The construction has begun
with c in the position 3 of the input string cocoon.
However, the cost is bounded by O(n − i + 1 + k), where k (0 ≤ k ≤ i − 1) is the number
of update of the suﬃx tree by Left Extension. Consequently, the cost of Right Extension
is O(n) for any i with 1 ≤ i ≤ n.
From now on we consider the cost of Left Extension. Suppose that we now have
STree (w[k : j]) with 1 ≤ k ≤ i − 1 and i ≤ j ≤ n, that is, the suﬃx tree has been
updated (i − k) times by Left Extension and (j − i + 1) times by Right Extension. Then,
the cost of the next iteration by Left Extension can be charged to the diﬀerence between
the depths of active leafw[k:j] and active leafw[k−1:j] (for detail see [14]). Thus the sum of
all these diﬀerences is proportional to j. Since j ≤ n, the cost of Left Extension is O(n)
for any i with 1 ≤ i ≤ n.
88
7.3 Bidirectional Construction of DAWGs
The directed acyclic word graph (DAWG) is a well known and widely studied index
structure, originally introduced in [7]. The DAWG is the smallest ﬁnite state automaton
recognizing all suﬃxes of a given string [12].
Let us go back to the suﬃx trees. We introduce a table called the shortest extension
links, the sext links for short, of STree (w) for a string w ∈ Σ∗
. The idea of the sext
link was originally given in [14], but it was for STree(w). What we treat here is one for
STree (w).
Deﬁnition 7.2 Let w be an arbitrary string in Σ∗
and a be an arbitrary character in Σ.
For a string x ∈ Factor(w), sext[a,
w
=⇒
x ] =
w
=⇒
ax.
It derives from Deﬁnition 7.2 that the sext links can be described as the set T =
{(
w
=⇒
x , a,
w
=⇒
ax) | x, ax ∈ Factor(w) and a ∈ Σ}. Assume that a string w ends with a unique
character. On this assumption
w
=⇒
x =
w
−→x for any string x ∈ Factor(w). Therefore, the
following set T is identical to T for any such string w, T = {(
w
−→x , a,
w
−→ax) | x, ax ∈
Factor(w) and a ∈ Σ}. By considering the reversal wrev
of the string w, the set T can be
transformed as T = {(
z
←−y , a,
z
←−ya) | y, ya ∈ Factor(z), a ∈ Σ, and z = wrev
}. Observe that
there is a trivial one-to-one correspondence between E in Deﬁnition 2.9 and T . Hence
the following lemma stands.
Lemma 7.3 Let w be an arbitrary string in Σ∗
whose right most character is unique in
w. The set T of sext links of w is identical to the set E of edges of DAWG(w).
Meanwhile, the table sext has a very close correspondence with tables test and link.
Let v be a node of STree (w) and a be a character in Σ, then;
(1) if sext[a, v] = nil, then test[a, v] = false.
(2) else test[a, v] = true. Let sext[a, v] = u, then
(a) if |u| = |v| + 1, then link[a, v] = u.
(b) else link[a, v] = nil.
If we use table length for a node x whose value is |x|, it is possible to examine the condition
of (2)-(a), whether |u| = |v| + 1, in constant time. It implies that when constructing
STree (w) in bidirectional on-line manner, we can use the table sext instead of the tables
89
test and link. When we are given a string w ∈ Σ∗
, we consider the string z = wrev
and
use a symbol $ occurring nowhere in z, as z$. It then follows from Theorem 7.2 and
Lemma 7.3 that:
Theorem 7.3 For any string w ∈ Σ∗
, the algorithm presented in Section 7.2 can construct
DAWG(w) in bidirectional on-line manner and in time linear in |w|.
It is shown in Fig 7.4 how DAWG(w) is constructed to both direction, where w =
cocoo.
Σ
Σ
Σ
Σ
Σ
Σ
ε
Σ
Σ
c
c
o
o
c
oc
c
o
c
o
o
oco
c
o
c
c
oc
c
o
Σ
Σ
o
o
ococ
c
o
c
c
o
c
o
oooooo
c
c
Σ
Σ
o
o
oococ
c
o
c
c
o
c
o
c
c
c
o
c
o
o
o
c o
Σ
Σ
o
o
oococ$
c
o
c
c
c
o
c
c
o
c
o
o
o
c
o
$
$
o
c
$
c
$
$
o
o
c
c
$
$
Figure 7.4: We are given string cocoo and now constructing DAWG(cocoo). In so doing,
we build STree (oococ$) while computing the sext links. The broken edges represent
the edges of the suﬃx tree, whereas the solid edges do the sext links. Ignoring the
path $co from the root node, the sext links in STree (oococ$) are equal to the edges of
DAWG(cocoo).
90
7.4 Concluding Remarks
We developed a bidirectional, on-line, and linear-time suﬃx tree construction algorithm.
We also showed that the algorithm can also construct DAWGs in bidirectional on-line
manner and in linear time with respect to the length of an input string.
It might be interesting to consider bidirectional construction of compact directed
acyclic word graphs (CDAWGs), which require smaller space than both suﬃx trees and
DAWGs [8, 16]. Extending a given CDAWG to the right direction should be based on the
on-line (left to right) algorithm introduced in [32]. The left extension of a CDAWG could
perhaps be accomplished by a modiﬁed version of Weiner’s algorithm, which is going to
be our future work.
91
Chapter 8
The Minimum All-Suﬃxes Directed
Acyclic Word Graphs
8.1 Introduction
A variety of patterns have been considered so far, according to various kinds of purposes
and aims. The most basic one is a substring pattern. Let Σ be a ﬁnite alphabet. We
call an element in Σ a character, and one in Σ∗
a string. We say a pattern string p is a
substring of a text string w if w = upv for some strings u, v ∈ Σ∗
. When a text w is ﬁxed
and a pattern p is ﬂexible, once constructing a suitable data structure for w, we can solve
the substring matching problem in O(|p|) time, where |p| denotes the length of p. In order
to solve the problem eﬃciently, much attention has extensively been paid to inventing
eﬃcient data structures, such as suﬃx trees [58, 43, 55], directed acyclic word graphs
(DAWGs) [7, 12], compact directed acyclic word graphs (CDAWGs) [8, 16, 32], suﬃx
arrays [41], compact suﬃx arrays [40], suﬃx cacti [35], compressed suﬃx arrays [47, 19],
that are also mentioned in previous chapters.
Meanwhile, the problem ﬁnding a subsequence pattern has also been widely studied.
We say a pattern p is a subsequence of a text w if p can be obtained by removing zero
or more characters from w. By means of the directed acyclic subsequence graph (DASG)
for w, we can examine whether or not p is a subsequence of w in O(|p|) time [4, 15]. An
episode pattern is a “length-bounded” version of a subsequence pattern [42], as mentioned
in Chapter 9. An episode pattern is given in the form of a pair of a string p and an integer
k, as p, k . If p is a subsequence of x such that x is a substring of w with |x| ≤ k, we
92
say that the episode pattern p, k matches w. The episode directed acyclic subsequence
graphs (EDASGs) were given in [53], for a practical solution of the problem.
Now we propose a new kind of pattern matching problem: Given a text string w =
w1w2 · · · wn, a string p and an integer i, examine whether or not p is a substring of w[i :]
where w[i :] = wi . . . wn (NOTE: if i > |w|, the answer is always NO). We name the
pattern p, i a beginning-sensitive pattern, a BS-pattern for short. For any string w ∈ Σ∗
we denote by DAWG(w) the DAWG of w. Using the DAWGs for all suﬃxes of w, this
problem is solvable in O(|p|) time. This simple collection of the DAWGs is called the
naive all-suﬃxes directed acyclic word graph for w, written as the naive ASDAWG(w).
Since the size of DAWG(w) is O(|w|), that of the naive ASDAWG(w) is O(|w|2
).
In this paper we introduce a new data structure, named the minimum ASDAWG(w)
and denoted by MASDAWG(w). MASDAWG(w) is the minimization of the naive ASDAWG(w).
We show that the size of MASDAWG(w) is Θ(|w|) if |Σ| = 1, and Θ(|w|2
) if |Σ| ≥ 2.
Also, we produce an on-line algorithm that directly constructs MASDAWG(w) in time
linear in the size of MASDAWG(w).
We show further two applications of MASDAWG(w), one of which is called the region
sensitive pattern matching problem: Given a text string w = w1w2 · · · wn, a string p and
integers i, j, examine whether or not p is a substring of w[i : j] where w[i : j] = wi . . . wj
(NOTE: if i > |w|, the answer is always NO). We name the pattern p, (i, j) a regionsensitive
pattern, an RS-pattern for short. Using MASDAWG(w) with additional data,
the RS-pattern problem is solvable in O(|p|) time.
Finding a good rule to separate given two sets of strings, often referred to as positive
examples and negative examples, is a critical task in Knowledge Discovery and Data
Mining. In [22], an eﬃcient method, with which a subsequence pattern is considered as a
rule for the separation, was given, and in [23] one using an episode pattern was proposed
(see also Chapter 9). MASDAWG(w) is believed certainly to be a good “weapon” to
develop a practical algorithm to ﬁnd the best VLDC-patterns to distinguish given two
sets of strings eﬃciently. In fact, our experimental result has shown that the average size
of the minimum ASDAWGs for random texts of length 1 to 500 over a binary alphabet is
proportional to |w|1.24
, in spite of the theoretical space complexity, Θ(|w|2
).
The result was published in [33].
93
8.2 All-Suﬃxes Directed Acyclic Word Graphs
In this chapter we introduce a new structure named the all-suﬃxes directed acyclic word
graph (ASDAWG for short). Also, the minimized version of the structure, called the
minimum ASDAWG (MASDAWG for short), is also deﬁned. We give a tight bound for
the space complexity of the MASDAWG for an input string.
Throughout this chapter, let w[i :] = w[i : |w|] for 1 ≤ i ≤ |w| + 1.
Deﬁnition 8.1 (All-Suﬃxes DAWG (ASDAWG)) ASDAWG(w) is a kind of deterministic
automaton with |w|+1 initial states, designated by integers 0, 1, . . . , |w|, in which
the subgraph consisting of the states reachable from an initial state k and of their outgoing
edges is DAWG(w[k + 1 :]).
The simple collection of DAWG(w[1 :]), DAWG(w[2 :]),. . . , DAWG(w[n]), DAWG(w[n+
1 :]) (n = |w|) is an example of ASDAWG(w), to which we refer as the naive ASDAWG(w).
The number of nodes of the naive ASDAWG(w) is Θ(|w|2
). By minimizing the naive
ASDAWG(w), we can obtain the minimum ASDAWG(w), which is denoted by MASDAWG(w).
The naive ASDAWG(abba) and MASDAWG(abba) are shown in Fig. 8.1. The minimization
is performed based on the equivalence relation deﬁned as follows. Let denote a node
[x]R
u of DAWG(u) by an ordered pair u, [x]R
u . Every node of the naive ASDAWG(w) can
be represented by a pair u, [x]R
u
with u ∈ Suﬃx(w) and x ∈ Factor(u). The equivalence
relation, denoted by ∼w, is deﬁned by
u, [x]R
u ∼w v, [y]R
v ⇔ x−1
Suﬃx(u) = y−1
Suﬃx(v)
A node of MASDAWG(w) corresponds to an equivalence class under ∼w. We write
u, [x]R
u
simply as u, [x] if no confusion occurs.
Proposition 8.1 Let u ∈ Suﬃx(w). Let x be a nonempty factor of u. We factorize
u as u = hxt and assume h is the shortest such string. Then, hxt, [x] is equivalent to
sxt, [x] for every suﬃx s of h. (NOTE: The string x is not necessarily the representative
of [x]R
u
.)
Let h0, h1, . . . , hr be the suﬃxes of the string h arranged in the decreasing order of
their length. The above proposition implies an existence of the chain of equivalent nodes
h0xt, [x] , h1xt, [x] , . . . , hrxt, [x] .
94
MASDAWG(abba)
a b
b
b
b
b
b
a
aa
a
0
1
2
3
a
DAWG(abba)
a b
b
b
b
a
a
b b a
a
a
DAWG(bba)
b
a
a
a
DAWG(ba)
DAWG(a)
DAWG(e)
4
Figure 8.1: On the left DAWG(x) for any string x in Suﬃx(abba) are shown, and the
collection of those is the naive ASDAWG(w). On the right MASDAWG(w) is displayed.
While there are 16 states and 16 transitions in the former in total, there are 9 states and
12 transitions in the latter. For example, the nodes abba, [b] and bba, [b] are equivalent
due to Case 1 and merged into one. Also, abba, [abb] , bba, [bb] , and ba, [b] are merged
into one node, where the ﬁrst two are equivalent due to Case 2 and the last two are
equivalent due to Case 3. The upper four sink nodes are equivalent due to Case 2 and
the lowest one is equivalent to them (see Lemma 8.2), and therefore the ﬁve are merged
into one sink node.
In case more than one string exist in [x]R
u , the chain length r is maximized if we choose the
shortest one as x. The chain, however, does not necessarily break at the node hrxt, [x] .
The shortest string in [x]R
u is not necessarily the shortest in [x]R
hrxt: Shorter one may exist.
Thus we need more precise discussion.
Lemma 8.1 Let h ∈ Σ+
and u, hu ∈ Suﬃx(w). If a node of DAWG(u) is equivalent to
some node of DAWG(hu), then it is also equivalent to some node of DAWG(au) where a
is the last character of the string h.
Proof. Let h = ta (t ∈ Σ∗
). Assume t = ε. Let x ∈ Factor(u) with x = ε, and
y ∈ Factor(tau) with y = ε. Assume x−1
Suﬃx(u) = y−1
Suﬃx(tau). We have two cases
to consider.
95
• x ≡R
u y. In this case, every occurrence of the string y within tau must be included
within the u part. Thus, we have x−1
Suﬃx(u) = y−1
Suﬃx(au).
• x ≡R
u y. In this case, (1) y is written as y = sx where s is a nonempty string, and
(2) there is an occurrence of y within tau that covers the boundary between a and
u but the x part of the occurrence of y = sx is contained in the u part of the string
tau. In this case, by truncating an appropriate length preﬁx of s we can obtain a
string z as a suﬃx of y = sx such that x−1
Suﬃx(u) = z−1
Suﬃx(au).
The proof is now complete.
The above lemma guarantees that the DAWGs sharing one node of MASDAWG(w) are
‘consecutive.’ We therefore concentrate on the relation between two consecutive DAWGs.
First, we consider the equivalence of the initial state.
Lemma 8.2 Suppose b ∈ Σ and u, bu ∈ Suﬃx(w). Let y ∈ Factor(bu) and assume y is
the representative of [y]R
bu. Then, the node u, [ε] and bu, [y] are equivalent under ∼w if
and only if y = b and u is of the form b with ≥ 0.
See, for example, MASDAWG(bbbbb) shown in Fig. 8.2.
b
0 1
b
2
b
3
b
4
b
5
Figure 8.2: MASDAWG(w) for w = b5
. For every i = 0, 1, . . . , 4, the initial node [ε]R
bi of
DAWG(bi
) is equivalent to the node [b]R
bi+1 of DAWG(bi+1
).
As an extreme case of Lemma 8.2 where = 0, the node [ε]R
ε
of DAWG(ε) is always
equivalent to the sink node [b]R
b
of the previous DAWG(b).
Next, we consider the case of the states but the initial state.
Lemma 8.3 Suppose b ∈ Σ and u, bu ∈ Suﬃx(w). Let x ∈ Factor(u) with x = ε.
Let y ∈ Factor(bu) with y = ε. Assume x and y are the representatives of [x]R
u
and
[y]R
bu, respectively. The equivalence u, [x] ∼w bu, [y] implies that if y ∈ Preﬁx(bu) then
y = bx and x ∈ Preﬁx(u), and otherwise y = x. Moreover, u, [x] ∼w bu, [y] holds if
and only if either
96
(Case 1) x ∈ Preﬁx(bu) and y = x;
(Case 2) x ∈ Preﬁx(u), x ≡R
bu y, and y = bx; or
(Case 3) x = bi
, y = bi+1
, and u is of the form b s such that i ≤ , and s ∈ Σ∗
does not
begin with b and does not contain an occurrence of bi
.
Proof. Suppose x−1
Suﬃx(u) = y−1
Suﬃx(bu). Let u[i + 1 :] (0 < i ≤ |u|) be the longest
member of this set.
1. When y ∈ Preﬁx(bu). Then, i = |y| − 1 and y = by with y = u[1 : i]. Since
u[i+1 :] ∈ Suﬃx(x), we have u = hxu[i+1 :] for some h ∈ Σ∗
. Namely, x is a suﬃx
of y = u[1 : i].
(a) When y ∈ Preﬁx(bu). We have y ≡R
bu y and
(y )−1
Suﬃx(u) = (y )−1
Suﬃx(bu) = y−1
Suﬃx(bu) = x−1
Suﬃx(u),
which implies x ≡R
u y . Since y ∈ Preﬁx(u), y must be the representative of
[y ]R
u = [x]R
u , thus we have x = y .
(b) When y ∈ Preﬁx(bu). String y is a preﬁx of y = by , and therefore has a period
of 1. Hence we have y = bi
and y = bi+1
. Since x is a suﬃx of y = bi
, x = bj
for
some j with 0 < j ≤ i. If j < i, then u[j + 1 :] ∈ x−1
Suﬃx(u), a contradiction.
Thus we have j = i, i.e., x = bi
. On the other hand, u[1 : i] = y = bi
and thus
u is of the form b s such that ≥ i and s ∈ Σ∗
does not begin with b. We can
show that the string s cannot contain an occurrence of x = bi
.
Note that we have x ∈ Preﬁx(u) in both the cases.
2. When y ∈ Preﬁx(bu). We have y−1
Suﬃx(u) = y−1
Suﬃx(bu) = x−1
Suﬃx(u), which
implies x ≡R
u y. From the choice of x, y must be a suﬃx of x and x = δy with δ ∈ Σ∗
.
Assume, for a contradiction, that x−1
Suﬃx(bu) = x−1
Suﬃx(u). Then there must
be a suﬃx u[j + 1 :] of u such that j < i and bu = hxu[j + 1 :] with h ∈ Σ∗
. Since
x = δy, we have bu = hδyu[j + 1 :], which implies u[j + 1 :] ∈ y−1
Suﬃx(bu), a
contradiction. Hence we have x ≡R
bu y. From the choice of y, x must be a suﬃx of
y. Thus we have x = y.
97
It should be noted that Case 1 and Case 2 of Lemma 8.3 ﬁt to Proposition 8.1, whereas
Case 3 is irregular in the sense that the two equivalence classes [x]R
u and [y]R
bu have no
common member despite u, [x] ∼w bu, [y] . See Fig. 8.1, which includes instances of
Case 1, Case 2, and Case 3.
The owner of a node of MASDAWG(w) is deﬁned to be the DAWG(w[k :]) such that
k is the smallest integer for which DAWG(w[k :]) shares the node. We are now ready to
estimate the lower bound of the number of nodes of MASDAWG(w).
Theorem 8.1 When |Σ| ≥ 2, the number of nodes of MASDAWG(w) for a string w is
Θ(|w|2
). It is Θ(|w|) for a unary alphabet.
Proof. The proof for the case of a unary alphabet Σ = {a} is not diﬃcult. We can use
Lemma 8.2. We now prove the lower bound in the case of |Σ| ≥ 2. Let us consider a string
w = (ab)m
(ba)m
, where a, b are distinct characters from Σ. For each i = 2, . . . , m − 1, let
ui = (ab)i
(ba)m
. Let x = (ba)j
with 0 < j < i. It is not diﬃcult to show that x ≡R
ui
ax
and x ≡R
ui
b−1
x, and therefore [x]R
ui
= {x}. Thus x is the representative of [x]R
ui
, and we
can use the above lemma. Since x ∈ Preﬁx(bui), x ∈ Preﬁx(ui), and the ﬁrst character of
ui is not b, none of the three conditions is satisﬁed, and therefore DAWG(ui) is the owner
of the node corresponding to [x]R
ui
. Thus, the nodes of MASDAWG(w) corresponding to
[(ba)1
]R
ui
, [(ba)2
]R
ui
, . . . , [(ba)i−1
]R
ui
are distinct and are owned by DAWG(ui). For each i with 1 < i < m, DAWG(ui) has at
least i − 1 own nodes. Thus, MASDAWG(w) has Ω(m2
) = Ω(|w|2
) nodes.
8.3 On-Line Construction of MASDAWGs
Since the construction of the naive ASDAWG(w) takes O(|w|2
) and the minimization
can be performed in linear time proportional to the number of the edges of the naive
ASDAWG(w) (see [46]), we can build MASDAWG(w) in O(|w|2
). On the other hand,
we have shown that the number of nodes in MASDAWG(w) is Θ(|w|2
). We are therefore
interested in on-line and direct construction of MASDAWG(w). We have obtained the
following result.
Theorem 8.2 MASDAWG(w) can be constructed directly and on-line in time linear with
respect to its size.
98
The algorithm for on-line construction of MASDAWG(w) basically simulates the online
constructions of the DAWGs for all suﬃxes of a string w. Figure 8.3 illustrates the
on-line construction of MASDAWG(abbab).
a a b
b
b
a ab
0 0
1
a b
b
b
b
b
abb
0
1
2
a b
b
b
b
b
b
a
aa
a
abba
0
1
2
3
a b
b
b
b
b
b
a
a
a
a
abbab
0
1
2
3
a
a
b
b
4
b
a
1 2 3
4 5
Figure 8.3: On-line construction of MASDAWG(w) for w = abbab. Each initial state
becomes independent whenever the newly appended character violates the condition of
Lemma 8.2. Node separation of other type occurs only twice. One happens during the
update of MASDAWG(ab) to MASDAWG(abb). The sink node consisting of abb, [ab] and
b, [b] is separated into two nodes. This is recognized as a node separation in DAWG(abb).
The other occurs during the update of MASDAWG(abba) to MASDAWG(abbab). The
node consisting of abba, [abb] , bba, [bb] , and ba, [b] is separated into two. This is
a special case in the sense that no node separation occurs inside any of DAWG(abba),
DAWG(bba), and DAWG(ba). (See the ﬁrst case of Lemma 8.7.) (Note: Though each
accepting state is marked double-circled in any step in this ﬁgure, we do not maintain
it on-line. After the construction of MASDAWG(w) is completed, we mark every node
reached during the suﬃx-links-traversal from the sink node.)
99
In the following sections, we present a basic idea of the algorithm together with several
lemmas which support it.
8.3.1 Suﬃx Links
In the construction, the suﬃx links play a key role. One main diﬀerence compared with
constructing a single DAWG is that a node may have more than one suﬃx link. This
happens because MASDAWG(w) may contain two distinct, equivalent nodes u, [x] and
v, [y] such that the node to which the suﬃx link from u, [x] points is not equivalent
to the node to which the suﬃx link from v, [y] points. We update MASDAWG(w)
into MASDAWG(wa) as if the underlying DAWGs for w[1 :], w[2 :], . . . were updated
simultaneously, as follows. Conceptually, we reserve all suﬃx links of these DAWGs, by
associating each suﬃx link with the corresponding DAWG. Whenever two or more suﬃx
links are duplicated, the corresponding DAWGs are consecutive due to Lemma 8.1, so
that we can handle them at once. This is critical for the linearity of our algorithm. We
traverse the dag induced by the suﬃx links rooted from the sink node, in the order of the
corresponding DAWGs, and process each encountered node appropriately (creating a new
edge to the new sink node, separating the node, or redirecting an edge to the separated
node).
8.3.2 Compact Representation of Node Length Information
Remember in the on-line construction of the DAWG for a single string, there occurs an
event so-called node separation. Formally, this event is described as follows. We store in
each node [x]R
w of DAWG(w) its length, namely, the length of the representative of [x]R
w.
Consider updating DAWG(w) to DAWG(wa) where a is a character. Let z be the longest
suﬃx of wa that also occurs within w. We call it the longest repeated suﬃx of wa. A node
separation happens iﬀ z is not the representative of [z]R
w
. The node [z]R
w
can be detected
by traversing the suﬃx link chain from the sink in order to ﬁnd its parent node [z ]R
w
,
which is the ﬁrst encountered node on the chain that has an outgoing edge labelled by a.
Whenever the length of [z]R
w
is greater than that of its parent [z ]R
w
plus one, the node [z]R
w
of DAWG(w) is separated into two nodes [x]R
wa and [z]R
wa of DAWG(wa), where x is the
representative of [z]R
w
.
Note that a node of MASDAWG(w) corresponds to an equivalence class under the
100
equivalence relation ∼w, and therefore two or more DAWGs may share a node of MASDAWG(w).
We need to know the length of the corresponding node of an arbitrary one among them.
Naive solution would be to store into a node of MASDAWG(w) a (|w| + 1)-tuple of integers,
the ith value of which indicates the length of the corresponding node of the i-th
DAWG, where i = 0, 1, . . . , |w|. The space requirement is, however, proportional to |w|3
.
Below we give an idea of compact representation of the tuple.
Lemma 8.4 Let w[i + 1 :], [x1] , . . . , w[i + :], [x ] be the nodes of the naive ASDAWG(w)
which are merged into one node in MASDAWG(w), where 0 ≤ i and i + ≤ |w| + 1. We
assume each of the strings x1, . . . , x is the representatives of the equivalence class of it.
Then, there exists an integer k with 1 ≤ k ≤ such that
xj =



xk, if 1 ≤ j ≤ k;
xk[j − k + 1 :], if k < j ≤ .
(See Fig. 8.4.)
Proof. By Lemma 8.3.
For example, MASDAWG(abb) in Fig. 8.3 has a node consisting of abb, [b] and
bb, [b] . Also, MASDAWG(abba) has a node consisting of abba, [abb] , bba, [bb] , and
ba, [b] .
It follows from the above lemma that the function that takes as input an integer s
and returns |xs| if 1 ≤ s ≤ can be represented as a quartet (i, , k, |xk|), which requires
only a constant space (or O(log |w|) space). The update procedure of the quartet for each
node is basically apparent, except for the nodes in which node separations occur.
8.3.3 Node Separation
Recall that two or more DAWGs can share one node of MASDAWG(w), and each of them
has a possibility of being separated into two nodes. This seems to complicate the update
of MASDAWG(w). However, we can readily show the following lemma.
Lemma 8.5 Suppose b ∈ Σ and u, bu ∈ Suﬃx(w). Let x ∈ Factor(u) with x = ε. Let
y ∈ Factor(bu) with y = ε. Assume x and y are the representatives of [x]R
u and [y]R
bu,
respectively. Suppose u, [x] ∼w bu, [y] . Let a ∈ Σ, and let z be the longest repeated
suﬃx of bua. Suppose z ∈ [y]R
bu. If |z| < |y|, then z is also the longest repeated suﬃx of
101
i + 1: x1
x2
· · ·
· · ·
xk
xk+1
xk+2
· · ·
· · ·
x
Figure 8.4: The representatives xj of [xj]R
w[i+j:] such that the nodes w[i + j :], [xj] of the
naive ASDAWG(w) are merged into one node of MASDAWG(w).
ua, and z ∈ [x]R
u . If |z| = |y|, then x is a repeated suﬃx of ua (not necessarily to be the
longest).
The next lemma characterizes the node separations that occur during the update of
MASDAWG(w) to MASDAWG(wa).
Lemma 8.6 Consider the node of MASDAWG(w) stated in Lemma 8.4 (see Fig. 8.4).
Let z be the longest repeated suﬃx of w[i + j :]a. Suppose z ∈ [xj]R
w[i+j:]
.
1. When |z| = |xk|: Node separation occurs in none of the DAWGs for the strings
w[i + j :], . . . , w[i + :].
2. When |z| < |xk|: Let t be the maximum integer such that z is a proper suﬃx
of xt. Node separation occurs in each of the DAWGs for the strings w[i + j :
], . . . , w[i+t :]. That is, for each j = 1, . . . , t, the node [xj]R
w[i+j:]
of DAWG(w[i+j :])
is separated into [xj]R
w[i+j:]a and [z]R
w[i+j:]a inside DAWG(w[i + j :]a). The nodes
w[i + j :]a, [x1] , . . . , w[i + :]a, [x ] are equivalent under ∼wa, and the new nodes
w[i + j :]a, [z] , . . . , w[i + t :], [z] are also equivalent under ∼wa.
The node separations of DAWGs characterized in the above lemma lead to a node separation
in the update of MASDAWG(w) to MASDAWG(wa). It simultaneously performs
102
the node separations within each DAWG caused by the common z. (For the same z, we
can take j as small as possible.)
The remaining problem to be overcome is that there is another kind of node separation
in the update of MASDAWG(w).
Lemma 8.7 In the update of MASDAWG(w) to MASDAWG(wa), node separation of
the following types may occur, where w ∈ Σ∗
and a ∈ Σ.
1. When w[i + 1 :] is of the form b +1
s such that w[i] = b or i = 0, ≥ 1, and s in Σ∗
does not begin with b and does not contain an occurrence of b :
Let d be the largest integer such that s contains an occurrence of bd
. MASDAWG(w)
has a node consisting of
w[i + j + 1 :], [bd+k
] , w[i + j + 2 :], [bd+k−1
] , . . . , w[i + j + k], [bd+1
] ,
where k = − (d + j) + 1, for each j = 0, 1, . . . , d. If |s| > 0, s ends with bd
, and
a = b, then the node is separated into two nodes, one of which consists of
w[i + j + 1 :]a, [bd+k
] , w[i + j + 2 :]a, [bd+k−1
] , . . . , w[i + j + k − 1]a, [bd+2
] ,
and the other consists only of w[i + j + k :]a, [bd+1
] .
2. When w[i + 1 :] is of the form b with ≥ 1 such that w[i] = b or i = 0:
MASDAWG(w) has a node consisting of
b , [bj
] , b −1
, [bj−1
] , . . . , b −j
, [ε] ,
for each j = 1, . . . , . Whenever b = a, the node is separated into two nodes, one of
which consists of
b a, [bj
] , b −1
a, [bj−1
] , . . . , b −j+1
a, [b] ,
and the other consists only of b −j
a, [ε] ,
For an example of the ﬁrst case of the above lemma, consider the update of MASDAWG(w)
to MASDAWG(wb) for w = bbbbbab, which can be found in Fig. 8.5 and Fig. 8.6.
It should be emphasized that in the node separation mentioned in the above lemma
no node separation occurs inside a DAWG. This kind of node separation can also be
performed during the suﬃx link traversal started at the sink node, although the detail is
omitted in this chapter.
103
0
1
2
3
4
5
6
b b b b b a
aaaaa
b
b b b b a
aaaa
b
b b b a
aaa
b
b b a
aa
b
b a
a
b
a b
b
b
7
0
1
2
3
4
5
6
b b b b b a
aaaaa
b
b b b b a
aaaa
b
b b b a
aaa
b
b b a
aa
b
b a
a
b
a b
b
b
7
b
b
b
b
b
b
b
b
8
b
b
Figure 8.5: The naive ASDAWG(bbbbbab) on the left, and the naive ASDAWG(w) on the
right. The nodes connected by the broken lines are equivalent due to Case 3. Recall the
value of “d” mentioned in Lemma 8.7. In string bbbbbab the value of d is 1, whereas in
string bbbbbabb d = 2 since the new b is added afterward.
8.4 Applications
In this section we show some applications to which the data structure ASDAWG and its
variants eﬀectively contribute.
8.4.1 Finding Beginning-Sensitive Patterns
Deﬁnition 8.2 (Beginning-Sensitive Pattern) A beginning-sensitive pattern (a BSpattern
for short) is a pair p, i where p is a string in Σ∗
and i is a positive integer.
Deﬁnition 8.3 (BS-Pattern Matching Problem)
Instance: a text w and a BS-pattern p, i .
104
3
0
b b b b b a
aaaaa
b
b
a
b
b
b a
a
a
b
a b a
a b
b
1
2
4
5
6
7
a b 3
0
b b b b b a
aaaaa
b
b
a b
b
a
a
a
b
a a
a b
1
2
4
5
6
8
a
b
b
a
b
b
a
b
a
b
b
b
b
7
a
Figure 8.6: MASDAWG(bbbbbab) is on the left, and MASDAWG(bbbbbabb) is on the right.
Compare the update of MASDAWG(bbbbbab) to MASDAWG(bbbbbabb) with that of the
naive ASDAWG(bbbbbab) to the naive ASDAWG(bbbbbabb) shown in Fig. 8.5.
Determine: whether p is a substring of w[i :].
This is a natural extension of the substring pattern matching problem with i = 1.
The BS-pattern matching problem is solvable in O(|p|) for an arbitrary pair p, i , by
using ASDAWG(w). For a given text w, we construct MASDAWG(w) with the on-line
algorithm proposed in Section 8.3. For a BS-pattern p, i , if i > |w|, the BS-pattern
never matches w. Otherwise, we start with the i-th initial state of MASDAWG(w) and
examine whether or not the string p is recognized.
8.4.2 Pattern Matching within a Speciﬁc Region
Deﬁnition 8.4 (Region-Sensitive Pattern) A region-sensitive pattern (an RS-pattern
for short) is a triple p, (i, j) where p is a string in Σ∗
and i, j are positive integers.
Deﬁnition 8.5 (RS-Pattern Matching Problem)
Instance: a text w and an RS-pattern p, (i, j) .
105
Determine: whether p occurs within the region w[i : j] in the text w.
This is a natural extension of the BS-pattern matching problem in which j = |w|. For
a given text w, we construct MASDAWG(w). It is a trivial fact that, while constructing
MASDAWG(w), the on-line algorithm is able to mark each state with the integer for the
position of the right most occurrence of the string corresponding to the state. For an
RS-pattern p, (i, j) , if i > |w|, the RS-pattern never matches w. Otherwise, we start
with the i-th initial state of MASDAWG(w) and examine whether or not the string p is
recognized. If it is recognized, we compare j with the integer k stored in the state at
which p ﬁnally arrived. Then: If j ≤ k, YES; Otherwise, NO. Obviously, the problem can
be solved in O(|p|) time.
8.4.3 Finding Variable-Length-Don’t-Care’s Patterns
We remark that MASDAWG(w) can also be a powerful structure to ﬁnd a variable-lengthdon’t-care’s
patterns. The detail will be treated in Section 9.5 with strong possibility that
MASDAWG(w) can contribute to Knowledge Discovery and Data Mining.
106
Chapter 9
Finding Patterns to Best Separate
Two Sets of Strings
9.1 Introduction
In these days, a lot of text data or sequential data are available, and it is quite important
to discover useful rules from these data. Finding a good rule to separate two given sets,
often referred as positive examples and negative examples, is a critical task in Knowledge
Discovery and Data Mining.
In [22], Hirao et al. considered subsequence patterns as rules. A subsequence pattern
s matches a string t if s can be obtained by deleting zero or more characters from t. They
introduced a practical algorithm to ﬁnd the best subsequence pattern that separates
positive examples from negative examples, and showed some experimental results. A
drawback of subsequence patterns is that they are not suitable for classifying long strings
over small alphabet, since a short subsequence pattern matches almost all long strings.
In this chapter, we consider episode patterns, which were originally introduced by Mannila
et al. [42]. An episode pattern v, k , where v is a string and k is an integer, matches
a string t if v is a subsequence for some substring u of t with |u| ≤ k. Episode patterns
are generalization of subsequence patterns since a subsequence pattern v is equivalent to
the episode pattern v, ∞ . We give a practical solution to ﬁnd the best episode pattern
which separates one set of strings from the other set of strings. We propose a practical
implementation of exact search algorithm that practically avoids exhaustive search. The
key idea is to introduce some heuristics to reduce the search space based on the com-
107
binatorial properties of episode patterns, and to utilize an eﬃcient data structure that
helps to determine whether an episode pattern matches with a ﬁxed string, at the cost
of preprocessing time and space requirement to construct it. The result was published
in [23], and also reported in [49].
At the end of the chapter, we also consider the variable-length-don’t-care’s patterns.
Let Π = (Σ ∪ { })∗
, where is a wildcard that matches any string. A pattern q ∈ Π
such as q = a ba c is called a variable-length-don’t-care’s pattern (VLDC-pattern), where
a, b ∈ Σ. The language L(q) of a pattern q ∈ Π is the set of strings obtained by replacing
’s in q with strings. For example, L(a ba c) = {aubavc | u, v ∈ Σ∗
}. This language
corresponds to a class of the pattern languages proposed by Angluin [1]. We have strongly
believed that the VLDC-patterns can be quite good rules to separate given two sets of
strings. We declare that the smallest automaton to recognize all possible VLDC-patterns
matching a text w is a variant of MASDAWG(w), introduced in Chapter 8.
9.2 Preliminaries
9.2.1 Notation
Let N be the set of integers. Throughout this chapter, we use the word substring in the
same meaning as the word factor. We say that a string v is a subsequence of a string w if
v can be obtained by removing zero or more characters from w. We write as v str w if v
is a substring of w, and as v seq w if v is a subsequence of w.
For a string v, we deﬁne the substring language Lstr
(v) and subsequence language Lseq
(v)
as follows:
Lstr
(v) = {w ∈ Σ∗
| v str w},
Lseq
(v) = {w ∈ Σ∗
| v seq w}.
An episode pattern is a pair of a string v and an integer k, and we deﬁne the episode
language Leps
( v, k ) by
Leps
( v, k ) = {w ∈ Σ∗
| ∃
u str w such that v seq u and |u| ≤ k}.
The following lemma is obvious from the deﬁnitions.
Lemma 9.1 (Hirao et al. [22]) For any strings v, w ∈ Σ∗
,
108
1. if v is a preﬁx of w, then v str w,
2. if v is a suﬃx of w, then v str w,
3. if v str w then v seq w,
4. v str w if and only if Lstr
(v) ⊇ Lstr
(w),
5. v seq w if and only if Lseq
(v) ⊇ Lseq
(w).
9.2.2 Formulation of the Problem
Let good be a function from Σ∗
× 2Σ∗
× 2Σ∗
to the set of real numbers. We formulate the
problem to be solved as follows.
Deﬁnition 9.1 (Finding the best pattern according to good)
Input: Two sets S, T ⊆ Σ∗
of strings.
Output: A string w ∈ Σ∗
that maximizes the value good(w, S, T).
Intuitively, the value good(w, S, T) expresses the goodness w to separate S, T. The deﬁnition
of good varies for each application. For examples, the χ2
values, entropy information
gain, and gini index can be used. Essentially, these statistical measures are deﬁned by the
numbers of strings that satisfy the rule speciﬁed by w. Any of the above examples of the
measures can be described in the following form:
good(w, S, T) = f(xw, yw, |S|, |T|), where
xw = |S ∩ Leps
(w)|,
yw = |T ∩ Leps
(w)|.
When the sets S and T are ﬁxed, the values |S| and |T| become constants. Thus, we
abbreviate the function to f(x, y) in the sequel.
Since the function good(w, S, T) expresses the goodness of a episode pattern w to
distinguish two sets, it is natural to assume that the function f satisﬁes the conicality,
deﬁned as follows.
Deﬁnition 9.2 We say that a function f(x, y) is conic if
• for any 0 ≤ y ≤ ymax, there exists an x1 such that
109
– f(x, y) ≥ f(x , y) for any 0 ≤ x < x ≤ x1, and
– f(x, y) ≤ f(x , y) for any x1 ≤ x < x ≤ xmax.
• for any 0 ≤ x ≤ xmax, there exists a y1 such that
– f(x, y) ≥ f(x, y ) for any 0 ≤ y < y ≤ y1, and
– f(x, y) ≤ f(x, y ) for any y1 ≤ y < y ≤ ymax.
We assume that f is conic and can be evaluated in constant time in the sequel. The
optimization problem to be tackled follow.
Deﬁnition 9.3 (Finding the best substring pattern according to f)
Input: Two sets S, T ⊆ Σ∗
of strings.
Output: A string v that maximizes the value f(xv, yv), where xv = |S ∩ Lstr
(v)| and
yv = |T ∩ Lstr
(v)|.
Deﬁnition 9.4 (Finding the best subsequence pattern according to f)
Input: Two sets S, T ⊆ Σ∗
of strings.
Output: A string v that maximizes the value f(xv, yv), where xv = |S ∩ Lseq
(v)| and
yv = |T ∩ Lseq
(v)|.
Deﬁnition 9.5 (Finding the best episode pattern according to f)
Input: Two sets S, T ⊆ Σ∗
of strings.
Output: An episode pattern wpat that maximizes the value f(x v,k , y v,k ), where x v,k =
|S ∩ Leps
( v, k )| and y v,k = |T ∩ Leps
( v, k )|.
We remark that the ﬁrst problem can be solved in linear time [22], while the latter
two are NP-hard [44].
We review the basic idea of our algorithms. Fig. 9.1 shows a naive algorithm which
exhaustively examines and evaluate all possible patterns one by one, and returns the best
pattern that gives the maximum value. The most time-consuming part is obviously the
lines 3 and 4. In order to reduce the search time, we should (1) reduce the possible
patterns in line 2 dynamically by using some appropriate pruning method, and (2) speed
up the computation of |S ∩ L(π)| and |T ∩ L(π)| for each π. In Section 9.3, we deal with
(1), and in Section 9.4, we treat (2).
110
pattern FindMaxPattern(StringSet S, T)
1 maxVal = −∞;
2 for all possible pattern π do
3 x = |S ∩ L(π)|;
4 y = |T ∩ L(π)|;
5 val = f(x, y);
6 if val > maxVal then
7 maxVal = val;
8 maxSeq = π;
9 return maxSeq;
Figure 9.1: Exhaustive search algorithm.
9.3 Pruning Heuristics
In this section, we introduce some pruning heuristics, inspired by Morishita and Sese [45].
For a function f(x, y), we denote F(x, y) = max{f(x, y), f(x, 0), f(0, y), f(0, 0)}. From
the deﬁnition of conic function, we can prove the following lemma.
Lemma 9.2 For any patterns v and w with L(v) ⊇ L(w), we have
f(xw, yw) ≤ F(xv, yv).
9.3.1 For Subsequence Patterns
We consider ﬁnding subsequence pattern in this subsection. By Lemma 9.1 (5) and
Lemma 9.2, we have the following lemma.
Lemma 9.3 (Hirao et al. [22]) For any strings v, w ∈ Σ∗
with v seq w, we have
f(xw, yw) ≤ F(xv, yv).
In Fig. 9.2, we show our algorithm to ﬁnd the best subsequence pattern from given
two sets of strings, according to the function f. Optionally, we can specify the maximum
length of subsequences. We use the following data structures in the algorithm.
StringSet Maintain a set S of strings.
• int numOfSubseq(string seq) : return the cardinality of the set {w ∈ S | seq seq
w}.
111
string FindMaxSubsequence(StringSet S, T, int maxLength = ∞)
1 string preﬁx, seq, maxSeq;
2 double upperBound = ∞, maxVal = −∞, val;
3 int x, y;
4 PriorityQueue queue; /* Best First Search*/
5 queue.push(””, ∞);
6 while not queue.empty() do
7 (preﬁx, upperBound) = queue.pop();
8 if upperBound < maxVal then break;
9 foreach c ∈ Σ do
10 seq= preﬁx+ c; /* string concatenation */
11 x = S.numOfSubseq(seq);
12 y = T.numOfSubseq(seq);
13 val = f(x, y);
14 if val > maxVal then
15 maxVal = val;
16 maxSeq = seq;
17* upperBound = F(x, y);
18 if |seq| < maxLength then
19 queue.push(seq, upperBound);
20 return maxSeq;
Figure 9.2: Algorithm FindMaxSubsequence.
PriorityQueue Maintain strings with their priorities.
• bool empty() : return true if the queue is empty.
• void push(string w, double priority) : push a string w into the queue with priority
priority.
• (string, double) pop() : pop and return a pair (string, priority), where priority is
the highest in the queue.
The next theorem guarantees the completeness of the algorithm.
Theorem 9.1 (Hirao et al. [22]) Let S and T be sets of strings, and be a positive
integer. The algorithm FindMaxSubsequence(S, T, ) will return a string w that maximizes
the value f(xv, yv) among the strings of length at most , where xv = |S ∩ Lseq
(s)|
and ys = |T ∩ Lseq
(s)|.
112
9.3.2 For Episode Patterns
We now show a practical algorithm to ﬁnd the best episode patterns. We should remark
that the search space of episode patterns is Σ∗
×N , while the search space of subsequence
patterns was Σ∗
. A straight-forward approach based on the last subsection might be as
follows. First we observe that the algorithm FindMaxSubsequence in Fig. 9.2 can be easily
modiﬁed to ﬁnd the best episode pattern v, k for any ﬁxed threshold k: we have only to
replace the lines 11 and 12 so that they compute the numbers of strings in S and T that
match with the episode pattern seq, k , respectively. Thus, for each possible threshold
value k, repeat his algorithm, and get the maximum. A short consideration reveals that
we have only to consider the threshold values up to l, that is the length of the longest
string in given S and T.
However, here we give a more eﬃcient solution. Let us consider the following problem,
that is a subproblem of ﬁnding the best episode pattern in Deﬁnition 9.5.
Deﬁnition 9.6 (Finding the best threshold value)
Input: Two sets S, T ⊆ Σ∗
of strings, and a string v ∈ Σ∗
.
Output: Integer k that maximizes the value f(x v,k , y v,k ), where x v,k = |S∩Leps
( v, k )|
and y v,k = |T ∩ Leps
( v, k )|.
The next lemma give a basic containment of episode pattern languages.
Lemma 9.4 (Hirao et al. [23]) For any two episode patterns v, l and w, k , if v seq
w and l ≥ k then Leps
( v, l ) ⊇ Leps
( w, k ).
By Lemma 9.2 and 9.4, we have the next lemma.
Lemma 9.5 (Hirao et al. [23]) For any two episode patterns v, l and w, k , if v seq
w and l ≥ k then f(x w,k , y w,k ) ≤ F(x v,l , y v,l ).
For strings v, s ∈ Σ∗
, we deﬁne the threshold value θ of v for s by θ = min{k ∈ N |
s ∈ Leps
( v, k )}. If no such value, let θ = ∞. Note that s ∈ Leps
( v, k ) for any k < θ,
and s ∈ Leps
( v, k ) for any k ≥ θ. For a set S of strings and a string v, let us denote by
ΘS,v the set of threshold values of v for some s ∈ S.
A key observation is that a best threshold value for given S, T ⊆ Σ∗
and a string
v ∈ Σ∗
can be found in ΘS,v ∪ ΘT,v without loss of generality. Thus we can restrict the
search space of the best threshold values to ΘS,v ∪ ΘT,v.
113
From now on, we consider the numerical sequence {x v,k }∞
k=0. (We will treat {y v,k }∞
k=0
in the same way.) It clearly follows from Lemma 9.4 that the sequence is non-decreasing.
Remark that 0 ≤ x v,k ≤ |S| for any k. Moreover, x v,l = x v,l+1 = x v,l+2 = · · · , where
l is the length of the longest string in S. Hence, we can represent {x v,k }∞
k=0 with a list
having at most min{|S|, l} elements. We call this list a compact representation of the
sequence {x v,k }∞
k=0 (CRS, for short).
We show how to compute CRS for each v and a ﬁxed S. Observe that x v,k increases
only at the threshold values in ΘS,v. By computing a sorted list of all threshold values in
ΘS,v, we can construct the CRS of {x v,k }∞
k=0. If using the counting sort, we can compute
the CRS for any v ∈ Σ∗
in O(|S|ml + |S|) = O(||S||m) time, where m = |v|.
We emphasize that the time complexity of computing the CRS of {x v,k }∞
k=0 is the
same as that of computing x v,k for a single k (0 ≤ k ≤ ∞), by our method.
After constructing CRSs ¯x of {x v,k }∞
k=0 and ¯y of {y v,k }∞
k=0, we can compute the best
threshold value in O(|¯x| + |¯y|) time. Thus we have the following, which gives an eﬃcient
solution to the ﬁnding the best threshold value problem.
Lemma 9.6 Given S, T ⊆ Σ∗
and v ∈ Σ∗
, we can ﬁnd the best threshold value in
O( (||S|| + ||T||)·|v| ) time.
By substituting this procedure into the algorithm FindMaxSubsequence, we get an
algorithm to ﬁnd a best episode pattern from given two sets of strings, according to the
function f, shown in Fig. 9.3. We add a method crs(v) to the data structure StringSet
that returns CRS of {x v,k }∞
k=0, as mentioned above.
By Lemma 9.5, we can use the value upperBound = F(xv,∞, yv,∞) to prune branches
in the search tree computed at line 19 marked by (*). We emphasize that the value
F(x v,k , y v,k ) is insuﬃcient as upperBound. Note also that x v,∞ and y v,∞ can be
extracted from ¯x and ¯y in constant time, respectively. The next theorem guarantees the
completeness of the algorithm.
Theorem 9.2 Let S and T be sets of strings, and be a positive integer. The algorithm
FindBestEpisode(S, T, ) will return an episode pattern that maximizes f(x v,k , y v,k ),
with x v,k = |S ∩ Leps
( v, k )| and y v,k = |T ∩ Leps
( v, k )|, where v varies any string of
length at most and k varies any integer.
114
string FindBestEpisode(StringSet S, T, int )
1 string preﬁx, v;
2 episodePattern maxSeq; /* pair of string and int */
3 double upperBound = ∞, maxVal = −∞, val;
4 int k ;
5 CompactRepr ¯x, ¯y; /* CRS */
6 PriorityQueue queue; /* Best First Search*/
7 queue.push(””, ∞);
8 while not queue.empty() do
9 (preﬁx, upperBound) = queue.pop();
10 if upperBound < maxVal then break;
11 foreach c ∈ Σ do
12 v = preﬁx+ c; /* string concatenation */
13 ¯x = S.crs(v);
14 ¯y = T.crs(v);
15 k = argmaxk{f(x v,k , y v,k )} and val = f(x v,k , y v,k );
16 if val > maxVal then
17 maxVal = val;
18 maxEpisode = v, k ;
19(*) upperBound = F(x v,∞ , y v,∞ );
20 if upperBound > maxVal and |v| < then
21 queue.push(v, upperBound);
22 return maxEpisode;
Figure 9.3: Algorithm FindBestEpisode.
9.4 Using Eﬃcient Data Structures
In this chapter, we introduce a data structure eﬃcient for the speedup of answering the
queries.
9.4.1 Episode Directed Acyclic Subsequence Graphs
We now analyze the complexity of episode pattern matching. Given an episode pattern
v, k and a string t, determine whether t ∈ Leps
( v, k ) or not. This problem can be
answered by ﬁlling up the edit distance table between v and t, where only insertion
operation with cost one is allowed. It takes Θ(mn) time and space using a standard
dynamic programming method, where m = |v| and n = |t|. For a ﬁxed string, automatabased
approach is useful. Thereby we use the Episode Directed Acyclic Subsequence
Graph (EDASG) for string t, which was recently introduced by Tro´ıˇcek in [53]. Hereafter,
115
9876543210 a a b a a b a b b
b b
b
b ba a
a
b
a
ba
a
b
Figure 9.4: EDASG(t), where t = aabaababb. Solid arrows denote the forward edges, and
broken arrows denote the backward edges. The number in each circle denotes the state
number.
let EDASG(t) denote the EDASG for t. With the use of EDASG(t), episode pattern
matching can be answered quickly in practice, although the worst case behavior is still
O(mn). EDASG(t) is also useful to compute the threshold value θ of given v for t quickly
in practice. As an example, EDASG(aabaababb) is shown in Fig. 9.4. When examining
if an episode pattern abb, 4 matches with t or not, we start from the initial state 0 and
arrive at state 6, by traversing the forward edges spelling abb. It means that the shortest
preﬁx of t that contains abb as a subsequences is t[0 : 6] = aabaab, where t[i : j] denotes
the substring ti+1 . . . tj of t. Moreover, the diﬀerence between the state numbers 6 and 0
corresponds to the length of matched substring aabaab of t, that is, 6−0 = |aabaab|. Since
it exceeds the threshold 4, we move backwards spelling bba and reach state 1. It means
that the shortest suﬃx of t[0 : 6] that contains abb as a subsequence is t[1 : 6] = abaab.
Since 6 − 1 > 4, we have to examine other possibilities. It is not hard to see that we
have only to consider the string t[2 : ∗]. Thus we continue the same traversal started from
state 2, that is the next state of state 1. By forward traversal spelling abb, we reach state
8, and then backward traversal spelling bba bring us to state 4. In this time, we found
the matched substring t[4 : 8] = abab which contains the subsequence abb, and the length
8 − 4 = 4 satisﬁes the threshold. Therefore we report the occurrence and terminate the
procedure.
It is not diﬃcult to see that the EDASGs are useful to compute the threshold value
of v for a ﬁxed t. We have only to repeat the above forward and backward traversal up
to the end, and return the minimum length of the matched substrings. Although the
time complexity is still Θ(mn), practical behavior is usually better than using standard
dynamic programming method.
116
9.5 Finding Variable-Length-Don’t-Care’s Patterns
In this chapter, we consider other kinds of patterns, called variable-length-don’t-care’s
patterns. We here remark that the MASDAWG, introduced in Chapter 8, is a powerful
structure to ﬁnd patterns of these kinds.
The variable-length-don’t-care’s patterns are deﬁned as follows.
Deﬁnition 9.7 (Variable-Length-Don’t-Care’s Patterns) Let Π = (Σ∪{ })∗
, where
is a wildcard that matches any string. An element q ∈ Π is called a variable-lengthdon’t-care’s
pattern (a VLDC-pattern for short).
For instance, a ab ba is a VLDC-pattern for a, b ∈ Σ. We say that a VLDC-pattern
q matches a text string w ∈ Σ∗
if w can be obtained by replacing ’s in q with some
strings. In the running example, the VLDC-pattern a ab ba matches text abababbbaa
by replacing the ’s with ab, b, b and a, respectively.
We write as q vldc w if a VLDC-pattern q matches w. For a VLDC-pattern q, we
deﬁne the VLDC-language Lvldc
(q) as
Lvldc
(q) = {w ∈ Σ∗
| q vldc w}.
Now we consider the following problem.
Deﬁnition 9.8 (Finding the best VLDC-pattern according to f)
Input: Two sets S, T ⊆ Σ∗
of strings.
Output: A string v that maximizes the value f(xv, yv), where xv = |S ∩ Lvldc
(v)| and
yv = |T ∩ Lvldc
(v)|.
It is known that this problem is NP-hard [44].
Given a set S, T of strings, we have to examine whether or not every possible VLDCpattern
matches each string in S and T. Since the problem is NP-hard, we are forced to
exponentially many candidate. Thus, when considering some eﬃcient heuristics, we need
some fast method for the examination. We address that the MASDAWGs are remarkably
helpful for this. In fact, the smallest index structure capable of solving the following
pattern matching problem in O(|q|) time is a variant of MASDAWG(w).
Deﬁnition 9.9 (VLDC-Pattern Matching Problem)
Instance: a text w and a VLDC-pattern q.
Determine: whether q matches w.
117
The automaton recognizing all possible VLDC-patterns matching w is called the wildcard
ASDAWG for w, and denoted by WASDAWG(w). The automata were originally
introduced in [25]. WASDAWG(abbab) is displayed in Fig. 9.5. In WASDAWG(w), a
-transition is added between each state and the initial state of the “same layer” in
MASDAWG(w) (see also MASDAWG(abbab) in Fig. 8.3). Note that there exist two additional
states, one of which is a unique initial state of WASDAWG(abbab). They are added
in order that VLDC-patterns beginning with a can be recognized. For any q ∈ Π, the
VLDC-pattern matching problem can be solved in O(|q|) time, by using WASDAWG(w).
a b
b
b
b
b
b
a
a
a
a
a
a
b
b
b
a
b
Figure 9.5: WASDAWG(w) where w = abbab.
9.6 Concluding Remarks
This chapter was mainly devoted to the problem for ﬁnding the best episode patterns that
eﬀectively separate given two sets of strings.
Hamuro et al. [21] implemented the algorithm for ﬁnding the best subsequences, and
reported a quite successful experiment on business data. In the meantime, Bannai et
al. [6] and Iida et al. [27, 26] have installed our algorithms into the core of the decision
tree generator in the BONSAI system [48]. They reported that they achieved signiﬁcant
results by using episode patterns as rules, on biological sequences.
118
On the other hand, we can easily extend our algorithm to enumerate all strings whose
values of the objective function exceed the given threshold, since we essentially examine
all strings, with eﬀective pruning heuristics. Enumeration may be more preferable in the
context of text data mining [10, 17, 57].
We also discussed the use of VLDC-patterns as rules for the separation. The invention
of WASDAWGs is believed to be a clue to a practical algorithm to ﬁnd the best VLDCpatterns
that distinguish two sets of strings. The development of the pruning heuristic on
using VLDC-patterns is our future work.
119
Acknowledgment
First and foremost, I wish to thank Prof. Ayumi Shinohara for his throughout supervision
over the past two years. His enthusiastic suggestion and encouragement have truly been
part of my motivation to tackle lots of diﬃcult problems. He helped me every time I needed
advice. Above all, I appreciate that he has oﬀered me perfect environment and situation
in which I have been able to focus only on my study. I add that the common experiences
in the travels to Jerusalem (Israel), Laguna de San Rafael (Chile) and Washington DC
(USA) are impressively nice, and unforgettable.
I would also like to thank Prof. Masayuki Takeda, my second supervisor. It was he
who gave me the ﬁrst opportunity to work for index structures, to which all topics in this
thesis correspond. His attitude forward researching has been my model. The only thing
remaining undone is that we have never been abroad together, though one of our ‘prime’
purposes is, as we are repeatedly saying, to go abroad and have fun there! I just hope we
can do it this year.
I would also like to express my appreciation to Prof. Setsuo Arikawa, Prof. Hiroki
Arimura and Dr. Hiroshi Sakamoto. They gave me fruitful advice on what to do and how
to be as a researcher.
I also wish to thank my senior colleagues, Dr. Takuya Kida, Mr. Masahiro Hirao, Mr.
Hiromasa Hoshino, and Mr. Tetsuya Matsumoto. They kindly helped me when I did not
know much about the department and research, and they used to be my good model of
presentation as well.
I am appreciative of my colleagues, Mr. Keisuke Iida, Mr. Katsuaki Taniguchi, Mr.
Tatsuya Furuzono, Mr. Shuichi Mitarai, and Mr. Koichiro Yamamoto. They are my good
friends, as well as give me some good stimulus on research aspect. Also, I am grateful to
my friend Mr. Takaaki Miyachi, without whom I would never have become able to make
use of English.
Last but not least, I thank my parents for their support.
120
Bibliography
[1] D. Angluin. Finding patterns common to a set of strings. J. Comput. Sys. Sci.,
21:46–62, 1980.
[2] A. Apostolico. The myriad virtues of subword trees. In A. Apostolico and Z. Galil,
editors, Combinatorial Algorithm on Words, volume 12 of NATO Advanced Science
Institutes, Series F, pages 85–96. Springer-Verlag, 1985.
[3] A. Apostolico and Z. Galil. Pattern Matching Algorithms. Oxford University Press,
New York, 1997.
[4] R. A. Baeza-Yates. Searching subsequences (note). Theoretical Computer Science,
78(2):363–376, Jan. 1991.
[5] M. Bal´ık. Implementation of dawg. In Proc. The Prague Stringology Club Workshop
’98 (PSCW’98). Czech Technical University, 1998.
[6] H. Bannai, K. Iida, A. Shinohara, M. Takeda, and S. Miyano. More speed and more
pattern variations for knowledge discovery system BONSAI. In Proc. the International
Conference on Genome Informatics ’01 (GIW’01), pages 454–455, 2001.
[7] A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen, and J. Seiferas.
The smallest automaton recognizing the subwords of a text. Theoretical Computer
Science, 40:31–55, 1985.
[8] A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht. Complete
inverted ﬁles for eﬃcient text retrieval and analysis. J. ACM, 34(3):578–595, 1987.
[9] D. Breslauer. The suﬃx tree of a tree and minimizing sequential transducers. Theoretical
Computer Science, 191:131–144, 1998.
121
[10] A. Califano. SPLASH: Structural pattern localization analysis by sequential histograms.
Bioinformatics, Feb. 1999.
[11] M. T. Chen and J. Seiferas. Eﬃcient and elegant subword tree construction. In
A. Apostolico and Z. Galil, editors, Combinatorial Algorithm on Words, volume 12
of NATO Advanced Science Institutes, Series F, pages 97–107. Springer-Verlag, 1985.
[12] M. Crochemore. Transducers and repetitions. Theoretical Computer Science, 45:63–
86, 1986.
[13] M. Crochemore, C. S. Iliopoulos, T. Lecroq, and Y. J. Pinzon. Approximate string
matching in musical sequences. In Proc. The Prague Stringology Conference ’01
(PSC’01), pages 26–36. Czech Technical University, Sept. 2001.
[14] M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, New York,
1994.
[15] M. Crochemore and Z. Tron´ıˇcek. Directed acyclic subsequence graph for multiple
texts. Technical Report IGM-99-13, Institut Gaspard-Monge, June 1999.
[16] M. Crochemore and R. V´erin. On compact directed acyclic word graphs. In J. Mycielski,
G. Rozenberg, and A. Salomaa, editors, Structures in Logic and Computer
Science, volume 1261 of Lecture Notes in Computer Science, pages 192–211. SpringerVerlag,
1997.
[17] R. Feldman, Y. Aumann, A. Amir, A. Zilberstein, and W. Klosgen. Maximal association
rules: A new tool for mining for keyword co-occurrences in document collections.
In Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining,
pages 167–170. AAAI Press, Aug. 1997.
[18] R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying
view of linear-time suﬃx tree construction. Algorithmica, 19(3):331–353, 1997.
[19] R. Grossi and J. S. Vitter. Compressed suﬃx arrays and suﬃx trees with applications
to text indexing and string matching. In Proc. of 32nd ACM Symposium on Theory
of Computing (STOC’00), pages 397–406, 2000.
[20] D. Gusﬁeld. Algorithms on Strings, Trees, and Sequences. Cambridge University
Press, New York, 1997.
122
[21] Y. Hamuro, H. Kawata, N. Katoh, and K. Yada. A machine learning algorithm
for analyzing string patterns helps to discover simple and interpretable business rules
from purchase history. In S. Arikawa and A. Shinohara, editors, Progress in Discovery
Science, Lecture Notes in Computer Science. Springer-Verlag, 2002.
[22] M. Hirao, H. Hoshino, A. Shinohara, M. Takeda, and S. Arikawa. A practical algorithm
to ﬁnd the best subsequence patterns. In S. Arikawa and S. Morishita, editors,
Proc. The Third International Conference on Discovery Science, volume 1967 of Lecture
Notes in Artiﬁcial Intelligence, pages 141–154. Springer-Verlag, 2000.
[23] M. Hirao, S. Inenaga, A. Shinohara, M. Takeda, and S. Arikawa. A practical algorithm
to ﬁnd the best episode patterns. In K. P. Jantke and A. Shinohara, editors,
Proc. The Fourth International Conference on Discovery Science, volume 2226 of
Lecture Notes in Artiﬁcial Intelligence, pages 435–440. Springer-Verlag, 2001.
[24] J. Holub and B. Melichar. Approximate string matching using factor automata.
Theoretical Computer Science, 249:305–311, 2000.
[25] H. Hoshino. Construction of data structure for knowledge discovery on text data.
Master’s thesis, Department of Informatics, Kyushu University, Feb. 2001. (in
Japanese).
[26] K. Iida. Knowledge discovery on molecular biological data. Master’s thesis, Department
of Informatics, Kyushu University, Feb. 2002. (in Japanese).
[27] K. Iida, H. Bannai, A. Shinohara, M. Takeda, and S. Miyano. Extension and speed up
of knowledge discovery system BONSAI. In Proc. the 7th annual Paciﬁc Symposium
on Biocomputing (PSB’02), page 100, 2002.
[28] S. Inenaga. Bidirectional on-line linear-time construction of suﬃx trees and directed
acyclic word graphs. Technical Report DOI-TR-CS-203, Department of Informatics,
Kyushu University, Feb. 2002.
[29] S. Inenaga, H. Hoshino, A. Shinohara, M. Takeda, and S. Arikawa. Construction of
the CDAWG for a trie. In Proc. The Prague Stringology Conference ’01 (PSC’01),
pages 37–48. Czech Technical University, Sept. 2001.
123
[30] S. Inenaga, H. Hoshino, A. Shinohara, M. Takeda, and S. Arikawa. On-line construction
of symmetric compact directed acyclic word graphs. In Proc. of 8th International
Symposium on String Processing and Information Retrieval (SPIRE’01), pages 96–
110. IEEE Computer Society, 2001.
[31] S. Inenaga, H. Hoshino, A. Shinohara, M. Takeda, and S. Arikawa. Uniﬁcation of
algorithms to construct index structures for texts. Technical Report DOI-TR-CS-196,
Department of Informatics, Kyushu University, Aug. 2001.
[32] S. Inenaga, H. Hoshino, A. Shinohara, M. Takeda, S. Arikawa, G. Mauri, and
G. Pavesi. On-line construction of compact directed acyclic word graphs. In A. Amir
and G. M. Landau, editors, Proc. 12th Annual Symposium on Combinatorial Pattern
Matching (CPM’01), volume 2089 of Lecture Notes in Computer Science, pages
169–180. Springer-Verlag, 2001.
[33] S. Inenaga, M. Takeda, A. Shinohara, H. Hoshino, and S. Arikawa. The minimum
dawg for all suﬃxes of a string and its applications. Technical Report DOI-TR-CS-
204, Department of Informatics, Kyushu University, Feb. 2002.
[34] T. Kadota, M. Hirao, A. Ishino, M. Takeda, A. Shinohara, and F. Matsuo. Musical
sequence comparison for melodic and rhythmic similarities. In Proc. of 8th International
Symposium on String Processing and Information Retrieval (SPIRE’01), pages
111–122. IEEE Computer Society, 2001.
[35] J. K¨arkk¨ainen. Suﬃx cactus: A cross between suﬃx tree and suﬃx array. In Z. Galil
and E. Ukkonen, editors, Proc. 6th Annual Symposium on Combinatorial Pattern
Matching (CPM’95), volume 973 of Lecture Notes in Computer Science, pages 191–
204. Springer-Verlag, 1995.
[36] D. E. Knuth, J. H. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM
J. Comput., 6(2):323–350, 1977.
[37] S. R. Kosaraju. Fast pattern matching in trees. In Proc. 30th IEEE Symp. on
Foundations of Computer Science, pages 178–183, 1989.
[38] S. Kurtz. Reducing the space requirement of suﬃx trees. Software - Practice and
Experience, 29(13):1149–1171, 1999.
124
[39] M. G. Maaß. Linear bidirectional on-line construction of aﬃx trees. In R. Giancarlo
and D. Sankoﬀ, editors, Proc. 11th Annual Symposium on Combinatorial Pattern
Matching (CPM’00), volume 1848 of Lecture Notes in Computer Science, pages 320–
334. Springer-Verlag, 2000.
[40] V. M¨akinen. Compact suﬃx array. In R. Giancarlo and D. Sankoﬀ, editors, Proc. 11th
Annual Symposium on Combinatorial Pattern Matching (CPM’00), volume 1848 of
Lecture Notes in Computer Science, pages 305–319. Springer-Verlag, 2000.
[41] U. Manber and G. Myers. Suﬃx arrays: A new method for on-line string searches.
SIAM J. Compt., 22(5):935–948, 1993.
[42] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episode in sequences.
In U. M. Fayyad and R. Uthurusamy, editors, Proc. 1st International Conference
on Knowledge Discovery and Data Mining, pages 210–215. AAAI Press, Aug.
1995.
[43] E. M. McCreight. A space-economical suﬃx tree construction algorithm. J. ACM,
23(2):262–272, Apr. 1976.
[44] S. Miyano, A. Shinohara, and T. Shinohara. Polynomial-time learning of elementary
formal systems. New Generation Computing, 18:217–242, 2000.
[45] S. Morishita and J. Sese. Traversing itemset lattices with statistical metric pruning.
In Proc. of the 19th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, pages 226–236. ACM Press, May 2000.
[46] D. Revuz. Minimization of acyclic deterministic automata in linear time. Theoretical
Computer Science, 92(1):181–189, Jan. 1992.
[47] K. Sadakane. Compressed text databases with eﬃcient query algorithms based on
the compressed suﬃx array. In Proc. of 11th International Symposium on Algorithms
and Computation (ISAAC’00), volume 1969 of Lecture Notes in Computer Science,
pages 410–421. Springer-Verlag, 2000.
[48] S. Shimozono, A. Shinohara, T. Shinohara, S. Miyano, S. Kuhara, and S. Arikawa.
Knowledge acquisition from amino acid sequences by machine learning system BON-
125
SAI. Transactions of Information Processing Society of Japan, 35(10):2009–2018,
Oct. 1994.
[49] A. Shinohara, M. Takeda, S. Arikawa, M. Hirao, H. Hoshino, and S. Inenaga. Practical
algorithms to ﬁnd the best string patterns. In S. Arikawa and A. Shinohara,
editors, Progress in Discovery Science, volume 2281 of Lecture Notes in Artiﬁcial
Intelligence, pages 308–318. Springer-Verlag, 2002.
[50] J. Stoye. Aﬃxb¨aume. Master’s thesis, Universit¨at Bielefeld, may 1995. (in German).
[51] J. Stoye. Aﬃx trees. Technical Report 2000–4, Universit¨at Bielefeld, Technische
Fakult¨at, 2000.
[52] M. Takeda, T. Matsumoto, T. Fukuda, and I. Nanri. Discovering characteristic
expressions from literary works: A new text analysis method beyond n-gram and
KWIC. Theoretical Comput. Sci., 2002. (to appear).
[53] Z. Tron´ıˇcek. Episode matching. In A. Amir and G. M. Landau, editors, Proc. 12th
Annual Symposium on Combinatorial Pattern Matching (CPM’01), volume 2089 of
Lecture Notes in Computer Science, pages 143–146. Springer-Verlag, 2001.
[54] E. Ukkonen. Approximate string matching over suﬃx trees. volume 684 of Lecture
Notes in Computer Science, pages 228–242. Springer-Verlag, 1993.
[55] E. Ukkonen. On-line construction of suﬃx trees. Algorithmica, 14(3):249–260, 1995.
[56] E. Ukkonen and D. Wood. Approximate string matching with suﬃx automata. Algorithmica,
10(5):353–364, 1993.
[57] J. T. L. Wang, G.-W. Chirn, T. G. Marr, B. A. Shapiro, D. Shasha, and K. Zhang.
Combinatorial pattern discovery for scientiﬁc data: Some preliminary results. In
Proc. of the 1994 ACM SIGMOD International Conference on Management of Data,
pages 115–125. ACM Press, May 1994.
[58] P. Weiner. Linear pattern matching algorithms. In Proc. 14th Annual Symposium
on Switching and Automata Theory, pages 1–11, Oct. 1973.
126