IA167: String Algorithms Spring 2014
Lecture 1: February 17
Lecturer: Alex Popa Scribes: Jiří Zárevúcky
1.1 Notation
Σ The alphabet. A set of characters that any string consists of.
x[i] The i-th character of string x.
t = t[1] . . . t[n] The text.
p = p[1] . . . p[m] The pattern.
x[i . . . j] = x[i]x[i + 1] . . . x[j − 1]x[j] A substring of x starting at i and ending at j.
E.g. t[1 . . . i] is a preﬁx of t, p[i . . . m] is a suﬃx of p.
Pr[x] Probability that x is true, as a fraction (0 ≤ Pr[x] ≤ 1).
1.2 The problem
Problem 1 (Exact pattern matching). Given two strings, text t and pattern p, ﬁnd positions of every
substring in t identical to p.
We can formulate the problem more precisely: given t = t[1] . . . t[n] and p = p[1] . . . p[m], ﬁnd every i for
which it holds that t[i . . . i + m − 1] = p.
From this formulation, we can make a naive algorithm simply by iterating over every possible i. It can range
from 1 to (n − m + 1), and checking equality of two m-character strings takes O (m) operations, resulting in
total complexity of O (n · m).
Exercise: Implement this algorithm in your favorite language.
1-1
1-2 Lecture 1: February 17
1.3 Rabin-Karp algorithm
The naive algorithm suﬀers from poor time complexity because it can potentially process the same portion
of the text many times. Rabin-Karp algorithm tries to remove this problem by comparing small ﬁngerprints
in place of longer strings. We denote a ﬁngerprint of p as φ(p).
Furthermore, this ﬁngerprint is deﬁned in such a way that φ(t[i + 1 . . . i + m]) can be computed from
φ(t[i . . . i + m − 1]) in O (1) time, regardless of m. This is called a rolling hash function.
The ﬁngerprint function is selected randomly from a family of similar functions, which makes the Rabin-Karp
a randomized algorithm. If two strings have diﬀerent ﬁngerprints, we immediately know they are diﬀerent,
so we can avoid checking the equality character by character. However, since many strings can share the
same ﬁngerprint (this is always the case, as we want the ﬁngerprint to be much shorter than the string), it
is possible for some positive answers to be false positives. It is possible to naively check every positive result
for correctness.
Omitting this check results in a Monte Carlo algorithm with O (n + m) worst-case complexity, but a small
probability of erroneous result. Including the check results results in a Las Vegas algorithm, which has a
complexity of O (n + m + k · m), where k is the number of possible matches according to the ﬁngerprint
function, but zero probability of error. This can be up to O (n · m) in the worst case.
1.3.1 The ﬁngerprint function
The algorithm as originally presented selects φ at random as following:
• Have a ﬁxed large integer T.
• Select uniformly at random a prime p < T.
• Select φ to be
φ(s) =
|s|
i=1
s[i] · 2|s|−i
mod p
We are assuming that the characters of the alphabet have a numerical representation, so that s[i] is in fact
a natural number as well as a character.
Example:
s = 01001, p = 7
φ(s) = (0 · 24
+ 1 · 23
+ 0 · 22
+ 0 · 21
+ 1 · 20
) mod 7 = 9 mod 7 = 2
In the Rabin-Karp algorithm, we need to compute the ﬁngerprint for every m-character substring of the
text, in order to match it against the pattern. Fortunately, we only need to compute the full formula for the
ﬁrst substring. This is because the next substring only diﬀers from the last in adding one character on one
side, and removing one character on the other side.
Lecture 1: February 17 1-3
Example:
Suppose for simplicity that the string is a decimal number, and the ﬁngerprint is the value of this number
modulo a ﬁxed prime.
(A value of a decimal string is a sum very similar to the above ﬁngerprint function.)
s = “415926”, m = 5
s1 = “41592”, s2 = “15926”
φ(s1) = 41592 mod 97 = · · · = 76
φ(s2) = 15926 mod 97 = ((41592 − 4 · 10m
) · 10 + 6) mod 97 = (φ(s1) − s1[1] · 10m
) · 10 + s2[m] = 18
This simple trick makes it easy to compute each next ﬁngerprint in constant time.
1.3.2 *Probability of error
It is somewhat more diﬃcult to reason about uniformly random primes. In order to show that the chance
of collision is small, we demonstrate the idea on a slightly diﬀerent ﬁngerprint function.
Alternative ﬁngerprint:
1. ﬁx a large prime p
2. choose uniformly randomly a number r ∈ {1, ..., p − 1}
φr(s) =
|s|
i=1
s[i] · r|s|−i
mod p
Theorem 1.1 (Lagrange’s). A polynomial with degree d modulo a prime number p has at most d roots.
Proof: See http://en.wikipedia.org/wiki/Lagrange%27s_theorem_(number_theory)
Theorem 1.2. Suppose we have ﬁxed s, s , |s| = |s |, s = s , and random r. Then
Pr[φr(s) = φr(s )] ≤
|s|
p − 1
Proof: The probability that s and s have the same ﬁngerprint is the same as the probability that r we
chose is a root of the polynomial (φr(s) − φr(s )) (where r is the variable). This polynomial has a degree of
|s| − 1, which is easy to see from the deﬁnition of φr.
By the Lagrange’s theorem, there are less than |s| roots, but there are p−1 choices of r, so for any pair s, s ,
the chance cannot be more than |s|
p−1 .
Consequently, for a big enough p, the probability of collision is very small. For the original ﬁngerprint, the
analysis is more complicated but the result is similar.
1-4 Lecture 1: February 17
1.4 Knuth-Morris-Pratt algorithm
KMP algorithm improves on the naive algorithm using a diﬀerent idea than Rabin-Karp algorithm. Let us
illustrate the problem with an example.
Example:
Consider the following alignment of p = “aaaabaaab” against a text.
aaaaaaabaaaabaaabaaa
aaaabaaab
Only the last character of the pattern is mismatched. However, in the naive algorithm, we shift by
one and all those previously matched characters are read and checked again. This is the reason for the
algorithm’s quadratic complexity.
Ideally, we should not have to re-read characters that were once matched, because we have already seen
them and we know they are a preﬁx of the pattern. We can achieve that by shifting the pattern such that
the already matched characters match the pattern on its new position.
This means we are looking for a proper suﬃx of the matched text that is a preﬁx of p. In fact, we want to
ﬁnd the longest such suﬃx, as it corresponds to the shortest shift length. Since the characters are known to
match the pattern, the shift for any given matched preﬁx of p is otherwise independent of the text.
Example:
aaaaaaabaaaabaaabaaa
aaaabaaab
aaaabaaab
In case there is no match before mismatch, we just shift by one:
amalimalamalimalo
malimalo
malimalo
malimalo
malimalo
malimalo
Deﬁnition 1.3. A border of a word w is an index i < |w| such that
w[1 . . . i] = w[|w| − i + 1 . . . |w|]
Informally: It is the length of a preﬁx that is also a suﬃx at the same time. The suﬃxes we use when
shifting the pattern are the longest borders of preﬁxes of p, and as such we can precompute them in a table
for constant-time lookup.
Let us denote the longest border of p[1 . . . i] to be Π[i]. The following lemma formalizes what we intuitively
see from the above description.
Lemma 1.4. The longest suﬃx of p[1 . . . l]c that is also a preﬁx of p is:
1. p[1 . . . l + 1], if c = p[l + 1].
2. The longest suﬃx of p[1 . . . Π[i]]c which is a preﬁx of p, otherwise.
Lecture 1: February 17 1-5
1.4.1 The search
Suppose that we already have the table Π ready. We keep index it in the text and index ip in the pattern,
both denoting the last matched character (or 0 if no character was matched yet).
• If t[it + 1] = p[ip + 1], then we simply advance both by one.
• Otherwise, if ip > 0, we assign to ip the value Π[ip], “shifting” the pattern.
• Finally, if neither of the two conditions hold, we just advance it by one.
This is repeated until either it exceeds n, in which case the search failed, or ip reaches m, in which case
the match is reported and ip is possibly set to Π[m] to look for the next match. Note that advancing both
indices together corresponds to the pattern staying in place, while just decreasing ip or just increasing it
corresponds to shifting the pattern forward.
Theorem 1.5. The search takes O (n) operations.
Proof: it never decreases and is bounded by n. The only case in which it is not increased is a case in which
ip is decreased, but ip only ever increases together with it. That means the number of ip decreases may
never be larger than the number of it increases, which is at most n.
1.4.2 Precomputing borders
That leaves us the problem of computing Π. Fortunately, that is quite easily achievable in O (m) time:
For the preﬁx p[1 . . . 1], Π[1] = 0 (trivial).
Suppose we know Π[1], . . . , Π[i]. The longest border Π[i + 1] can be 0, or it can be some border of p[1 . . . i]
extended by one character. Borders naturally nest – every border of a string except the longest one is also
a border of the longest border substring. Thus Π already contains all borders of Π[i], not just the longest
one. We can simply iterate over them by repeatedly applying Π to i.
The proof that iterating over borders does not break the overall linear complexity is similar to the analysis
for search. The border can only increase by one per step (Π[i + 1] ≤ Π[i] + 1), so it cannot decrease more
than m times during the entire computation. In fact, if you look carefully, you can see that precomputing is
very similar to the search algorithm, the diﬀerence being only in information stored in every step.
1-6 Lecture 1: February 17
1.5 Boyer-Moore Algorithm
Exercise: Assume that the pattern does not contain the same character twice. Modify the naive algorithm
to have O (n) time complexity.
Exercise: Modify the algorithm from the previous exercise to have O n
m complexity for an inﬁnite class
of inputs, while still being correct for all patterns that do not contain the same character twice.
In the exercise, we see that for some patterns, the search can take much less than n character comparisons.
The Boyer-Moore Algorithm takes advantage of this, by matching the pattern against the text from right
to left. The result is an algorithm that can run as fast as O n
m + m for some inputs, while still being
O (n + m) in the worst case. It uses a combination of two (sometimes three) independent rules.
1.5.1 Bad character rule
The ﬁrst rule is that every time a mismatch occurs, the pattern is shifted such that the mismatched character
is aligned to the matching character in the pattern. If there is no such character in the pattern, the pattern
is shifted beyond the mismatched character.
Example:
aaaaabaaabaaaaaaaaa
aabaaaaa
aabaaaaa
aabaaaaa
aaaaaacabababaaaaaaa
ababab
ababab
1.5.2 Good suﬃx rule
The idea behind the good suﬃx rule is that when a mismatch occurs, the pattern is shifted such that the
matched portion of the text matches the shifted pattern. I.e. when a suﬃx s = p[i . . . m] matches the text,
and p[i − 1] mismatches, we shift to the substring p[j . . . k] = s, k < m, where k is the largest possible. If
there is no such string, we shift p to the longest suﬃx of the matched portion that is a preﬁx of p.
Example:
aaaaaabaaacaa
aabaaacaa
aabaaacaa
aabaaacaa
Lecture 1: February 17 1-7
1.5.3 *Galil rule
When using just the ﬁrst two rules, the algorithm has worst-case linear complexity only for cases when the
pattern does not appear in the text. To achieve linear complexity for all cases, we need one more rule.
See http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#The_Galil_Rule.