The FM Index
Algorithms for Sequence Analysis
Sven Rahmann
Summer 2021
Reminder about the FM Index:
Backward Search with the BWT
Ferragina and Manzini, Opportunistic Data Structures with Applications, 2000
Algorithmic Bioinformatics 2
Backward Search with BWT, Occ and C
s = ctatatat$ and bwt=tttt$aaac:
C	  [$]	   C	  [a]	   C	  [c]	   C	  [t]	  
0	   1	   4	   5	  
t	   t	   t	   t	   $	   a	   a	   a	   c	  
0	   1	   2	   3	   4	   5	   6	   7	   8	  
Occ	  [$]	   0	   0	   0	   0	   1	   1	   1	   1	   1	  
Occ	  [a]	   0	   0	   0	   0	   0	   1	   2	   3	   3	  
Occ	  [c]	   0	   0	   0	   0	   0	   0	   0	   0	   1	  
Occ	  [t]	   1	   2	   3	   4	   4	   4	   4	   4	   4	  
t
t
t
t
$
a
a
a
c
S[pos[i]]bwtindex
0
1
2
3
4
5
6
7
8
$
a t $
a t a t $
a t a t a t $
c t a t a t a t $
t $
t a t $
t a t a t $
t a t a t a t $
Occ(c, r) returns the number of occurrences of c ∈ Σ in the preﬁx bwt[0 . . . r].
The k-th c in the BWT is the k-th c in the ﬁrst characters of the sorted suﬃxes.
LF(r) = C[c] + Occ(c, r) − 1
Algorithmic Bioinformatics 3
Backward Search with BWT, Occ and C
s = ctatatat$ and bwt=tttt$aaac:
C	  [$]	   C	  [a]	   C	  [c]	   C	  [t]	  
0	   1	   4	   5	  
t	   t	   t	   t	   $	   a	   a	   a	   c	  
0	   1	   2	   3	   4	   5	   6	   7	   8	  
Occ	  [$]	   0	   0	   0	   0	   1	   1	   1	   1	   1	  
Occ	  [a]	   0	   0	   0	   0	   0	   1	   2	   3	   3	  
Occ	  [c]	   0	   0	   0	   0	   0	   0	   0	   0	   1	  
Occ	  [t]	   1	   2	   3	   4	   4	   4	   4	   4	   4	  
t
t
t
t
$
a
a
a
c
S[pos[i]]bwtindex
0
1
2
3
4
5
6
7
8
$
a t $
a t a t $
a t a t a t $
c t a t a t a t $
t $
t a t $
t a t a t $
t a t a t a t $
Let [i, j] be the interval for µ; let [i , j ] be the interval for cµ. Then
i = C[c] + Occ(c, i − 1)
j = C[c] + Occ(c, j) − 1
Algorithmic Bioinformatics 3
Implementation of Occ
As shown, Occ uses O(|Σ|n) words of space – worse than the suﬃx tree!
(A word is a number of size O(poly(n)), i.e., using O(log n) bits.)
Algorithmic Bioinformatics 4
Implementation of Occ
As shown, Occ uses O(|Σ|n) words of space – worse than the suﬃx tree!
(A word is a number of size O(poly(n)), i.e., using O(log n) bits.)
Simple Idea
Store the entries of Occ only for every k-th position (typically k ∈ {32, 64, 128}).
To obtain Occ(a, r), look up Occ(a, r/k ),
and count the as in the remaining r − r/k characters in bwt.
Algorithmic Bioinformatics 4
Implementation of Occ
As shown, Occ uses O(|Σ|n) words of space – worse than the suﬃx tree!
(A word is a number of size O(poly(n)), i.e., using O(log n) bits.)
Simple Idea
Store the entries of Occ only for every k-th position (typically k ∈ {32, 64, 128}).
To obtain Occ(a, r), look up Occ(a, r/k ),
and count the as in the remaining r − r/k characters in bwt.
t	   t	   t	   t	   $	   a	   a	   a	   c	  
0	   1	   2	   3	   4	   5	   6	   7	   8	  
Occ	  [$]	   0	   1	   1	  
Occ	  [a]	   0	   0	   3	  
Occ	  [c]	   0	   0	   1	  
Occ	  [t]	   1	   4	   4	  
Algorithmic Bioinformatics 4
Occ Queries in Constant Time
and in Small Space
Algorithmic Bioinformatics 5
Implementation of Occ: Advanced Ideas
Data Structures
Succinct data structure for rank queries (binary alphabet)
Wavelet tree (for converting alphabet to sequence of binary alphabets)
Wavelet matrix (alternative to wavelet tree)
Algorithmic Bioinformatics 6
Implementation of Occ: Advanced Ideas
Data Structures
Succinct data structure for rank queries (binary alphabet)
Wavelet tree (for converting alphabet to sequence of binary alphabets)
Wavelet matrix (alternative to wavelet tree)
Rank queries
Let ranks,a(i) be the number of as in s[0 . . . i], i.e., Occ(a, i) for s.
Next goal
Rank queries in O(1) time for s over binary alphabet {0, 1} with o(n) additional bits,
i.e., for binary s, compute ranks(i) := ranks,1(i) in constant time for all i.
Algorithmic Bioinformatics 6
Implementation of Occ: Advanced Ideas
Data Structures
Succinct data structure for rank queries (binary alphabet)
Wavelet tree (for converting alphabet to sequence of binary alphabets)
Wavelet matrix (alternative to wavelet tree)
Rank queries
Let ranks,a(i) be the number of as in s[0 . . . i], i.e., Occ(a, i) for s.
Next goal
Rank queries in O(1) time for s over binary alphabet {0, 1} with o(n) additional bits,
i.e., for binary s, compute ranks(i) := ranks,1(i) in constant time for all i.
Note: ranks,0(i) = (i + 1) − ranks,1(i).
Algorithmic Bioinformatics 6
First Observations
ranks(i): number of ones in s[. . . i] for i = 0, . . . , n − 1, where n = |s|
Trivial ideas
Slow, but lightweight: O(n) time, O(1) additional memory
Slightly faster: loop with popcount: O(n/W ) time, O(1) additional memory
(popcount: number of 1-bits in a machine word, elementary instruction)
Fast but heavy-weight: full Occ table: O(1) time, but O(n log n) bits (n words)
Slightly slower, but lighter: sparse table (every k-th entry):
O(k) time and O(n/k · log n) bits (n/k words)
Algorithmic Bioinformatics 7
First Observations
ranks(i): number of ones in s[. . . i] for i = 0, . . . , n − 1, where n = |s|
Trivial ideas
Slow, but lightweight: O(n) time, O(1) additional memory
Slightly faster: loop with popcount: O(n/W ) time, O(1) additional memory
(popcount: number of 1-bits in a machine word, elementary instruction)
Fast but heavy-weight: full Occ table: O(1) time, but O(n log n) bits (n words)
Slightly slower, but lighter: sparse table (every k-th entry):
O(k) time and O(n/k · log n) bits (n/k words)
Desired
Data structure that supports O(1) time, but needs only o(n) bits, i.e.,
if x(n) is the additional number of bits (plus n for s), we want x(n)/n → 0 for n → ∞.
Algorithmic Bioinformatics 7
A small rank data structure for (binary) Occ
Basic Idea
Store every S-th entry, so S bits form a superblock:
O(log n · n/S) bits for superblock table
Algorithmic Bioinformatics 8
A small rank data structure for (binary) Occ
Basic Idea
Store every S-th entry, so S bits form a superblock:
O(log n · n/S) bits for superblock table
Choose S := Θ((log n)2); so need O(n/ log n) = o(n) bits
Remaining problem: Count ones in superblocks of size S
Algorithmic Bioinformatics 8
A small rank data structure for (binary) Occ
Basic Idea
Store every S-th entry, so S bits form a superblock:
O(log n · n/S) bits for superblock table
Choose S := Θ((log n)2); so need O(n/ log n) = o(n) bits
Remaining problem: Count ones in superblocks of size S
So far: time O(log2
n); memory o(n) bits
Reﬁnement
Partition each superblock into Θ(log n) blocks of size B := Θ(log n)
Each superblock has a table with rank diﬀerences for each block start.
Values up to Θ(log2
n) need O(log log n) Bits.
Number of blocks is Θ(log n · n/S) = Θ(n/ log n).
Total size: O(n log log n/ log n) = o(n) Bits.
Algorithmic Bioinformatics 8
Answering Rank Queries in Constant Time
Query ranks(i)
Given i, compute index s of superblock and index b of block inside superblock,
such that i = s · S + b · B + j with 0 ≤ j < B.
Look up rank Rs for superblock s in ﬁrst table
Look up rank diﬀerence rs,b for block b in second table
Compute number of ones rs,b,j in remaining j bits;
constant time with bitmask and popcount, because j < B = Θ(log n)
Answer is Rs + rs,b + rs,b,j :
sum of three terms, each in constant time.
Algorithmic Bioinformatics 9
Practical Implementation
Theory (RAM model): popcount of O(log n) bits in constant time
Practical popcount of up to 64 Bits in constant time
Choose B := 64 = Θ(log n), assume n ≤ 264
Choose S := 16 · (64)2 = 65536 = 216 = Θ((log n)2)
64-bit ints for superblock ranks, 16-bit ints for block ranks
Values can be adjusted for diﬀerent n, but these choices are convenient.
Algorithmic Bioinformatics 10
Practical Implementation
Theory (RAM model): popcount of O(log n) bits in constant time
Practical popcount of up to 64 Bits in constant time
Choose B := 64 = Θ(log n), assume n ≤ 264
Choose S := 16 · (64)2 = 65536 = 216 = Θ((log n)2)
64-bit ints for superblock ranks, 16-bit ints for block ranks
Values can be adjusted for diﬀerent n, but these choices are convenient.
We have n/216 superblocks with 64-bit rank values
Each superblock has 1024 blocks (64 bits) with 16-bit rank values
Total: n/65536 · (64 + 1024 · 16) ≈ 0.25 · n bits
Algorithmic Bioinformatics 10
From Binary to General Alphabet
Algorithmic Bioinformatics 11
Simple but Wasteful Idea
Use one bit vector per letter to represent BWT;
compute a separate succinct rank data structure for each letter.
Algorithmic Bioinformatics 12
Simple but Wasteful Idea
Use one bit vector per letter to represent BWT;
compute a separate succinct rank data structure for each letter.
Example: T = banana$, bwt = annb$aa
Bits 0 1 2 3 4 5 6
$ 0 0 0 0 1 0 0
a 1 0 0 0 0 1 1
b 0 0 0 1 0 0 0
n 0 1 1 0 0 0 0
Algorithmic Bioinformatics 12
Simple but Wasteful Idea
Use one bit vector per letter to represent BWT;
compute a separate succinct rank data structure for each letter.
Example: T = banana$, bwt = annb$aa
Bits 0 1 2 3 4 5 6
$ 0 0 0 0 1 0 0
a 1 0 0 0 0 1 1
b 0 0 0 1 0 0 0
n 0 1 1 0 0 0 0
Wasteful: Needs O(n|Σ|) + o(n|Σ|) bits.
But BWT itself only needs O(n log |Σ|) bits!
Algorithmic Bioinformatics 12
Wavelet Tree
Deﬁnition (Wavele tree)
Let T be a text with |T| = n; let Σ = {0, . . . , s − 1} be an alphabet with |Σ| = s.
The wavelet tree for T is a balanced binary tree with s leaves.
Algorithmic Bioinformatics 13
Wavelet Tree
Deﬁnition (Wavele tree)
Let T be a text with |T| = n; let Σ = {0, . . . , s − 1} be an alphabet with |Σ| = s.
The wavelet tree for T is a balanced binary tree with s leaves.
Each node corresponds to a sub-interval of the alphabet.
The root corresponds to Σ; each leaf corresponds to a single character.
Algorithmic Bioinformatics 13
Wavelet Tree
Deﬁnition (Wavele tree)
Let T be a text with |T| = n; let Σ = {0, . . . , s − 1} be an alphabet with |Σ| = s.
The wavelet tree for T is a balanced binary tree with s leaves.
Each node corresponds to a sub-interval of the alphabet.
The root corresponds to Σ; each leaf corresponds to a single character.
Each non-leaf node represents the sub-sequence of T whose characters are in a
sub-alphabet {a, . . . , b}. It partitions the sub-alphabet into two parts of equal size:
lower alphabet a, . . . , (a + b)/2 ,
upper alphabet (a + b)/2 + 1, . . . , b.
A bit vector indicates which letter belongs to which sub-sub-sequence.
Algorithmic Bioinformatics 13
Wavelet Tree: Example
Algorithmic Bioinformatics 14
Properties of the Wavelet Tree
There are log |Σ| levels in the wavelet tree.
In each level, we have n bits (summed over all nodes in the level).
The wavelet tree thus needs n · log |Σ| bits, as T does.
An additional O(|Σ| log n) bits are required for representing the tree structure.
Algorithmic Bioinformatics 15
Properties of the Wavelet Tree
There are log |Σ| levels in the wavelet tree.
In each level, we have n bits (summed over all nodes in the level).
The wavelet tree thus needs n · log |Σ| bits, as T does.
An additional O(|Σ| log n) bits are required for representing the tree structure.
With a rank data structure for each node, we need an additional o(n log |Σ|) bits.
With these, we can answer character and rank queries in O(log |Σ|) time.
We can replace T by the wavelet tree (and delete T).
Algorithmic Bioinformatics 15
Rank Queries on the Wavelet Tree
Root query: rankσ(i)
In the root, is σ in the lower or upper alphabet (0-bit or 1-bit) ?
0-bit: Compute k := i − rank1(i) + 1, go to left child.
1-bit: Compute k := rank1(i), go to right child.
Algorithmic Bioinformatics 16
Rank Queries on the Wavelet Tree
Root query: rankσ(i)
In the root, is σ in the lower or upper alphabet (0-bit or 1-bit) ?
0-bit: Compute k := i − rank1(i) + 1, go to left child.
1-bit: Compute k := rank1(i), go to right child.
With the child as new root, query for rankσ(k).
Result is found when no child exists for σ.
Algorithmic Bioinformatics 16
Wavelet Tree: Example Query rankn(15)
Algorithmic Bioinformatics 17
Saving Space for the suﬃx array pos
Algorithmic Bioinformatics 18
Storing the Suﬃx Array pos
Problem
So far, we can answer the pattern search decision problem and the counting problem.
How do we obtain the positions of the suﬃxes in the BWT interval?
Algorithmic Bioinformatics 19
Storing the Suﬃx Array pos
Problem
So far, we can answer the pattern search decision problem and the counting problem.
How do we obtain the positions of the suﬃxes in the BWT interval?
Answer
Easy: Enumerate the interval of the suﬃx array pos. However. . .
Algorithmic Bioinformatics 19
Storing the Suﬃx Array pos
Problem
So far, we can answer the pattern search decision problem and the counting problem.
How do we obtain the positions of the suﬃxes in the BWT interval?
Answer
Easy: Enumerate the interval of the suﬃx array pos. However. . .
Storing the complete suﬃx array pos takes space (less than suﬃx tree, but still. . . ).
We are looking for a more space-eﬃcient solution.
Sparse suﬃx array?
For the Occ table, we store only every k-th entry and recompute the rest on demand.
Can we do the same for the suﬃx array?
Algorithmic Bioinformatics 19
Successor Array Ψ and Predecessor Array Ψ−1
Deﬁnition: successor array Ψ
Ψ[r] is the index in the suﬃx array pos, such that
pos[Ψ[r]] = pos[r] + 1
Algorithmic Bioinformatics 20
Successor Array Ψ and Predecessor Array Ψ−1
Deﬁnition: successor array Ψ
Ψ[r] is the index in the suﬃx array pos, such that
pos[Ψ[r]] = pos[r] + 1
We used this in principle for Kasai’s linear-time lcp computation:
Ψ[r] = rank[pos[r] + 1]
Algorithmic Bioinformatics 20
Successor Array Ψ and Predecessor Array Ψ−1
Deﬁnition: successor array Ψ
Ψ[r] is the index in the suﬃx array pos, such that
pos[Ψ[r]] = pos[r] + 1
We used this in principle for Kasai’s linear-time lcp computation:
Ψ[r] = rank[pos[r] + 1]
We call Ψ−1, the inverse of Ψ, the predecessor array. This is the LF mapping.
pos[Ψ−1
[r]] = pos[r] − 1
Ψ−1
[r] = rank[pos[r] − 1]
Algorithmic Bioinformatics 20
Example: The Predecessor Array Ψ−1
or LF
LF L F
r Ψ−1
[r] pos[r] bwt[r] T[pos[r] :]
0 1 13 i $
1 2 i i$
2 8 p ii$
3 7 1 m iississippii$
4 10 s ippii$
5 11 s issippii$
6 3 2 i ississippii$
7 0 $ miississippii$
8 9 p pii$
9 4 9 i ppii$
10 12 s sippii$
11 13 s sissippii$
12 5 6 i ssippii$
13 6 i ssissippii$
LF[r] = C[a] + Occ(a, r) − 1,
where a = bwt[r]:
Reconstruct T from bwt
Reconstruct pos
Algorithmic Bioinformatics 21
Using LF to (Partially) Reconstruct the Suﬃx Array
Approach
pos[Ψ−1[r]] = pos[r] − 1
⇔ pos[r] = pos[Ψ−1[r]] + 1
Applying that relationship recursively yields
pos[r] = pos[(Ψ−1
)k
[r]] + k
Data structure
We compute Ψ−1 = LF from Occ and C
Store every t-th entry of suﬃx array pos (e.g. t = 32: BWA read mapper)
Reconstruct the rest of the suﬃx array (pos) on-the-ﬂy:
Apply Ψ−1 and increase k until we hit a stored value
Algorithmic Bioinformatics 22
Summary
FM Index
BWT, C, Occ
implementation: succinct rank data structure on wavelet tree
sampled pos, every t-th entry
original text is not required!
Backward Search
Compute interval for cµ from interval for µ
Constant time per character, O(m) for pattern P with |P| = m
Enumeration of text positions from sampled pos,
expected O(t) time, but worst-case O(n) time per position
Algorithmic Bioinformatics 23
Possible Exam Questions
Why and how is the FM index compressed?
How can rank (Occ) queries be implemented in constant time with succinct space?
What is a wavelet tree? How does it support character and rank queries?
What are the successor / predecessor arrays? Construct an example.
Explain sparse suﬃx arrays.
How long does a query on a sparse suﬃx array take in the worst case?
How can one determine the position of pattern matches for a BWT interval?
Algorithmic Bioinformatics 24