The FM Index Algorithms for Sequence Analysis Sven Rahmann Summer 2021 Reminder about the FM Index: Backward Search with the BWT Ferragina and Manzini, Opportunistic Data Structures with Applications, 2000 Algorithmic Bioinformatics 2 Backward Search with BWT, Occ and C s = ctatatat$ and bwt=tttt$aaac: C  [$]   C  [a]   C  [c]   C  [t]   0   1   4   5   t   t   t   t   $   a   a   a   c   0   1   2   3   4   5   6   7   8   Occ  [$]   0   0   0   0   1   1   1   1   1   Occ  [a]   0   0   0   0   0   1   2   3   3   Occ  [c]   0   0   0   0   0   0   0   0   1   Occ  [t]   1   2   3   4   4   4   4   4   4   t t t t $ a a a c S[pos[i]]bwtindex 0 1 2 3 4 5 6 7 8 $ a t $ a t a t $ a t a t a t $ c t a t a t a t $ t $ t a t $ t a t a t $ t a t a t a t $ Occ(c, r) returns the number of occurrences of c ∈ Σ in the prefix bwt[0 . . . r]. The k-th c in the BWT is the k-th c in the first characters of the sorted suffixes. LF(r) = C[c] + Occ(c, r) − 1 Algorithmic Bioinformatics 3 Backward Search with BWT, Occ and C s = ctatatat$ and bwt=tttt$aaac: C  [$]   C  [a]   C  [c]   C  [t]   0   1   4   5   t   t   t   t   $   a   a   a   c   0   1   2   3   4   5   6   7   8   Occ  [$]   0   0   0   0   1   1   1   1   1   Occ  [a]   0   0   0   0   0   1   2   3   3   Occ  [c]   0   0   0   0   0   0   0   0   1   Occ  [t]   1   2   3   4   4   4   4   4   4   t t t t $ a a a c S[pos[i]]bwtindex 0 1 2 3 4 5 6 7 8 $ a t $ a t a t $ a t a t a t $ c t a t a t a t $ t $ t a t $ t a t a t $ t a t a t a t $ Let [i, j] be the interval for µ; let [i , j ] be the interval for cµ. Then i = C[c] + Occ(c, i − 1) j = C[c] + Occ(c, j) − 1 Algorithmic Bioinformatics 3 Implementation of Occ As shown, Occ uses O(|Σ|n) words of space – worse than the suffix tree! (A word is a number of size O(poly(n)), i.e., using O(log n) bits.) Algorithmic Bioinformatics 4 Implementation of Occ As shown, Occ uses O(|Σ|n) words of space – worse than the suffix tree! (A word is a number of size O(poly(n)), i.e., using O(log n) bits.) Simple Idea Store the entries of Occ only for every k-th position (typically k ∈ {32, 64, 128}). To obtain Occ(a, r), look up Occ(a, r/k ), and count the as in the remaining r − r/k characters in bwt. Algorithmic Bioinformatics 4 Implementation of Occ As shown, Occ uses O(|Σ|n) words of space – worse than the suffix tree! (A word is a number of size O(poly(n)), i.e., using O(log n) bits.) Simple Idea Store the entries of Occ only for every k-th position (typically k ∈ {32, 64, 128}). To obtain Occ(a, r), look up Occ(a, r/k ), and count the as in the remaining r − r/k characters in bwt. t   t   t   t   $   a   a   a   c   0   1   2   3   4   5   6   7   8   Occ  [$]   0   1   1   Occ  [a]   0   0   3   Occ  [c]   0   0   1   Occ  [t]   1   4   4   Algorithmic Bioinformatics 4 Occ Queries in Constant Time and in Small Space Algorithmic Bioinformatics 5 Implementation of Occ: Advanced Ideas Data Structures Succinct data structure for rank queries (binary alphabet) Wavelet tree (for converting alphabet to sequence of binary alphabets) Wavelet matrix (alternative to wavelet tree) Algorithmic Bioinformatics 6 Implementation of Occ: Advanced Ideas Data Structures Succinct data structure for rank queries (binary alphabet) Wavelet tree (for converting alphabet to sequence of binary alphabets) Wavelet matrix (alternative to wavelet tree) Rank queries Let ranks,a(i) be the number of as in s[0 . . . i], i.e., Occ(a, i) for s. Next goal Rank queries in O(1) time for s over binary alphabet {0, 1} with o(n) additional bits, i.e., for binary s, compute ranks(i) := ranks,1(i) in constant time for all i. Algorithmic Bioinformatics 6 Implementation of Occ: Advanced Ideas Data Structures Succinct data structure for rank queries (binary alphabet) Wavelet tree (for converting alphabet to sequence of binary alphabets) Wavelet matrix (alternative to wavelet tree) Rank queries Let ranks,a(i) be the number of as in s[0 . . . i], i.e., Occ(a, i) for s. Next goal Rank queries in O(1) time for s over binary alphabet {0, 1} with o(n) additional bits, i.e., for binary s, compute ranks(i) := ranks,1(i) in constant time for all i. Note: ranks,0(i) = (i + 1) − ranks,1(i). Algorithmic Bioinformatics 6 First Observations ranks(i): number of ones in s[. . . i] for i = 0, . . . , n − 1, where n = |s| Trivial ideas Slow, but lightweight: O(n) time, O(1) additional memory Slightly faster: loop with popcount: O(n/W ) time, O(1) additional memory (popcount: number of 1-bits in a machine word, elementary instruction) Fast but heavy-weight: full Occ table: O(1) time, but O(n log n) bits (n words) Slightly slower, but lighter: sparse table (every k-th entry): O(k) time and O(n/k · log n) bits (n/k words) Algorithmic Bioinformatics 7 First Observations ranks(i): number of ones in s[. . . i] for i = 0, . . . , n − 1, where n = |s| Trivial ideas Slow, but lightweight: O(n) time, O(1) additional memory Slightly faster: loop with popcount: O(n/W ) time, O(1) additional memory (popcount: number of 1-bits in a machine word, elementary instruction) Fast but heavy-weight: full Occ table: O(1) time, but O(n log n) bits (n words) Slightly slower, but lighter: sparse table (every k-th entry): O(k) time and O(n/k · log n) bits (n/k words) Desired Data structure that supports O(1) time, but needs only o(n) bits, i.e., if x(n) is the additional number of bits (plus n for s), we want x(n)/n → 0 for n → ∞. Algorithmic Bioinformatics 7 A small rank data structure for (binary) Occ Basic Idea Store every S-th entry, so S bits form a superblock: O(log n · n/S) bits for superblock table Algorithmic Bioinformatics 8 A small rank data structure for (binary) Occ Basic Idea Store every S-th entry, so S bits form a superblock: O(log n · n/S) bits for superblock table Choose S := Θ((log n)2); so need O(n/ log n) = o(n) bits Remaining problem: Count ones in superblocks of size S Algorithmic Bioinformatics 8 A small rank data structure for (binary) Occ Basic Idea Store every S-th entry, so S bits form a superblock: O(log n · n/S) bits for superblock table Choose S := Θ((log n)2); so need O(n/ log n) = o(n) bits Remaining problem: Count ones in superblocks of size S So far: time O(log2 n); memory o(n) bits Refinement Partition each superblock into Θ(log n) blocks of size B := Θ(log n) Each superblock has a table with rank differences for each block start. Values up to Θ(log2 n) need O(log log n) Bits. Number of blocks is Θ(log n · n/S) = Θ(n/ log n). Total size: O(n log log n/ log n) = o(n) Bits. Algorithmic Bioinformatics 8 Answering Rank Queries in Constant Time Query ranks(i) Given i, compute index s of superblock and index b of block inside superblock, such that i = s · S + b · B + j with 0 ≤ j < B. Look up rank Rs for superblock s in first table Look up rank difference rs,b for block b in second table Compute number of ones rs,b,j in remaining j bits; constant time with bitmask and popcount, because j < B = Θ(log n) Answer is Rs + rs,b + rs,b,j : sum of three terms, each in constant time. Algorithmic Bioinformatics 9 Practical Implementation Theory (RAM model): popcount of O(log n) bits in constant time Practical popcount of up to 64 Bits in constant time Choose B := 64 = Θ(log n), assume n ≤ 264 Choose S := 16 · (64)2 = 65536 = 216 = Θ((log n)2) 64-bit ints for superblock ranks, 16-bit ints for block ranks Values can be adjusted for different n, but these choices are convenient. Algorithmic Bioinformatics 10 Practical Implementation Theory (RAM model): popcount of O(log n) bits in constant time Practical popcount of up to 64 Bits in constant time Choose B := 64 = Θ(log n), assume n ≤ 264 Choose S := 16 · (64)2 = 65536 = 216 = Θ((log n)2) 64-bit ints for superblock ranks, 16-bit ints for block ranks Values can be adjusted for different n, but these choices are convenient. We have n/216 superblocks with 64-bit rank values Each superblock has 1024 blocks (64 bits) with 16-bit rank values Total: n/65536 · (64 + 1024 · 16) ≈ 0.25 · n bits Algorithmic Bioinformatics 10 From Binary to General Alphabet Algorithmic Bioinformatics 11 Simple but Wasteful Idea Use one bit vector per letter to represent BWT; compute a separate succinct rank data structure for each letter. Algorithmic Bioinformatics 12 Simple but Wasteful Idea Use one bit vector per letter to represent BWT; compute a separate succinct rank data structure for each letter. Example: T = banana$, bwt = annb$aa Bits 0 1 2 3 4 5 6 $ 0 0 0 0 1 0 0 a 1 0 0 0 0 1 1 b 0 0 0 1 0 0 0 n 0 1 1 0 0 0 0 Algorithmic Bioinformatics 12 Simple but Wasteful Idea Use one bit vector per letter to represent BWT; compute a separate succinct rank data structure for each letter. Example: T = banana$, bwt = annb$aa Bits 0 1 2 3 4 5 6 $ 0 0 0 0 1 0 0 a 1 0 0 0 0 1 1 b 0 0 0 1 0 0 0 n 0 1 1 0 0 0 0 Wasteful: Needs O(n|Σ|) + o(n|Σ|) bits. But BWT itself only needs O(n log |Σ|) bits! Algorithmic Bioinformatics 12 Wavelet Tree Definition (Wavele tree) Let T be a text with |T| = n; let Σ = {0, . . . , s − 1} be an alphabet with |Σ| = s. The wavelet tree for T is a balanced binary tree with s leaves. Algorithmic Bioinformatics 13 Wavelet Tree Definition (Wavele tree) Let T be a text with |T| = n; let Σ = {0, . . . , s − 1} be an alphabet with |Σ| = s. The wavelet tree for T is a balanced binary tree with s leaves. Each node corresponds to a sub-interval of the alphabet. The root corresponds to Σ; each leaf corresponds to a single character. Algorithmic Bioinformatics 13 Wavelet Tree Definition (Wavele tree) Let T be a text with |T| = n; let Σ = {0, . . . , s − 1} be an alphabet with |Σ| = s. The wavelet tree for T is a balanced binary tree with s leaves. Each node corresponds to a sub-interval of the alphabet. The root corresponds to Σ; each leaf corresponds to a single character. Each non-leaf node represents the sub-sequence of T whose characters are in a sub-alphabet {a, . . . , b}. It partitions the sub-alphabet into two parts of equal size: lower alphabet a, . . . , (a + b)/2 , upper alphabet (a + b)/2 + 1, . . . , b. A bit vector indicates which letter belongs to which sub-sub-sequence. Algorithmic Bioinformatics 13 Wavelet Tree: Example Algorithmic Bioinformatics 14 Properties of the Wavelet Tree There are log |Σ| levels in the wavelet tree. In each level, we have n bits (summed over all nodes in the level). The wavelet tree thus needs n · log |Σ| bits, as T does. An additional O(|Σ| log n) bits are required for representing the tree structure. Algorithmic Bioinformatics 15 Properties of the Wavelet Tree There are log |Σ| levels in the wavelet tree. In each level, we have n bits (summed over all nodes in the level). The wavelet tree thus needs n · log |Σ| bits, as T does. An additional O(|Σ| log n) bits are required for representing the tree structure. With a rank data structure for each node, we need an additional o(n log |Σ|) bits. With these, we can answer character and rank queries in O(log |Σ|) time. We can replace T by the wavelet tree (and delete T). Algorithmic Bioinformatics 15 Rank Queries on the Wavelet Tree Root query: rankσ(i) In the root, is σ in the lower or upper alphabet (0-bit or 1-bit) ? 0-bit: Compute k := i − rank1(i) + 1, go to left child. 1-bit: Compute k := rank1(i), go to right child. Algorithmic Bioinformatics 16 Rank Queries on the Wavelet Tree Root query: rankσ(i) In the root, is σ in the lower or upper alphabet (0-bit or 1-bit) ? 0-bit: Compute k := i − rank1(i) + 1, go to left child. 1-bit: Compute k := rank1(i), go to right child. With the child as new root, query for rankσ(k). Result is found when no child exists for σ. Algorithmic Bioinformatics 16 Wavelet Tree: Example Query rankn(15) Algorithmic Bioinformatics 17 Saving Space for the suffix array pos Algorithmic Bioinformatics 18 Storing the Suffix Array pos Problem So far, we can answer the pattern search decision problem and the counting problem. How do we obtain the positions of the suffixes in the BWT interval? Algorithmic Bioinformatics 19 Storing the Suffix Array pos Problem So far, we can answer the pattern search decision problem and the counting problem. How do we obtain the positions of the suffixes in the BWT interval? Answer Easy: Enumerate the interval of the suffix array pos. However. . . Algorithmic Bioinformatics 19 Storing the Suffix Array pos Problem So far, we can answer the pattern search decision problem and the counting problem. How do we obtain the positions of the suffixes in the BWT interval? Answer Easy: Enumerate the interval of the suffix array pos. However. . . Storing the complete suffix array pos takes space (less than suffix tree, but still. . . ). We are looking for a more space-efficient solution. Sparse suffix array? For the Occ table, we store only every k-th entry and recompute the rest on demand. Can we do the same for the suffix array? Algorithmic Bioinformatics 19 Successor Array Ψ and Predecessor Array Ψ−1 Definition: successor array Ψ Ψ[r] is the index in the suffix array pos, such that pos[Ψ[r]] = pos[r] + 1 Algorithmic Bioinformatics 20 Successor Array Ψ and Predecessor Array Ψ−1 Definition: successor array Ψ Ψ[r] is the index in the suffix array pos, such that pos[Ψ[r]] = pos[r] + 1 We used this in principle for Kasai’s linear-time lcp computation: Ψ[r] = rank[pos[r] + 1] Algorithmic Bioinformatics 20 Successor Array Ψ and Predecessor Array Ψ−1 Definition: successor array Ψ Ψ[r] is the index in the suffix array pos, such that pos[Ψ[r]] = pos[r] + 1 We used this in principle for Kasai’s linear-time lcp computation: Ψ[r] = rank[pos[r] + 1] We call Ψ−1, the inverse of Ψ, the predecessor array. This is the LF mapping. pos[Ψ−1 [r]] = pos[r] − 1 Ψ−1 [r] = rank[pos[r] − 1] Algorithmic Bioinformatics 20 Example: The Predecessor Array Ψ−1 or LF LF L F r Ψ−1 [r] pos[r] bwt[r] T[pos[r] :] 0 1 13 i $ 1 2 i i$ 2 8 p ii$ 3 7 1 m iississippii$ 4 10 s ippii$ 5 11 s issippii$ 6 3 2 i ississippii$ 7 0 $ miississippii$ 8 9 p pii$ 9 4 9 i ppii$ 10 12 s sippii$ 11 13 s sissippii$ 12 5 6 i ssippii$ 13 6 i ssissippii$ LF[r] = C[a] + Occ(a, r) − 1, where a = bwt[r]: Reconstruct T from bwt Reconstruct pos Algorithmic Bioinformatics 21 Using LF to (Partially) Reconstruct the Suffix Array Approach pos[Ψ−1[r]] = pos[r] − 1 ⇔ pos[r] = pos[Ψ−1[r]] + 1 Applying that relationship recursively yields pos[r] = pos[(Ψ−1 )k [r]] + k Data structure We compute Ψ−1 = LF from Occ and C Store every t-th entry of suffix array pos (e.g. t = 32: BWA read mapper) Reconstruct the rest of the suffix array (pos) on-the-fly: Apply Ψ−1 and increase k until we hit a stored value Algorithmic Bioinformatics 22 Summary FM Index BWT, C, Occ implementation: succinct rank data structure on wavelet tree sampled pos, every t-th entry original text is not required! Backward Search Compute interval for cµ from interval for µ Constant time per character, O(m) for pattern P with |P| = m Enumeration of text positions from sampled pos, expected O(t) time, but worst-case O(n) time per position Algorithmic Bioinformatics 23 Possible Exam Questions Why and how is the FM index compressed? How can rank (Occ) queries be implemented in constant time with succinct space? What is a wavelet tree? How does it support character and rank queries? What are the successor / predecessor arrays? Construct an example. Explain sparse suffix arrays. How long does a query on a sparse suffix array take in the worst case? How can one determine the position of pattern matches for a BWT interval? Algorithmic Bioinformatics 24