Data Warehouse Indexing & Optimization PA220: Database systems for data analytics Contents • Approaches to indexing • Data partitioning • Joins • Materialized views October 31, 2024 PA220 DB for Analytics 2 Why Indexes? • Consider a 100 GB table; at 100 MB/s read speed we need 17 minutes for a full table scan • Query for the number of “Bosch S500” washing machines sold in Germany last month • Applying restrictions (product, location) the selectivity would be strongly improved • If we have 30 locations, 10,000 products and 24 months in the DW, the selectivity value is 1/30 * 1/ 10,000 * 1/24 = 0,000 000 14 • So… we read 100 GB for 1,4KB of data • The problem is: how to filter data in a fact table as much as possible October 31, 2024 PA220 DB for Analytics 3 The low value of selectivity → Highly selective predicate! Why Indexes? • Reduce the size of read pages of data cube to a minimum with indexes October 31, 2024 PA220 DB for Analytics 4 Index Types • Tree structures • B+-tree, R-tree, … • Hash based • Dynamic hash table • Special • Bitmap index • Block-Range INdex (in Pg) October 31, 2024 PA220 DB for Analytics 5 Multidimensional Data • B+-tree • classic structure – very efficient in updates • supports point and range queries • limited to 1D data • UB-tree • uses B*-tree and • Z-curve to linearize n-dim data • R-tree • wrapping by n-dim rectangles • R+, R*, X-tree October 31, 2024 PA220 DB for Analytics 6 UB-Trees • Convert n-dim data to a single dimension by the Z-curve • and Index by B* tree • The Z-curve provides for good performance for range queries! • Consecutive values on the Z-curve index similar data • Similarity by means of neighborhood October 31, 2024 PA220 DB for Analytics 7 UB-Trees • Z-Value address representation • Calculate the z-values such that neighboring data is clustered together • Calculated through bit interleaving of the coordinates of the tuple • To localize a value with coordinates one must perform de-interleaving October 31, 2024 PA220 DB for Analytics 8 For Z-value 51, we have the offset 50. 50 in binary is 110010 Z-value = 110010 Y = 101 = 5 X = 100 = 4 row at x=4, y=5 We have Z-regions – describes one block in storage. E.g. [1-9], [10-18]. UB-Trees – Range Query • Range queries (RQ) in UB-Trees • Each query can be specified by 2 coordinates • qa (the upper left corner of the query rectangle) • qb (the lower right corner of the query rectangle) • Range Query Algorithm 1. Calculate z-values for qa and qb 2. Get a node with Z-Region containing qa • e.g., Z-Region of qa is [10:18] 3. The corresponding page is loaded and filtered with the query predicate • E.g., value 10 has after de-interleaving x=1 and y=2, which is outside the query rectangle October 31, 2024 PA220 DB for Analytics 9 Q: x[2;5], y[3;6] UB-Trees – Range Query • Range Query Algorithm (cont.) 4. After qa, all values on the Z-curve are de-interleaved and checked by their coordinates • The data is only accessed from the disk. • The next jump point on the Z-curve is 27. 5. Repeat Steps 2 and 3 until the decoded end-address of the last filtered region is bigger than qb October 31, 2024 PA220 DB for Analytics 10 Calculating the jump point mostly involves: • Performing bit operations and comparisons • 3 points: qa, qb and the current Z-Value UB-Trees – next “jump” point • Idea of getting next jump point October 31, 2024 PA220 DB for Analytics 11 Index comparison • Area of records to scan (dark-grey) to get the answer (light-grey) October 31, 2024 PA220 DB for Analytics 12 R-Trees • Like B-trees • Data objects stored in leaf nodes • Nodes represented by minimum bounding rectangles • High-balanced structure October 31, 2024 PA220 DB for Analytics 13 Query S: 3 out of 11 nodes are checked. (root, R2, R8) R1 R2 R3 root R7 R8 R9 xx Check all the objects in node R8. R-Trees Querying • Many MBR overlaps deteriorate query performance • All nodes get visited in the worst case. • Key is insertion/split optimization • Minimize volume by MBR → overlaps. October 31, 2024 PA220 DB for Analytics 14 Bad split Better split R+ Tree • Eliminates overlaps by replication of objects in leaves • Improves performance of point queries October 31, 2024 PA220 DB for Analytics 15 A B C P root K J I Read the object G. G H Bitmap Index • A bitmap index for an attribute is a data structure composed of: • A collection of bitmaps (bit-vectors) • The number of bit-vectors represents the count of distinct values of an attr. in the relation • Bitmap (bit vector/array) is an array data structure that stores individual bits • Bit signals the presence of value in the row with the relative index of the bit’s position. • The length of each bit-vector is the cardinality of the relation. • It is compressed by Run-length encoding. October 31, 2024 PA220 DB for Analytics 16 Shop dim Sales fact Bitmap on Shop of Sales 3 2 1 Bitmap Index • Records are allocated under permanent numbers. • There is a mapping between record numbers and record addresses. • Insertion • bit-vectors are extended, and the new record is appended to the table • Update • toggle the bits in the old bit-vector array and in the new one. • Deletion • in the fact table → tombstones • in the index → bit is cleared October 31, 2024 PA220 DB for Analytics 17 Bitmap Index – Queries • Combine OR/AND values • OR/AND bit ops on vectors • E.g., Saturn | P&C • Combine different indexes on the same table October 31, 2024 PA220 DB for Analytics 18 3 2 1 100010 001000 ---------- 101010 Bitmap Index • Good for data which has a “small” number of distinct values • E.g., gender data, clothing sizes • Thus, the selectivity value is low. • Combinations of multiple indexes lead to highly-selective predicates. • Similar performance as B+ tree for static (read-only) data • also, when all values are distinct • Many distinct values cause many bit arrays. • Transform to multiple components and project to decrease it. • → Multi-component Bitmap Index • Not very good for range queries on values. • → Range-encoded Bitmap Index October 31, 2024 PA220 DB for Analytics 19 Multi-component Bitmap Index • Encoding using a different numeration system to reduce storage space • E.g., classes • Idea: • transform values into more dimensions and project • intersection of projections gives the original value • E.g., the month attribute has values between 0 and 11. • Encode by X = 3*Z+Y October 31, 2024 PA220 DB for Analytics 20 1 0 A_,1 A_,0A_,2A_,3A1,_ A0,_A2,_ Q1 Q2 Q3 Q4 1st Jan Apr Jul Oct 2nd Feb May Aug Nov 3rd Mar Jun Sep Dec Multi-component Bitmap Index • Encoding using a different numeration system to reduce storage space, e.g., classes • Idea: • transform values into more dimensions and project • intersection of projections gives the original value • E.g., the month attribute October 31, 2024 PA220 DB for Analytics 21 1 0 A_,1 A_,0A_,2A_,3A1,_ A0,_A2,_ Q1 Q2 Q3 Q4 1st Jan Apr Jul Oct 2nd Feb May Aug Nov 3rd Mar Jun Sep Dec Encode by X = 3*Z+Y i.e., June: 5=3*1+2 New bit-arrays Multi-component Bitmap Index • If we have 100 (0..99) different days to index we can use a multicomponent bitmap index with the basis of <10,10> • The storage is reduced from 100 to 20 bitmap-vectors • 10 for y and 10 for z • The read-access for a point query (1 day out of 100) needs however 2 read operations instead of just 1 • plus, the bit-and operation on the bit-arrays October 31, 2024 PA220 DB for Analytics 22 Range-encoded Bitmap Index • Requires a logical ordering of values • Idea: • set the bit in all bit-vectors of the values following this current one • range queries will check just 2 bit-vectors • matches are: NOT previous AND current • Disadvantage: • a point query requires reading 2 vectors October 31, 2024 PA220 DB for Analytics 23 Range-encoded Bitmap Index • Query: Persons born between March and August • So, persons who didn’t exist in February but existed in August. • Just 2 vectors read: ((NOT A1) AND A7) • Normal bitmap would require 6 vectors to read. October 31, 2024 PA220 DB for Analytics 24 Index on month of birth Summary of Indexes • B-Trees are not fit for multidimensional data • UB-trees can be applicable • R-Trees may not scale to many dimensions • Bitmap indexes are typically only a fraction of the size of the indexed data in the table • Bitmap indexes reduce response time for large classes of ad hoc queries October 31, 2024 PA220 DB for Analytics 25 Data Partitioning • Breaking data into “non-overlapping” parts • May correspond to the granularity of a dimension and use ranges to define partitions of a fact table. • Improves: • Business query performance, • i.e., minimize the amount of data to scan • Data availability, • e.g., back-up/restore can run at the partition level • Database administration, • e.g., archiving data, recreating indexes, loading tables October 31, 2024 PA220 DB for Analytics 26 Data Partitioning • Approaches: • Logical partitioning by • Date, Line of business, Geography, Organizational unit, Combinations of these factors, … • Physical partitioning • Makes data available to different processing nodes • Possible parallelization on multiple disks/machines • Implementation: • Application level • Database system October 31, 2024 PA220 DB for Analytics 27 Data Partitioning: Two Options • Horizontal – splitting out the rows of a table into multiple tables • Vertical – splitting out the columns of a table into multiple tables October 31, 2024 PA220 DB for Analytics 28 Partitioning key Horizontal Partitioning • Distributes records into disjoint “tables” • Typically, “view” over the union of the table is created • Types of partition function: • range – a range of values per table • list – enumeration of values per table • hash – result of a hash function determines the table • Data warehouse context: • Fact table is partitioned by, e.g., • Time dimension – weeks, months, or age of data • Another dim if it does not change often – branch, region • Individual partitions (tables) • require defining constraints on their contents • to use a subset of partitions in query execution October 31, 2024 PA220 DB for Analytics 29 Vertical Partitioning • Involves creating tables with fewer columns and using additional tables to store the remaining columns • Usually called row splitting • Row splitting creates one-to-one relationships between the partitions • Create a view that merges them • Different physical storage might be used • E.g., storing infrequently used or very wide columns on a different device • Data warehouse context : • move seldom-used columns from a highly-used table to another • Sometimes done as a side effect when an “outrigger” dimension is used. • This is relevant to fact tables and their measure columns! October 31, 2024 PA220 DB for Analytics 30 Vertical Partitioning (contrast to dimensions) • Mini-dimension with outrigger is a solution • Many dimension attributes are used very frequently as browsing constraints • In big dimensions these constraints can be hard to find among the lesser used ones • Logical groups of often used constraints can be separated into small dimensions • which are very well indexed and easily accessible for browsing • E.g., demography dimension • Notice the foreign key in customer October 31, 2024 PA220 DB for Analytics 31 Summary of Data Partitioning • Advantages • Records used together are grouped together • Each partition can be optimized for performance • Security, recovery • Partitions stored on different disks reduce contention • Take advantage of parallel processing capability • Disadvantages • Slow retrieval across partitions (expensive joins in vertical partitioning) • Complexity • Recommendations • A table is larger than 2GB (from Oracle) • A table has more than 100 million rows (practice) October 31, 2024 PA220 DB for Analytics 32 Join Optimization in DWH • Queries over several dimensions are often needed • This results in joins over the tables • Though joins are generally expensive operations, the overall cost of the query may strongly differ with the chosen evaluation plan for the joins • Joins are commutative and associative • R ⋈ S ≡ S ⋈ R • R ⋈ (S ⋈ T) ≡ (S ⋈ R) ⋈ T October 31, 2024 PA220 DB for Analytics 33 Join Optimization • This allows evaluating individual joins in any order • Results in join trees • Different join trees may show very different evaluation performance • Number of possible join trees may grow rapidly (n!) • DBMS optimizer considers • statistics to minimize result size • all possibilities → impossible for large n • heuristics to pick promising ones • when the number of relations is high (e.g., >6) • e.g., genetic algorithms October 31, 2024 PA220 DB for Analytics 34 Join Selection Heuristics • Join relations that relate by an attribute/condition • which avoids cross joins • Minimize the result size (A is the common attr.) • 𝑇 𝑅 ∗𝑇(𝑆) max(𝑉 𝑅,𝐴 ,𝑉 𝑆,𝐴 ) • Availability of indexes and selectivity of other conditions • User tuning • Hints in Oracle • Change join_collapse_limit in PostgreSQL October 31, 2024 PA220 DB for Analytics 35 Join Selection Heuristics in DWs • OLTP’s heuristics are not suitable in DWs • E.g., join Sales with Geo in the following case: • Sales has 10 mil records, in Germany there are 10 stores, in January 2016 there were products sold in 20 days, and the Electronics group has 50 products • If 20 % of our sales were performed in Germany, • the selectivity value is high. • so, an index would not help that much • The intermediate result would still comprise 2 mil records October 31, 2024 PA220 DB for Analytics 36 Join Selection Heuristics in DWs • The cross join of the dimension tables is recommended • Geo dimension – 10 stores in Germany • Time dimension – 20 days in Jan 2016 • Product dimension – 50 products in Electronics • 10m facts in Sales • 10*20*50 = 10,000 records after performing the cross product • But can also be expensive! • Cross-join of dimensions allows • a single pass over Sales • using an index on the most selective attribute yet. October 31, 2024 PA220 DB for Analytics 37 Join Selection Heuristics in DWs • If cross join is too large, intersect partial joins • applicable when all dimension FKs are indexed • in fact, it is a semi-join (no record duplication can take place) October 31, 2024 PA220 DB for Analytics 38 Summary of Joins • Prefer a cross-join on dimensions • If not all dimension FKs are indexed • Intersect semi-joins otherwise • Avoid standard DBMS’s plans • But check the plan first ☺ October 31, 2024 PA220 DB for Analytics 39 Materialized Views • Views whose tuples are stored in the database are said to be materialized • They provide fast access, like a (very high-level) cache • Need to maintain the view’s contents as the underlying tables change • Ideally, we want incremental view maintenance algorithms October 31, 2024 PA220 DB for Analytics 40 Materialized Views • How can we use MV in DW? • E.g., we have queries requiring us to join the Sales table with another dimension table and aggregate the result • SELECT P.Categ, SUM(S.Qty) FROM Product P, Sales S WHERE P.ProdID=S.ProdID GROUP BY P.Categ • SELECT G.Store, SUM(S.Qty) FROM Geo G, Sales S WHERE G.GeoID=S.GeoID GROUP BY G.Store • … • There are more solutions to speed up such queries • Pre-compute the two joins involved (product with sales and geo with sales) • Pre-compute each query in its entirety • Or use a common and already materialized view October 31, 2024 PA220 DB for Analytics 41 Materialized Views • Having the following view materialized • CREATE MATERIALIZED VIEW Totalsales(ProdID, GeoID, total) AS SELECT S.ProdID, S.GeoID, SUM(S.Qty) FROM Sales S GROUP BY S.ProdID, S.GeoID • We can use it in our queries • SELECT P.Categ, SUM(T.Total) FROM Product P, Totalsales T WHERE P.ProdID=T.ProdID GROUP BY P.Categ • SELECT G.Store, SUM(T.Total) FROM Geo G, Totalsales T WHERE G.GeoID=T.GeoID GROUP BY G.Store October 31, 2024 PA220 DB for Analytics 42 Materialized Views - Issues • Choice of materialized views • What views should we materialize, and what indexes should we build on the pre-computed results? • Utilization • Given a query and a set of materialized views, can we use the materialized views to answer the query? • Maintenance • How frequently should we refresh materialized views to make them consistent with the underlying tables? • And how can we do this incrementally? October 31, 2024 PA220 DB for Analytics 43 Materialized Views: Utilization • Utilization must be transparent • Queries are internally rewritten to use the available MVs by the query rewriter • The query rewriter performs integration of the MV based on the query execution graph October 31, 2024 PA220 DB for Analytics 44 Materialized Views: Utilization • E.g., mono-block query (perfect match) October 31, 2024 PA220 DB for Analytics 45 Materialized Views: Utilization • Correctness: • A query Q` represents a valid replacement of query Q by utilizing the materialized view M, if Q and Q` always deliver the same result. • Implementation requires the following: • The selection condition in M cannot be more restrictive than the one in Q. • The projection from Q must be a subset of the projection from M. • It must be possible to derive the aggregation functions in Q from ones in M. • Additional selection conditions in Q must also be possible on M. October 31, 2024 PA220 DB for Analytics 46 Materialized Views: Integration • A way to integrate a more restrictive view: • Split the query Q in two parts, Qa and Qb, such that • σ(Qa) = (σ(Q) ⋀ σ(M)) and • σ(Qb) = (σ(Q) ⋀ ¬σ(M)) October 31, 2024 PA220 DB for Analytics 47 Materialized Views & DWs • Often store aggregated results • For a set of “n” group-by attributes, there are 2n possible combinations • Too many to materialize all • What to materialize? October 31, 2024 PA220 DB for Analytics 48 Materialized Views & DWs • Choosing the views to materialize • Static choice: • The choice is performed at a certain time point • by the DB administrator (not very often) or by an algorithm • The set of MVs remains unmodified until the next refresh • The chosen MVs correspond to older queries • Dynamic choice: • The MV set adapts itself according to new queries October 31, 2024 PA220 DB for Analytics 49 Views to Materialize • Static choice • Choose which views to materialize, in concordance with the “benefit” they bring • The benefit is computed based on a cost function • The cost function involves • Query costs • Statistical approximations of the frequency of the query • Actualization/maintenance costs • Classical knapsack problem – a limit on MV storage and the cost of each MV • Greedy algorithm • Input: the lattice of cuboids, the expected cardinality of each node, and the maximum storage size available to save MVs • It calculates the nodes from the lattice which bring the highest benefit according to the cost function, until there is no more space to store MVs • Output: the list of lattice nodes to be materialized October 31, 2024 PA220 DB for Analytics 50 Views to Materialize • Disadvantages of static choice • OLAP applications are interactive • Usually, the user runs a series of queries to explain a behavior he has observed, which happened for the first time • So now the query set comprises hard to predict, ad-hoc queries • Even if the query pattern is observed after a while, it is unknown for how much time the pattern will remain valid • Queries are always changing • Often modification to the data leads to high update effort • There are, however, also for OLAP applications, some often repeating queries that should in any case be statically materialized. October 31, 2024 PA220 DB for Analytics 51 Views to Materialize • Dynamic choice • Monitor the queries being executed over time • Maintain a materialized view processing plan (MVPP) by incorporating most frequently executed queries • Modify MVPP incrementally by executing MVPP generation algorithm • as a background process • Decide on the views to be materialized • Reorganize the existing views • It works on the same principle as caching, but with semantic knowledge October 31, 2024 PA220 DB for Analytics 52 Views to Materialize • Dynamic choice • Updates of cached MV: • In each step, the cost of MV in the cache as well as of the query is calculated • All MVs as well as the query result are sorted according to their costs • The cache is then filled with MV in the order of their costs, from high to low • This way it can happen that one or more old MVs are replaced with the current query • Factors consider in the cost function: • Time of the last access • Frequency of query • Size of the materialized view • The costs a new calculation or actualization would produce for a MV • Number of queries which were answered with the MV • Number of queries which could be answered with this MV October 31, 2024 PA220 DB for Analytics 53 Maintenance of Materialized Views • Keeping a materialized view up-to-date with the underlying data • How do we refresh a view when an underlying table is refreshed? • When should we refresh a view in response to a change in the underlying table? • Approaches: • Re-computation – re-calculated from the scratch • Incremental – updated by new data, not easy to implement • Immediate – as part of the transaction that modifies the underlying data tables • Advantage: materialized view is always consistent • Disadvantage: updates are slowed down • Deferred – some time later, in a separate transaction • Advantage: can scale to maintain many views without slowing updates • Disadvantage: view briefly becomes inconsistent October 31, 2024 PA220 DB for Analytics 54 Maintenance of Materialized Views • Incremental maintenance • Changes to database relations are used to compute changes to the materialized view, which is then updated • Considering that we have a materialized view V, and that the basis relations suffer modifications through inserts, updates or deletes, we can calculate V` as follows • V` = (V - Δ-) ∪ Δ+, where Δ- and Δ+ represent deleted and inserted tuples, respectively October 31, 2024 PA220 DB for Analytics 55 Maintenance of Materialized Views • Deferred update options: • Lazy • delay refresh until next query on view, then refresh before answering the query • Periodic (Snapshot) • refresh periodically – queries are possibly answered using outdated version of view tuples • widely used in DWs • Event-based • e.g., refresh after a fixed number of updates to underlying data tables October 31, 2024 PA220 DB for Analytics 56 Summary • The term selectivity and its interpretation. • Bitmap indexes are universal, space-efficient • UB-trees, R*-trees, X-trees for multidimensional data • Partitioning • Records used together should be stored together • Mini-dimension • Joins • Computing cross join on dimension table is an option • Materialized views can replace parts of a query • Select what to materialize (not everything) statically or dynamically October 31, 2024 PA220 DB for Analytics 57