Data Warehouse Implementation & Querying PA220: Database systems for data analytics Contents • Implementation of Dimensional modelling • Querying by dimensions • Grouping possibilities • Aggregate functions • Window functions • Case Study: Grocery Store October 20, 2020 PA220 DB for Analytics 2 ER Model vs. Multidimensional Model • The multidimensional model • Its only purpose: data analysis • It is not suitable for OLTP systems • More built in “meaning” • What is important • What describes the important • What we want to optimize • Easy for query operations • Recognized by OLAP/BI tools • Tools offer powerful query facilities based on MD design October 20, 2020 PA220 DB for Analytics 3 The multidimensional model • Data is divided into: • Facts • Dimensions • Facts are the important entity: a sale • Facts have measures that can be aggregated: sales price • Dimensions describe facts • A sale has the dimensions Product, Store and Time • Facts “live” in a multidimensional cube (dice) • Think of an array from programming languages • Goal for dimensional modeling: • Surround facts with as much context (dimensions) as possible • Hint: redundancy may be ok (in well-chosen places) • But you should not try to model all relationships in the data (unlike E/R and OO modeling!) October 20, 2020 PA220 DB for Analytics 4 Cube Example October 20, 2020 PA220 DB for Analytics 5 Cubes • A “cube” may have many dimensions! • More than 3 - the term ”hypercube” is sometimes used • Theoretically no limit for the number of dimensions • Typical cubes have 4-12 dimensions • But only 2-4 dimensions can be viewed at a time • Dimensionality reduced by queries via projection/aggregation • A cube consists of cells • A given combination of dimension values • A cell can be empty (no data for this combination) • A sparse cube has few non-empty cells • A dense cube has many non-empty cells • Cubes become sparser for many/large dimensions October 20, 2020 PA220 DB for Analytics 6 Dimensions • Dimensions are the core of multidimensional databases • Other types of databases do not support dimensions • Dimensions are used for • Selection of data • Grouping of data at the right level of detail • Dimensions consist of dimension values • Product dimension have values ”milk”, ”cream”, … • Time dimension have values ”1/1/2001”, ”2/1/2001”,… • Dimension values may have an ordering • Used for comparing cube data across values • Example: ”percent sales increase compared with last month” • Especially used for Time dimension October 20, 2020 PA220 DB for Analytics 7 Schema Documentation • No well-defined standard • T level corresponds to ALL • Record the measures • You could also use a UML like notation • Modeling and OLAP tools may have their own notation October 20, 2020 PA220 DB for Analytics 8 OLAP Systems • A key concept of OLAP systems is multidimensional analysis: • Examining data from many dimensions. • Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000. • Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals. • List the top 10 sales representatives in Asia according to 2000 sales revenue for food products and rank their commissions. • Organization of cubes to efficiently answer the requests • Response time of seconds / few minutes October 20, 2020 PA220 DB for Analytics 9 Relational OLAP (ROLAP) • Store data in relational databases and simulate multidimensionality with special schemas • Data stored in relational tables • Star (or snowflake) schemas used for modeling • SQL used for querying • Pros • Leverages investments in relational technology • Scalable (billions of facts) • Flexible, designs easier to change • New, performance enhancing techniques adapted from MOLAP • Indices, materialized views • Cons • Storage use (often 3-4 times MOLAP) • Response times October 20, 2020 PA220 DB for Analytics 10 Multidimensional OLAP (MOLAP) • Physically stage the processed multidimensional information to deliver consistent and rapid response times to end users • Data stored in special multidimensional data structures • E.g., multidimensional array on hard disk • Pros • Less storage use (“foreign keys” not stored) • Faster query response times • Cons • Up till now not so good scalability • Less flexible, e.g., cube must be re-computed when design changes • Does not reuse an existing investment (but often bundled with RDBMS) • Not as open technology October 20, 2020 PA220 DB for Analytics 11 Hybrid OLAP (HOLAP) • Detail data stored in relational tables (ROLAP) • Aggregates stored in multidimensional structures (MOLAP) • Pros • Scalable (as ROLAP) • Fast (as MOLAP) • Cons • High complexity October 20, 2020 PA220 DB for Analytics 12 Question time • Suppose that we want to replace the original Store hierarchy A by a new hierarchy B • How do we modify the schema to reflect it in ROLAP / MOLAP? October 20, 2020 PA220 DB for Analytics 13 Question time (2) • New store is being open and fact table needs to be populated • How do we modify the fact table in ROLAP / MOLAP ? October 20, 2020 PA220 DB for Analytics 14 Relational OLAP Cubes • Two kinds of queries • Navigation queries examine one dimension • SELECT DISTINCT l FROM d [WHERE p] • Aggregation queries summarize fact data • SELECT d1.l1, d2.l2, SUM(f.m) FROM d1, d2, f WHERE f.dk1 = d1.dk1 AND f.dk2 = d2.dk2 [AND p] GROUP BY d1.l1,d2.l2 • Fast, interactive analysis of large amounts of data October 20, 2020 PA220 DB for Analytics 15 OLAP Queries October 20, 2020 PA220 DB for Analytics 16 OLAP Cube in MS Analysis Services Project October 20, 2020 PA220 DB for Analytics 17 Multidimensional database implementation • Microsoft SQL Server Analysis Services (SSAS) in MS SQL Server • Oracle Database OLAP Option within Oracle database October 20, 2020 PA220 DB for Analytics 18 Microsoft SQL Server Analysis Services (SSAS) • Introduced in SQL Server 2008 • Nice features built-in • Analysis Services • Integration Services • Reporting Services • Easy to use • Graphical “Management Studio” and “BI Developer Studio” • Allows all flavors of MOLAP, ROLAP and HOLAP to be used within the same model • Intelligent pre-aggregation (for improving query performance) • Uses the query language MDX (MultiDimensional eXpressions) October 20, 2020 PA220 DB for Analytics 19 OLAP Operations & Queries • DW queries are big queries • Imply a large portion of the data • Mostly read queries • Redundancy a necessity • Materialized views, special purpose indexes, denormalized schemas • Data is refreshed periodically • Daily or weekly • Their purpose is to analyze data • OLAP (OnLine Analytical Processing) October 20, 2020 PA220 DB for Analytics 20 OLAP Operations • Typical OLAP operations • Roll up • Drill down • Slice and dice • Pivot (rotate) (aka crosstab) • Other operations • Aggregate functions • Ranking and comparing • Drill across • Drill through • Data densification (partitioned outer join) October 20, 2020 PA220 DB for Analytics 21 Roll Up • Roll up (drill up) • Taking the current aggregation level of fact values and doing a further aggregation • Summarize data by • Climbing up hierarchy (hierarchical roll up) • By dimensional reduction (dimensional roll up) • Or by a mix of these 2 techniques • Used for obtaining an increased generalization • E.g., from Time.Week to Time.Year October 20, 2020 PA220 DB for Analytics 22 Roll Up • Hierarchical roll ups • Performed on the fact table and some dimension tables by climbing up the attribute hierarchies • E.g., climbed the Time hierarchy to Quarter and Article hierarchy to Prod. group October 20, 2020 PA220 DB for Analytics 23 Roll Up • Dimensional roll ups • Are done solely on the fact table by dropping one or more dimensions • E.g., drop the Client dimension October 20, 2020 PA220 DB for Analytics 24 Roll Up • Climbing above the top in hierarchical roll up • In an ultimate case, hierarchical roll up above the top level of an attribute hierarchy (attribute “ALL”) can be viewed as converting to a dimensional roll up October 20, 2020 PA220 DB for Analytics 25 Drill Down • Drill down (roll down) • Reverse of roll up • Represents a de aggregate operation • From higher level of summary to lower level of summary detailed data • Introducing new dimensions • Requires the existence of materialized finer grained data • You can’t drill if you don’t have the data October 20, 2020 PA220 DB for Analytics 26 Roll Up & Drill Down Example October 20, 2020 PA220 DB for Analytics 27 Slice • Slice: a subset of the multi-dimensional array corresponding to a single value of one or more dimensions and projection on the rest of dimensions • E.g., project on Geo (store) and Time from values corresponding to Laptops in the product dimension 𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑> 𝑆𝑎𝑙𝑒𝑠 October 20, 2020 PA220 DB for Analytics 28 Slice • Amounts to equality select condition • WHERE clause in SQL • E.g., slice Laptops October 20, 2020 PA220 DB for Analytics 29 Dice • Dice: amounts to range select condition on one dimension, or to equality select condition on more than one dimension • E.g. range SELECT 𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑>∨𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐶𝑒𝑙𝑙𝑃ℎ𝑜𝑛𝑒𝐼𝑑> 𝑆𝑎𝑙𝑒𝑠 October 20, 2020 PA220 DB for Analytics 30 Dice • E.g. equality SELECT on 2 dimensions Product and Time • 𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑>∧𝑀𝑜𝑛𝑡ℎ𝐼𝑑=<𝐷𝑒𝑐𝑒𝑚𝑏𝑒𝑟> 𝑆𝑎𝑙𝑒𝑠 October 20, 2020 PA220 DB for Analytics 31 Pivoting • Pivot (rotate): re-arranging data for viewing purposes • The simplest view of pivoting is that it selects two dimensions to aggregate the measure • The aggregated values are often displayed in a grid where each point in the (x, y) coordinate system corresponds to an aggregated value of the measure • The x and y coordinate values are the values of the selected two dimensions • The result of pivoting is also called cross tabulation • This is space efficient for dense data only (thus, few dimensions) • Shows data at different “granularities” October 20, 2020 PA220 DB for Analytics 32 Pivoting in ROLAP • Tabular representation for the cross-tabular report with totals. • ALL is a dummy value and stands for all or multiple values. • Probably not as nice to read as the crosstab. • Information content is the same as in the crosstab. • Is more space efficient than crosstab if the data is sparse. October 20, 2020 PA220 DB for Analytics 33 SQL & OLAP • The idea is to • Select by Attributes of Dimensions • E.g., region = „Europe“ • Group by Attributes of Dimensions • E.g., region, month, quarter • Aggregate on measures • E.g., sum(price * volume) • OLAP queries in SQL October 20, 2020 PA220 DB for Analytics 34 SQL & OLAP • No standard query language for OLAP • SQL99 for ROLAP • SQL:2003 OLAP extensions • New SQL commands • GROUPING SETS • ROLLUP • CUBE • Queries of type “top k” • New aggregate functions October 20, 2020 PA220 DB for Analytics 35 SQL & OLAP • Shortcomings of SQL/92 with regard to OLAP queries • Hard or impossible to express in SQL • Multiple aggregations • Comparisons (with aggregation) • Reporting features • Performance penalty • Poor execution of queries with many AND and OR conditions • Lack of support for advanced statistical functions October 20, 2020 PA220 DB for Analytics 36 SQL92 • Multiple aggregations in SQL/92 • Create a 2D spreadsheet that shows sum of sales by maker as well as car model • Each subtotal requires a separate aggregation query October 20, 2020 PA220 DB for Analytics 37 SELECT model, maker, sum(amt) FROM sales GROUP BY model, maker union SELECT model, sum(amt) FROM sales GROUP BY model union SELECT maker, sum(amt) FROM sales GROUP BY maker union SELECT sum(amt) FROM sales SQL92 • Comparisons in SQL/92 • This year’s sales vs. last year’s sales for each product • Requires a self join • CREATE VIEW v_sales AS SELECT prod_id , year, sum(qty) AS sale_sum FROM sales GROUP BY prod_id, year; • SELECT cur.prod_id , cur.year , cur.sale_sum , last.year, last.sale_sum FROM v_sales cur, v_sales last WHERE cur.year = (last.year+1) AND cur.prod_id = last.prod_id October 20, 2020 PA220 DB for Analytics 38 SQL92 • Reporting features in SQL/92 • Too complex to express • RANK (top k) and NTILE (“top X%” of all products) • Median • Running total, moving average, cumulative totals • E.g. moving average over a 3-day window of total sales for each product • CREATE OR REPLACE VIEW v_sales AS SELECT prod_id , time_id , sum(qty) AS sale_sum FROM sales GROUP BY prod_id , time_id • SELECT end.time , avg start.sale_sum ) FROM v_sales start, v_sales end WHERE end.time >= start.time AND end.time <=start.time + 2 GROUP BY end.time October 20, 2020 PA220 DB for Analytics 39 SQL99: Grouping Operators • GROUP BY ROLLUP(gcols) • Roll-up hierarchically • GROUP BY CUBE (gcols) • Roll-up to all possible combinations • GROUP BY gcols1, CUBE(gcols2) • Partial roll-up • GROUP BY GROUPING SETS (gcols1, …, gcolsN) • Explicit specification of roll-ups • GROUP BY groupings1, groupings2, … • Cross-product of groupings • SELECT … GROUPING_ID(gcols)… • Identification of roll-up level October 20, 2020 PA220 DB for Analytics 40 Roll Up • ROLLUP creates subtotals at n+1 levels, where n is the number of grouping columns • Rows that would be produced by GROUP BY without ROLLUP • First-level subtotals • Second-level subtotals • … • A grand total row • It is very helpful for subtotaling along a hierarchical dimensions such as time or geography • ROLLUP(y, m, day) or ROLLUP(country, state, city) • Order of attributes is significant! October 20, 2020 PA220 DB for Analytics 41 Roll Up • Roll up operation, e.g.: • SELECT year, brand, SUM(qty) FROM sales GROUP BY ROLLUP(year, brand); October 20, 2020 PA220 DB for Analytics 42 Cube • CUBE creates 2n combinations of subtotals, where n is the number of grouping columns • Includes all the rows produced by ROLLUP • CUBE is typically most suitable in queries that use columns from multiple dimensions rather than columns representing different levels of a single dimension • e.g., subtotals for all combinations of month, state, and product • Partial CUBE similar to partial ROLLUP October 20, 2020 PA220 DB for Analytics 43 Cube October 20, 2020 PA220 DB for Analytics 44 Cube • Example • SELECT year, brand, SUM(qty) FROM sales GROUP BY CUBE (year, brand); October 20, 2020 PA220 DB for Analytics 45 Grouping Sets • Grouping sets produce just the specified groupings. • No (automatic) rollup is performed. • E.g. GROUPING SET ( (A,B), (D), (C, E, F) ) • Collection of columns in paratheses → composite column • Efficiently replaces the series of UNIONed queries • SELECT dept_name , CAST(NULL AS CHAR(10)) AS job_title , COUNT(*) FROM personnel GROUP BY dept_name UNION ALL SELECT CAST(NULL AS CHAR(8)) AS dept_name , job_title , COUNT(*) FROM personnel GROUP BY job_title; • Can be re written as: • SELECT dept_name , job_title , COUNT(*) FROM Personnel GROUP BY GROUPING SET ( dept_name , job_title ); October 20, 2020 PA220 DB for Analytics 46 Grouping Sets • The issue of NULL values • The new grouping functions generate NULL values at the subtotal levels • How do we tell the difference between “generated NULLs” and “real NULLs” from the data itself? • The GROUPING function call returns 0 for NULL in the data and 1 for generated NULL October 20, 2020 PA220 DB for Analytics 47 Grouping Operators: Equivalences • CUBE(a,b) ≡ GROUPING SETS ((a,b), (a), (b), ()) • ROLLUP(a,b,c) ≡ GROUPING SETS ((a,b,c), (a,b), (a), ()) • GROUP BY GROUPING SETS(a,b,c) ≡ GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY c • GROUP BY GROUPING SETS((a,b,c)) ≡ GROUP BY a, b, c • GROUP BY GROUPING SETS(a,b,(b,c)) ≡ GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY b, c • GROUP BY GROUPING SETS(a,ROLLUP(b,c)) ≡ GROUP BY a UNION ALL GROUP BY ROLLUP(b, c) October 20, 2020 PA220 DB for Analytics 48 Identification of Groupings • With rollup and cube we must provide a possibility to programmatically determine the rollup level. • The GROUPING_ID function is designed for this. • GROUPING_ID takes a list of grouping columns as an argument. • For each column it returns 1 if its value is NULL because of a rollup, and 0 otherwise. • The list of binary digits is interpreted as a binary number and returned as a base-10 number. • Example: GROUPING_ID(a,b) for CUBE(a,b) October 20, 2020 PA220 DB for Analytics 49 Concatenated Groupings • A concatenated grouping is specified by listing multiple grouping sets, cubes, and rollups, and produces the cross-product of groupings from each grouping set • Example: • GROUP BY GROUPING SETS(a,b), GROUPING SETS(c,d) produces (a,c), (a,d), (b,c), (b,d) • A concise and easy way to generate useful combinations of groupings • A small number of concatenated groupings can generate a large number of final groups • One of the most important uses for concatenated groupings is to generate the aggregates for a hierarchical cube October 20, 2020 PA220 DB for Analytics 50 Hierarchical Cubes • A hierarchical cube is a data set where the data is aggregated along the rollup hierarchy of each of its dimensions. • The aggregations are combined across dimensions • Example: • ROLLUP(year, quarter, month), ROLLUP(category,subcategory,name), ROLLUP(region,subregion,country,state,city) • Produces a total of 4x4x6=96 aggregate groups • Compare to 2^12 = 4096 groupings by CUBE and 96 explicit group specifications • Groups: (year,category,region), (quarter,category,region), (month,category,region), … October 20, 2020 PA220 DB for Analytics 51 Window Functions • The window clause specifies an action to perform over a set of rows • 3 sub clauses: Partitioning, ordering and aggregation grouping • OVER ([PARTITION BY ] ORDER BY []) • SELECT … , AVG(sales) OVER (PARTITION BY region ORDER BY month ASC ROWS 2 PRECEDING) AS SMA3, … FROM … • moving average of 3 rows October 20, 2020 PA220 DB for Analytics 52 Ranking Operators • Ranking operators in SQL • Row numbering is the most basic ranking function • Old style: ROW_NUMBER() returns a column that contains the row’s number within the result set • E.g., SELECT SalesOrderID , CustomerID , ROW_NUMBER() OVER (ORDER BY SalesOrderID ) as RunningCount FROM Sales WHERE SalesOrderID > 10000 ORDER BY SalesOrderID October 20, 2020 PA220 DB for Analytics 53 Ranking Operators • ROW_NUMBER doesn’t consider tied values • Each 2 equal values get 2 different row numbers • The behavior is nondeterministic • Each tied value could have its number switched! • We need something deterministic October 20, 2020 PA220 DB for Analytics 54 Ranking Operators • RANK and DENSE_RANK functions • Allow ranking items in a group • Syntax: • RANK ( ) OVER ( [ query_partition_clause ] order_by_clause ) • DENSE_RANK ( ) OVER ( [ query_partition_clause ] order_by_clause ) • DENSE_RANK leaves no gaps in ranking sequence when there are ties • PERCENT_RANK → (rank - 1) / (total rows - 1) • CUME_DIST - the cumulative distribution • the number of partition rows preceding (or peers with) the current row / total partition rows • The value ranges from 1/N to 1 October 20, 2020 PA220 DB for Analytics 55 Ranking Operators • E.g., SELECT channel, calendar, TO_CHAR(TRUNC(SUM(amount_sold), -6), '9,999,999’) AS sales, RANK() OVER (ORDER BY TRUNC(amount_sold, -6)) DESC) AS rank, DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold), -6)) DESC) AS dense_rank FROM sales, products … October 20, 2020 PA220 DB for Analytics 56 Ranking Operators • Group ranking - RANK function can operate within groups: the rank gets reset whenever the group changes • A single query can contain more than one ranking function, each partitioning the data into different groups. • PARTITION BY clause October 20, 2020 PA220 DB for Analytics 57 SELECT … RANK() OVER (PARTITION BY channel ORDER BY SUM(amount_sold) DESC) AS rank_by_channel Ntile • NTILE splits a set into equal groups • It divides an ordered partition into buckets and assigns a bucket number to each row in the partition • Buckets are calculated so that each bucket has exactly the same number of rows assigned to it or at most 1 row more than the others • NTILE(4) - quartile • NTILE(100) - percentage • Not a part of the SQL99 standard, but adopted by major vendors October 20, 2020 PA220 DB for Analytics 58 SELECT … NTILE(3) OVER (ORDER BY sales) NT_3 FROM … Window Frame • Obtain a value of a particular row of a window frame defined by window clause (PARTITION BY…) • first_value(expression) • last_value(expression) • nth_value (expression) October 20, 2020 PA220 DB for Analytics 59 SELECT … FIRST_VALUE(sales) OVER (ORDER BY sales) AS lowest_sale CHANNEL CALENDAR SALES LOWEST_SALE Direst sales 02.2016 10,000 4,000 Direst sales 03.2016 9,000 4,000 Internet 02.2016 6,000 4,000 Internet 03.2016 6,000 4,000 Partners 03.2016 4,000 4,000 SELECT … FIRST_VALUE(sales) OVER (PARTITION BY channel ORDER BY sales) AS lowest_sales Window Frame • Access to a row that comes before the current row at a specified physical offset with the current window frame (partition) • LAG(expression [,offset [,default_value]]) • … after the current row • LEAD(expression [,offset [,default_value]] October 20, 2020 PA220 DB for Analytics 60 CHANNEL CALENDAR SALES PREV_SALE Direst sales 02.2016 10,000 NULL Direst sales 03.2016 9,000 10,000 Internet 02.2016 6,000 NULL Internet 03.2016 6,000 6,000 Partners 03.2016 4,000 NULL SELECT … LAG(sales, 1) OVER (PARTITION BY channel ORDER BY calendar) AS prev_sales Data Densification • Enrich the existing “holey” data with default values • Goal: produce dense result, i.e. incl. the weeks 24, 25, and 26. October 20, 2020 PA220 DB for Analytics 61 Data Densification • Partitioned Outer Join • Apply outer join on each partition • Implemented in Oracle October 20, 2020 PA220 DB for Analytics 62 Case Study: Grocery Store • Stock Keeping Units (SKUs) • Point Of Sale (POS) system • Stores/Branches • Promotions • Task: Analyze how promotions affect sales October 20, 2020 PA220 DB for Analytics 63 Case Study: DW Design Steps • Choose the business process(es) to model • Sales • Choose the granularity of the business process • Sales by Product by Store by Promotion by Day • Low granularity is needed • Are individual transactions necessary/feasible? • Choose the dimensions • Time, Branch (store), Advert (promotion), Product • Choose the measures • Dollar_sales, unit_sales, dollar_cost, customer_count • Resisting normalization and preserving browsing • Flat dimension tables makes browsing easy and fast October 20, 2020 PA220 DB for Analytics 64 Case Study: Dimensions • Time dimension • Explicit time dimension is needed (events, holidays,..) • Product dimension • Many-level hierarchy allows drill-down/roll-up • Many descriptive attributes (often more than 50) • Branch (store) dimension • Many descriptive attributes • Advert (promotion) dimension • Example of a causal dimension • Used to see if promotions work/are profitable • Ads, price reductions, end-of-aisle displays, coupons October 20, 2020 PA220 DB for Analytics 65 Case Study: Measures (Facts) • All additive across all dimensions • Dollar_sales • Unit_sales • Dollar_cost • Gross profit (derived) • Computed from sales and cost: sales – cost • Additive • Gross margin (derived) • Computed from gross profit and sales: (sales – cost)/cost • Non-additive across all dimensions • Customer_count • Additive across time, promotion, and store • Non-additive across product. Why? • Semi-additive October 20, 2020 PA220 DB for Analytics 66 Case Study: DW Size • Estimated number of fact records: • Time dimension: 2 years = 730 days • Branch dimension: 300 stores reporting each day • Product dimension: 30,000 products, only 3000 sell per day • Promotion dimension: 5000 combinations, but a product only appears in one combination per day • 730*300*3000*1 = 657,000,000 • Total data warehouse size: 657,000,000 facts* 8 fields/fact * 4 bytes/field = 21 GB • Number of fields: 4 FKs + 4 measures = 8 fields • Assuming sizes of dimensions negligible • Small size (by today’s standard), feasible to store at transaction level detail October 20, 2020 PA220 DB for Analytics 67 Case Study: Data Model t(time) t_id (PK) t_day_name t_cal_month_num t_cal_year t_cal_week_num t_date t_holiday … p(roduct) p_id p_name p_desc p_cat p_subcat p_list_price b(ranch) b_id b_name b_city b_country b_country_code b_region … a(dvert) a_id a_name a_aisle_display a_coupon a_price_reduction m(edia) m_id m_desc m_class … c(ustomer) c_id c_first_name c_last_name c_country c_country_code c_region s(ales) s_p_id (FK) s_c_id (FK) s_t_id (FK) s_m_id (FK) s_a_id (FK) s_b_id (FK) s_quant_sold s_dollar_sold s_dollar_cost s_custom_cnt Case Study: Common View • A common view is created to simplify queries October 20, 2020 PA220 DB for Analytics 69 CREATE VIEW spcbatm SELECT * FROM sales s JOIN product p ON (s_p_id=p_id) JOIN customer c ON (s_c_id=c_id) JOIN branch b ON (s_b_id=b_id) JOIN advert a ON (s_a_id=a_id) JOIN time t ON (s_t_id=t_id) JOIN media m ON (s_m_id=m_id) Case Study: Roll up Example • Rollup from right to left • Computes and combines the following groupings • m_desc, t_cal_month_desc, c_country_code • m_desc, t_cal_month_desc • m_desc • October 20, 2020 PA220 DB for Analytics 70 SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet') AND t_cal_month_desc IN ('2000-09', '2000-10') AND c_country_code IN ('GB', 'US') GROUP BY ROLLUP(m_desc, t_cal_month_desc, c_country_code); Case Study: Roll up Example October 20, 2020 PA220 DB for Analytics 71 C_ Case Study: Partial Roll up • m_desc is always present and not part of the rollup hierarchy • Computes and combines the following groupings • m_desc, t_cal_month_desc, c_country_code • m_desc, t_cal_month_desc • m_desc October 20, 2020 PA220 DB for Analytics 72 SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet') AND t_cal_month_desc IN ('2000-09', '2000-10') AND c_country_code IN ('GB', 'US') GROUP BY m_desc, ROLLUP(t_cal_month_desc, c_country_code); Case Study: Partial Roll up October 20, 2020 PA220 DB for Analytics 73 C_ SUM(S_DOLLAR_SOLD) Case Study: Cube Example • Produces all possible roll-up combinations • Computes and combines the following groupings • m_desc, t_cal_month_desc, n_iso_code • m_desc, t_cal_month_desc • m_desc, n_iso_code • t_cal_month, n_iso_code • m_desc • October 20, 2020 PA220 DB for Analytics 74 SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet') AND t_cal_month_desc IN ('2000-09', '2000-10') AND c_country_code IN ('GB', 'US') GROUP BY CUBE(m_desc, t_cal_month_desc, c_country_code); Case Study: Cube Example October 20, 2020 PA220 DB for Analytics 75 SUM(S_DOLLAR_SOLD) Grouping Sets October 20, 2020 PA220 DB for Analytics 76 SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet’) AND t_cal_month_desc IN ('2000-09', '2000-10’) AND c_country_code IN ('GB', 'US') GROUP BY GROUPING SETS ((m_desc, t_cal_month_desc, c_country_code), (m_desc, c_country_code), (t_cal_month_desc, c_country_code)); Grouping Sets October 20, 2020 PA220 DB for Analytics 77 SUM(S_DOLLAR_SOLD)C_ Grouping ID Example • Replaces all NULLs from rollup with string ‘*’. • Leaves NULL that are not the result of rollup untouched. • Could easily make selective replacements of NULL. October 20, 2020 PA220 DB for Analytics 78 SELECT CASE WHEN GROUPING_ID(m_desc)=1 THEN ‘*' ELSE m_desc END, CASE WHEN GROUPING_ID(c_country_code)=1 THEN ‘*' ELSE c_country_code END, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet’) AND t_cal_month_desc= '2000-09’ AND c_country_code IN ('GB', 'US') GROUP BY CUBE(m_desc, c_country_code); Grouping ID Example October 20, 2020 PA220 DB for Analytics 79 SUM(S_DOLLAR_SOLD) Ranking Example • Rank the media (‘Internet’ versus ‘Direct sales’) used for selling products according to their dollar sales. Use the number of unit sales to break ties. Do the analysis for August until November 2000. October 20, 2020 PA220 DB for Analytics 80 SELECT m_desc, t_cal_month_desc, SUM(s_dollar_sold), SUM(s_quant_sold), RANK() OVER (ORDER BY SUM(s_dollar_sold) DESC, SUM(s_quant_sold) DESC) AS Rank FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet'), AND t_cal_month_desc IN ('2000-08', '2000-09’, '2000-10', '2000-11') GROUP BY m_desc, t_cal_month_desc; Ranking Example (2) • Determine the two least and most successful sales media, respectively (in terms of total amount sold). October 20, 2020 PA220 DB for Analytics 81 SELECT * FROM (SELECT m_desc, SUM(s_dollar_sold), RANK() OVER (ORDER BY SUM(s_dollar_sold)) worst, RANK() OVER (ORDER BY SUM(s_dollar_sold) DESC) best FROM spcbatm GROUP BY m_desc) WHERE worst < 3 OR best < 3; SUM(S_DOLLAR_SOLD) Ranking Example (3) • Determine the output of the following statement: October 20, 2020 PA220 DB for Analytics 82 SELECT c_id, p_id, RANK() OVER (ORDER BY p_id) AS r1, RANK() OVER (ORDER BY c_id) AS r2, RANK() OVER (ORDER BY 1) AS r3, RANK() OVER (PARTITION BY c_id ORDER BY p_id) AS r4, RANK() OVER (PARTITION BY p_id ORDER BY c_id) AS r5 FROM spcbatm WHERE c_id in (214, 608, 699) AND p_id in (42, 98, 123) GROUP BY c_id, p_id; Summary • ROLAP is a good option • exploits existing investments • SQL:2003 has added a lot of support for OLAP operations • SQL is not just select-from-where • Extensions of GROUP BY clause • rollup, grouping sets, cube • functions to identify grouping • Case study for Grocery store October 20, 2020 PA220 DB for Analytics 83