Data Warehouse Implementation & Querying PA220: Database systems for data analytics Contents • Implementation of dimensional model • Querying by dimensions • Grouping possibilities • Aggregate functions • Window functions • Case Study: Grocery Store October 17, 2024 PA220 DB for Analytics 2 Data Cube • Implication of using dimensional model (star schema) • A symbolic representation of aggregated data October 17, 2024 PA220 DB for Analytics 3 (Data) Cubes • A “cube” may have many dimensions! • More than 3 - the term ”hyper-cube” can sometimes be used • Theoretically no limit for the number of dimensions • Typical cubes have 4-12 dimensions • But only 2-4 dimensions can be viewed at a time • Dimensionality reduced by queries via projection/aggregation • A cube consists of cells • A given combination of dimension values • A cell can be empty (no data for this combination) • A sparse cube has few non-empty cells • A dense cube has many non-empty cells • Cubes become sparser for many/large dimensions October 17, 2024 PA220 DB for Analytics 4 Dimensions • Dimensions are used for • Selection of data • Grouping of data at the right level of detail • Dimensions consist of dimension values • Product dimension have values ”milk”, ”cream”, … • Time dimension have values ”1/1/2001”, ”2/1/2001”,… • Dimension values may have an ordering • Used for comparing cube data across values • Example: ”percent sales increase compared with last month” • Especially used for Time dimension • Dimensions are the core of multidimensional databases • Other types of databases do not support dimensions October 17, 2024 PA220 DB for Analytics 5 Schema Documentation • No well-defined standard • T level corresponds to ALL • Record the measures • You could also use a UML like notation • Modeling and OLAP tools may have their own notation October 17, 2024 PA220 DB for Analytics 6 OLAP Systems • A key concept of OLAP systems is multidimensional analysis: • Examining data from many dimensions. • Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to geographical region, for 1999 and 2000. • Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals. • List the top 10 sales representatives in Asia according to 2000 sales revenue for food products and rank their commissions. • Organization of cubes to efficiently answer the requests • Response time of seconds / few minutes October 17, 2024 PA220 DB for Analytics 7 Relational OLAP (ROLAP) • Store data in relational databases and simulate multidimensionality with special schemas • Data stored in relational tables • Star (or snowflake) schemas used for modeling • SQL used for querying • Pros • Leverages investments in relational technology • Scalable (billions of facts) • Flexible, designs easier to change • New, performance enhancing, techniques adapted from MOLAP • Indices, materialized views • Cons • Storage use (often 3-4 times MOLAP) • Response times October 17, 2024 PA220 DB for Analytics 8 Multidimensional OLAP (MOLAP) • Physically stages the processed multidimensional information to deliver consistent and rapid response times to end users • Data stored in special multidimensional data structures • E.g., multidimensional array A on hard disk • Pros • Less storage use (“foreign keys” not stored) • Faster query response times • Direct access to a cell by giving position, e.g. A[1][2] → 2 • Cons • Up till now not so good scalability • Less flexible, e.g., cube must be re-computed when design changes • Does not reuse an existing investment (but often bundled with RDBMS) • Not as open technology October 17, 2024 PA220 DB for Analytics 9 Hybrid OLAP (HOLAP) • Detail data stored in relational tables (ROLAP) • Aggregates stored in multidimensional structures (MOLAP) • Pros • Scalable (as ROLAP) • Fast (as MOLAP) • Cons • High complexity October 17, 2024 PA220 DB for Analytics 10 Question time • Suppose that we want to replace the original Store hierarchy A by a new hierarchy B • How do we modify the schema to reflect it in ROLAP / MOLAP? October 17, 2024 PA220 DB for Analytics 11 Question time (2) • New store is being open and fact table needs to be populated • How do we modify the fact table in ROLAP / MOLAP ? October 17, 2024 PA220 DB for Analytics 12 Relational OLAP Cubes • Two kinds of queries • Navigation queries examine one dimension • SELECT DISTINCT l FROM d [WHERE p] • Aggregation queries summarize fact data • SELECT d1.l1, d2.l2, SUM(f.m) FROM d1, d2, f WHERE f.dk1 = d1.dk1 AND f.dk2 = d2.dk2 [AND p] GROUP BY d1.l1,d2.l2 • Fast, interactive analysis of large amounts of data October 17, 2024 PA220 DB for Analytics 13 A mere symbolic representation as a cube OLAP Queries October 17, 2024 PA220 DB for Analytics 14 OLAP Cube in MS Analysis Services Project October 17, 2024 PA220 DB for Analytics 15 Multidimensional database implementation • Microsoft SQL Server Analysis Services (SSAS) in MS SQL Server • Oracle Database OLAP Option within Oracle database October 17, 2024 PA220 DB for Analytics 16 Microsoft SQL Server Analysis Services (SSAS) • Introduced in SQL Server 2008 • Nice built-in features • Analysis Services • Integration Services • Reporting Services • Easy to use • Graphical “Management Studio” and “BI Developer Studio” • Allows all flavors of MOLAP, ROLAP and HOLAP to be used within the same model • Intelligent pre-aggregation (for improving query performance) • Uses the query language MDX (MultiDimensional eXpressions) October 17, 2024 PA220 DB for Analytics 17 OLAP Operations & Queries • Redundancy is necessity • Materialized views, special purpose indexes, denormalized schemas • Data is refreshed periodically • Daily or weekly • DW queries are big queries • Imply a large portion of the data • Mostly read queries October 17, 2024 PA220 DB for Analytics 18 OLAP Queries • Typical OLAP query operations • Roll up • Drill down • Slice and dice • Pivot (rotate) (aka crosstab) • Other operations • Drill across • navigate from a fact table to another one by using a shared dimension • Drill through • allows users to view relational transactions that make up a multidimensional point in cube • Data densification (partitioned outer join) • Based on • Aggregate functions • Ranking and comparing operators October 17, 2024 PA220 DB for Analytics 19 Roll Up • Roll up (drill up) • Taking the current aggregation level of fact values and doing a further aggregation • Summarize data by • Climbing up hierarchy (hierarchical roll up) • By dimensional reduction (dimensional roll up) • Or by a mix of these 2 techniques • Used for obtaining an increased generalization • E.g., from Time.Week to Time.Year October 17, 2024 PA220 DB for Analytics 20 Roll Up • Hierarchical roll ups • Performed on the fact table and some dimension tables by climbing up the attribute hierarchies • E.g., climbed the Time hierarchy to Quarter and Article hierarchy to Prod. group October 17, 2024 PA220 DB for Analytics 21 Roll Up • Dimensional roll ups • Are done solely on the fact table by dropping one or more dimensions • E.g., drop the Client dimension October 17, 2024 PA220 DB for Analytics 22 Roll Up • Climbing above the top in hierarchical roll up • In an ultimate case, hierarchical roll up above the top level of an attribute hierarchy (attribute “ALL”) can be viewed as converting to a dimensional roll up October 17, 2024 PA220 DB for Analytics 23 Drill Down • Drill down (roll down) • Reverse of roll up • Represents a de-aggregate operation • From higher level of summary to lower level of summary detailed data • Introducing new dimensions • Requires the existence of materialized finer grained data • You can not drill if you do not have the data October 17, 2024 PA220 DB for Analytics 24 Roll Up & Drill Down Example October 17, 2024 PA220 DB for Analytics 25 Slice • Slice: a subset of the multi-dimensional array corresponding to a single value of one or more dimensions and projection on the rest of dimensions • E.g., project on Geo (store) and Time from values corresponding to Laptops (cat. of products) in the product dimension October 17, 2024 PA220 DB for Analytics 26 SELECT store_id, time_id, amount FROM Sales WHERE article_id = 𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑> 𝑆𝑎𝑙𝑒𝑠 Slice • Amounts to equality select condition on one dimension • WHERE clause in SQL • E.g., slice Laptops October 17, 2024 PA220 DB for Analytics 27 Dice • Dice: amounts to range select condition on one dimension, or to equality select condition on more than one dimension • E.g., range SELECT October 17, 2024 PA220 DB for Analytics 28 𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑>∨𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐶𝑒𝑙𝑙𝑃ℎ𝑜𝑛𝑒𝐼𝑑> 𝑆𝑎𝑙𝑒𝑠 SELECT store_id, time_id, amount FROM Sales WHERE article_id = OR article_id = Dice • E.g., equality SELECT on 2 dimensions Product and Time October 17, 2024 PA220 DB for Analytics 29 SELECT store_id, time_id, amount FROM Sales WHERE article_id = AND month_id = 𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑>∧𝑀𝑜𝑛𝑡ℎ𝐼𝑑=<𝐷𝑒𝑐𝑒𝑚𝑏𝑒𝑟> 𝑆𝑎𝑙𝑒𝑠 Pivoting • Pivot (rotate): re-arrange data for viewing purposes • The simplest view of pivoting is that it selects two dimensions to aggregate the measure • The aggregated values are often displayed in a grid where each point in the (x, y) coordinate system corresponds to an aggregated value of the measure • The x and y coordinate values are the values of the selected two dimensions • Shows data at different “granularities” • The result of pivoting is also called cross tabulation • This is space efficient for dense data only (thus, few dimensions) October 17, 2024 PA220 DB for Analytics 30 Pivoting in ROLAP • Tabular representation for the cross-tabular report with totals. • ALL is a dummy value and stands for all or multiple values. (≈ NULL in SQL) • Probably not as nice to read as the crosstab. • Information content is the same as in the crosstab. • Is more space efficient than crosstab if the data is sparse. October 17, 2024 PA220 DB for Analytics 31 SQL & OLAP • The idea is to • Select by Attributes of Dimensions • E.g., region = „Europe“ • Group by Attributes of Dimensions • E.g., region, month, quarter • Aggregate on measures • E.g., sum(price * volume) • OLAP queries in SQL October 17, 2024 PA220 DB for Analytics 32 SQL & OLAP • No standard query language for OLAP • SQL99 for ROLAP • SQL:2003 OLAP extensions • New SQL SELECT clauses • GROUPING SETS • ROLLUP • CUBE • Queries of type “top k” • New aggregate functions October 17, 2024 PA220 DB for Analytics 33 SQL & OLAP • Shortcomings of SQL/92 with regard to OLAP queries • Hard or impossible to express in SQL • Multiple aggregations • Comparisons (with aggregation) • Reporting features • Performance penalty • Poor execution of queries with many AND and OR conditions • Lack of support for advanced statistical functions October 17, 2024 PA220 DB for Analytics 34 SQL92 • Multiple aggregations in SQL/92 • Create a 2D spreadsheet that shows sum of sales by maker as well as car model • Each subtotal requires a separate aggregation query October 17, 2024 PA220 DB for Analytics 35 SELECT model, maker, sum(amt) FROM sales GROUP BY model, maker union SELECT model, sum(amt) FROM sales GROUP BY model union SELECT maker, sum(amt) FROM sales GROUP BY maker union SELECT sum(amt) FROM sales SQL92 • Comparisons in SQL/92 • This year’s sales vs. last year’s sales for each product • Requires a self join • CREATE VIEW v_sales AS SELECT prod_id , year, sum(qty) AS sale_sum FROM sales GROUP BY prod_id, year; • SELECT cur.prod_id , cur.year , cur.sale_sum , last.year, last.sale_sum FROM v_sales cur, v_sales last WHERE cur.year = (last.year+1) AND cur.prod_id = last.prod_id October 17, 2024 PA220 DB for Analytics 36 SQL92 • Reporting features in SQL/92 • Too complex to express • RANK (top k) and NTILE (“top X%” of all products) • Median • Running/rolling/moving total, moving average, cumulative totals • E.g., moving average over a 3-day window of total sales for each product • CREATE OR REPLACE VIEW v_sales AS SELECT prod_id, time_id, sum(qty) AS sale_sum FROM sales GROUP BY prod_id, time_id • SELECT end.time, avg( start.sale_sum ) FROM v_sales NATURAL JOIN d_time start, v_sales NATURAL JOIN d_time end WHERE end.time >= start.time AND end.time <=start.time + 2 GROUP BY end.time October 17, 2024 PA220 DB for Analytics 37 SQL99: Grouping Operators • GROUP BY ROLLUP(gcols) • Roll-up hierarchically • GROUP BY CUBE (gcols) • Roll-up to all possible combinations • GROUP BY gcols1, CUBE(gcols2) • Partial roll-up • GROUP BY GROUPING SETS (gcols1, …, gcolsN) • Explicit specification of roll-ups • GROUP BY groupings1, groupings2, … • Cross-product of groupings • SELECT … GROUPING_ID(gcols)… • Identification of roll-up level October 17, 2024 PA220 DB for Analytics 38 Roll Up • ROLLUP creates subtotals at n+1 levels, where n is the number of grouping columns • Rows that would be produced by GROUP BY without ROLLUP • First-level subtotals • Second-level subtotals • … • A grand total row • It is very helpful for subtotaling along a hierarchical dimensions such as time or geography • ROLLUP(y, m, day) or ROLLUP(country, state, city) • Order of attributes is significant! October 17, 2024 PA220 DB for Analytics 39 Roll Up • Roll up operation, e.g.: • SELECT year, brand, SUM(qty) FROM sales GROUP BY ROLLUP(year, brand); October 17, 2024 PA220 DB for Analytics 40 Cube • CUBE creates 2n combinations of subtotals, where n is the number of grouping columns • Includes all the rows produced by ROLLUP • CUBE is typically most suitable in queries that • use columns from multiple dimensions • rather than columns representing different levels of a single dimension • e.g., subtotals for all combinations of month, state, and product • Partial CUBE similar to partial ROLLUP October 17, 2024 PA220 DB for Analytics 41 Cube October 17, 2024 PA220 DB for Analytics 42 Cube • Example • SELECT year, brand, SUM(qty) FROM sales GROUP BY CUBE (year, brand); October 17, 2024 PA220 DB for Analytics 43 Grouping Sets • Grouping sets produce just the specified groupings. • No (automatic) rollup is performed. • E.g., GROUPING SETS ( (A,B), (D), (C, E, F) ) • Collection of columns in paratheses → composite column • Efficiently replaces the series of UNIONed queries • SELECT dept_name , CAST(NULL AS CHAR(10)) AS job_title , COUNT(*) FROM personnel GROUP BY dept_name UNION ALL SELECT CAST(NULL AS CHAR(8)) AS dept_name , job_title , COUNT(*) FROM personnel GROUP BY job_title; • Can be rewritten as: • SELECT dept_name , job_title , COUNT(*) FROM Personnel GROUP BY GROUPING SETS ( (dept_name), (job_title) ); October 17, 2024 PA220 DB for Analytics 44 Grouping Sets • The issue of NULL values • The new grouping functions generate NULL values at the subtotal levels • How do we tell the difference between “generated NULLs” and “real NULLs” from the data itself? • The GROUPING function call returns 0 for NULL in the data and 1 for generated NULL October 17, 2024 PA220 DB for Analytics 45 Grouping Operators: Equivalences • CUBE(a,b) ≡ GROUPING SETS ((a,b), (a), (b), ()) • ROLLUP(a,b,c) ≡ GROUPING SETS ((a,b,c), (a,b), (a), ()) • GROUP BY GROUPING SETS (a,b,c) ≡ GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY c • GROUP BY GROUPING SETS ((a,b,c)) ≡ GROUP BY a, b, c • GROUP BY GROUPING SETS (a,b,(b,c)) ≡ GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY b, c • GROUP BY GROUPING SETS (a,ROLLUP(b,c)) ≡ GROUP BY a UNION ALL GROUP BY ROLLUP(b, c) October 17, 2024 PA220 DB for Analytics 46 Identification of Groupings • With rollup and cube, we must provide a possibility to determine the rollup level programmatically. • The GROUPING_ID function is designed for this. • GROUPING_ID takes a list of grouping columns as an argument. • For each column it returns 1 if its value is NULL because of a rollup, and 0 otherwise. • The list of binary digits is interpreted as a binary number and returned as a base-10 number. • Example for CUBE(a,b): October 17, 2024 PA220 DB for Analytics 47 Concatenated Groupings • A concatenated grouping is specified by listing multiple • grouping sets, cubes, and rollups, and produces the cross-product of groupings from each grouping set • Example: • GROUP BY GROUPING SETS (a,b), GROUPING SETS (c,d) produces (a,c), (a,d), (b,c), (b,d) • A concise and easy way to generate useful combinations of groupings • A small number of concatenated groupings can generate a large number of final groups • One of the most important uses for concatenated groupings is to generate the aggregates for a hierarchical cube October 17, 2024 PA220 DB for Analytics 48 Hierarchical Cubes • A hierarchical cube is a data set where the data is aggregated along the rollup hierarchy of each of its dimensions. • The aggregations are combined across dimensions • Example: • ROLLUP(year, quarter, month), ROLLUP(category,subcategory,name), ROLLUP(region,subregion,country,state,city) • Groups: (year,category,region), (quarter,category,region), (month,category,region), … • Produces a total of 4x4x6=96 aggregate groups • Compare to 2^11 = 2048 groupings by CUBE and 96 explicit group specifications October 17, 2024 PA220 DB for Analytics 49 Window Functions • The window clause specifies an action to perform over a set of rows • 3 sub-clauses: partitioning, ordering and aggregation grouping • OVER ([PARTITION BY ] ORDER BY []) • Moving average of 3 rows: • SELECT … , AVG(sales) OVER (PARTITION BY region ORDER BY month ASC ROWS 2 PRECEDING) AS SMA3, … FROM … October 17, 2024 PA220 DB for Analytics 50 Ranking Operators • Ranking operators in SQL • Row numbering is the most basic ranking function • Old style: ROW_NUMBER() returns a column that contains the row’s number within the result set • E.g., SELECT SalesOrderID , CustomerID , ROW_NUMBER() OVER (ORDER BY SalesOrderID ) as RunningCount FROM Sales WHERE SalesOrderID > 10000 ORDER BY SalesOrderID October 17, 2024 PA220 DB for Analytics 51 Ranking Operators • ROW_NUMBER doesn’t consider tied values • Each 2 equal values get 2 different row numbers • The behavior is nondeterministic • Each tied value could have its number switched! • We need something deterministic October 17, 2024 PA220 DB for Analytics 52 Ranking Operators • RANK and DENSE_RANK functions • Allow ranking items in a group • Syntax: • RANK ( ) OVER ( [ query_partition_clause ] order_by_clause ) • DENSE_RANK ( ) OVER ( [ query_partition_clause ] order_by_clause ) • DENSE_RANK leaves no gaps in ranking sequence when there are ties • PERCENT_RANK → (rank - 1) / (total rows - 1) • CUME_DIST - the cumulative distribution • the number of partition rows preceding (or peers with) the current row / total partition rows • The value ranges from 1/N to 1 October 17, 2024 PA220 DB for Analytics 53 Ranking Operators • E.g., SELECT channel, calendar, TO_CHAR(TRUNC(SUM(amount_sold), -6), '9,999,990’) AS sales, RANK() OVER (ORDER BY TRUNC(amount_sold, -6)) DESC) AS rank, DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold), -6)) DESC) AS dense_rank FROM sales, products WHERE … GROUP BY channel, calendar October 17, 2024 PA220 DB for Analytics 54 Ranking Operators • Group ranking - RANK function can operate within groups: the rank gets reset whenever the group changes • A single query can contain more than one ranking function, each partitioning the data into different groups. • PARTITION BY clause October 17, 2024 PA220 DB for Analytics 55 SELECT … RANK() OVER (PARTITION BY channel ORDER BY SUM(amount_sold) DESC) AS rank_by_channel Ntile • NTILE splits a set into equal-sized groups • It divides an ordered partition into buckets and assigns a bucket number to each row in the partition • Buckets are calculated so that each bucket has exactly the same number of rows assigned to it or at most 1 row more than the others • NTILE(4) - quartile • NTILE(100) - percentage • Not a part of the SQL99 standard, but adopted by major vendors October 17, 2024 PA220 DB for Analytics 56 SELECT … NTILE(3) OVER (ORDER BY sales) NT_3 FROM … Window Frame • Obtain a value of a particular row of a window frame defined by window clause (PARTITION BY…) • first_value(expression) • last_value(expression) • nth_value (expression) October 17, 2024 PA220 DB for Analytics 57 SELECT … FIRST_VALUE(sales) OVER (ORDER BY sales) AS lowest_sale CHANNEL CALENDAR SALES LOWEST_SALE Direst sales 02.2016 10,000 4,000 Direst sales 03.2016 9,000 4,000 Internet 02.2016 6,000 4,000 Internet 03.2016 6,000 4,000 Partners 03.2016 4,000 4,000 SELECT … FIRST_VALUE(sales) OVER (PARTITION BY channel ORDER BY sales) AS lowest_sale_per_channel Window Frame • Access to a row that comes before the current row at a specified physical offset with the current window frame (partition) • LAG(expression [,offset [,default_value]]) • … after the current row • LEAD(expression [,offset [,default_value]] October 17, 2024 PA220 DB for Analytics 58 CHANNEL CALENDAR SALES PREV_SALE Direst sales 02.2016 10,000 NULL Direst sales 03.2016 9,000 10,000 Internet 02.2016 6,000 NULL Internet 03.2016 6,000 6,000 Partners 03.2016 4,000 NULL SELECT … LAG(sales, 1) OVER (PARTITION BY channel ORDER BY calendar) AS prev_sale Data Densification • Enrich the existing “holey” data with default values • Goal: produce dense result for all products • i.e., incl. the weeks 24, 25, and 26. October 17, 2024 PA220 DB for Analytics 59 Data Densification • Partitioned Outer Join • Apply outer join on each partition • Implemented in Oracle October 17, 2024 PA220 DB for Analytics 60 Case Study: Grocery Store (from prev. lecture) • Stock Keeping Units (SKUs) • Point Of Sale (POS) system • Stores/Branches • Promotions • Task: Analyze how promotions (adverts) affect sales October 17, 2024 PA220 DB for Analytics 61 Case Study: Data Model t(time) t_id (PK) t_day_name t_cal_month_num t_cal_year t_cal_week_num t_date t_holiday … p(roduct) p_id p_name p_desc p_cat p_subcat p_list_price b(ranch) b_id b_name b_city b_country b_country_code b_region … a(dvert) a_id a_name a_aisle_display a_coupon a_price_reduction m(edia) m_id m_desc m_class … c(ustomer) c_id c_first_name c_last_name c_country c_country_code c_region s(ales) s_p_id (FK) s_c_id (FK) s_t_id (FK) s_m_id (FK) s_a_id (FK) s_b_id (FK) s_quant_sold s_dollar_sold s_dollar_cost s_custom_cnt October 17, 2024 PA220 DB for Analytics 62 Case Study: Dimensions • Time dimension • Explicit time dimension is needed (events, holidays,..) • Product dimension • Many-level hierarchy allows drill-down/roll-up • Many descriptive attributes (often more than 50) • Branch (store) dimension • Many descriptive attributes • Advert (promotion) dimension • Example of a causal dimension • Used to see if promotions work/are profitable • Ads, price reductions, end-of-aisle displays, coupons October 17, 2024 PA220 DB for Analytics 63 Case Study: Measures (Facts) • Dollar_sales, Unit_sales, Dollar_cost • All additive across all dimensions • Gross profit (derived) • Computed from sales and cost: sales – cost • Additive • Gross margin (derived) • Computed from gross profit and sales: (sales – cost)/cost • Non-additive • sum() makes not sense across any dimension since it is a ratio; rather avg() is sensible • Customer_count • Semi-additive • i.e., additive across time, promotion, and store • i.e., non-additive across product. Why? October 17, 2024 PA220 DB for Analytics 64 Case Study: Common View • A common view is created to simplify queries October 17, 2024 PA220 DB for Analytics 66 CREATE VIEW spcbatm SELECT * FROM sales s JOIN product p ON (s_p_id=p_id) JOIN customer c ON (s_c_id=c_id) JOIN branch b ON (s_b_id=b_id) JOIN advert a ON (s_a_id=a_id) JOIN time t ON (s_t_id=t_id) JOIN media m ON (s_m_id=m_id) Case Study: Query Formulation • Task: Analyze how promotions (adverts) affect sales • Assume specific customer countries (GB and US), media types (Direct and Internet) and months (Sep, Oct 2000). October 17, 2024 PA220 DB for Analytics 67 SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet') AND t_cal_month_desc IN ('2000-09', '2000-10') AND c_country_code IN ('GB', 'US') GROUP BY m_desc, t_cal_month_desc, c_country_code; s(ales) s_p_id (FK) s_c_id (FK) s_t_id (FK) s_m_id (FK) s_a_id (FK) s_b_id (FK) s_quant_sold s_dollar_sold s_dollar_cost s_custom_cnt t(time) s_cal_month_desc m(edia) m_desc c(ustomer) c_country_code Case Study: Make a pivot table • Compute subtotals: • Rollup from right to left • Computes and combines the following groupings • m_desc, t_cal_month_desc, c_country_code • m_desc, t_cal_month_desc • m_desc • October 17, 2024 PA220 DB for Analytics 68 SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet') AND t_cal_month_desc IN ('2000-09', '2000-10') AND c_country_code IN ('GB', 'US') GROUP BY ROLLUP(m_desc, t_cal_month_desc, c_country_code); Case Study: Roll up Example October 17, 2024 PA220 DB for Analytics 69 C_ Case Study: Make a pivot table (2) • Partial Roll up • m_desc is always present and not part of the rollup hierarchy • Computes and combines the following groupings • m_desc, t_cal_month_desc, c_country_code • m_desc, t_cal_month_desc • m_desc October 17, 2024 PA220 DB for Analytics 70 SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet') AND t_cal_month_desc IN ('2000-09', '2000-10') AND c_country_code IN ('GB', 'US') GROUP BY m_desc, ROLLUP(t_cal_month_desc, c_country_code); Case Study: Partial Roll up October 17, 2024 PA220 DB for Analytics 71 C_ SUM(S_DOLLAR_SOLD) Case Study: Make a pivot table (3) • Produces all possible roll-up combinations • CUBE operator • Computes and combines the following groupings • m_desc, t_cal_month_desc, c_country_code • m_desc, t_cal_month_desc • m_desc, c_country_code • m_desc • t_cal_month, c_country_code • t_cal_month • c_country_code • October 17, 2024 PA220 DB for Analytics 72 SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet') AND t_cal_month_desc IN ('2000-09', '2000-10') AND c_country_code IN ('GB', 'US') GROUP BY CUBE(m_desc, t_cal_month_desc, c_country_code); Case Study: Cube Example October 17, 2024 PA220 DB for Analytics 73 SUM(S_DOLLAR_SOLD) Case Study: Make a pivot table (4) October 17, 2024 PA220 DB for Analytics 74 SELECT m_desc, t_cal_month_desc, c_country_code, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet’) AND t_cal_month_desc IN ('2000-09', '2000-10’) AND c_country_code IN ('GB', 'US') GROUP BY GROUPING SETS ((m_desc, t_cal_month_desc, c_country_code), (m_desc, c_country_code), (t_cal_month_desc, c_country_code)); Grouping Sets • Specific group-by’s • Grouping sets Grouping Sets October 17, 2024 PA220 DB for Analytics 75 SUM(S_DOLLAR_SOLD)C_ Grouping ID Example • Replaces all NULLs from rollup with string ‘*’. • Leaves NULL that are not the result of rollup untouched. • Could easily make selective replacements of NULL. October 17, 2024 PA220 DB for Analytics 76 SELECT CASE WHEN GROUPING_ID(m_desc)=1 THEN ‘*' ELSE m_desc END, CASE WHEN GROUPING_ID(c_country_code)=1 THEN ‘*' ELSE c_country_code END, SUM(s_dollar_sold) FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet’) AND t_cal_month_desc= '2000-09’ AND c_country_code IN ('GB', 'US') GROUP BY CUBE(m_desc, c_country_code); Grouping ID Example October 17, 2024 PA220 DB for Analytics 77 SUM(S_DOLLAR_SOLD) Case Study: Ranking sales • Rank dollar sales by the media (‘Internet’ versus ‘Direct sales’) • Do the analysis for August until November 2000. • Use the number of unit sales to break ties. October 17, 2024 PA220 DB for Analytics 78 M_DESC T_CAL_MONTH SUM(s_dollar_s SUM(s_quant_s Rank ------------------------------------------------------------ Direct Sales 2000-11 799830.48 396180 1 Direct Sales 2000-10 774222.02 306868 2 Direct Sales 2000-09 723423.73 286950 3 Direct Sales 2000-08 723423.73 286045 4 Internet 2000-10 151593.43 100425 5 Internet 2000-11 145874.92 98760 6 Internet 2000-09 140793.11 96461 7 Internet 2000-08 138953.02 94568 8 Case Study: Ranking sales • Rank dollar sales by the media (‘Internet’ versus ‘Direct sales’) • Do the analysis for August until November 2000. • Use the number of unit sales to break ties. October 17, 2024 PA220 DB for Analytics 79 SELECT m_desc, t_cal_month_desc, SUM(s_dollar_sold), SUM(s_quant_sold), RANK() OVER (ORDER BY SUM(s_dollar_sold) DESC, SUM(s_quant_sold) DESC) AS Rank FROM spcbatm WHERE m_desc IN ('Direct Sales', 'Internet'), AND t_cal_month_desc IN ('2000-08', '2000-09’, '2000-10', '2000-11') GROUP BY m_desc, t_cal_month_desc; Case Study: Ranking sales (2) • Determine the two least and most successful sales media, respectively (in terms of total amount sold). October 17, 2024 PA220 DB for Analytics 80 SELECT * FROM (SELECT m_desc, SUM(s_dollar_sold), RANK() OVER (ORDER BY SUM(s_dollar_sold)) worst, RANK() OVER (ORDER BY SUM(s_dollar_sold) DESC) best FROM spcbatm GROUP BY m_desc) WHERE worst < 3 OR best < 3; SUM(S_DOLLAR_SOLD) Case Study: Ranking sales (3) • Determine the output of the following statement: October 17, 2024 PA220 DB for Analytics 81 SELECT c_id, p_id, RANK() OVER (ORDER BY p_id) AS r1, RANK() OVER (ORDER BY c_id) AS r2, RANK() OVER (ORDER BY 1) AS r3, RANK() OVER (PARTITION BY c_id ORDER BY p_id) AS r4, RANK() OVER (PARTITION BY p_id ORDER BY c_id) AS r5 FROM spcbatm WHERE c_id in (214, 608, 699) AND p_id in (42, 98, 123) GROUP BY c_id, p_id; 3 1 3 1 3 1 2 2 4 4 1 1 1 1 1 1 1 2 1 2 1 1 2 2 3 Summary (Takeaways) • ROLAP is a good option that exploits existing investments • You should know difference between ROLAP and MOLAP • SQL:2003 has added a lot of support for OLAP operations • SQL is not just select-from-where • Types of queries • Navigational vs aggregation • Operations: roll up / drill down, slicing, dicing, pivoting • dimensional vs hierarchical roll up • Data densification October 17, 2024 PA220 DB for Analytics 82