Data Warehouse
Implementation & Querying
PA220: Database systems for data analytics
Contents
• Implementation of dimensional model
• Querying by dimensions
• Grouping possibilities
• Aggregate functions
• Window functions
• Case Study: Grocery Store
October 17, 2024 PA220 DB for Analytics 2
Data Cube
• Implication of using dimensional model (star schema)
• A symbolic representation of aggregated data
October 17, 2024 PA220 DB for Analytics 3
(Data) Cubes
• A “cube” may have many dimensions!
• More than 3 - the term ”hyper-cube” can sometimes be used
• Theoretically no limit for the number of dimensions
• Typical cubes have 4-12 dimensions
• But only 2-4 dimensions can be viewed at a time
• Dimensionality reduced by queries via projection/aggregation
• A cube consists of cells
• A given combination of dimension values
• A cell can be empty (no data for this combination)
• A sparse cube has few non-empty cells
• A dense cube has many non-empty cells
• Cubes become sparser for many/large dimensions
October 17, 2024 PA220 DB for Analytics 4
Dimensions
• Dimensions are used for
• Selection of data
• Grouping of data at the right level of detail
• Dimensions consist of dimension values
• Product dimension have values ”milk”, ”cream”, …
• Time dimension have values ”1/1/2001”, ”2/1/2001”,…
• Dimension values may have an ordering
• Used for comparing cube data across values
• Example: ”percent sales increase compared with last month”
• Especially used for Time dimension
• Dimensions are the core of multidimensional databases
• Other types of databases do not support dimensions
October 17, 2024 PA220 DB for Analytics 5
Schema Documentation
• No well-defined standard
• T level corresponds to ALL
• Record the measures
• You could also use a UML like notation
• Modeling and OLAP tools may have their own notation
October 17, 2024 PA220 DB for Analytics 6
OLAP Systems
• A key concept of OLAP systems is multidimensional analysis:
• Examining data from many dimensions.
• Show total sales across all products at increasing aggregation levels for a geography
dimension, from state to country to geographical region, for 1999 and 2000.
• Create a cross-tabular analysis of our operations showing expenses by territory in South
America for 1999 and 2000. Include all possible subtotals.
• List the top 10 sales representatives in Asia according to 2000 sales revenue for food
products and rank their commissions.
• Organization of cubes to efficiently answer the requests
• Response time of seconds / few minutes
October 17, 2024 PA220 DB for Analytics 7
Relational OLAP (ROLAP)
• Store data in relational databases and simulate multidimensionality with special
schemas
• Data stored in relational tables
• Star (or snowflake) schemas used for modeling
• SQL used for querying
• Pros
• Leverages investments in relational technology
• Scalable (billions of facts)
• Flexible, designs easier to change
• New, performance enhancing, techniques adapted from MOLAP
• Indices, materialized views
• Cons
• Storage use (often 3-4 times MOLAP)
• Response times
October 17, 2024 PA220 DB for Analytics 8
Multidimensional OLAP (MOLAP)
• Physically stages the processed multidimensional information to deliver
consistent and rapid response times to end users
• Data stored in special multidimensional data structures
• E.g., multidimensional array A on hard disk
• Pros
• Less storage use (“foreign keys” not stored)
• Faster query response times
• Direct access to a cell by giving position, e.g. A[1][2] → 2
• Cons
• Up till now not so good scalability
• Less flexible, e.g., cube must be re-computed when design changes
• Does not reuse an existing investment (but often bundled with RDBMS)
• Not as open technology
October 17, 2024 PA220 DB for Analytics 9
Hybrid OLAP (HOLAP)
• Detail data stored in relational tables (ROLAP)
• Aggregates stored in multidimensional structures (MOLAP)
• Pros
• Scalable (as ROLAP)
• Fast (as MOLAP)
• Cons
• High complexity
October 17, 2024 PA220 DB for Analytics 10
Question time
• Suppose that we want to replace the original Store hierarchy A by a
new hierarchy B
• How do we modify the schema to reflect it in ROLAP / MOLAP?
October 17, 2024 PA220 DB for Analytics 11
Question time (2)
• New store is being open and fact table needs to be populated
• How do we modify the fact table in ROLAP / MOLAP ?
October 17, 2024 PA220 DB for Analytics 12
Relational OLAP Cubes
• Two kinds of queries
• Navigation queries examine one dimension
• SELECT DISTINCT l FROM d [WHERE p]
• Aggregation queries summarize fact data
• SELECT d1.l1, d2.l2, SUM(f.m)
FROM d1, d2, f
WHERE f.dk1 = d1.dk1 AND f.dk2 = d2.dk2 [AND p]
GROUP BY d1.l1,d2.l2
• Fast, interactive analysis of large amounts of data
October 17, 2024 PA220 DB for Analytics 13
A mere symbolic
representation as a cube
OLAP Queries
October 17, 2024 PA220 DB for Analytics 14
OLAP Cube in MS Analysis Services Project
October 17, 2024 PA220 DB for Analytics 15
Multidimensional database implementation
• Microsoft SQL Server Analysis Services (SSAS) in MS SQL Server
• Oracle Database OLAP Option within Oracle database
October 17, 2024 PA220 DB for Analytics 16
Microsoft SQL Server Analysis Services (SSAS)
• Introduced in SQL Server 2008
• Nice built-in features
• Analysis Services
• Integration Services
• Reporting Services
• Easy to use
• Graphical “Management Studio” and “BI Developer Studio”
• Allows all flavors of MOLAP, ROLAP and HOLAP to be used within the
same model
• Intelligent pre-aggregation (for improving query performance)
• Uses the query language MDX (MultiDimensional eXpressions)
October 17, 2024 PA220 DB for Analytics 17
OLAP Operations & Queries
• Redundancy is necessity
• Materialized views, special purpose indexes, denormalized schemas
• Data is refreshed periodically
• Daily or weekly
• DW queries are big queries
• Imply a large portion of the data
• Mostly read queries
October 17, 2024 PA220 DB for Analytics 18
OLAP Queries
• Typical OLAP query operations
• Roll up
• Drill down
• Slice and dice
• Pivot (rotate) (aka crosstab)
• Other operations
• Drill across
• navigate from a fact table to another one by using a shared dimension
• Drill through
• allows users to view relational transactions that make up a multidimensional point in cube
• Data densification (partitioned outer join)
• Based on
• Aggregate functions
• Ranking and comparing operators
October 17, 2024 PA220 DB for Analytics 19
Roll Up
• Roll up (drill up)
• Taking the current aggregation level of fact values and doing a further
aggregation
• Summarize data by
• Climbing up hierarchy (hierarchical roll up)
• By dimensional reduction (dimensional roll up)
• Or by a mix of these 2 techniques
• Used for obtaining an increased generalization
• E.g., from Time.Week to Time.Year
October 17, 2024 PA220 DB for Analytics 20
Roll Up
• Hierarchical roll ups
• Performed on the fact table and some dimension tables by climbing up the
attribute hierarchies
• E.g., climbed the Time hierarchy to Quarter and Article hierarchy to Prod. group
October 17, 2024 PA220 DB for Analytics 21
Roll Up
• Dimensional roll ups
• Are done solely on the fact table by dropping one or more dimensions
• E.g., drop the Client dimension
October 17, 2024 PA220 DB for Analytics 22
Roll Up
• Climbing above the top in hierarchical roll up
• In an ultimate case, hierarchical roll up above the top level of an attribute
hierarchy (attribute “ALL”) can be viewed as converting to a dimensional roll
up
October 17, 2024 PA220 DB for Analytics 23
Drill Down
• Drill down (roll down)
• Reverse of roll up
• Represents a de-aggregate operation
• From higher level of summary to lower level of summary detailed data
• Introducing new dimensions
• Requires the existence of materialized finer grained data
• You can not drill if you do not have the data
October 17, 2024 PA220 DB for Analytics 24
Roll Up & Drill Down Example
October 17, 2024 PA220 DB for Analytics 25
Slice
• Slice: a subset of the multi-dimensional array corresponding to a
single value of one or more dimensions and projection on the rest of
dimensions
• E.g., project on Geo (store) and Time from values corresponding to Laptops
(cat. of products) in the product dimension
October 17, 2024 PA220 DB for Analytics 26
SELECT store_id, time_id, amount
FROM Sales
WHERE article_id = <laptop_id>
𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑> 𝑆𝑎𝑙𝑒𝑠
Slice
• Amounts to equality select condition on one dimension
• WHERE clause in SQL
• E.g., slice Laptops
October 17, 2024 PA220 DB for Analytics 27
Dice
• Dice: amounts to range select condition on one dimension, or to
equality select condition on more than one dimension
• E.g., range SELECT
October 17, 2024 PA220 DB for Analytics 28
𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑>∨𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐶𝑒𝑙𝑙𝑃ℎ𝑜𝑛𝑒𝐼𝑑> 𝑆𝑎𝑙𝑒𝑠
SELECT store_id, time_id, amount
FROM Sales
WHERE article_id = <laptop_id>
OR article_id = <cellphone_id>
Dice
• E.g., equality SELECT on 2 dimensions Product and Time
October 17, 2024 PA220 DB for Analytics 29
SELECT store_id, time_id, amount
FROM Sales
WHERE article_id = <laptop_id>
AND month_id = <December>
𝜋 𝑆𝑡𝑜𝑟𝑒𝐼𝑑,𝑇𝑖𝑚𝑒𝐼𝑑,𝐴𝑚𝑜𝑢𝑛𝑡 𝜎𝐴𝑟𝑡𝑖𝑐𝑙𝑒𝐼𝑑=<𝐿𝑎𝑝𝑡𝑜𝑝𝐼𝑑>∧𝑀𝑜𝑛𝑡ℎ𝐼𝑑=<𝐷𝑒𝑐𝑒𝑚𝑏𝑒𝑟> 𝑆𝑎𝑙𝑒𝑠
Pivoting
• Pivot (rotate): re-arrange data for viewing purposes
• The simplest view of pivoting is that it selects two dimensions to aggregate
the measure
• The aggregated values are often displayed in a grid where each point in the (x, y)
coordinate system corresponds to an aggregated value of the measure
• The x and y coordinate values are the values of the selected two dimensions
• Shows data at different “granularities”
• The result of pivoting is also called cross tabulation
• This is space efficient for dense data only
(thus, few dimensions)
October 17, 2024 PA220 DB for Analytics 30
Pivoting in ROLAP
• Tabular representation for the cross-tabular
report with totals.
• ALL is a dummy value and stands for all or multiple
values. (≈ NULL in SQL)
• Probably not as nice to read as the crosstab.
• Information content is the same as in the
crosstab.
• Is more space efficient than crosstab if the data is
sparse.
October 17, 2024 PA220 DB for Analytics 31
SQL & OLAP
• The idea is to
• Select by Attributes of Dimensions
• E.g., region = „Europe“
• Group by Attributes of Dimensions
• E.g., region, month, quarter
• Aggregate on measures
• E.g., sum(price * volume)
• OLAP queries in SQL
October 17, 2024 PA220 DB for Analytics 32
SQL & OLAP
• No standard query language for OLAP
• SQL99 for ROLAP
• SQL:2003 OLAP extensions
• New SQL SELECT clauses
• GROUPING SETS
• ROLLUP
• CUBE
• Queries of type “top k”
• New aggregate functions
October 17, 2024 PA220 DB for Analytics 33
SQL & OLAP
• Shortcomings of SQL/92 with regard to OLAP queries
• Hard or impossible to express in SQL
• Multiple aggregations
• Comparisons (with aggregation)
• Reporting features
• Performance penalty
• Poor execution of queries with many AND and OR conditions
• Lack of support for advanced statistical functions
October 17, 2024 PA220 DB for Analytics 34
SQL92
• Multiple aggregations in SQL/92
• Create a 2D spreadsheet that shows sum of sales by maker as well as car
model
• Each subtotal requires a separate aggregation query
October 17, 2024 PA220 DB for Analytics 35
SELECT model, maker, sum(amt) FROM sales GROUP BY model, maker
union
SELECT model, sum(amt) FROM sales GROUP BY model
union
SELECT maker, sum(amt) FROM sales GROUP BY maker
union
SELECT sum(amt) FROM sales
SQL92
• Comparisons in SQL/92
• This year’s sales vs. last year’s sales for each product
• Requires a self join
• CREATE VIEW v_sales AS
SELECT prod_id , year, sum(qty) AS sale_sum
FROM sales GROUP BY prod_id, year;
• SELECT cur.prod_id , cur.year , cur.sale_sum , last.year, last.sale_sum
FROM v_sales cur, v_sales last
WHERE cur.year = (last.year+1) AND cur.prod_id = last.prod_id
October 17, 2024 PA220 DB for Analytics 36
SQL92
• Reporting features in SQL/92
• Too complex to express
• RANK (top k) and NTILE (“top X%” of all products)
• Median
• Running/rolling/moving total, moving average, cumulative totals
• E.g., moving average over a 3-day window of total sales for each product
• CREATE OR REPLACE VIEW v_sales AS
SELECT prod_id, time_id, sum(qty) AS sale_sum
FROM sales GROUP BY prod_id, time_id
• SELECT end.time, avg( start.sale_sum )
FROM v_sales NATURAL JOIN d_time start, v_sales NATURAL JOIN d_time end
WHERE end.time >= start.time AND end.time <=start.time + 2
GROUP BY end.time
October 17, 2024 PA220 DB for Analytics 37
SQL99: Grouping Operators
• GROUP BY ROLLUP(gcols)
• Roll-up hierarchically
• GROUP BY CUBE (gcols)
• Roll-up to all possible combinations
• GROUP BY gcols1, CUBE(gcols2)
• Partial roll-up
• GROUP BY GROUPING SETS (gcols1, …, gcolsN)
• Explicit specification of roll-ups
• GROUP BY groupings1, groupings2, …
• Cross-product of groupings
• SELECT … GROUPING_ID(gcols)…
• Identification of roll-up level
October 17, 2024 PA220 DB for Analytics 38
Roll Up
• ROLLUP creates subtotals at n+1 levels, where n is the number of
grouping columns
• Rows that would be produced by GROUP BY without ROLLUP
• First-level subtotals
• Second-level subtotals
• …
• A grand total row
• It is very helpful for subtotaling along a hierarchical dimensions such
as time or geography
• ROLLUP(y, m, day) or ROLLUP(country, state, city)
• Order of attributes is significant!
October 17, 2024 PA220 DB for Analytics 39
Roll Up
• Roll up operation, e.g.:
• SELECT year, brand, SUM(qty) FROM sales GROUP BY ROLLUP(year, brand);
October 17, 2024 PA220 DB for Analytics 40
Cube
• CUBE creates 2n combinations of subtotals, where n is the number of
grouping columns
• Includes all the rows produced by ROLLUP
• CUBE is typically most suitable in queries that
• use columns from multiple dimensions
• rather than columns representing different levels of a single dimension
• e.g., subtotals for all combinations of month, state, and product
• Partial CUBE similar to partial ROLLUP
October 17, 2024 PA220 DB for Analytics 41
Cube
October 17, 2024 PA220 DB for Analytics 42
Cube
• Example
• SELECT year, brand, SUM(qty) FROM sales GROUP BY CUBE (year, brand);
October 17, 2024 PA220 DB for Analytics 43
Grouping Sets
• Grouping sets produce just the specified groupings.
• No (automatic) rollup is performed.
• E.g., GROUPING SETS ( (A,B), (D), (C, E, F) )
• Collection of columns in paratheses → composite column
• Efficiently replaces the series of UNIONed queries
• SELECT dept_name , CAST(NULL AS CHAR(10)) AS job_title , COUNT(*) FROM personnel
GROUP BY dept_name
UNION ALL
SELECT CAST(NULL AS CHAR(8)) AS dept_name , job_title , COUNT(*) FROM personnel
GROUP BY job_title;
• Can be rewritten as:
• SELECT dept_name , job_title , COUNT(*) FROM Personnel GROUP BY GROUPING SETS
( (dept_name), (job_title) );
October 17, 2024 PA220 DB for Analytics 44
Grouping Sets
• The issue of NULL values
• The new grouping functions generate NULL values at the subtotal levels
• How do we tell the difference between “generated NULLs” and “real NULLs” from the
data itself?
• The GROUPING function call returns 0 for NULL in the data and 1 for generated NULL
October 17, 2024 PA220 DB for Analytics 45
Grouping Operators: Equivalences
• CUBE(a,b) ≡ GROUPING SETS ((a,b), (a), (b), ())
• ROLLUP(a,b,c) ≡ GROUPING SETS ((a,b,c), (a,b), (a), ())
• GROUP BY GROUPING SETS (a,b,c) ≡
GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY c
• GROUP BY GROUPING SETS ((a,b,c)) ≡ GROUP BY a, b, c
• GROUP BY GROUPING SETS (a,b,(b,c)) ≡
GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY b, c
• GROUP BY GROUPING SETS (a,ROLLUP(b,c)) ≡
GROUP BY a UNION ALL GROUP BY ROLLUP(b, c)
October 17, 2024 PA220 DB for Analytics 46
Identification of Groupings
• With rollup and cube, we must provide a possibility to determine the
rollup level programmatically.
• The GROUPING_ID function is designed for this.
• GROUPING_ID takes a list of grouping columns as an argument.
• For each column it returns 1 if its value is NULL because of a rollup, and 0
otherwise.
• The list of binary digits is interpreted as a binary number and returned as a
base-10 number.
• Example for CUBE(a,b):
October 17, 2024 PA220 DB for Analytics 47
Concatenated Groupings
• A concatenated grouping is specified by listing multiple
• grouping sets, cubes, and rollups,
and produces the cross-product of groupings from each grouping set
• Example:
• GROUP BY GROUPING SETS (a,b), GROUPING SETS (c,d)
produces (a,c), (a,d), (b,c), (b,d)
• A concise and easy way to generate useful combinations of groupings
• A small number of concatenated groupings can generate a large number of
final groups
• One of the most important uses for concatenated groupings is to generate the
aggregates for a hierarchical cube
October 17, 2024 PA220 DB for Analytics 48
Hierarchical Cubes
• A hierarchical cube is a data set where the data is aggregated along
the rollup hierarchy of each of its dimensions.
• The aggregations are combined across dimensions
• Example:
• ROLLUP(year, quarter, month),
ROLLUP(category,subcategory,name),
ROLLUP(region,subregion,country,state,city)
• Groups: (year,category,region), (quarter,category,region),
(month,category,region), …
• Produces a total of 4x4x6=96 aggregate groups
• Compare to 2^11 = 2048 groupings by CUBE and 96 explicit group specifications
October 17, 2024 PA220 DB for Analytics 49
Window Functions
• The window clause specifies an action to perform over a set of rows
• 3 sub-clauses: partitioning, ordering and aggregation grouping
• <aggregate function>
OVER ([PARTITION BY <column list>]
ORDER BY <sort column list>
[<aggregation grouping>])
• Moving average of 3 rows:
• SELECT … ,
AVG(sales) OVER (PARTITION BY region
ORDER BY month ASC
ROWS 2 PRECEDING) AS SMA3, …
FROM …
October 17, 2024 PA220 DB for Analytics 50
Ranking Operators
• Ranking operators in SQL
• Row numbering is the most basic ranking function
• Old style: ROW_NUMBER() returns a column that contains the row’s number within the
result set
• E.g., SELECT SalesOrderID , CustomerID ,
ROW_NUMBER() OVER (ORDER BY SalesOrderID ) as RunningCount
FROM Sales WHERE SalesOrderID > 10000
ORDER BY SalesOrderID
October 17, 2024 PA220 DB for Analytics 51
Ranking Operators
• ROW_NUMBER doesn’t consider tied values
• Each 2 equal values get 2 different row numbers
• The behavior is nondeterministic
• Each tied value could have its number switched!
• We need something deterministic
October 17, 2024 PA220 DB for Analytics 52
Ranking Operators
• RANK and DENSE_RANK functions
• Allow ranking items in a group
• Syntax:
• RANK ( ) OVER ( [ query_partition_clause ] order_by_clause )
• DENSE_RANK ( ) OVER ( [ query_partition_clause ] order_by_clause )
• DENSE_RANK leaves no gaps in ranking sequence when there are ties
• PERCENT_RANK → (rank - 1) / (total rows - 1)
• CUME_DIST - the cumulative distribution
• the number of partition rows preceding (or peers with) the current row / total partition
rows
• The value ranges from 1/N to 1
October 17, 2024 PA220 DB for Analytics 53
Ranking Operators
• E.g.,
SELECT channel, calendar,
TO_CHAR(TRUNC(SUM(amount_sold), -6), '9,999,990’) AS sales,
RANK() OVER (ORDER BY TRUNC(amount_sold, -6)) DESC) AS rank,
DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold), -6)) DESC) AS dense_rank
FROM sales, products WHERE …
GROUP BY channel, calendar
October 17, 2024 PA220 DB for Analytics 54
Ranking Operators
• Group ranking - RANK function can operate within groups: the rank
gets reset whenever the group changes
• A single query can contain more than one ranking function, each partitioning
the data into different groups.
• PARTITION BY clause
October 17, 2024 PA220 DB for Analytics 55
SELECT … RANK() OVER (PARTITION BY channel ORDER BY SUM(amount_sold) DESC) AS rank_by_channel
Ntile
• NTILE splits a set into equal-sized groups
• It divides an ordered partition into buckets and assigns a bucket number to
each row in the partition
• Buckets are calculated so that each bucket has exactly the same number of
rows assigned to it or at most 1 row more than the others
• NTILE(4) - quartile
• NTILE(100) - percentage
• Not a part of the SQL99 standard, but adopted by major vendors
October 17, 2024 PA220 DB for Analytics 56
SELECT … NTILE(3) OVER (ORDER BY sales) NT_3 FROM …
Window Frame
• Obtain a value of a particular row of a window frame defined by
window clause (PARTITION BY…)
• first_value(expression)
• last_value(expression)
• nth_value (expression)
October 17, 2024 PA220 DB for Analytics 57
SELECT … FIRST_VALUE(sales) OVER (ORDER BY sales) AS lowest_sale
CHANNEL CALENDAR SALES LOWEST_SALE
Direst sales 02.2016 10,000 4,000
Direst sales 03.2016 9,000 4,000
Internet 02.2016 6,000 4,000
Internet 03.2016 6,000 4,000
Partners 03.2016 4,000 4,000
SELECT … FIRST_VALUE(sales) OVER (PARTITION BY channel ORDER BY sales) AS lowest_sale_per_channel
Window Frame
• Access to a row that comes before the current row at a specified
physical offset with the current window frame (partition)
• LAG(expression [,offset [,default_value]])
• … after the current row
• LEAD(expression [,offset [,default_value]]
October 17, 2024 PA220 DB for Analytics 58
CHANNEL CALENDAR SALES PREV_SALE
Direst sales 02.2016 10,000 NULL
Direst sales 03.2016 9,000 10,000
Internet 02.2016 6,000 NULL
Internet 03.2016 6,000 6,000
Partners 03.2016 4,000 NULL
SELECT … LAG(sales, 1) OVER (PARTITION BY channel ORDER BY calendar) AS prev_sale
Data Densification
• Enrich the existing “holey” data with default values
• Goal: produce dense result for all products
• i.e., incl. the weeks 24, 25, and 26.
October 17, 2024 PA220 DB for Analytics 59
Data Densification
• Partitioned Outer Join
• Apply outer join on each partition
• Implemented in Oracle
October 17, 2024 PA220 DB for Analytics 60
Case Study: Grocery Store (from prev. lecture)
• Stock Keeping Units (SKUs)
• Point Of Sale (POS) system
• Stores/Branches
• Promotions
• Task: Analyze how promotions (adverts) affect sales
October 17, 2024 PA220 DB for Analytics 61
Case Study: Data Model
t(time)
t_id (PK)
t_day_name
t_cal_month_num
t_cal_year
t_cal_week_num
t_date
t_holiday
…
p(roduct)
p_id
p_name
p_desc
p_cat
p_subcat
p_list_price
b(ranch)
b_id
b_name
b_city
b_country
b_country_code
b_region
…
a(dvert)
a_id
a_name
a_aisle_display
a_coupon
a_price_reduction
m(edia)
m_id
m_desc
m_class
…
c(ustomer)
c_id
c_first_name
c_last_name
c_country
c_country_code
c_region
s(ales)
s_p_id (FK)
s_c_id (FK)
s_t_id (FK)
s_m_id (FK)
s_a_id (FK)
s_b_id (FK)
s_quant_sold
s_dollar_sold
s_dollar_cost
s_custom_cnt
October 17, 2024 PA220 DB for Analytics 62
Case Study: Dimensions
• Time dimension
• Explicit time dimension is needed (events, holidays,..)
• Product dimension
• Many-level hierarchy allows drill-down/roll-up
• Many descriptive attributes (often more than 50)
• Branch (store) dimension
• Many descriptive attributes
• Advert (promotion) dimension
• Example of a causal dimension
• Used to see if promotions work/are profitable
• Ads, price reductions, end-of-aisle displays, coupons
October 17, 2024 PA220 DB for Analytics 63
Case Study: Measures (Facts)
• Dollar_sales, Unit_sales, Dollar_cost
• All additive across all dimensions
• Gross profit (derived)
• Computed from sales and cost: sales – cost
• Additive
• Gross margin (derived)
• Computed from gross profit and sales: (sales – cost)/cost
• Non-additive
• sum() makes not sense across any dimension since it is a ratio; rather avg() is sensible
• Customer_count
• Semi-additive
• i.e., additive across time, promotion, and store
• i.e., non-additive across product. Why?
October 17, 2024 PA220 DB for Analytics 64
Case Study: Common View
• A common view is created to simplify queries
October 17, 2024 PA220 DB for Analytics 66
CREATE VIEW spcbatm
SELECT *
FROM sales s JOIN product p ON (s_p_id=p_id)
JOIN customer c ON (s_c_id=c_id)
JOIN branch b ON (s_b_id=b_id)
JOIN advert a ON (s_a_id=a_id)
JOIN time t ON (s_t_id=t_id)
JOIN media m ON (s_m_id=m_id)
Case Study: Query Formulation
• Task: Analyze how promotions (adverts) affect sales
• Assume specific customer countries (GB and US),
media types (Direct and Internet) and months (Sep, Oct 2000).
October 17, 2024 PA220 DB for Analytics 67
SELECT m_desc, t_cal_month_desc, c_country_code,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet')
AND t_cal_month_desc IN ('2000-09', '2000-10')
AND c_country_code IN ('GB', 'US')
GROUP BY m_desc, t_cal_month_desc, c_country_code;
s(ales)
s_p_id (FK)
s_c_id (FK)
s_t_id (FK)
s_m_id (FK)
s_a_id (FK)
s_b_id (FK)
s_quant_sold
s_dollar_sold
s_dollar_cost
s_custom_cnt
t(time)
s_cal_month_desc
m(edia)
m_desc
c(ustomer)
c_country_code
Case Study: Make a pivot table
• Compute subtotals:
• Rollup from right to left
• Computes and combines the following groupings
• m_desc, t_cal_month_desc, c_country_code
• m_desc, t_cal_month_desc
• m_desc
• October
17, 2024 PA220 DB for Analytics 68
SELECT m_desc, t_cal_month_desc, c_country_code,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet')
AND t_cal_month_desc IN ('2000-09', '2000-10')
AND c_country_code IN ('GB', 'US')
GROUP BY ROLLUP(m_desc, t_cal_month_desc, c_country_code);
Case Study: Roll up Example
October 17, 2024 PA220 DB for Analytics 69
C_
Case Study: Make a pivot table (2)
• Partial Roll up
• m_desc is always present and not part of the rollup hierarchy
• Computes and combines the following groupings
• m_desc, t_cal_month_desc, c_country_code
• m_desc, t_cal_month_desc
• m_desc
October 17, 2024 PA220 DB for Analytics 70
SELECT m_desc, t_cal_month_desc, c_country_code,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet')
AND t_cal_month_desc IN ('2000-09', '2000-10')
AND c_country_code IN ('GB', 'US')
GROUP BY m_desc, ROLLUP(t_cal_month_desc, c_country_code);
Case Study: Partial Roll up
October 17, 2024 PA220 DB for Analytics 71
C_ SUM(S_DOLLAR_SOLD)
Case Study: Make a pivot table (3)
• Produces all possible roll-up combinations
• CUBE operator
• Computes and combines the following groupings
• m_desc, t_cal_month_desc, c_country_code
• m_desc, t_cal_month_desc
• m_desc, c_country_code
• m_desc
• t_cal_month, c_country_code
• t_cal_month
• c_country_code
• October
17, 2024 PA220 DB for Analytics 72
SELECT m_desc, t_cal_month_desc, c_country_code,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet')
AND t_cal_month_desc IN ('2000-09', '2000-10')
AND c_country_code IN ('GB', 'US')
GROUP BY CUBE(m_desc, t_cal_month_desc,
c_country_code);
Case Study: Cube Example
October 17, 2024 PA220 DB for Analytics 73
SUM(S_DOLLAR_SOLD)
Case Study: Make a pivot table (4)
October 17, 2024 PA220 DB for Analytics 74
SELECT m_desc, t_cal_month_desc, c_country_code,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet’)
AND t_cal_month_desc IN ('2000-09', '2000-10’)
AND c_country_code IN ('GB', 'US')
GROUP BY GROUPING SETS ((m_desc, t_cal_month_desc, c_country_code),
(m_desc, c_country_code),
(t_cal_month_desc, c_country_code));
Grouping Sets
• Specific group-by’s
• Grouping sets
Grouping Sets
October 17, 2024 PA220 DB for Analytics 75
SUM(S_DOLLAR_SOLD)C_
Grouping ID Example
• Replaces all NULLs from rollup with string ‘*’.
• Leaves NULL that are not the result of rollup untouched.
• Could easily make selective replacements of NULL.
October 17, 2024 PA220 DB for Analytics 76
SELECT
CASE WHEN GROUPING_ID(m_desc)=1 THEN ‘*' ELSE m_desc END,
CASE WHEN GROUPING_ID(c_country_code)=1 THEN ‘*' ELSE c_country_code END,
SUM(s_dollar_sold)
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet’)
AND t_cal_month_desc= '2000-09’
AND c_country_code IN ('GB', 'US')
GROUP BY CUBE(m_desc, c_country_code);
Grouping ID Example
October 17, 2024 PA220 DB for Analytics 77
SUM(S_DOLLAR_SOLD)
Case Study: Ranking sales
• Rank dollar sales by the media (‘Internet’ versus ‘Direct sales’)
• Do the analysis for August until November 2000.
• Use the number of unit sales to break ties.
October 17, 2024 PA220 DB for Analytics 78
M_DESC T_CAL_MONTH SUM(s_dollar_s SUM(s_quant_s Rank
------------------------------------------------------------
Direct Sales 2000-11 799830.48 396180 1
Direct Sales 2000-10 774222.02 306868 2
Direct Sales 2000-09 723423.73 286950 3
Direct Sales 2000-08 723423.73 286045 4
Internet 2000-10 151593.43 100425 5
Internet 2000-11 145874.92 98760 6
Internet 2000-09 140793.11 96461 7
Internet 2000-08 138953.02 94568 8
Case Study: Ranking sales
• Rank dollar sales by the media (‘Internet’ versus ‘Direct sales’)
• Do the analysis for August until November 2000.
• Use the number of unit sales to break ties.
October 17, 2024 PA220 DB for Analytics 79
SELECT m_desc, t_cal_month_desc, SUM(s_dollar_sold), SUM(s_quant_sold),
RANK() OVER (ORDER BY SUM(s_dollar_sold) DESC,
SUM(s_quant_sold) DESC) AS Rank
FROM spcbatm
WHERE m_desc IN ('Direct Sales', 'Internet'),
AND t_cal_month_desc IN ('2000-08', '2000-09’, '2000-10', '2000-11')
GROUP BY m_desc, t_cal_month_desc;
Case Study: Ranking sales (2)
• Determine the two least and most successful sales media,
respectively (in terms of total amount sold).
October 17, 2024 PA220 DB for Analytics 80
SELECT * FROM (SELECT m_desc, SUM(s_dollar_sold),
RANK() OVER (ORDER BY SUM(s_dollar_sold)) worst,
RANK() OVER (ORDER BY SUM(s_dollar_sold) DESC) best
FROM spcbatm
GROUP BY m_desc)
WHERE worst < 3 OR best < 3;
SUM(S_DOLLAR_SOLD)
Case Study: Ranking sales (3)
• Determine the output of the following statement:
October 17, 2024 PA220 DB for Analytics 81
SELECT c_id, p_id, RANK() OVER (ORDER BY p_id) AS r1,
RANK() OVER (ORDER BY c_id) AS r2,
RANK() OVER (ORDER BY 1) AS r3,
RANK() OVER (PARTITION BY c_id ORDER BY p_id) AS r4,
RANK() OVER (PARTITION BY p_id ORDER BY c_id) AS r5
FROM spcbatm
WHERE c_id in (214, 608, 699)
AND p_id in (42, 98, 123)
GROUP BY c_id, p_id;
3
1
3
1
3
1
2
2
4
4
1
1
1
1
1
1
1
2
1
2
1
1
2
2
3
Summary (Takeaways)
• ROLAP is a good option that exploits existing investments
• You should know difference between ROLAP and MOLAP
• SQL:2003 has added a lot of support for OLAP operations
• SQL is not just select-from-where
• Types of queries
• Navigational vs aggregation
• Operations: roll up / drill down, slicing, dicing, pivoting
• dimensional vs hierarchical roll up
• Data densification
October 17, 2024 PA220 DB for Analytics 82