Data Warehouse Modelling PA220: Database systems for data analytics Contents • Business event • Dimensional modelling • Star schema • Data warehouse design • Slowly changing dimension • Date dimension • Degenarate dimension December 14, 2022 PA220 DB for Analytics 2 Transactional Database Design (OLTP) • OLTP Performance is about Inserting and Updating Quickly • Locking Must Be Minimized • Very Small Sets of Data Is Retrieved in a Query • Data Consistency is Critical • Laws of Normalization December 14, 2022 PA220 DB for Analytics 3 Reporting Database Design (Data Warehouse) • Copy of OLTP • Data Performance is about retrieving the data quickly • Locking is not an issue • Large sets of data are retrieved in a query • Insert and Update speed is not important. December 14, 2022 PA220 DB for Analytics 4 Business Process Measurements • Want to store information about business processes. • -> Store “business process measurement events” • Example: Retail sales • -> Could store information like date/time, product, store number, promotion, customer, clerk, sales dollars, sales units, … • -> Implies a level of detail, or grain. • Observe: These stored data have different flavors: • Ones that refer to other entities, e.g., to describe the context of the event (e.g., product, store, clerk) • -> dimensions • Ones that look more like “measurement values” (sales dollars, sales units) • -> facts or measures December 14, 2022 PA220 DB for Analytics 5 Business Process Measurement Events • A flat table view of the events could look like December 14, 2022 PA220 DB for Analytics 6 Business Analysts • Businesspeople are used to analyzing such data using pivot tables in spreadsheet software. December 14, 2022 PA220 DB for Analytics 7 Dimensional Modeling • Ralph Kimball: "Dimensional Modeling is a design technique for databases intended to support end-user queries in a data warehouse" • Event table attributes: • use some as dimensions and some as measures to aggregate (facts) • Descriptive data is separated from quantity • e.g., customer name is separated from the order amount. December 14, 2022 PA220 DB for Analytics 8 Dimensional Modeling (2) December 14, 2022 PA220 DB for Analytics 9 Dimensional Modeling (3) • Data is denormalized as needed to support reporting. • The resulting model reflects the kinds of questions the business wants to ask rather than the functions of the underlying operational system. • Data maintenance performance is secondary. December 14, 2022 PA220 DB for Analytics 10 What is a Dimension? • Dimensions describe business events like the sale of a product. • They are what users would want to sort, group and filter • e.g., dates, customer number, store number December 14, 2022 PA220 DB for Analytics 11 Dimension • Typical criterion for grouping • Many dimensions support some form of hierarchy • E.g., city -> region -> state -> country • Sometimes: more than one natural hierarchy • E.g., date -> day -> month -> quarter -> year • -> weekday • Bottom level of hierarchy • the finest granularity – atomic grain • Top level of hierarchy • typically, “ALL” December 14, 2022 PA220 DB for Analytics 12 What is a fact? • A fact, also called a measure, is a measurable metric which is described by the dimensions. • e.g., the sale amount or order quantity • There are usually many more dimensions than facts. December 14, 2022 PA220 DB for Analytics 13 Facts • Facts represent the subject of the desired analysis • The ”important” in the business that should be analyzed • A fact is most often identified via its dimension values • A fact is a non-empty cell • Some models give facts an explicit identity • Generally, a fact should • be attached to exactly one value in each dimension; • only be attached to dimension values in the bottom levels • e.g., if the lowest time granularity is day, for each fact the exact day should be specified; • some models do not require this. December 14, 2022 PA220 DB for Analytics 14 Facts (2) • Different types of facts are distinguished • Transaction (Event) facts • A fact for every business event (sale) • Snapshot facts • A fact for every dimension combination at given time interval (period), which is the grain of it. • Typically, uniformly dense – zero/null-value for periods of “no activity” • Captures current status (inventory) • Cumulative snapshot facts • A fact for every dimension combination at given time interval • from the beginning to the end of a process (e.g., claim processing) • Captures cumulative status up to now, e.g., sales to date • So, the fact row is updated • ”Fact-less” facts • A fact per event (customer contact) • No numerical measures • An event happened for a dimension value combination December 14, 2022 PA220 DB for Analytics 15 Facts (3) • Different types of facts are distinguished • Aggregate facts • simple numeric rollups of atomic fact table • just to accelerate queries • Consolidated facts • facts combined from multiple processes if they can be expressed at the same grain. • e.g., sales actuals combined with sales forecasts • Every type of facts answers different questions • Often transaction facts and snapshot facts exist December 14, 2022 PA220 DB for Analytics 16 Fact Granularity (Grain) • Granularity of facts is important • What does a single fact mean? • Determines the level of detail • Given by the combination of bottom levels • e.g., ”total sales per store per day per product” • Important for scalability of the number of facts • Often the granularity is a single business transaction • Example: sale • Sometimes the data is aggregated (total sales per store per day per product) • Aggregation might be necessary for scalability • Generally, transaction detail can be handled • Except perhaps huge clickstreams, etc. December 14, 2022 PA220 DB for Analytics 17 Measures • Measures represent the fact property that users want to study and analyze • e.g., the total sales price • A measure has two components • Numerical value: (sales price) • Aggregation formula: (SUM) • used for aggregating / combining several measure values into one • Additivity is an important property for measures • Single fact table rows are (almost) never retrieved, but aggregations over millions of fact rows. • Measure value determined by the combination of dimension values • Measure value is meaningful for all aggregation levels December 14, 2022 PA220 DB for Analytics 18 Measures (2) • Three types of measures are distinguished • (Fully-)Additive measures • Can be aggregated over all dimensions using SUM • e.g., sales price, gross profit computed from sales and cost • Often occur in transaction (event) facts • Semi-additive measures • Cannot be aggregated over some dimensions – typically time • e.g., banking: account balance is additive across all dimensions but time • Often occur in snapshot facts • Non-additive measures • Cannot be aggregated over any dimensions • e.g., unit cost cannot be added, ratios • Occur in all types of facts December 14, 2022 PA220 DB for Analytics 19 Star Schema • Rather than a flat table, use a star schema for dimensional modelling in a relational database. • What types of queries are supported? • How will “slice and dice” queries look like on such a schema? December 14, 2022 PA220 DB for Analytics 20 A dimension can contain many attributes A fact table has facts (measures) and a key to each related dimension. Star Schema – Key Points • Dimensions relate directly to the fact table only. • The dimensions are denormalized • i.e., Sales Person Region does not have a related region lookup table • as an OLTP design would likely have. • Usually, the dimension keys are NOT keys from the source systems • they are generated by DW load process • -> called surrogate keys • Grain of the facts • defined by the dimension attributes • i.e., how detailed the measures are. • Warning! This is not a relational design so be careful if you are an OLTP developer. December 14, 2022 PA220 DB for Analytics 21 The Snowflake Schema December 14, 2022 PA220 DB for Analytics 22 • When a dimension relates to another dimension • Beware! OLTP designers must resist the urge to normalize by creating snowflakes. • Snowflakes cause a number of performance and usability issues and are rarely justified. The Snowflake Schema • + Hierarchies are made explicit/visible • + Very flexible • + Dimension tables use less space • However, this is a minor saving • Disk space of dimensions is typically less than 5 percent of disk for DW • - Harder to use due to many joins • - Worse performance • e.g., efficient bitmap indexes are not applicable December 14, 2022 PA220 DB for Analytics 23 Some Key Terms • Surrogate Keys • artificially created keys (usually integers) used only by the data warehouse to uniquely identify a row in a dimension table. • Grain • level of detail a fact row represents. For example, sale amount of a single item at a given date/time by salesperson A in the Boston store. • Must be also consistent with the level of detail of dimensions of the fact table. • Conformed Dimension • Different source systems (CRM versus Sales) often have differences in the list of dimension values they support. The CRM system may not have enclosed branches, but the sales system does. A consolidated list of dimension values that supports all the source systems values is called a conformed dimension. Conformed dimensions are critical for a successful data warehouse. December 14, 2022 PA220 DB for Analytics 24 Surrogate Keys • Surrogate keys are integers that are assigned sequentially in a dimension table, e.g., 1, 2, 3, … • Should be used instead of natural operational production codes. • Many advantages over operational codes • Make the DW independent from operational changes • e.g., re-use of old operational keys after some time • Avoid key overlap problem when consolidating data • Dimension keys should not contain “intelligence” • Should be stored explicitly as additional attribute • Performance: Small integer vs. long alpha-numeric code • Results in smaller fact tables • 1 Byte in a 1 billion fact table translates into 1 GB disk space December 14, 2022 PA220 DB for Analytics 25 Steps of Dimensional Modelling 1. Choose the business process 2. Declare the grain 3. Identify the dimensions 4. Identify the facts December 14, 2022 PA220 DB for Analytics 26 DM Steps: 1. Choose the business process The basics in the design build on the actual business process which the data warehouse should cover. Therefore, the first step in the model is to describe the business process which the model builds on. This could for instance be a sales situation in a retail store. To describe the business process, one can choose to do this in plain text or use basic Business Process Modeling Notation (BPMN) or other design guides like the Unified Modeling Language (UML). December 14, 2022 PA220 DB for Analytics 27 DM Steps: 2. Declare the grain The grain of the model is the exact description of what the dimensional model should be focusing on. This could for instance be "An individual line item on a customer slip from a retail store". To clarify what the grain means, you should pick the central process and describe it with one sentence. Furthermore, the grain (sentence) is what you are going to build your dimensions and fact table from. You might find it necessary to go back to this step to alter the grain due to new information gained on what your model is supposed to be able to deliver. December 14, 2022 PA220 DB for Analytics 28 DM Steps: 3. Identify the dimensions The dimensions must be defined within the grain from the second step of this 4-step process. Dimensions are the foundation of the fact table and is where the data for the fact table is collected. Typically, dimensions are nouns like date, store, inventory etc. These dimensions are where all the data is stored. For example, the date dimension could contain data such as year, month and weekday. December 14, 2022 PA220 DB for Analytics 29 Mary Jones buys 1 book for $22.50 entitled "Agile Data Warehouse Design" on December 2, 2013 at 3:12 PM via Amazon.com using her Visa card to be delivered on December 10, 2013 by UPS. DM Steps: 4. Identify the facts After defining the dimensions, the next step in the process is to make values for the fact table. This step is to identify the numeric facts that will populate each fact table row. This step is closely related to the business users of the system, since this is where they get access to data stored in the data warehouse. Therefore, most of the fact table rows are numerical, additive figures such as quantity or cost per unit, etc. December 14, 2022 PA220 DB for Analytics 30 The Seven W’s of DW Design December 14, 2022 PA220 DB for Analytics 31 • One instance of the event: • Mary Jones buys 1 book for $22.50 entitled "Agile Data Warehouse Design" on December 2, 2013 at 3:12 PM via Amazon.com using her Visa card to be delivered on December 10, 2013 by UPS. • The W’s: • How • What • When • Where • Who • How many/much • Why The Seven W’s of DW Design December 14, 2022 PA220 DB for Analytics 32 • One instance of the event: • Mary Jones buys 1 book for $22.50 entitled "Agile Data Warehouse Design" on December 2, 2013 at 3:12 PM via Amazon.com using her Visa card to be delivered on December 10, 2013 by UPS. Who What How much (fact) When Where How The Seven W’s of DW Design • Intuitive and natural for business users • Efficient way to get the required details • Provides a jumping-off point to get other information • e.g., “Mary ordered via the internet. Are there other outlets for buying products?” or “Mary is an individual, do you have groups or corporate customers?” • Helps you focus on a single process at a time. December 14, 2022 PA220 DB for Analytics 33 Data Warehouse Bus Matrix • Finding common dimensions • Create one dimension per column • Create one fact table per row December 14, 2022 PA220 DB for Analytics 34 Slowly Changing Dimension (SCD) • Dimension values can change • How the change is handled depends on the value the business places on knowing the historical values of a dimension. • Example: • The New York store is reassigned from the Northeast Region to the Middle Region. • Does management want to be able to see changes in sales as the Northeast Region before the change and the Middle Region after the change? December 14, 2022 PA220 DB for Analytics 35 SCD Type 1: Overwrite • Simply overwrite the existing dimension data with the new information • Adv: Easy to implement • Disadv: Lose ability to see how data looked in past -> no history of values • Example: • After the change, all sales in history of the NY store will be counted to the Middle Region. December 14, 2022 PA220 DB for Analytics 36 SCD Type 2: Add New Row • Keep all historical values – take snapshots / make versioning • Adv: Better ability to report accurately historically. • Disadv: Most complex to implement. • Example: • Sales in New York store made prior to the change will be reported in the Northeast Region • Any sales after that will be reported in the Middle Region. December 14, 2022 PA220 DB for Analytics 37 id store_name store_city store_region valid_from valid_to current_row 123 SkirtJoyce New York NorthEast 2010-01-14 2020-09-21 expired 124 SkirtJoyce New York Middle 2020-09-22 NULL current Are NULLs allowed? New surrogate key SCD Type 2: Implications to data • Updating a row in the dimension table implies • a new surrogate key value, which implies • updating the fact table (change all recs to the new value) → slow… impossible; or • querying for all data – use multiple keys of the same store. • But can I find the previous versions from the current one? • We must add a new attribute for making the relationship, e.g., store_id, or the original key. • leaving the same value → dropping the primary key constraint • no changes to the fact table, but the foreign key must be dropped too. • not common December 14, 2022 PA220 DB for Analytics 38 SCD Type 3: Add New Attribute • Keep the prior value and the new value • Adv: Easier to implement than SCD type 2 while still providing some support for historical reporting • Disadv: Historical reporting is limited. • Example: • Sales in New York store made prior to the change will be reported in the Northeast Region but any sales after that will be reported in the Middle Region. • However, if the store was moved to the Southeast Region subsequently, you would lose that it was ever in the Northeast Region. December 14, 2022 PA220 DB for Analytics 39 id store_name store_city store_region prev_store_region valid_from 123 SkirtJoyce New York Middle NorthEast 2020-09-22 SCD Type 4: Add Mini-Dimension • For rapidly changing attributes of a dimension • Introduces a new mini-dimension over such attributes • Fact table is extended by the new dimension December 14, 2022 PA220 DB for Analytics 40 id store_name store_city 123 SkirtJoyce New York id store_region valid_from valid_to 1 NorthEast 2010-01-14 2020-09-21 2 Middle 2020-09-22 9999-12-31 Store dimension Region dimension Attributes valid_from/to were added here just to keep the information when the change happened. It is not required by design. SCD Type 5: Mini-dim & Type 1 Outrigger • Preserves historical attribute value and historical facts as of current attribute values. December 14, 2022 PA220 DB for Analytics 41 id store_name store_city cur_region 123 SkirtJoyce New York 2 id store_region valid_from valid_to 1 NorthEast 2010-01-14 2020-09-21 2 Middle 2020-09-22 9999-12-31 Store dimension Region dimension SCD Type 6 • Preserves history as Type 5 • current value is added on top. • on change, all rows of the “durable” key are systematically updated. • Mix of types 1, 2 and 3 • Makes roll-ups easier for current value • cur_store_region is available in all rows December 14, 2022 PA220 DB for Analytics 42 id store_id store_name store_city cur_store_region prev_store_region valid_from valid_to current_row 123 NY1 SkirtJoyce New York MiddleEast NorthEast 2010-01-14 2020-09-21 expired 124 NY1 SkirtJoyce New York MiddleEast Middle 2020-09-22 2021-07-31 expired 348 NY1 SkirtJoyce New York MiddleEast MiddleEast 2021-08-01 9999-12-31 current SCD Type 0: Retain Original • Passive method • when source value changes, we do not do the update in DW. • Ignore changes ((-: • E.g., the attribute is obsolete – fax number December 14, 2022 PA220 DB for Analytics 43 SCD Types – Summary • SCD 0 - Passive method (No Change) • SCD 1 - Overwriting the old value (Latest Record Only) • SCD 2 - Creating a new additional record (Maintains History) • SCD 3 - Creating an additional Column (Rarely Used) • SCD 4 - Using history table, for rapidly changing dimensions • SCD 5 - History table and “nested” dimension • SCD 6 - The hybrid approach of overwriting, adding attrs and history • SCD 7 - Dual type 1 and type 2 dimensions • places both the durable and surrogate key to the fact table December 14, 2022 PA220 DB for Analytics 44 The Date Dimension • Typically uses the date value in integer form, e.g., 20130101 as its key. • Used to provide descriptive date information, i.e., month name, quarter, day of week, etc. • Often has multiple keys in the fact table that point to it. There are called the role-playing dimensions. • Can be substituted by view to differentiate the references semantically. • Load with all past and future dates possible from the data. • aka created in advance • SSAS has a wizard to generate this dimension data. December 14, 2022 PA220 DB for Analytics 45 The Date Dimension • “Meaningful” values are important, e.g., for report generation • Holiday/Nonholiday vs. Yes/no • Some other events, …. • Time-of-day is a separate dimension • Date dimension vs. SQL date type • Many date attributes are not supported in SQL, e.g., fiscal month • Business user is not versed in SQL • Date dimension is relatively small • 10 years = 3,650 rows December 14, 2022 PA220 DB for Analytics 46 Degenerate Dimensions • “empty”, i.e., dimension keys without dimension table • Examples are operational control numbers • e.g., order #, invoice #, POS transaction #, etc. • Contain important information • Still useful to serve as part of primary key in fact table or for grouping • e.g., grouping by POS transaction number to retrieve all products purchased in a single transaction December 14, 2022 PA220 DB for Analytics 47 Amount of Dimensions • Typically, 4-15 per fact table • Too many dimensions is bad • A sign that dimensions are not independent, and hence should be combined • Significantly increases space requirements of fact table December 14, 2022 PA220 DB for Analytics 48 Redundancy in DW • Only very little redundancy in fact tables • The same fact data (generally) only stored in one fact table • Redundancy is mostly in dimension tables • Star dimension tables have redundant entries for the higher levels • Redundancy problems? • Inconsistent data – the central load process helps with this • Update time – the DW is optimized for querying, not updates • Space use – dimension tables typically take up less than 5% of DW • So: controlled redundancy is good • Up to a certain limit December 14, 2022 PA220 DB for Analytics 49 Summary • Dimensional modelling – four steps • Key terms: surrogate keys, grain, conformed dim., bus matrix • Star schema vs. snowflake schema • Slowly changing dimensions • Date dimension • can become a role-playing dimension, when referenced multiple times (issue date, settlement date, due date, …) December 14, 2022 PA220 DB for Analytics 50