ETL Process PA220: Database systems for data analytics Contents • Overview of ETL • Data Cleaning • Loading Tips • Issues • Summary October 24, 2024 PA220 DB for Analytics 2 ETL Process Overview • Data is periodically brought from the ODS to the data warehouse. • In most DW systems, the ETL process is the most complex part. • and the most underestimated and time-consuming part. • Often, 80% of development time is spent on ETL October 24, 2024 PA220 DB for Analytics 3 Capture Data Change Data Staging Area • Transit storage for data underway in the ETL process • Transformations/cleansing done here • No user queries (some do it) • Sequential operations (few) on large data volumes • Performed by central ETL logic • Easily restarted • No need for locking, logging, etc. • RDBMS or flat files? (DBMS have become better at this) • Finished dimensions copied from the staging area to relevant marts October 24, 2024 PA220 DB for Analytics 4 Data Warehouse Reporting Visualization Analysis Data Mart Sales Marketing Transaction data Stock Management E-shop ERP Services CRM Transforming Raw data Meta data Data dimensions Summary data Cleaning Integrating Web Logs Files Data Staging Staging data Aggregate Precalc. Storage ETL Process Types • When do we run the ETL process? October 24, 2024 PA220 DB for Analytics 5 ETL Process Types • Considerations: • Overhead on data warehouse and source sides. • E.g., online propagation puts a permanent burden on both sides; cannot benefit from bulk loading mechanisms • Data Staleness • Frequent updates reduce staleness but increase overhead. • Debugging, Failure Handling • With online/stream-based mechanisms, it may be more difficult to track down problems. • Different process for different flavors of data? • E.g., periodic refresh may work well for small (dimension) tables. October 24, 2024 PA220 DB for Analytics 6 Star- schema Model Data Extraction: Getting Data • Source → Staging Table: • Tool selection depends on data source • database, XML, flat files, etc. • Use SQL, XQuery, Perl, awk, etc. to query the source system • Often: • Extract source data to flat file (e.g., CSV) • Then bulk-load into staging table • Data compression for large data transfers • Data encryption if transfer over public networks October 24, 2024 PA220 DB for Analytics 7 Data Extraction: Capturing Data Changes • Detecting changes is a challenge: • Audit Columns • E.g., “last modified” timestamp • Set timestamps or “new” flags on every row update. How? • Unset “new” flags on every load into the DW. Why? • Full Diff • Keep old snapshot and diff it with the current version. • Thorough, will detect any change • Resource-intensive: need to move and scan large volumes • Optimization: Hashes/checksums to speed up comparison • Database Log Scraping • The database’s write-ahead log contains all change information • Scraping the log may get messy, though. • Variant: create a message stream ODS → DW • Message Queue Monitoring • The source system must use a messaging framework; then low overhead October 24, 2024 PA220 DB for Analytics 8 Data Cleansing • After extraction, data must be normalized and cleaned. October 24, 2024 PA220 DB for Analytics 9 Data Quality (Revision) • Data in DW must be: • Precise • DW data must match known numbers - or explanation needed • Complete • DW has all relevant data, and the users know • Consistent • No contradictory data: aggregates fit with detail data • Unique • The same thing is called the same and has the same key (customers) • Timely • Data is updated ”frequently enough” and the users know when • Data almost never has decent quality October 24, 2024 PA220 DB for Analytics 10 Data Cleansing • Problem: • Real-world data is messy. • Consistency rules in the OLTP system? • A lot of data is still entered by people. • Data warehouses serve as an integration platform. • Typical cleaning and normalization tasks: • Correct spelling errors / data type conversion. • Handle missing / null values. • Identify record matches and duplicates. • Resolve conflicts and inconsistencies. • Normalize (“conform”) data. October 24, 2024 PA220 DB for Analytics 11 Data Cleansing: Primitives • Parsing • E.g., source table has an ‘address’ column; whereas target table has ‘street’, ‘zip’, and ‘city’ columns; pieces of a string to normalize (e.g., “Road” → “Rd”) • Similarity Join – bring together similar data • For record matching (same entity recognition) and deduplication • Clustering – put items into groups, based on “similarity” • E.g., pre-processing for deduplication • Outlier detection – values ​​not matching the pattern • E.g., failure of a sensor; detection of a new “class” / entity October 24, 2024 PA220 DB for Analytics 12 Data Cleansing: Similarity Join • Process of identifying duplicates: • Similarity measures: • edit distance, Jaccard coefficient, Soundex • Threshold of similarity is important • Limits the number of candidates for duplicates! October 24, 2024 PA220 DB for Analytics 13 Similarity Join – Edit distance • What is the “similarity” between strings s1 and s2? • dedit(s1,s2) = min. number of operations to transform s1 into s2 • E.g., s1 = “Sweet” and s2 = “Sweat” • Levenshtein distance – ins, del, replace only • Longest common subsequence (LCS) – ins, del only October 25, 2024 PA220 DB for Analytics 14 Source: Jens Teubner Data Warehousing Similarity Join – Jaccard coefficient • Similarity of two sets S1 and S2 • by comparing cardinalities of intersection and union • Example • q-grams: converting a string to a set • i.e. a set of all substrings of length q • 2grams(“Sweet”) = { Sw, we, ee, et } October 25, 2024 PA220 DB for Analytics 15 Source: Jens Teubner Data Warehousing Similarity Join – SoundEx • Phonetic algorithm to index words by sound 1. Retain the first letter 2. Replace letters with numbers • Mapping of alike sounds to the same number • If no mapping, drop the letter 3. Drop letters where the preceding letter yielded the same number. 4. Collect three numbers, fill with zero’s if necessary. October 25, 2024 PA220 DB for Analytics 16 Source: Jens Teubner Data Warehousing Similarity Join – Algorithms • Naïve strategy • O(n2) • Blocking naïve strategy • Assume b blocks and compare only within • O(1/2 (n2/b – n)) • Sorted neighborhood • Sort the inputs and scan with sliding windows of w • Index to accelerate similarity range query R(q,threshold) October 25, 2024 PA220 DB for Analytics 17 Similarity Join – Naïve strategy • Compare every record with each other • Assume symmetry of distance October 25, 2024 PA220 DB for Analytics 18 Source: Jens Teubner Data Warehousing Similarity Join – Blocking naïve strategy • Partition input data into blocks • Duplicates must end in the same block! • Disadv: typically, uneven block sizes… • Compare all pairs within blocks only • E.g. for matching customers • Use their ZIP code (assuming ZIP has not changed, customer has not moved) • Use first character of last name October 25, 2024 PA220 DB for Analytics 19 Source: Jens Teubner Data Warehousing Similarity Join – Sorted neighborhood • Assign a sort key to each record • Sort the records • Apply a sliding window of size w across the sorted list and join within. • E.g., sort customers by • First 3 consonants of last name • First letter of last name and first 2 digits of ZIP code October 25, 2024 PA220 DB for Analytics 20 Source: Jens Teubner Data Warehousing Data Cleansing: Detecting Inconsistencies • Data (quality) screening system: • Column screens: Test data within a column • Correct value ranges, value formatting, null values? • Detect random/noise values • Structure screens: Relationship across columns • Two or more columns implement a hierarchy (e.g., a series of m:n relationships) • Foreign key relationships between tables • Combination of columns is a valid item, e.g., an existing postal address • Business rule screens: Data plausible according to business rules? • E.g., customer status X requires Y years of loyalty, Z EUR total revenue in previous period October 24, 2024 PA220 DB for Analytics 21 Data Cleansing: Error Handling • Halting the process on error • requires manual intervention – diagnose, restart/resume the job or abort it • Create a suspense file (logging) • Log the errors in a side channel for later processing • Not clear when to handle its contents – fix the records and re-introduce to the job? • until these data items are restored, the overall DB integrity is questionable • Tag the data and continue • Bad fact records – create an audit dimension • Bad dimension data – use unique error values • Best solution whenever possible October 24, 2024 PA220 DB for Analytics 22 Data Cleansing: Error Handling by Logging October 24, 2024 PA220 DB for Analytics 23 • A special error event schema can be created • as a result of “Tag the data and continue“ • Grain corresponds to the error appearance • Batch dim – info of the job • Date dim – not a minute and sec of the error • rather a weekday, last day of fiscal period, to constraint / summarize errors • Time of day – timestamp when the error occurred Data Cleansing: Error Handling by Tagging October 24, 2024 PA220 DB for Analytics 24 • Audit dimension • attached to the resulting fact table • created in data cleansing • stores audit conditions • Example • an ETL job finished with no error • a new audit rec. describing it is created • all new fact records are associated with it • if an error occurred (e.g., out of bounds) • another audit rec is created, and the failing fact records get attached Improving Data Quality • Appoint ”data stewards” - responsible for data quality • A given steward has the responsibility for certain tables • Includes manual inspections and corrections! • DW-controlled improvement • Default values • ”Not yet assigned 157” note to data steward • Source-controlled improvements • The optimal? • Construct programs that check data quality • Are totals as expected? • Do results agree with alternative source? • Do not fix all problems with data quality • Allow management to see ”weird” data in their reports? October 24, 2024 PA220 DB for Analytics 25 Data Transformation: Schema Integration • Different source systems, types, and schemas must be integrated. • Infer mapping between schemas (automatically)? • Tools: • Compare table and attribute names; consider synonyms and homonyms • Infer data types/formats and mapping rules • Techniques like similarity joins and deduplication. • Still: • Often a lot of manual work needed. October 24, 2024 PA220 DB for Analytics 26 Data Loading: Prepare Dimension Tables • For each dimension do the following checks: • Dimension row is new • Generate the surrogate keys • Attributes in dimension have changed • Handle updates respecting SCD type of dimension October 24, 2024 PA220 DB for Analytics 27 Data Loading: Prepare Dim Tables - Problems • “upsert” – update if exists, else insert (aka SQL-based update) • often a real performance killer • better: separate updates and bulk-load inserts • Generate and find dimension surrogate keys • e.g., use key generator of back-end DB • Maintain “Dim Cross Ref” table in memory or in back-end DB • Dimensions must be updated before facts • The relevant dimension rows for new facts must be in place • Special key considerations if initial load must be performed again • May re-compute aggregates (Type 1 updates) • again, bulk-loading/changing is a good choice October 24, 2024 PA220 DB for Analytics 28 Data Loading: Performance Tips 1. Turn off logging • Databases maintain a write-ahead log to implement failure tolerance mechanisms. • Row-by-row logging causes huge overhead. 2. Disable indexes and reindex after updates 3. Pre-sort data • Depending on system, may speed up index construction. • Additional benefit: may result in better physical layout 4. Truncate table • When loading from scratch October 24, 2024 PA220 DB for Analytics 31 Loading Data – Performance Tips 5. Enable “fast mode” • If data is prepared properly, database may use faster parsing mechanisms • e.g., “copy from” command 6. Make sure data is correct • Transformation, field truncation, error reporting may slow down bulk-loading significantly 7. Temporarily disable integrity control • Avoid checking during load, but do it in bulk, too. • e.g., foreign keys in the fact table October 24, 2024 PA220 DB for Analytics 32 Loading Data – Performance Tips 8. Parallelization • Dimensions can be loaded concurrently • Fact tables can be loaded concurrently • Partitions of one fact table can be loaded concurrently • when horizontal partitioning is used October 24, 2024 PA220 DB for Analytics 33 Hints on ETL Design • Do not try to implement all transformations in one step! • Do one (or just a few) thing(s) at a time • Copy source data one-one to staging area • Compute deltas • Only if doing incremental load • Handle versions and generate DW keys • Versions only if handling slowly changing dimensions • Implement complex transformations • Load dimensions • Load facts October 24, 2024 PA220 DB for Analytics 34 General Issues / Decisions • Files versus streams/pipes • Streams/pipes: no disk overhead, fast throughput • Files: easier restart, often the only possibility • ETL tool or self-coding • Code: easy start, co-existence with IT infrastructure • Tool: better productivity on subsequent projects • Load frequency • ETL time depends on processed data volumes. • Daily load is much faster than monthly. • Applies to all steps in the ETL process • Should DW be on-line 24/7? • Use partitions or several sets of tables October 24, 2024 PA220 DB for Analytics 35 Summary • ETL is very time consuming (80% of entire project) • Needs to be implemented as a sequence of many small steps • Types of ETL processes • Extraction of data from source systems might be very time consuming • Incremental approach is suggested • Transformation into DW format includes many steps, such as • building key, cleansing the data, handle inconsistent/duplicate data, etc. • Load includes the loading of the data in the DW, updating indexes, computing pre-aggregates, etc. • Performance issues and tips October 24, 2024 PA220 DB for Analytics 36