ETL Process
PA220: Database systems for data analytics
Contents
• Overview of ETL
• Data Cleaning
• Loading Tips
• Issues
• Summary
October 24, 2024 PA220 DB for Analytics 2
ETL Process Overview
• Data is periodically brought from the ODS to the data warehouse.
• In most DW systems, the ETL process is the most complex part.
• and the most underestimated and time-consuming part.
• Often, 80% of development time is spent on ETL
October 24, 2024 PA220 DB for Analytics 3
Capture Data Change
Data Staging Area
• Transit storage for data underway
in the ETL process
• Transformations/cleansing done here
• No user queries (some do it)
• Sequential operations (few) on large data volumes
• Performed by central ETL logic
• Easily restarted
• No need for locking, logging, etc.
• RDBMS or flat files? (DBMS have become better at this)
• Finished dimensions copied from the staging area to relevant marts
October 24, 2024 PA220 DB for Analytics 4
Data Warehouse
Reporting
Visualization
Analysis
Data Mart
Sales
Marketing
Transaction data
Stock
Management
E-shop
ERP
Services
CRM
Transforming
Raw data
Meta data
Data
dimensions
Summary
data
Cleaning
Integrating
Web
Logs
Files
Data Staging
Staging data
Aggregate Precalc.
Storage
ETL Process Types
• When do we run the ETL process?
October 24, 2024 PA220 DB for Analytics 5
ETL Process Types
• Considerations:
• Overhead on data warehouse and source sides.
• E.g., online propagation puts a permanent burden on both sides; cannot benefit from
bulk loading mechanisms
• Data Staleness
• Frequent updates reduce staleness but increase overhead.
• Debugging, Failure Handling
• With online/stream-based mechanisms, it may be more difficult to track down problems.
• Different process for different flavors of data?
• E.g., periodic refresh may work well for small (dimension) tables.
October 24, 2024 PA220 DB for Analytics 6
Star-
schema
Model
Data Extraction: Getting Data
• Source → Staging Table:
• Tool selection depends on data source
• database, XML, flat files, etc.
• Use SQL, XQuery, Perl, awk, etc. to query the source system
• Often:
• Extract source data to flat file (e.g., CSV)
• Then bulk-load into staging table
• Data compression for large data transfers
• Data encryption if transfer over public networks
October 24, 2024 PA220 DB for Analytics 7
Data Extraction: Capturing Data Changes
• Detecting changes is a challenge:
• Audit Columns
• E.g., “last modified” timestamp
• Set timestamps or “new” flags on every row update. How?
• Unset “new” flags on every load into the DW. Why?
• Full Diff
• Keep old snapshot and diff it with the current version.
• Thorough, will detect any change
• Resource-intensive: need to move and scan large volumes
• Optimization: Hashes/checksums to speed up comparison
• Database Log Scraping
• The database’s write-ahead log contains all change information
• Scraping the log may get messy, though.
• Variant: create a message stream ODS → DW
• Message Queue Monitoring
• The source system must use a messaging framework; then low overhead
October 24, 2024 PA220 DB for Analytics 8
Data Cleansing
• After extraction, data must be normalized and cleaned.
October 24, 2024 PA220 DB for Analytics 9
Data Quality (Revision)
• Data in DW must be:
• Precise
• DW data must match known numbers - or explanation needed
• Complete
• DW has all relevant data, and the users know
• Consistent
• No contradictory data: aggregates fit with detail data
• Unique
• The same thing is called the same and has the same key (customers)
• Timely
• Data is updated ”frequently enough” and the users know when
• Data almost never has decent quality
October 24, 2024 PA220 DB for Analytics 10
Data Cleansing
• Problem:
• Real-world data is messy.
• Consistency rules in the OLTP system?
• A lot of data is still entered by people.
• Data warehouses serve as an integration platform.
• Typical cleaning and normalization tasks:
• Correct spelling errors / data type conversion.
• Handle missing / null values.
• Identify record matches and duplicates.
• Resolve conflicts and inconsistencies.
• Normalize (“conform”) data.
October 24, 2024 PA220 DB for Analytics 11
Data Cleansing: Primitives
• Parsing
• E.g., source table has an ‘address’ column; whereas target table has ‘street’,
‘zip’, and ‘city’ columns; pieces of a string to normalize (e.g., “Road” → “Rd”)
• Similarity Join – bring together similar data
• For record matching (same entity recognition) and deduplication
• Clustering – put items into groups, based on “similarity”
• E.g., pre-processing for deduplication
• Outlier detection – values ​​not matching the pattern
• E.g., failure of a sensor; detection of a new “class” / entity
October 24, 2024 PA220 DB for Analytics 12
Data Cleansing: Similarity Join
• Process of identifying duplicates:
• Similarity measures:
• edit distance, Jaccard coefficient, Soundex
• Threshold of similarity is important
• Limits the number of candidates for duplicates!
October 24, 2024 PA220 DB for Analytics 13
Similarity Join – Edit distance
• What is the “similarity” between strings s1 and s2?
• dedit(s1,s2) = min. number of operations to transform s1 into s2
• E.g., s1 = “Sweet” and s2 = “Sweat”
• Levenshtein distance – ins, del, replace only
• Longest common subsequence (LCS) – ins, del only
October 25, 2024 PA220 DB for Analytics 14
Source: Jens Teubner Data Warehousing
Similarity Join – Jaccard coefficient
• Similarity of two sets S1 and S2
• by comparing cardinalities of intersection and union
• Example
• q-grams: converting a string to a set
• i.e. a set of all substrings of length q
• 2grams(“Sweet”) = { Sw, we, ee, et }
October 25, 2024 PA220 DB for Analytics 15
Source: Jens Teubner Data Warehousing
Similarity Join – SoundEx
• Phonetic algorithm to index words by sound
1. Retain the first letter
2. Replace letters with numbers
• Mapping of alike sounds to the same number
• If no mapping, drop the letter
3. Drop letters where the preceding letter
yielded the same number.
4. Collect three numbers, fill with zero’s
if necessary.
October 25, 2024 PA220 DB for Analytics 16
Source: Jens Teubner Data Warehousing
Similarity Join – Algorithms
• Naïve strategy
• O(n2)
• Blocking naïve strategy
• Assume b blocks and compare only within
• O(1/2 (n2/b – n))
• Sorted neighborhood
• Sort the inputs and scan with sliding windows of w
• Index to accelerate similarity range query R(q,threshold)
October 25, 2024 PA220 DB for Analytics 17
Similarity Join – Naïve strategy
• Compare every record with each other
• Assume symmetry of distance
October 25, 2024 PA220 DB for Analytics 18
Source: Jens Teubner Data Warehousing
Similarity Join – Blocking naïve strategy
• Partition input data into blocks
• Duplicates must end in the same block!
• Disadv: typically, uneven block sizes…
• Compare all pairs within blocks only
• E.g. for matching customers
• Use their ZIP code (assuming ZIP has not changed, customer has not moved)
• Use first character of last name
October 25, 2024 PA220 DB for Analytics 19
Source: Jens Teubner Data Warehousing
Similarity Join – Sorted neighborhood
• Assign a sort key to each record
• Sort the records
• Apply a sliding window of size w
across the sorted list and join within.
• E.g., sort customers by
• First 3 consonants of last name
• First letter of last name and first 2 digits of ZIP code
October 25, 2024 PA220 DB for Analytics 20
Source: Jens Teubner Data Warehousing
Data Cleansing: Detecting Inconsistencies
• Data (quality) screening system:
• Column screens: Test data within a column
• Correct value ranges, value formatting, null values?
• Detect random/noise values
• Structure screens: Relationship across columns
• Two or more columns implement a hierarchy (e.g., a series of m:n relationships)
• Foreign key relationships between tables
• Combination of columns is a valid item, e.g., an existing postal address
• Business rule screens: Data plausible according to business rules?
• E.g., customer status X requires Y years of loyalty, Z EUR total revenue in previous period
October 24, 2024 PA220 DB for Analytics 21
Data Cleansing: Error Handling
• Halting the process on error
• requires manual intervention – diagnose, restart/resume the job or abort it
• Create a suspense file (logging)
• Log the errors in a side channel for later processing
• Not clear when to handle its contents – fix the records and re-introduce to the
job?
• until these data items are restored, the overall DB integrity is questionable
• Tag the data and continue
• Bad fact records – create an audit dimension
• Bad dimension data – use unique error values
• Best solution whenever possible
October 24, 2024 PA220 DB for Analytics 22
Data Cleansing: Error Handling by Logging
October 24, 2024 PA220 DB for Analytics 23
• A special error event schema can be created
• as a result of “Tag the data and continue“
• Grain corresponds to
the error appearance
• Batch dim – info of the job
• Date dim – not a minute
and sec of the error
• rather a weekday, last day of fiscal period,
to constraint / summarize errors
• Time of day – timestamp
when the error occurred
Data Cleansing: Error Handling by Tagging
October 24, 2024 PA220 DB for Analytics 24
• Audit dimension
• attached to the resulting fact table
• created in data cleansing
• stores audit conditions
• Example
• an ETL job finished with no error
• a new audit rec. describing it is created
• all new fact records are associated with it
• if an error occurred (e.g., out of bounds)
• another audit rec is created, and the failing fact records get attached
Improving Data Quality
• Appoint ”data stewards” - responsible for data quality
• A given steward has the responsibility for certain tables
• Includes manual inspections and corrections!
• DW-controlled improvement
• Default values
• ”Not yet assigned 157” note to data steward
• Source-controlled improvements
• The optimal?
• Construct programs that check data quality
• Are totals as expected?
• Do results agree with alternative source?
• Do not fix all problems with data quality
• Allow management to see ”weird” data in their reports?
October 24, 2024 PA220 DB for Analytics 25
Data Transformation: Schema Integration
• Different source systems, types, and schemas must be integrated.
• Infer mapping between schemas (automatically)?
• Tools:
• Compare table and attribute names; consider synonyms and homonyms
• Infer data types/formats and mapping rules
• Techniques like similarity joins and deduplication.
• Still:
• Often a lot of manual work needed.
October 24, 2024 PA220 DB for Analytics 26
Data Loading: Prepare Dimension Tables
• For each dimension do the following checks:
• Dimension row is new
• Generate the surrogate keys
• Attributes in dimension have changed
• Handle updates respecting
SCD type of dimension
October 24, 2024 PA220 DB for Analytics 27
Data Loading: Prepare Dim Tables - Problems
• “upsert” – update if exists, else insert (aka SQL-based update)
• often a real performance killer
• better: separate updates and bulk-load inserts
• Generate and find dimension surrogate keys
• e.g., use key generator of back-end DB
• Maintain “Dim Cross Ref” table in memory or in back-end DB
• Dimensions must be updated before facts
• The relevant dimension rows for new facts must be in place
• Special key considerations if initial load must be performed again
• May re-compute aggregates (Type 1 updates)
• again, bulk-loading/changing is a good choice
October 24, 2024 PA220 DB for Analytics 28
Data Loading: Performance Tips
1. Turn off logging
• Databases maintain a write-ahead log to implement failure tolerance
mechanisms.
• Row-by-row logging causes huge overhead.
2. Disable indexes and reindex after updates
3. Pre-sort data
• Depending on system, may speed up index construction.
• Additional benefit: may result in better physical layout
4. Truncate table
• When loading from scratch
October 24, 2024 PA220 DB for Analytics 31
Loading Data – Performance Tips
5. Enable “fast mode”
• If data is prepared properly, database may use faster parsing mechanisms
• e.g., “copy from” command
6. Make sure data is correct
• Transformation, field truncation, error reporting may slow down bulk-loading
significantly
7. Temporarily disable integrity control
• Avoid checking during load, but do it in bulk, too.
• e.g., foreign keys in the fact table
October 24, 2024 PA220 DB for Analytics 32
Loading Data – Performance Tips
8. Parallelization
• Dimensions can be loaded concurrently
• Fact tables can be loaded concurrently
• Partitions of one fact table can be loaded concurrently
• when horizontal partitioning is used
October 24, 2024 PA220 DB for Analytics 33
Hints on ETL Design
• Do not try to implement all transformations in one step!
• Do one (or just a few) thing(s) at a time
• Copy source data one-one to staging area
• Compute deltas
• Only if doing incremental load
• Handle versions and generate DW keys
• Versions only if handling slowly changing dimensions
• Implement complex transformations
• Load dimensions
• Load facts
October 24, 2024 PA220 DB for Analytics 34
General Issues / Decisions
• Files versus streams/pipes
• Streams/pipes: no disk overhead, fast throughput
• Files: easier restart, often the only possibility
• ETL tool or self-coding
• Code: easy start, co-existence with IT infrastructure
• Tool: better productivity on subsequent projects
• Load frequency
• ETL time depends on processed data volumes.
• Daily load is much faster than monthly.
• Applies to all steps in the ETL process
• Should DW be on-line 24/7?
• Use partitions or several sets of tables
October 24, 2024 PA220 DB for Analytics 35
Summary
• ETL is very time consuming (80% of entire project)
• Needs to be implemented as a sequence of many small steps
• Types of ETL processes
• Extraction of data from source systems might be very time consuming
• Incremental approach is suggested
• Transformation into DW format includes many steps, such as
• building key, cleansing the data, handle inconsistent/duplicate data, etc.
• Load includes the loading of the data in the DW, updating indexes,
computing pre-aggregates, etc.
• Performance issues and tips
October 24, 2024 PA220 DB for Analytics 36