PA152: Efficient Use of DB 10. Schema Tuning Vlastislav Dohnal PA152, Vlastislav Dohnal, FI MUNI, 2022 2 Schema ◼ Relation schema relation name and a list of attributes, their types and integrity constraints E.g., ◼ Table student(uco, name, last_name, day_of_birth) ◼ Database schema Schema of all relations PA152, Vlastislav Dohnal, FI MUNI, 2022 3 Differences in Schema ◼ Same data organized differently ◼ Example of business requirements Suppliers ◼ Address Orders ◼ Part/product, quantity, supplier PA152, Vlastislav Dohnal, FI MUNI, 2022 4 Differences in Schema ◼ Alternatives Schema 1 ◼ Order1(supplier_id, part_id, quantity, supplier_address) Schema 2 ◼ Order2(supplier_id, part_id, quantity) ◼ Supplier(id, address) ◼ Differences Schema 2 saves space. Schema 1 may not keep address when there is no order. PA152, Vlastislav Dohnal, FI MUNI, 2022 5 Differences in Schema ◼ Performance trade-off Frequent access to address of supplier given an ordered part ◼ → schema 1 is good (no need for join) Many new orders ◼ → schema 1 wastes space (address duplicates) ◼ → relation will be stored in more blocks PA152, Vlastislav Dohnal, FI MUNI, 2022 6 Theory of Good Schema ◼ Normal forms 1NF, 2NF, 3NF, Boyce-Codd NF, … ◼ Functional dependency A → B ◼ B functionally depends on A ◼ Value of attr. B is determined if we know the value of attr. A ◼ Let t, s be rows of a relation, then t[A] = s[A]  t[B] = s[B] PA152, Vlastislav Dohnal, FI MUNI, 2022 7 Theory of Good Schema ◼ Order1(supplier_id, part_id, quantity, supplier_address) ◼ Functional dependency example: supplier_id → supplier_address supplier_id, part_id → quantity PA152, Vlastislav Dohnal, FI MUNI, 2022 8 Theory of Good Schema ◼ K is a primary key K → R L → R for any L  K ◼ i.e., for each attribute A in R holds: K → A and L → A ◼ Example Supplier(id, address) id → address id is the (primary) key PA152, Vlastislav Dohnal, FI MUNI, 2022 9 Theory of Good Schema ◼ Example Order1(supplier_id, part_id, quantity, supplier_address) supplier_id → supplier_address supplier_id, part_id → quantity supplier_id, part_id is the primary key PA152, Vlastislav Dohnal, FI MUNI, 2022 10 Schema Normalization ◼ 1NF – all attributes are atomic ◼ 2NF – all attributes depend on a whole super-key ◼ 3NF – all attributes depend directly on a candidate key no transitive dependency ◼ Normalization = transformation to BCNF/3NF PA152, Vlastislav Dohnal, FI MUNI, 2022 11 Schema Normalization ◼ A relation R is normalized if every functional dependency X → A involving attributes in R has the property that X is a (super-)key. ◼ Example Order1(supplier_id, part_id, quantity, supplier_address) ◼ supplier_id → supplier_address ◼ supplier_id, part_id → quantity Is not normalized PA152, Vlastislav Dohnal, FI MUNI, 2022 12 Schema Normalization ◼ Example Order2(supplier_id, part_id, quantity) ◼ supplier_id, part_id → quantity Supplier(id, address) ◼ id → address Schema is normalized PA152, Vlastislav Dohnal, FI MUNI, 2022 13 Schema Normalization: Example ◼ Bank Customer has an account Customer has an address Account is open at a branch of the bank ◼ Is relation normalized? Bank(customer, account, address, branch) PA152, Vlastislav Dohnal, FI MUNI, 2022 14 Schema Normalization: Example ◼ Relation Bank(customer, account, address, branch) customer → account customer → address account → branch ◼ Primary key is customer Proven by functional dependencies… ◼ Relation is not normalized There is a transitive dependency. PA152, Vlastislav Dohnal, FI MUNI, 2022 15 Schema Normalization: Example ◼ Relation decomposition Bank(customer, account, address, branch) ◼ customer → account ◼ customer → address Account(account, branch) ◼ account → branch Normalized now… PA152, Vlastislav Dohnal, FI MUNI, 2022 16 Practical Schema Design ◼ Identify entities Customer, supplier, order, … ◼ Each entity has attributes Customer has an address, phone number, … ◼ There are two constraints on attributes: 1. An attribute cannot have attribute of its own (is atomic). 2. The entity associated with an attribute must functionally determine that attribute. ◼ A functional dependency for each non-key attribute. PA152, Vlastislav Dohnal, FI MUNI, 2022 17 Practical Schema Design ◼ Each entity becomes a relation ◼ To these relations, add relations that reflect relationships between entities E.g., WorksOn(emp_id, project_id) ◼ Identify the functional dependencies among all attributes and check that the schema is normalized If functional dependency AB → C, then ABC should be part of the same relation. PA152, Vlastislav Dohnal, FI MUNI, 2022 18 Vertical Partitioning ◼ Example: Telephone Provider Customer entity has id, address and remaining credit value. ◼ Deps:  id → address  id → credit Normalized schema design  Customer(id, address, credit) ◼ Or  CustAddr(id, address)  CustCredit(id, credit) Which design is better? PA152, Vlastislav Dohnal, FI MUNI, 2022 19 Vertical Partitioning ◼ Which design is better, depends on the query pattern: The application that sends a monthly statement. The credit is updated or examined several times a day. ◼ → The second schema might be better Relation CustCredit is smaller ◼ Fewer blocks; may fit in main memory ◼ → faster table/index scan PA152, Vlastislav Dohnal, FI MUNI, 2022 20 Vertical Partitioning ◼ Single relation is better than two if attributes are queried together → no need for join ◼ Two relations are better if Attributes queried separately (or some much more often) Attributes are large (long strings, …) ◼ Caveat: LOBs are stored apart of the relation. Or some attributes are updated more often than the others. PA152, Vlastislav Dohnal, FI MUNI, 2022 21 Vertical Partitioning ◼ Another example Customer has id and address (street, city, zip) ◼ Is this normalization convenient? CustStreet(id, street) CustCity(id, city, zip) PA152, Vlastislav Dohnal, FI MUNI, 2022 22 Vertical Partitioning: Performance ◼ R(X,Y,Z) - X integer, Y and Z large strings Performance depends on query pattern 0 0,005 0,01 0,015 0,02 No Partitioning Query XYZ Vertical Partitioning Query XYZ No Partitioning Query XY Vertical Partitioning Query XY Througput(queries/sec)Table-scan No partitioning: R(X,Y,Z) Vert. part.: R1(X,Y) R2(X,Z) SQLServer 2k Windows 2k PA152, Vlastislav Dohnal, FI MUNI, 2022 23 Vertical Partitioning: Performance ◼ R(X,Y,Z) - X integer, Y and Z long strings Selection X=?, project XY or XYZ 0 200 400 600 800 1000 0 20 40 60 80 100 Throughput(queries/sec) % of access that only concern XY no vertical partitioning vertical partitioning Index Scan Vert. part. gives advantage if proportion of accessing XY is greater than 25%. Join requires 2 index accesses. PA152, Vlastislav Dohnal, FI MUNI, 2022 24 Vertical Antipartitioning ◼ Start with normalized schema ◼ Add attributes of a relation to the other ◼ Example Stock market (brokers) ◼ Price trends for last 3 000 trading days ◼ Broker’s decision based on last 10 day mainly Schema ◼ StockDetail(stock_id, issue_date, company) ◼ StockPrice(stock_id, date, price) PA152, Vlastislav Dohnal, FI MUNI, 2022 25 Vertical Antipartitioning ◼ Schema StockDetail(stock_id, issue_date, company) StockPrice(stock_id, date, price) ◼ Queries for all 10-day prices are expensive Even though there is an index on stock_id, date Join is needed for further information from StockDetail PA152, Vlastislav Dohnal, FI MUNI, 2022 26 Vertical Antipartitioning ◼ Replicate some data ◼ Schema StockDetail(stock_id, issue_date, company, price_today, price_yesterday, …, price_10d_ago) StockPrice(stock_id, date, price) ◼ Queries for all 10-day prices 1x index scan; no join PA152, Vlastislav Dohnal, FI MUNI, 2022 27 Vertical Antipartitioning ◼ Disadvantage Data replication ◼ Not high ◼ Can diminish by not storing in StockPrice  → queries for average price get complicated, … PA152, Vlastislav Dohnal, FI MUNI, 2022 28 Tuning Denormalization ◼ Denormalization violating normalization for the sake of performance! ◼ Good for Attributes from different normalized relations are often accessed together ◼ Bad for Updates are frequent ◼ → locate “source” data to update replicas PA152, Vlastislav Dohnal, FI MUNI, 2022 29 Tuning Denormalization ◼ Example (TPC-H)  region(r_regionkey, r_name, r_comment)  nation(n_nationkey, n_name, n_regionkey, n_comment)  supplier(s_suppkey, s_name, s_address, s_nationkey, s_phone, s_acctbal, s_comment)  item(i_orderkey, i_partkey, i_suppkey, i_linenumber, i_quantity, i_extendedprice, i_discount, i_tax, i_returnflag, i_linestatus, i_shipdate, i_commitdate, i_receiptdate, i_shipmode, i_comment)  T(item) = 600 000 T(supplier) = 500, T(nation) = 25, T(region) = 5 ◼ Query: Find items of European suppliers PA152, Vlastislav Dohnal, FI MUNI, 2022 30 Tuning Denormalization ◼ Denormalization of item  itemdenormalized (i_orderkey, i_partkey , i_suppkey, i_linenumber, i_quantity, i_extendedprice, i_discount, i_tax, i_returnflag, i_linestatus, i_shipdate, i_commitdate, i_receiptdate, i_shipmode, i_comment, i_regionname);  600 000 rows PA152, Vlastislav Dohnal, FI MUNI, 2022 31 Tuning Denormalization ◼ Queries: SELECT i_orderkey, i_partkey, i_suppkey, i_linenumber, i_quantity, i_extendedprice, i_discount, i_tax, i_returnflag, i_linestatus, i_shipdate, i_commitdate, i_receiptdate, i_shipinstruct, i_shipmode, i_comment, r_name FROM item, supplier, nation, region WHERE i_suppkey = s_suppkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'Europe'; SELECT i_orderkey, i_partkey, i_suppkey, i_linenumber, i_quantity, i_extendedprice, i_discount, i_tax, i_returnflag, i_linestatus, i_shipdate, i_commitdate, i_receiptdate, i_shipinstruct, i_shipmode, i_comment, i_regionname FROM itemdenormalized WHERE i_regionname = 'Europe'; PA152, Vlastislav Dohnal, FI MUNI, 2022 32 Tuning Denormalization: Performance ◼ Query: Find items of European suppliers 0,0000 0,0005 0,0010 0,0015 0,0020 normalized denormalized Throughput(Queries/sec) Normalized: join of 4 relations Denormalized: one relation 54% perf. gain Oracle 8i EE Windows 2k 3x 18GB disk (10 000 rpm) 54% gain PA152, Vlastislav Dohnal, FI MUNI, 2022 33 Clustered Storage of Relations ◼ An alternative to denormalization ◼ Not always supported by DB system ◼ Oracle Clustered storage of two relations ◼ Order(supplier_id, product_id, quantity) ◼ Supplier(id, address) Storage ◼ Order records stored at the corresponding supplier record PA152, Vlastislav Dohnal, FI MUNI, 2022 34 Clustered Storage of Relations ◼ Example ◼ Order(supplier_id, product_id, quantity) ◼ Supplier(id, address) 10, Inter-pro.cz Hodonín 10, 235, 5 10, 545, 10 11, Unikov Bzenec 11, 123, 30 11, 234, 2 11, 648, 10 11, 956, 1 12, Školex Modřice 12, 12, 50 12, 34, 120 … PA152, Vlastislav Dohnal, FI MUNI, 2022 35 Horizontal Partitioning ◼ Divides table by its rows Vertical partitioning = by columns ◼ Motivation Smaller volume of data to process Rapid deletions ◼ Use Data archiving Spatial partitioning … PA152, Vlastislav Dohnal, FI MUNI, 2022 36 Horizontal Partitioning ◼ Automatically Modern (commercial) DB systems ◼ MS SQL Server 2005 and later ◼ Oracle 9i and later, … ◼ PostgreSQL 10 ◼ Manually With DBMS support ◼ Query optimizer Without DBMS support PA152, Vlastislav Dohnal, FI MUNI, 2022 37 Horizontal Partitioning ◼ Query rewrites Automatic partitioning ◼ No rewrites necessary Manual partitioning ◼ With DB support  No rewrites necessary  Table inheritance / definition of views with UNION ALL ◼ Without DB support  Manual query rewrite  List of tables in FROM clause must be changed PA152, Vlastislav Dohnal, FI MUNI, 2022 38 Horizontal Partitioning: SQL Server ◼ MS SQL Server 2005 and later  Define partitioning function ◼ CREATE PARTITION FUNCTION ◼ Partitioning to intervals  Define partitioning scheme ◼ CREATE PARTITION SCHEME ◼ Where to store data (what storage partitions)  Create partitioned table ◼ CREATE TABLE … ON partitioning scheme ◼ Stored data are automatically split into partitions  Create indexes ◼ CREATE INDEX ◼ Indexes are created on table partitions, i.e., automatically partitioned PA152, Vlastislav Dohnal, FI MUNI, 2022 39 Horizontal Partitioning: Oracle ◼ Oracle 9i and later Partitioning by intervals, enums, hashing ◼ Composite partitioning supported  Partitions split into subpartitions Included in syntax of CREATE TABLE http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_7002.htm#i2129707 ◼ PostgreSQL 10 and later Partitioning by intervals, enums, hashing ◼ CREATE TABLE … ( … ) PARTITION BY RANGE (…); Horizontal Partitioning: MariaDB ◼ Part of SQL syntax, applies to indexes ◼ Types:  hash, range, list; also double partitioning ◼ Limitation on UNIQUE constraints  All columns used in the table's partitioning expression must be part of every unique key the table may have. PA152, Vlastislav Dohnal, FI MUNI, 2022 40 CREATE TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE) ENGINE=MyISAM PARTITION BY HASH( MONTH(tr_date) ) PARTITIONS 6 CREATE TABLE ti … PARTITION BY RANGE (MONTH(tr_date)) ( PARTITION spring VALUES LESS THAN (4), PARTITION summer VALUES LESS THAN (7), PARTITION fall VALUES LESS THAN (10), PARTITION winter VALUES LESS THAN MAXVALUE ); Including primary key PA152, Vlastislav Dohnal, FI MUNI, 2022 41 Horizontal Partitioning: PostgreSQL ◼ PostgreSQL 8.2 and later Partitioning by intervals, enums ◼ Principle (http://www.postgresql.org/docs/current/static/ddl-partitioning.html) Table inheritance ◼ Create a base table  No data stored, no indexes, … ◼ Individual partitions are inherited tables  For each table, a CHECK constraint to limit data is defined ◼ Create necessary indexes Disadvantage: ref. integrity cannot be used PA152, Vlastislav Dohnal, FI MUNI, 2022 42 Horizontal Partitioning: PostgreSQL ◼ Principle Inserting records ◼ Inserted into base table ◼ Insert rules defined on the base table  Insertion to the “newest” partition only → one RULE  In general, one rule per partition is defined  Triggers can be used too… In case views are used, ◼ Define INSTEAD OF triggers PA152, Vlastislav Dohnal, FI MUNI, 2022 43 Horizontal Partitioning: PostgreSQL ◼ Example in xdohnal schema (db.fi.muni.cz) Not partitioned table account ◼ Primary key id ◼ R(account) = 200 000 ◼ V(account,home_city) = 5 Partitioned table account_parted ◼ by home_city (5 partitions)  Partitions: account_parted1 .. account_parted5 home_city | count home_city1 | 40020 home_city2 | 40186 home_city3 | 39836 home_city4 | 39959 home_city5 | 39999 PA152, Vlastislav Dohnal, FI MUNI, 2022 44 Horizontal Partitioning: PostgreSQL ◼ Statistics Table Rows Sizes Indexes account 200 000 41 984 kB 4 408 kB account_parted 0 0 kB 8 kB account_parted1 40 020 8 432 kB 896 kB account_parted2 40 186 8 464 kB 896 kB account_parted3 39 836 8 392 kB 888 kB account_parted4 39 959 8 416 kB 896 kB account_parted5 39 999 8 424 kB 896 kB Totals: 200 000 42 128 kB 4 472 kB PA152, Vlastislav Dohnal, FI MUNI, 2022 45 Horizontal Partitioning: PostgreSQL ◼ Query optimizer Allow checking constraint on partitions ◼ Queries (compare execution plans) select * from account where id=8; select * from account_parted where id=8; select count(*) from account where home_city='home_city1'; select count(*) from account_parted where home_city='home_city1'; select * from account where home_city='home_city1' and id=8; select * from account_parted where home_city='home_city1' and id=8; set constraint_exclusion=on; Transaction Tuning ◼ Application view on a transaction It runs isolated – without any concurrent activity. ◼ Database view on a transaction Atomic and consistent change of data; many can be run concurrently. So, correctness of result must be ensured. PA152, Vlastislav Dohnal, FI MUNI, 2022 46 Transaction Concurrency ◼ Two transactions are concurrent if their executions overlap in time. Can happen on a single thread/processor too, e.g., one waiting for I/O to complete. ◼ Concurrency control Controls activity of transactions and make the result appear equivalent to serial execution. Typically achieved by mutual exclusion ◼ E.g., semaphore PA152, Vlastislav Dohnal, FI MUNI, 2022 47 Transaction Concurrency ◼ A semaphore on entire database (one transaction at a time) Good for in-memory databases. ◼ Locking mechanism is good for secondary memory databases. Read (shared) locks and write (exclusive) locks. Record level and relation (table) level PA152, Vlastislav Dohnal, FI MUNI, 2022 48 Concurrency through locking ◼ Rules 1. A transaction must hold a lock on x before accessing it. 2. A transaction must not acquire a lock on any item y after releasing a lock on any item x. ◼ This ensures correctness no update can be made to data read (and locked) by someone else. PA152, Vlastislav Dohnal, FI MUNI, 2022 49 Duration of Transaction ◼ Duration effects on performance More locks a transaction requests, more likely it is that it will wait for some other transaction to finish. The longer T executes, the longer some other transaction may wait if it is blocked by T. ◼ In operational DBs, shorter transactions are preferred. since updates are frequent PA152, Vlastislav Dohnal, FI MUNI, 2022 50 Transaction Design ◼ Avoid user-interaction during a transaction ◼ Lock only what you need  E.g., do not filter recs in an app ◼ Chop transaction  E.g., T accesses x and y. Any other T’ accesses at most one of x or y and nothing else. T can be divided into two transaction (each modifying x and y separately). ◼ Weaken isolation level  Many DBMSes default to releasing read locks on completing the read IO. PA152, Vlastislav Dohnal, FI MUNI, 2022 51 Levels of Isolation ◼ Serializable ◼ Repeatable read Phantom reads (newly inserted recs) ◼ Read committed Non-repeatable reads (a transaction has committed an update) ◼ Read uncommitted Dirty reads (non-committed recs); writes are still atomic ◼ No locking PA152, Vlastislav Dohnal, FI MUNI, 2022 52 Query Tuning: Takeaways ◼ Five basic principles Think globally; fix locally Break bottlenecks by partitioning ◼ transactions, relations, also more HW ((-: Start-up costs are high; running costs are low ◼ E.g., it is expensive to begin a read operation on a disk. Render unto server what is due unto server Be prepared for trade-offs PA152, Vlastislav Dohnal, FI MUNI, 2022 53