BKM_DATS: Databázové systémy 4. Advanced SQL Vlastislav Dohnal BKM_DATS, Vlastislav Dohnal, FI MUNI, 2022 2 Contents Advanced Aggregate Functions Analytic Functions Data Aggregation Sequences JSON Processing Advanced Aggregate Functions General functions array_agg(expression) packs all input values into one array Statistical functions stddev_samp(expression) calculates the (sample) standard deviation over the values var_samp(expression) calculates the (sample) variance over the values corr(a,b) correlation coefficient between the two sets of values regr_slope(y,x) slope of the least-squares-fit linear function determined by the (x, y) pairs regr_intercept(y, x) y-intercept of the least-squares-fit linear equation determined by the (X, Y) pairs BKM_DATS, Vlastislav Dohnal, FI MUNI, 2022 3 Advanced Aggregate Functions (Inverse) Distribution functions mode() WITHIN GROUP (ORDER BY expression) returns the most frequent input value choosing the first one arbitrarily if there are multiple equallyfrequent results percentile_cont(fraction) WITHIN GROUP (ORDER BY expression) continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed percentile_disc(fraction) WITHIN GROUP (ORDER BY expression) discrete percentile: returns the first input value whose position in the ordering equals or exceeds the specified fraction BKM_DATS, Vlastislav Dohnal, FI MUNI, 2022 4 fraction  <0;1> Advanced Aggregate Functions Hypothetical-set functions rank(value) WITHIN GROUP (ORDER BY expr) rank of the hypothetical value, with gaps for duplicate rows, over all values of expr. dense_rank(value) WITHIN GROUP (ORDER BY expr) rank of the hypothetical value, without gaps percent_rank(value) WITHIN GROUP (ORDER BY expr) relative rank of the hypothetical value, ranging from 0 to 1 cume_dist(value) WITHIN GROUP (ORDER BY expr) relative rank of the hypothetical value, ranging from 1/N to 1 BKM_DATS, Vlastislav Dohnal, FI MUNI, 2022 5 Analytic Functions provide the ability to perform calculations across sets of rows that are related to the current query row generally called Window functions OVER ([PARTITION BY ] ORDER BY []) E.g., SELECT … , AVG(sales) OVER (PARTITION BY region ORDER BY month ASC ROWS 2 PRECEDING), … FROM … moving average over 3 rows BKM_DATS, Vlastislav Dohnal, FI MUNI, 2022 6 Analytic Functions Ranking operators Row numbering is the most basic ranking function E.g., SELECT SalesOrderID , CustomerID , ROW_NUMBER() OVER (ORDER BY SalesOrderID ) as RunningCount FROM Sales WHERE SalesOrderID > 10000 ORDER BY SalesOrderID October 20, 2022 PA220 DB for Analytics7 Analytic Functions ROW_NUMBER doesn’t consider tied values Each 2 equal values get 2 different row numbers The behavior is nondeterministic Each tied value could have its number switched! We need something deterministic RANK() and DENSE_RANK() October 20, 2022 PA220 DB for Analytics8 Analytic Functions RANK and DENSE_RANK functions Allow ranking items in a group Syntax: RANK ( ) OVER ( [ query_partition_clause ] order_by_clause ) DENSE_RANK ( ) OVER ( [ query_partition_clause ] order_by_clause ) DENSE_RANK leaves no gaps in ranking sequence when there are ties PERCENT_RANK → (rank - 1) / (total rows - 1) CUME_DIST - the cumulative distribution the number of partition rows preceding (or peers with) the current row / total partition rows The value ranges from 1/N to 1 October 20, 2022 PA220 DB for Analytics9 Analytic Functions Example SELECT channel, calendar, TO_CHAR(TRUNC(SUM(amount_sold), -6), '9,999,999’) AS sales, RANK() OVER (ORDER BY TRUNC(amount_sold, -6)) DESC) AS rank, DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold), -6)) DESC) AS dense_rank FROM sales, products … GROUP BY channel, calendar ORDER BY sales DESC October 20, 2022 PA220 DB for Analytics10 Analytic Functions Group ranking - RANK function can operate within groups: the rank gets reset whenever the group changes A single query can contain more than one ranking function, each partitioning the data into different groups. PARTITION BY clause October 20, 2022 PA220 DB for Analytics11 SELECT … RANK() OVER (PARTITION BY channel ORDER BY SUM(amount_sold) DESC) AS rank_by_channel Analytic Functions NTILE splits a set into equal-sized groups It divides an ordered partition into buckets and assigns a bucket number to each row in the partition Buckets are calculated so that each bucket has exactly the same number of rows assigned to it or at most 1 row more than the others NTILE(4) - quartile NTILE(100) - percentage Not a part of the SQL99 standard, but adopted by major vendors October 20, 2022 PA220 DB for Analytics12 SELECT … NTILE(3) OVER (ORDER BY sales) NT_3 FROM … Analytic Functions Obtain a value of a particular row of a window frame defined by window clause (PARTITION BY…) first_value(expression) last_value(expression) nth_value (expression) October 20, 2022 PA220 DB for Analytics13 SELECT … FIRST_VALUE(sales) OVER (ORDER BY sales) AS lowest_sale CHANNEL CALENDAR SALES LOWEST_SALE Direst sales 02.2016 10,000 4,000 Direst sales 03.2016 9,000 4,000 Internet 02.2016 6,000 4,000 Internet 03.2016 6,000 4,000 Partners 03.2016 4,000 4,000 SELECT … FIRST_VALUE(sales) OVER (PARTITION BY channel ORDER BY sales) AS lowest_sales Analytic Functions Access to a row that comes before the current row at a specified physical offset with the current window frame (partition) LAG(expression [,offset [,default_value]]) … after the current row LEAD(expression [,offset [,default_value]] October 20, 2022 PA220 DB for Analytics14 CHANNEL CALENDAR SALES PREV_SALE Direst sales 02.2016 10,000 NULL Direst sales 03.2016 9,000 10,000 Internet 02.2016 6,000 NULL Internet 03.2016 6,000 6,000 Partners 03.2016 4,000 NULL SELECT … LAG(sales, 1) OVER (PARTITION BY channel ORDER BY calendar) AS prev_sales Data Aggregations Used in GROUP BY clause instead of mere list of attributes ROLLUP (e1, e2, e3, …) represents the given list of expressions and all prefixes of the list including the empty list CUBE (e1, e2, e3, …) represents the given list and all of its possible subsets (i.e., the power set) GROUPING SETS ( (e1,e2), (e4,e5), (e6), () …) rows are grouped separately by each specified grouping set Function to obtain which “GROUP BY” takes place GROUPING(args...) Integer bit mask indicating which arguments are not being included in the current grouping set BKM_DATS, Vlastislav Dohnal, FI MUNI, 2022 15 Data Aggregations Pivoting table for make and model over sales data SELECT make, model, sum(amount) FROM sales GROUP BY CUBE (make, model) BKM_DATS, Vlastislav Dohnal, FI MUNI, 2022 16 Data Aggregations Example of CUBE on table of car sales (year, make, model, amount) GROUP BY CUBE (year, make, model) calculates: BKM_DATS, Vlastislav Dohnal, FI MUNI, 2022 17