概述 我们在数据分析过程中,会需要在数据集中按照某个维度进行分组,并且需要在组内进行一个类似排名、统计、相邻数据提取等分析操作。因此,在SQL 2003标准中,引入了窗口函数的能力。本文将基于Spark 3.0 版本的Spark SQL的窗口函数能力 进行介绍,并给出一些实验示例。最后,基于主要的窗口函数能力,来介绍能够解决哪些业务场景的分析需求。
窗口函数 语法 1 2 3 4 window_function OVER ( [ { PARTITION | DISTRIBUTE } BY partition_col_name = partition_col_val ( [ , ... ] ) ] { ORDER | SORT } BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ] [ window_frame ] )
参数
实验示例 导入数据 1 2 3 4 5 6 7 8 9 10 11 CREATE TABLE employees (name STRING , dept STRING , salary INT , age INT );INSERT INTO employees VALUES ("Lisa" , "Sales" , 10000 , 35 );INSERT INTO employees VALUES ("Evan" , "Sales" , 32000 , 38 );INSERT INTO employees VALUES ("Fred" , "Engineering" , 21000 , 28 );INSERT INTO employees VALUES ("Alex" , "Sales" , 30000 , 33 );INSERT INTO employees VALUES ("Tom" , "Engineering" , 23000 , 33 );INSERT INTO employees VALUES ("Jane" , "Marketing" , 29000 , 28 );INSERT INTO employees VALUES ("Jeff" , "Marketing" , 35000 , 38 );INSERT INTO employees VALUES ("Paul" , "Engineering" , 29000 , 23 );INSERT INTO employees VALUES ("Chloe" , "Engineering" , 23000 , 25 );
RANK函数 rank(): 计算值在一组值中的排名。结果是1加上当前行前面的行数或当前行相等的行数。这些值在序列中会产生间隙。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 select name ,dept,salary,rank () over (partition by dept order by salary) as rank from employees;+ | name | dept | salary | rank | + | Lisa | Sales | 10000 | 1 | | Alex | Sales | 30000 | 2 | | Evan | Sales | 32000 | 3 | | Fred | Engineering | 21000 | 1 | | Tom | Engineering | 23000 | 2 | | Chloe | Engineering | 23000 | 2 | | Paul | Engineering | 29000 | 4 | | Jane | Marketing | 29000 | 1 | | Jeff | Marketing | 35000 | 2 | +
dense_rank(): 计算值在一组值中的排名。结果是1加上先前分配的排名值。在排名序列中不会产生间隙。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 select name ,dept,salary, dense_rank () over (partition by dept order by salary) as dense_rank from employees;+ | name | dept | salary | dense_rank | + | Lisa | Sales | 10000 | 1 | | Alex | Sales | 30000 | 2 | | Evan | Sales | 32000 | 3 | | Fred | Engineering | 21000 | 1 | | Chloe | Engineering | 23000 | 2 | | Tom | Engineering | 23000 | 2 | | Paul | Engineering | 29000 | 3 | | Jane | Marketing | 29000 | 1 | | Jeff | Marketing | 35000 | 2 | +
row_number(): 窗口内一行一行有序编号。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 select name ,dept,salary, row_number() over (partition by dept order by salary) as row_number from employees;+ | name | dept | salary | row_number | + | Lisa | Sales | 10000 | 1 | | Alex | Sales | 30000 | 2 | | Evan | Sales | 32000 | 3 | | Fred | Engineering | 21000 | 1 | | Tom | Engineering | 23000 | 2 | | Chloe | Engineering | 23000 | 3 | | Paul | Engineering | 29000 | 4 | | Jane | Marketing | 29000 | 1 | | Jeff | Marketing | 35000 | 2 | +
percent_rank():计算值在一组值中的百分比排名 (rank - 1)/(total_row - 1)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 select name ,dept,salary, percent_rank () over (partition by dept order by salary) as percent_rank from employees;+ | name | dept | salary | percent_rank | + | Lisa | Sales | 10000 | 0.0 | | Alex | Sales | 30000 | 0.5 | | Evan | Sales | 32000 | 1.0 | | Fred | Engineering | 21000 | 0.0 | | Chloe | Engineering | 23000 | 0.3333333333333333 | | Tom | Engineering | 23000 | 0.3333333333333333 | | Paul | Engineering | 29000 | 1.0 | | Jane | Marketing | 29000 | 0.0 | | Jeff | Marketing | 35000 | 1.0 | +
ntile(n): 将窗口内的值分为n组。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 select name ,dept,ntile(3 ) over (order by rand ()) as group from employees;+ | name | dept | group | + | Chloe | Engineering | 1 | | Jeff | Marketing | 1 | | Paul | Engineering | 1 | | Alex | Sales | 2 | | Evan | Sales | 2 | | Jane | Marketing | 2 | | Tom | Engineering | 3 | | Lisa | Sales | 3 | | Fred | Engineering | 3 | +
分析函数 lag(col,offset[,default]): 从指定列的前offset位置的值,如果超出了起始位置,返回default值; lead(col,offset[,default]): 从指定列的后offset位置的值,如果超出了结束位置,返回default值。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 select name ,dept,salary, salary-lag(salary,1 ,salary) over (partition by dept order by salary) as add , lead (salary,1 ,salary) over (partition by dept order by salary) - salary as minus from employees;+ | name | dept | salary | add | minus | + | Lisa | Sales | 10000 | 0 | 20000 | | Alex | Sales | 30000 | 20000 | 2000 | | Evan | Sales | 32000 | 2000 | 0 | | Fred | Engineering | 21000 | 0 | 2000 | | Tom | Engineering | 23000 | 2000 | 0 | | Chloe | Engineering | 23000 | 0 | 6000 | | Paul | Engineering | 29000 | 6000 | 0 | | Jane | Marketing | 29000 | 0 | 6000 | | Jeff | Marketing | 35000 | 6000 | 0 | +
cume_dist(): 计算小于等于当前行值的行数/分组内总行数。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 select name ,dept,salary, cume_dist () over (partition by dept order by salary) as cume_dist from employees;+ | name | dept | salary | cume_dist | + | Lisa | Sales | 10000 | 0.3333333333333333 | 1/3 | Alex | Sales | 30000 | 0.6666666666666666 | 2/3 | Evan | Sales | 32000 | 1.0 | 3/3 | Fred | Engineering | 21000 | 0.25 | | Chloe | Engineering | 23000 | 0.75 | | Tom | Engineering | 23000 | 0.75 | | Paul | Engineering | 29000 | 1.0 | | Jane | Marketing | 29000 | 0.5 | | Jeff | Marketing | 35000 | 1.0 | +
聚合函数 unbounded: 无边界,没有限制 preceding: 前多少行 following: 后多少行 frame_end不设置默认为current row
range: 逻辑窗口,指定区间的范围取值 rows: 物理窗口,指定区间的行数取值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 select name ,dept,salary, sum (salary) over (partition by dept order by salary rows between unbounded preceding and current row ) as rows_total, sum (salary) over (partition by dept order by salary range between unbounded preceding and current row ) as range_total from employees;+ | name | dept | salary | rows_total | range_total | + | Lisa | Sales | 10000 | 10000 | 10000 | | Alex | Sales | 30000 | 40000 | 40000 | | Evan | Sales | 32000 | 72000 | 72000 | | Fred | Engineering | 21000 | 21000 | 21000 | | Chloe | Engineering | 23000 | 44000 | 67000 | | Tom | Engineering | 23000 | 67000 | 67000 | | Paul | Engineering | 29000 | 96000 | 96000 | | Jane | Marketing | 29000 | 29000 | 29000 | | Jeff | Marketing | 35000 | 64000 | 64000 | +
业务场景
经典的SparkSQL/Hive-SQL/MySQL面试-练习题
使用SQL窗口函数进行商务数据分析
Hive开窗函数实战
经典Hive-SQL面试题