概述

我们在数据分析过程中,会需要在数据集中按照某个维度进行分组,并且需要在组内进行一个类似排名、统计、相邻数据提取等分析操作。因此,在SQL 2003标准中,引入了窗口函数的能力。本文将基于Spark 3.0 版本的Spark SQL的窗口函数能力进行介绍,并给出一些实验示例。最后,基于主要的窗口函数能力,来介绍能够解决哪些业务场景的分析需求。

窗口函数

语法

1
2
3
4
window_function OVER
( [ { PARTITION | DISTRIBUTE } BY partition_col_name = partition_col_val ( [ , ... ] ) ]
{ ORDER | SORT } BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ... ]
[ window_frame ] )

参数

  • 窗口函数

    • 排名 : RANK | DENSE_RANK | PERCENT_RANK | NTILE | ROW_NUMBER
    • 分析:CUME_DIST | LAG | LEAD
    • 聚合:MAX | MIN | COUNT | SUM | AVG | ...
  • 窗口大小

    • 指定窗口的起始行和结束行语法

    { RANGE | ROWS } { frame_start | BETWEEN frame_start AND frame_end }

    • 窗口起始或结束的语法

      UNBOUNDED PRECEDING | offset PRECEDING | CURRENT ROW | offset FOLLOWING | UNBOUNDED FOLLOWING

      offset: 基于current row位置的偏移量

    备注: frame_end没有设置,默认为CURRENT ROW.

实验示例

导入数据

1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT);

INSERT INTO employees VALUES ("Lisa", "Sales", 10000, 35);
INSERT INTO employees VALUES ("Evan", "Sales", 32000, 38);
INSERT INTO employees VALUES ("Fred", "Engineering", 21000, 28);
INSERT INTO employees VALUES ("Alex", "Sales", 30000, 33);
INSERT INTO employees VALUES ("Tom", "Engineering", 23000, 33);
INSERT INTO employees VALUES ("Jane", "Marketing", 29000, 28);
INSERT INTO employees VALUES ("Jeff", "Marketing", 35000, 38);
INSERT INTO employees VALUES ("Paul", "Engineering", 29000, 23);
INSERT INTO employees VALUES ("Chloe", "Engineering", 23000, 25);

RANK函数

rank(): 计算值在一组值中的排名。结果是1加上当前行前面的行数或当前行相等的行数。这些值在序列中会产生间隙。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
select name,dept,salary,rank() over (partition by dept order by salary) as rank from employees;
+--------+--------------+---------+-------+
| name | dept | salary | rank |
+--------+--------------+---------+-------+
| Lisa | Sales | 10000 | 1 |
| Alex | Sales | 30000 | 2 |
| Evan | Sales | 32000 | 3 |
| Fred | Engineering | 21000 | 1 |
| Tom | Engineering | 23000 | 2 |
| Chloe | Engineering | 23000 | 2 |
| Paul | Engineering | 29000 | 4 |
| Jane | Marketing | 29000 | 1 |
| Jeff | Marketing | 35000 | 2 |
+--------+--------------+---------+-------+

dense_rank(): 计算值在一组值中的排名。结果是1加上先前分配的排名值。在排名序列中不会产生间隙。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
select name,dept,salary, dense_rank() over (partition by dept order by salary) as dense_rank from employees;
+--------+--------------+---------+-------------+
| name | dept | salary | dense_rank |
+--------+--------------+---------+-------------+
| Lisa | Sales | 10000 | 1 |
| Alex | Sales | 30000 | 2 |
| Evan | Sales | 32000 | 3 |
| Fred | Engineering | 21000 | 1 |
| Chloe | Engineering | 23000 | 2 |
| Tom | Engineering | 23000 | 2 |
| Paul | Engineering | 29000 | 3 |
| Jane | Marketing | 29000 | 1 |
| Jeff | Marketing | 35000 | 2 |
+--------+--------------+---------+-------------+

row_number(): 窗口内一行一行有序编号。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
select name,dept,salary, row_number() over (partition by dept order by salary) as row_number from employees;
+--------+--------------+---------+-------------+
| name | dept | salary | row_number |
+--------+--------------+---------+-------------+
| Lisa | Sales | 10000 | 1 |
| Alex | Sales | 30000 | 2 |
| Evan | Sales | 32000 | 3 |
| Fred | Engineering | 21000 | 1 |
| Tom | Engineering | 23000 | 2 |
| Chloe | Engineering | 23000 | 3 |
| Paul | Engineering | 29000 | 4 |
| Jane | Marketing | 29000 | 1 |
| Jeff | Marketing | 35000 | 2 |
+--------+--------------+---------+-------------+

percent_rank():计算值在一组值中的百分比排名 (rank - 1)/(total_row - 1)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
select name,dept,salary, percent_rank() over (partition by dept order by salary) as percent_rank from employees;
+--------+--------------+---------+---------------------+
| name | dept | salary | percent_rank |
+--------+--------------+---------+---------------------+
| Lisa | Sales | 10000 | 0.0 |
| Alex | Sales | 30000 | 0.5 |
| Evan | Sales | 32000 | 1.0 |
| Fred | Engineering | 21000 | 0.0 |
| Chloe | Engineering | 23000 | 0.3333333333333333 |
| Tom | Engineering | 23000 | 0.3333333333333333 |
| Paul | Engineering | 29000 | 1.0 |
| Jane | Marketing | 29000 | 0.0 |
| Jeff | Marketing | 35000 | 1.0 |
+--------+--------------+---------+---------------------+

ntile(n): 将窗口内的值分为n组。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
select name,dept,ntile(3) over (order by rand()) as group from employees;
+--------+--------------+--------+
| name | dept | group |
+--------+--------------+--------+
| Chloe | Engineering | 1 |
| Jeff | Marketing | 1 |
| Paul | Engineering | 1 |
| Alex | Sales | 2 |
| Evan | Sales | 2 |
| Jane | Marketing | 2 |
| Tom | Engineering | 3 |
| Lisa | Sales | 3 |
| Fred | Engineering | 3 |
+--------+--------------+--------+

分析函数

lag(col,offset[,default]): 从指定列的前offset位置的值,如果超出了起始位置,返回default值;
lead(col,offset[,default]): 从指定列的后offset位置的值,如果超出了结束位置,返回default值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
select name,dept,salary, 
salary-lag(salary,1,salary) over (partition by dept order by salary) as add,
lead(salary,1,salary) over(partition by dept order by salary) - salary as minus
from employees;

+--------+--------------+---------+--------+--------+
| name | dept | salary | add | minus |
+--------+--------------+---------+--------+--------+
| Lisa | Sales | 10000 | 0 | 20000 |
| Alex | Sales | 30000 | 20000 | 2000 |
| Evan | Sales | 32000 | 2000 | 0 |
| Fred | Engineering | 21000 | 0 | 2000 |
| Tom | Engineering | 23000 | 2000 | 0 |
| Chloe | Engineering | 23000 | 0 | 6000 |
| Paul | Engineering | 29000 | 6000 | 0 |
| Jane | Marketing | 29000 | 0 | 6000 |
| Jeff | Marketing | 35000 | 6000 | 0 |
+--------+--------------+---------+--------+--------+

cume_dist(): 计算小于等于当前行值的行数/分组内总行数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
select name,dept,salary, cume_dist() over (partition by dept order by salary) as cume_dist from employees;
+--------+--------------+---------+---------------------+
| name | dept | salary | cume_dist |
+--------+--------------+---------+---------------------+
| Lisa | Sales | 10000 | 0.3333333333333333 | 1/3
| Alex | Sales | 30000 | 0.6666666666666666 | 2/3
| Evan | Sales | 32000 | 1.0 | 3/3
| Fred | Engineering | 21000 | 0.25 |
| Chloe | Engineering | 23000 | 0.75 |
| Tom | Engineering | 23000 | 0.75 |
| Paul | Engineering | 29000 | 1.0 |
| Jane | Marketing | 29000 | 0.5 |
| Jeff | Marketing | 35000 | 1.0 |
+--------+--------------+---------+---------------------+

聚合函数

unbounded: 无边界,没有限制
preceding: 前多少行
following: 后多少行
frame_end不设置默认为current row

range: 逻辑窗口,指定区间的范围取值
rows: 物理窗口,指定区间的行数取值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
select name,dept,salary,
sum(salary) over (partition by dept order by salary rows between unbounded preceding and current row) as rows_total,
sum(salary) over (partition by dept order by salary range between unbounded preceding and current row) as range_total
from employees;

+--------+--------------+---------+-------------+--------------+
| name | dept | salary | rows_total | range_total |
+--------+--------------+---------+-------------+--------------+
| Lisa | Sales | 10000 | 10000 | 10000 |
| Alex | Sales | 30000 | 40000 | 40000 |
| Evan | Sales | 32000 | 72000 | 72000 |
| Fred | Engineering | 21000 | 21000 | 21000 |
| Chloe | Engineering | 23000 | 44000 | 67000 |
| Tom | Engineering | 23000 | 67000 | 67000 |
| Paul | Engineering | 29000 | 96000 | 96000 |
| Jane | Marketing | 29000 | 29000 | 29000 |
| Jeff | Marketing | 35000 | 64000 | 64000 |
+--------+--------------+---------+-------------+--------------+

业务场景

  1. 经典的SparkSQL/Hive-SQL/MySQL面试-练习题
  2. 使用SQL窗口函数进行商务数据分析
  3. Hive开窗函数实战
  4. 经典Hive-SQL面试题