KEMBAR78
Advance sql - window functions patterns and tricks | PPTX
Advance SQL
Windows Functions
Eyal Trabelsi
What is a window function
https://blog.matters.tech/sql-window-functions-basics-e9a9fa17ce7e
Did they forget anything ?
NULLS !!!!!
• NULLS values are treated as their own group over
the partitioned columns.
NULLS !!!!!
• NULLS values are treated as their own group over
the partitioned columns.
• They sorted and ranked according to the NULLS
FIRST or NULLS LAST option.
NULLS !!!!!
• NULLS values are treated as their own group over
the partitioned columns.
• They sorted and ranked according to the NULLS
FIRST or NULLS LAST option.
• By default, NULL values are sorted and ranked last
in ASC ordering, and sorted and ranked first in
DESC ordering.
NULLS !!!!!
Supported window functions
Supported window functions
• Window function are vendor ‘ dependent!!!!
Supported window functions
• Window function are vendor ‘ dependent!!!!
• Most vendors support the all the basic window functions
like avg, min, max, count, sum, lead, lag, rank etc.
Supported window functions
• Window function are vendor ‘ dependent!!!!
• Most vendors support the all the basic window functions
like avg, min, max, count, sum, lead, lag, rank etc.
• These are the vendor specific supported window
functions:
Supported window functions
• Window function are vendor ‘ dependent!!!!
• Most vendors support the all the basic window functions
like avg, min, max, count, sum, lead, lag, rank etc.
• These are the vendor specific supported window
functions:
Our Business use case
Lets say we have the following events table
Our Business use case
Our Business use case
Event_id time user_id type click_num
1 18:00 1 a 2
2 18:20 1 a 3
3 18:59 1 b 4
4 18:00 2 b 1
Lets say we have the following events table
Calculating Cumulative sum via sum
Calculating Cumulative sum via sum
Event_id time user_id type click_num cumsum
1 18:0
0
1 a 2 2
2 18:2
0
1 a 3 5
3 18:5
9
1 b 4 9
4 18:0
0
2 b 1 1
Calculating Cumulative sum via sum
Calculating Cumulative sum via sum
SELECT event_id,
time,
user_id,
type,
sum(click_num) over (PARTITION BY user_id ORDER BY time
ROWS UNBOUNDED PRECEDING)
FROM t
Calculating Cumulative sum via sum
Calculating Cumulative sum via sum
DEFAULTS!!!
Calculating Cumulative sum via sum
DEFAULTS!!!
Calculating Cumulative sum via sum
• Calculate cumulative sum of lead per month
• Calculate cumulative sum of orders’ amount per store
Calculate Growth
Calculate Growth
Event_id time user_id type click_num growth
1 18:0
0
1 a 2 NULL
2 18:2
0
1 b 3 1/2
3 18:5
9
1 a 4 1/3
4 18:0
0
2 b 1 NULL
Calculate Growth
Calculate Growth
SELECT event_id,
time,
user_id,
type,
lag(num_click) over (PARTITION BY user_id ORDER BY time) AS
last_num_click,
(num_click - last_num_click) / last_num_click
FROM t
Calculate Growth
SELECT event_id,
time,
user_id,
type,
lag(num_click) over (PARTITION BY user_id ORDER BY time) AS
last_num_click,
(num_click - last_num_click) / last_num_click
FROM t
*
Calculate Growth
• Calculate growth of accounts month after month
• Calculate growth of orders amount per store month after month
• Calculate growth of orders amount per store and opportunity
Calculate Growth
• Calculate growth of accounts month after month
• Calculate growth of orders amount per store month after month
• Calculate growth of orders amount per store and opportunity HARD :P
Indicate the first event for each user
Indicate the first event for each user
Event_id time user_id type first
1 18:00 1 a 1
2 18:20 1 b 0
3 18:59 1 a 0
4 18:00 2 b 1
Indicate the first event for each user
Indicate the first event for each user
SELECT event_id,
time,
user_id,
type
floor(1/ row_number() over (PARTITION BY user_id ORDER BY time))
FROM t
Indicate the first row template
Indicate the first row template
SELECT c1,
c2,
floor(1/ row_number() over (PARTITION BY c1 ORDER BY c2))
FROM t
Indicate the first template use-cases
Indicate the first template use-cases
• Our analysis need to indicate rank without changing.
granularity (filtering for example or summation of sub
granularity).
Indicate the first template use-cases
• Our analysis need to indicate rank without changing.
granularity (filtering for example or summation of sub
granularity).
• When we want to consolidate tables which indicate
changes and the source doesn’t support it. For example
table which indicate transitions to plans might miss the
first plans.
Indicate the first template use-cases
• Our analysis need to indicate rank without changing.
granularity (filtering for example or summation of sub
granularity).
• When we want to consolidate tables which indicate
changes and the source doesn’t support it. For example
table which indicate transitions to plans might miss the
first plans.
• Our query goal is not clear whether changing granularity
is needed.
Indicate the first template questions
• Show indicator on the first segment touch point on the store
• Show indicator for the first event of product change for store
Now for more
“complicated”
patterns
Creating session for user within 30 min
Creating session for user within 30 min
Event_id time user_id type session
1 18:00 1 a 1
2 18:20 1 b 1
3 18:59 1 a 2
4 18:00 2 b 1
Creating session for user within 30 min
Creating session for user within 30 min
https://blog.modeanalytics.com/finding-user-sessions-sql/
Creating session template
Creating session template
SELECT *,
SUM(new_session) over (PARTITION BY user_id ORDER BY time) AS new_session
FROM ( SELECT *
CASE WHEN (time - LEAD(time) over (PARTITION BY user_id ORDER BY time)) >
<INTERVAL>
THEN 1
ELSE 0
END AS new_session
FROM t
)
Creating session with condition
Creating session with condition
Lets say we don’t want to sessionize events with type b
Creating session with condition
Event_id time user_id type session
1 18:00 1 a 1
2 18:20 1 b
3 18:59 1 a 2
4 18:00 2 a 1
Lets say we don’t want to sessionize events with type b
Creating session with condition
Creating session with condition
Creating session with condition
SELECT *,
SUM(new_session) over (PARTITION BY user_id ,condition ORDER BY time) AS new_session
FROM ( SELECT *
CASE WHEN (time - LEAD(time) over (PARTITION BY user_id,condition ORDER BY time)) >
<INTERVAL>
THEN 1
ELSE 0
END AS new_session
FROM t
)
Creating session with condition
template use-cases
• Our analysis need to indicate rank without changing.
granularity (filtering for example).
Finding Series length
Finding Series length
https://www.youtube.com/watch?v=mgipNdAgQ3o&t
=1070s%3Fstart=20:08&end=24:34
Finding Series length use-cases
Finding Series length use-cases
• Our analysis need to check sequence of improvement:
Finding Series length questions
• find the largest sequence of month with increase in lead creation
• find the largest sequence of month with decrease in ticket for product x
• find the difference in sequence of month with decrease in ticket for product
x after bug was fixed
Time decay template use-cases
Our analysis has time related importance calculation
• Calculated recommended movie , if I saw a movie a week
ago it will probably be more relevant to my interest than
movie I saw 5 years ago .
Our Business use case
Event_id time user_id package
1 1/1/18 1 a
2 1/2/18 1 b
3 1/6/18 1 a
4 1/7/18 1 c
Lets say we have the following package change table
Join on time interval
Join on time interval
Lets say we have the following package change
table
Join on time interval
Event_id From_time To_time user_id package
1 1/1/18 1/2/18 1 a
2 1/2/18 1/6/18 1 b
3 1/6/18 1/7/18 1 a
4 1/7/18 NULL 1 c
Lets say we have the following package change
table
Join on time interval
Event_id From_time To_time user_id package
1 1/1/18 1/2/18 1 a
2 1/2/18 1/6/18 1 b
3 1/6/18 1/7/18 1 a
4 1/7/18 1/1/99 1 c
Lets say we have the following package change
table
*
Join on time interval
*
Join on time interval
SELECT event_id,
type,
user_id,
package,
time AS from_time,
ISNULL(LEAD(time) over (PARTITION BY user_id ORDER BY time)),’1/1/99’)
AS to_time
FROM t
Join on time interval template
Join on time interval template
SELECT c1,
c2 AS from_time,
ISNULL(LEAD(c2) over (PARTITION BY c1 ORDER BY c2)),’2999-01-01’)
AS to_time
FROM t
Join on time interval template use-cases
Join on time interval template use-cases
• We need to join between two events which didn’t occur in
the exact same time
Join on time interval template use-cases
• We need to join between two events which didn’t occur in
the exact same time
• When we want to create “contracts” when we have
events only.
Join on time interval template use-cases
• We need to join between two events which didn’t occur in
the exact same time
• When we want to create “contracts” when we have
events only.
• When we want to consolidate two “contracts” from
different sources.
Join on time interval template questions
• Connecting orders to the browsing session which resulted in the
purchase
• For each month understand which package the customer is.
Our Business use case
Event_id time user_id type click_num
1 18:00 1 a 2
2 18:20 1 a 3
3 18:59 1 b 4
3 18:59 1 b 4
Lets say we have the following events table
Deduplication
Deduplication
Event_id time user_id type click_num
1 18:00 1 a 2
2 18:20 1 a 3
3 18:59 1 b 4
Deduplication
Deduplication
SELECT event_id,
time,
user_id,
type,
click_num
FROM (SELECT *
row_number() over (PARTITION BY event_id ORDER BY time)) AS
rnk
FROM t)
WHERE rnk=1
Deduplication template
Deduplication template
SELECT *
FROM (SELECT c1,
c2,
row_number() over (PARTITION BY pk ORDER BY c2)) AS rnk
FROM t)
WHERE rnk=1
Deduplication template questions
• Return all won opportunities except ones which occur
in the same month (keep only last)
Window Functions Rocks
Window Functions Rocks

Advance sql - window functions patterns and tricks

  • 1.
  • 2.
    What is awindow function https://blog.matters.tech/sql-window-functions-basics-e9a9fa17ce7e
  • 3.
    Did they forgetanything ?
  • 4.
  • 5.
    • NULLS valuesare treated as their own group over the partitioned columns. NULLS !!!!!
  • 6.
    • NULLS valuesare treated as their own group over the partitioned columns. • They sorted and ranked according to the NULLS FIRST or NULLS LAST option. NULLS !!!!!
  • 7.
    • NULLS valuesare treated as their own group over the partitioned columns. • They sorted and ranked according to the NULLS FIRST or NULLS LAST option. • By default, NULL values are sorted and ranked last in ASC ordering, and sorted and ranked first in DESC ordering. NULLS !!!!!
  • 9.
  • 10.
    Supported window functions •Window function are vendor ‘ dependent!!!!
  • 11.
    Supported window functions •Window function are vendor ‘ dependent!!!! • Most vendors support the all the basic window functions like avg, min, max, count, sum, lead, lag, rank etc.
  • 12.
    Supported window functions •Window function are vendor ‘ dependent!!!! • Most vendors support the all the basic window functions like avg, min, max, count, sum, lead, lag, rank etc. • These are the vendor specific supported window functions:
  • 13.
    Supported window functions •Window function are vendor ‘ dependent!!!! • Most vendors support the all the basic window functions like avg, min, max, count, sum, lead, lag, rank etc. • These are the vendor specific supported window functions:
  • 15.
    Our Business usecase Lets say we have the following events table
  • 16.
  • 17.
    Our Business usecase Event_id time user_id type click_num 1 18:00 1 a 2 2 18:20 1 a 3 3 18:59 1 b 4 4 18:00 2 b 1 Lets say we have the following events table
  • 18.
  • 19.
    Calculating Cumulative sumvia sum Event_id time user_id type click_num cumsum 1 18:0 0 1 a 2 2 2 18:2 0 1 a 3 5 3 18:5 9 1 b 4 9 4 18:0 0 2 b 1 1
  • 20.
  • 21.
    Calculating Cumulative sumvia sum SELECT event_id, time, user_id, type, sum(click_num) over (PARTITION BY user_id ORDER BY time ROWS UNBOUNDED PRECEDING) FROM t
  • 22.
  • 23.
    Calculating Cumulative sumvia sum DEFAULTS!!!
  • 24.
    Calculating Cumulative sumvia sum DEFAULTS!!!
  • 25.
    Calculating Cumulative sumvia sum • Calculate cumulative sum of lead per month • Calculate cumulative sum of orders’ amount per store
  • 27.
  • 28.
    Calculate Growth Event_id timeuser_id type click_num growth 1 18:0 0 1 a 2 NULL 2 18:2 0 1 b 3 1/2 3 18:5 9 1 a 4 1/3 4 18:0 0 2 b 1 NULL
  • 29.
  • 30.
    Calculate Growth SELECT event_id, time, user_id, type, lag(num_click)over (PARTITION BY user_id ORDER BY time) AS last_num_click, (num_click - last_num_click) / last_num_click FROM t
  • 31.
    Calculate Growth SELECT event_id, time, user_id, type, lag(num_click)over (PARTITION BY user_id ORDER BY time) AS last_num_click, (num_click - last_num_click) / last_num_click FROM t *
  • 32.
    Calculate Growth • Calculategrowth of accounts month after month • Calculate growth of orders amount per store month after month • Calculate growth of orders amount per store and opportunity
  • 33.
    Calculate Growth • Calculategrowth of accounts month after month • Calculate growth of orders amount per store month after month • Calculate growth of orders amount per store and opportunity HARD :P
  • 35.
    Indicate the firstevent for each user
  • 36.
    Indicate the firstevent for each user Event_id time user_id type first 1 18:00 1 a 1 2 18:20 1 b 0 3 18:59 1 a 0 4 18:00 2 b 1
  • 37.
    Indicate the firstevent for each user
  • 38.
    Indicate the firstevent for each user SELECT event_id, time, user_id, type floor(1/ row_number() over (PARTITION BY user_id ORDER BY time)) FROM t
  • 39.
    Indicate the firstrow template
  • 40.
    Indicate the firstrow template SELECT c1, c2, floor(1/ row_number() over (PARTITION BY c1 ORDER BY c2)) FROM t
  • 41.
    Indicate the firsttemplate use-cases
  • 42.
    Indicate the firsttemplate use-cases • Our analysis need to indicate rank without changing. granularity (filtering for example or summation of sub granularity).
  • 43.
    Indicate the firsttemplate use-cases • Our analysis need to indicate rank without changing. granularity (filtering for example or summation of sub granularity). • When we want to consolidate tables which indicate changes and the source doesn’t support it. For example table which indicate transitions to plans might miss the first plans.
  • 44.
    Indicate the firsttemplate use-cases • Our analysis need to indicate rank without changing. granularity (filtering for example or summation of sub granularity). • When we want to consolidate tables which indicate changes and the source doesn’t support it. For example table which indicate transitions to plans might miss the first plans. • Our query goal is not clear whether changing granularity is needed.
  • 45.
    Indicate the firsttemplate questions • Show indicator on the first segment touch point on the store • Show indicator for the first event of product change for store
  • 47.
  • 49.
    Creating session foruser within 30 min
  • 50.
    Creating session foruser within 30 min Event_id time user_id type session 1 18:00 1 a 1 2 18:20 1 b 1 3 18:59 1 a 2 4 18:00 2 b 1
  • 51.
    Creating session foruser within 30 min
  • 52.
    Creating session foruser within 30 min https://blog.modeanalytics.com/finding-user-sessions-sql/
  • 53.
  • 54.
    Creating session template SELECT*, SUM(new_session) over (PARTITION BY user_id ORDER BY time) AS new_session FROM ( SELECT * CASE WHEN (time - LEAD(time) over (PARTITION BY user_id ORDER BY time)) > <INTERVAL> THEN 1 ELSE 0 END AS new_session FROM t )
  • 55.
  • 56.
    Creating session withcondition Lets say we don’t want to sessionize events with type b
  • 57.
    Creating session withcondition Event_id time user_id type session 1 18:00 1 a 1 2 18:20 1 b 3 18:59 1 a 2 4 18:00 2 a 1 Lets say we don’t want to sessionize events with type b
  • 58.
  • 59.
  • 60.
    Creating session withcondition SELECT *, SUM(new_session) over (PARTITION BY user_id ,condition ORDER BY time) AS new_session FROM ( SELECT * CASE WHEN (time - LEAD(time) over (PARTITION BY user_id,condition ORDER BY time)) > <INTERVAL> THEN 1 ELSE 0 END AS new_session FROM t )
  • 61.
    Creating session withcondition template use-cases • Our analysis need to indicate rank without changing. granularity (filtering for example).
  • 62.
  • 63.
  • 64.
  • 65.
    Finding Series lengthuse-cases • Our analysis need to check sequence of improvement:
  • 66.
    Finding Series lengthquestions • find the largest sequence of month with increase in lead creation • find the largest sequence of month with decrease in ticket for product x • find the difference in sequence of month with decrease in ticket for product x after bug was fixed
  • 67.
    Time decay templateuse-cases Our analysis has time related importance calculation • Calculated recommended movie , if I saw a movie a week ago it will probably be more relevant to my interest than movie I saw 5 years ago .
  • 68.
    Our Business usecase Event_id time user_id package 1 1/1/18 1 a 2 1/2/18 1 b 3 1/6/18 1 a 4 1/7/18 1 c Lets say we have the following package change table
  • 69.
    Join on timeinterval
  • 70.
    Join on timeinterval Lets say we have the following package change table
  • 71.
    Join on timeinterval Event_id From_time To_time user_id package 1 1/1/18 1/2/18 1 a 2 1/2/18 1/6/18 1 b 3 1/6/18 1/7/18 1 a 4 1/7/18 NULL 1 c Lets say we have the following package change table
  • 72.
    Join on timeinterval Event_id From_time To_time user_id package 1 1/1/18 1/2/18 1 a 2 1/2/18 1/6/18 1 b 3 1/6/18 1/7/18 1 a 4 1/7/18 1/1/99 1 c Lets say we have the following package change table *
  • 73.
    Join on timeinterval *
  • 74.
    Join on timeinterval SELECT event_id, type, user_id, package, time AS from_time, ISNULL(LEAD(time) over (PARTITION BY user_id ORDER BY time)),’1/1/99’) AS to_time FROM t
  • 75.
    Join on timeinterval template
  • 76.
    Join on timeinterval template SELECT c1, c2 AS from_time, ISNULL(LEAD(c2) over (PARTITION BY c1 ORDER BY c2)),’2999-01-01’) AS to_time FROM t
  • 77.
    Join on timeinterval template use-cases
  • 78.
    Join on timeinterval template use-cases • We need to join between two events which didn’t occur in the exact same time
  • 79.
    Join on timeinterval template use-cases • We need to join between two events which didn’t occur in the exact same time • When we want to create “contracts” when we have events only.
  • 80.
    Join on timeinterval template use-cases • We need to join between two events which didn’t occur in the exact same time • When we want to create “contracts” when we have events only. • When we want to consolidate two “contracts” from different sources.
  • 81.
    Join on timeinterval template questions • Connecting orders to the browsing session which resulted in the purchase • For each month understand which package the customer is.
  • 83.
    Our Business usecase Event_id time user_id type click_num 1 18:00 1 a 2 2 18:20 1 a 3 3 18:59 1 b 4 3 18:59 1 b 4 Lets say we have the following events table
  • 84.
  • 85.
    Deduplication Event_id time user_idtype click_num 1 18:00 1 a 2 2 18:20 1 a 3 3 18:59 1 b 4
  • 86.
  • 87.
    Deduplication SELECT event_id, time, user_id, type, click_num FROM (SELECT* row_number() over (PARTITION BY event_id ORDER BY time)) AS rnk FROM t) WHERE rnk=1
  • 88.
  • 89.
    Deduplication template SELECT * FROM(SELECT c1, c2, row_number() over (PARTITION BY pk ORDER BY c2)) AS rnk FROM t) WHERE rnk=1
  • 90.
    Deduplication template questions •Return all won opportunities except ones which occur in the same month (keep only last)
  • 92.
  • 93.

Editor's Notes

  • #14 Spark supported window functions: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html Redshift supported window functions: https://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html
  • #22 TODO cummulitive sum
  • #25 Redshift defaults- https://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_synopsis.html Spark defaults - https://stackoverflow.com/questions/47130030/whats-the-default-window-frame-for-window-functions
  • #53 Nice trick: https://blog.jooq.org/2016/10/31/a-little-known-sql-feature-use-logical-windowing-to-aggregate-sliding-ranges/ Not supported in most engines
  • #56 Example segment
  • #57 Example segment
  • #58 Example segment
  • #59 Example segment
  • #60 Example segment
  • #61 Example segment
  • #63 Can be used for sessionization as well
  • #64  - Link is https://www.youtube.com/watch?v=mgipNdAgQ3o&t=1070s%3Fstart=20:08&end=24:34