Analytical Functions inTeradata
25/12/2013
Data Delivery - BI
Telecom/ Belgacom
Vinay Pratap Singh Visen
Business Intelligence/Teradata
vinay.visen@tcs.com
Confidentiality Statement
Confidentiality and Non-Disclosure Notice
The information contained in this document is confidential and proprietary to TATA
Consultancy Services. This information may not be disclosed, duplicated or used for any
other purposes. The information contained in this document may not be released in
whole or in part outside TCS for any purpose without the express written permission of
TATA Consultancy Services.
Tata Code of Conduct
We, in our dealings, are self-regulated by a Code of Conduct as enshrined in the Tata
Code of Conduct. We request your support in helping us adhere to the Code in letter and
spirit. We request that any violation or potential violation of the Code by any person be
promptly brought to the notice of the Local Ethics Counselor or the Principal Ethics
Counselor or the CEO of TCS. All communication received in this regard will be treated
and kept as confidential.
TableofContent
1.
HowtodetectoverlaprowsfromahistorytableusingteradataOLAP.............................................................4
1.1Introduction ............................................................................................................................................................ 4
1.2OverlapsFunction ................................................................................................................................................... 4
1.3Solutionwithexample ............................................................................................................................................ 4
1.4Algorithm ................................................................................................................................................................ 5
2.
MergeadjacentperiodsinhistorytableusingTeradataOLAP..........................................................................6
2.1Introduction ............................................................................................................................................................ 6
2.2Solutionwithexample ............................................................................................................................................ 6
2.3Algorithm ................................................................................................................................................................ 7
3.
HowtoUseTeradataWindowAggregateFunctionsForDataMining...............................................................8
3.1Introduction ............................................................................................................................................................ 8
3.2Cumulativeandmovingsumusingsum/over......................................................................................................... 8
3.3CumulativeandmovingsumusingCSUM .............................................................................................................. 9
3.4Cumulativeandmovingsumwithreset ............................................................................................................... 10
3.5CumulativeandmovingsumwithresetusingCSUM ........................................................................................... 11
3.6MovingDifference(MDIFF)................................................................................................................................... 12
1.
How to detect overlap rows from a history table using teradata OLAP
1.1 Introduction
If a Teradata table maintains history of any particular concept and two time periods overlap with each other due to
data quality issues, they can be detected using Teradata OVERLAPS function.
1.2 Overlaps Function
Teradata OVERLAPS is a Boolean function which tests whether two time periods overlap one another. It returns
TRUE if the two time periods overlap else returns FALSE.
The syntax of the OVERLAPS function is:
SELECT <LITERAL>
WHERE
(<START_DATE_TIME>,<END_DATE_TIME>)OVERLAPS(<START_DATE_TIME>,<END_DATE_TIME>)
1.3 Solution with example
CUST_BILL_ACCT_HIST table maintains history of customer and their billing accounts. It contains customer, its
billing account, status and start and end dates
CUST_ID
645890
645890
645890
45890
BILL_ACCT_ID
345
589
640
234
START_DATE
20/05/2004
13/07/2006
13/04/2008
13/04/2008
END_DATE
12/07/2006
12/04/2008
20/08/2009
31/12/9999
CUST_STATUS
A
A
?
I
Overlaprecords
In the above example, 3rd and 4th record overlaps with each other. A customer can be linked to only one billing
account at a given moment. Here same customer is linked to two billing account at the same time resulting due to
overlap in the time period. Therefore, this overlap record should be detected and removed from the table to avoid
data quality issues.
Below query using OVERLAPS function can be used to detect and flag OVERLAP records-
SELECT
a.CUST_ID
, a.BILL_ACCT_ID
, a.START_DT
, a.END_DT
,a.CUST_STATUS
, MAX( CASE
WHEN (a.START_DT, a.END_DT) OVERLAPS (b.START_DT, b.END_DT)
THEN
CASE
WHEN ( a.START_DT = b.START_DT) AND (a.END_DT = b.END_DT) -- Overlap with itself, ignored
THEN
CASE
WHEN (
(a.CUST_ID = b.CUST_ID)
AND
(a.BILL_ACCT_ID = b.BILL_ACCT_ID) AND
(COALESCE(a.CUST_STATUS, 'NA') = COALESCE(b.CUST_STATUS, 'NA'))
)
THEN 0
ELSE 1
END
ELSE 1
END
ELSE 0
END
) as OVERLAP_FLAG
FROM
CUST_DB.CUST_BILL_ACCT_HIST A
INNER JOIN
CUST_DB.CUST_BILL_ACCT_HIST B
ON
a.BILL_ACCT_ID = b.BILL_ACCT_ID
GROUP BY 1,2,3,4,5
OUTPUT
CUST_ID
645890
645890
645890
645890
BILL_ACCT_ID
345
589
640
234
START_DATE
20/05/2004
13/07/2006
13/04/2008
13/04/2008
END_DATE
12/07/2006
12/04/2008
20/08/2009
31/12/9999
CUST_STATUS
A
A
?
I
OVERLAP_FLAG
0
0
1
1
1.4 Algorithm
Table is joined with itself.
To identify and flag overlapped records two CASE statements are used
First CASE statement is for detecting real overlap
Second CASE statement is for detecting overlap with itself (since table is joined with itself) and therefore
ignored.
If any of the columns contain NULL values in data, apply COALESCE function to get consistent results (Avoid
NULL comparison).
2.
Merge adjacent periods in history table using Teradata OLAP
2.1 Introduction
If a table maintains history of any particular concept and there are time intervals adjacent to each other in the
subsequent records, they can be merged together using Teradata analytical functions. This will result in a single
record out of multiple records.
2.2 Solution with example
E.g. CUST_BILL_ACCT_HIST table maintains history of customer and their billing accounts. It contains customer, its
billing account, status and start and end dates.
Below records, there are adjacent periods in subsequent rows. From row number 2 to 6, same CUST_ID (=5338) is
linked to BILL_ACCT_ID (=131). In each subsequent record, START_DATE of CURRENT row equals to END_DATE of
PREVIOUS row.
CUST_ID
2468
5338
5338
5338
5338
5338
1121
BILL_ACCT_ID
131
131
131
131
131
131
104
CUST_STATUS
A
A
A
A
A
A
A
START_DT
09/11/1999
26/03/2002
04/04/2002
08/04/2002
15/03/2004
20/03/2004
25/09/2004
END_DT
26/03/2002
03/04/2002
07/04/2002
14/03/2004
19/03/2004
24/09/2004
13/01/2005
Adjacentrecords
Below query using Teradata analytical functions can be used to merge adjacent time periods:-
SELECT
CUST_ID
,BILL_ACCT_ID
,CUST_STATUS
, min(a.START_DT) as START_DT
, max(a.END_DT) as END_DT
FROM
(
SELECT
CUST_ID
,BILL_ACCT_ID
,CUST_STATUS
,START_DT
,END_DT
, SUM(BREAK_FLAG) OVER (PARTITION BY CUST_ID, BILL_ACCT_ID, CUST_STATUS ORDER BY a.START_DT, a.END_DT
ROWS UNBOUNDED PRECEDING) as RUNNING_BREAK_FLAG
FROM
(
SELECT
CUST_ID
,BILL_ACCT_ID
,CUST_STATUS
,START_DT
,END_DT
, MIN(a.END_DT) OVER
(PARTITION BY CUST_ID, BILL_ACCT_ID, CUST_STATUS ORDER BY a.START_DT, a.END_DT ROWS BETWEEN 1 PRECEDING
AND 1 PRECEDING) as PREVIOUS_END_DATE
, CASE
WHEN (a.START_DT <> PREVIOUS_END_DATE) OR PREVIOUS_END_DATE IS NULL
THEN 1
ELSE 0
END as BREAK_FLAG
FROM CUST_DB.CUST_BILL_ACCT_HIST a
)a
)a
GROUP BY
CUST_ID
,BILL_ACCT_ID
,CUST_STATUS
,RUNNING_BREAK_FLAG
OUTPUT
CUST_ID
2468
5338
1121
BILL_ACCT_ID
131
131
104
CUST_STATUS
A
A
A
START_DT
09/11/1999
26/03/2002
25/09/2004
END_DT
26/03/2002
24/09/2004
13/01/2005
2.3 Algorithm
END_DATE of the previous row is calculated and is compared with START_DATE of the current row.
The gap between the two time periods is flagged as BREAK_FLAG.
All such BREAK_FLAGs are summed up and grouped together.
Minimum START_DATE of first row is calculated and will be used as START_DATE of merged row.
Maximum END_DATE of last row is calculated and will be used as END_DATE of merged row.
3.
How to Use Teradata Window Aggregate Functions For Data Mining
3.1 Introduction
A Window aggregate function (also called ordered analytic function) is an aggregate function on which a
window specification is applied and for which the result set is dependent on the ordering of the data.
It is used to perform on-line analytical processing (OLAP), which permits analysis and mining of data in the data
warehouse.
Like traditional aggregate functions, OLAP functions operate on groups of rows and permit qualification and
filtering of the group result. Unlike aggregation, OLAP functions return individual rows of information, not just
aggregate functions.
3.2 Cumulative and moving sum using sum/over
An accumulative sum implies an ordering on a particular column or expression.
The following ANSI syntax is used with SUM to provide accumulative sum OLAP functionality:
SELECT
< COLUMN-NAME>
SUM(<COLUMN_NAME> ) OVER (ORDER BY <COLUMN_A> [ASC|DESC],
[...., < COLUMN_B> [ASC|DESC]
[ROWS [BETWEEN] (UNBOUNDED | <NUMBER> ) PRECEDING [AND
(UNBOUNDED | <NUMBER> )
FOLLOWING] )
The ORDER BY clause defines the sequencing for accumulation. It may specify a single column or multiple columns
in a comma-separated list. It is not optional and it is part of the SUM / OVER. The default sequence is ascending
(ASC).
The ROWS UNBOUND PRECEDING clause indicates that all preceding rows in the listing are to be included in the
accumulation.
Below query shows the cumulative sales using the SUM / OVER:
SELECT
PRODUCT_ID
,SALE_DATE
,DAILY_SALES
,SUM(DAILY_SALES) OVER (ORDER BY PRODUCT_ID, SALE_DATE ROWS UNBOUNDED PRECEDING) AS
CUMULATIVE_SUM
FROM SALES_TAB
Product_ID
1000
1000
1000
2000
2000
2000
2000
Sale_Date
28/09/2000
29/09/2000
30/09/2000
28/09/2000
29/09/2000
30/09/2000
01/10/2000
Daily_Sales
48850.40
54500.22
36000.07
41888.88
48000.00
49850.03
54850.29
Cumulative_Sum
48850.40
103350.62
139350.69
373093.60
421093.60
470943.63
525793.92
3.3 Cumulative and moving sum using CSUM
Teradata provides standard OLAP function CSUM to achieve the cumulative sum functionality.
The syntax for CSUM is:
SELECT
<COLUMN-NAME>
CSUM(<COLUMN_NAME>, <SORT-KEY> [ASC|DESC] [,<SORT-KEY> [ASC|DESC] )
GROUP BY (<COLUMN_NAME>
The first parameter is entered as the column containing a numeric value to sum.
The second parameter is entered as a sort key that sequences the output rows.
Below query shows the cumulative sales using CSUM OLAP function:
SELECT
PRODUCT_ID
,SALE_DATE
,DAILY_SALES
,CSUM(DAILY_SALES, PRODUCT_ID) AS CSUM_RESET
FROM SALES_TAB
Product_ID
1000
1000
1000
2000
2000
2000
2000
Sale_Date
28/09/2000
29/09/2000
30/09/2000
28/09/2000
29/09/2000
30/09/2000
01/10/2000
Daily_Sales
48850.40
54500.22
36000.07
41888.88
48000.00
49850.03
54850.29
Cumulative_Sum
48850.40
103350.62
139350.69
373093.60
421093.60
470943.63
525793.92
3.4 Cumulative and moving sum with reset
An accumulative sum with reset is used to reset the accumulation value back to zero when the data value of a
particular group changes.
The following ANSI syntax is used with SUM to provide accumulative sum with reset OLAP functionality:
SELECT
<COLUMN-NAME>
SUM(<COLUMN_NAME> ) OVER (PARTITION BY (<COLUMN_NAME> ORDER BY <COLUMN_A> [ASC|DESC],
[...., < COLUMN_B>
[ASC|DESC] [ROWS [BETWEEN] (UNBOUNDED | <NUMBER> ) PRECEDING [AND
(UNBOUNDED | <NUMBER> ) FOLLOWING] )
The PARTITION BY clause is used to GROUP data and reset the accumulation value when the group value changes.
10
Below, query shows the cumulative sales with a reset on a change in the PRODUCT_ID:
SELECT
PRODUCT_ID
,SALE_DATE
,DAILY_SALES
,SUM(DAILY_SALES) OVER (PARTITION BY SALE_DATE ORDER BY PRODUCT_ID, SALE_DATE DESC ROWS
UNBOUNDED PRECEDING) AS CSUM_RESET
FROM SALES_TAB
Product_ID
1000
1000
1000
2000
2000
2000
3000
3000
3000
Sale_Date
03/10/2000
02/10/2000
01/10/2000
03/10/2000
02/10/2000
01/10/2000
03/10/2000
02/10/2000
01/10/2000
Daily_Sales
64300.00
32800.50
40200.43
43200.18
36021.93
54850.29
21553.79
19678.94
28000.00
CSUM_Reset
64300.00
97100.50
13,300.93
43200.18
79222.11
13072.40
ResetonchangeofGroup(ANSI)
21553.79
41232.73
69232.73
3.5 Cumulative and moving sum with reset using CSUM
An accumulative sum with reset using CSUM is used to reset the accumulation value back to zero when the data
value of a particular group changes.
The syntax for achieving this using CSUM is:
SELECT
<COLUMN-NAME>
CSUM(<COLUMN_NAME>, <SORT-KEY> [ASC|DESC] [,<SORT-KEY> [ASC|DESC] )
GROUP BY (<COLUMN_NAME>
GROUP by clause is used to reset the accumulation value back to zero when the data value of a particular group
changes.
Below query shows the cumulative sales with reset when the SALE_DATE changes
11
SELECT
PRODUCT_ID
,SALE_DATE
,DAILY_SALES
,CSUM(DAILY_SALES, SALE_DATE, PRODUCT_ID ) AS CSUM_RESET
FROM SALES_TAB
GROUP BYSALE_DATE
Product_ID Sale_Date
Daily_Sales CSUM_Reset
1000
01/10/2000 40200.43
13,300.93
2000
01/10/2000 54850.29
13072.40
3000
01/10/2000 28000.00
1000
02/10/2000 32800.50
69232.73
97100.50
2000
02/10/2000 36021.93
79222.11
3000
02/10/2000 19678.94
41232.73
1000
03/10/2000 64300.00
64300.00
2000
03/10/2000 43200.18
43200.18
3000
03/10/2000 21553.79
21553.79
ResetonchangeofGroup(OLAP)
Please note that here GROUP BY is used only to reset when the value changes.
3.6 Moving Difference (MDIFF)
The moving difference MDIFF () computes the moving difference of a column based on the on the value of n row
before.
The SYNTAX for this OLAP function is:
SELECT
<COLUMN_NAME>
MDIFF(<COLUMN_NAME>e, N, SORT LIST[1], [2]).
EXAMPLE
SELECT
PRODUCT_ID , SALE_DATE,
MDIFF(AMOUNT,2, SALE_DATE) as MOVING_DIFF
FROM TABLE_A
ORDER BY 1;
12
Product_ID Sale_Date
1000
2000
Amount
01/10/2000 2000
02/10/2000 3000
MDIFF
NULL
NULL
2500
3000
03/10/2000 4500
04/10/2000 6500
3500
2000
05/10/2000 4500
3000
06/10/2000 7000
500
3000
N
=2
Please note that for differences calculated on the first rows (where no value can be found on the preceding rows),
the function will return NULL.
13
Thank You
Contact
For more information, contact vinay.visen@tcs.com
AboutTataConsultancyServices(TCS)
TataConsultancyServicesisanITservices,consultingandbusinesssolutions
organizationthatdeliversrealresultstoglobalbusiness,ensuringalevelofcertaintyno
otherfirmcanmatch.TCSoffersaconsultingled,integratedportfolioofITandIT
enabledinfrastructure,engineeringandassuranceservices.Thisisdeliveredthroughits
TM
uniqueGlobalNetworkDeliveryModel ,recognizedasthebenchmarkofexcellencein
softwaredevelopment.ApartoftheTataGroup,Indiaslargestindustrialconglomerate,
TCShasaglobalfootprintandislistedontheNationalStockExchangeandBombay
StockExchangeinIndia.
Formoreinformation,visitusatwww.tcs.com.
ITServices
BusinessSolutions
Outsourcing
All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content /
information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced,
republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS.
Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws,
and could result in criminal or civil penalties. Copyright  2011 Tata Consultancy Services Limited
14