Data Science
Part 2 - SQL
CANTILEVER LABS IS OFFICIAL TRAINING PARTNER OF
IIT BOMBAY | IIT MADRAS | IIT KHARGPUR | IIT HYDERABAD
BITS PILANI | BITS HYDERABAD | NIT ROURKELA | SYMBIOSIS
JNTU HYDERABAD | SREENIDHI | MAHINDRA UNIVERSITY | GITAM
srm
chennai
| gnits
| cmr-cet
| GEETHANJALI
| CHITKARA
& many more
Index
Part 2/4 - SQL
Part 1 - Python Part 3 - P&S Part 4 - Deep into Data Science
01
Lets understand first, As a Data scientist why do we need SQL?
02
Case study 1: (PAYPAL interview)
03
Case study 3 : Joins
04
Case study 4 : Analyzing telecom data
05
GENERAL SQL QUESTIONS
Lets understand first, As a Data scientist why do we need SQL?
When we learn machine learning academically, we use datasets from Kaggle or other
such websites. Many of those datasets are readymade (we directly get a csv file with x
rows and y columns). In industrial setting, we have to prepare datasets from multiple
data sources(tables), build hypothesis and test them. There can be multiple SQL queries
running at backend to prepare just one column(feature) in your dataset, which may
involve aggregation, ordering, windowing, joins and many such SQL operations.
As machine learning models performance majorly depends on quality features its been
trained on, So while project is in development phase, we have to do lot of
experimentation with hyperparameters and quality features, for which we have to try
new set of features for improvements, that’s where at least basic understanding of SQL
comes handy. Later when pilots are successful, these data extraction pipelines will be
automated by data engineers where you need advanced knowledge to optimize
workflows.
Also for general purpose analytics to gain more insights from data we need SQL as
Excel has its limitations when it comes to big data.
Some examples of features that you will be using as input dataset of your machine
learning model in various industries are,
Telecom : How many times customer did recharge after expiry of his prepaid plan, Avg
of last 3 recharges MRP
Finance : Sum/avg of top 3 high value transactions of customer, days passed since
recent transaction, creditworthiness
Manufacturing : Number of times maintenance activity performed, days between each
maintenance or breakdowns of machine
Ecommerce : Tag customers who did > 50$ purchase in their 2nd transaction (high
value repeating customers)
In following sections you will find 2 types of question sets, one will be case study
based, which are mostly asked in interviews, second will be fundamental questions. At
the end we also have interview checklist for SQL and some useful links to learn and
practice sql.
01 Data Science | Part 2 - SQL
Case study 1: (PAYPAL interview)
Table 1 (daily transaction data) columns : Pymt_ID, Pymt_Date, Sndr_ID, Rcvr_ID, Amt
Table 2 columns : Rcvr_ID, Rcvr_name, Rcvr_Industry
Table 3 columns : Sndr_id, Sndr_name, Sndr_age
Q. Which industry has 3rd highest total receiving amount.
Case study 2 : Window functions
Table Name: Employee_MST (keeps record of active employee salary and dept)
Table Name: Employee_DTL (keeps record of all employees associated with company)
Q . Refer Above tables and Write a Query which gives below output,
02 Data Science | Part 2 - SQL
A. Output table has only employee names which have joined recently. Here concept
used is first get department wise ranking using window function with descending order
of dates and use that table with alias and then get the data which has recent date rank
(row_number) to get only recently joined employee
Case study 3 : Joins
Employee_name
Employee_dtl
Q. Write a Query which gives below Output.
03 Data Science | Part 2 - SQL
A:
Case study 4 : Analyzing telecom data
Table : 1 year data of recharges done by subscribers
Q1. HOW MANY TOTAL RECHARGES EACH SUBSCRIBER HAS DONE IN JUNE MONTH
Q2. WHICH RECHARGE PLAN MRP IS SUBSCRIBED MOST
Q3. EXTRACT CUSTOMERS WHO HAVE DONE MORE THAN 15 RECHARGES
04 Data Science | Part 2 - SQL
Q4. GET TOTAL, AVG AND MAXIMUM OF RECENT 3 RECHARGE AMOUNT OF
SUBSCRIBER
Q5. THERE ARE HOW MANY SUCH CUSTOMERS IN SYSTEM, WHO HAVE NOT DONE
ANY RECHARGE FOR LAST 35 DAYS
Q6. GET RECENT RECHARGE OF SUBSCRIBERS
GENERAL SQL QUESTIONS
Q1) What is the difference between ISNULL and COALESCE?
ISNULL is used when we want null values as imputed by our specified value in final table.
COALESCE returns first non null entry
Q2. What are different SQL commands : ( As a data scientist we majorly deal with
DDL, DML,DQL )
05 Data Science | Part 2 - SQL
Q3 . Types of joins in SQL :
Q4 . Data types in SQL :
Q5. What is the difference between Delete, Truncate and Drop ?
Delete : We can delete all rows or targeted rows based on condition
Truncate : We can delete all rows from table at once
Drop : We can delete entire table from database
06 Data Science | Part 2 - SQL
Q6. How is “PARTITION BY” different from “GROUP BY”?
PARTITION BY gives aggregated columns with each record in the specified table. If we have 15 records in
the table, the query output SQL PARTITION BY also gets 15 rows. On the other hand, GROUP BY gives
one row per group in result set.
E.g. : Suppose we have below table of student heights in class A and B,
We want to know avg. height of students from class A and B,
Group by clause will give below output
But, Now if I want to see each students height compared to their class avg. height, we will use partition
by clause as below.
Output :
Now its more informative for me to see each student height as well as class avg.
Q7. What is order of each SQL clause
07 Data Science | Part 2 - SQL
Q8. What is the difference between RANK() ,ROW_NUMBER() and DENSE_RANK() ?
Rank() : it is used in window function, it ranks the data as per order given in window. It skips the ranking
if it finds similar record for that window
Dense_rank() : it works in similar way as of rank(), but it does not skip ranking if it finds duplicate in
window
Row_num() : it returns simply row number of record in window function.
Q9.
Grouping Data and Using Aggregate Functions
Ordering Data Results
Selecting Data from Multiple Tables ( Joins )
Q10. Different types of aggregate functions
COUNT()
SUM()
MIN()
MAX()
AVG()
STDEV()
VAR()
08 Data Science | Part 2 - SQL
Q11. What are Constraints in SQL?
NOT NULL - Restricts NULL value from being inserted into a column.
CHECK - Verifies that all values in a field satisfy a condition.
DEFAULT - Automatically assigns a default value if no value has been specified for the field.
UNIQUE - Ensures unique values to be inserted into the field.
INDEX - Indexes a field providing faster retrieval of records.
PRIMARY KEY - Uniquely identifies each record in a table.
FOREIGN KEY - Ensures referential integrity for a record in another table.
Q12 . What are ACID properties?
Atomicity: This property ensures that the transaction is completed in all-or-nothing way.
Consistency: This ensures that updates made to the database is valid and follows rules and
restrictions.
Isolation: This property ensures integrity of transaction that are visible to all other transactions.
Durability: This property ensures that the committed transactions are stored permanently in the
database
Q13 . How to find the 5th highest salary in SQL?
Q14. What is cte in SQL
CTEs are Common Table Expressions that are used to create temporary result tables from which data
can be retrieved/ used.
Interview checklist for SQL :
Before interview, you should have at least solved problems that contain following SQL
clauses.
Group by, Order by, having, window functions, is null, rank, dense_rank, row_number,
min, max, avg, stdev, count, all types of joins, like, wildcards,.
09 Data Science | Part 2 - SQL
Useful links :
https://www.w3schools.com/sql/
https://www.codecademy.com/courses/learn-sql/lessons/aggregate-functions/exercises/
intro\
https://www.hackerrank.com/domains/sql\
09 Data Science | Part 2 - SQL
Part 1/4 - Python
Part 2 - SQL
Next Part 3 - P&S
Part 4 - Deep into Data Science
@cantilever_labs
@cantilever labs
@cantilever labs
www.cantileverlabs.com
Data Science | Part 1 - Python