Module 1 Applied Data Science 1.1 and 1.2
Module 1 Applied Data Science 1.1 and 1.2
Science
B. Tech. Semester - VIII
Scheme
Applied Data Science
Applied Data Science
Laboratory
• Mid Semester Examination • Experiment – 5 Marks
– 20 Marks • CIAP – (20 Marks, (Activity-
• CCE Activity – (20 Marks 1), 20 Marks (Activity-2),
(Activity-1), 20 Marks Attendance – 10 Marks) – 40
(Activity-2), Attendance – 10 Marks
Marks) – 20 Marks
• End Semester Examination
– (100 Marks) – 60 Marks
Course Outcome
Module 2: Data
Module 1:
Processing &
Introduction
Visualization
Programming (Python, R)
Skills Required for Data Statistics and probability
Science Data visualization tools
What is Data Science?
Differences between
Definition of Data
Data Science, ML, AI
Science.
and Big Data Analytics.
Definition of Data Science
Definition of Data Science
• Data Science is an interdisciplinary field that involves extracting meaningful insights and knowledge from
structured and unstructured data using scientific methods, processes, algorithms, and systems. It
combines elements of statistics, mathematics, computer science, and domain expertise to analyze data
and solve complex real-world problems.
• Data Handling
• Exploratory Data Analysis (EDA)
• Modeling and Machine Learning
• Visualization and Communication
• Applications Across Domains
Example
• Predicting customer behavior in e-commerce to improve product recommendations and increase sales.
Artificial Intelligence
Definition:
Applications:
Definition:
Applications:
Definition:
Applications:
Programming (Python, R)
Skills Required for Data Statistics and probability
Science Data visualization tools
Components of Data Science
Machine learning
Data interpretation
models
This is the foundational step in any Data Science project.
Data Collection:
• The process of gathering data from various sources for analysis.
• Sources of Data:
• Databases (e.g., SQL, NoSQL)
• Web scraping (e.g., BeautifulSoup, Scrapy)
• APIs (e.g., Twitter API, REST APIs)
• Sensors and IoT devices
• Surveys and questionnaires
• Challenges in Data Collection:
1. Data Collection
• Missing data
• Inconsistent formats
• High volume of unstructured data
Key Tasks:
• Removing duplicates
• Handling missing values (e.g., imputation, deletion)
• Encoding categorical data (e.g., one-hot encoding)
• Standardizing and normalizing numerical data
• Tools Used:
• Python Libraries: Pandas, NumPy
• Software: Excel, OpenRefine
The goal is to explore and summarize the data to
understand trends, patterns, and relationships.
Data Analysis:
Types of Models:
• Supervised Learning:
• Examples: Linear Regression, Decision Trees
• Use case: Predicting house prices
• Unsupervised Learning:
• Examples: K-Means Clustering, PCA
Machine Learning • Use case: Customer segmentation
• Reinforcement Learning:
Tools Used:
Key Elements:
Communicating Insights:
4. Data • Creating reports, dashboards, and presentations
Real-World Impact:
Programming (Python, R)
Skills Required for Data Statistics and probability
Science Data visualization tools
Importance of Data Science
1. Data-Driven Decision 6. Fraud Detection and Risk
Making Management
2. Improved Business 7. Enhanced Decision
Efficiency Accuracy in Complex
3. Personalized Customer Systems
Experiences 8. Supports Sustainable
4. Competitive Advantage Development
Data Science
Techniques Hybrid Recommendation Systems: Combines collaborative
and content-based filtering for better accuracy.
How It Helps:
• Personalized recommendations increase the likelihood
of customers making additional purchases.
• Recommendations like "Customers who bought this also
bought" encourage users to explore related products,
Benefits leading to upselling and cross-selling.
Proof:
• McKinsey Report: Suggests that 35% of Amazon's total
revenue is generated from its recommendation system.
• Case Study: Research published in Journal of Big Data
highlights that personalized product recommendations
can increase average order value by 10-15%.
Increases Customer Retention and Loyalty
How It Helps:
• By offering a personalized shopping experience,
customers feel understood and valued, leading to higher
retention rates.
• Repeat customers tend to purchase more, contributing
significantly to overall revenue.
Benefits
Proof:
• Forbes Study (2022): Found that 75% of customers are
more likely to return to an online retailer that provides
personalized recommendations.
• Amazon Prime Impact: Prime members are offered highly
personalized recommendations, which helps Amazon
achieve higher retention rates compared to competitors.
Enhances Customer Experience
How It Helps:
• Simplifies product discovery by filtering through millions
of items and presenting relevant options.
• Reduces decision fatigue, saving customers time and
effort in searching for products.
Benefits Proof:
• Baymard Institute Study: Reports that 58% of users
abandon e-commerce sites due to difficulty in finding
relevant products. Amazon mitigates this by offering
tailored recommendations, improving user satisfaction.
• Statista (2023): Amazon scored 78 out of 100 in
customer satisfaction surveys, partly attributed to its
recommendation system.
Drives User Engagement
How It Helps:
• By continuously suggesting new and relevant
products, users spend more time browsing Amazon’s
platform.
• Increased time on site correlates with a higher
probability of making a purchase.
Benefits
Proof:
• Deloitte Research: Shows that personalized product
recommendations increase site engagement by up to
30%.
• Internal Amazon Analytics (reported by CNBC): Users
who engage with recommendations are 3 times more
likely to complete a purchase than those who don’t.
Facilitates Efficient Inventory Management
How It Helps:
• Data insights help Amazon understand product
demand patterns, allowing for better inventory
planning.
• Recommending slow-moving items to targeted
customers helps reduce inventory holding costs.
Benefits
Proof:
• Business Insider Report: Amazon's efficient inventory
turnover is attributed to its recommendation system’s
ability to boost sales of underperforming products.
• Walmart Benchmarking Study: Shows that Amazon’s
inventory management outperforms traditional
retailers, with Data Science playing a crucial role.
Improves Marketing ROI
How It Helps:
• Amazon uses recommendation insights to target
customers with personalized emails and ads, improving
the return on investment (ROI) for marketing campaigns.
• Targeted advertising increases the relevance of
promotions, driving better conversion rates.
Benefits Proof:
• Econsultancy Research: Personalized email
recommendations result in 26% higher click-through
rates and 760% more revenue compared to non-
personalized emails.
• Amazon’s Sponsored Product Ads: These leverage
recommendation data to show relevant ads, which have
significantly higher conversion rates than generic ads.
Competitive Advantage and Market Leadership
How It Helps:
• The recommendation system differentiates Amazon from
competitors, providing a unique and seamless shopping
experience.
• Competitors like Walmart and Alibaba have tried to
replicate Amazon's recommendation engine but have not
Benefits matched its efficiency.
Proof:
• eMarketer (2023): Amazon holds a 39.5% share of the U.S.
e-commerce market, far ahead of competitors, with its
personalized shopping experience being a key factor.
• Comparison Study: Studies show that Amazon’s conversion
rate (13%) is significantly higher than the industry average
(2-3%), largely due to its recommendation engine.
Empowers Small Sellers on Amazon Marketplace
How It Helps:
• The recommendation system promotes products from small
and medium-sized sellers, leveling the playing field.
• By analyzing data on customer preferences, even niche
products gain visibility through recommendations.
Benefits Proof:
• Amazon Seller Statistics (2022): Over 50% of Amazon’s sales
come from third-party sellers. Many report significant sales
growth due to visibility from Amazon’s recommendation
system.
• Seller Testimonials: Several small businesses credit
Amazon’s recommendation system for boosting their sales
without requiring extensive marketing efforts.
Overwhelming Product Choices:
• With millions of products, customers can
feel overwhelmed.
• The recommendation system narrows
options, presenting relevant and desirable
choices.
Challenges Dynamic Customer Preferences:
Addressed Using • Customer interests change over time.
• Data Science models adapt by continuously
Data Science learning from new data to stay relevant.
Customer Retention:
Impact on • A seamless and personalized shopping
Amazon’s Business experience encourages repeat visits
and fosters customer loyalty.
Market Leadership:
• By leveraging Data Science, Amazon
maintains its competitive edge and sets
the benchmark for e-commerce
personalization.
Definition and scope
What is Data Science? Differences between Data Science, AI, ML,
and Big Data
Programming (Python, R)
Skills Required for Data Statistics and probability
Science Data visualization tools
Core Data Science Roles
Role Primary Responsibility Key Skills
Python, R, Machine Learning (Scikit-
Extract insights and build predictive
Data Scientist learn, TensorFlow), Statistics, Data
models to solve business problems.
Visualization (Tableau, Power BI)
SQL, Apache Spark, Data Pipelines,
Build and maintain data infrastructure
Data Engineer Cloud Platforms (AWS, Azure), Big
and pipelines.
Data (Hadoop, Kafka)
Excel, SQL, Tableau, Power BI,
Analyze data and create actionable
Data Analyst Exploratory Data Analysis, Statistical
insights for decision-making.
Techniques
Machine Learning Develop and deploy machine learning Python, TensorFlow, PyTorch, MLOps
Engineer models into production. (MLflow), Model Deployment (APIs)
BI Tools (Power BI, QlikView), KPI
Business Provide data-driven insights through
Analysis, Communication, Business
Intelligence Analyst dashboards and reports.
Acumen
Specialized Data Science Roles
Role Primary Responsibility Key Skills
Data Modeling (Erwin), Cloud Data
Design and manage the overall data
Data Architect Architecture, SQL, NoSQL, Data
framework and architecture.
Governance
Database Management (MySQL,
Ensure the smooth operation and
Data Administrator Oracle), Performance Optimization,
security of databases.
Data Backup & Recovery
Apply statistical techniques to analyze SAS, SPSS, R, Hypothesis Testing,
Statistician
and interpret data. Regression Analysis
Deep Learning, NLP, TensorFlow,
Conduct research to develop advanced
AI Research Scientist PyTorch, Research & Publication,
AI algorithms and solutions.
Advanced Mathematics
Data Privacy Laws (GDPR), Data
Data Governance Ensure compliance, data integrity, and
Management Tools (Collibra), Risk
Specialist ethical data use.
Management, Communication
Definition and scope
What is Data Science? Differences between Data Science, AI, ML,
and Big Data
Programming (Python, R)
Skills Required for Data Statistics and probability
Science Data visualization tools
Technical Skills Analytical Skills Soft Skills
Programming Language: Statistical Knowledge Communication Skills
Python, R, SQL,
Skills Java/Scala, etc.
Data Exploration &
Analysis
Business Acumen
Teamwork &
Required Data Manipulation and
Processing: Pandas,
Problem Solving Collaboration
Critical Thinking Adaptability & Curiosity
for Data Data Visualization Tools:
NumPy, Excel, etc.
Problem
Data Acquisition
Formulation
Exploratory Data
Data Preparation
Analytics
Interpret &
Build Models Communicate
Result
Problem Formulation
Understanding the Problem Statement, thorough study of the
Business model is required.
Problem Formulation
• Problem Formulation:
• Understand the Context:
• The company operates 100 retail outlets.
• Seasonal trends and promotional events significantly affect product demand.
• Manual inventory tracking is time-consuming and prone to errors.
• Define the Scope:
• Analyze data for 50 high-demand product categories.
• Focus on sales data from the last two years.
• Exclude niche products and data from outlets that do not generate significant revenue.
• Identify Constraints:
• Incomplete data for certain periods due to system outages.
• Limited IT infrastructure for real-time analytics.
• Deadline to implement the solution before the upcoming holiday season.
• Set Success Criteria:
• Achieve a 90% demand forecast accuracy for key products.
• Reduce stockouts to fewer than 5 occurrences per month.
• Lower inventory holding costs by at least 10%.
Problem Formulation
Understanding the Problem Statement, thorough study of the
Business model is required.
Data Acquisition
Identify the relevant data sources as per defined problem statement
and decide the format and various tools for data acquisition.
Identify Relevant Data Sources in
Data Science
• Identifying the right data sources is a crucial step in any Data Science project. The quality and
relevance of data directly impact the accuracy and usefulness of the insights generated. Data
sources can be broadly categorized into internal and external sources, and specific tools are
used to acquire this data effectively.
• Types of Data
• Tools for Data Acquisition
Problem Formulation
Understanding the Problem Statement, thorough study of the
Business model is required.
Data Preparation
the process of cleaning, organizing, and
transforming raw data into a format that can
be analyzed
3
Data Acquisition
Identify the relevant data sources as per defined problem statement
and decide the format and various tools for data acquisition.
Data Cleaning
Data cleaning, also known as data cleansing or preprocessing, is a crucial step in the
Data Science process. It involves preparing raw data for analysis by identifying and
correcting errors, filling in missing values, standardizing formats, and addressing
inconsistencies. Clean data ensures the accuracy, reliability, and efficiency of subsequent
analyses or machine learning models.
Why Data Cleaning Is Important
Enhances Model
Improves Data Quality:
Performance: Clean data
Ensures accuracy,
results in better-performing
consistency, and reliability.
machine learning models.
Facilitates Analysis:
Saves Time: Prevents issues
Reduces noise, making it
during later stages of data
easier to detect meaningful
analysis.
patterns.
Key Steps in Data Cleaning
Categorical Encoding:
Units and Scales: Convert
Standardize categorical
Date Formats: Convert all measurements to a
variables using consistent
date entries to a consistent common unit (e.g., inches
labels or encoding
format (e.g., YYYY-MM-DD). to centimeters, USD to
methods (e.g., one-hot
EUR).
encoding, label encoding).
Other Common Data Cleaning Tasks
1 2 3
Remove Duplicate Fix Structural Errors: Filter Irrelevant Data:
Records: Eliminate Correct typos, Remove data that does
redundant entries to inconsistent not contribute to
avoid skewing results. capitalization, or solving the problem
misnamed categories (e.g., unrelated
(e.g., “NY” vs. “New columns).
York”).
Importance of Data Cleaning
4
Data Acquisition Exploratory Data Analytics
Identify the relevant data sources as per is a process used by data scientists to:
defined problem statement and decide the Validate data, Generate hypotheses, Identify
format and various tools for data acquisition. trends, Summarize data characteristics
Exploratory Data Analytics
5
Build Model
is the process of deploying
machines for
understanding a system
4
Data Acquisition Exploratory Data Analytics
Identify the relevant data sources as per is a process used by data scientists to:
defined problem statement and decide the Validate data, Generate hypotheses, Identify
format and various tools for data acquisition. trends, Summarize data characteristics
Building the Model
Define the Problem Type Select the Appropriate Model Train the Model
Supervised Learning (e.g., regression, Regression: Linear Regression, Random Data splitting: Training and testing sets
classification) Forest Cross-validation techniques
Unsupervised Learning (e.g., clustering, Classification: Logistic Regression, SVM,
anomaly detection) Neural Networks
Reinforcement Learning Clustering: k-Means, Hierarchical
Clustering
Anomaly Detection: Isolation Forest,
Autoencoders
Building the Model
Evaluate Model Performance Fine-Tune the Model Test and Deploy the Model
Regression Metrics: MAE, MSE, RMSE, Hyperparameter tuning: Grid search, Final evaluation using test data
R-squared random search Deployment for real-time predictions
Classification Metrics: Accuracy,
Precision, Recall, F1-Score, ROC-AUC
Clustering Metrics: Silhouette Score,
Davies-Bouldin Index
Problem Formulation
Understanding the Problem Statement, thorough study of the
Business model is required.
Data Preparation
the process of cleaning, organizing, and transforming
raw data into a format that can be analyzed Interpret &
3 Communicate
Summarize key finding & provide recommendation
5
Build Model
is the process of deploying
machines for
understanding a system
4
Data Acquisition Exploratory Data Analytics
Identify the relevant data sources as per is a process used by data scientists to:
defined problem statement and decide the Validate data, Generate hypotheses, Identify
format and various tools for data acquisition. trends, Summarize data characteristics
Interpret & Communicate
Highlight essential insights derived from the model.
Summarize Key Findings Explain the significance of key features driving the predictions.
Types of visualizations:
•Bar charts, pie charts, line graphs, scatter plots, heatmaps, confusion
Use Clear and Effective Visualizations matrix.
Tools for visualization: Matplotlib, Seaborn, Tableau, Power BI.
• Netflix:
https://www.analyticsvidhya.com/blog/2023/06/netflix-
case-study-eda-unveiling-data-driven-strategies-for-
streaming/#h-official-documentation-and-resources
• Airbnb, AstraZeneca, Johnson & Johnson, IMD, HDFC
https://www.upgrad.com/blog/top-data-science-case-
studies-for-inspiration/
Lecture 3: Types of
Analytics
Objective: Understand the different
types of analytics and their applications.
Outline
Definition and examples (e.g., sales reports, website traffic)
Descriptive Analytics Tools and techniques
Techniques:
Examples in Business:
Tools:
• Answers "Why did it happen?" by drilling down into data to identify the root causes of events.
• Enables businesses to uncover insights about failures, inefficiencies, or unexpected trends.
Techniques:
• Drill-Down Analysis: Breaking data into finer levels to uncover deeper insights.
• Correlation Analysis: Identifying relationships between variables (e.g., increased social media ads lead to higher
sales).
• Root Cause Analysis: Tracing anomalies or unexpected results to their source.
Examples in Business:
Tools:
• SQL for querying data, Python for deeper statistical analysis, and tools like Splunk for system diagnostics.
Predictive Analytics
Purpose:
• Answers "What will happen?" by forecasting future outcomes based on historical data.
• Provides actionable insights to prepare for potential scenarios or risks.
Techniques:
Examples in Business:
Tools:
Purpose:
Techniques:
• Optimization Models: Finding the best solution among various alternatives (e.g., minimizing costs or maximizing profits).
• Simulation: Modeling scenarios to understand potential outcomes and their impact.
Examples in Business:
• Dynamic Pricing: Adjusting prices in real-time based on demand, competition, and market trends (e.g., surge pricing in
ride-hailing apps).
• Inventory Optimization: Ensuring the right amount of stock to meet demand without overstocking or understocking.
Tools:
Retail:
Finance:
Marketing &
Legal Cybersecurity Aerospace Fashion
Advertisement
Lecture 5: Data
Ethics and
Challenges
Objective: Discuss ethical considerations
and challenges in Data Science.
Outline
Data privacy and security
Ethical Issues Bias and fairness in algorithms
Transparency and accountability
Regulations and Standards GDPR, HIPAA, and other data protection laws
Data exploration can be broadly classified into two types—descriptive statistics and
data visualization.
Visualization is the process of projecting the data, or parts of it, into multi-dimensional
space or abstract images. All the useful (and adorable) charts fall under this category.
Data exploration in the context of data science uses both descriptive statistics and
visualization techniques.
OBJECTIVES OF DATA EXPLORATION
In the data science process, data exploration is leveraged in many different steps including
preprocessing or data preparation, modeling, and interpretation of the modeling results
Data Understanding
Data Preparation
Binary Attribute
Numeric Attributes
Continuous
Qualitative &
Quantitative Attributes
Nominal Attribute
Nominal means “relating to names” . The utilities of a nominal attribute are sign or title of
objects . Each value represents some kind of category, code or state, and so nominal
attributes are also referred to as categorial.
Example – Suppose that skin color and education status are two attributes of expressing
person objects. In our implementation, possible values for skin color are dark, white,
brown. The attributes for education status can contain the values- undergraduate,
postgraduate, matriculate. Both skin color and education status are nominal attributes.
Binary Attribute
Example – Given the attribute drinker narrate a patient item, 1 specify that the
drinker drinks, while 0 specify that the patient does not. Similarly, suppose the
patient undergoes a medical test that has two practicable outcomes.
Ordinal Attribute
Ordinal attribute is an attribute with a viable advantage that has
a significant sequence or ranking among them, but the
enormity between consecutive values is not known.
Discrete Attribute : A discrete attribute The attributes skin color, drinker, medical
has a limited or restricted unlimited set report, and drink size each have a finite number
of values, which may appear as integers. of values, and so are discrete.
• Accuracy: There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties
that could be human or computer errors.
• Completeness: For some reasons, incomplete data can occur, attributes of interest such as customer information for sales
& transaction data may not always be available.
• Consistency: Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field
incoherent format. Duplicate tuples need cleaning of details, too.
• Timeliness: It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales
records on time. There are also several corrections & adjustments which flow into after the end of the month. Data stored in
the database are incomplete for a time after each month.
• Believability: It is reflective of how much users trust the data.
• Interpretability: It reflects how easy the users can understand the data.
• Descriptive statistics refers to the study of the aggregate
DESCRIPTIVE
quantities of a dataset.
Multivariate exploration is the study of more than one attribute in the dataset
simultaneously. This technique is critical to understanding the relationship
between the attributes, which is central to data science methods.
Univariate Exploration
• Mean
• Median
• Mode
Measure of Spread
• Range
• Deviation
Univariate Exploration: Measure of
Central Tendency
The objective of finding the central location of an attribute is to
quantify the dataset with one central or most common number.
σ 𝑿𝒊 𝑿𝟏 + 𝑿𝟐 + ⋯ + 𝑿𝒏
ഥ=
𝑴𝒆𝒂𝒏 𝒐𝒓 𝑿 =
𝒏 𝒏
σ 𝑤𝑖 𝑋𝑖 𝑤1 𝑋1 + 𝑤2 𝑋2 + ⋯ + 𝑤𝑛 𝑋𝑛
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑀𝑒𝑎𝑛 𝑜𝑟 𝑋𝑤 = =
σ 𝑤𝑖 𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛
Univariate Exploration: Measure of
Central Tendency: Median