KEMBAR78
Ds Unit 2 Notes | PDF | Databases | Sql
0% found this document useful (0 votes)
18 views26 pages

Ds Unit 2 Notes

Uploaded by

Priyanka Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views26 pages

Ds Unit 2 Notes

Uploaded by

Priyanka Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

1

Unit 2
The Five Steps of Data Science
Overview of the five steps, Explore the data, obtain the data, model the data, communicate and visualize
the results

The OSEMN framework

Definition: Obtain the Data

In a data science, obtaining data is the first crucial step. It involves gathering all the necessary
information from various sources before you can start analyzing and drawing conclusions.
Think of this step as setting up the foundation for your project; without accurate and relevant
data, the analysis cannot be done effectively.

Key Activities

1. Query Databases
o Relational Databases (SQL):
▪ What It Is: Relational databases store data in structured tables. SQL
(Structured Query Language) is used to interact with these databases.
▪ How It Works: You write specific commands to request the data you need.
▪ Example: If you need to find all sales records from last month, you
might use a command like SELECT * FROM sales WHERE date > '2024-
07-01' AND date < '2024-07-31';. This command retrieves data from the
sales table for the specified date range.
o Non-Relational Databases (NoSQL):

Prepared by Mrs. Shah S. S.


2

▪ What It Is: Non-relational databases, such as MongoDB, store data in a more


flexible format (often as documents).
▪ How It Works: Instead of SQL commands, you use methods specific to the
database.
▪ Example: To find active products, you might use a command like
db.products.find({ "status": "active" }).
2. File Formats
o Excel and CSV:
▪ What It Is: These are common file formats for storing data in tables. Excel
files have more features (like formatting), while CSV files are simpler and use
commas to separate values.
▪ How It Works: To work with these files, special software or libraries are used.
▪ Example: If you have a CSV file of sales data, you might use a tool
like Microsoft Excel or a software library to open and analyze it.
o TSV (Tab-Separated Values):
▪ What It Is: Similar to CSV but uses tabs instead of commas to separate values.
▪ How It Works: Many of the same tools used for CSV files can handle TSV
files as well.
3. Web Scraping
o What It Is: Web scraping involves extracting data from websites.
o How It Works: Tools or scripts are used to automatically gather information from web
pages.
▪ Example: If you need product reviews from an online store, you might use a
tool to scrape the reviews directly from the store’s website.
4. Web APIs
o What It Is: APIs (Application Programming Interfaces) allow you to fetch data from
online services.
o How It Works: You make requests to a service (like Twitter) to get data in a structured
format.
▪ Example: To collect recent tweets about a product, you would use Twitter’s
API to request this data.
5. File Downloads
o What It Is: You can download datasets from online repositories or corporate archives.
o How It Works: These datasets can come in various formats (CSV, JSON, etc.), and
you use appropriate software or libraries to open and read them.
▪ Example: Downloading a dataset of customer feedback from Kaggle and
opening it with Excel or a data analysis tool.
6. Big Data Technologies
o What It Is: For handling very large datasets, specialized technologies like Hadoop or
Spark are used.
o How It Works: These tools distribute the data across multiple machines to manage
and process it efficiently.
▪ Example: Analyzing a massive dataset of user activity on a social media
platform might use Spark to handle the large volume of data.

Hadoop Architecture:
The diagram represents the architecture of Hadoop, an open-source framework used for distributed
storage and processing of large data sets. Here are the main components:

Prepared by Mrs. Shah S. S.


3

1. Distributed Storage (HDFS): Hadoop Distributed File System (HDFS) is used for storing data across
multiple machines in a distributed manner.
2. MapReduce: This component processes data stored in HDFS. It divides tasks into smaller sub-tasks
and processes them in parallel.
3. YARN (Yet Another Resource Negotiator): Manages resources and schedules jobs in the Hadoop
cluster.
4. Hadoop Common: Provides common utilities and libraries (like Java libraries) that support the other
Hadoop modules.

Skills Required
1. Technical Skills:
o SQL: Ability to write commands to extract and manipulate data from relational
databases.
o File Handling: Knowing how to work with different file formats like CSV, Excel, and
JSON.
o Web Scraping & APIs: Understanding how to gather data from websites and online
services.
2. Programming Skills:
o Python/R: Proficiency in programming languages used for handling and analyzing
data. Python and R are commonly used for data manipulation.
o Big Data Tools: Familiarity with tools like Hadoop and Spark for large-scale data
processing.

Example Scenario
Imagine you're working on a project analyzing e-commerce sales data:

1. Query Databases: Extract sales records from a MySQL database to analyze trends over time.
2. Web Scraping: Scrape customer reviews from the e-commerce website to understand customer
sentiment.
3. Web APIs: Use a Twitter API to gather customer feedback on social media about your
products.

Prepared by Mrs. Shah S. S.


4

2. Scrub Data
Definition: Data Scrubbing

Data scrubbing, also known as data cleaning or data cleansing, is the process of preparing and
improving data to ensure its accuracy and consistency before it’s used for analysis. The goal is
to make sure that the data is clean, reliable, and in a format that is easy to work with.

Key Activities

1. What Is Data Conversion?

Definition: Data conversion is the process of changing data from one format or structure to
another to ensure consistency and usability across different systems and applications. This
often involves transforming data to fit a standard format that can be used seamlessly in different
platforms or applications.

Purpose:

• Consistency: Ensures that data from various sources is unified and follows a standardized
format, making it easier to work with and analyze.
• Accuracy: Reduces the risk of errors caused by inconsistent data formats or structures.
• Integration: Facilitates the merging of data from multiple sources, allowing for more
comprehensive analysis.
• Efficiency: Streamlines data processing and reporting by having data in a consistent format.

2. How Data Conversion Works

Step-by-Step Process:

1. Data Collection:
o Objective: Gather data from various sources. These sources might include databases,
spreadsheets, APIs, and more.
o Considerations: Ensure all relevant data is collected, noting the formats and
structures used.
2. Format Analysis:
o Objective: Analyze the formats and structures of the collected data.
o Activities: Identify different column names, data types (e.g., integers, strings, dates),
and formats (e.g., date formats like MM/DD/YYYY vs. YYYY-MM-DD).
3. Transformation:
o Objective: Convert data into a standardized format.
o Activities:
▪ Rename columns to have consistent names.
▪ Convert date formats to a single standard.
▪ Normalize numeric values (e.g., currency, decimal places).
▪ Reformat text (e.g., upper case to lower case).
4. Validation:
o Objective: Ensure the converted data is accurate and maintains its integrity.
o Activities:
▪ Check for data consistency and completeness.

Prepared by Mrs. Shah S. S.


5

▪ Verify that no data has been lost or corrupted during the conversion.
▪ Perform tests to ensure data accuracy.
5. Integration:
o Objective: Combine the converted data into a single dataset or system.
o Activities:
▪ Merge datasets with consistent formats.
▪ Update or replace existing data with the newly converted data.
▪ Ensure that the integrated data works as expected in the target system.

3. Example of Data Conversion

Scenario: You have several CSV files from different departments with varying column names
and formats.

Files:

• File 1: Contains a column labeled "Email" and dates in MM/DD/YYYY format.


• File 2: Contains a column labeled "email address" and dates in DD-MM-YYYY format.

Conversion Process:

• Column Renaming: Rename "email address" in File 2 to "Email" to match File 1.


• Date Format Standardization: Convert all dates to YYYY-MM-DD format.
• Data Merging: Combine both files into a single dataset with standardized columns and
formats.

Result:

• A unified dataset with consistent column names and standardized date formats, making it
ready for analysis or reporting.

4. Common Challenges in Data Conversion

• Inconsistent Data Formats: Data might come in various formats, such as different date styles
or number representations.
• Data Quality Issues: Data may be missing, incomplete, or contain errors, which can complicate
the conversion process.
• Complex Data Structures: Some data may be nested or hierarchical, making it challenging to
flatten or standardize.
• Scalability: Handling large volumes of data efficiently can be difficult, requiring robust tools
and techniques.

5. Best Practices for Data Conversion

• Define Standard Formats:


o Establish clear rules and formats for how data should be standardized. Document
these rules to maintain consistency.
• Automate Where Possible:
o Use tools and scripts to automate repetitive tasks in the conversion process, reducing
manual errors and saving time.

Prepared by Mrs. Shah S. S.


6

• Validate Data Thoroughly:


o Implement thorough validation checks to ensure that the converted data is accurate
and complete. This includes performing spot checks and comparing results with
original data.
• Document Processes:
o Keep detailed records of the conversion methods, rules, and transformations used.
This documentation helps with troubleshooting and maintaining consistency in future
conversions.

1. What Is Data Filtering?

Definition: Data filtering is the process of identifying and removing data that is irrelevant,
incorrect, or corrupted. This step is crucial in data preparation to ensure that subsequent
analysis is based on high-quality, reliable data.

Purpose:

• Improve Accuracy: Ensures that the data used in analysis is accurate and relevant, which leads
to more reliable results.
• Enhance Efficiency: By filtering out unnecessary or erroneous data, you streamline the
dataset, which speeds up processing and analysis.
• Ensure Reliability: Reduces the risk of drawing incorrect conclusions based on flawed data.
• Provide Clarity: Helps in focusing on the most relevant data, making analysis easier and more
effective.

2. How Data Filtering Works

Step-by-Step Process:

1. Identify Irrelevant Data:


o Objective: Determine which data points do not contribute to the analysis.
o Examples:
▪ Missing values in critical columns.
▪ Outliers that fall outside expected ranges.
▪ Duplicated records.
2. Define Filtering Criteria:
o Objective: Establish rules for what constitutes irrelevant or incorrect data.
o Criteria Examples:
▪ Missing Data: Remove rows where essential fields are missing (e.g., customer
names, purchase dates).
▪ Range Constraints: Exclude data that falls outside logical or acceptable ranges
(e.g., negative sales amounts).
▪ Consistency: Filter out data that does not adhere to expected formats (e.g.,
dates in an incorrect format).
3. Apply Filters:
o Objective: Use software or scripts to apply the defined criteria and remove or flag
irrelevant data.
o Methods:

Prepared by Mrs. Shah S. S.


7

▪ Manual Filtering: Using built-in features in spreadsheets like Excel or Google


Sheets.
▪ Automated Filtering: Using programming languages like Python with libraries
such as Pandas.
4. Validate Results:
o Objective: Ensure that the filtering process did not inadvertently remove useful data.
o Activities:
▪ Review Sample Data: Check a sample of the filtered data to confirm accuracy.
▪ Check Statistics: Compare summary statistics before and after filtering to
ensure data integrity.

3. Example of Data Filtering

Scenario: You have a dataset with customer information. Some rows are incomplete, missing
critical details like customer names or purchase dates.

Original Data:

Customer ID Name Email Purchase Date

001 John Doe john@example.com 08/22/2024

002 jane@example.com

003 Alice Smith alice@example.com 08/23/2024

004 Bob Brown 07/01/2024

Filtering Process:

• Criteria: Remove rows where the 'Name' or 'Purchase Date' is missing.


• Filtered Data:

Customer ID Name Email Purchase Date

001 John Doe john@example.com 08/22/2024

003 Alice Smith alice@example.com 08/23/2024

Outcome: Only complete records are retained, making the data ready for further analysis.

4. Tools and Techniques for Data Filtering

Excel/Google Sheets:

• Built-in Filters: Use filter options to display only rows that meet certain criteria.
• Conditional Formatting: Highlight cells that meet specific conditions (e.g., empty cells).

Python (Pandas):

Prepared by Mrs. Shah S. S.


8

• Code Example:

python

import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Filter out rows with missing 'Name' or 'Purchase Date'


filtered_df = df.dropna(subset=['Name', 'Purchase Date'])

# Save the cleaned data


filtered_df.to_csv('filtered_data.csv', index=False)

o dropna(): Removes rows with missing values in specified columns.

Database Queries:

• SQL Example:

sql
Copy code
SELECT *
FROM customers
WHERE Name IS NOT NULL AND Purchase_Date IS NOT NULL;

o WHERE Clause: Filters rows based on criteria defined in the SQL query.

5. Common Challenges

• Incomplete Data:
o Issue: Missing values in critical columns can skew results.
o Solution: Implement checks and fill missing values where possible, or remove
incomplete rows.
• Inconsistent Data:
o Issue: Data that does not adhere to expected formats or ranges can be problematic.
o Solution: Standardize data formats and apply rules to detect and correct
inconsistencies.
• Large Datasets:
o Issue: Filtering large datasets can be resource-intensive.
o Solution: Use efficient algorithms and tools that can handle large volumes of data.

6. Best Practices

• Set Clear Criteria:


o Action: Define specific rules for what constitutes irrelevant or incorrect data.
o Example: Decide on acceptable ranges, formats, and completeness for each column.
• Document Filtering Steps:
o Action: Keep detailed records of the filtering process and criteria used.
o Purpose: Ensures reproducibility and transparency in the data cleaning process.
• Validate Filtered Data:

Prepared by Mrs. Shah S. S.


9

o Action: Review the filtered data to ensure that useful information has not been
inadvertently removed.
o Purpose: Maintain data integrity and reliability.

Summary

Data filtering is a critical step in data preparation that involves removing or correcting data that
does not meet certain criteria. It helps ensure that the data used for analysis is accurate, relevant,
and reliable. By following systematic processes and using appropriate tools, you can effectively
clean and prepare your data for insightful analysis.

Value Replacement
1. What Is Value Replacement?

Definition: Value replacement is the process of handling missing, non-standard, or incorrect


entries in a dataset by substituting them with appropriate values or indicators. This ensures that
the dataset is complete and consistent, making it suitable for analysis.

Purpose:

• Complete Data: Fills gaps where data is missing to avoid incomplete datasets.
• Standardize Entries: Converts non-standard values into a uniform format.
• Facilitate Analysis: Ensures that all data entries are usable and fit for analysis.

2. Why Value Replacement Is Important

Completeness:

• Why It Matters: Missing values can hinder analysis, leading to incomplete insights or biased
results.
• Example: If customer email addresses are missing, it’s challenging to perform email marketing
analysis.

Consistency:

• Why It Matters: Inconsistent entries can lead to errors or confusion during analysis.
• Example: Having “N/A”, “none”, and empty cells for missing email addresses can complicate
data processing.

Accuracy:

• Why It Matters: Incorrect or incomplete data can result in misleading conclusions.


• Example: If a dataset used for predicting sales has missing sales figures, any analysis or model
built on this data will be flawed.

Data Integrity:

Prepared by Mrs. Shah S. S.


10

• Why It Matters: Ensuring that all entries are standardized and complete maintains the
reliability of the dataset.
• Example: Replacing “unknown” with a standardized missing value helps maintain data
integrity.

3. How Value Replacement Works

Step-by-Step Process:

1. Identify Missing or Non-Standard Values:


o Objective: Locate entries that are missing or don’t follow the expected format.
o Techniques:
▪ Data Profiling: Analyze the dataset to identify missing values and non-
standard entries.
▪ Visual Inspection: Use data visualization tools to spot anomalies or gaps.
2. Define Replacement Strategy:
o Objective: Decide on the best approach to replace missing or non-standard values.
o Strategies:
▪ Placeholder Values: Use a fixed value to indicate missing data.
▪ Example: Use “no-email@domain.com” for missing email addresses.
▪ Statistical Methods: Estimate missing values based on statistical properties
of the dataset.
▪ Example: Replace missing values with the mean or median of the
column.
▪ Predictive Models: Use machine learning models to predict and fill in missing
values based on other features.
▪ Example: Predict missing sales figures using historical data.
3. Apply Value Replacement:
o Objective: Implement the replacement strategy using appropriate tools or methods.
o Methods:
▪ Manual Replacement: For small datasets, manually enter replacement
values.
▪ Automated Replacement: Use scripts or data cleaning tools to apply
replacements programmatically.
4. Verify and Validate:
o Objective: Ensure that the replacements are correct and do not introduce new errors.
o Activities:
▪ Review Samples: Check a subset of replaced values to confirm accuracy.
▪ Update Documentation: Record changes made during the replacement
process for future reference.

4. Tools and Techniques for Value Replacement

Excel/Google Sheets:

• Find & Replace:


o Feature: Allows you to search for specific values and replace them with new ones.
o Use Case: Quickly replace placeholder text in small datasets.
• Formulas:
o Example: =IF(ISBLANK(A2), "no-email@domain.com", A2)
o Use Case: Automatically fill missing values in a column based on conditions.

Prepared by Mrs. Shah S. S.


11

Python (Pandas):

• Code Example:

import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Replace missing values in 'Email' column with a placeholder


df['Email'].fillna('no-email@domain.com', inplace=True)

# Replace missing values in 'Age' column with the median age


median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

# Save the updated data


df.to_csv('updated_data.csv', index=False)

o fillna(): Fills missing values with a specified value or method.

Database Queries:

• SQL Example:

sql
Copy code
UPDATE customers
SET Email = 'no-email@domain.com'
WHERE Email IS NULL;

o UPDATE Statement: Sets a default value for missing email addresses in the database.

5. Common Challenges

Choosing the Right Replacement:

• Challenge: Selecting an appropriate replacement can be difficult, especially when the missing
data is crucial.
• Solution: Use domain knowledge to select values that make sense in the context of your data.

Impact on Analysis:

• Challenge: Replacement values may affect the results of analysis, leading to potential bias.
• Solution: Consider using multiple methods or sensitivity analysis to understand the impact of
replacements.

Data Integrity:

• Challenge: Replacements can sometimes introduce bias or distort the data if not done
carefully.
• Solution: Validate replacements and document changes to maintain transparency.

Prepared by Mrs. Shah S. S.


12

6. Best Practices

Define Clear Replacement Criteria:

• Action: Establish rules for what values to use as replacements.


• Example: For missing email addresses, use a placeholder; for numerical values, use the
median or mean.

Document Replacement Process:

• Action: Keep detailed records of what values were replaced and why.
• Purpose: Ensures transparency and allows for reproducibility.

Verify Accuracy:

• Action: Regularly review and verify the replaced values to ensure they are appropriate.
• Purpose: Maintain the integrity and quality of the dataset.

Summary

Value replacement is essential for maintaining data quality by addressing missing or non-
standard entries. By carefully selecting replacement values and following best practices, you
can ensure that your dataset remains complete and reliable for analysis.

Data Transformation
o What It Is: Changing the structure of data to make it more useful for analysis.
o How It Works: This might involve combining, splitting, or restructuring columns in
your dataset.
▪ Example: If you have a column with full names (e.g., “John Smith”), you
might split it into two separate columns: “First Name” and “Last Name.”
Conversely, if you have separate columns for “Street,” “City,” and “Zip Code,”
you might combine them into a single “Address” column.

Skills Required

1. Scripting Tools:
o Python: Python is commonly used for data scrubbing, with libraries like Pandas that
offer powerful tools for data manipulation and cleaning.
▪ Example: You can use Pandas to read CSV files, replace missing values, and
merge datasets (df.fillna() for missing values, pd.concat() for combining
datasets).
o R: R is another popular tool for data cleaning, using packages like dplyr to filter and
transform data.
▪ Example: dplyr functions like filter() to remove unwanted rows and mutate() to
modify columns.
2. Big Data Tools:
o Hadoop/MapReduce: For very large datasets, you might use Hadoop’s MapReduce
framework to distribute data cleaning tasks across many machines.
o Spark: Apache Spark provides fast, scalable data processing and cleaning capabilities
through its DataFrame API.

Prepared by Mrs. Shah S. S.


13

Example Scenario

Imagine you have several CSV files containing customer records with varying formats:

1. Combine Data: You merge all the CSV files into a single dataset, ensuring that column names
(like “Email” and “Email Address”) are consistent across the merged file.
2. Handle Missing Values: If some records have missing email addresses, you replace these with
a placeholder value like “no-email@domain.com.”
3. Standardize Dates: If dates are in different formats (e.g., "MM/DD/YYYY" vs. "YYYY-MM-
DD"), you convert all dates into a standard format for consistency.

3. Explore Data

Prepared by Mrs. Shah S. S.


14

Definition: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of examining a dataset to understand its
characteristics and discover patterns or insights before performing more detailed statistical
analysis. It helps you get a feel for the data, identify trends, and spot anomalies.

Key Activities

1. Inspect Data

Introduction to Inspecting Data

Inspecting data is the crucial initial step in data analysis. It involves examining a dataset to
understand its structure, content, and quality. This process helps in determining how data is
organized, identifying any issues, and preparing it for subsequent analysis. Effective data
inspection is essential because it lays the groundwork for accurate and insightful data analysis.

Purpose of Inspecting Data

The primary purpose of inspecting data is to gain a comprehensive understanding of what the
dataset contains. This involves:

1. Determining Data Types: Identifying what kinds of data are present, such as
numerical, categorical, ordinal, or nominal. This helps in selecting appropriate methods
for further analysis.

Prepared by Mrs. Shah S. S.


15

2. Understanding Organization: Reviewing how the data is structured, including the


arrangement of rows and columns, and understanding how data points relate to each
other.
3. Spotting Anomalies: Detecting any issues such as missing values, duplicates, or
inconsistencies that could impact the quality and reliability of the analysis.

How Data Inspection Works

Data inspection involves a few key steps:

1. Examine Data Types: Determine whether the data is numerical (quantitative),


categorical (grouped by categories), ordinal (ordered categories), or nominal
(unordered categories). This categorization informs how the data should be processed
and analyzed.
2. Understand Organization: Look at the dataset's structure, including how data is
organized into columns and rows. Check for the data types of each column and how
they are organized.
3. Spot Anomalies: Identify any discrepancies or issues within the data, such as
inconsistent formats, missing values, or duplicate entries. Addressing these issues early
is crucial for maintaining data quality.

Data Types

1. Numerical Data: Represents quantities and can be used for calculations. Examples
include sales figures and temperatures. Numerical data is essential for performing
statistical analyses and creating quantitative models.
2. Categorical Data: Represents distinct categories or groups and is used for
classification purposes. Examples include product types and customer feedback ratings.
This data type helps in grouping and summarizing data.
3. Ordinal Data: A subset of categorical data where categories have a meaningful order.
Examples are survey ratings such as "Poor," "Fair," "Good," and "Excellent." While
ordinal data can be ranked, the intervals between ranks are not necessarily equal.
4. Nominal Data: Categorical data without any inherent order or ranking. Examples
include colors and names. Nominal data is used for labeling and classification without
implying any ranking.

Example of Inspecting Data

Consider a dataset of customer feedback:

• Numerical Data: Ratings given on a scale of 1 to 10.


• Categorical Data: Types of feedback such as Compliment, Complaint, or Suggestion.
• Ordinal Data: Satisfaction levels like Dissatisfied, Neutral, and Satisfied.

By analyzing these data types, you can understand how customers feel about your services and
identify patterns or issues.

Analyzing Data Structure

When analyzing the structure of data, focus on:

Prepared by Mrs. Shah S. S.


16

• Columns and Rows: Columns represent different variables or features (e.g., Customer
Name, Feedback Type), while rows represent individual records or observations (e.g.,
a single customer's feedback entry).
• Data Formats: Understand the different formats present, such as text for names and
descriptions, numbers for quantitative measures, and dates for temporal information.

Identifying Potential Issues

Common issues to look out for include:

• Missing Values: Entries that are empty or null, which need to be addressed to avoid
gaps in analysis.
• Duplicate Entries: Repeated records that can skew results and need to be removed.
• Inconsistent Data Formats: Variations in data formatting that should be standardized
to ensure uniformity.

Addressing these issues early helps in maintaining the integrity of the data and improving the
accuracy of the analysis.

Tools for Data Inspection

Several tools can assist in data inspection:

• Excel: Useful for basic data inspection and manipulation tasks.


• Pandas (Python): Provides advanced functions for data manipulation and inspection.
• R: Equipped with various statistical tools and functions for detailed data analysis and
inspection.

Descriptive Statistics
Introduction to Descriptive Statistics

Descriptive statistics is a branch of statistics that focuses on summarizing and describing the
main features of a dataset. The goal is to provide a concise overview of the data, making it
easier to understand and analyze. This involves calculating various summary statistics that
reveal central tendencies, dispersion, and distribution characteristics.

Key Points:

• Summary of Data: Provides a snapshot of the data’s essential aspects.


• Central Tendency and Variability: Descriptive statistics help us understand where
most data points lie (central tendency) and how spread out the data points are
(variability).

Purpose of Descriptive Statistics

Descriptive statistics serve several purposes:

Prepared by Mrs. Shah S. S.


17

1. Summarization: They summarize large datasets into simple metrics that highlight key
aspects of the data.
2. Understanding Central Tendency: They help identify where most of the data values
fall, indicating the typical value.
3. Assessing Variability: They measure how much the data values differ from one
another, providing insights into the spread of the data.
4. Facilitating Comparisons: They allow for comparisons between different datasets or
groups by providing a common set of summary measures.

Key Statistics Overview

Descriptive statistics include various key metrics:

1. Central Tendency: Measures like mean, median, and mode help us understand where
most data points are concentrated.
2. Variability: Measures like standard deviation and variance reveal how data points are
dispersed around the central value.

Mean

Definition: The mean, or average, is calculated by summing all the data values and dividing
by the number of values. It provides a measure of the central location of the data.

Median

Definition: The median is the middle value of a dataset when it is ordered from smallest to
largest. It divides the dataset into two equal halves.

Calculation:

• Odd Number of Values: The median is the middle value.


• Even Number of Values: The median is the average of the two middle values.

Purpose: The median is less affected by outliers and skewed data than the mean, providing a
better measure of central tendency in such cases

Mode

Definition: The mode is the value that occurs most frequently in the dataset. A dataset may
have one mode, more than one mode, or no mode if no value repeats.

Calculation: Identify the value with the highest frequency.

Purpose: The mode provides insight into the most common or popular value in the dataset.

Standard Deviation

Definition: The standard deviation measures the average distance of each data point from the
mean. It quantifies the spread of the data around the mean.

Prepared by Mrs. Shah S. S.


18

Example: Sales Revenue Analysis

Consider analyzing a dataset of sales revenue:

• Mean: $287.50
• Median: $275
• Mode: $300 (most frequent value)
• Standard Deviation: Approximately $75.90

This example demonstrates how these descriptive statistics provide insights into the central
tendency and variability of sales revenue, helping to summarize and understand the dataset
effectively.

Data Visualization
Overview of Data Visualization

Data visualization is the practice of creating visual representations of data to uncover patterns,
trends, and relationships. The goal is to transform complex data into a format that is easier to
understand and interpret, enabling users to make informed decisions based on visual insights.

How Data Visualization Works

Data visualization simplifies the analysis of data by using charts and graphs to illustrate:

• Patterns: Trends and recurring themes in the data.


• Trends: Changes or movements in data over time.
• Relationships: Connections or correlations between different variables.

This approach helps in making complex data more accessible and comprehensible, allowing
for better analysis and decision-making.

Common Visualizations

1. Histograms:
o Purpose: To show the distribution of a single variable.
o Example: Visualizing customer ratings on a scale of 1 to 5.
o How It Works: Bars represent the frequency of ratings within specified intervals,
helping to understand how often each rating was given.
2. Scatter Plots:
o Purpose: To show the relationship between two numerical variables.
o Example: Analyzing the correlation between customer spending and satisfaction
ratings.
o How It Works: Data points are plotted on a two-dimensional axis, revealing patterns
or correlations between the variables.
3. Box Plots:
o Purpose: To display the spread and identify outliers in the data.
o Example: Comparing satisfaction ratings across different service categories like
Delivery, Product Quality, and Customer Support.

Prepared by Mrs. Shah S. S.


19

o How It Works: Box plots show the median, quartiles, and outliers, providing insights
into data distribution and variability.

Example: Customer Feedback Analysis

Histograms

• Scenario: You have data on customer ratings for your e-commerce service.
• Visualization: Create a histogram to see how often each rating (from 1 to 5) was given by
customers.
• Insight: The histogram helps you understand the overall distribution of customer satisfaction.
For instance, if most ratings are concentrated at the high end (4 or 5), it indicates general
satisfaction. Conversely, a concentration of lower ratings (1 or 2) suggests dissatisfaction and
areas for improvement.

Scatter Plots

• Scenario: You want to explore if there's a relationship between how much customers spend
and their satisfaction ratings.
• Visualization: Plot customer spending on the x-axis and their satisfaction ratings on the y-axis.
• Insight: The scatter plot reveals whether higher spending correlates with higher satisfaction.
For example, if data points trend upwards, it suggests a positive relationship where higher
spending is associated with better ratings. If there’s no clear pattern, spending may not be a
significant factor in satisfaction.

Box Plots

• Scenario: Compare customer satisfaction ratings across different service categories such as
Delivery, Product Quality, and Customer Support.
• Visualization: Use box plots to show the distribution of ratings within each category.
• Insight: Box plots illustrate the median satisfaction level, the range of ratings, and any
potential outliers. For example, if Delivery has a wide range and many outliers, it indicates
significant variability and potential issues with delivery service.

Skills Required

1. Visualization Tools:
o Python Libraries:
▪ Matplotlib: A basic plotting library for creating charts and graphs.
▪ Seaborn: Built on top of Matplotlib, it offers more advanced visualizations
and easier-to-create plots.
o R Packages:
▪ ggplot2: A powerful package for creating a wide variety of visualizations with
detailed customization options.
2. Statistical Skills:
o Understanding Statistics: Knowledge of how to compute and interpret descriptive
statistics to summarize and analyze data.

Example Scenario

Suppose you have a dataset with sales data, including advertising spend and revenue:

Prepared by Mrs. Shah S. S.


20

1. Inspect Data:
o Determine that you have numerical data for advertising spend and revenue, and
categorical data for product categories.
2. Descriptive Statistics:
o Calculate the mean and median of advertising spend and sales revenue to understand
average values.
o Compute the standard deviation to gauge the variability in spending and revenue.
3. Data Visualization:
o Create a scatter plot to explore the relationship between advertising spend and sales
revenue.
o Generate a histogram to show the distribution of sales revenue across different product
categories.

4. Model Data

Prepared by Mrs. Shah S. S.


21

Definition: Applying Statistical Models and Machine Learning Algorithms

In this stage of a data science project, statistical models and machine learning algorithms
are used to analyze data with the goal of making predictions, classifying data into categories,
or uncovering hidden patterns. This step involves using mathematical and computational
techniques to understand the data and derive actionable insights.

Key Activities
Overview of Feature Selection

Feature Selection is a crucial step in the data preprocessing phase of building machine learning
models. It involves identifying and selecting the most relevant variables (features) from your
dataset to build a model that is both accurate and efficient.

What is Feature Selection?

Feature selection is the process of choosing the most important variables from your dataset that
contribute significantly to the predictive power of the model. By focusing on these key
variables, you can enhance the model’s performance, reduce its complexity, and improve its
interpretability.

Purpose: Selecting the most relevant features helps in building a model that is not only more
accurate but also less prone to overfitting. This means the model will perform better on new,
unseen data. Additionally, it reduces the computational resources needed for training and
prediction.

Prepared by Mrs. Shah S. S.


22

Why Feature Selection Matters

Improves Model Performance: By removing irrelevant or redundant features, the model can
focus on the variables that have the most predictive power. This leads to better performance
metrics, such as accuracy and precision.

Reduces Overfitting: A model with too many features might learn noise or irrelevant patterns
in the training data, which can lead to overfitting. Feature selection helps in creating a simpler
model that generalizes better to new data.

Enhances Computational Efficiency: Fewer features mean less data to process, which
reduces the computational resources and time required for training and making predictions.

Simplifies Model Interpretation: A model with fewer features is easier to understand and
interpret. This helps in understanding how different features affect the predictions.

Methods of Feature Selection

Filter Methods: These methods evaluate features independently of the model. They use
statistical tests and metrics to select features based on their relevance to the target variable.
Examples include the Chi-square test for categorical data and correlation coefficients for
numerical data.

Wrapper Methods: These methods evaluate subsets of features by actually training and
testing the model with these subsets. Examples include Recursive Feature Elimination (RFE),
which iteratively removes the least important features based on model performance.

Embedded Methods: These methods perform feature selection as part of the model training
process. They integrate feature selection into the model-building process itself. Examples
include Lasso Regression, which uses regularization to penalize less important features, and
Decision Trees, which inherently rank features by importance.

Example: Predicting Customer Churn

In a scenario where you’re building a model to predict customer churn (whether a customer
will leave or stay), feature selection helps identify which features are most useful for predicting
churn.

Relevant Features: Features such as customer activity patterns (e.g., frequency of purchases)
and demographic information (e.g., age, location) are likely important for predicting churn.
These features provide insights into customer behavior and profile.

Irrelevant Features: Features like customer ID are unique to each customer but do not provide
any meaningful information for predicting churn. Therefore, such features are excluded from
the model to avoid unnecessary complexity.

Prepared by Mrs. Shah S. S.


23

Feature Selection Process

Data Preparation: Start by cleaning and preprocessing the data to ensure it’s in a suitable
format for analysis. This includes handling missing values and normalizing data if necessary.

Feature Evaluation: Apply various methods to evaluate the importance of each feature. This
might involve statistical tests, model-based techniques, or a combination of approaches.

Model Building: Use the selected features to build your model and evaluate its performance.
Compare different sets of features to determine which combination yields the best results.

Iteration: Feature selection is an iterative process. Based on the model’s performance, you
may need to refine the feature set by adding or removing features and re-evaluating the model.

Tools and Techniques

Statistical Tests: Tools like correlation coefficients and Chi-square tests can help in assessing
the relationship between features and the target variable.

Algorithms: Techniques such as Recursive Feature Elimination (RFE) and Lasso Regression
are useful for selecting important features based on model performance and regularization.

Libraries and Frameworks: In Python, libraries like Scikit-learn and Statsmodels provide
tools for feature selection. In R, packages like Caret and Boruta can be used for similar
purposes.

Model Training
o What It Is: Building and training statistical or machine learning models using your
data.
o How It Works: You use various algorithms to train the model to learn patterns from
the data.
▪ Common Algorithms:
▪ Logistic Regression: Used for classification tasks, such as predicting
whether a customer will churn (yes or no).
▪ Linear Regression: Used for predicting continuous values, such as
forecasting sales revenue.
▪ Clustering Algorithms (e.g., k-means): Used for grouping data into
clusters based on similarity, such as grouping customers by purchasing
behavior.
▪ Example: To predict whether a customer will churn, you might train a logistic
regression model using data on customer activities and demographics.

Model Evaluation
o What It Is: Assessing how well your model performs on unseen data to ensure it makes
accurate predictions or classifications.

Prepared by Mrs. Shah S. S.


24

o How It Works: You use different metrics to evaluate the model’s performance.
▪ Evaluation Metrics:
▪ For Classification:
▪ Accuracy: The percentage of correctly classified instances.
▪ Precision: The ratio of true positive predictions to the total
predicted positives.
▪ Recall: The ratio of true positive predictions to the total actual
positives.
▪ F1 Score: The harmonic mean of precision and recall, useful
for imbalanced datasets.
▪ For Regression:
▪ R² (R-squared): The proportion of variance in the dependent
variable that is predictable from the independent variables.
▪ MAE (Mean Absolute Error): The average of the absolute
errors between predicted and actual values.
▪ RMSE (Root Mean Square Error): The square root of the
average of squared errors, which gives more weight to larger
errors.
▪ Example: For a churn prediction model, you might use accuracy and F1 score
to evaluate how well the model identifies customers who will churn.

Skills Required

1. Machine Learning:
o Proficiency: Knowledge of different machine learning algorithms and their
applications.
o Tools: Familiarity with libraries and frameworks that support machine learning.
▪ Python Libraries: Scikit-learn is commonly used for implementing and
evaluating machine learning algorithms.
▪ R Packages: CARET is used for creating and evaluating machine learning
models.
2. Model Evaluation:
o Understanding Metrics: Ability to calculate and interpret metrics to understand model
performance and make improvements.

Example Scenario

Let’s say you are working on a churn prediction project for a company that wants to reduce
customer attrition:

1. Feature Selection:
o Identify key features such as customer purchase frequency, customer service
interactions, and demographic details.
2. Model Training:
o Train a logistic regression model using these features to predict whether each customer
is likely to churn (leave the company) or not.
3. Model Evaluation:
o Evaluate the model using metrics like accuracy (how often the model is correct) and
F1 score (how well the model balances precision and recall), ensuring it performs well
in predicting customer churn.

Prepared by Mrs. Shah S. S.


25

5. Interpret Data

Definition: Presenting Results and Insights

The final step in a data science project involves presenting the results and insights in a way
that effectively communicates findings to stakeholders and provides actionable
recommendations. This step ensures that the insights gained from the data are clearly
understood and can be used to make informed decisions.

Key Activities

1. Present Findings
o What It Is: Creating reports and presentations to share the results of your analysis with
stakeholders.
o How It Works: Use clear and concise language to explain your findings, supported by
visual aids like charts and graphs.
▪ Reports: Detailed documents that include data analysis, visualizations, and
interpretations. These are often used for formal documentation and in-depth
review.
▪ Presentations: Slide decks or other visual formats used to summarize key
findings and discuss them in meetings.
▪ Example: Prepare a report summarizing the results of a customer churn
analysis, including charts that show trends and a summary of key factors
affecting customer retention.
2. Data Visualization
o What It Is: Using charts, graphs, and dashboards to make the results of your analysis
more accessible and understandable.
o How It Works: Visualizations help to highlight important patterns, trends, and insights
from the data.
▪ Types of Visualizations:
▪ Charts and Graphs: Bar charts, line graphs, pie charts, and scatter
plots can illustrate different aspects of your data.
▪ Dashboards: Interactive tools that provide an overview of key metrics
and allow users to explore data dynamically.

Prepared by Mrs. Shah S. S.


26

▪ Example: Create a dashboard that displays customer segments and their


purchasing behaviors, making it easy for marketing teams to identify target
groups.
3. Actionable Insights
o What It Is: Translating data findings into practical recommendations that can be used
to make business decisions.
o How It Works: Connect the results of your analysis to specific actions or strategies
that can address business challenges or seize opportunities.
▪ Example: If your analysis shows that a particular customer segment is highly
engaged but underserved, you might recommend targeted marketing
campaigns or product offers to this group.

Skills Required

1. Business Acumen:
o Understanding Context: Knowing the business context helps frame your findings in
a way that addresses relevant questions and informs decisions.
o Example: If analyzing sales data for a retail company, understanding the company’s
market and objectives allows you to provide insights that are directly applicable to their
goals.
2. Communication Skills:
o Presenting Clearly: The ability to explain complex results in simple terms for non-
technical audiences.
o Data Visualization Tools: Proficiency in tools that create effective visualizations,
such as Excel, Tableau, or Power BI.
o Effective Storytelling: Crafting a narrative that connects your findings to business
needs and demonstrates the value of your insights.

Example Scenario

Let’s say you conducted a customer segmentation analysis for a marketing campaign:

1. Present Findings:
o Create a detailed report and a presentation that outlines the different customer segments
you identified. Include visualizations like pie charts to show the proportion of each
segment and bar graphs to illustrate their spending patterns.
2. Data Visualization:
o Develop a dashboard that allows stakeholders to interact with the data. Include filters
to view different customer segments and charts that show key metrics like average
purchase frequency and engagement levels.
3. Actionable Insights:
o Based on your analysis, suggest targeted marketing strategies for each customer
segment. For example, recommend personalized email campaigns for high-value
customers and special promotions for frequent buyers to increase engagement and
sales.

Conclusion: The five steps of data science—obtaining data, scrubbing data, exploring data,
modeling data, and interpreting data—form a comprehensive framework for transforming raw
data into actionable insights. Mastering these steps and associated skills ensures effective data
analysis and supports informed decision-making.

Prepared by Mrs. Shah S. S.

You might also like