Data Fundamentals
1. Data
Raw facts, figures, or information collected from various sources. It has no meaning until it is
processed or analyzed.
2. Dataset
A collection of related data, usually presented in tabular form (rows and columns), that can be used
for analysis.
Example: A table of customer orders with columns like Order ID, Date, and Amount.
3. Structured Data
Data organized in a fixed format like rows and columns (usually in relational databases). Easy to store
and analyze.
Example: Excel sheets, SQL tables.
4. Unstructured Data
Data that doesn't have a predefined format. Harder to process and analyze.
Example: Emails, images, videos, social media posts.
5. Semi-Structured Data
Data that doesn’t fit into strict tables but has some structure using tags or markers.
Example: JSON, XML files.
6. Big Data
Large, complex datasets that can't be handled with traditional tools. Characterized by the 5 V’s:
Volume, Velocity, Variety, Veracity, and Value.
Example: Real-time social media data, sensor data from IoT.
7. Data Pipeline
A sequence of processes that move data from one system to another (often from source to storage
or analysis). It includes data collection, transformation, and loading.
Think: Like a factory assembly line for data.
8. Data Source
The origin or provider of the data. It can be a database, file, web API, sensor, survey, etc.
Example: Google Analytics, SQL Server, Excel file.
9. Data Warehouse
A centralized system where structured data from different sources is stored for reporting and
analysis. Optimized for querying and analysis.
Example: Amazon Redshift, Snowflake.
10. Data Lake
A large storage repository that holds raw data in its original format (structured + unstructured). Used
for big data and advanced analytics.
Example: AWS S3 as a data lake.
11. Data Mart
A smaller, more focused version of a data warehouse designed for a specific team or business unit
(like sales or marketing).
Think: A mini data warehouse for one department.
12. ETL (Extract, Transform, Load)
A traditional data integration process:
• Extract data from sources
• Transform it into the right format
• Load it into a data warehouse or destination
Used when data needs heavy cleaning or reformatting before loading.
13. ELT (Extract, Load, Transform)
A modern variation of ETL where:
• Data is Extracted
• Loaded into a data lake or warehouse
• Transformed afterward using the storage engine’s power
Common in cloud systems with strong processing power like Snowflake.
14. Data Lifecycle
The complete journey of data from creation to deletion, typically involving these stages:
• Data creation
• Storage
• Processing
• Analysis
• Usage
• Archival or deletion
Managing the lifecycle ensures security, efficiency, and compliance.
Data Cleaning & Wrangling:
1. Missing Values
Definition: Data points that are absent in a dataset.
Example:
Name Age City
John 25 New York
Lisa Chicago
Why fix? Can affect analysis and model accuracy.
2. Outliers
Definition: Data points significantly different from others.
Example:
A salary dataset: [30K, 32K, 33K, 35K, 500K] ← 500K is an outlier.
Why fix? They may skew the results or indicate data entry errors.
3. Data Normalization
Definition: Rescaling data to a range (usually 0 to 1).
Example:
Before: [10, 20, 30] → After normalization: [0, 0.5, 1]
Used in: Algorithms like KNN, neural networks.
4. Data Standardization
Definition: Rescaling data to have mean = 0 and standard deviation = 1.
Example:
Before: [100, 110, 130]
After: [−1.2, 0.0, 1.2] (Standardized Z-scores)
Used in: Regression, PCA, etc.
5. Data Transformation
Definition: Changing data format or values to make analysis easier.
Examples:
• Converting yes/no to 1/0
• Applying log transform to reduce skewness
6. Null Values
Definition: A special marker for missing or unknown data (e.g., NaN in Python).
Example:
Score
90
NaN
Handled by: Imputation or deletion.
7. Duplicates
Definition: Repeated rows or records in a dataset.
Example:
ID Name
1 John
1 John
Fix: Use .drop_duplicates() in Pandas.
8. Data Types (int, float, string, datetime, etc.)
Definition: Type of values in each column.
Examples:
• int: 1, 2, 3
• float: 1.5, 2.75
• string: "apple", "HR"
• datetime: 2023-01-01
Why it matters? Wrong type can cause errors in analysis.
9. Encoding (Label, One-hot)
Definition: Converting categorical values into numerical format.
• Label Encoding:
"Male" = 0, "Female" = 1
• One-Hot Encoding:
"Red", "Green", "Blue" → [1,0,0], [0,1,0], [0,0,1]
Used in: Machine learning models that require numeric input.
10. Imputation
Definition: Filling missing values with estimates.
Examples:
• Mean imputation: Fill missing age with average age
• Forward fill: Fill from previous value
• Mode imputation: Fill with most common value
11. Feature Scaling
Definition: Adjusting features to the same scale to ensure fair treatment in models.
Types:
• Normalization (Min-Max Scaling)
• Standardization (Z-score)
Example:
Height (in cm): [150, 160, 170]
Scaled: [0.0, 0.5, 1.0]
Data Analysis:
1. Exploratory Data Analysis (EDA)
Definition: A process of analyzing data sets to summarize their main characteristics using statistics
and visualizations.
Example:
• Plotting histograms to understand distribution
• Using box plots to detect outliers
• Checking null values or data types
2. Descriptive Statistics
Definition: Basic statistics that describe the features of data.
Includes: Mean, Median, Mode, Min, Max, Standard Deviation
Example:
For column Sales = [100, 200, 300]:
• Mean = 200
• Max = 300
• Std Dev = 100
3. Summary Statistics
Definition: A combined set of descriptive statistics for each column.
Example (using Pandas):
df.describe()
Returns count, mean, std, min, 25%, 50%, 75%, max for numeric columns.
4. Correlation
Definition: Measures the strength and direction of a relationship between two variables (ranges
from -1 to 1).
Example:
• Height vs Weight → +0.9 (strong positive correlation)
• Age vs Screen Time → -0.5 (negative correlation)
5. Covariance
Definition: Shows the direction of the linear relationship between two variables, but not the
strength.
Example:
• Income and Spending may have positive covariance (both rise together).
(More difficult to interpret directly than correlation)
6. Grouping
Definition: Splitting data into groups based on a column.
Example:
Group sales data by region:
df.groupby('Region')
Each region’s data is separated for analysis.
7. Aggregation
Definition: Performing calculations like sum, mean, count on groups.
Example:
Average salary by department:
df.groupby('Department')['Salary'].mean()
8. Filtering
Definition: Selecting rows that meet certain conditions.
Example:
df[df['Age'] > 30]
→ Returns only rows where Age is more than 30.
9. Sorting
Definition: Arranging data in ascending or descending order.
Example:
df.sort_values('Revenue', ascending=False)
→ Sorts by highest revenue first.
10. Pivot Tables
Definition: A tool to reorganize and summarize data using rows, columns, and aggregation.
Example:
Show total sales per product per region.
df.pivot_table(values='Sales', index='Product', columns='Region', aggfunc='sum')
11. Cross-tabulation (Crosstab)
Definition: A summary table showing the frequency distribution of variables.
Example:
Compare gender and department:
pd.crosstab(df['Gender'], df['Department'])
→ Shows count of males and females per department.
12. Time Series Analysis
Definition: Analyzing data collected over time to identify trends, seasonality, or patterns.
Example:
• Plotting daily sales for 1 year
• Detecting monthly seasonality in electricity usage
Visualization
Bar Chart
Definition: Displays data using rectangular bars.
Used for: Comparing quantities across categories.
Example: Sales by region
X-axis = Region, Y-axis = Sales
Line Graph
Definition: Connects data points with lines.
Used for: Showing trends over time.
Example: Website traffic over 12 months.
Histogram
Definition: A bar chart that shows the distribution of numerical data.
Used for: Understanding frequency of values within intervals.
Example: Distribution of student scores.
Box Plot (Box and Whisker Plot)
Definition: Shows data distribution, median, quartiles, and outliers.
Used for: Comparing distributions.
Example: Box plots of salary across departments.
Scatter Plot
Definition: Dots plotted on X and Y axes to show relationships.
Used for: Detecting correlation or patterns.
Example: Age vs Spending score of customers.
Pie Chart
Definition: Circular chart divided into slices.
Used for: Showing percentage composition.
Example: Market share by product.
Use sparingly — harder to compare than bar charts.
Heatmap
Definition: Uses color to represent values in a matrix.
Used for: Correlation matrices, frequency maps.
Example: Correlation heatmap of features in a dataset.
Treemap
Definition: Nested rectangles sized by value.
Used for: Showing part-to-whole relationships.
Example: Sales contribution by product subcategories.
Area Chart
Definition: Like a line chart, but the area under the line is filled.
Used for: Showing cumulative data trends over time.
Example: Website visits over time by traffic source.
Dashboard
Definition: A visual display of key metrics and trends, updated in real-time or regularly.
Used for: Monitoring business performance in one view.
Example: Sales dashboard showing revenue, customer count, and profit by region.
Slicers/Filters
Definition: Tools in dashboards to narrow down data dynamically.
Example: A slicer in Power BI to view sales by month or region.
Storytelling with Data
Definition: Using visualizations to guide users through insights, patterns, and conclusions.
Example: Showing how customer churn dropped after a new loyalty program using graphs and
annotations.
5. Business Intelligence (BI) Tools
Power BI
Definition: Microsoft’s BI tool for building dashboards, reports, and visual analytics.
Features: DAX, slicers, real-time dashboards, strong Excel integration.
Tableau
Definition: A visual analytics platform for interactive dashboards.
Known for: Drag-and-drop ease, powerful visuals, Story feature.
Looker
Definition: Google Cloud’s BI tool with strong data modeling (LookML).
Used for: Embedded analytics, governed data exploration.
Excel (Advanced features)
Features to know:
• PivotTable – Summarize data dynamically
• VLOOKUP/XLOOKUP – Find data from another table
• Power Query – Load & transform data
• Charts/Slicers – Create basic dashboards
KPI (Key Performance Indicator)
Definition: A measurable value that indicates how effectively a goal is being achieved.
Example:
• Conversion rate
• Customer retention
• Monthly recurring revenue
Metric
Definition: A single quantifiable measure (can support a KPI).
Example: Total website visitors, revenue, number of transactions.
Dashboard vs. Report
Feature Dashboard Report
Purpose Monitor real-time performance Detailed, often static, analysis
Format Visual + interactive Tabular, charts, text-heavy
Interactivity High (filters, slicers, drilldowns) Low to Medium
Update Real-time or scheduled Periodic or static
SQL Terms
Tables
Structured data arranged in rows and columns.
Example:
A Customers table with columns: ID, Name, Email.
Columns / Rows
• Columns: Data fields (attributes) like Name, Age.
• Rows: Each record/entry.
Example:
| ID | Name | Age |
|----|------|-----|
| 1 | John | 25 | ← Row
Primary Key / Foreign Key
• Primary Key: Uniquely identifies each row.
• Foreign Key: A column that refers to the primary key in another table.
Example:
Orders.CustomerID → foreign key linking to Customers.ID.
Joins (INNER, LEFT, RIGHT, FULL)
• INNER JOIN: Matching rows in both tables
• LEFT JOIN: All rows from left + matching from right
• RIGHT JOIN: All rows from right + matching from left
• FULL JOIN: All rows from both tables, matched where possible
Subquery
A query inside another query.
Example:
SELECT * FROM Employees
WHERE Salary > (SELECT AVG(Salary) FROM Employees)
CTE (Common Table Expression)
Temporary result set used in a query with better readability.
Example:
WITH SalesData AS (
SELECT Region, SUM(Sales) AS TotalSales
FROM Orders
GROUP BY Region
)
SELECT * FROM SalesData WHERE TotalSales > 100000;
Window Functions (ROW_NUMBER, RANK, etc.)
Perform calculations across a set of rows related to the current row.
Example:
ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC)
Aggregate Functions (SUM, COUNT, AVG, MIN, MAX)
Used for summarizing data.
Example:
SELECT AVG(Salary) FROM Employees;
GROUP BY / HAVING
• GROUP BY: Groups rows with same values
• HAVING: Filters grouped results
Example:
SELECT Department, COUNT(*)
FROM Employees
GROUP BY Department
HAVING COUNT(*) > 5;
WHERE / ORDER BY
• WHERE: Filters rows before grouping
• ORDER BY: Sorts result set
Example:
SELECT * FROM Employees WHERE Age > 30 ORDER BY Salary DESC;
Indexing
Improves query performance by quickly locating data.
Example: Creating index on EmployeeID for faster search.
Normalization (1NF, 2NF, 3NF)
Process of organizing data to reduce redundancy.
• 1NF: No repeating groups, atomic columns
• 2NF: No partial dependency
• 3NF: No transitive dependency
Goal: Store data efficiently and avoid anomalies.
7. Python for Data Analysis
Pandas (DataFrame, Series)
• DataFrame: 2D table (like Excel sheet)
• Series: 1D labeled array
Example:
df = pd.read_csv('data.csv')
df['Revenue'].mean()
NumPy
Used for numerical computations and arrays.
Example:
import numpy as np
arr = np.array([1, 2, 3])
Matplotlib / Seaborn
Libraries for data visualization.
Example (Seaborn):
sns.boxplot(data=df, x='Region', y='Sales')
Functions
Reusable blocks of code.
Example:
def add(a, b):
return a + b
List Comprehension
Short way to create lists.
Example:
squares = [x**2 for x in range(5)]
Lambda Functions
Anonymous functions.
Example:
add = lambda x, y: x + y
Map / Filter / Reduce
• map: Apply function to all items
• filter: Filter items by condition
• reduce: Combine items using function
Example:
list(map(lambda x: x*2, [1,2,3])) # Output: [2, 4, 6]
Exception Handling
Manages errors gracefully.
Example:
try:
val = 10 / 0
except ZeroDivisionError:
print("Can't divide by zero")
File Handling (CSV, Excel, JSON)
Read/write data from/to files.
Example (CSV):
df = pd.read_csv('sales.csv')
Regex (Regular Expressions)
Used to search or match patterns in strings.
Example:
import re
re.findall(r'\d+', 'Order123') # Output: ['123']
JSON Handling
Reading and writing JSON data.
Example:
import json
data = json.loads('{"name": "John"}')
print(data['name'])
Intro to Analytics & Modeling
1. Predictive Analytics
Definition: Uses historical data to make predictions about the future.
Example:
• Predicting customer churn
• Forecasting next month’s sales using past sales data
Techniques used: Regression, classification, time series forecasting
2. Descriptive Analytics
Definition: Summarizes historical data to understand what has happened.
Example:
• Total sales in last quarter
• Number of users who visited a website in June
Tools used: Dashboards, summary statistics, charts
3. Prescriptive Analytics
Definition: Recommends actions based on data to achieve desired outcomes.
Example:
• Suggesting discount percentages to maximize revenue
• Recommending the best delivery routes to reduce time
Often uses: Optimization, simulation, decision trees
4. Regression
Definition: Predicts a continuous numerical value.
Example:
• Predicting house prices based on area, number of rooms
• Salary prediction from years of experience
Common types: Linear regression, multiple regression
5. Classification
Definition: Predicts a category/label.
Example:
• Email = spam or not spam
• Customer = will buy or not buy
Algorithms: Logistic Regression, Decision Trees, Random Forest
6. Clustering
Definition: Groups similar data points together without predefined labels.
Example:
• Grouping customers based on behavior (unsupervised learning)
• Market segmentation: Cluster customers by age, spending, and location
Popular algorithm: K-Means Clustering
7. Hypothesis Testing
Definition: A statistical method to test an assumption about a population.
• p-value: Tells how likely the observed result is due to chance
o p < 0.05 → statistically significant
• t-test: Compares means of two groups
o Example: Is average spending different between male and female customers?
• chi-square test: Checks relationships between categorical variables
o Example: Is product type related to return frequency?
8. Confidence Interval
Definition: A range of values that’s likely to contain the true population parameter with a given level
of confidence (commonly 95%).
Example:
• We are 95% confident that average customer age is between 32 and 35 years
Used in: Reporting estimate reliability
9. A/B Testing
Definition: A method to compare two versions of a product or strategy to see which one performs
better.
Example:
• A group sees Website Version A
• B group sees Website Version B
→ Whichever gives higher conversion rate is chosen.
Used in: Marketing, UX design, product development
Soft Skills / Business Thinking
These are essential non-technical skills every data analyst needs to work effectively in business
environments.
Problem-solving
Definition: The ability to understand a challenge, break it down, and find data-driven solutions.
Example:
• Sales are dropping → Analyze customer segments, regions, and product trends to find
causes.
Communication (Verbal and Visual)
Definition:
• Verbal: Clearly explaining insights to technical and non-technical audiences
• Visual: Presenting insights via dashboards, charts, and reports
Example:
Explaining complex analysis to a marketing manager using a Power BI dashboard and storytelling.
Business Acumen
Definition: Understanding how the business works — strategy, goals, KPIs, and industry trends.
Example:
Knowing how profit margins, customer acquisition cost, and churn affect business decisions.
Domain Knowledge (Retail, HR, Finance, etc.)
Definition: Industry-specific understanding that helps in interpreting data accurately.
Example:
• In HR analytics, knowing what attrition means
• In retail, understanding inventory turnover
Stakeholder Management
Definition: Collaborating with team members, managers, and clients while managing expectations
and delivering insights.
Example:
Handling multiple requests from marketing and finance teams and prioritizing based on business
value.
Requirement Gathering
Definition: Collecting details about what business users need from the data or report.
Example:
Asking the client:
"Do you want sales broken down by product, region, or time period?"
Root Cause Analysis
Definition: Digging deep into data to find the actual reason behind a problem.
Example:
Sales dropped → Found product out of stock for 10 days due to supplier delay.
Storytelling
Definition: Presenting insights in a narrative format to engage and guide decision-makers.
Example:
“Customer engagement dropped after we removed feature X. Here’s the timeline, charts, and
recommendations.”
10. Data Governance / Security
This ensures data is secure, accurate, and compliant, especially when dealing with sensitive or large-
scale data.
PII (Personally Identifiable Information)
Definition: Any data that can identify an individual.
Examples:
Name, email, Aadhaar number, phone number, address
Important: Must be protected with encryption and access controls.
Data Privacy
Definition: Protecting users’ personal data and ensuring it’s collected and used ethically.
Example:
Only collecting necessary data and getting user consent.
GDPR (General Data Protection Regulation)
Definition: A European Union regulation that governs how companies collect and use personal data.
Key points:
• User consent
• Right to be forgotten
• Data breach reporting
Even Indian firms must comply if they deal with EU citizens.
Data Accuracy
Definition: Ensuring data is correct and free from errors.
Example:
A customer’s age recorded as 250 → Data accuracy issue
Data Lineage
Definition: Tracking where data originates, how it changes over time, and where it flows.
Used for: Debugging issues, audits, and transparency
Example:
Order data → cleaned in ETL → loaded to dashboard → used for weekly sales report
Data Integrity
Definition: Ensures data is complete, consistent, and trustworthy across its lifecycle.
Example:
If the same customer ID appears in two places with different names → integrity issue