KEMBAR78
Data Fundamentals | PDF | Statistics | Analytics
0% found this document useful (0 votes)
9 views21 pages

Data Fundamentals

The document provides a comprehensive overview of data fundamentals, including definitions and examples of various data types, data processing techniques, and data analysis methods. It covers topics such as data cleaning, visualization, business intelligence tools, SQL terms, and Python for data analysis. Additionally, it introduces predictive analytics and the data lifecycle, emphasizing the importance of managing data effectively for analysis and decision-making.

Uploaded by

himanshii7802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views21 pages

Data Fundamentals

The document provides a comprehensive overview of data fundamentals, including definitions and examples of various data types, data processing techniques, and data analysis methods. It covers topics such as data cleaning, visualization, business intelligence tools, SQL terms, and Python for data analysis. Additionally, it introduces predictive analytics and the data lifecycle, emphasizing the importance of managing data effectively for analysis and decision-making.

Uploaded by

himanshii7802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Fundamentals

1. Data

Raw facts, figures, or information collected from various sources. It has no meaning until it is
processed or analyzed.

2. Dataset

A collection of related data, usually presented in tabular form (rows and columns), that can be used
for analysis.
Example: A table of customer orders with columns like Order ID, Date, and Amount.

3. Structured Data

Data organized in a fixed format like rows and columns (usually in relational databases). Easy to store
and analyze.
Example: Excel sheets, SQL tables.

4. Unstructured Data

Data that doesn't have a predefined format. Harder to process and analyze.
Example: Emails, images, videos, social media posts.

5. Semi-Structured Data

Data that doesn’t fit into strict tables but has some structure using tags or markers.
Example: JSON, XML files.

6. Big Data

Large, complex datasets that can't be handled with traditional tools. Characterized by the 5 V’s:
Volume, Velocity, Variety, Veracity, and Value.
Example: Real-time social media data, sensor data from IoT.

7. Data Pipeline

A sequence of processes that move data from one system to another (often from source to storage
or analysis). It includes data collection, transformation, and loading.
Think: Like a factory assembly line for data.
8. Data Source

The origin or provider of the data. It can be a database, file, web API, sensor, survey, etc.
Example: Google Analytics, SQL Server, Excel file.

9. Data Warehouse

A centralized system where structured data from different sources is stored for reporting and
analysis. Optimized for querying and analysis.
Example: Amazon Redshift, Snowflake.

10. Data Lake

A large storage repository that holds raw data in its original format (structured + unstructured). Used
for big data and advanced analytics.
Example: AWS S3 as a data lake.

11. Data Mart

A smaller, more focused version of a data warehouse designed for a specific team or business unit
(like sales or marketing).
Think: A mini data warehouse for one department.

12. ETL (Extract, Transform, Load)

A traditional data integration process:

• Extract data from sources

• Transform it into the right format

• Load it into a data warehouse or destination


Used when data needs heavy cleaning or reformatting before loading.

13. ELT (Extract, Load, Transform)

A modern variation of ETL where:

• Data is Extracted

• Loaded into a data lake or warehouse

• Transformed afterward using the storage engine’s power


Common in cloud systems with strong processing power like Snowflake.

14. Data Lifecycle


The complete journey of data from creation to deletion, typically involving these stages:

• Data creation

• Storage

• Processing

• Analysis

• Usage

• Archival or deletion

Managing the lifecycle ensures security, efficiency, and compliance.

Data Cleaning & Wrangling:

1. Missing Values

Definition: Data points that are absent in a dataset.


Example:

Name Age City

John 25 New York

Lisa Chicago

Why fix? Can affect analysis and model accuracy.

2. Outliers

Definition: Data points significantly different from others.


Example:
A salary dataset: [30K, 32K, 33K, 35K, 500K] ← 500K is an outlier.

Why fix? They may skew the results or indicate data entry errors.

3. Data Normalization

Definition: Rescaling data to a range (usually 0 to 1).


Example:
Before: [10, 20, 30] → After normalization: [0, 0.5, 1]

Used in: Algorithms like KNN, neural networks.


4. Data Standardization

Definition: Rescaling data to have mean = 0 and standard deviation = 1.


Example:
Before: [100, 110, 130]
After: [−1.2, 0.0, 1.2] (Standardized Z-scores)

Used in: Regression, PCA, etc.

5. Data Transformation

Definition: Changing data format or values to make analysis easier.


Examples:

• Converting yes/no to 1/0

• Applying log transform to reduce skewness

6. Null Values

Definition: A special marker for missing or unknown data (e.g., NaN in Python).
Example:

Score

90

NaN

Handled by: Imputation or deletion.

7. Duplicates

Definition: Repeated rows or records in a dataset.


Example:

ID Name

1 John

1 John

Fix: Use .drop_duplicates() in Pandas.

8. Data Types (int, float, string, datetime, etc.)

Definition: Type of values in each column.


Examples:
• int: 1, 2, 3

• float: 1.5, 2.75

• string: "apple", "HR"

• datetime: 2023-01-01

Why it matters? Wrong type can cause errors in analysis.

9. Encoding (Label, One-hot)

Definition: Converting categorical values into numerical format.

• Label Encoding:
"Male" = 0, "Female" = 1

• One-Hot Encoding:
"Red", "Green", "Blue" → [1,0,0], [0,1,0], [0,0,1]

Used in: Machine learning models that require numeric input.

10. Imputation

Definition: Filling missing values with estimates.


Examples:

• Mean imputation: Fill missing age with average age

• Forward fill: Fill from previous value

• Mode imputation: Fill with most common value

11. Feature Scaling

Definition: Adjusting features to the same scale to ensure fair treatment in models.
Types:

• Normalization (Min-Max Scaling)

• Standardization (Z-score)

Example:
Height (in cm): [150, 160, 170]
Scaled: [0.0, 0.5, 1.0]
Data Analysis:

1. Exploratory Data Analysis (EDA)

Definition: A process of analyzing data sets to summarize their main characteristics using statistics
and visualizations.
Example:

• Plotting histograms to understand distribution

• Using box plots to detect outliers

• Checking null values or data types

2. Descriptive Statistics

Definition: Basic statistics that describe the features of data.


Includes: Mean, Median, Mode, Min, Max, Standard Deviation
Example:
For column Sales = [100, 200, 300]:

• Mean = 200

• Max = 300

• Std Dev = 100

3. Summary Statistics

Definition: A combined set of descriptive statistics for each column.


Example (using Pandas):

df.describe()

Returns count, mean, std, min, 25%, 50%, 75%, max for numeric columns.

4. Correlation

Definition: Measures the strength and direction of a relationship between two variables (ranges
from -1 to 1).
Example:

• Height vs Weight → +0.9 (strong positive correlation)

• Age vs Screen Time → -0.5 (negative correlation)

5. Covariance
Definition: Shows the direction of the linear relationship between two variables, but not the
strength.
Example:

• Income and Spending may have positive covariance (both rise together).

(More difficult to interpret directly than correlation)

6. Grouping

Definition: Splitting data into groups based on a column.


Example:
Group sales data by region:

df.groupby('Region')

Each region’s data is separated for analysis.

7. Aggregation

Definition: Performing calculations like sum, mean, count on groups.


Example:
Average salary by department:

df.groupby('Department')['Salary'].mean()

8. Filtering

Definition: Selecting rows that meet certain conditions.


Example:

df[df['Age'] > 30]

→ Returns only rows where Age is more than 30.

9. Sorting

Definition: Arranging data in ascending or descending order.


Example:

df.sort_values('Revenue', ascending=False)

→ Sorts by highest revenue first.


10. Pivot Tables

Definition: A tool to reorganize and summarize data using rows, columns, and aggregation.
Example:
Show total sales per product per region.

df.pivot_table(values='Sales', index='Product', columns='Region', aggfunc='sum')

11. Cross-tabulation (Crosstab)

Definition: A summary table showing the frequency distribution of variables.


Example:
Compare gender and department:

pd.crosstab(df['Gender'], df['Department'])

→ Shows count of males and females per department.

12. Time Series Analysis

Definition: Analyzing data collected over time to identify trends, seasonality, or patterns.
Example:

• Plotting daily sales for 1 year

• Detecting monthly seasonality in electricity usage

Visualization

Bar Chart

Definition: Displays data using rectangular bars.


Used for: Comparing quantities across categories.
Example: Sales by region
X-axis = Region, Y-axis = Sales

Line Graph

Definition: Connects data points with lines.


Used for: Showing trends over time.
Example: Website traffic over 12 months.
Histogram

Definition: A bar chart that shows the distribution of numerical data.


Used for: Understanding frequency of values within intervals.
Example: Distribution of student scores.

Box Plot (Box and Whisker Plot)

Definition: Shows data distribution, median, quartiles, and outliers.


Used for: Comparing distributions.
Example: Box plots of salary across departments.

Scatter Plot

Definition: Dots plotted on X and Y axes to show relationships.


Used for: Detecting correlation or patterns.
Example: Age vs Spending score of customers.

Pie Chart

Definition: Circular chart divided into slices.


Used for: Showing percentage composition.
Example: Market share by product.

Use sparingly — harder to compare than bar charts.

Heatmap

Definition: Uses color to represent values in a matrix.


Used for: Correlation matrices, frequency maps.
Example: Correlation heatmap of features in a dataset.

Treemap

Definition: Nested rectangles sized by value.


Used for: Showing part-to-whole relationships.
Example: Sales contribution by product subcategories.

Area Chart

Definition: Like a line chart, but the area under the line is filled.
Used for: Showing cumulative data trends over time.
Example: Website visits over time by traffic source.
Dashboard

Definition: A visual display of key metrics and trends, updated in real-time or regularly.
Used for: Monitoring business performance in one view.
Example: Sales dashboard showing revenue, customer count, and profit by region.

Slicers/Filters

Definition: Tools in dashboards to narrow down data dynamically.


Example: A slicer in Power BI to view sales by month or region.

Storytelling with Data

Definition: Using visualizations to guide users through insights, patterns, and conclusions.
Example: Showing how customer churn dropped after a new loyalty program using graphs and
annotations.

5. Business Intelligence (BI) Tools

Power BI

Definition: Microsoft’s BI tool for building dashboards, reports, and visual analytics.
Features: DAX, slicers, real-time dashboards, strong Excel integration.

Tableau

Definition: A visual analytics platform for interactive dashboards.


Known for: Drag-and-drop ease, powerful visuals, Story feature.

Looker

Definition: Google Cloud’s BI tool with strong data modeling (LookML).


Used for: Embedded analytics, governed data exploration.

Excel (Advanced features)

Features to know:

• PivotTable – Summarize data dynamically

• VLOOKUP/XLOOKUP – Find data from another table

• Power Query – Load & transform data


• Charts/Slicers – Create basic dashboards

KPI (Key Performance Indicator)

Definition: A measurable value that indicates how effectively a goal is being achieved.
Example:

• Conversion rate

• Customer retention

• Monthly recurring revenue

Metric

Definition: A single quantifiable measure (can support a KPI).


Example: Total website visitors, revenue, number of transactions.

Dashboard vs. Report

Feature Dashboard Report

Purpose Monitor real-time performance Detailed, often static, analysis

Format Visual + interactive Tabular, charts, text-heavy

Interactivity High (filters, slicers, drilldowns) Low to Medium

Update Real-time or scheduled Periodic or static

SQL Terms

Tables

Structured data arranged in rows and columns.


Example:
A Customers table with columns: ID, Name, Email.

Columns / Rows

• Columns: Data fields (attributes) like Name, Age.


• Rows: Each record/entry.
Example:
| ID | Name | Age |
|----|------|-----|
| 1 | John | 25 | ← Row

Primary Key / Foreign Key

• Primary Key: Uniquely identifies each row.

• Foreign Key: A column that refers to the primary key in another table.
Example:
Orders.CustomerID → foreign key linking to Customers.ID.

Joins (INNER, LEFT, RIGHT, FULL)

• INNER JOIN: Matching rows in both tables

• LEFT JOIN: All rows from left + matching from right

• RIGHT JOIN: All rows from right + matching from left

• FULL JOIN: All rows from both tables, matched where possible

Subquery

A query inside another query.


Example:

SELECT * FROM Employees

WHERE Salary > (SELECT AVG(Salary) FROM Employees)

CTE (Common Table Expression)

Temporary result set used in a query with better readability.


Example:

WITH SalesData AS (

SELECT Region, SUM(Sales) AS TotalSales

FROM Orders

GROUP BY Region

)
SELECT * FROM SalesData WHERE TotalSales > 100000;

Window Functions (ROW_NUMBER, RANK, etc.)

Perform calculations across a set of rows related to the current row.


Example:

ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC)

Aggregate Functions (SUM, COUNT, AVG, MIN, MAX)

Used for summarizing data.


Example:

SELECT AVG(Salary) FROM Employees;

GROUP BY / HAVING

• GROUP BY: Groups rows with same values

• HAVING: Filters grouped results


Example:

SELECT Department, COUNT(*)

FROM Employees

GROUP BY Department

HAVING COUNT(*) > 5;

WHERE / ORDER BY

• WHERE: Filters rows before grouping

• ORDER BY: Sorts result set


Example:

SELECT * FROM Employees WHERE Age > 30 ORDER BY Salary DESC;

Indexing

Improves query performance by quickly locating data.


Example: Creating index on EmployeeID for faster search.
Normalization (1NF, 2NF, 3NF)

Process of organizing data to reduce redundancy.

• 1NF: No repeating groups, atomic columns

• 2NF: No partial dependency

• 3NF: No transitive dependency


Goal: Store data efficiently and avoid anomalies.

7. Python for Data Analysis

Pandas (DataFrame, Series)

• DataFrame: 2D table (like Excel sheet)

• Series: 1D labeled array


Example:

df = pd.read_csv('data.csv')

df['Revenue'].mean()

NumPy

Used for numerical computations and arrays.


Example:

import numpy as np

arr = np.array([1, 2, 3])

Matplotlib / Seaborn

Libraries for data visualization.


Example (Seaborn):

sns.boxplot(data=df, x='Region', y='Sales')

Functions

Reusable blocks of code.


Example:

def add(a, b):

return a + b
List Comprehension

Short way to create lists.


Example:

squares = [x**2 for x in range(5)]

Lambda Functions

Anonymous functions.
Example:

add = lambda x, y: x + y

Map / Filter / Reduce

• map: Apply function to all items

• filter: Filter items by condition

• reduce: Combine items using function


Example:

list(map(lambda x: x*2, [1,2,3])) # Output: [2, 4, 6]

Exception Handling

Manages errors gracefully.


Example:

try:

val = 10 / 0

except ZeroDivisionError:

print("Can't divide by zero")

File Handling (CSV, Excel, JSON)

Read/write data from/to files.


Example (CSV):

df = pd.read_csv('sales.csv')
Regex (Regular Expressions)

Used to search or match patterns in strings.


Example:

import re

re.findall(r'\d+', 'Order123') # Output: ['123']

JSON Handling

Reading and writing JSON data.


Example:

import json

data = json.loads('{"name": "John"}')

print(data['name'])

Intro to Analytics & Modeling

1. Predictive Analytics

Definition: Uses historical data to make predictions about the future.


Example:

• Predicting customer churn

• Forecasting next month’s sales using past sales data

Techniques used: Regression, classification, time series forecasting

2. Descriptive Analytics

Definition: Summarizes historical data to understand what has happened.


Example:

• Total sales in last quarter

• Number of users who visited a website in June

Tools used: Dashboards, summary statistics, charts


3. Prescriptive Analytics

Definition: Recommends actions based on data to achieve desired outcomes.


Example:

• Suggesting discount percentages to maximize revenue

• Recommending the best delivery routes to reduce time

Often uses: Optimization, simulation, decision trees

4. Regression

Definition: Predicts a continuous numerical value.


Example:

• Predicting house prices based on area, number of rooms

• Salary prediction from years of experience

Common types: Linear regression, multiple regression

5. Classification

Definition: Predicts a category/label.


Example:

• Email = spam or not spam

• Customer = will buy or not buy

Algorithms: Logistic Regression, Decision Trees, Random Forest

6. Clustering

Definition: Groups similar data points together without predefined labels.


Example:

• Grouping customers based on behavior (unsupervised learning)

• Market segmentation: Cluster customers by age, spending, and location

Popular algorithm: K-Means Clustering

7. Hypothesis Testing

Definition: A statistical method to test an assumption about a population.

• p-value: Tells how likely the observed result is due to chance

o p < 0.05 → statistically significant


• t-test: Compares means of two groups

o Example: Is average spending different between male and female customers?

• chi-square test: Checks relationships between categorical variables

o Example: Is product type related to return frequency?

8. Confidence Interval

Definition: A range of values that’s likely to contain the true population parameter with a given level
of confidence (commonly 95%).
Example:

• We are 95% confident that average customer age is between 32 and 35 years

Used in: Reporting estimate reliability

9. A/B Testing

Definition: A method to compare two versions of a product or strategy to see which one performs
better.
Example:

• A group sees Website Version A

• B group sees Website Version B


→ Whichever gives higher conversion rate is chosen.

Used in: Marketing, UX design, product development

Soft Skills / Business Thinking


These are essential non-technical skills every data analyst needs to work effectively in business
environments.

Problem-solving

Definition: The ability to understand a challenge, break it down, and find data-driven solutions.
Example:

• Sales are dropping → Analyze customer segments, regions, and product trends to find
causes.

Communication (Verbal and Visual)


Definition:

• Verbal: Clearly explaining insights to technical and non-technical audiences

• Visual: Presenting insights via dashboards, charts, and reports

Example:
Explaining complex analysis to a marketing manager using a Power BI dashboard and storytelling.

Business Acumen

Definition: Understanding how the business works — strategy, goals, KPIs, and industry trends.
Example:
Knowing how profit margins, customer acquisition cost, and churn affect business decisions.

Domain Knowledge (Retail, HR, Finance, etc.)

Definition: Industry-specific understanding that helps in interpreting data accurately.


Example:

• In HR analytics, knowing what attrition means

• In retail, understanding inventory turnover

Stakeholder Management

Definition: Collaborating with team members, managers, and clients while managing expectations
and delivering insights.
Example:
Handling multiple requests from marketing and finance teams and prioritizing based on business
value.

Requirement Gathering

Definition: Collecting details about what business users need from the data or report.
Example:
Asking the client:

"Do you want sales broken down by product, region, or time period?"

Root Cause Analysis

Definition: Digging deep into data to find the actual reason behind a problem.
Example:
Sales dropped → Found product out of stock for 10 days due to supplier delay.
Storytelling

Definition: Presenting insights in a narrative format to engage and guide decision-makers.


Example:
“Customer engagement dropped after we removed feature X. Here’s the timeline, charts, and
recommendations.”

10. Data Governance / Security

This ensures data is secure, accurate, and compliant, especially when dealing with sensitive or large-
scale data.

PII (Personally Identifiable Information)

Definition: Any data that can identify an individual.


Examples:
Name, email, Aadhaar number, phone number, address

Important: Must be protected with encryption and access controls.

Data Privacy

Definition: Protecting users’ personal data and ensuring it’s collected and used ethically.
Example:
Only collecting necessary data and getting user consent.

GDPR (General Data Protection Regulation)

Definition: A European Union regulation that governs how companies collect and use personal data.
Key points:

• User consent

• Right to be forgotten

• Data breach reporting


Even Indian firms must comply if they deal with EU citizens.

Data Accuracy

Definition: Ensuring data is correct and free from errors.


Example:
A customer’s age recorded as 250 → Data accuracy issue

Data Lineage
Definition: Tracking where data originates, how it changes over time, and where it flows.
Used for: Debugging issues, audits, and transparency
Example:
Order data → cleaned in ETL → loaded to dashboard → used for weekly sales report

Data Integrity

Definition: Ensures data is complete, consistent, and trustworthy across its lifecycle.
Example:
If the same customer ID appears in two places with different names → integrity issue

You might also like