KEMBAR78
Unit 1 1 | PDF | Infographics | Principal Component Analysis
0% found this document useful (0 votes)
1K views19 pages

Unit 1 1

Uploaded by

Manasa Bogam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views19 pages

Unit 1 1

Uploaded by

Manasa Bogam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT-I

INTRODUCTION AND DATA FOUNDATION


Introduction:
Data visualization is the graphical representation of information and data.

By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way
to see and understand trends, outliers, and patterns in data.

Importance of Data Visualization:

The importance of data visualization is simple: it helps people see, interact with, and better understand
data. Whether simple or complex, the right visualization can bring everyone on the same page, regardless
of their level of expertise.

1. Simplifies Complex Data: Transforms large datasets into a more understandable format.

2. Reveals Patterns and Trends: Helps in identifying trends, correlations, and outliers.

3. Enhances Data Analysis: Makes data analysis more efficient and insightful.

4. Improves Decision Making: Facilitates quicker and better decision-making.

5. Communication Tool: Aids in communicating data-driven insights clearly and effectively.

Types of Data Visualizations

 Chart: Information presented in a tabular, graphical form with data displayed along two axes. Can
be in the form of a graph, diagram, or map.

 Table: A set of figures displayed in rows and columns.


 Graph: A diagram of points, lines, segments, curves, or areas that represents certain variables in
comparison to each other, usually along two axes at a right angle.

 Geospatial: A visualization that shows data in map form using different shapes and colors to show
the relationship between pieces of data and specific locations.

1. Charts:

Examples:

o Bar Charts: Compare quantities across categories.

o Line Charts: Show trends over time.

o Pie Charts: Represent proportions within a whole.

o Histogram: Display the distribution of a dataset.

o Scatter Plots: Show relationships between two variables.

2. Graphs:

Examples:

o Network Graphs: Display relationships and interconnections.

o Flowcharts: Illustrate processes or workflows.

o Tree Maps: Represent hierarchical data.

3. Maps:

o Heat Maps: Show data density on a geographical map.

o Choropleth Maps: Use color gradients to represent data values across geographical
regions.

Tools and Software for Data Visualization

 Microsoft Excel: Basic charts and graphs.

 Tableau: Advanced interactive visualizations.

 Power BI: Integrates with various data sources for interactive reports.

 D3.js: JavaScript library for producing dynamic, interactive data visualizations in web browsers.

 Google Data Studio: Free tool for creating dashboards and reports.
Relationship between Data Visualization and Other Fields
Data visualization is a multidisciplinary field that intersects with various other domains, enhancing the way
information is interpreted and communicated. Here’s a look at how data visualization interacts with and
benefits different fields:

1. Statistics

Relationship:

 Enhancement of Statistical Analysis: Data visualization tools are used to illustrate statistical
findings, making complex data more accessible and understandable.

 Exploratory Data Analysis (EDA): Visual techniques help in identifying patterns, trends, and
outliers in data, which are crucial for statistical analysis.

2. Computer Science

Relationship:

 Algorithms and Programming: Data visualization requires efficient algorithms to process and
render data effectively.

 Human-Computer Interaction (HCI): Focuses on designing user-friendly visualization tools that


facilitate interaction and understanding.

3. Business Intelligence (BI)

Relationship:

 Decision Support Systems: Data visualization is a key component of BI tools, helping businesses
make data-driven decisions.

 Performance Metrics: Visual dashboards display key performance indicators (KPIs) and metrics in
an easily digestible format.

4. Healthcare

Relationship:

 Medical Imaging: Visualization techniques are used to interpret complex medical images (e.g.,
MRIs, CT scans).

 Epidemiology: Visualizing data helps track the spread of diseases and the effectiveness of
interventions.

5. Environmental Science

Relationship:

 Climate Data Analysis: Visualization helps in understanding and communicating climate change
data and environmental impacts.

 Geospatial Analysis: Maps and geographic visualizations are used to study environmental
phenomena and resource distribution.

6. Finance

Relationship:

 Market Analysis: Financial data visualization aids in analyzing stock market trends and investment
performance.
 Risk Management: Visualization tools help in assessing and communicating financial risks.

7. Education

Relationship:

 Interactive Learning Tools: Visual aids and interactive dashboards are used in educational settings
to facilitate learning and engagement.

 Curriculum Development: Data visualization assists educators in analyzing student performance


and curriculum effectiveness.

8. Social Sciences

Relationship:

 Survey Data Analysis: Visualization helps in interpreting data from social science research, such as
surveys and experiments.

 Behavioral Studies: Visual tools are used to analyze and present findings in psychology and
sociology.

9. Journalism

Relationship:

 Data Journalism: Visualizations are used to tell compelling stories with data, making complex
information accessible to a broad audience.

 Infographics: Journalists use infographics to summarize and highlight key points in their articles.

Benefits:

 Increased reader engagement and comprehension.

 Effective communication of complex stories.

10. Marketing

Relationship:

 Customer Insights: Visualization tools analyze customer data to understand behavior and
preferences.

 Campaign Performance: Marketers use dashboards to track and visualize the performance of
marketing campaigns.

Data Visualization Process:


Data visualization is a process that transforms raw data into graphical representations to help
communicate insights and findings effectively. Here's a step-by-step guide to the data visualization process:

1. Define Your Objectives:

o Purpose: Understand why you need to visualize the data. Is it to identify trends, make
decisions, or communicate findings?

2. Collect and Prepare Data:

o Gather Data: Collect the necessary data from various sources such as databases,
spreadsheets, or APIs.
o Clean Data: Clean the data by handling missing values, removing duplicates, and
correcting errors. Ensure the data is in a suitable format for analysis.

3. Understand the Data:

o Explore Data: Use statistical methods and exploratory data analysis (EDA) to understand
the data’s structure, patterns, and relationships.

o Identify Key Metrics: Determine the key metrics and dimensions that are most relevant to
your objectives.

4. Choose the Right Visualization Type:

o Match Data to Visualization: Select the most appropriate type of visualization based on
the data and the insights you want to convey. Common types include bar charts, line
graphs, scatter plots, pie charts, histograms, heatmaps, and more.

o Consider Complexity: For complex data sets, consider using advanced visualizations like
treemaps, network diagrams, or interactive dashboards.

5. Design the Visualization:

o Create Layout: Design a clear and logical layout for your visualization. Organize the
elements in a way that guides the viewer’s eye to the most important information.

o Use Colors and Styles: Use color, shapes, and styles effectively to highlight key insights and
make the visualization aesthetically pleasing.

o Add Labels and Annotations: Include titles, axis labels, legends, and annotations to
provide context and make the visualization self-explanatory.

6. Implement the Visualization:

o Choose Tools: Select appropriate tools and software for creating the visualization. Popular
tools include:

 Excel and Google Sheets: For simple charts and graphs.

 Tableau and Power BI: For interactive and complex dashboards.

 Python Libraries (Matplotlib, Seaborn, Plotly): For customizable and advanced


visualizations.

 R (ggplot2, Shiny): For statistical and customized visualizations.

o Create Visualization: Use the chosen tool to create your visualization, applying the design
principles you’ve planned.

7. Present and Share:

o Present: Share the visualization in presentations, reports, or online platforms, ensuring it


is accessible to your intended audience.

o Provide Context: Include a narrative or explanation to guide viewers through the


visualization and highlight the key takeaways.

Pseudo code Conventions - The Scatter plot.


Pseudocode is defined as a method of describing a process or writing programming code and algorithms
using a natural language such as English. It is not the code itself, but rather a description of what the code
should do. In other words, it is used as a detailed yet understandable step-by-step plan or blueprint from
which a program can be written. It is like a rough draft of a program or an algorithm before it is
implemented in a programming language

Def: Scatter plots are the graphs that present the relationship between two variables in a data-set. It
represents data points on a two-dimensional plane . The independent variable or attribute is plotted on
the X-axis, while the dependent variable is plotted on the Y-axis. These plots are often called scatter graphs
or scatter diagrams.

A scatter plot is a diagram where each value in the data set is represented by a dot.

Days of the week Sales in $


1 250
2 280
3 380
4 260
5 300
6 240
7 180

Creating a scatter plot involves plotting points on a two-dimensional graph based on a pair of numerical
data. Here's a pseudo code outline for generating a scatter plot:
Scatter Plot

1. Initialize Data

o Define arrays/lists for X and Y coordinates.

2. Setup Plot

o Create a canvas or graph where the scatter plot will be drawn.

3. Plot Points

o Iterate over the data points and plot each (X, Y) coordinate on the canvas.

4. Add Labels and Titles

o Add axis labels and a title to the scatter plot for clarity.

5. Display Plot

o Render the scatter plot on the screen.

Here's a detailed pseudo code:

BEGIN

// Step 1: Initialize Data

DECLARE List X = [x1, x2, x3, ..., xn]

DECLARE List Y = [y1, y2, y3, ..., yn]

// Step 2: Setup Plot

CALL CreateCanvas(width, height)

SET CanvasTitle = "Scatter Plot"

SET XAxisLabel = "X-Axis"

SET YAxisLabel = "Y-Axis"

// Step 3: Plot Points

FOR i FROM 0 TO LENGTH(X) - 1 DO

PLOT_POINT(X[i], Y[i])

END FOR

// Step 4: Add Labels and Titles

SET Canvas.XLabel = XAxisLabel

SET Canvas.YLabel = YAxisLabel


SET Canvas.Title = CanvasTitle

// Step 5: Display Plot

CALL DisplayCanvas()

END

Introduction Data Foundation:


Building a strong data foundation is essential for any data-driven initiative, including data visualization.

A data foundation refers to the fundamental infrastructure, processes, and strategies that lay the
groundwork for effectively collecting, managing, storing, organizing, and leveraging enterprise data.

A robust data foundation ensures that the data is accurate, reliable, and prepared for analysis, which is
crucial for generating meaningful insights and making informed decisions.

Key Components of Data Foundation

1. Data Collection: The process of gathering data from various sources.

2. Data Storage: Storing the collected data in a structured manner.

3. Data Cleaning: Ensuring the data is free from errors and inconsistencies.

4. Data Integration: Combining data from different sources to provide a unified view.

5. Data Preparation: Transforming data into a format suitable for analysis and visualization.

Data Collection

Data Sources

1. Internal Sources: Data generated within an organization.

o Examples: Operational databases, CRM systems, financial records.

2. External Sources: Data obtained from outside the organization.

o Examples: Market research reports, social media data, third-party datasets.

Data Collection Methods

1. Surveys and Questionnaires: Gathering data directly from respondents.

2. Interviews: Collecting detailed information through personal or group interviews.

3. Observations: Recording data based on observed behaviors or events.

4. Transactional Data: Capturing data from transactions or activities.


5. Web Scraping: Extracting data from websites using automated tools.

Data Storage

Types of Data Storage Systems

1. Databases: Structured storage systems for organizing and retrieving data.

o Examples: SQL databases (MySQL, PostgreSQL), NoSQL databases (MongoDB, Cassandra).

2. Data Warehouses: Centralized repositories for storing large volumes of data from multiple
sources.

o Examples: Amazon Redshift, Google BigQuery, Snowflake.

3. Data Lakes: Storage systems that hold raw data in its native format.

o Examples: Hadoop, Azure Data Lake, AWS Lake Formation.

4. Cloud Storage: Scalable storage solutions provided by cloud service providers.

o Examples: Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage.

Data Cleaning

Common Data Cleaning Tasks

1. Handling Missing Data: Filling in or removing missing values.

2. Removing Duplicates: Identifying and removing duplicate records.

3. Correcting Errors: Fixing inaccuracies in the data.

4. Standardizing Data: Ensuring consistency in data formats and values.

Data Integration

Data Integration Techniques

1. Merging Datasets: Combining data from different sources into a single dataset.

2. Joining Tables: Linking tables based on common keys.

3. ETL (Extract, Transform, Load): Extracting data from sources, transforming it into the desired
format, and loading it into a storage system.

4. APIs: Using Application Programming Interfaces to integrate data from different systems.

Data Preparation

Data Preparation Steps

1. Data Transformation: Converting data into a suitable format for analysis.

o Techniques: Normalization, aggregation, encoding categorical variables.

2. Data Enrichment: Enhancing data with additional information.

o Examples: Adding geolocation data, appending demographic information.

3. Data Validation: Ensuring data accuracy and completeness.

o Techniques: Cross-checking with other data sources, validating against known


benchmarks.
Tools for Data Handling

Data Collection Tools

 Survey Tools: SurveyMonkey, Google Forms.

 Web Scraping Tools: BeautifulSoup, Scrapy.

Data Storage Tools

 Database Management Systems: MySQL, MongoDB.

 Cloud Storage Solutions: Amazon S3, Google Cloud Storage.

Data Cleaning Tools

 Data Cleaning Software: OpenRefine, Trifacta.

 Programming Libraries: Pandas (Python), dplyr (R).

Data Integration Tools

 ETL Tools: Talend, Apache Nifi.

 API Management Tools: Postman, Swagger.

Data Preparation Tools

 Data Transformation Libraries: Pandas (Python), DataWrangler.

 Data Validation Tools: Great Expectations, DataCleaner.

DATA : Data is defined as facts or figures, or information that's stored in or used by a computer. An
example of data is information collected for a research paper. An example of data is an email.

Types of Data
Understanding the different types of data (in statistics, marketing research, or data science) allows you to
pick the data type that most closely matches your needs and goals
Types of Data

Qualitative Data (Categorical Data)

As the name suggest Qualitative Data tells the features of the data in the statistics. Qualitative Data is also
called Categorical Data and it categorizes the data into various categories. Qualitative data includes data
such as gender of people, their family name and others in sample of population data.

Qualitative data is further categorized into two categories that includes,

 Nominal Data

 Ordinal Data

Nominal Data

Nominal data is a type of data that consists of categories or names that cannot be ordered or ranked.
Nominal data is often used to categorize observations into groups, and the groups are not comparable. In
other words, nominal data has no inherent order or ranking. Examples of nominal data include gender
(Male or female), race (White, Black, Asian), religion (Hinuduism, Christianity, Islam, Judaism), and blood
type (A, B, AB, O).

Nominal data can be represented using frequency tables and bar charts, which display the number or
proportion of observations in each category. For example, a frequency table for gender might show the
number of males and females in a sample of people.

Nominal data is analyzed using non-parametric tests, which do not make any assumptions about the
underlying distribution of the data. Common non-parametric tests for nominal data include Chi-Squared
Tests and Fisher’s Exact Tests. These tests are used to compare the frequency or proportion of
observations in different categories.

Ordinal Data

Ordinal data is a type of data that consists of categories that can be ordered or ranked. However, the
distance between categories is not necessarily equal. Ordinal data is often used to measure subjective
attributes or opinions, where there is a natural order to the responses. Examples of ordinal data include
education level (Elementary, Middle, High School, College), job position (Manager, Supervisor, Employee),
etc.

Ordinal data can be represented using bar charts, line charts. These displays show the order or ranking of
the categories, but they do not imply that the distances between categories are equal.

Ordinal data is analyzed using non-parametric tests, which make no assumptions about the underlying
distribution of the data. Common non-parametric tests for ordinal data include the Wilcoxon Signed-Rank
test and Mann-Whitney U test.

Quantitative Data (Numerical Data)

Quantitavive Data is the type of the data that represents the numerical value of the data. They are also
called the Numerical Data. This data type is used to represent the height, weight, length and other things
of the data. Quantitative data is further classified into two categories that are,

 Discrete Data

 Continuous Data

Discrete Data
Discrite data type is a type of data in statistics that only uses Discrete Value or Single Values. These data
types have values that can be easily counted as whole numbers. The example of the discreate data types
are,

 Height of Students in a class

 Marks of the students in a class test

 Weight of different members of a family, etc.

Continuous Data

Continuous data is the type of the quantitative data that represent the data in a continuous range. The
variable in the data set can have any value between the range of the data set. Examples of the continuous
data types are,

 Temperature Range

 Salary range of Workers in a Factory, etc.

Structure of data Within Records


Within a record, data is organized as a single unit, typically corresponding to a row in a table or a record in
a database. Each record consists of multiple fields or attributes, each containing a piece of data.

Key Elements

1. Attributes/Fields:

o These are individual pieces of data within a record. For example, in a dataset of customer
information, fields might include CustomerID, Name, Age, and PurchaseAmount.

2. Data Types:

o Each attribute has a data type, such as integer, float, string, or date. Proper data types
ensure that the data is correctly interpreted and manipulated.

Structure of Data in data visualization:


In data visualization, the structure of data plays a crucial role in determining how the data is represented
and visualized. There are various ways data can be structured depending on the type of visualization, the
purpose, and the characteristics of the data itself. Here are some common structures:

1. Tabular Structure

 Description: Data is arranged in rows and columns, much like a spreadsheet.

 Example: A table with columns like "Date," "Sales," and "Region."

 Use Cases: Best for simple data visualizations like bar charts, line graphs, or heat maps.

 Visualization: Bar chart, line chart, scatter plot.

2. Hierarchical Structure

 Description: Data is organized in a tree-like format, with parent-child relationships.


 Example: Organization charts, file directory structures, or any dataset with levels of categorization
(e.g., family trees).

 Use Cases: Useful for visualizing relationships and categorizations.

 Visualization: Tree maps, dendrograms, sunburst charts.

3. Network/Graph Structure

 Description: Data consists of nodes (entities) and edges (relationships) that connect the nodes.

 Example: Social networks, transportation networks, or web pages connected via hyperlinks.

 Use Cases: Shows relationships and interactions between entities.

 Visualization: Network graphs, force-directed graphs, radial graphs.

4. Geospatial Structure

 Description: Data is associated with geographical locations, often with latitude and longitude
coordinates.

 Example: Population density by region, meteorological data, or crime rates across cities.

 Use Cases: When geographic context is important for understanding the data.

 Visualization: Choropleth maps, heat maps, point maps.

5. Temporal Structure

 Description: Data is structured around time, where each data point is connected to a specific point
in time.

 Example: Time-series data, such as stock prices over time or website traffic trends.

 Use Cases: Analyzing changes and trends over time.

 Visualization: Line charts, area charts, Gantt charts.

6. Matrix Structure

 Description: Data is structured in a two-dimensional grid where rows and columns intersect to
form cells.

 Example: A confusion matrix in machine learning or correlation matrices between variables.

 Use Cases: Often used when comparing relationships between variables.

 Visualization: Heatmaps, correlation plots.

7. Textual/Unstructured Data

 Description: Text data, which does not follow a fixed format or structure.

 Example: Documents, social media posts, or articles.

 Use Cases: Useful in natural language processing (NLP) and sentiment analysis.

 Visualization: Word clouds, text graphs, frequency distributions.

8. Multi-dimensional (n-D) Structure

 Description: Data with more than two dimensions (e.g., features, categories) is represented.

 Example: A dataset with multiple attributes like age, gender, income, and education level.
 Use Cases: Analyzing multi-dimensional datasets where several factors play a role.

 Visualization: Parallel coordinate plots, 3D scatter plots, radar charts.

Data Flow in Visualization

 Raw Data: Initially in a structured or unstructured format.

 Transformation: Data is cleaned, transformed, and possibly aggregated.

 Mapping: Data is mapped to visual attributes like position, size, color, shape, and orientation.

 Rendering: Visualization is rendered based on mapped attributes, using tools like D3.js, Tableau, or
Python’s Matplotlib.

Data Preprocessing:
The process of converting raw data into understandable format.

Data preprocessing is a crucial step in data visualization, as it prepares raw data for analysis and
visualization. The goal is to clean and transform the data to make it suitable for the intended analysis.

Steps Involved in Data Preprocessing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can
be done using various techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features
are high-dimensional and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used
to reduce the size of the dataset while preserving the important information. It can be done using
techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often
used to reduce the size of the dataset by replacing similar data points with a representative
centroid. It can be done using techniques such as k-means, hierarchical clustering, and density-
based clustering.

Example: Preparing Data for Visualization

Let’s walk through a complete example of preprocessing a dataset for visualization:

Date Product Sales Revenue


1/1/202
4 A 100 2000
1/2/202
4 B 150 3000
1/3/202
4 A 200 4000
1/4/202
4 C 50 1000
1/5/202
4 B 300 6000
1/6/202
4 A 250 5000

Key Steps in Data Preprocessing for Visualization

1. Data Collection
o Objective: Gather data from various sources, such as databases, APIs, or
spreadsheets.
o Example: Collect sales data from a company's sales database.
2. Data Cleaning
o Objective: Identify and correct errors, inconsistencies, and missing values in the
dataset.
o Tasks:
 Remove Duplicates: Ensure there are no duplicate records.
 Handle Missing Values: Fill, interpolate, or remove missing data.
 Correct Errors: Fix any inconsistencies or inaccuracies.

import pandas as pd

# Sample data

df = pd.DataFrame({

'Date': ['2024-01-01', '2024-01-02', None, '2024-01-04', '2024-01-05', '2024-01-05'],

'Product': ['A', 'B', 'A', 'C', None, 'B'],

'Sales': [100, 150, 200, None, 300, 300],

'Revenue': [2000, 3000, 4000, 1000, 6000, 6000]

})

# Drop duplicate rows


df = df.drop_duplicates()

# Fill missing values

df['Date'] = df['Date'].fillna(method='ffill')

df['Product'] = df['Product'].fillna('Unknown')

df['Sales'] = df['Sales'].fillna(df['Sales'].mean())

Data Transformation

 Objective: Convert data into a format suitable for analysis and visualization.
 Tasks:
o Normalization/Scaling: Adjust the range of data values.
o Encoding Categorical Variables: Convert categorical data into numerical format.
o Date/Time Conversion: Ensure date and time data are in the correct format.

Example:
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Convert 'Date' to datetime type


df['Date'] = pd.to_datetime(df['Date'])

# Normalize 'Sales' and 'Revenue'


scaler = StandardScaler()
df[['Sales', 'Revenue']] = scaler.fit_transform(df[['Sales', 'Revenue']])

# Encode 'Product' column


le = LabelEncoder()
df['Product'] = le.fit_transform(df['Product'])

4. Data Aggregation

 Objective: Summarize and group data for easier visualization.


 Tasks:
o Group By: Aggregate data based on certain columns.
o Compute Summary Statistics: Calculate totals, averages, or other statistics.

Example:

# Group by 'Product' and calculate total 'Sales'


product_sales = df.groupby('Product')['Sales'].sum().reset_index()

5. Feature Engineering
 Objective: Create new features or variables that can provide additional insights.
 Tasks:
o Create Derived Variables: Generate new columns based on existing data.
o Binning: Group numerical data into bins or categories.

Example:

# Add a 'Month' column

df['Month'] = df['Date'].dt.to_period('M')

# Compute monthly sales totals

monthly_sales = df.groupby('Month')['Sales'].sum().reset_index()

Data Preprocessing Comlete code:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

# Load dataset

df = pd.DataFrame({

'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-06'],

'Product': ['A', 'B', 'A', 'C', 'B', 'A'],

'Sales': [100, 150, 200, 50, 300, 250],

'Revenue': [2000, 3000, 4000, 1000, 6000, 5000]

})

# Convert 'Date' to datetime type

df['Date'] = pd.to_datetime(df['Date'])

# Normalize 'Sales' and 'Revenue'

scaler = StandardScaler()

df[['Sales', 'Revenue']] = scaler.fit_transform(df[['Sales', 'Revenue']])


# Add 'Month' column

df['Month'] = df['Date'].dt.to_period('M')

# Aggregate data

monthly_sales = df.groupby('Month')['Sales'].sum().reset_index()

# Visualization

plt.figure(figsize=(10, 6))

plt.plot(monthly_sales['Month'].astype(str), monthly_sales['Sales'], marker='o')

plt.xlabel('Month')

plt.ylabel('Normalized Sales')

plt.title('Monthly Sales Trend')

plt.xticks(rotation=45)

plt.grid(True)

plt.show()

You might also like