KEMBAR78
Data Analytics Preparation & Visualization | PDF | Computer Programming | Software Engineering
0% found this document useful (0 votes)
6 views54 pages

Data Analytics Preparation & Visualization

Uploaded by

Zegar Pradipta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views54 pages

Data Analytics Preparation & Visualization

Uploaded by

Zegar Pradipta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Data Analytics Preparation &

Visualization
Data Science Learning Studio
Speakers

Hamimah Alatas

Data Analytics at
Tokopedia
Contents

01 02
Intro to Numpy,
Intro to Python Pandas, and
Matplotlib

03 04
Intro to Data Intro EDA &
Preparation Visualization
Using Pandas using Matplotlib

05

Analytical
Reports
1.
Introduction to
Python
Introduction to
python

What is Python?
Python is a high-level programming language, with applications in numerous areas,
including web programming, scripting, scientific computing, and artificial intelligence.

It is very popular and used by organizations such as Google, NASA, the CIA, and Disney.
Python is processed at runtime by the interpreter. There is no need to compile your
program before executing it.

The three major versions of Python are 1.x, 2.x and 3.x. These are subdivided into minor versions, such
as 2.7 and 3.3. Code written for Python 3.x is guaranteed to work in all future versions.
Both Python Version 2.x and 3.x are used currently. This course covers Python 3.x, but it isn't hard to
change from one version to another.

An interpreter is a program that runs scripts written in an interpreted language such as


Python.
Introduction to
python

What is Python?
• Top Programming Language 2019*
• Easy to learn
• Huge Community
• Has numerous libraries for different needs
• Readable and Maintainable Code
Introduction to
python

What is libraries?
A library is an umbrella term referring to a reusable chunk of code. Usually, a Python
library contains a collection of related modules and packages.

a few examples of python libraries:


● Numpy,
● Pandas,
● Matplotlib
● Scikit-learn
● etc
2.
Numpy, Pandas
and Matplotlib
Numpy, Pandas &
Matplotlib

What is Numpy?
is an open source Python library that’s used in almost every field of science and
engineering. It’s the universal standard for working with numerical data in Python, and it’s at
the core of the scientific Python and PyData ecosystems. (According to numpy.org)
Numpy, Pandas &
Matplotlib

What is Pandas?

ME??
Numpy, Pandas &
Matplotlib

What is Pandas?

The name Pandas is derived from panel data. Panel data


comprises of observations over multiple time periods for the
same individuals

● One of the most widely used Python library for data


analysis and engineering.
● Implemented in 2008 by Wes McKinney
● Open source
● Implemented on top of C — hence it’s fast
● Introduced to the DataFrame, Series and Panel objects
Numpy, Pandas &
Matplotlib

What is Matplotlib?

Matplotlib is a Python 2D plotting library which produces


publication quality figures in a variety of hardcopy formats and
interactive environments across platforms.

Matplotlib can be used in Python scripts, the Python and


IPython shells, the Jupyter notebook, web application servers,
and four graphical user interface toolkits
Numpy, Pandas &
Matplotlib

How to install and


use it?
- There are so many ways to install it, I prefer use PyPI, type in your
Terminal

Click here for more details

- To use it, simply


3.
Intro to Data
Preparation Using
Pandas
Data Preparation
Using Pandas

What is a Series?
Definition:
● Series is 1 dimensional in nature such as an array. Series is
a mutable data object. It means it can be updated, new
items can be added and existing items can be deleted
from the collection.
● It can hold data of any type.
● It can be instantiated with an array or a dictionary. The
keys of the dictionary are used to represent indexes.
● The constructor of Series can be called by passing in data,
index, data type and a Boolean flag indicating if we want to
copy the data. If we don’t specify the index names then the
indexes are labeled as integers starting from 0.

Think of Series as Vertical Columns that can hold multiple rows.


Data Preparation
Using Pandas

What is a Data Frame?

Definition:
● Possibly the most used data structure in a data science
project.
● It is a table with rows and columns — like a SQL Table Or
Excel Spreadsheet Sheet.
● The table can store in memory data objects in different
format.
● Offers high performing time series data analysis and
engineering
● A data frame is a container of one or more Series.
● DataFrame is mutable.
Data Preparation
Using Pandas

What can we do with


pandas DataFrame?
Manipulate:
1. Create
2. Selection
3. Addition
4. Deletion
5. Rename
6. Looping
7. Sorting
Data Preparation
Using Pandas

What can we do with


pandas DataFrame?
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None,
Manipulate: copy=False)[source]

1. Create Two-dimensional size-mutable, potentially heterogeneous tabular data structure with


labeled axes (rows and columns). Arithmetic operations align on both row and column
2. Selection
labels. Can be thought of as a dict-like container for Series objects. The primary pandas
3. Addition data structure.
4. Deletion
5. Rename
6. Looping
7. Sorting
Data Preparation
Using Pandas

What can we do with


pandas DataFrame?
Selection by Column
Manipulate:
1. Create
2. Selection
3. Addition
4. Deletion
5. Rename
6. Looping Selection by Rows
7. Sorting
Data Preparation
Using Pandas

What can we do with


pandas DataFrame?
Addition Column

Manipulate:
1. Create
2. Selection
3. Addition
4. Deletion
5. Rename
Addition Rows
6. Looping
7. Sorting
Data Preparation
Using Pandas

What can we do with


pandas DataFrame?
Deletion Column
Manipulate:
1. Create
2. Selection
3. Addition
4. Deletion
5. Rename
6. Looping Deletion Rows
7. Sorting
Data Preparation
Using Pandas

What can we do with


pandas DataFrame?
Rename Column

Manipulate:
1. Create
2. Selection
3. Addition
4. Deletion
5. Rename
6. Looping Rename Index

7. Sorting
Data Preparation
Using Pandas

What can we do with


pandas DataFrame?
Loop over Items Loop over Rows
Manipulate:
1. Create
2. Selection
3. Addition
4. Deletion
5. Rename
6. Looping
7. Sorting
Data Preparation
Using Pandas

What can we do with


pandas DataFrame?
Sort by Index
Manipulate:
1. Create
2. Selection
3. Addition
4. Deletion
5. Rename Sort by Columns
6. Looping
7. Sorting
Data Preparation
Using Pandas

Pandas Functionality

Functions:
1. Read Data
2. Basic Function
3. Data Engineering
4. Aggregate
5. Simple Plotting
Data Preparation
Using Pandas

Pandas Functionality
pandas.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], ...)[source]
Read a comma-separated values (csv) file into DataFrame.
Functions:
1. Read Data
2. Basic Function
3. Data Engineering
4. Aggregate
5. Simple Plotting pandas.read_excel(io, sheet_name=0, ...)[source]
Read an Excel file into a pandas DataFrame.
Data Preparation
Using Pandas

Pandas Functionality
Head
Use head(n) to return the first n records
r = df.head(10) #will return first 10 records
Functions: Tail
1. Read Data Use tail(n) to return the last n records
r = df.tail(10) #will return last 10 records
2. Basic Function
3. Data Engineering Transpose
If you want to swap rows and columns, use the T attribute
4. Aggregate transposed = df.T
5. Simple Plotting There are also key attributes of a DataFrame such as:
shape — shows dimensionality of the DataFrame
DataFrame also offers a number of statistic size — number of items
functions such as: ndim — number of axes
● abs() — Absolute values
● mean() — Mean values. It also offers median(), Describe
mode() If you want to see a quick summary of your data frame and want to
● min() — minimum value. It also offers max() be informed of its count, mean, standard deviation, minimum,
count(), std() — standard deviation maximum and a number of percentiles for each of the columns in
● prod() — to calculate product of the values the data frame then use the describe method:
● cumsum() to calculate cumulative sum etc df.describe()
Data Preparation
Using Pandas

Pandas Functionality
To Check For Missing Values
df.notnull()
Functions: To Drop Missing Values
1. Read Data df.dropna()
2. Basic Function Filling Missing Values — Direct Replace
3. Data Engineering df.fillna(ScalarValue)
4. Aggregate We can also pass in a dictionary and use the replace() method to replace
5. Simple Plotting the items with the replaced value.

Filling Missing Values — Backward Or Forward


df.fillna(method='backfill') #ffill for forward fill

Computing Correlation
df.corr() #between all columns
df['columnA'].corr(df['columnB']) # between two columns
Data Preparation
Using Pandas

Pandas Functionality
To Check For Missing Values
df.notnull()
Functions: To Drop Missing Values
1. Read Data df.dropna()
2. Basic Function Filling Missing Values — Direct Replace
3. Data Engineering df.fillna(ScalarValue)
4. Aggregate We can also pass in a dictionary and use the replace() method to replace
5. Simple Plotting the items with the replaced value.

Filling Missing Values — Backward Or Forward


df.fillna(method='backfill') #ffill for forward fill

Computing Correlation
df.corr() #between all columns
df['columnA'].corr(df['columnB']) # between two columns
Data Preparation
Using Pandas

Pandas Functionality

Grouping Rows
groupedDataFrame = df.groupby('ColumnName')
Functions: #multiple grouping columns
1. Read Data groupedDataFrame = df.groupby(['ColumnA', 'ColumnB')

2. Basic Function Filtering


df.filter(myCustomFunction) #myCustomFunction takes in a
3. Data Engineering parameter and returns a value
4. Aggregate
Merging
5. Simple Plotting The function to merge is called merge() that takes in left data
frame, right data frame, and on parameter defining which columns we
want to join on and how parameter outlining the join e.g. left, right, outer or
inner.
merged = pd.merge(left,right,left_on='name',right_on='id', how='left')

Union Data Frames


To concatenate two data frames, use concat() function:
pd.concat([one, two])
Data Preparation
Using Pandas

Pandas Functionality
Pandas Data frame offers a range of graphical plotting options.
We can plot, box plot, area, scatter plots, stacked charts, bar
charts, histograms, etc.
Functions:
● df.plot.scatter() #plots a scatter chart
1. Read Data ● df.diff.hist(bins=10) # creates a histogram
2. Basic Function
3. Data Engineering
4. Aggregate
5. Simple Plotting
Hands-On
4.
Intro Exploratory
Data Analytics (EDA)
& Data Visualization
using Matplotlib
EDA & Visualization
with matplotlib

What can we do in
matplotlib?

Visualization:
The plot() function in the Matplotlib library’s Pyplot module is used to create a 2D hexagonal plot
1. Line plot of the coordinates x and y. plot() will take various arguments like plot(x, y, scalex, scaley, data,
2. Histogram **kwargs).
3. Bar Chart
4. Scatter Plot
5. Pie Chart
6. Boxplot
EDA & Visualization
with matplotlib

What can we do in
matplotlib?

Visualization:
we can use plt.hist() function for plotting the histograms which will take various arguments like
1. Line plot data, bins, color, etc.
2. Histogram
3. Bar Chart
4. Scatter Plot
5. Pie Chart
6. Boxplot
EDA & Visualization
with matplotlib

What can we do in
matplotlib?

Visualization: in a bar chart, we have one axis representing a particular category of the columns and
1. Line plot another axis representing the values or count of the specific category. Barcharts are plotted

2. Histogram both vertically and horizontally

3. Bar Chart
4. Scatter Plot
5. Pie Chart
6. Boxplot
EDA & Visualization
with matplotlib

What can we do in
matplotlib?
Scatter plots are used to show the relationships between the variables and use the dots for
Visualization:
the plotting or it used to show the relationship between two numeric variables.
1. Line plot The scatter() method in the Matplotlib library is used for plotting.
2. Histogram
3. Bar Chart
4. Scatter Plot
5. Pie Chart
6. Boxplot
EDA & Visualization
with matplotlib

What can we do in
matplotlib?
A pie chart (or circular chart ) is used to show the percentage of the whole. Hence it is used
Visualization:
when we want to compare the individual categories with the whole.
1. Line plot
2. Histogram
3. Bar Chart
4. Scatter Plot
5. Pie Chart
6. Boxplot

Tips : Don’t Use Pie Charts


Study after study has shown that pie charts are not the most effective means of communicating data
EDA & Visualization
with matplotlib

What can we do in
matplotlib?
A Box plot is used to show the summary of the whole dataset or all the numeric values in
the dataset. The summary contains minimum, first quartile, median, third quartile, and
Visualization:
maximum. Also, the median is present between the first and third quartile. Here x-axis
1. Line plot contains the data values and y coordinates show the frequency distribution.
2. Histogram
3. Bar Chart
4. Scatter Plot
5. Pie Chart
6. Boxplot
Hands-On
4.
Analytical Reporting
Analytical Reporting

Writing tips
1. Shorter is better.
○ Be concise.
2. One recommendation per document.
○ State the recommendation first.
3. One topic per paragraph
○ State the topic first.
4. Rule of Three
○ Have 3 supporting ideas for any arguments.
5. Use Active Voice rather than Passive
○ Active Voice: “We recommend to buy company A.”
○ Passive Voice: “It is recommended that we buy company A”
Analytical Reporting

Pyramid Principle

a. Inductive vs Deductive writing


Inductive writing is clearer for the reader

b. Mutually Exclusive Collectively Exhaustive


(“MECE”)
How to separate arguments for maximum clarity
Analytical Reporting

Deductive vs Inductive
Writing
Deductive Writing Inductive Writing

3 components: Components
1. Statement about a situation 1. Start with the conclusion
2. Statement about a subject relating to the 2. Support the argument with groups of ideas
situation in (1) that are similar to each other
3. Conclusion about the subject based on (2) Example
Company A is
Example worth buying

Any company
Company A Therefore
that meets these Internal
meetings these Company A is Industry will Potential for
three criteria will capability to
criteria worth buying continue to grow efficiency gains
be worth buying meet growth

● Groups of ideas can be described by a plural


● Easier to construct than inductive reasoning noun
● Naturally the way we think (sequential) ● Top-down reasoning need to be verified / logical
Analytical Reporting

Deductive vs Inductive
Writing
Inductive Writing is Easier for the Reader

Deductive Inductive You must


Recommendation

Therefore, here’s change


Here’s what’s Here’s what’s You must How
what you should
going wrong causing it change
do about it
A3 B3 C3
Recommend-
ation
Why
A1 B1 C1 A2 B2 C2 A3 B3 C3
Why Why How A1 A2 B1 B2 C1 C2

1. To understand the overall reasoning, the reader must :


- Read A1, B1, C1, A2, B2, C2, 1. The reader’s major question is answered directly
- Remember them,
- And then connect to each of these to the 2. Clear separation in the thinking between subject
recommendations (A3, B3, C3). areas
2. The reader will have to wait to see the recommendation
3. All information related to A / B / C are in one place
3. To get there will have to re-enact the writer’s entire
problem-solving process. This is the Pyramid Structure
Analytical Reporting

Pyramid Structure

1. Every single document will always be


structured to support 1 single answer
○ Start with the answer first

2. Ideas at any level in the pyramid must always


be summaries of the ideas grouped below them

3. Ideas in each grouping must always be the


same kind of idea

4. Ideas in each grouping must always be logically


ordered
Picture Source
Analytical Reporting

MECE
1. MECE: “Mutually Exclusive, Collectively
Exhaustive”

2. Makes the supporting ideas flow to the


recommendation with maximal clarity

3. When you divide a recommendation into


its parts, you must make sure the pieces
produced are:
a. Mutually Exclusive “No overlaps”
Picture Source
b. Collectively Exhaustive “Nothing left out”
Analytical Reporting

MECE
Example Case: Sales in a Bakery
This is not MECE. Why?

Distribution of
Orders Per Day
(# of orders)

1-5 5-10 10-15 15-20


Analytical Reporting

MECE
Example Case: Sales in a Bakery
This is not MECE. Why? This is MECE.

Distribution of Distribution of
Orders Per Day Orders Per Day
(# of orders) (# of orders)

1-5 5-10 10-15 15-20 0 1-5 6-10 11-15 16-20 >20

Mutually Exclusive
Picture Source

Collectively Exhaustive
Analytical Reporting

MECE
Example Case: Sales in a Bakery
This is not MECE. Why?

Revenue
declined last
month

Orders per Lowered Reduced Negative


day Prices store review in
decreasing opening online
hours forum
Analytical Reporting

MECE
Example Case: Sales in a Bakery
This is not MECE. Why? This is MECE.
Revenue
declined last
Revenue month
declined last
month

Orders per Lowered


day prices
decreasing incorrectly
Orders per Lowered Reduced Negative
day Prices store review in
decreasing opening online
hours forum Reduced Negative
store review in
opening online
hours forum
Analytical Reporting

Pyramid Structure

1. Every single document will always be


structured to support 1 single answer
○ Start with the answer first
2. Ideas at any level in the pyramid must always
be summaries of the ideas grouped below them

3. Ideas in each grouping must always be the


same kind of idea
4. Ideas in each grouping must always be logically
ordered

5. Ideas in each grouping must always be Picture Source

MECE
Thank you!
Do you have any questions?
da@tokopedia.com
tokopedia.com
If you’re interested to be part of Tokopedia’s data
team that strives together in transforming data to
knowledge (and having fun while doing it)

Visit here :

bit.ly/dataanalytics-hiring (For Data Analytics)


bit.ly/datascience-hiring (For Data Science)

You might also like