KEMBAR78
python-pandas-For-Data-Analysis-Manipulate.pptx
PYTHON PANDAS
Introduction
 Pandas is a Python package providing fast, flexible,
and expressive data structures designed to make
working with relational or labeled data both easy
and intuitive.
 It aims to be the fundamental high-level building
block for doing practical, real world data analysis in
Python.
 It has the broader goal of becoming the most
powerful and flexible open source data analysis /
manipulation tool available in any language
 The name Pandas is derived from the word Panel Data –
an Econometrics from Multidimensional data.
 Python library provides high-performance, easy-to-use
data structures and data analysis tools for the Python
programming language. Python with Pandas is used in a
wide range of fields including academic and commercial
domains including finance, economics, Statistics,
analytics, etc.
 Python was majorly used for data munging and
preparation. It had very little contribution towards
data analysis. Pandas solved this problem.
 Using Pandas, we can accomplish five typical steps in
the processing and analysis of data, regardless of the
origin of data:
 load
 prepare
 manipulate
 model, and
 analyze.
Pandas Features
 Fast and efficient DataFrame object with default and
customized indexing.
 Tools for loading data into in-memory data objects from
different file formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of date sets.
 Label-based slicing, indexing and subsetting of large data sets.
 Columns from a data structure can be deleted or inserted.
 Group by data for aggregation and transformations.
 High performance merging and joining of data.
 Time Series functionality.
Installation of Pandas
 Python Anaconda is a free Python distribution with SciPy
stack and Spyder IDE for Windows OS.
 It is also available for Linux and Mac.
 Standard Python distribution doesn't come bundled with
Pandas module. A lightweight alternative is to install
Pandas using popular Python package installer, pip.
C:UsersSony>pip install pandas
Highlights of Pandas
 A fast and efficient DataFrame object for data manipulation with
integrated indexing;
 Tools for reading and writing data between in-memory data
structures and different formats: CSV and text files, Microsoft Excel,
SQL databases, and the fast HDF5 format;
 Intelligent data alignment and integrated handling of missing data:
gain automatic label-based alignment in computations and easily
manipulate messy data into an orderly form;
 Flexible reshaping and pivoting of data sets;
 Intelligent label-based slicing, fancy indexing, and subsetting of
large data sets;
 Columns can be inserted and deleted from data structures for size
mutability.
Dataset in Pandas
 Pandas deals with the following three data structures −
 Series
 DataFrame
 These data structures are built on top of Numpy array
 All Pandas data structures are value mutable (can be
changed). Except Series all are size mutable. Series is
size immutable.
 DataFrame is widely used and one of the most
important data structures. Panel is used much less.
Series
 Series is a one-dimensional array like structure with
homogeneous data. For example, the following series is a
collection of integers 10, 23, 56.
Panel
 Panel is a three-dimensional data structure with
heterogeneous data. It is hard to represent the panel in
graphical representation. But a panel can be illustrated as
a container of DataFrame.
DataFrame
 DataFrame is a two-dimensional array with heterogeneous
data. For example,
 The table represents the data of a sales team of an
organization with their overall performance rating. The data is
represented in rows and columns. Each column represents an
attribute and each row represents a person.
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78
Series
 Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python
objects). The axis labels are collectively called index.
 A series can be created using various inputs like −
 Array
 Dict
Create a Series by array
 If data is an ndarray, then index passed must be of
the same length. If no index is passed, then by
default index will be range(n) where n is array length,
i.e., [0,1,2,3…. range(len(array))-1].
 Ex: series_1.py
 import pandas as pd
 import numpy as np
 data = np.array(['a','b','c','d'])
 s = pd.Series(data)
 print (s)
Create a Series from dict
 A dict can be passed as input and if no index is
specified, then the dictionary keys are taken in a
sorted order to construct index.
DataFrames
 A Data frame is a two-dimensional data structure, i.e., data
is aligned in a tabular fashion in rows and columns.
Features of DataFrame
 Potentially columns are of different types
 Size – Mutable
 Labeled axes (rows and columns)
 Can Perform Arithmetic operations on rows and columns
 A pandas DataFrame can be created using the following
constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
Create DataFrame
Pandas DataFrame can be created using various inputs like :
 Lists
 dict
 Series
 Syntax:
import pandas as pd
df = pd.DataFrame()
print (df)
The above syntax will generate an empty dataframe
with no columns and no index
Dataframe using Lists
 The DataFrame can be created using a single list or a
list of lists.
 Syntax:
import pandas as pd
df = pd.DataFrame()
print (df)
The above syntax will generate an empty dataframe
with no columns and no index
DataFrame using Dict of arrays & Lists
 All the arrays must be of same length. If index is
passed, then the length of the index should be equal
to the length of the arrays.
 If no index is passed, then by default, index will be
range(n), where n is the array length.
 List of Dictionaries can be passed as input data to
create a DataFrame. The dictionary keys are by
default taken as column names.
 We can also create a DataFrame with a list of
dictionaries, row indices, and column indices.
 Note: Here df2 DataFrame is created with a
column index other than the dictionary key; thus,
appended the NaN’s in place. Whereas, df1 is
created with column indices same as dictionary
keys, so NaN’s appended.
DataFrame from Dict of Series
 Dictionary of Series can be passed to form a
DataFrame.
 The resultant index is the union of all the series
indexes passed.
Dataset Manipulations
 Column wise manipulations in a dataframe
 We can perform Dataframe manipulations like:
Selecting required columns for display
Adding new columns
Deleting the columns
 Row wise manipulations in a dataframe
 We can do the following like:-
Row Selection,
Selecting using label
Selecting using integer location
Selecting using slicing
Addition of row, and
Deletion of row
 Dataset concatinating
 Dataset Merging
 Dataset Joining
Data Preprocessing
 In the real world, we usually come across lots of
raw data which is not fit to be readily processed by
machine learning algorithms. We need to
preprocess the raw data before it is fed into
various machine learning algorithms.
 In other simple words, we can say that before
providing the data to the machine learning
algorithms we need to preprocess the data.
Why preprocessing ?
 Real world data are generally
Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
Noisy: containing errors or outliers
Inconsistent: containing discrepancies in codes or names
Tasks in data preprocessing
Data cleaning: fill in missing values, smooth noisy data,
identify or remove outliers, and resolve inconsistencies.
Data integration: using multiple databases, data cubes, or
files.
Data transformation: normalization and aggregation.
Data reduction: reducing the volume but producing the
same or similar analytical results.
Data discretization: part of data reduction, replacing
numerical attributes with nominal ones.
Data cleaning
 Fill in missing values (attribute or class value):
 Ignore the tuple: usually done when class label is missing.
 Use the attribute mean (or majority nominal value) to fill in the missing value.
 Use the attribute mean (or majority nominal value) for all samples belonging to the
same class.
 Predict the missing value by using a learning algorithm: consider the attribute with the
missing value as a dependent (class) variable and run a learning algorithm (usually
Bayes or decision tree) to predict the missing value.
 Identify outliers and smooth out noisy data:
 Binning
 Sort the attribute values and partition them into bins (see "Unsupervised discretization" below);
 Then smooth by bin means, bin median, or bin boundaries.
 Clustering: group values in clusters and then detect and remove outliers (automatic or
manual)
 Regression: smooth by fitting the data into regression functions.
 Correct inconsistent data: use domain knowledge or expert decision.
Data transformation
 Normalization:
 Scaling attribute values to fall within a specified range.
 Example: to transform V in [min, max] to V' in [0,1],
apply V'=(V-Min)/(Max-Min)
 Scaling by using mean and standard deviation (useful when min
and max are unknown or when there are
outliers): V'=(V-Mean)/StDev
 Aggregation: moving up in the concept hierarchy on numeric
attributes.
 Generalization: moving up in the concept hierarchy on nominal
attributes.
 Attribute construction: replacing or adding new attributes
inferred by existing attributes.
Data reduction
 Reducing the number of attributes
 Data cube aggregation: applying roll-up, slice or dice operations.
 Removing irrelevant attributes: attribute selection (filtering and
wrapper methods), searching the attribute space
 Principle component analysis (numeric attributes only): searching for
a lower dimensional space that can best represent the data..
 Reducing the number of attribute values
 Binning (histograms): reducing the number of attributes by grouping
them into intervals (bins).
 Clustering: grouping values in clusters.
 Aggregation or generalization
 Reducing the number of tuples
 Sampling
Discretization and generating concept hierarchies
 Unsupervised discretization - class variable is not used.
 Equal-interval (equiwidth) binning: split the whole range of numbers in
intervals with equal size.
 Equal-frequency (equidepth) binning: use intervals containing equal
number of values.
 Supervised discretization - uses the values of the class variable.
 Using class boundaries. Three steps:
 Sort values.
 Place breakpoints between values belonging to different classes.
 If too many intervals, merge intervals with equal or similar class distributions.
 Entropy (information)-based discretization.
Generating concept hierarchies: recursively applying partitioning or
discretization methods.
Missing Values in the array set or the Dataset
 Identifying the no. of missing values in a dataset.

Function: - data.isna() or data.isnull()
 The above function returns true if the dataframe or
dataset is having null values.
 We can also count the number of null values in a column.
 Function:- data.isnull().sum() or data.isna().sum()
data.isnull().sum(axis=0) [column level] /
data.isnull().sum(axis=1) [row level]

python-pandas-For-Data-Analysis-Manipulate.pptx

  • 1.
  • 2.
    Introduction  Pandas isa Python package providing fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive.  It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.  It has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language
  • 3.
     The namePandas is derived from the word Panel Data – an Econometrics from Multidimensional data.  Python library provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.
  • 4.
     Python wasmajorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem.  Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data:  load  prepare  manipulate  model, and  analyze.
  • 5.
    Pandas Features  Fastand efficient DataFrame object with default and customized indexing.  Tools for loading data into in-memory data objects from different file formats.  Data alignment and integrated handling of missing data.  Reshaping and pivoting of date sets.  Label-based slicing, indexing and subsetting of large data sets.  Columns from a data structure can be deleted or inserted.  Group by data for aggregation and transformations.  High performance merging and joining of data.  Time Series functionality.
  • 6.
    Installation of Pandas Python Anaconda is a free Python distribution with SciPy stack and Spyder IDE for Windows OS.  It is also available for Linux and Mac.  Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install Pandas using popular Python package installer, pip. C:UsersSony>pip install pandas
  • 7.
    Highlights of Pandas A fast and efficient DataFrame object for data manipulation with integrated indexing;  Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;  Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;  Flexible reshaping and pivoting of data sets;  Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;  Columns can be inserted and deleted from data structures for size mutability.
  • 8.
    Dataset in Pandas Pandas deals with the following three data structures −  Series  DataFrame  These data structures are built on top of Numpy array  All Pandas data structures are value mutable (can be changed). Except Series all are size mutable. Series is size immutable.  DataFrame is widely used and one of the most important data structures. Panel is used much less.
  • 9.
    Series  Series isa one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56. Panel  Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame.
  • 10.
    DataFrame  DataFrame isa two-dimensional array with heterogeneous data. For example,  The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person. Name Age Gender Rating Steve 32 Male 3.45 Lia 28 Female 4.6 Vin 45 Male 3.9 Katie 38 Female 2.78
  • 11.
    Series  Series isa one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects). The axis labels are collectively called index.  A series can be created using various inputs like −  Array  Dict
  • 12.
    Create a Seriesby array  If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].  Ex: series_1.py  import pandas as pd  import numpy as np  data = np.array(['a','b','c','d'])  s = pd.Series(data)  print (s)
  • 13.
    Create a Seriesfrom dict  A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index.
  • 14.
    DataFrames  A Dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Features of DataFrame  Potentially columns are of different types  Size – Mutable  Labeled axes (rows and columns)  Can Perform Arithmetic operations on rows and columns  A pandas DataFrame can be created using the following constructor − pandas.DataFrame( data, index, columns, dtype, copy)
  • 15.
    Create DataFrame Pandas DataFramecan be created using various inputs like :  Lists  dict  Series
  • 16.
     Syntax: import pandasas pd df = pd.DataFrame() print (df) The above syntax will generate an empty dataframe with no columns and no index
  • 17.
    Dataframe using Lists The DataFrame can be created using a single list or a list of lists.  Syntax: import pandas as pd df = pd.DataFrame() print (df) The above syntax will generate an empty dataframe with no columns and no index
  • 18.
    DataFrame using Dictof arrays & Lists  All the arrays must be of same length. If index is passed, then the length of the index should be equal to the length of the arrays.  If no index is passed, then by default, index will be range(n), where n is the array length.  List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.
  • 19.
     We canalso create a DataFrame with a list of dictionaries, row indices, and column indices.  Note: Here df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended.
  • 20.
    DataFrame from Dictof Series  Dictionary of Series can be passed to form a DataFrame.  The resultant index is the union of all the series indexes passed.
  • 21.
    Dataset Manipulations  Columnwise manipulations in a dataframe  We can perform Dataframe manipulations like: Selecting required columns for display Adding new columns Deleting the columns
  • 22.
     Row wisemanipulations in a dataframe  We can do the following like:- Row Selection, Selecting using label Selecting using integer location Selecting using slicing Addition of row, and Deletion of row
  • 23.
     Dataset concatinating Dataset Merging  Dataset Joining
  • 24.
    Data Preprocessing  Inthe real world, we usually come across lots of raw data which is not fit to be readily processed by machine learning algorithms. We need to preprocess the raw data before it is fed into various machine learning algorithms.  In other simple words, we can say that before providing the data to the machine learning algorithms we need to preprocess the data.
  • 25.
    Why preprocessing ? Real world data are generally Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data Noisy: containing errors or outliers Inconsistent: containing discrepancies in codes or names
  • 26.
    Tasks in datapreprocessing Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies. Data integration: using multiple databases, data cubes, or files. Data transformation: normalization and aggregation. Data reduction: reducing the volume but producing the same or similar analytical results. Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
  • 27.
    Data cleaning  Fillin missing values (attribute or class value):  Ignore the tuple: usually done when class label is missing.  Use the attribute mean (or majority nominal value) to fill in the missing value.  Use the attribute mean (or majority nominal value) for all samples belonging to the same class.  Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.  Identify outliers and smooth out noisy data:  Binning  Sort the attribute values and partition them into bins (see "Unsupervised discretization" below);  Then smooth by bin means, bin median, or bin boundaries.  Clustering: group values in clusters and then detect and remove outliers (automatic or manual)  Regression: smooth by fitting the data into regression functions.  Correct inconsistent data: use domain knowledge or expert decision.
  • 28.
    Data transformation  Normalization: Scaling attribute values to fall within a specified range.  Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min)  Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev  Aggregation: moving up in the concept hierarchy on numeric attributes.  Generalization: moving up in the concept hierarchy on nominal attributes.  Attribute construction: replacing or adding new attributes inferred by existing attributes.
  • 29.
    Data reduction  Reducingthe number of attributes  Data cube aggregation: applying roll-up, slice or dice operations.  Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space  Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data..  Reducing the number of attribute values  Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).  Clustering: grouping values in clusters.  Aggregation or generalization  Reducing the number of tuples  Sampling
  • 30.
    Discretization and generatingconcept hierarchies  Unsupervised discretization - class variable is not used.  Equal-interval (equiwidth) binning: split the whole range of numbers in intervals with equal size.  Equal-frequency (equidepth) binning: use intervals containing equal number of values.  Supervised discretization - uses the values of the class variable.  Using class boundaries. Three steps:  Sort values.  Place breakpoints between values belonging to different classes.  If too many intervals, merge intervals with equal or similar class distributions.  Entropy (information)-based discretization. Generating concept hierarchies: recursively applying partitioning or discretization methods.
  • 31.
    Missing Values inthe array set or the Dataset  Identifying the no. of missing values in a dataset.  Function: - data.isna() or data.isnull()  The above function returns true if the dataframe or dataset is having null values.  We can also count the number of null values in a column.  Function:- data.isnull().sum() or data.isna().sum() data.isnull().sum(axis=0) [column level] / data.isnull().sum(axis=1) [row level]