KEMBAR78
Lecture 1 Pandas Basics.pptx machine learning | PPTX
DISCOVER . LEARN . EMPOWER
Lecture – 1
Pandas Basics
APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
MACHINE LEARNING (22CSH-286)
Faculty: Prof. (Dr.) Madan Lal Saini(E13485)
1
Machine Learning: Course Objectives
2
COURSE OBJECTIVES
The Course aims to:
1. Understand and apply various data handling and visualization techniques.
2. Understand about some basic learning algorithms and techniques and their
applications, as well as general questions related to analysing and handling large data
sets.
3. To develop skills of supervised and unsupervised learning techniques and
implementation of these to solve real life problems.
4. To develop basic knowledge on the machine techniques to build an intellectual
machine for making decisions behalf of humans.
5. To develop skills for selecting an algorithm and model parameters and apply them for
designing optimized machine learning applications.
COURSE OUTCOMES
3
On completion of this course, the students shall be able to:-
CO1 Describe and apply various data pre-processing and visualization techniques on dataset.
CO2
Understand about some basic learning on algorithms and analysing their applications, as
well as general questions related to analysing and handling large data sets.
CO3
Describe machine learning techniques to build an intellectual machine for making
decisions on behalf of humans.
CO4
Develop supervised and unsupervised learning techniques and implementation of these to
solve real life problems.
CO5
Analyse the performance of machine learning model and apply optimization techniques to
improve the performance of the model.
Unit-1 Syllabus
4
Unit-1 Data Pre-processing Techniques
Data Pre-
Processing
Data Frame Basics, CSV File, Libraries for Pre-processing, Handling
Missing data, Encoding Categorical data, Feature Scaling, Handling Time
Series data.
Feature Extraction Dimensionality Reduction: Feature Selection Techniques, Feature
Extraction Techniques; Data Transformation, Data Normalization.
Data Visualization Different types of plots, Plotting fundamentals using Matplotlib, Plotting
fundamentals using Seaborn.
SUGGESTIVE READINGS
TEXT BOOKS:
• T1: Tom.M.Mitchell, “Machine Learning”, McGraw Hill, International Edition, 2018
• T2: Ethern Alpaydin, “Introduction to Machine Learning”. Eastern Economy Edition, Prentice Hall of
India, 2015.
• T3: Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY
(2018).
REFERENCE BOOKS:
• R1 Sebastian Raschka, Vahid Mirjalili, “Python Machine Learning”, Packt Publisher (2019)
• R2 Aurélien Géron, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Wiley,
2nd Edition, 2022
• R3 Christopher Bishop, “Pattern Recognition and Machine Learning”, Illustrated Edition, Springer,
2016.
5
Table of Contents
 Introduction to Pandas
 Data frame
 Series
 Operation
 Plots
6
Data Structures
• Series: It is a one-dimensional labeled array capable of holding data
of any type (integer, string, float, python objects, etc.). Pandas Series
is nothing but a column in an excel sheet.
• Import pandas as pd
• data=np.array ([‘d’,’e’,’e’,’k’,’s’,’h’,’a’])
• ser=pd.series(data)
• Data Frame: it is two-dimensional size-mutable, potentially heterogeneous
tabular data structure with labeled axes (row and columns).
• d=pd.DataRange(20200301,period=10)
• pd.DataFrame(np.random.randn(10,4),index=d,columns=[‘A’,’B’,’C’,’D’])
…continued
df.head()
df.columns
df.index
df.describe()
df.sort_values(by=‘C’)
df[0:3]
df.loc[‘2020301’:’20200306’,[‘D’:,’C’]]
df.iloc[3:5,0:2]
df[df[‘A’]>0]
Handle Missing Values
Missing data or null values in a data can create lot of ruckus in other
stages of data science life cycle.
It is very important to deal with the missing data in an effective
manner
• Ex.
• df.isnull().count()
• df.isnull().sum()
• df.dropna()
• df.fillna(value=2)
Series
data = np.array(['a','b','c','d’])
s = pd.Series(data,index=[100,101,102,103])
print s
Create Data Frame
List
Dict
Series
Numpy ndarrays
Another Data Frame
Data Frame Examples
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
data = [['Alex',10],['Bob',12],['Clarke',13]]
df =
pd.DataFrame(data,columns=['Name','Age’])
print df
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b',
'c', 'd'])}
df = pd.DataFrame(d)
print df ['one']
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4],
index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df # using del function
print ("Deleting the first column using DEL function:")
del df['one’]
print df # using pop function
print ("Deleting another column using POP function:")
df.pop('two’)
print df
Data Frame Functionality
Sr.No. Attribute or Method & Description
1 T
Transposes rows and columns.
2 axes
Returns a list with the row axis labels and column axis labels as the only members.
3 dtypes
Returns the dtypes in this object.
4 empty
True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.
5 ndim
Number of axes / array dimensions.
6 shape
Returns a tuple representing the dimensionality of the DataFrame.
7 size
Number of elements in the NDFrame.
8 values
Numpy representation of NDFrame.
9 head()
Returns the first n rows.
10 tail()
Returns last n rows.
Continued..
• rename:The rename() method allows you to relabel an axis based
on some mapping (a dict or Series) or an arbitrary function.
• getdummies(): Returns the DataFrame with One-Hot Encoded
values.
• loc: Pandas provide various methods to have purely label based
indexing. When slicing, the start bound is also included.
• iloc: Pandas provide various methods in order to get purely
integer based indexing. Like python and numpy, these are 0-
based indexing.
df = pd.DataFrame(np.random.randn(8, 4), index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D’])
# Select few rows for multiple columns, say list[]
print df.loc[['a','b','f','h'],['A','C’]]
df.loc[‘a’:’h’]
print df.iloc[:4]
print df.iloc[1:5, 2:4]
More Functions..
Sr.No. Function Description
1 count() Number of non-null observations
2 sum() Sum of values
3 mean() Mean of Values
4 median() Median of Values
5 mode() Mode of values
6 std() Standard Deviation of the Values
7 min() Minimum Value
8 max() Maximum Value
9 abs() Absolute Value
10 prod() Product of Values
11 cumsum() Cumulative Sum
12 cumprod() Cumulative Product
Data Frame: filtering
16
To subset the data we can apply Boolean indexing. This indexing is commonly
known as a filter. For example if we want to subset the rows in which the salary
value is greater than $120K:
In [ ]: #Calculate mean salary for each professor rank:
df_sub = df[ df['salary'] > 120000 ]
In [ ]: #Select only those rows that contain female professors:
df_f = df[ df['sex'] == 'Female' ]
Any Boolean operator can be used to subset the data:
> greater; >= greater or equal;
< less; <= less or equal;
== equal; != not equal;
Data Frames groupby method
19
Using "group by" method we can:
• Split the data into groups based on some criteria
• Calculate statistics (or apply a function) to each group
• Similar to dplyr() function in R
In [ ]: #Group data using rank
df_rank = df.groupby(['rank'])
In [ ]: #Calculate mean value for each numeric column per each group
df_rank.mean()
Data Frames groupby method
20
Once groupby object is create we can calculate various statistics for each group:
In [ ]: #Calculate mean salary for each professor rank:
df.groupby('rank')[['salary']].mean()
Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object.
When double brackets are used the output is a Data Frame
Data Frames groupby method
21
groupby performance notes:
- no grouping/splitting occurs until it's needed. Creating the groupby object
only verifies that you have passed a valid mapping
- by default the group keys are sorted during the groupby operation. You may
want to pass sort=False for potential speedup:
In [ ]: #Calculate mean salary for each professor rank:
df.groupby(['rank'], sort=False)[['salary']].mean()
Graphics to explore the data
28
To show graphs within Python notebook include inline directive:
In [ ]: %matplotlib inline
Seaborn package is built on matplotlib but provides high level
interface for drawing attractive statistical graphics, similar to ggplot2
library in R. It specifically targets statistical data visualization
Graphics
29
description
distplot histogram
barplot estimate of central tendency for a numeric variable
violinplot similar to boxplot, also shows the probability density of the
data
jointplot Scatterplot
regplot Regression plot
pairplot Pairplot
boxplot boxplot
swarmplot categorical scatterplot
factorplot General categorical plot
Key Features
• Fast and efficient DataFrame object with default and customized
indexing.
• Tools for loading data into in-memory data objects from different file
formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and subsetting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
Questions?
• How Do You Handle Missing or Corrupted Data in a Dataset?
• How Can You Choose a Classifier Based on a Training Set Data Size?
• What Are the Three Stages of Building a Model in Machine Learning?
• What Are the Different Types of Machine Learning?
• What is ‘training Set’ and ‘test Set’ in a Machine Learning Model? How
Much Data Will You Allocate for Your Training, Validation, and Test Sets?
31
References
Book:
• Ethern Alpaydin, “Introduction to Machine Learning”. Eastern Economy Edition, Prentice
Hall of India, 2015.
• Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY
(2018).
Research Paper:
• Bi, Qifang, et al. "What is machine learning? A primer for the epidemiologist." American
journal of epidemiology 188.12 (2019): 2222-2239.
• Jordan, Michael I., and Tom M. Mitchell. "Machine learning: Trends, perspectives, and
prospects." Science 349.6245 (2015): 255-260.
Websites:
• https://www.geeksforgeeks.org/machine-learning/
• https://www.javatpoint.com/machine-learning
Videos:
• https://www.youtube.com/playlist?list=PLIg1dOXc_acbdJo-AE5RXpIM_rvwrerwR
32
THANK YOU
For queries
Email: madan.e13485@cumail.in

Lecture 1 Pandas Basics.pptx machine learning

  • 1.
    DISCOVER . LEARN. EMPOWER Lecture – 1 Pandas Basics APEX INSTITUTE OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING MACHINE LEARNING (22CSH-286) Faculty: Prof. (Dr.) Madan Lal Saini(E13485) 1
  • 2.
    Machine Learning: CourseObjectives 2 COURSE OBJECTIVES The Course aims to: 1. Understand and apply various data handling and visualization techniques. 2. Understand about some basic learning algorithms and techniques and their applications, as well as general questions related to analysing and handling large data sets. 3. To develop skills of supervised and unsupervised learning techniques and implementation of these to solve real life problems. 4. To develop basic knowledge on the machine techniques to build an intellectual machine for making decisions behalf of humans. 5. To develop skills for selecting an algorithm and model parameters and apply them for designing optimized machine learning applications.
  • 3.
    COURSE OUTCOMES 3 On completionof this course, the students shall be able to:- CO1 Describe and apply various data pre-processing and visualization techniques on dataset. CO2 Understand about some basic learning on algorithms and analysing their applications, as well as general questions related to analysing and handling large data sets. CO3 Describe machine learning techniques to build an intellectual machine for making decisions on behalf of humans. CO4 Develop supervised and unsupervised learning techniques and implementation of these to solve real life problems. CO5 Analyse the performance of machine learning model and apply optimization techniques to improve the performance of the model.
  • 4.
    Unit-1 Syllabus 4 Unit-1 DataPre-processing Techniques Data Pre- Processing Data Frame Basics, CSV File, Libraries for Pre-processing, Handling Missing data, Encoding Categorical data, Feature Scaling, Handling Time Series data. Feature Extraction Dimensionality Reduction: Feature Selection Techniques, Feature Extraction Techniques; Data Transformation, Data Normalization. Data Visualization Different types of plots, Plotting fundamentals using Matplotlib, Plotting fundamentals using Seaborn.
  • 5.
    SUGGESTIVE READINGS TEXT BOOKS: •T1: Tom.M.Mitchell, “Machine Learning”, McGraw Hill, International Edition, 2018 • T2: Ethern Alpaydin, “Introduction to Machine Learning”. Eastern Economy Edition, Prentice Hall of India, 2015. • T3: Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY (2018). REFERENCE BOOKS: • R1 Sebastian Raschka, Vahid Mirjalili, “Python Machine Learning”, Packt Publisher (2019) • R2 Aurélien Géron, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Wiley, 2nd Edition, 2022 • R3 Christopher Bishop, “Pattern Recognition and Machine Learning”, Illustrated Edition, Springer, 2016. 5
  • 6.
    Table of Contents Introduction to Pandas  Data frame  Series  Operation  Plots 6
  • 7.
    Data Structures • Series:It is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). Pandas Series is nothing but a column in an excel sheet. • Import pandas as pd • data=np.array ([‘d’,’e’,’e’,’k’,’s’,’h’,’a’]) • ser=pd.series(data) • Data Frame: it is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (row and columns). • d=pd.DataRange(20200301,period=10) • pd.DataFrame(np.random.randn(10,4),index=d,columns=[‘A’,’B’,’C’,’D’])
  • 8.
  • 9.
    Handle Missing Values Missingdata or null values in a data can create lot of ruckus in other stages of data science life cycle. It is very important to deal with the missing data in an effective manner • Ex. • df.isnull().count() • df.isnull().sum() • df.dropna() • df.fillna(value=2)
  • 10.
    Series data = np.array(['a','b','c','d’]) s= pd.Series(data,index=[100,101,102,103]) print s
  • 11.
  • 12.
    Data Frame Examples data= [1,2,3,4,5] df = pd.DataFrame(data) print df data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age’]) print df d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print df ['one'] d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10,20,30], index=['a','b','c'])} df = pd.DataFrame(d) print ("Our dataframe is:") print df # using del function print ("Deleting the first column using DEL function:") del df['one’] print df # using pop function print ("Deleting another column using POP function:") df.pop('two’) print df
  • 13.
    Data Frame Functionality Sr.No.Attribute or Method & Description 1 T Transposes rows and columns. 2 axes Returns a list with the row axis labels and column axis labels as the only members. 3 dtypes Returns the dtypes in this object. 4 empty True if NDFrame is entirely empty [no items]; if any of the axes are of length 0. 5 ndim Number of axes / array dimensions. 6 shape Returns a tuple representing the dimensionality of the DataFrame. 7 size Number of elements in the NDFrame. 8 values Numpy representation of NDFrame. 9 head() Returns the first n rows. 10 tail() Returns last n rows.
  • 14.
    Continued.. • rename:The rename()method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function. • getdummies(): Returns the DataFrame with One-Hot Encoded values. • loc: Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. • iloc: Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0- based indexing. df = pd.DataFrame(np.random.randn(8, 4), index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D’]) # Select few rows for multiple columns, say list[] print df.loc[['a','b','f','h'],['A','C’]] df.loc[‘a’:’h’] print df.iloc[:4] print df.iloc[1:5, 2:4]
  • 15.
    More Functions.. Sr.No. FunctionDescription 1 count() Number of non-null observations 2 sum() Sum of values 3 mean() Mean of Values 4 median() Median of Values 5 mode() Mode of values 6 std() Standard Deviation of the Values 7 min() Minimum Value 8 max() Maximum Value 9 abs() Absolute Value 10 prod() Product of Values 11 cumsum() Cumulative Sum 12 cumprod() Cumulative Product
  • 16.
    Data Frame: filtering 16 Tosubset the data we can apply Boolean indexing. This indexing is commonly known as a filter. For example if we want to subset the rows in which the salary value is greater than $120K: In [ ]: #Calculate mean salary for each professor rank: df_sub = df[ df['salary'] > 120000 ] In [ ]: #Select only those rows that contain female professors: df_f = df[ df['sex'] == 'Female' ] Any Boolean operator can be used to subset the data: > greater; >= greater or equal; < less; <= less or equal; == equal; != not equal;
  • 19.
    Data Frames groupbymethod 19 Using "group by" method we can: • Split the data into groups based on some criteria • Calculate statistics (or apply a function) to each group • Similar to dplyr() function in R In [ ]: #Group data using rank df_rank = df.groupby(['rank']) In [ ]: #Calculate mean value for each numeric column per each group df_rank.mean()
  • 20.
    Data Frames groupbymethod 20 Once groupby object is create we can calculate various statistics for each group: In [ ]: #Calculate mean salary for each professor rank: df.groupby('rank')[['salary']].mean() Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object. When double brackets are used the output is a Data Frame
  • 21.
    Data Frames groupbymethod 21 groupby performance notes: - no grouping/splitting occurs until it's needed. Creating the groupby object only verifies that you have passed a valid mapping - by default the group keys are sorted during the groupby operation. You may want to pass sort=False for potential speedup: In [ ]: #Calculate mean salary for each professor rank: df.groupby(['rank'], sort=False)[['salary']].mean()
  • 28.
    Graphics to explorethe data 28 To show graphs within Python notebook include inline directive: In [ ]: %matplotlib inline Seaborn package is built on matplotlib but provides high level interface for drawing attractive statistical graphics, similar to ggplot2 library in R. It specifically targets statistical data visualization
  • 29.
    Graphics 29 description distplot histogram barplot estimateof central tendency for a numeric variable violinplot similar to boxplot, also shows the probability density of the data jointplot Scatterplot regplot Regression plot pairplot Pairplot boxplot boxplot swarmplot categorical scatterplot factorplot General categorical plot
  • 30.
    Key Features • Fastand efficient DataFrame object with default and customized indexing. • Tools for loading data into in-memory data objects from different file formats. • Data alignment and integrated handling of missing data. • Reshaping and pivoting of date sets. • Label-based slicing, indexing and subsetting of large data sets. • Columns from a data structure can be deleted or inserted. • Group by data for aggregation and transformations. • High performance merging and joining of data. • Time Series functionality.
  • 31.
    Questions? • How DoYou Handle Missing or Corrupted Data in a Dataset? • How Can You Choose a Classifier Based on a Training Set Data Size? • What Are the Three Stages of Building a Model in Machine Learning? • What Are the Different Types of Machine Learning? • What is ‘training Set’ and ‘test Set’ in a Machine Learning Model? How Much Data Will You Allocate for Your Training, Validation, and Test Sets? 31
  • 32.
    References Book: • Ethern Alpaydin,“Introduction to Machine Learning”. Eastern Economy Edition, Prentice Hall of India, 2015. • Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY (2018). Research Paper: • Bi, Qifang, et al. "What is machine learning? A primer for the epidemiologist." American journal of epidemiology 188.12 (2019): 2222-2239. • Jordan, Michael I., and Tom M. Mitchell. "Machine learning: Trends, perspectives, and prospects." Science 349.6245 (2015): 255-260. Websites: • https://www.geeksforgeeks.org/machine-learning/ • https://www.javatpoint.com/machine-learning Videos: • https://www.youtube.com/playlist?list=PLIg1dOXc_acbdJo-AE5RXpIM_rvwrerwR 32
  • 33.
    THANK YOU For queries Email:madan.e13485@cumail.in