Data Analysis With Python
Dr. Amar Singh
Professor
Lovely Professional University
Libraries in Python
• Scientific Computing Libraries
• NumPy
• Pandas
• SciPy
• Visualization Libraries
• Matplotlib
• Seaborn
• Algorithmic Libraries
• Scikit-learn
Importing Data in Python
• Importing is the process of loading or reading the data from different
resources.
• The data may be in different formats.
• .csv, .json, .xlsx
• Path of the dataset could be mentioned as below:
• C:\\mydata\\data.csv
• To read a csv file we can use following command:
• pd.read_csv(“c:\\mydata\\data.csv”)
Libraries
Check data type
• Dataframe.dtypes
Printing Dataframe
• df[“BasePay”] // Prints only BasePay column
• df.head(n) //shows first n rows of the data frame
• df.tail(n) //shows bottom n rows of the data frame
• Df.dtypes // used to check data types
Dataframe.describe()
• Returns full summary Statistical
Dataframe.describe(include=“all”)
Data Pre-processing
• Pre-Processing is used to convert raw data into another format for
further data analysis.
• Also known as data cleaning or data wrangling.
Data-Preprocessing
• Deal with missing values
• Data Formatting
• Data Normalization
• Converting Categorical Values to Numerical Values
Missing Values
• When no value is stored for column in an observation.
• Could be represented as ?, NA or blank cell.
How to deal with missing data
• Drop missing values
• Drop the variable
• Replace missing values with an average or frequency values.
• Leave it as missing data.
How to drop missing values in python
• Use dataframe.dropna()
How to replace missing value with new value ?
• Df.replace(missing value, new value)
Data Formatting
• Data are usually collected from different sources and stored in
different formats.
• Bringing data into standard of expression allows user to make
meaningful comparisons.
Incorrect Data Types
• Sometimes wrong datatype is assigned to a column.
Continue..
Apply calculations to entire column
Data Normalization
• Normalization is the process of transforming values of several
variables into a similar range.
• Typical values range from 0 to 1
Normalization
Age Income
20 20000
25 45000
37 28000
• Age and income are in different ranges..
• Hard to Compare.
• “Income” will influence the results more.
Methods for normalization
Simple feature scaling
• df['length'] = df['length']/df['length'].max()
• df['width'] = df['width']/df['width'].max()
Categorical
Continue..
Continue..
Continue..
Continue..
Exploratory Data Analysis (EDA)
• Preliminary Step to data analysis
• Get better understanding of data set.
• Summarize main characteristics of data set.
• Uncover relationship between different variables
• Extract Important Variables
Descriptive Statistics
• Describe basic features of data.
• Giving short summaries about sample and measures of data set.
Descriptive Statistics
• df.describe()
• df.value_count()
• Summarizing categorical data
• Example : df[“drive-wheels"].value_counts()
Scatterplot
• Represents the relationships between variables
• Predictor variable on x-axis.
• Target variable on y-axis.
Scatterplot : Example
Grouping Data
• Groupby method is used to grouping the data.
• Can be applied on categorical variables.
• Groups the data into categories.
• Example:
• test_Data1= test_Data.groupby('JobTitle’)
• test_Data1.mean()
Correlation
• a measure of the extent of interdependence between variables.
• 1: Total positive linear correlation.
• 0: No linear correlation, the two variables most likely do not affect each other.
• -1: Total negative linear correlation.
• df.corr()
Correlation using scatter plot
Exercise
• Import SocialAds.csv dataset.
• Show first five rows of the dataset.
• Show last five rows of the dataset.
• Give the statistical description of the dataset.
• Count the number of males and females in the dataset.