0% found this document useful (0 votes)

33 views11 pages

EDA - Session-2 - Data Frame Basics-2

The document discusses exploring and cleaning a dataset containing 25480 rows and 12 columns using Pandas. It imports necessary libraries, reads the data, and examines the number of rows and columns, data types, missing values, and duplicates. Key steps taken include viewing the head and tail, shapes, dtypes, selecting numeric/categorical columns, checking for nulls, and dropping duplicates.

Uploaded by

jeeshu048

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views11 pages

EDA - Session-2 - Data Frame Basics-2

Uploaded by

jeeshu048

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Import the packages

In [1]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Read the data

In [3]: path=r"C:\Users\omkar\OneDrive\Documents\Data science\Naresh IT\Datafiles\Vi

df=pd.read_csv(path)
df

Out[3]:
case_id continent education_of_employee has_job_experience requires_job_trainin

0 EZYV01 Asia High School N

1 EZYV02 Asia Master's Y

2 EZYV03 Asia Bachelor's N

3 EZYV04 Asia Bachelor's N

4 EZYV05 Africa Master's Y

... ... ... ... ...

25475 EZYV25476 Asia Bachelor's Y

25476 EZYV25477 Asia High School Y

25477 EZYV25478 Asia Master's Y

25478 EZYV25479 Asia Master's Y

25479 EZYV25480 Asia Bachelor's Y

25480 rows × 12 columns

 

ℎ𝑒𝑎𝑑
Top 5 rows

In [6]: # dataframe name : df

# bydefault 5 rows
df.head(2)

Out[6]:
case_id continent education_of_employee has_job_experience requires_job_training no_o

0 EZYV01 Asia High School N N

1 EZYV02 Asia Master's Y N

 

𝑇𝑎𝑖𝑙
Last 5 rows
In [9]: df.tail()

Out[9]:
case_id continent education_of_employee has_job_experience requires_job_trainin

25475 EZYV25476 Asia Bachelor's Y

25476 EZYV25477 Asia High School Y

25477 EZYV25478 Asia Master's Y

25478 EZYV25479 Asia Master's Y

25479 EZYV25480 Asia Bachelor's Y

 

𝑠ℎ𝑎𝑝𝑒
Number of rows and number of columns

In [11]: df.shape

Out[11]: (25480, 12)

In [12]: print("The number of rows:",df.shape[0])

print("The number of columns:",df.shape[1])

The number of rows: 25480

The number of columns: 12

𝑠𝑖𝑧𝑒
how many indices are there provided by size

In [13]: df.size

Out[13]: 305760

In [14]: 25480*12

Out[14]: 305760

𝑐𝑜𝑙𝑢𝑚𝑛𝑠
In [15]: df.columns # all the column values

Out[15]: Index(['case_id', 'continent', 'education_of_employee', 'has_job_experienc

e',
'requires_job_training', 'no_of_employees', 'yr_of_estab',
'region_of_employment', 'prevailing_wage', 'unit_of_wage',
'full_time_position', 'case_status'],
dtype='object')

In [16]: type(df)

Out[16]: pandas.core.frame.DataFrame
In [17]: type(df.columns)

Out[17]: pandas.core.indexes.base.Index

𝑑𝑡𝑦𝑝𝑒𝑠
data types

In [18]: df.dtypes

# Object means categorical
# other then object numerical (int or float)

Out[18]: case_id object

continent object
education_of_employee object
has_job_experience object
requires_job_training object
no_of_employees int64
yr_of_estab int64
region_of_employment object
prevailing_wage float64
unit_of_wage object
full_time_position object
case_status object
dtype: object

In [19]: type(df.dtypes)

Out[19]: pandas.core.series.Series

𝑡𝑎𝑠𝑘 − 1
Extract Numerical columns and categorical column sepearetly by using dtypes output

In [25]: # Convert above one into dictionary

# key and values
# if for list
d1=dict(df.dtypes)
# for i in d1:
# if d1[i]=='object':
# print(i)

cat=[i for i in d1 if d1[i]=='object']
num=[i for i in d1 if d1[i]!='object']
In [26]: cat

Out[26]: ['case_id',
'continent',
'education_of_employee',
'has_job_experience',
'requires_job_training',
'region_of_employment',
'unit_of_wage',
'full_time_position',
'case_status']

In [28]: # Categorical data avalaibale

df.select_dtypes(include='object').columns

Out[28]: Index(['case_id', 'continent', 'education_of_employee', 'has_job_experienc

e',
'requires_job_training', 'region_of_employment', 'unit_of_wage',
'full_time_position', 'case_status'],
dtype='object')

In [29]: df.select_dtypes(exclude='object').columns

Out[29]: Index(['no_of_employees', 'yr_of_estab', 'prevailing_wage'], dtype='objec

t')

In [ ]: # df has 12 columns
# df.select_dtypes(include='object') has 9 columns
# df.select_dtypes(exclude='object') has 3 columns

𝑖𝑠𝑛𝑢𝑙𝑙
identify if data has any missing values or Null values
In [31]: df.isnull()
# True means (yes) there is a null value
# False maens (No) there is no null value

Out[31]:
case_id continent education_of_employee has_job_experience requires_job_training no_o

0 False False False False False

1 False False False False False

2 False False False False False

3 False False False False False

4 False False False False False

... ... ... ... ... ...

475 False False False False False

476 False False False False False

477 False False False False False

478 False False False False False

479 False False False False False

80 rows × 12 columns
 

In [ ]: # when you open excel sheet the data has empty

# which means that data is missed
# when you read that using panads
# at that particular postion it display as Null

In [32]: df.isnull().sum()

Out[32]: case_id 0
continent 0
education_of_employee 0
has_job_experience 0
requires_job_training 0
no_of_employees 0
yr_of_estab 0
region_of_employment 0
prevailing_wage 0
unit_of_wage 0
full_time_position 0
case_status 0
dtype: int64

𝑑𝑟𝑜𝑝 𝑑𝑢𝑝𝑙𝑖𝑐𝑎𝑡𝑒𝑠
Drop duplicate values
In [33]: df.drop_duplicates()

Out[33]:
case_id continent education_of_employee has_job_experience requires_job_trainin

0 EZYV01 Asia High School N

1 EZYV02 Asia Master's Y

2 EZYV03 Asia Bachelor's N

3 EZYV04 Asia Bachelor's N

4 EZYV05 Africa Master's Y

... ... ... ... ...

25475 EZYV25476 Asia Bachelor's Y

25476 EZYV25477 Asia High School Y

25477 EZYV25478 Asia Master's Y

25478 EZYV25479 Asia Master's Y

25479 EZYV25480 Asia Bachelor's Y

25480 rows × 12 columns

 

𝑖𝑛𝑓𝑜
In [35]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_id 25480 non-null object
1 continent 25480 non-null object
2 education_of_employee 25480 non-null object
3 has_job_experience 25480 non-null object
4 requires_job_training 25480 non-null object
5 no_of_employees 25480 non-null int64
6 yr_of_estab 25480 non-null int64
7 region_of_employment 25480 non-null object
8 prevailing_wage 25480 non-null float64
9 unit_of_wage 25480 non-null object
10 full_time_position 25480 non-null object
11 case_status 25480 non-null object
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB

In [36]: len(df)

Out[36]: 25480

head
tail
shape
size
columns
dtypes
isnull
isnull().sum()
drop duplicates
info
len

Bound method

You need to keep brackets

Not callable

you need to remove the brackets

Attribute error

the method is not available

check the spell mistake

In [ ]: # we want read some sample of data

# we know head will give top5
# we know tail wil give last 5
# if you want specific rows or columns

𝑡𝑎𝑘𝑒-𝑙𝑜𝑐-𝑖𝑙𝑜𝑐
In [43]: df.take((2,5,7))
# 2,3,4 are the columns or rows
# axis=1 reference as columns
# axis=0 reference as rows
# by default axis =0 , rows

Out[43]:
case_id continent education_of_employee has_job_experience requires_job_training no_of_e

EZYV03 Asia Bachelor's N Y

EZYV06 Asia Master's Y N

North
EZYV08 Bachelor's Y N
America

 
In [41]: df.take([2,5,7],axis=1)
# python index start with 0

Out[41]:
education_of_employee no_of_employees region_of_employment

0 High School 14513 West

1 Master's 2412 Northeast

2 Bachelor's 44444 West

3 Bachelor's 98 West

4 Master's 1082 South

... ... ... ...

25475 Bachelor's 2601 South

25476 High School 3274 Northeast

25477 Master's 1121 South

25478 Master's 1918 West

25479 Bachelor's 3195 Midwest

25480 rows × 3 columns

In [44]: df.take([100,200,300])

Out[44]:
case_id continent education_of_employee has_job_experience requires_job_training n

100 EZYV101 Asia Master's Y N

200 EZYV201 Asia Doctorate Y N

300 EZYV301 Asia Master's Y N

 

In [ ]: # i want 100,200,300 rows from 4, 8, 11 columns

In [45]: df.take([100,200,300]).take([4,8,11],axis=1)

Out[45]:
requires_job_training prevailing_wage case_status

100 N 28243.79 Certified

200 N 74441.11 Certified

300 N 101371.21 Certified

take does not take rows and columns at a time

𝑖𝑙𝑜𝑐
In [ ]: #df.iloc[<rows>,<columns>]
#df.iloc[<start:end>,<start:end>]
#rows=[]
#cols=[]
#df.iloc[rows,cols]

In [46]: df.iloc[5:10] # all the columns

Out[46]:
ent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_

Asia Master's Y N 2339

Asia Bachelor's N N 4985

orth
Bachelor's Y N 3035
rica

Asia Bachelor's N N 4810

ope Doctorate Y N 2251

 

In [47]: df.iloc[5:10,2:5]

Out[47]:
education_of_employee has_job_experience requires_job_training

5 Master's Y N

6 Bachelor's N N

7 Bachelor's Y N

8 Bachelor's N N

9 Doctorate Y N

In [48]: df.iloc[:,2:5]

Out[48]:
education_of_employee has_job_experience requires_job_training

0 High School N N

1 Master's Y N

2 Bachelor's N Y

3 Bachelor's N N

4 Master's Y N

... ... ... ...

25475 Bachelor's Y Y

25476 High School Y N

25477 Master's Y N

25478 Master's Y Y

25479 Bachelor's Y N

25480 rows × 3 columns

In [ ]: df.iloc[5:10] # all the columns
df.iloc[5:10,2:5] # specific rows and specific columns
df.iloc[:,2:5] # all the rows

In [49]: df.iloc[[100,200,300]]

Out[49]:
case_id continent education_of_employee has_job_experience requires_job_training n

100 EZYV101 Asia Master's Y N

200 EZYV201 Asia Doctorate Y N

300 EZYV301 Asia Master's Y N

 

In [50]: df.iloc[[100,200,300],[4,8,11]]

Out[50]:
requires_job_training prevailing_wage case_status

100 N 28243.79 Certified

200 N 74441.11 Certified

300 N 101371.21 Certified

In [ ]: df.iloc[5:10] # all the columns

df.iloc[5:10,2:5] # specific rows and specific columns
df.iloc[:,2:5] # all the rows
df.iloc[[100,200,300]]
df.iloc[[100,200,300],[4,8,11]]

In [51]: df.columns

Out[51]: Index(['case_id', 'continent', 'education_of_employee', 'has_job_experienc

e',
'requires_job_training', 'no_of_employees', 'yr_of_estab',
'region_of_employment', 'prevailing_wage', 'unit_of_wage',
'full_time_position', 'case_status'],
dtype='object')

In [55]: # Only prevailing_Wage

df.iloc[[100,200,300],[8]]

# No bracket: Series
# Barcket is there : Data frame

Out[55]:
prevailing_wage

100 28243.79

200 74441.11

300 101371.21
In [56]: # Only full time
df.iloc[[100,200,300],[10]]

# iloc will consider column index

Out[56]:
full_time_position

100 Y

200 Y

300 Y

𝑙𝑜𝑐
In [57]: df.loc[[100,200,300],['full_time_position']]

# loc will consider directly column name

Out[57]:
full_time_position

100 Y

200 Y

300 Y

In [59]: #df.loc[[100,200,300],[10]]

In [ ]:

Complete Case Analysis (CCA) : Advantages
No ratings yet
Complete Case Analysis (CCA) : Advantages
6 pages
Basics of Pandas
No ratings yet
Basics of Pandas
5 pages
Python Pandas-DataFrames Complete - Jupyter Notebook
No ratings yet
Python Pandas-DataFrames Complete - Jupyter Notebook
34 pages
12 Pandas
100% (1)
12 Pandas
21 pages
Unit 4
No ratings yet
Unit 4
25 pages
Pandas
No ratings yet
Pandas
32 pages
EDA - Session-1 - Basic Dataframe Opertaions-1
No ratings yet
EDA - Session-1 - Basic Dataframe Opertaions-1
7 pages
Exp 343
No ratings yet
Exp 343
18 pages
Prints
No ratings yet
Prints
43 pages
Python
No ratings yet
Python
32 pages
15 - 11 - 24 - SVM - Jupyter Notebook
No ratings yet
15 - 11 - 24 - SVM - Jupyter Notebook
5 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
17 pages
Data Integration and Missing Values Analysis
No ratings yet
Data Integration and Missing Values Analysis
23 pages
Scaler Case Study
No ratings yet
Scaler Case Study
18 pages
Pandas
No ratings yet
Pandas
13 pages
Data Analysis Using Python
No ratings yet
Data Analysis Using Python
12 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Exp 3
No ratings yet
Exp 3
10 pages
Cheat Sheet
No ratings yet
Cheat Sheet
15 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
20 pages
Exploratory Data Analysis and Preprocessing Pipeline
No ratings yet
Exploratory Data Analysis and Preprocessing Pipeline
18 pages
Social Network Analysis: Cheruvu Nvss Suhas 21BCE8374
No ratings yet
Social Network Analysis: Cheruvu Nvss Suhas 21BCE8374
10 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
DA Lab Manual r22
No ratings yet
DA Lab Manual r22
31 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
Python Pandas: 12 Data Manipulation Techniques
100% (2)
Python Pandas: 12 Data Manipulation Techniques
19 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
Python Solutions
No ratings yet
Python Solutions
11 pages
What Can You Do With Dataframes Using Pandas?: Pandas Is A High-Level Data Manipulation Tool Developed by Wes Mckinney
No ratings yet
What Can You Do With Dataframes Using Pandas?: Pandas Is A High-Level Data Manipulation Tool Developed by Wes Mckinney
10 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Kunj Project 1
No ratings yet
Kunj Project 1
34 pages
MGNM - 801 - Ca1
No ratings yet
MGNM - 801 - Ca1
14 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
2 pages
Xii Record (Dataframe & CSV)
No ratings yet
Xii Record (Dataframe & CSV)
11 pages
AI Practical 2025
No ratings yet
AI Practical 2025
14 pages
Exp1d
No ratings yet
Exp1d
6 pages
Geo Python Doc (1) 7,8 Bavesh
No ratings yet
Geo Python Doc (1) 7,8 Bavesh
9 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
2 pages
Dsba Project Main Et Easyvisa
No ratings yet
Dsba Project Main Et Easyvisa
46 pages
Ali Bhai's IP Project
No ratings yet
Ali Bhai's IP Project
31 pages
Pandas Operations Guide
No ratings yet
Pandas Operations Guide
6 pages
Set 1
No ratings yet
Set 1
16 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
6 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
23 pages
IP Employee Project
No ratings yet
IP Employee Project
32 pages
Eda - 1@3pm 8th Nov
No ratings yet
Eda - 1@3pm 8th Nov
2 pages
Ip Practical
No ratings yet
Ip Practical
3 pages
EmployeeMgmt XII IP ProjectReprot 2022 23
No ratings yet
EmployeeMgmt XII IP ProjectReprot 2022 23
16 pages
Data Mining - Week - 4
No ratings yet
Data Mining - Week - 4
8 pages
LP II Practical
No ratings yet
LP II Practical
5 pages
Razorpay Data Analyst Interview Questions 1739977522
No ratings yet
Razorpay Data Analyst Interview Questions 1739977522
12 pages
Employee Data Analysis Report
No ratings yet
Employee Data Analysis Report
22 pages
2.3 - Jupyter Notebook
No ratings yet
2.3 - Jupyter Notebook
24 pages
Test-1 - Python and Stat - Jupyter Notebook
0% (1)
Test-1 - Python and Stat - Jupyter Notebook
3 pages
EDA - Session-6 - Bi Variate Analysis
No ratings yet
EDA - Session-6 - Bi Variate Analysis
17 pages
EDA - Session-5 - Outlier Analysis
No ratings yet
EDA - Session-5 - Outlier Analysis
11 pages
EDA - Session-7 - Convert Categorical To Numerical
No ratings yet
EDA - Session-7 - Convert Categorical To Numerical
5 pages
Unit - 1
No ratings yet
Unit - 1
29 pages
Statistics Sampling Theoresm Session 8
No ratings yet
Statistics Sampling Theoresm Session 8
5 pages
Metamorphosis Clean
No ratings yet
Metamorphosis Clean
35 pages
Variance - Wikipedia, The Free Encyclopedia
No ratings yet
Variance - Wikipedia, The Free Encyclopedia
18 pages
Offshore Asset Life Extension Guide
No ratings yet
Offshore Asset Life Extension Guide
9 pages
Chapter 1-2 Lecture Note
No ratings yet
Chapter 1-2 Lecture Note
64 pages
Holmen 200 Manual Ver 1 2 2
No ratings yet
Holmen 200 Manual Ver 1 2 2
24 pages
'Might Be Something' The Language of Indet - Elisabeth A. Povinelli
No ratings yet
'Might Be Something' The Language of Indet - Elisabeth A. Povinelli
27 pages
Quick and Easy Analysis of Alcohol Content in Hand Sanitizer by FTIR Spectroscopy. Application Note (Shimadzu)
No ratings yet
Quick and Easy Analysis of Alcohol Content in Hand Sanitizer by FTIR Spectroscopy. Application Note (Shimadzu)
2 pages
Four Theoretical Contributions Which Are Central To The Understanding of Organizations Ezdehar Okasheh University of The People
No ratings yet
Four Theoretical Contributions Which Are Central To The Understanding of Organizations Ezdehar Okasheh University of The People
8 pages
Size Reduction in Pharmaceutical Engineering
No ratings yet
Size Reduction in Pharmaceutical Engineering
46 pages
Consumer Science Education Intro
No ratings yet
Consumer Science Education Intro
4 pages
MAN'S AWAKENING AND THE PRACTICE OF REMEMBERING ONESELF by Henri Tracol
No ratings yet
MAN'S AWAKENING AND THE PRACTICE OF REMEMBERING ONESELF by Henri Tracol
20 pages
Guideline To Make and Understand Unit Test Case: 1. Overview
No ratings yet
Guideline To Make and Understand Unit Test Case: 1. Overview
11 pages
Bishan Public Library - Fact Sheet
100% (1)
Bishan Public Library - Fact Sheet
3 pages
Ms 145 2006
No ratings yet
Ms 145 2006
20 pages
Lecture 5 ParametricMethod
No ratings yet
Lecture 5 ParametricMethod
20 pages
SF - FDA - 407-01 FD&FA Contract Checklist
No ratings yet
SF - FDA - 407-01 FD&FA Contract Checklist
1 page
CAPE Management of Business Syllabus
100% (1)
CAPE Management of Business Syllabus
64 pages
DNA Microarray Technology Lecture
No ratings yet
DNA Microarray Technology Lecture
29 pages
English Corpus Linguistics Cristina Marquez
No ratings yet
English Corpus Linguistics Cristina Marquez
11 pages
Avika Jain Data Handling Presentation Roll No.12 Class 8TH
No ratings yet
Avika Jain Data Handling Presentation Roll No.12 Class 8TH
13 pages
Effects of Heavy Load Carriage During Constant-Speed, Simulated Road Marching
No ratings yet
Effects of Heavy Load Carriage During Constant-Speed, Simulated Road Marching
4 pages
Daftar Peserta Kelas Semester Antara T.A. 2023-2024
No ratings yet
Daftar Peserta Kelas Semester Antara T.A. 2023-2024
77 pages
Voice Conversion for Engineers
No ratings yet
Voice Conversion for Engineers
4 pages
Chemistry Solubility Project
No ratings yet
Chemistry Solubility Project
9 pages
Design Calculation For 1.0M Height RCC Retaining Wall
100% (1)
Design Calculation For 1.0M Height RCC Retaining Wall
8 pages
The Science Fiction Films of Andrei Tarkovsky PDF
No ratings yet
The Science Fiction Films of Andrei Tarkovsky PDF
14 pages
Solutions International Economics 9 Ed Appleyard
No ratings yet
Solutions International Economics 9 Ed Appleyard
301 pages
Learner's Guide - Drilling & Blasting in Opencast
No ratings yet
Learner's Guide - Drilling & Blasting in Opencast
36 pages
PMLS 1 p4
No ratings yet
PMLS 1 p4
8 pages
Science Quiz for Students
No ratings yet
Science Quiz for Students
9 pages