Pandas
Pandas
GEETA DESAI
KrishnanGEE
1. Pandas
Pandas is an open-source Python library used for data manipulation and analysis. It provides data structures and
functions needed to work efficiently with structured data, particularly tabular data.
Fast and efficient for manipulating data, including missing data handling.
Flexible data structures: It provides the Series and DataFrame structures to hold and manipulate data in various
formats (e.g., CSV, Excel).
Integrated with other libraries: Pandas works well with libraries like NumPy, Matplotlib, and Seaborn for
numerical computations and visualization.
1 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
Series
data = [10, 20, 30, 40] data = {'a': 10, 'b': 20, 'c': 30}
series = pd.Series(data) series = pd.Series(data)
print(series) print(series)
Output: Output:
0 10 a 10
1 20 b 20
2 30 c 30
3 40 dtype: int64
dtype: int64
In this above, the dictionary keys become the index
Here, 0, 1, 2, 3 are the indices, and 10, 20, 30, 40 are labels.
the values.
We can access Series elements using their index or by using their key label:
import pandas as pd
Output:
10
20
2 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
Vectorized Operations:
import pandas as pd
Output:
0 15
1 25
2 35
3 45
dtype: int64
a 15
b 25
c 35
dtype: int64
import pandas as pd
3 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Occupation': ['Engineer', 'Doctor', 'Artist']
}
df = pd.DataFrame(data)
print(df)
Output:
print(df)
Output:
Viewing data
info(): Displays information about the DataFrame including data types and memory usage.
describe(): Generates descriptive statistics like mean, min, max, and quartiles for numeric columns.
4 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
import pandas as pd
Output:
Head Description
Name Age Place Age
0 Alice 25 BLR count 6.000000
1 Bob 30 TRPL mean 34.166667
2 Charlie 35 GLB std 7.359801
3 Dread Wing 30 CYB min 25.000000
4 Bumble Bee 40 CYB 25% 30.000000
50% 32.500000
Tail 75% 38.750000
Name Age Place max 45.000000
1 Bob 30 TRPL
2 Charlie 35 GLB Info
3 Dread Wing 30 CYB <class 'pandas.core.frame.DataFrame'>
4 Bumble Bee 40 CYB RangeIndex: 6 entries, 0 to 5
5 Arcee 45 CYB Data columns (total 3 columns):
# Column Non-Null Count Dtype
5 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
Selecting Columns
import pandas as pd
print("\n")
Output:
0 Alice
1 Bob
2 Charlie
3 Dread Wing
4 Bumble Bee
5 Arcee
Name: Name, dtype: object
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 Dread Wing 30
4 Bumble Bee 40
5 Arcee 45
6 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
import pandas as pd
# Data as a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Dread Wing', 'Bumble Bee', 'Arcee'],
'Age': [25, 30, 35, 30, 40, 45],
'Place': ['BLR', 'TRPL', 'GLB', 'CYB', 'CYB', 'CYB']
}
Output:
Name Alice
Age 25
Place BLR
Name: 0, dtype: object
Name Alice
Age 25
Place BLR
Name: 0, dtype: object
7 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
Adding Column
import pandas as pd
# Sample DataFrame
data = {'Name': ['Rahul', 'Ananya', 'Vikram'],
'Age': [25, 30, 22]}
df = pd.DataFrame(data)
print(df)
print('\n')
# Adding a new column based on existing column
df['Age+5'] = df['Age'] + 5
print(df)
Output:
Deleting Columns
import pandas as pd
# Sample DataFrame
data = {'Name': ['Rahul', 'Ananya', 'Vikram'],
'Age': [25, 30, 22],
'Age+5': [30, 35, 27]}
df = pd.DataFrame(data)
print(df)
print('\n')
del df['Name']
print(df)
8 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
Output:
Name Age
0 Rahul 25
1 Ananya 30
2 Vikram 22
Age
0 25
1 30
2 22
Renaming Columns
import pandas as pd
# Sample DataFrame
data = {'Name': ['Rahul', 'Ananya', 'Vikram'],
'Age': [25, 30, 22]}
df = pd.DataFrame(data)
print('\n')
Output:
Name Years
0 Rahul 25
1 Ananya 30
2 Vikram 22
9 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
A 2
B 1
dtype: int64
dropna()
print("Original DataFrame:")
print(df)
df_dropped_any = df.dropna()
print("\nDataFrame after dropping rows with any missing values:")
print(df_dropped_any)
Output:
Original DataFrame:
A B C
0 1.0 NaN 1
1 2.0 2.0 2
2 NaN 3.0 3
3 4.0 4.0 4
DataFrame after dropping rows with any missing values:
A B C
1 2.0 2.0 2
3 4.0 4.0 4
10 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
import pandas as pd
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 4.0 4.0 4.0
data = {
'A': [1, 2, None, 4],
'B': [None, 2, 3, 4],
'C': [1, 2, 3, 4]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
df_dropped_cols_any = df.dropna(axis=1)
print("\nDataFrame after dropping columns with any missing values:")
print(df_dropped_cols_any)
11 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
Output:
Original DataFrame:
A B C
0 1.0 NaN 1
1 2.0 2.0 2
2 NaN 3.0 3
3 4.0 4.0 4
data = {
'A': [1, 2, None, 4],
'B': [None, 2, 3, 4],
'C': [1, 2, 3, 4]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
A B C
0 1.0 NaN 1
1 2.0 2.0 2
2 NaN 3.0 3
3 4.0 4.0 4
12 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
fillna()
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
A B C
0 1.0 NaN 1.0
1 2.0 2.0 2.0
2 NaN 3.0 NaN
3 4.0 NaN 4.0
13 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
import pandas as pd
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
A B C
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 NaN 3.0
3 4.0 4.0 NaN
DataFrame after forward filling:
A B C
0 1.0 NaN 1.0
1 1.0 2.0 2.0
2 3.0 2.0 3.0
3 4.0 4.0 3.0
DataFrame after backward filling:
A B C
0 1.0 2.0 1.0
1 3.0 2.0 2.0
2 3.0 4.0 3.0
3 4.0 4.0 NaN
14 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
import pandas as pd
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
A B C
0 1.0 NaN 1
1 2.0 2.0 2
2 NaN 3.0 3
3 4.0 4.0 4
15 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
import pandas as pd
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
A B C
0 1.0 NaN 1.0
1 NaN 2.0 NaN
2 3.0 NaN 3.0
3 NaN 4.0 4.0
DataFrame after filling missing values with different values for each column:
A B C
0 1.0 99.0 1.0
1 0.0 2.0 5.0
2 3.0 99.0 3.0
3 0.0 4.0 4.0
16 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({
'A': [None, 2, None, 4],
'B': [1, None, 3, None]
})
df2 = pd.DataFrame({
'A': [5, 6, 7, 8],
'B': [9, 10, 11, 12]
})
print("Original DataFrame:")
print(df1)
Output:
Original DataFrame:
A B
0 NaN 1.0
1 2.0 NaN
2 NaN 3.0
3 4.0 NaN
17 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
import pandas as pd
# Sample DataFrame
data = {
'A': [None, 2, None, 4],
'B': [1, None, 3, None]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
A B
0 NaN 1.0
1 2.0 NaN
2 NaN 3.0
3 4.0 NaN
DataFrame after filling missing values with mean for 'A' and median for 'B':
A B
0 3.0 1.0
1 2.0 2.0
2 3.0 3.0
3 4.0 2.0
18 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
Pandas performs alignment automatically when performing arithmetic operations on objects that may
not have the same labels (for both Series and DataFrames). This means that Pandas will match on the row and
column labels, and if the labels don't match, it will fill missing data with NaN.
import pandas as pd
19 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
Broadcasting in Pandas refers to performing operations between objects of different dimensions, such
as a Series and a DataFrame
import pandas as pd
df1 = pd.DataFrame({
'A': [1, 2, 3],
Output:
'B': [4, 5, 6]
}, index=['X', 'Y', 'Z']) A B
print(df1) X 1 4
Y 2 5
# Series for broadcasting
Z 3 6
series = pd.Series([10, 20, 30],
index=['A', 'B', A B C
'C']) X 11 24 NaN
Y 12 25 NaN
# Broadcasting addition Z 13 26 NaN
df_broadcast = df1 + series
print(df_broadcast)
Sorting Data
# Creating a DataFrame
data = {'Name': ['Ravi', 'Anita', 'Priya', 'Amit'],
'Age': [28, 22, 25, 32]}
df = pd.DataFrame(data, index=['b', 'a', 'd', 'c'])
print(sorted_df)
print('\n')
print(sorted_by_values)
Output:
Name Age
a Anita 22
b Ravi 28
c Amit 32
d Priya 25
Name Age
a Anita 22
d Priya 25
b Ravi 28
c Amit 32
20 | 😁 👻 ✌️ 😎
Pandas – Data Science using Python
Let’s say we have a CSV file called employees.csv with the following content in the same folder where we
have the python file:
Name,Department,Salary
Vijay,Sales,50000
Nisha,HR,60000
Anil,Finance,70000
Example Output
import pandas as pd Name Department Salary
Breakdown
Step 2: Reading CSV File pd.read_csv() is used to read a CSV file and store its contents in a DataFrame.
Step 3: Displaying Data The .head() method displays the first 5 rows of the DataFrame to quickly inspect the
Let’s create and write the following data to a new CSV file new_employees.csv.
Example
import pandas as pd
data = {
'Name': ['Ramesh', 'Sunita', 'Amit'],
'Department': ['IT', 'Marketing', 'Operations'],
'Salary': [80000, 65000, 72000]
}
df = pd.DataFrame(data)
df.to_csv('new_employees.csv', index=False)
df_new = pd.read_csv('new_employees.csv')
print(df_new)
Pandas – Data Science using Python
Breakdown
Step 1: Creating Data The sample data is stored in a dictionary format where each key represents a
column name.
Step 3: Writing CSV File The .to_csv() method is used to write the DataFrame to a CSV file named
new_employees.csv. We set index=False to avoid writing row indices into the CSV.
Step 4: Verifying Data We read the newly created CSV file back to verify its content.
Let’s assume we have an Excel file students.xlsx with the following sheet:
Name Grade
Amit A
Priya B
Sohan A
Example
import pandas as pd
Output
Student Marks
0 Ravi 85
1 Kiran 90
2 Anjali 88
Breakdown
Step 2: Reading Excel File The pd.read_excel() function reads the students.xlsx file.
Step 3: Displaying Data We use .head() to inspect the first few rows of the DataFrame.
Pandas – Data Science using Python
Example
import pandas as pd
Breakdown
Step 1: Creating Data A dictionary is used to store the student names and their marks.
Step 3: Writing Excel File The DataFrame is written to an Excel file named marks.xlsx. The index=False
option prevents Pandas from writing row numbers into the Excel file.
Step 4: Verifying Data The newly created Excel file is read back into a DataFrame for verification.
Reading a JSON File: Let’s assume we have a JSON file data.json with the following content:
[
{"Name": "Nikhil", "Age": 23},
{"Name": "Sonal", "Age": 25},
{"Name": "Pooja", "Age": 22}
]
Example Output:
Breakdown
Step 2: Reading JSON File The pd.read_json() function reads the JSON file and loads it into a DataFrame.
Step 3: Displaying Data The .head() method is used to display the first few rows of the DataFrame.
Example
import pandas as pd
Breakdown
Step 1: Creating Data A dictionary is used to store city names and their population.
Step 3: Writing JSON File The DataFrame is written to a JSON file named cities.json.
Step 4: Verifying Data The JSON file is read back to ensure it was written correctly.
Pandas – Data Science using Python
Pandas – Data Science using Python
Pandas – Data Science using Python