Data analysis - 5th unit
Pandas Series:
Pandas Series are useful for organizing and working with one-dimensional
data. They provide labeled indexing, making it easy to access and manipulate
data. Series support operations like data alignment, handling missing data, and
mathematical calculations. They can be easily integrated with other Pandas
structures like DataFrames, making them a versatile tool for data analysis and
manipulation in Python.
Creating data using Pandas Series:
import pandas as pd
s = pd.Series([10, 20, 30, 40, 50])
print(s)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
The pd.Series() function creates a Series object from the provided list [10, 20,
30, 40, 50] . The index is automatically generated as integers starting from 0.
import pandas as pd
k= pd.Series([1,7,2],index=[1,2,3])
print(k)
To add index = [1,2,3], we will add index parameter either it is 1,2,3 or a,b,c we
can add to the index parameter
Output:
1 1
Data analysis - 5th unit 1
2 7
3 2
dtype: int64
Pandas Series operations:
Pandas Series operations allow you to perform various operations on Series
objects, such as arithmetic operations, boolean operations, and more. Here are
some common operations:
1. Arithmetic Operations: You can perform arithmetic operations like addition,
subtraction, multiplication, and division on Series objects. For example:
import pandas as pd
# Create two Series
s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([5, 6, 7, 8])
# Addition
result = s1 + s2
print(result)
Output: 0 6
1 8
2 10
3 12
dtype: int64
# Subtraction
result = s1 - s2
print(result)
Output: 0 -4
1 -4
2 -4
Data analysis - 5th unit 2
3 -4
dtype: int64
# Multiplication
result = s1 * s2
print(result)
Output: 0 5
1 12
2 21
3 32
dtype: int64
# Division
result = s1 / s2
print(result)
Output: 0 0.200000
1 0.333333
2 0.428571
3 0.500000
dtype: float64
2. Boolean Indexing:
This operation is called boolean indexing. In this operation, the expression res >
2 creates a boolean mask where each element in the Series is compared
against the value 2
import pandas as pd
res = pd.Series([1,2,3,4],index = ["a","b","c","d"])
k = res[res>2]
print(k)
This code creates a Pandas Series res with values [1, 2, 3, 4] and custom
index labels ["a", "b", "c", "d"] . Then, it filters the Series using boolean
Data analysis - 5th unit 3
indexing to select only the elements that are greater than 2, and stores the
result in a new Series k .
Output:
c 3
d 4
dtype: int64
3. Descriptive Statistics: Pandas provides several methods to calculate
descriptive statistics for Series objects, such as mean() , median() , min() ,
max() etc. For example:
import pandas as pd
# Create a Series
s = pd.Series([1, 2, 3, 4])
# Mean
print("Mean:", s.mean())
# Median
print("Median:", s.median())
# Sum
print("Sum:", s.sum())
# Minimum
print("Minimum:", s.min())
# Maximum
print("Maximum:", s.max())
# Count
print("Count:", s.count())
Output:
Mean: 2.5
Median: 2.5
Data analysis - 5th unit 4
Sum: 10
Minimum: 1
Maximum: 4
Count: 4
4. Handling Missing values:
import pandas as pd
# Create a Series with missing values
s = pd.Series([15,85,None,74,56,None,87])
# Check for missing values
print(s.isnull())
0 False
1 False
2 True
3 False
4 False
5 True
6 False
dtype: bool
# Drop missing values
print(s.dropna())
0 15.0
1 85.0
3 74.0
4 56.0
6 87.0
dtype: float64
# Fill missing values with a specific value (e.g., 0)
Data analysis - 5th unit 5
print(s.fillna(19))
0 15.0
1 85.0
2 19.0
3 74.0
4 56.0
5 19.0
6 87.0
dtype: float64
These operations demonstrate how to check for missing values in a Pandas
Series ( isnull() ), how to drop those missing values ( dropna() ), and how to fill
missing values with a specific value ( fillna() ).
Pandas DataFrame:
A DataFrame in Pandas is a two-dimensional, size-mutable, and heterogeneous
tabular data structure with labeled axes (rows and columns). It is similar to a
spreadsheet or SQL table, where each column can have a different data type.
DataFrames are particularly useful for data manipulation and analysis tasks, as
they provide powerful methods for handling and processing structured data.
Creating a DataFrame using Pandas:
import pandas as pd
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
print(df)
Name Age City
0 Alice 25 New York
Data analysis - 5th unit 6
1 Bob 30 San Francisco
2 Charlie 35 Los Angeles
This code creates a Pandas DataFrame df from a dictionary data , where each
key in the dictionary becomes a column in the DataFrame and the
corresponding list of values becomes the data in that column. The DataFrame is
then printed to the console, displaying the data in a tabular format.
DataFrame operations:
1. Adding a new column:
df['Gender'] = ['Female', 'Male', 'Male']
print(df)
Name Age City Gender
0 Alice 25 New York Female
1 Bob 30 San Francisco Male
2 Charlie 35 Los Angeles Male
here we added a column called gender.
2. drop:
dropping a column:
k= df.drop("City",axis =1)
k
Name Age Gender
0 Alice 25 Female
1 Bob 30 Male
2 Charlie 35 Male
here we dropped City column. axis = 1, represents column
dropping a row:
k= df.drop(1,axis =0)
Data analysis - 5th unit 7
print(k)
Name Age City Gender
0 Alice 25 New York Female
2 Charlie 35 Los Angeles Male
here we dropped a row with index 1, axis =0 represents a row.
3. Grouping by Gender and calculating mean age:
grouped_df = df.groupby('Gender')['Age'].mean()
print(grouped_df)
Gender
Female 25.0
Male 32.5
Name: Age, dtype: float64
The groupby method in Pandas is used to split the DataFrame into groups based
on some criteria, such as a specific column value. In this case,
df.groupby('Gender') groups the DataFrame df by the 'Gender' column.
After grouping, the ['Age'] part selects the 'Age' column from each group, and
the mean() method calculates the mean age for each group.
Finally, print(grouped_df) displays the resulting Series, where each index
corresponds to a unique value in the 'Gender' column, and each value is the
mean age of the group.
4. Selection and indexing:
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'San Francisco', 'Los Angeles',
df = pd.DataFrame(data)
Data analysis - 5th unit 8
# Access a single value
print("\nValue at index 1, column 'Name':")
print(df.at[1, 'Name'])
# Access a row using integer position
print("\nRow at index 2:")
print(df.iloc[2])
# Access a column using label
print("\nColumn 'Age':")
print(df['Age'])
Value at index 1, column 'Name':
Bob
Row at index 2:
Name Charlie
Age 35
City Los Angeles
Name: 2, dtype: object
Column 'Age':
0 25
1 30
2 35
3 40
Name: Age, dtype: int64
Selection and indexing in Pandas DataFrame allows you to access and
manipulate data efficiently. Here's a short explanation:
1. Single Column Selection: You can select a single column by using square
brackets [] with the column name as a string. For example,
df['Column_Name'] selects the column named 'Column_Name' .
2. Multiple Columns Selection: To select multiple columns, you can pass a list
of column names inside the square brackets. For example, df[['Column1',
'Column2']] selects columns 'Column1' and 'Column2' .
Data analysis - 5th unit 9
3. Row Selection Using Label: You can use the loc[] indexer to select rows
by label. For example, df.loc['Label'] selects the row with the specified
label.
5. reading a CSV file in pandas:
This Python code uses the pandas library to read a CSV file named 'data.csv'
into a DataFrame object called df . The pd.read_csv() function is used to read
the CSV file and create the DataFrame. The DataFrame is a tabular data
structure that stores the data from the CSV file in rows and columns, allowing
for easy manipulation and analysis of the data using pandas' powerful tools and
functions.
• head() : View the first few rows of the DataFrame.
Data analysis - 5th unit 10
Filtering Data: here we are considering first_ings_score which was greater
than 150
Matplotlib:
Line plot:
A line plot is a type of plot where data points are connected by straight lines. It
is often used to visualize data over a continuous interval.
Data analysis - 5th unit 11
Bar Graph:
Creating a bar graph
so Now lets Import a dataset using pandas and perform data visualization
operation using matplotlib, here we are using ipl 2022 dataset
histogram:
Data analysis - 5th unit 12
In Matplotlib, you can create a histogram using the hist function.
piechart:
To draw a pie chart with multiple colors, labels, and percentages in Matplotlib,
you can use the pie function along with the autopct parameter to display the
percentages.
In this example, each slice of the pie chart will be colored according to the
colors list, and the autopct='%1.1f%%' parameter will display the percentage
values with one decimal place. Adjust the labels , colors lists according to your
data and color preferences.
Data analysis - 5th unit 13
Data analysis - 5th unit 14