KEMBAR78
Data Analysis - 5th Unit | PDF | Arithmetic | Integer (Computer Science)
0% found this document useful (0 votes)
31 views14 pages

Data Analysis - 5th Unit

The document provides an overview of data analysis using Pandas, focusing on Pandas Series and DataFrames. It covers creating Series and DataFrames, performing operations like arithmetic calculations, boolean indexing, and handling missing values, as well as basic DataFrame manipulations such as adding and dropping columns. Additionally, it introduces data visualization techniques using Matplotlib, including line plots, bar graphs, histograms, and pie charts.

Uploaded by

niharikap229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

Data Analysis - 5th Unit

The document provides an overview of data analysis using Pandas, focusing on Pandas Series and DataFrames. It covers creating Series and DataFrames, performing operations like arithmetic calculations, boolean indexing, and handling missing values, as well as basic DataFrame manipulations such as adding and dropping columns. Additionally, it introduces data visualization techniques using Matplotlib, including line plots, bar graphs, histograms, and pie charts.

Uploaded by

niharikap229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data analysis - 5th unit

Pandas Series:
Pandas Series are useful for organizing and working with one-dimensional
data. They provide labeled indexing, making it easy to access and manipulate
data. Series support operations like data alignment, handling missing data, and
mathematical calculations. They can be easily integrated with other Pandas
structures like DataFrames, making them a versatile tool for data analysis and
manipulation in Python.

Creating data using Pandas Series:

import pandas as pd
s = pd.Series([10, 20, 30, 40, 50])
print(s)

Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64

The pd.Series() function creates a Series object from the provided list [10, 20,
30, 40, 50] . The index is automatically generated as integers starting from 0.

import pandas as pd
k= pd.Series([1,7,2],index=[1,2,3])
print(k)

To add index = [1,2,3], we will add index parameter either it is 1,2,3 or a,b,c we
can add to the index parameter

Output:
1 1

Data analysis - 5th unit 1


2 7
3 2
dtype: int64

Pandas Series operations:


Pandas Series operations allow you to perform various operations on Series
objects, such as arithmetic operations, boolean operations, and more. Here are
some common operations:

1. Arithmetic Operations: You can perform arithmetic operations like addition,


subtraction, multiplication, and division on Series objects. For example:

import pandas as pd

# Create two Series


s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([5, 6, 7, 8])

# Addition
result = s1 + s2
print(result)

Output: 0 6
1 8
2 10
3 12
dtype: int64

# Subtraction
result = s1 - s2
print(result)

Output: 0 -4
1 -4
2 -4

Data analysis - 5th unit 2


3 -4
dtype: int64

# Multiplication
result = s1 * s2
print(result)

Output: 0 5
1 12
2 21
3 32
dtype: int64

# Division
result = s1 / s2
print(result)

Output: 0 0.200000
1 0.333333
2 0.428571
3 0.500000
dtype: float64

2. Boolean Indexing:

This operation is called boolean indexing. In this operation, the expression res >

2 creates a boolean mask where each element in the Series is compared

against the value 2

import pandas as pd
res = pd.Series([1,2,3,4],index = ["a","b","c","d"])
k = res[res>2]
print(k)

This code creates a Pandas Series res with values [1, 2, 3, 4] and custom
index labels ["a", "b", "c", "d"] . Then, it filters the Series using boolean

Data analysis - 5th unit 3


indexing to select only the elements that are greater than 2, and stores the
result in a new Series k .

Output:
c 3
d 4
dtype: int64

3. Descriptive Statistics: Pandas provides several methods to calculate


descriptive statistics for Series objects, such as mean() , median() , min() ,
max() etc. For example:

import pandas as pd

# Create a Series
s = pd.Series([1, 2, 3, 4])

# Mean
print("Mean:", s.mean())

# Median
print("Median:", s.median())

# Sum
print("Sum:", s.sum())

# Minimum
print("Minimum:", s.min())

# Maximum
print("Maximum:", s.max())

# Count
print("Count:", s.count())

Output:
Mean: 2.5
Median: 2.5

Data analysis - 5th unit 4


Sum: 10
Minimum: 1
Maximum: 4
Count: 4

4. Handling Missing values:

import pandas as pd

# Create a Series with missing values


s = pd.Series([15,85,None,74,56,None,87])

# Check for missing values


print(s.isnull())

0 False
1 False
2 True
3 False
4 False
5 True
6 False
dtype: bool

# Drop missing values


print(s.dropna())

0 15.0
1 85.0
3 74.0
4 56.0
6 87.0
dtype: float64

# Fill missing values with a specific value (e.g., 0)

Data analysis - 5th unit 5


print(s.fillna(19))

0 15.0
1 85.0
2 19.0
3 74.0
4 56.0
5 19.0
6 87.0
dtype: float64

These operations demonstrate how to check for missing values in a Pandas


Series ( isnull() ), how to drop those missing values ( dropna() ), and how to fill
missing values with a specific value ( fillna() ).

Pandas DataFrame:
A DataFrame in Pandas is a two-dimensional, size-mutable, and heterogeneous
tabular data structure with labeled axes (rows and columns). It is similar to a
spreadsheet or SQL table, where each column can have a different data type.
DataFrames are particularly useful for data manipulation and analysis tasks, as
they provide powerful methods for handling and processing structured data.

Creating a DataFrame using Pandas:

import pandas as pd

# Creating a DataFrame from a dictionary


data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
print(df)

Name Age City


0 Alice 25 New York

Data analysis - 5th unit 6


1 Bob 30 San Francisco
2 Charlie 35 Los Angeles

This code creates a Pandas DataFrame df from a dictionary data , where each
key in the dictionary becomes a column in the DataFrame and the
corresponding list of values becomes the data in that column. The DataFrame is
then printed to the console, displaying the data in a tabular format.

DataFrame operations:
1. Adding a new column:

df['Gender'] = ['Female', 'Male', 'Male']


print(df)

Name Age City Gender


0 Alice 25 New York Female
1 Bob 30 San Francisco Male
2 Charlie 35 Los Angeles Male

here we added a column called gender.

2. drop:

dropping a column:

k= df.drop("City",axis =1)
k

Name Age Gender


0 Alice 25 Female
1 Bob 30 Male
2 Charlie 35 Male

here we dropped City column. axis = 1, represents column

dropping a row:

k= df.drop(1,axis =0)

Data analysis - 5th unit 7


print(k)

Name Age City Gender


0 Alice 25 New York Female
2 Charlie 35 Los Angeles Male

here we dropped a row with index 1, axis =0 represents a row.

3. Grouping by Gender and calculating mean age:

grouped_df = df.groupby('Gender')['Age'].mean()
print(grouped_df)

Gender
Female 25.0
Male 32.5
Name: Age, dtype: float64

The groupby method in Pandas is used to split the DataFrame into groups based
on some criteria, such as a specific column value. In this case,
df.groupby('Gender') groups the DataFrame df by the 'Gender' column.
After grouping, the ['Age'] part selects the 'Age' column from each group, and
the mean() method calculates the mean age for each group.
Finally, print(grouped_df) displays the resulting Series, where each index
corresponds to a unique value in the 'Gender' column, and each value is the
mean age of the group.

4. Selection and indexing:

import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'San Francisco', 'Los Angeles',
df = pd.DataFrame(data)

Data analysis - 5th unit 8


# Access a single value
print("\nValue at index 1, column 'Name':")
print(df.at[1, 'Name'])

# Access a row using integer position


print("\nRow at index 2:")
print(df.iloc[2])

# Access a column using label


print("\nColumn 'Age':")
print(df['Age'])

Value at index 1, column 'Name':


Bob

Row at index 2:
Name Charlie
Age 35
City Los Angeles
Name: 2, dtype: object

Column 'Age':
0 25
1 30
2 35
3 40
Name: Age, dtype: int64

Selection and indexing in Pandas DataFrame allows you to access and


manipulate data efficiently. Here's a short explanation:

1. Single Column Selection: You can select a single column by using square
brackets [] with the column name as a string. For example,
df['Column_Name'] selects the column named 'Column_Name' .

2. Multiple Columns Selection: To select multiple columns, you can pass a list
of column names inside the square brackets. For example, df[['Column1',
'Column2']] selects columns 'Column1' and 'Column2' .

Data analysis - 5th unit 9


3. Row Selection Using Label: You can use the loc[] indexer to select rows
by label. For example, df.loc['Label'] selects the row with the specified
label.

5. reading a CSV file in pandas:


This Python code uses the pandas library to read a CSV file named 'data.csv'
into a DataFrame object called df . The pd.read_csv() function is used to read
the CSV file and create the DataFrame. The DataFrame is a tabular data
structure that stores the data from the CSV file in rows and columns, allowing
for easy manipulation and analysis of the data using pandas' powerful tools and
functions.

• head() : View the first few rows of the DataFrame.

Data analysis - 5th unit 10


Filtering Data: here we are considering first_ings_score which was greater
than 150

Matplotlib:
Line plot:
A line plot is a type of plot where data points are connected by straight lines. It
is often used to visualize data over a continuous interval.

Data analysis - 5th unit 11


Bar Graph:
Creating a bar graph

so Now lets Import a dataset using pandas and perform data visualization
operation using matplotlib, here we are using ipl 2022 dataset

histogram:

Data analysis - 5th unit 12


In Matplotlib, you can create a histogram using the hist function.

piechart:
To draw a pie chart with multiple colors, labels, and percentages in Matplotlib,
you can use the pie function along with the autopct parameter to display the
percentages.

In this example, each slice of the pie chart will be colored according to the
colors list, and the autopct='%1.1f%%' parameter will display the percentage

values with one decimal place. Adjust the labels , colors lists according to your
data and color preferences.

Data analysis - 5th unit 13


Data analysis - 5th unit 14

You might also like