Introduction to Python Pandas
Pandas is a powerful open-source Python library for data analysis and manipulation. It provides
data structures like DataFrame and Series that make handling structured data (like tables and
time-series) easy and efficient. Pandas is widely used in data science, machine learning, and
analytics due to its versatility and high-level abstractions for managing datasets.
Key Features of Pandas
   1. Data Structures:
          o Series: One-dimensional, similar to a column in Excel or a 1D NumPy array.
          o DataFrame: Two-dimensional, like a table with rows and columns.
   2. Data Manipulation:
          o Filtering, sorting, grouping, and aggregation.
   3. Integration:
          o Works seamlessly with other libraries like NumPy, Matplotlib, and Scikit-learn.
   4. Data I/O:
          o Read and write data from various formats like CSV, Excel, SQL, JSON, etc.
   5. Time-Series Support:
          o Provides functionality for analyzing and processing time-series data.
Applications of Pandas
1. Data Cleaning
Real-time Example:
      Task: Cleaning customer data by handling missing values.
import pandas as pd
# Sample data
data = {'Name': ['Alice', 'Bob', None, 'Eve'], 'Age': [25, 30, None, 40]}
df = pd.DataFrame(data)
# Handling missing values
df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
2. Financial Analysis
Real-time Example:
      Task: Analyzing stock market data.
import pandas as pd
# Loading sample stock data
df = pd.read_csv('https://example.com/stock_prices.csv', parse_dates=['Date'])
# Filtering for a specific company
apple_stock = df[df['Company'] == 'Apple']
# Calculating moving average
apple_stock['Moving_Avg'] = apple_stock['Close'].rolling(window=20).mean()
print(apple_stock.head())
3. Exploratory Data Analysis (EDA)
Real-time Example:
      Task: Analyzing a dataset of sales.
# Loading sales data
sales_data = pd.read_csv('https://example.com/sales_data.csv')
# Grouping sales by region
region_sales = sales_data.groupby('Region')['Sales'].sum()
# Plotting the data
region_sales.plot(kind='bar', title='Sales by Region')
4. Time Series Analysis
Real-time Example:
      Task: Forecasting electricity demand based on past data.
# Loading time series data
df = pd.read_csv('https://example.com/electricity_demand.csv',
parse_dates=['Timestamp'])
# Resampling data to hourly averages
hourly_demand = df.resample('H', on='Timestamp')['Demand'].mean()
print(hourly_demand.head())
5. Machine Learning Preprocessing
Real-time Example:
      Task: Preparing data for a machine learning model.
# Loading data
data = pd.read_csv('https://example.com/housing_data.csv')
# Dropping irrelevant columns
data.drop(['ID'], axis=1, inplace=True)
# Encoding categorical features
data = pd.get_dummies(data, columns=['City'], drop_first=True)
# Normalizing numerical features
data['Price'] = (data['Price'] - data['Price'].mean()) / data['Price'].std()
print(data.head())
6. Web Scraping and Analysis
Real-time Example:
      Task: Scraping live product prices and analyzing them.
import pandas as pd
import requests
from bs4 import BeautifulSoup
# Scraping data from a website
response = requests.get('https://example.com/products')
soup = BeautifulSoup(response.text, 'html.parser')
# Extracting product names and prices
products = {'Name': [], 'Price': []}
for product in soup.select('.product-item'):
    products['Name'].append(product.select_one('.name').text)
products['Price'].append(float(product.select_one('.price').text.strip('$')))
df = pd.DataFrame(products)
# Analyzing product prices
print(df.describe())
Why Use Pandas?
      Handles large datasets efficiently.
      Provides intuitive data manipulation tools.
      Simplifies working with different data formats.
      Integrates well with visualization and machine learning tools.
What Can Pandas Do?
      Pandas gives you answers about the data. Like:
      Is there a correlation between two or more columns?
      What is average value?
      Max value?
      Min value?
      Pandas are also able to delete rows that are not relevant, or contains wrong values, like
       empty or NULL values. This is called cleaning the data.
Pandas Series:
The Pandas Series can be defined as a one-dimensional array that is capable of storing various
data types. We can easily convert the list, tuple, and dictionary into series using "series' method.
The row labels of series are called the index. A Series cannot contain multiple columns. It has
the following parameter:
      data: It can be any list, dictionary, or scalar value.
      index: The value of the index should be unique and hashable. It must be of the same
       length as data. If we do not pass any index, default np.arrange(n) will be used.
      dtype: It refers to the data type of series.
      copy: It is used for copying the data.
Creating a Series:
We can create a Series in two ways:
   1. Create an empty Series
   2. Create a Series using inputs.
Create an Empty Series:
We can easily create an empty series in Pandas which means it will not have any value.
The syntax that is used for creating an Empty Series:
1. <series object> = pandas.Series()
   The below example creates an Empty Series type object that has no values and having
   default datatype, i.e., float64.
     Example
1. import pandas as pd
2. x = pd.Series()
3. print (x)
   Output
     Series([], dtype: float64)
     Creating a Series using inputs:
     We can create Series by using various inputs:
         o    Array
         o    Dict
         o    Scalar value
     Creating Series from Array:
     Before creating a Series, firstly, we have to import the numpy module and then use array()
     function in the program. If the data is ndarray, then the passed index must be of the same
     length.
     If we do not pass an index, then by default index of range(n) is being passed where n
     defines the length of an array, i.e., [0,1,2,....range(len(array))-1].
     Example
1. import pandas as pd
2.   import numpy as np
3.   info = np.array(['P','a','n','d','a','s'])
4.   a = pd.Series(info)
5.   print(a)
     Output
     0P
     1a
     2n
     3d
     4a
     5s
     dtype: object
     Create a Series from dict
     We can also create a Series from dict. If the dictionary object is being passed as an
     input and the index is not specified, then the dictionary keys are taken in a sorted
     order to construct the index.
     If index is passed, then values correspond to a particular label in the index will be extracted
     from the dictionary.
1. #import the pandas library
2.   import pandas as pd
3.   import numpy as np
4.   info = {'x' : 0., 'y' : 1., 'z' : 2.}
5.   a = pd.Series(info)
6.   print (a)
     Output
     x 0.0
     y 1.0
     z 2.0
     dtype: float64
     Create a Series using Scalar:
     If we take the scalar values, then the index must be provided. The scalar value will be
     repeated for matching the length of the index.
1. #import pandas library
2.   import pandas as pd
3.   import numpy as np
4.   x = pd.Series(4, index=[0, 1, 2, 3])
5.   print (x)
     Output
     04
     14
     24
     34
     dtype: int64
     Accessing data from series with Position:
     Once you create the Series type object, you can access its indexes, data, and even
     individual elements.
     The data in the Series can be accessed similar to that in the ndarray.
1. import pandas as pd
2. x = pd.Series([1,2,3],index = ['a','b','c'])
3. #retrieve the first element
4. print (x[0])
   Output
1
Series object attributes
The Series attribute is defined as any information related to the Series object such as size,
datatype. etc. Below are some of the attributes that you can use to get the information about
the Series object:
    Attributes                                      Description
 Series.index                                       Defines the index of the Series.
 Series.shape                                       It returns a tuple of shape of the data.
 Series.dtype                                       It returns the data type of the data.
 Series.size                                        It returns the size of the data.
                                                    It returns True if Series object is empty, otherwise
 Series.empty
                                                    returns false.
                                                    It returns True if there are any NaN values,
 Series.hasnans
                                                    otherwise returns false.
 Series.nbytes                                      It returns the number of bytes in the data.
 Series.ndim                                        It returns the number of dimensions in the data.
 Series.itemsize                                    It returns the size of the datatype of item.
Retrieving Index array and data array of a series object
We can retrieve the index array and data array of an existing Series object by using the
attributes index and values.
1. import numpy as np
2.   import pandas as pd
3.   x=pd.Series(data=[2,4,6,8])
4.   y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])
5.   print(x.index)
6.   print(x.values)
7.   print(y.index)
8.   print(y.values)
     Output
     RangeIndex(start=0, stop=4, step=1)
     [2 4 6 8]
     Index(['a', 'b', 'c'], dtype='object')
     [11.2 18.6 22.5]
     Retrieving Types (dtype) and Size of Type (itemsize)
     You can use attribute dtype with Series object as <objectname> dtype for retrieving the data
     type of an individual element of a series object, you can use the itemsize attribute to show
     the number of bytes allocated to each data item.
1. import numpy as np
2.   import pandas as pd
3.   a=pd.Series(data=[1,2,3,4])
4.   b=pd.Series(data=[4.9,8.2,5.6],
5.   index=['x','y','z'])
6.   print(a.dtype)
7.   print(a.itemsize)
8.   print(b.dtype)
9.   print(b.itemsize)
     Output
     int64
     8
     float64
     8
     Retrieving Shape
     The shape of the Series object defines total number of elements including missing or empty
     values(NaN).
1. import numpy as np
2.   import pandas as pd
3.   a=pd.Series(data=[1,2,3,4])
4.   b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5.   print(a.shape)
6.   print(b.shape)
     Output
     (4,)
     (3,)
     Retrieving Dimension, Size and Number of bytes:
1. import numpy as np
2.   import pandas as pd
3.   a=pd.Series(data=[1,2,3,4])
4.   b=pd.Series(data=[4.9,8.2,5.6],
5.   index=['x','y','z'])
6.   print(a.ndim, b.ndim)
7.   print(a.size, b.size)
8.   print(a.nbytes, b.nbytes)
     Output
     11
     43
     32 24
     Checking Emptiness and Presence of NaNs
     To check the Series object is empty, you can use the empty attribute. Similarly, to check if
     a series object contains some NaN values or not, you can use the hasans attribute.
     Example
1. import numpy as np
2.   import pandas as pd
3.   a=pd.Series(data=[1,2,3,np.NaN])
4.   b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5.   c=pd.Series()
6.   print(a.empty,b.empty,c.empty)
7.   print(a.hasnans,b.hasnans,c.hasnans)
8.   print(len(a),len(b))
9.   print(a.count( ),b.count( ))
     Output
     False False True
     True False False
     43
     33
Series Functions
There are some functions used in Series which are as follows:
                                                  Map the values from two series that have a
 Pandas Series.map()
                                                  common column.
                                                  Calculate the standard deviation of the given set
 Pandas Series.std()
                                                  of numbers, DataFrame, column, and rows.
 Pandas Series.to_frame()                         Convert the series object to the dataframe.
                                                  Returns a Series that contain counts of unique
 Pandas Series.value_counts()
                                                  values.
Python DataFrame: Reading CSV and JSON, and Performing Analysis
Functions
Python's pandas library provides powerful tools for handling, manipulating, and analyzing
structured data.
1. Python DataFrame: Reading CSV
Definition
pd.read_csv(): Reads a comma-separated values (CSV) file into a DataFrame.
CSV files are widely used for storing tabular data in various fields such as finance, healthcare,
and e-commerce.
Real-Time Scenario
Finance: Reading a CSV containing stock market data to analyze trends.
E-commerce: Reading product sales data for generating reports.
Example: Reading CSV and Basic Operations
import pandas as pd
# Reading a CSV file
df = pd.read_csv("sales_data.csv")
# Displaying first 5 rows
print(df.head())
# Scenario: Calculate total sales
total_sales = df["sales_amount"].sum()
print(f"Total Sales: {total_sales}")
2. Python DataFrame: Reading JSON
Definition
pd.read_json(): Reads a JSON file into a DataFrame.
JSON is a popular format for transmitting data in web applications and APIs.
Real-Time Scenario
Web Development: Reading user details from a JSON API response.
Social Media Analysis: Reading JSON containing user activity for engagement reports.
Example: Reading JSON and Basic Operations
import pandas as pd
# Reading a JSON file
df = pd.read_json("user_data.json")
# Displaying first 5 rows
print(df.head())
# Scenario: Filter users above age 30
filtered_users = df[df["age"] > 30]
print(filtered_users)
3. Python DataFrame: Analysis Functions
Definition
Pandas provides a wide range of functions to analyze and manipulate data, such as
summarization, filtering, grouping, and visualization.
Real-Time Scenario
Healthcare: Summarizing patient data for trend analysis.
Marketing: Grouping customer purchases by region for targeted campaigns.
Analysis Functions
Summarization Functions
df.describe(): Provides statistical summary.
df.mean(), df.sum(), df.count(): Calculate mean, sum, or count of values.
Example: Statistical Summary
df = pd.read_csv("employee_data.csv")
print(df.describe()) # Summary of numerical columns
# Scenario: Calculate average salary
avg_salary = df["salary"].mean()
print(f"Average Salary: {avg_salary}")
Filtering and Querying
df.loc[]: Filter rows by label.
df[df["column_name"] > value]: Conditional filtering.
Example: Filter Data
# Scenario: Employees with salary > 50000
high_salary = df[df["salary"] > 50000]
print(high_salary)
Grouping and Aggregation
df.groupby(): Groups data by specified columns and applies aggregation functions.
Example: Group Sales by Region
# Scenario: Total sales by region
grouped_sales = df.groupby("region")["sales_amount"].sum()
print(grouped_sales)
Sorting
df.sort_values(): Sorts the DataFrame by specified columns.
Example: Sort Employees by Salary
sorted_employees = df.sort_values(by="salary", ascending=False)
print(sorted_employees)
Handling Missing Data
df.isnull(): Checks for missing values.
df.fillna(): Fills missing values with a specified value.
df.dropna(): Drops rows/columns with missing values.
Example: Handle Missing Values
# Scenario: Replace missing salaries with 30000
df["salary"] = df["salary"].fillna(30000)
print(df)
Merging and Joining
pd.merge(): Merges two DataFrames.
df.join(): Joins DataFrames on indices.
Example: Merge Employee and Department Data
departments = pd.DataFrame({"dept_id": [1, 2], "dept_name": ["HR", "Finance"]})
merged_df = pd.merge(df, departments, left_on="dept_id", right_on="dept_id")
print(merged_df)
Visualization
df.plot(): Generates basic plots.
df.hist(): Creates histograms.
Example: Plot Sales Data
import matplotlib.pyplot as plt
# Scenario: Sales Trend
df["sales_amount"].plot(kind="line")
plt.title("Sales Trend")
plt.show()
Functions Summary
                Function              Purpose                Example Use Case
                           Load data from a CSV file
pd.read_csv()                                           Load sales or employee data.
                           into a DataFrame
                           Load data from a JSON file   Load API response for user
pd.read_json()
                           into a DataFrame             activity.
                           Statistical summary of
df.describe()                                           Summarize patient statistics.
                           numerical columns
                           Group data and apply         Calculate total sales per
df.groupby()
                           aggregation functions        region.
                           Sort data by specified
df.sort_values()                                        Rank employees by salary.
                           columns
                                                        Replace missing product
df.fillna()                Fill missing values
                                                        prices.
                           Visualize data using basic   Analyze sales trends over
df.plot()
                           plots                        months.
Data Cleaning Functions in Python DataFrames
Data cleaning is a crucial step in preparing datasets for analysis. Pandas provides several
functions to clean and preprocess data. Below is a detailed explanation of key data-cleaning
techniques, real-time scenarios, and example codes.
Common Data Issues and Pandas Cleaning Functions
           Issue                Function/Technique                          Description
                        isnull(), notnull(), fillna(),           Identify, fill, or remove missing
Missing Values          dropna()                                 data.
                                                                 Detect and remove duplicate
Duplicate Rows          duplicated(), drop_duplicates()
                                                                 rows.
                                                                 Convert data to appropriate
Incorrect Data Types    astype(), to_datetime()
                                                                 types.
Outliers                clip(), replace(),   filtering           Handle extreme values.
                                                                 Replace or correct invalid
Invalid Values          Filtering, apply(), replace()
                                                                 entries.
Inconsistent            str.lower(), str.strip(),                Standardize text data for
Formatting              str.replace()                            consistency.
Removing Unwanted                                                Drop irrelevant rows or
                        Filtering, drop()
Data                                                             columns.
1. Handling Missing Data
Scenario: A sales dataset has missing values for revenue.
import pandas as pd
# Sample DataFrame with missing values
data = {
    "Product": ["A", "B", "C", None],
    "Sales": [100, 200, None, 150],
    "Revenue": [500, None, 300, 400],
}
df = pd.DataFrame(data)
# Identify missing values
print(df.isnull())
# Fill missing values
df["Revenue"] = df["Revenue"].fillna(df["Revenue"].mean())
# Drop rows with missing Product
df = df.dropna(subset=["Product"])
print(df)
Functions
      isnull(): Checks for missing values.
      fillna(value): Replaces missing values with a specified value.
      dropna(): Removes rows or columns with missing values.
2. Removing Duplicates
Scenario: A customer dataset has duplicate entries.
# Sample DataFrame with duplicates
data = {"Customer": ["Alice", "Bob", "Alice"], "Purchase": [200, 300, 200]}
df = pd.DataFrame(data)
# Detect duplicates
print(df.duplicated())
# Remove duplicates
df = df.drop_duplicates()
print(df)
Functions
      duplicated(): Identifies duplicate rows.
      drop_duplicates(): Removes duplicate       rows.
3. Converting Data Types
Scenario: Date data is in string format and needs conversion.
data = {"Date": ["2024-12-01", "2024-12-02", "2024-12-03"], "Sales": ["100",
"200", "300"]}
df = pd.DataFrame(data)
# Convert Sales to numeric
df["Sales"] = pd.to_numeric(df["Sales"])
# Convert Date to datetime
df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes)
Functions
      astype(type): Converts a column to the specified type.
      pd.to_datetime(): Converts a column to datetime format.
4. Handling Outliers
Scenario: Sales data contains extreme outliers.
data = {"Sales": [100, 200, 300, 10000]}
df = pd.DataFrame(data)
# Cap sales at 500
df["Sales"] = df["Sales"].clip(upper=500)
print(df)
Functions
      clip(lower, upper): Limits values within   a specified range.
      replace(): Replaces specified values.
5. Replacing Invalid or Incorrect Values
Scenario: Age column has invalid negative values.
data = {"Name": ["Alice", "Bob"], "Age": [25, -5]}
df = pd.DataFrame(data)
# Replace negative ages with mean age
df["Age"] = df["Age"].apply(lambda x: max(x, 0))
print(df)
Functions
      replace(to_replace, value): Replaces values based on conditions.
      apply(func): Applies a custom function to transform data.
6. Standardizing Text
Scenario: Product names have inconsistent capitalization and whitespace.
data = {"Product": [" apple", "Orange         ", "BANANA"]}
df = pd.DataFrame(data)
# Standardize text
df["Product"] = df["Product"].str.strip().str.lower()
print(df)
Functions
      str.strip(): Removes leading/trailing whitespace.
      str.lower(): Converts text to lowercase.
      str.replace(pattern, replacement): Replaces text    based on a pattern.
7. Dropping Irrelevant Data
Scenario: Drop unnecessary columns like "ID".
data = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"], "Score": [85,
90, 95]}
df = pd.DataFrame(data)
# Drop ID column
df = df.drop(columns=["ID"])
print(df)
Functions
      drop(columns):   Removes specified columns.
      drop(indexes):   Removes specified rows.
8. Applying Filters
Scenario: Retain rows where revenue > 300.
data = {"Product": ["A", "B", "C"], "Revenue": [200, 400, 300]}
df = pd.DataFrame(data)
# Filter rows
filtered_df = df[df["Revenue"] > 300]
print(filtered_df)
9. Handling Categorical Data
Scenario: Replace categorical values with labels.
data = {"Gender": ["Male", "Female", "Male"]}
df = pd.DataFrame(data)
# Replace categories with numeric labels
df["Gender"] = df["Gender"].replace({"Male": 0, "Female": 1})
print(df)
               Function                         Use Case            Real-Time Scenario
 isnull()                        Identify missing values     Detect missing survey responses.
                                                             Replace missing prices with the
 fillna()                        Fill missing data
                                                             average.
                                 Remove rows/columns with    Drop incomplete customer
 dropna()
                                 missing data                records.
                                                             Find duplicate orders in e-
 duplicated()                    Detect duplicate rows
                                                             commerce data.
 drop_duplicates()               Remove duplicate rows       Clean duplicate customer entries.
                                                             Convert numeric strings to
 astype()                        Convert column data types
                                                             integers.
                                                             Replace "NA" with a default
 replace()                       Replace specific values
                                                             value in a column.
 clip()                          Cap outliers                Limit revenue to a specific range.
 str.strip()                     Remove extra spaces         Clean up messy product names.
 drop()                          Drop irrelevant data        Remove ID columns for analysis.
Summary of functions