Introduction to Pandas: Data Manipulation in Python
1. Installing Pandas
Pandas is a powerful Python library specifically designed for handling structured data. It
simplifies tasks like data cleaning, transformation, and analysis by providing user-friendly data
structures and functions. To begin using it, you first need to install the library.
Example:
You can easily install Pandas using pip, Python's package manager.
pip install pandas
After installation, you can import it and check the version to confirm it's ready to use.
import pandas as pd
print(pd.__version__)
Output:
Pandas version: 2.1.4
2. Understanding Pandas Data Structures
The two fundamental data types in Pandas are the Series and the DataFrame. A Series is a
one-dimensional array-like object capable of holding various data types, while a DataFrame is
a two-dimensional, table-like structure. Think of a DataFrame as a spreadsheet or SQL table.
Example:
A Series can represent a single column of data.
A DataFrame is a collection of Series objects, where each Series represents a column.
import pandas as pd
# Creating a Series for a list of daily temperatures
temperatures = pd.Series([25, 27, 24, 26])
print(temperatures)
# Creating a DataFrame for student data
student_data = pd.DataFrame({
'Student_ID': [101, 102],
'Score': [85, 92]
})
print(student_data)
Output:
0 25
1 27
2 24
3 26
dtype: int64
Student_ID Score
0 101 85
1 102 92
3. Different Ways to Create a Series Object
A Pandas Series can be constructed from several different types of data sources, making it
highly versatile.
Example:
import pandas as pd
import numpy as np
# From a simple list
fruits = pd.Series(["apple", "banana", "orange"])
# From a NumPy array
np_array = np.array([10, 20, 30])
numbers = pd.Series(np_array)
# From a Python dictionary
# Keys become the index labels, and values become the data
product_prices = pd.Series({"Laptop": 1200, "Mouse": 25, "Keyboard": 75})
# From a single scalar value, repeated for a given index
# The value '50' is assigned to each index label
single_value_series = pd.Series(50, index=["item1", "item2", "item3"])
print(fruits)
print(numbers)
print(product_prices)
print(single_value_series)
Output:
0 apple
1 banana
2 orange
dtype: object
0 10
1 20
2 30
dtype: int64
Laptop 1200
Mouse 25
Keyboard 75
dtype: int64
item1 50
item2 50
item3 50
dtype: int64
4. Series as a Specialized NumPy Array
A Series can be seen as an enhanced version of a NumPy array. While it shares core features
like vectorized operations, it adds the crucial element of a labeled index, which allows for
more intuitive data access and alignment.
Example:
The .values attribute of a Series provides access to the underlying NumPy array, while the
.index attribute reveals the added labels.
import pandas as pd
gpa_scores = pd.Series([3.8, 3.5, 4.0], index=["A-1", "A-2", "A-3"])
# The core values (like a NumPy array)
print(gpa_scores.values)
# The labeled index (the extra feature)
print(f"Series index: {gpa_scores.index}")
Output:
Series values: [3.8 3.5 4. ]
Series index: Index(['A-1', 'A-2', 'A-3'], dtype='object')
5. Series as a Specialized Dictionary
A Series acts similarly to a Python dictionary, where the index labels serve as keys and the
data values are the associated values. This allows for quick and efficient data retrieval using
familiar dictionary-style syntax.
Example:
You can access data points in a Series using their index label, just as you would use a key to
look up a value in a dictionary.
import pandas as pd
city_populations = pd.Series([1000000, 250000, 500000], index=["Tokyo", "London", "Paris"])
# Accessing the population of "London"
print(f"Population of London: {city_populations['London']}")
Output:
Population of London: 250000
6. Understanding DataFrame Objects
A Pandas DataFrame is the most widely used data structure in Pandas. It’s a two-dimensional,
mutable table of data with labeled axes (rows and columns). It’s essentially a container for
multiple Series objects that share the same index.
Example:
import pandas as pd
# Creating a DataFrame from a dictionary of lists
# Each list becomes a column in the table
employee_data = {
'Employee_ID': [1, 2, 3],
'Department': ['IT', 'HR', 'Finance']
}
employee_df = pd.DataFrame(employee_data)
print(employee_df)
Output:
Employee_ID Department
0 1 IT
1 2 HR
2 3 Finance
7. DataFrame as a Specialized NumPy Array
Just as a Series extends a NumPy array, a DataFrame can be viewed as an extended
two-dimensional NumPy array. It not only contains a grid of data but also provides labels for
both rows and columns, making it much easier to work with.
Example:
You can create a DataFrame from a NumPy array and then add meaningful labels for the
columns and rows.
import numpy as np
import pandas as pd
# A 2x3 NumPy array
np_matrix = np.array([[10, 20, 30], [40, 50, 60]])
# Creating a DataFrame with column and row labels
df_from_array = pd.DataFrame(np_matrix, columns=["Col A", "Col B", "Col C"], index=["Row 1",
"Row 2"])
print(df_from_array)
Output:
Col A Col B Col C
Row 1 10 20 30
Row 2 40 50 60
8. DataFrame as a Specialized Dictionary
A DataFrame can also be understood as a dictionary where the keys are the column names
and the values are the corresponding Series objects. This means you can access a column
using dictionary-like syntax.
Example:
Accessing a specific column from a DataFrame is straightforward using bracket notation.
import pandas as pd
dataset = pd.DataFrame({
'Product': ['Phone', 'Tablet'],
'Price': [800, 450]
})
# Accessing the 'Price' column
prices = dataset['Price']
print(f"The prices are: \n{prices}")
Output:
The prices are:
0 800
1 450
Name: Price, dtype: int64
9. Constructing DataFrame Objects (Multiple
Methods)
DataFrames are incredibly flexible and can be created from a wide variety of data sources.
Here are some of the most common methods.
(a) From a Single Series
A single Series can be directly converted into a DataFrame. The
Series' index becomes the DataFrame's row index, and its values
become a single column.
import pandas as pd
scores_series = pd.Series([95, 88, 72], name="Exam_Scores")
scores_df = pd.DataFrame(scores_series)
print(scores_df)
Output:
Exam_Scores
0 95
1 88
2 72
(b) From a List of Dictionaries
This is a very common method, where each dictionary in the list represents a single row, and
the dictionary keys become the column names.
import pandas as pd
project_members = [
{"Name": "Alex", "Role": "Developer"},
{"Name": "Ben", "Role": "Designer"},
{"Name": "Chris", "Role": "Manager"}
]
project_df = pd.DataFrame(project_members)
print(project_df)
Output:
Name Role
0 Alex Developer
1 Ben Designer
2 Chris Manager
(c) From a Dictionary of Series Objects
By using a dictionary where the keys are column names and the values are Series objects, you
can build a DataFrame with aligned columns.
import pandas as pd
# Creating two Series with a shared index
units = pd.Series([150, 200], index=["Q1", "Q2"])
revenue = pd.Series([5000, 7500], index=["Q1", "Q2"])
sales_report = pd.DataFrame({"Units_Sold": units, "Total_Revenue": revenue})
print(sales_report)
Output
Units_Sold Total_Revenue
Q1 150 5000
Q2 200 7500
(d) From a Two-Dimensional NumPy Array
A 2D NumPy array can be used as the foundation for a DataFrame. You can then add column
and row labels for better readability.
import numpy as np
import pandas as pd
data_array = np.array([[1, 2, 3], [4, 5, 6]])
dataset_df = pd.DataFrame(data_array, columns=["A", "B", "C"])
print(dataset_df)
Output:
A B C
0 1 2 3
1 4 5 6
(e) From a NumPy Structured Array
This method is useful when you have data with a mix of data types (e.g., numbers and strings)
that you want to organize into a DataFrame.
import numpy as np
import pandas as pd
# A structured array with a defined data type for each field
employee_info = np.array([
(101, "John", 60000),
(102, "Jane", 75000)
], dtype=[("ID", "i4"), ("Name", "U10"), ("Salary", "i4")])
employee_info_df = pd.DataFrame(employee_info)
print(employee_info_df)
Output:
ID Name Salary
0 101 John 60000
1 102 Jane 75000