Data Manipulation with Pandas and
NumPy
Dr. Nana Yaw Duodu
Computer Science Department
Accra Technical University
DATA MANIPULATION COMPUTER SCIENCE DEPARTMENT
7/15/2025 FACULTY OF APPLIED SCIENCES 2
DATA MANIPULATION COMPUTER SCIENCE DEPARTMENT
• Data Manipulation is a core skill in data science, enabling analysts and
scientists to clean, reshape, and prepare data for analysis.
• Definition
“Data manipulation is the process of organizing or arranging data in order
to make it easier to interpret.”
• Data manipulation is the process of transforming data to make it more useful for
analysis.
• It involves cleaning, filtering, merging, and reshaping data.
7/15/2025 FACULTY OF APPLIED SCIENCES 3
IMPORTANCE OF DATA MANIPULATION COMPUTER SCIENCE DEPARTMENT
i. Enhancing data quality
ii. Extracting relevant information
iii. Transforming data structures
iv. Handling missing data
v. Creating derived variables
7/15/2025 FACULTY OF APPLIED SCIENCES 4
DATA MANIPULATION COMPUTER SCIENCE DEPARTMENT
• Python offers powerful libraries like Pandas and NumPy, which simplify working with
structured and numerical data.
• In preparing your Python environment for data manipulation and numerical analysis
the Pandas and NumPy libraries are imported.
7/15/2025 FACULTY OF APPLIED SCIENCES 5
DATA MANIPULATION COMPUTER SCIENCE DEPARTMENT
• Pandas provides a wide range of functions for data manipulation,
including data selection, filtering, and aggregation.
• The library is designed to work with two main data structures:
Series and DataFrame.
• A Series is a one-dimensional array-like object that can hold any
data type, while a DataFrame is a two-dimensional table-like
structure that can hold multiple Series.
7/15/2025 FACULTY OF APPLIED SCIENCES 6
DATA MANIPULATION COMPUTER SCIENCE DEPARTMENT
• The statements follow the syntax
below.
7/15/2025 FACULTY OF APPLIED SCIENCES 7
The statement:
COMPUTER SCIENCE DEPARTMENT
“import pandas as pd”
This imports the Pandas library and assigns it the alias pd, allowing you to
work with data structures like Data Frames and Series, which are
essential for handling and analyzing structured data in Python.
“Import NumPy as np”
This imports the NumPy library and assigns it the alias np, enabling
efficient handling of numerical operations, arrays, and mathematical
computations.
7/15/2025 FACULTY OF APPLIED SCIENCES 8
The statement:
COMPUTER SCIENCE DEPARTMENT
Feature NumPy Pandas
Main Data Structure ndarray (multi-dimensional) Series, DataFrame
Focus Area Numerical computation Data analysis and manipulation
Data Type Support Primarily numeric Numeric, text, dates, categories
Indexing Integer-based Labeled indexing (rows & columns)
File I/O Support Limited Excellent (CSV, Excel, SQL, etc.)
Use Case Example Linear algebra, FFTs, simulations Cleaning, filtering, merging data
7/15/2025 FACULTY OF APPLIED SCIENCES 9
NumPy Arrays and Array Operations
COMPUTER SCIENCE DEPARTMENT
• NumPy (Numerical Python) provides support for large, multi-
dimensional arrays and matrices along with a collection of
mathematical functions. Creating Arrays import NumPy as np import
pandas as pdimport numpy as np
• # Creating a 1D arrayarr1 = np.array([1, 2, 3, 4])
• # Creating a 2D arrayarr2 = np.array([[1, 2], [3, 4]])
• # Element-wise operationssum_arr = arr1 + 5prod_arr = arr1 * 2
7/15/2025 FACULTY OF APPLIED SCIENCES 10
NumPy
COMPUTER SCIENCE DEPARTMENT
➢NumPy is highly efficient and forms the basis of many data science operations.
➢Introduction to Pandas DataFramesPandas provides two core data structures:
Series and DataFrame.
➢A Series is a one-dimensional array-like object, while a DataFrame is a two-
dimensional, table-like structure with labeled axes.
➢Creating Series and DataFramesimport pandas as pd
• # Seriess = pd.Series([10, 20, 30, 40])# DataFramedata = {'Name':
['Alice','Bob'],'Age':[25, 30]}df = pd.DataFrame(data)
7/15/2025 FACULTY OF APPLIED SCIENCES 11
NumPy
COMPUTER SCIENCE DEPARTMENT
➢NumPy is highly efficient and forms the basis of many data science operations.
➢Introduction to Pandas DataFramesPandas provides two core data structures:
Series and DataFrame.
➢A Series is a one-dimensional array-like object, while a DataFrame is a two-
dimensional, table-like structure with labeled axes.
➢Creating Series and DataFramesimport pandas as pd
• # Seriess = pd.Series([10, 20, 30, 40])# DataFramedata = {'Name':
['Alice','Bob'],'Age':[25, 30]}df = pd.DataFrame(data)
7/15/2025 FACULTY OF APPLIED SCIENCES 12
DataFrames
COMPUTER SCIENCE DEPARTMENT
1.Data Frames are ideal for handling tabular data such as CSV files
or SQL tables. Reading and Writing CSV/Excel Files Pandas
makes it easy to load and save data from various sources.
# Reading a CSV file
df = pd.read_csv('data.csv')# Reading an Excel filedf_excel =
pd.read_excel('data.xlsx')
7/15/2025 FACULTY OF APPLIED SCIENCES 13
What is Dataset?
COMPUTER SCIENCE DEPARTMENT
• A dataset is a collection of data typically organized in tables, arrays or specific
formats, such as CSV or JSON for easy retrieval and analysis.
• Datasets are essential for data analysis, machine learning (ML), artificial
intelligence (AI) and other applications that require reliable, accessible data.
• A dataset in machine learning and artificial intelligence is used to train and test
algorithms and models.
7/15/2025 FACULTY OF APPLIED SCIENCES 14
What is Dataset?
COMPUTER SCIENCE DEPARTMENT
• A Dataset is a set of data grouped into a collection with which developers can
work to meet their goals.
• In a dataset, the rows represent the number of data points and the columns
represent the features of the Dataset.
7/15/2025 FACULTY OF APPLIED SCIENCES 15
Types of Dataset?
COMPUTER SCIENCE DEPARTMENT
• Numerical Dataset: They include numerical data points that can be solved with
equations. These include temperature, humidity, marks and so on.
• Categorical Dataset: These include categories such as colour, gender,
occupation, games, sports and so on.
• Web Dataset: These include datasets created by calling APIs using HTTP
requests and populating them with values for data analysis. These are mostly
stored in JSON (JavaScript Object Notation) formats.
• Time series Dataset: These include datasets between a period, for example,
changes in geographical terrain over time.
• Image Dataset: It includes a dataset consisting of images. This is mostly used to
differentiate the types of diseases, heart conditions and so on.
7/15/2025 FACULTY OF APPLIED SCIENCES 16
Types of Dataset?
COMPUTER SCIENCE DEPARTMENT
• Ordered Dataset: These datasets contain data that are ordered in ranks, for
example, customer reviews, movie ratings and so on.
• Partitioned Dataset: These datasets have data points segregated into different
members or different partitions.
• File-Based Datasets: These datasets are stored in files, in Excel as .csv, or .xlsx
files.
• Bivariate Dataset: In this dataset, 2 classes or features are directly correlated
to each other. For example, height and weight in a dataset are directly related
to each other.
• Multivariate Dataset: In these types of datasets, as the name suggests 2 or
more classes are directly correlated to each other. For example, attendance,
and assignment grades are directly correlated to a student’s overall grade.
7/15/2025 FACULTY OF APPLIED SCIENCES 17
FEATURES OF A DATASET
COMPUTER SCIENCE DEPARTMENT
• Numerical Features: These may include numerical values such as height, weight, and
so on
• Categorical Features: These include multiple classes/ categories, such as gender,
colour, and so on.
• Metadata: Includes a general description of a dataset.
• Size of the Data: It refers to the number of entries and features it contains in the file
containing the Dataset.
• Formatting of Data: The datasets available online are available in several formats.
Some of them are JSON (JavaScript Object Notation), CSV (Comma Separated
Value), XML (eXtensible Markup Language), DataFrame, and Excel Files (xlsx or xlsm).
• Target Variable: It is the feature whose values/attributes are referred to get outputs
from the other features with machine learning techniques.
• Data Entries: These refer to the individual values of data present in the Dataset.
7/15/2025 FACULTY OF APPLIED SCIENCES 18
You are a data analyst at a retail company. The marketing team has provided you with a list of products and their monthly
You are a data analyst at a retail company. The marketing team has provided you with a list of products and their month
Assignment One (1)
COMPUTER SCIENCE DEPARTMENT
• As a data analyst at a retail company. The marketing team has provided you with a
list of products and their monthly sales (in units) for January and February. Your
task is to create the data, manipulate it using both pandas and numpy, and
extract meaningful insights.
• Variable to manipulate your data should include your Student ID. Eg
df_12243_sort.value”.
• Use the Data Below
➢Product names: ["Shoes", "Shirts", "Jeans", "Bags"]
➢January sales: [120, 150, 100, 90]
➢February sales: [130, 145, 110, 80]
➢Iidentity the following: mean, average, median, mode sales in January
7/15/2025 FACULTY OF APPLIED SCIENCES 19
THANK You !!
7/15/2025 FACULTY OF APPLIED SCIENCES 20