Pandas
DR. ARCHANA RAJE
What is Pandas?
➢ Python pandas is one of the most widely-used Python libraries in data
science and analytics.
➢ It provides high-performance, easy-to-use structures, and data analysis
tools.
➢ Pandas is a powerful Python library that is specifically designed to work on
data frames that have "relational" or "labeled" data.
➢ Two-dimensional table objects in pandas are referred to as DataFrame, as
well as Series.
➢ It is a structure that contains column names and row labels.
➢ This Python package works well for data manipulation, operating a dataset,
exploring a data frame, data analysis, and machine learning-related tasks.
Why Pandas?
Pandas simplifies the task related to data frames and makes it simple to
do many of the time-consuming, repetitive tasks involved in working with
data frames, such as:
➢ Import datasets - available in the form of spreadsheets, comma-
separated values (CSV) files, and more.
➢ Data cleansing - dealing with missing values and representing them as
NaN, NA, or NaT.
➢ Size mutability - columns can be added and removed from DataFrame
and higher-dimensional objects.
➢ Data normalization – normalize the data into a suitable format for
analysis.
➢ Data alignment - objects can be explicitly aligned to a set of labels.
Why Pandas?
➢ Intuitive merging and joining data sets – we can merge and join
datasets.
➢ Reshaping and pivoting of datasets – datasets can be reshaped
and pivoted as per the need.
➢ Efficient manipulation and extraction - manipulation and
extraction of specific parts of extensive datasets using intelligent
label-based slicing, indexing, and subsetting techniques.
➢ Statistical analysis - to perform statistical operations on datasets.
➢ Data visualization - Visualize datasets and uncover insights.
Applications of Pandas
The most common applications of Pandas are as follows:
➢ Data Cleaning: Pandas provides functionalities to clean messy data, deal with incomplete or
inconsistent data, handle missing values, remove duplicates, and standardize formats to do
effective data analysis.
➢ Data Exploration: Pandas easily summarize statistics, find trends, and visualize data using built-in
plotting functions, Matplotlib, or Seaborn integration.
➢ Data Preparation: Pandas may pivot, melt, convert variables, and merge datasets based on
common columns to prepare data for analysis.
➢ Data Analysis: Pandas supports descriptive statistics, time series analysis, group-by operations, and
custom functions.
➢ Data Visualisation: Pandas itself has basic plotting capabilities; it integrates and supports data
visualisation libraries like Matplotlib, Seaborn, and Plotly to create innovative visualisations.
➢ Time Series Analysis: Pandas supports date/time indexing, resampling, frequency conversion, and
rolling statistics for time series data.
Applications of Pandas
The most common applications of Pandas are as follows:
➢ Data Aggregation and Grouping: Pandas groupby() function lets you aggregate data and
compute group-wise summary statistics or apply functions to groups.
➢ Data Input/Output: Pandas makes data input and export easy by reading and writing CSV,
Excel, JSON, SQL databases, and more.
➢ Machine Learning: Pandas works well with Scikit-learn for data preparation, feature
engineering, and model input data.
➢ Web Scraping: Pandas may be used with BeautifulSoup or Scrapy to parse and analyse
structured web data for web scraping and data extraction.
➢ Financial Analysis: Pandas is commonly used in finance for stock market data analysis,
financial indicator calculation, and portfolio optimization.
➢ Text Data Analysis: Pandas' string manipulation, regular expressions, and text mining
functions help analyse textual data.
➢ Experimental Data Analysis: Pandas makes manipulating and analysing large datasets,
performing statistical tests, and visualising results easy.
Introduction to Data Structures
Pandas deals with the following three data Data Structure Dimensions Description
structures −
1D labeled
➢ Series Series 1 homogeneous array,
➢ DataFrame sizeimmutable.
➢ Panel General 2D labeled,
size-mutable tabular
These data structures are built on top of Numpy
array, which means they are fast. structure with
Data Frames 2
potentially
heterogeneously
The best way to think of these data structures is typed columns.
that the higher dimensional data structure is a
container of its lower dimensional data structure. General 3D labeled,
For example, DataFrame is a container of Series, Panel 3
size-mutable array.
Panel is a container of DataFrame.
Introduction to Data Structures
Series Panel
Series is a one-dimensional
DataFrame Panel is a three-dimensional data structure
with heterogeneous data. It is hard to
array like structure with DataFrame is a two-dimensional
represent the panel in graphical
homogeneous data. array with heterogeneous data. representation. But a panel can be illustrated
as a container of DataFrame.
Note − DataFrame is widely used and one of the most important data structures. Panel is used much less.