KEMBAR78
DAP Module3 | PDF | Databases | Database Index
0% found this document useful (0 votes)
152 views42 pages

DAP Module3

Uploaded by

Niki Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views42 pages

DAP Module3

Uploaded by

Niki Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Module 1 [22MCA31] Data Analytics using Python

Module 3
Data Pre-processing and Data Wrangling
Topics Covered

Loading from CSV files, Accessing SQL databases. Cleansing Data with Python:
Stripping out extraneous information, Normalizing data AND Formatting data.
Combining and Merging Data Sets – Reshaping and Pivoting – Data
Transformation – String Manipulation, Regular Expressions.

Loading from CSV files

Pandas features a number of functions for reading tabular data as a DataFrame object. Table
below has a summary of all of them.

Functions, which are meant to convert text data into a DataFrame. The options for these
functions fall into a few categories:
• Indexing: can treat one or more columns as the returned DataFrame, and whether to get
column names from the file, the user, or not at all.

• Type inference and data conversion: this includes the user-defined value conversions and
custom list of missing value markers.

• Datetime parsing: includes combining capability, including combining date and time
information spread over multiple columns into a single column in the result.

• Iterating: support for iterating over chunks of very large files.

• Unclean data issues: skipping rows or a footer, comments, or other minor things like
numeric data with thousands separated by commas.

ROOPA.H.M, Dept of MCA, RNSIT Page 1


Module 1 [22MCA31] Data Analytics using Python

read_csv and read_table are most used functions.


Before using any methods in the pandas library , import the library with the following
statement: Import pandas as pd

Let’s start with a small comma-separated (CSV) text file:


ex1.csv df = pd.read_csv('ex1.csv') pd.read_table('ex1.csv', sep=',')

Since ex1.csv is comma-delimited, we can use read_csv to read it into a DataFrame. If file
contains any other delimiters then, read_table can be used by specifying the delimiter.

pandas allows to assign column names by specifing names argument:

Suppose we wanted the message column to be the index of the returned DataFrame. We can
either indicate we want the column at index 4 or named 'message' using the index_col
argument:

ROOPA.H.M, Dept of MCA, RNSIT Page 2


Module 1 [22MCA31] Data Analytics using Python

To form a hierarchical index from multiple columns, just pass a list of column numbers or
names:

The parser functions have many additional arguments to help you handle the wide variety of
exception file formats that occur. For example, you can skip the first, third, and fourth rows
of a file with skiprows:

Handling missing values


• Handling missing values is an important and frequently nuanced part of the file parsing
process. Missing data is usually either not present (empty string) or marked by some
sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as
NA, -1.#IND, and NULL:

ROOPA.H.M, Dept of MCA, RNSIT Page 3


Module 1 [22MCA31] Data Analytics using Python

• The na_values option can take either a list or set of strings to consider missing values:

• Different NA sentinels can be specified for each column in a dict:

ROOPA.H.M, Dept of MCA, RNSIT Page 4


Module 1 [22MCA31] Data Analytics using Python

Accessing SQL databases

• A database is a file that is organized for storing data. Most databases are organized
like a dictionary in the sense that they map from keys to values. The biggest
difference is that the database is on disk (or other permanent storage), so it persists
after the program ends. Because a database is stored on permanent storage, it can
store far more data than a dictionary, which is limited to the size of the memory in
the computer.

• Like a dictionary, database software is designed to keep the inserting and accessing
of data very fast, even for large amounts of data. Database software maintains its
performance by building indexes as data is added to the database to allow the
computer to jump quickly to a particular entry.

• There are many different database systems which are used for a wide variety of
purposes including: Oracle, MySQL, Microsoft SQL Server, PostgreSQL, and SQLite.

• Python uses SQLite database. SQLite is designed to be embedded into other


applications to provide database support within the application.

• Python to work with data in SQLite database files, many operations can be done more
conveniently using software called the Database Browser for SQLite which is freely
available from:
http://sqlitebrowser.org/

• Using the browser you can easily create tables, insert data, edit data, or run simple
SQL queries on the data in the database

Database concepts

• For the first look, database seems to be a spreadsheet consisting of multiple sheets.
The primary data structures in a database are tables, rows and columns.

• In a relational database terminology, tables, rows and columns are referred as


relation, tuple and attribute respectively. Typical structure of a database table is as
shown below Table 3.1.

• Each table may consist of n number of attributes and m number of tuples (or
records).

• Every tuple gives the information about one individual. Every cell (i, j) in the table
indicates value of jth attribute for ith tuple.

ROOPA.H.M, Dept of MCA, RNSIT Page 5


Module 1 [22MCA31] Data Analytics using Python

Table 3.1: Typical Relational database table

Consider the problem of storing details of students in a database table. The format may
look like –
Roll No Name DOB Marks
Student1 1 Akshay 22/10/2001 82.5
Student 2 2 Arun 20/12/2000 81.3
............... ............... ............... ...............
............... ............... ............... ...............
Student m ............... ............... ............... ...............

Thus, table columns indicate the type of information to be stored, and table rows gives
record pertaining to every student. We can create one more table say department
consisting of attributes like dept_id, homephno, City. To relate this table with a
respective Rollno stored in student, and dept_id stored in department table. Thus, there
is a relationship between two tables in a single database. There are softwares that can
maintain proper relationships between multiple tables in a single database and are
known as Relational Database Management Systems (RDBMS).

Creating a database table


The code to create a database name(music.db) and a table named Tracks with two
columns in the database is as follows:

import sqlite3
conn = sqlite3.connect('music.db') #create database name music
cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')
conn.close()

• The connect operation makes a “connection” to the database stored in the file
music.db in the current directory. If the file does not exist, it will be created.

• A cursor is like a file handle that we can use to perform operations on the data
stored in the database. Calling cursor() is very similar conceptually to calling
open() when dealing with text files.

ROOPA.H.M, Dept of MCA, RNSIT Page 6


Module 1 [22MCA31] Data Analytics using Python

• Once we have the cursor, we can begin to execute commands on the contents of
the database using the execute() method is as shown in figure below.

Figure : Database Cursor

cur.execute(INSERT INTO Tracks (title, plays) VALUES ('My Way', 15))


This command inserts one record into the table Tracks where values for the attributes title
and plays are ‘My Way’ and 15 respectively.
cur.execute( SELECT * FROM Tracks)
Retrieves all the records from the table Tracks
cur.execute(SELECT * FROM Tracks WHERE title = 'My Way’)
Retrieves the records from the table Tracks having the value of attribute title as ‘My Way’
cur.execute(UPDATE Tracks SET plays = 16 WHERE title = 'My Way’)
Whenever we would like to modify the value of any particular attribute in the table, we
can use UPDATE command. Here, the value of attribute plays is assigned to a new value for
the record having value of title as ‘My Way’.

cur.execute(DELETE FROM Tracks WHERE title = 'My Way')


A particular record can be deleted from the table using DELETE command. Here, the record
with value of attribute title as ‘My Way’ is deleted from the table Tracks.
cur.execute('DROP TABLE IF EXISTS Tracks ')
This command will delete the contents of entire table

Example1: Write a python to create student Table from college database.(the attributes of
student like Name,USN,Marks.)Perform the following operations like insert,delete and
retrieve record from student Table.

ROOPA.H.M, Dept of MCA, RNSIT Page 7


Module 1 [22MCA31] Data Analytics using Python

import sqlite3
conn = sqlite3.connect(‘college.db’)
cur=conn.cursor()
print(“Opened database successfully”)
cur.execute(‘CREATE TABLE student(name TEXT, usn NUMERIC, Marks INTEGER)’)
print(“Table created successfully”)
cur.execute(‘INSERT INTO student(name,usn,marks) values (?,?,?)’,(‘akshay’,’1rn16mca16’,30))
cur.execute(‘insert into student(name,usn,marks) values (?,?,?)’,(‘arun’,’1rn16mca17’,65))
print(‘student’)
cur.execute(‘SELECT name, usn ,marks from student’)
for row in cur:
print(row)
cur.execute(‘DELETE FROM student WHERE Marks < 40’)
cur.execute(‘select name,usn,marks from student’)
conn.commit()
cur.close()
Output:
Opened database successfully
Table created successfully
student
('akshay', '1rn16mca16', 30)
('arun', '1rn16mca17', 65)
Example 2: Write a python code to create a database file(music.sqlite) and a table named
Tracks with two columns- title , plays. Also insert , display and delete the contents of the
table
import sqlite3
conn = sqlite3.connect('music.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE Tracks (title TEXT, plays INTEGER)')
cur.execute(“INSERT INTO Tracks (title, plays) VALUES ('Thunderstruck', 200)”)
cur.execute(“INSERT INTO Tracks (title, plays) VALUES (?, ?)”,('My Way', 15))
conn.commit()
print('Tracks:')
cur.execute('DELETE FROM Tracks WHERE plays < 100')
cur.execute('SELECT title, plays FROM Tracks')
for row in cur:
print(row)
cur.close()

Output
Tracks:
('Thunderstruck', 200)

ROOPA.H.M, Dept of MCA, RNSIT Page 8


Module 1 [22MCA31] Data Analytics using Python

Cleansing Data with Python: Stripping out extraneous information

Extraneous information refers to irrelevant or unnecessary data that can clutter a dataset and
make it difficult to analyze. This could include duplicate entries, empty fields, or irrelevant
columns. Stripping out this information involves removing it from the dataset, resulting in a more
concise and manageable dataset.

To strip out extraneous information in a Pandas DataFrame, you can use various methods and
functions provided by the library. Some commonly used methods include:
• dropna( ): This method removes rows with missing values (NaN or None) from the DataFrame.
You can specify the axis (0 for rows and 1 for columns) along which the rows or columns with
missing values should be dropped.
Example:
df = df.dropna()
#This will remove all rows that contain at least one missing value.

• drop( ): The drop() method in Pandas is used to remove columns from a DataFrame. It can be
used to drop a single column or multiple columns at once.
df.drop(columns, axis=1, inplace=False)
Ex:
cars2 = cars_data.drop(['Doors','Weight'],axis='columns')

• drop_duplicates(): methods to remove missing values and duplicate rows specify the
columns based on which the duplicates should be checked.

ROOPA.H.M, Dept of MCA, RNSIT Page 9


Module 1 [22MCA31] Data Analytics using Python

• loc[ ] and iloc[ ]: These indexing methods allow you to select specific rows and columns from
the DataFrame. They are used to select only the relevant data and exclude the unwanted
information.
Ex:1

Ex:2

ROOPA.H.M, Dept of MCA, RNSIT Page 10


Module 1 [22MCA31] Data Analytics using Python

• Filtering: conditional statements can be used to filter the DataFrame and select only the
rows that meet certain criteria. This allows to remove unwanted data based on specific
conditions.
Example:

data = {'Name': [‘Anitha', ‘Barathi', 'Charlie', 'David'],


'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)

# Example of filtering based on a condition


filtered_df = df[df['Age'] > 30]

# Display the filtered DataFrame


print(filtered_df)

Normalizing data AND Formatting data


Normalizing Data:

Data normalization is the process of transforming data into a consistent format to facilitate
comparison and analysis. This may involve converting data to a common unit of measurement,
formatting dates and times consistently, or standardizing data formats. Normalization ensures
that data is comparable and can be easily processed and analysed.

Normalization is a crucial step in data preprocessing for machine learning tasks. It involves
transforming numerical features to have a mean of 0 and a standard deviation of 1. This process
ensures that all features are on the same scale, enabling efficient and accurate learning by
machine learning algorithms.

In Python, several libraries provide functions for data normalization .

Method 1: Using sklearn


The sklearn method is a very famous method to normalize the data.

ROOPA.H.M, Dept of MCA, RNSIT Page 11


Module 1 [22MCA31] Data Analytics using Python

We import all the required libraries, NumPy and sklearn. You can see that we import the
preprocessing from the sklearn itself. That’s why this is the sklearn normalization method. We
created a NumPy array with some integer value that is not the same. We called the normalize
method from the preprocessing and passed the numpy_array, which we just created as a
parameter. We can see from the results, all integer data are now normalized between 0 and 1.

Method 2: Normalize a particular column in a dataset using sklearn

We can also normalize the particular dataset column. In this, we are going to discuss about that.

We import the library pandas and sklearn. We created a dummy CSV file, and we are now
loading that CSV file with the help of the pandas (read_csv) package. We print that CSV file
which we recently loaded. We read the particular column of the CSV file using the np. array and
store the result to value_array. We called the normalize method from the preprocessing and
passed the value_array parameter.

Method 3: Convert to normalize without using the columns to array (using sklearn)

In the previous method 2, we discussed how to a particular CSV file column we could normalize.
But sometimes we need to normalize the whole dataset, then we can use the below method
where we do normalize the whole dataset but along column-wise (axis = 0). If we mention the
axis = 1, then it will do row-wise normalize. The axis = 1 is by default value.

ROOPA.H.M, Dept of MCA, RNSIT Page 12


Module 1 [22MCA31] Data Analytics using Python

Now, we pass the whole CSV file along with one more extra parameter axis =0, which said to the
library that the user wanted to normalize the whole dataset column-wise.

Method 4: Using MinMaxScaler()


The sklearn also provides another method of normalization, which we called it MinMaxScalar.
This is also a very popular method because it is easy to use.

ROOPA.H.M, Dept of MCA, RNSIT Page 13


Module 1 [22MCA31] Data Analytics using Python

We called the MinMaxScalar from the preprocessing method and created an object
(min_max_Scalar) for that. We did not pass any parameters because we need to normalize the
data between 0 and 1. But if you want, you can add your values which will be seen in the next
method.

We first read all the names of the columns for further use to display results. Then we call the
fit_tranform from the created object min_max_Scalar and passed the CSV file into that. We get
the normalized results which are between 0 and 1.

Method 5: Using MinMaxScaler(feature_range=(x,y))

The sklearn also provides the option to change the normalized value of what you want. By
default, they do normalize the value between 0 and 1. But there is a parameter which we called
feature_range, which can set the normalized value according to our requirements.

Here, We call the MinMaxScalar from the preprocessing method and create an object
(min_max_Scalar) for that. But we also pass another parameter inside of the MinMaxScaler
(feature_range). That parameter value we set 0 to 2. So now, the MinMaxScaler will normalize the
data values between 0 to 2. We first read all the names of the columns for further use to display
results. Then we call the fit_tranform from the created object min_max_Scalar and passed the
CSV file into that. We get the normalized results which are between 0 and 2.

ROOPA.H.M, Dept of MCA, RNSIT Page 14


Module 1 [22MCA31] Data Analytics using Python

Method 6: Using the maximum absolute scaling

We can also do normalize the data using pandas. These features are also very popular in
normalizing the data. The maximum absolute scaling does normalize values between 0 and 1.
We are applying here .max () and .abs() as shown below:

We call each column and then divide the column values with the .max() and .abs(). We print the
result, and from the result, we confirm that our data normalize between 0 and 1.

Method 7: Using the z-score method

The next method which we are going to discuss is the z-score method. This method converts the
information to the distribution. This method calculates the mean of each column and then
subtracts from each column and, at last, divides it with the standard deviation. This normalizes
the data between -1 and 1.

ROOPA.H.M, Dept of MCA, RNSIT Page 15


Module 1 [22MCA31] Data Analytics using Python

We calculate the column’s mean and subtract it from the column. Then we divide the column
value with the standard deviation. We print the normalized data between -1 and 1.

One popular library is Scikit-Learn, which offers the StandardScaler class for normalization.
Here's an example of how to use StandardScaler to normalize a dataset:

ROOPA.H.M, Dept of MCA, RNSIT Page 16


Module 1 [22MCA31] Data Analytics using Python

Formatting Data:

• Formatting data in Pandas involves transforming and presenting data in a structured and
readable manner. Pandas, a popular Python library for data analysis, offers various methods
and techniques to format data effectively.

• One of the key features of Pandas is its ability to handle different data types and structures.
It provides specific formatting options for each data type, ensuring that data is displayed in
a consistent and meaningful way. For example, numeric data can be formatted with specific
number of decimal places, currency symbols, or percentage signs. Date and time data can be
formatted in various formats, such as "dd/mm/yyyy" or "hh:mm:ss".

• Pandas also allows users to align data within columns, making it easier to read and compare
values. This can be achieved using the "justify" parameter, which takes values such as "left",
"right", or "center". Additionally, Pandas provides options to control the width of columns,
ensuring that data is presented in a visually appealing manner.

• Furthermore, Pandas offers methods to format entire dataframes, applying consistent


formatting rules to all columns. This can be done using the "style" attribute, which allows
users to specify formatting options for different aspects of the dataframe, such as font,
background color, and borders.

• By leveraging the formatting capabilities of Pandas, users can effectively communicate


insights and patterns in their data, making it easier to analyze and interpret. Overall,
formatting data in Pandas is a crucial skill for data analysts and scientists to present their
findings in a clear and professional manner.
Ex 1 : Formatting Numeric Data

ROOPA.H.M, Dept of MCA, RNSIT Page 17


Module 1 [22MCA31] Data Analytics using Python

Ex 2: Formatting Date and Time Data

Ex 3: Aligning Data in Columns

ROOPA.H.M, Dept of MCA, RNSIT Page 18


Module 1 [22MCA31] Data Analytics using Python

Combining and merging data sets

Data contained in pandas objects can be combined together in a number of built-in ways:
• pandas.merge connects rows in DataFrames based on one or more keys. This will be
familiar to users of SQL or other relational databases, as it implements database join
operations.

• pandas.concat glues or stacks together objects along an axis.

• combine_first instance method enables splicing together overlapping data to fill in missing
values in one object with values from another.

Database-style DataFrame Merges


Merge or join operations combine data sets by linking rows using one or more keys. The merge
function in pandas is used to combine datasets.

Let’s start with a simple example:

import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df1 df2

• The below examples shows many-to-one merge situation; the data in df1 has multiple rows
labeled a and b, whereas df2 has only one row for each value in the key column.

# performs inner join # performs outer join


# pd.merge(df1, df2)
pd.merge(df1, df2, how='outer')
pd.merge(df1, df2, on='key')

ROOPA.H.M, Dept of MCA, RNSIT Page 19


Module 1 [22MCA31] Data Analytics using Python

Observe that the 'c' and 'd' values and associated data are missing from the result. By default
merge does an 'inner' join; the keys in the result are the intersection. The outer join takes the
union of the keys, combining the effect of applying both left and right joins.

• The below examples shows Many-to-many merges:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],'data2': range(5)})

df1 df2

Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in
the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method
only affects the distinct key values appearing in the result:

pd.merge(df1, df2, how='inner') pd.merge(df1, df2, on='key', how='left')

Merging on Index
The merge key or keys in a DataFrame will be found in its index. In this case, you can pass
left_index=True or right_index=True (or both) to indicate that the index should be used as the
merge key

ROOPA.H.M, Dept of MCA, RNSIT Page 20


Module 1 [22MCA31] Data Analytics using Python

ROOPA.H.M, Dept of MCA, RNSIT Page 21


Module 1 [22MCA31] Data Analytics using Python

Concatenating Along an Axis


Another kind of data combination operation is alternatively referred to as concatenation,
binding, or stacking. NumPy has a concatenate function for doing this with raw NumPy
arrays:

By default concat works along axis=0, producing another Series. If you pass axis=1, the result
will instead be a DataFrame (axis=1 is the columns):

ROOPA.H.M, Dept of MCA, RNSIT Page 22


Module 1 [22MCA31] Data Analytics using Python

Reshaping and pivoting

There are a number of fundamental operations for rearranging tabular data. These are
alternatingly referred to as reshape or pivot operations.

Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame.


There are two primary actions:
• stack: this changes from the columns in the data to the rows.
• unstack: this changes from the rows into the columns.

• Using the stack method on this data pivots the columns into the rows, producing a
Series.

From a hierarchically-indexed Series, we can rearrange the data back into a DataFrame with
unstack.

ROOPA.H.M, Dept of MCA, RNSIT Page 23


Module 1 [22MCA31] Data Analytics using Python

• By default the innermost level is unstacked (same with stack). You can unstack a different
level by passing a level number or name:

• Unstacking might introduce missing data if all of the values in the level aren’t found in each
of the subgroups:

Pivoting “long” to “wide” Format


• A common way to store multiple time series in databases and CSV is in so-called long or
stacked format :

• Data is frequently stored this way in relational databases like MySQL as a fixed schema allows
the number of distinct values in the item column to increase or decrease as data is added or
deleted in the table.

• The data may not be easy to work with in long format; it is preferred to have a DataFrame
containing one column per distinct item value indexed by timestamps in the date column.
ROOPA.H.M, Dept of MCA, RNSIT Page 24
Module 1 [22MCA31] Data Analytics using Python

DataFrame’s pivot method performs exactly this transformation.

The pivot() function is used to reshape a given DataFrame organized by given index / column
values. This function does not support data aggregation, multiple values will result in a
MultiIndex in the columns.

Syntax:
DataFrame.pivot(self, index=None, columns=None, values=None)

Example:

ROOPA.H.M, Dept of MCA, RNSIT Page 25


Module 1 [22MCA31] Data Analytics using Python

Suppose you had two value columns that you wanted to reshape simultaneously:

• By omitting the last argument, you obtain a DataFrame with hierarchical columns:

ROOPA.H.M, Dept of MCA, RNSIT Page 26


Module 1 [22MCA31] Data Analytics using Python

Data transformation

Data transformation is the process of converting raw data into a format that is suitable for
analysis and modeling. It's an essential step in data science and analytics workflows, helping to
unlock valuable insights and make informed decisions.
Few of the data transfer mechanisms are :
• Removing Duplicates
• Replacing Values
• Renaming Axis Indexes
• Discretization and Binning
• Detecting and Filtering Outliers
• Permutation and Random Sampling

i) Removing duplicates
Duplicate rows may be found in a DataFrame using method duplicated which returns a
boolean Series indicating whether each row is a duplicate or not. Relatedly,
drop_duplicates returns a DataFrame where the duplicated array is True.

data = DataFrame(
{ 'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4] } )

data.duplicated() data.drop_duplicates()
data

# rows 1, 4 and 6 are


dropped

ii) Filtering outliers


Filtering or transforming outliers is largely a matter of applying array operations.
Consider a DataFrame with some normally distributed data. (Note : while writing answers,
write your own random numbers between 0 and 1)

ROOPA.H.M, Dept of MCA, RNSIT Page 27


Module 1 [22MCA31] Data Analytics using Python

• Suppose we wanted to find values in one of the columns exceeding one in magnitude:

• To select all rows having a value exceeding 1 or -1, we can use the any method on a
boolean DataFrame:

iii) Replacing Values

• Some times it is necessary to replace missing values with some specific values or NAN
values. It can be done by using replace method. Let’s consider this Series:

data = Series([1., -999., 2., -999., -1000., 3.])


data

• The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series:

data.replace(-999, np.nan)

ROOPA.H.M, Dept of MCA, RNSIT Page 28


Module 1 [22MCA31] Data Analytics using Python

• If we want to replace multiple values at once, you instead pass a list then the substitute
value:

data.replace([-999, -1000], np.nan)

• To use a different replacement for each value, pass a list of substitutes:

data.replace([-999, -1000], [np.nan, 0]) data.replace({-999: np.nan, -1000: 0})

iv. Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping
of some form to produce new, differently labeled objects. The axes can also be modified in
place without creating a new data structure.

• We can assign to index, modifying the DataFrame in place:

import pandas as pd

data = pd.DataFrame(np.arange(12).reshape((3, 4)),


index=['Ohio', 'Colorado', 'New York'], columns=['one', 'two', 'three', 'four'])
data.index = data.index.map(str.upper)
data

• To create a transformed version of a data set without modifying the original, a useful
method is rename:

data.rename(index=str.title, columns=str.upper)

ROOPA.H.M, Dept of MCA, RNSIT Page 29


Module 1 [22MCA31] Data Analytics using Python

• rename can be used in conjunction with a dict-like object providing new values for a subset
of the axis labels:
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})

v. Discretization and binning

• Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose we have data about a group of people in a study, and we want to group them into
discrete age buckets:

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do
so, we have to use cut, a function in pandas:

import pandas as pd
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

• The object pandas returns is a special Categorical object. We can treat it like an array of
strings indicating the bin name; internally it contains a levels array indicating the distinct
category names along with a labeling for the ages data in the labels attribute:

cats.labels

cats.levels
Index([(18, 25], (25, 35], (35, 60], (60, 100]], dtype=object)
pd.value_counts(cats)

Consistent with mathematical notation for intervals, a parenthesis means that the side is
open while the square bracket means it is closed (inclusive).

ROOPA.H.M, Dept of MCA, RNSIT Page 30


Module 1 [22MCA31] Data Analytics using Python

vi. Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the
numpy.random.permutation function. Calling permutation with the length of the axis you
want to permute produces an array of integers indicating the new ordering:

df

df = DataFrame(np.arange(5 * 4).reshape(5, 4))

sampler = np.random.permutation(5)
array([1, 0, 2, 3, 4])
sampler

That array can then be used in ix-based indexing or the take function:

df.take(sampler)

vii. Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applica tions
is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a
DataFrame has k distinct values, you would derive a matrix or DataFrame containing k
columns containing all 1’s and 0’s. pandas has a get_dummies function for doing this,
though devising one yourself is not difficult. Let’s return to an earlier ex ample
DataFrame:

df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})

pd.get_dummies(df['key'])

In some cases, you may want to add a prefix to the columns in the indicator DataFrame,
which can then be merged with the other data. get_dummies has a prefix argument for
doing just this:

ROOPA.H.M, Dept of MCA, RNSIT Page 31


Module 1 [22MCA31] Data Analytics using Python

dummies = pd.get_dummies(df['key'], prefix='key')

df_with_dummy = df[['data1']].join(dummies)

df_with_dummy

String Manipulation

Python has long been a popular data munging language in part due to its ease-of-use for
string and text processing. Most text operations are made simple with the string object’s built-
in methods. For more complex pattern matching and text manipulations, regular expressions
may be needed. pandas adds to the mix by enabling you to apply string and regular
expressions concisely on whole arrays of data, additionally handling the annoyance of missing
data.

String Object Methods


• In many string munging and scripting applications, built-in string methods are sufficient.
Examples:
Description Code Output
a comma-separated val = 'a,b, guido'
string can be broken val.split(',') ['a', 'b', ' guido']
into pieces with split

split is often combined pieces = [x.strip()


with strip to trim for x in val.split(',')]
['a', 'b', 'guido']
whitespace (including
newlines)
The above substrings first,second,third= pieces
could be concatenated
together with a two- first+'::'+second+'::'+ 'a::b::guido'
colon delimiter using third
addition
A faster and more '::'.join(pieces)
Pythonic way is to pass
a list or tuple to the 'a::b::guido'
join method on the
string '::'

ROOPA.H.M, Dept of MCA, RNSIT Page 32


Module 1 [22MCA31] Data Analytics using Python

Note: refer module 2 for remaining string methods

Regular Expressions
Expressions.

• A RegEx, or Regular Expression, is a sequence of characters that forms a search


pattern.
• RegEx can be used to check if a string contains the specified search pattern.
• Python has a built-in package called re, which can be used to work with Regular
Expressions

RegEx Functions
The re module offers a set of functions that allows us to search a string for a match.
By using these functions we can search required pattern. They are as follows:

• match(): re.match() determine if the RE matches at the beginning of the string. The
method returns a match object if the search is successful. If not, it returns None.

import re Output:
abyss
pattern = '^a...s$'
Search successful.
test_string = 'abyss'
result = re.match(pattern,test_string)

if result:
print("Search successful.")
else:
print("Search unsuccessful.")

• search(): The search( ) function searches the string for a match, and returns a Match
object if there is a match. If there is more than one match found, only the first
occurrence of the match will be returned.
import re Output:
pattern='Tutorials' <re.Match object;
line ='Python Tutorials' span=(7,16),match='Tutorials'>
result = re.search(pattern, line)
Tutorials
print(result)
print(result.group())

#group( ) returns matched string

ROOPA.H.M, Dept of MCA, RNSIT Page 33


Module 1 [22MCA31] Data Analytics using Python

• findall() : Find all substrings where the RE matches, and returns them as a list. It
searches from start or end of the given string and returns all occurrences of the
pattern. While searching a pattern, it is recommended to use re.findall() always, it
works like re.search() and re.match() both.
import re Output:
str = "The rain in Spain" ['ai', 'ai']
x = re.findall("ai", str)
print(x)

Character matching in regular expressions

• Python provides a list of meta-characters to match search strings.

• Metacharacters are characters that are interpreted in a special way by a RegEx


engine. Here's a list of metacharacters:

[] A set of characters "[a-m]"

[^…] Matches any single character NOT in "[^a-m]"


brackets
\ Signals a special sequence (can also be "\d"
used to escape special characters)
. Any character (except newline character) "he..o"

^ Starts with "^hello"

$ Ends with "world$"

* Zero or more occurrences "aix*"

+ One or more occurrences "aix+"

{} Exactly the specified number of "al{2}"


occurrences
| Either or "falls|stays"

() Capture and group

When parentheses are added to a regular


expression, they are ignored for the
purpose of matching, but allow you to
extract a particular subset of the
matched string rather than the whole
string when using findall()

ROOPA.H.M, Dept of MCA, RNSIT Page 34


Module 1 [22MCA31] Data Analytics using Python

Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a
special meaning:
\d Returns a match where the string contains "\d"
digits (numbers from 0-9)
\D Returns a match where the string DOES NOT "\D"
contain digits

\s Returns a match where the string contains a "\s"


white space character

\S Returns a match where the string DOES NOT "\S"


contain a white space character

\w Returns a match where the string contains "\w"


any word characters (characters from a to Z,
digits from 0-9, and the underscore _
character)

\W Returns a match where the string DOES NOT "\W"


contain any word characters

\A Returns a match if the specified characters "\AThe"


are at the beginning of the string

\b Returns a match where the specified r"\bain"


characters are at the beginning or at the end r"ain\b"
of a word
\B Returns a match where the specified r"\Bain"
characters are present, but NOT at the r"ain\B"
beginning (or at the end) of a word
\Z Returns a match if the specified characters "Spain\Z"
are at the end of the string

Few examples on set of characters for pattern matching are as follows:

Set Description Examples

[arn] Returns a match where >>> str = "The rain in


one of the specified Spain"
characters (a, r, or n) >>>re.findall("[arn]",
are present str)
['r', 'a', 'n', 'n', 'a',
'n']

ROOPA.H.M, Dept of MCA, RNSIT Page 35


Module 1 [22MCA31] Data Analytics using Python

[a-n] Returns a match for >>> str = "The rain in Spain"


any lower case >>>re.findall("[a-n]",str)
character, ['h', 'e', 'a', 'i', 'n', 'i',
alphabetically 'n', 'a', 'i', 'n']
between a and n

[^arn] Returns a match for >>> str = "The rain in Spain"


any character >>>re.findall("[^arn]", str)
EXCEPT a, r, and n ['T', 'h', 'e', ' ', 'i', ' ',
'i', ' ', 'S', 'p', 'i']

[0123] Returns a match where >>> str = "The rain in Spain"


any of the specified >>>re.findall("[0123]", str)
digits (0, 1, 2, or 3) are []
present

[0-9] Returns a match for >>>str ="8 times before 11:45 AM"
any digit >>>re.findall("[0-9]", str)
between 0 and 9 ['8', '1', '1', '4', '5']

[0- Returns a match for >>>str = "8 times before 11:45 AM"
5][0- any two-digit numbers >>>re.findall("[0-5][0-9]", str)
9] ['11', '45']
from 00 and 59

[a-zA-Z] Returns a match for >>>str = "8 times before


any character 11:45 AM"
>>>re.findall("[a-zA-Z]",
alphabetically
str)
between a and z, lower ['t', 'i', 'm', 'e', 's', 'b',
case OR upper case 'e', 'f', 'o', 'r', 'e', 'A',
'M']

[+] In
sets, +, *, ., |, (), $,{} has
no special meaning, >>>str ="8 times before 11:45 AM"
so [+] means: return a >>>re.findall("[+]", str)
match for []
any + character in the
string

ROOPA.H.M, Dept of MCA, RNSIT Page 36


Module 1 [22MCA31] Data Analytics using Python

Few more examples for searching the pattern in files:


Let us consider a text file pattern.txt
#pattern.txt
From: Bengaluru^560098
From:<ravi123@gmail.com>
ravi
rohan
Mysore^56788
From:Karnataka
From:
<roopa.hm@rnsit.ac.in>

EX:1 Search for lines that start with 'F', followed by 2 characters, followed by 'm:'
import re Output:
hand = open('pattern.txt')
for line in hand: From: Bengaluru^560098
line = line.rstrip() From:<ravi123@gmail.com>
if re.search('^F..m:', From: <roopa.hm@rnsit.ac.in>
line):
print(line)

The regular expression F..m: would match any of the strings “From:”, “Fxxm:”, “F12m:”,
or “F!@m:” since the period characters in the regular expression match any character.

Ex:2 Search for lines that start with From and have an at sign
import re Output:
hand = open('pattern.txt')
for line in hand: From:<ravi123@gmail.com>
line = line.rstrip() From:
if re.search('^From:.+@', <roopa.hm@rnsit.ac.in>
line):
print(line)

The search string ˆFrom:.+@ will successfully match lines that start with “From:”,
followed by one or more characters (.+), followed by an at-sign.

Extracting data using regular expressions


If we want to extract data from a string in Python we can use the findall() method to
extract all of the substrings which match a regular expression.

Ex:1 Extract anything that looks like an email address from the line.
import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)
Output: ['csev@umich.edu', 'cwen@iupui.edu']

ROOPA.H.M, Dept of MCA, RNSIT Page 37


Module 1 [22MCA31] Data Analytics using Python

In the above example:


— The findall() method searches the string in the second argument and returns a list of all of
the strings that look like email addresses.

— Translating the regular expression, we are looking for substrings that have at least One or
more non-whitespace character, followed by an at-sign, followed by at least one more non-
whitespace character.

— The “\S+” matches as many non-whitespace characters as possible.

Ex:2 Search for lines that have an at sign between characters


import re Output:
hand = open('pattern.txt')
for line in hand: ['From:<ravi123@gmail.com>']
line = line.rstrip() ['<roopa.hm@rnsit.ac.in>']
x = re.findall('\S+@\S+',
line)
if len(x) > 0:
print(x)
We read each line and then extract all the substrings that match our regular
expression. Since findall() returns a list, we simply check if the number of elements in
our returned list is more than zero to print only lines where we found at least one
substring that looks like an email address.
Observe the above output, email addresses have incorrect characters like “<” or “>” at
the beginning or end. To eliminate those characters, refer to the below example
program.

Ex:3 Search for lines that have an at sign between characters .The characters
must be a letter or number
import re Output:
hand = open('pattern.txt')
for line in hand: ['From:ravi123@gmail.com']
line = line.rstrip() ['roopa.hm@rnsit.ac.in']
x=re.findall('[a-zA-Z0-9]\S+@\S+[a-zA-
Z]',line)
if len(x) > 0:
print(x)

Here, we are looking for substrings that start with a single lowercase letter, uppercase
letter, or number “[a-zA-Z0-9]”, followed by zero or more non-blank characters (\S*),
followed by an at-sign, followed by zero or more non-blank characters (\S*), followed by
an uppercase or lowercase letter. Note that we switched from + to * to indicate zero or
more non-blank characters since [a-zA-Z0-9] is already one non-blank character.
Remember that the * or + applies to the single character immediately to the left of the
plus or asterisk.

ROOPA.H.M, Dept of MCA, RNSIT Page 38


Module 1 [22MCA31] Data Analytics using Python

Combining searching and extracting


• Sometimes we may want to extract the lines from the file that match with specific
pattern, let say
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

We can use the following regular expression to select the lines:


^X-.*: [0-9.]+
Let us consider a sample file called file.txt
File.txt

X-DSPAM-Confidence:
0.8475
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6178
X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961
X-DSPAM
done with the file
content

Ex:1 Search for lines that start with 'X' followed by any non whitespace
characters and ':' followed by a space and any number. The number can
include a decimal.
import re Output:
hand = open('file.txt') X-DSPAM-Confidence:
for line in hand: 0.8475
line = line.rstrip() X-DSPAM-Probability:
x =re.search('^X-.*: ([0-9.]+)', 0.0000
line) X-DSPAM-Confidence:
if x: 0.6178
print(x.group()) X-DSPAM-Probability:
0.0000
X-DSPAM-Confidence:
0.6961

Here, it select the lines that


— start with X-,

ROOPA.H.M, Dept of MCA, RNSIT Page 39


Module 1 [22MCA31] Data Analytics using Python

— followed by zero or more characters (.*),


— followed by a colon (:) and then a space.
— After the space we are looking for one or more characters that are either a digit (0-
9) or a period [0-9.]+.
— Note that inside the square brackets, the period matches an actual period (i.e., it is
not a meta character between the square brackets).

• But, if we want only the numbers in the above output. We can use split() function on
extracted string. However, it is better to refine regular expression. To do so, we need
the help of parentheses.

When we add parentheses to a regular expression, they are ignored when matching
the string(with search()). But when we are using findall(), parentheses indicate that
while we want the whole expression to match, we are only interested in extracting a
portion of the substring that matches the regular expression.

Ex:2 Search for lines that start with 'X' followed by any non whitespace
characters and ':' followed by a space and any number. The number can include
a decimal. Then print the number if it is greater than zero.

import re Output:
hand = open('file.txt') ['0.8475']
for line in hand: ['0.0000']
line = line.rstrip() ['0.6178']
x = re.findall('^X.*: ([0-9.]+)', ['0.0000']
line) ['0.6961']
if len(x) > 0:
print(x)

• Let us consider another example, assume that the file contain lines of the form:

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

If we wanted to extract all of the revision numbers (the integer number at the end of
these lines) using the same technique as above, we could write the following program:

ROOPA.H.M, Dept of MCA, RNSIT Page 40


Module 1 [22MCA31] Data Analytics using Python

Ex:3 Search for lines that start with 'Details: rev=' followed by numbers and '.'
Then print the number if it is greater than zero.
import re
str="Details:http://source.sakaiproject.org/viewsvn/?view=rev&rev=3
9772"
x = re.findall('^Details:.*rev=([0-9]+)', str)
if len(x) > 0:
print(x)
Output:
['39772']

In the above example, we are looking for lines that start with Details:, followed by
any number of characters (.*), followed by rev=, and then by one or more digits. We
want to find lines that match the entire expression but we only want to extract the
integer number at the end of the line, so we surround [0-9]+ with parentheses.
Note that, the expression [0-9] is greedy, because, it can display very large
number. It keeps grabbing digits until it finds any other character than the digit.

• Consider another example – we may be interested in knowing time of a day of each


email. The file may have lines like –
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
Here, we would like to extract only the hour 09. That is, we would like only two digits
representing hour. This can done by following code-
line="From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16
2008"
x = re.findall('^From .* ([0-9][0-9]):', line)
if len(x) > 0:
print(x)
Output:
['09']

Escape character
Character like dot, plus, question mark, asterisk, dollar etc. are meta characters in
regular expressions. Sometimes, we need these characters themselves as a part of
matching string. Then, we need to escape them using a backslash. For example,
import re Output:
x = 'We just received $10.00 for
cookies.' matched string: $10.00
y = re.search('\$[0-9.]+',x)
print("matched string:",y.group())

ROOPA.H.M, Dept of MCA, RNSIT Page 41


Module 1 [22MCA31] Data Analytics using Python

Here, we want to extract only the price $10.00. As, $ symbol is a metacharacter, we
need to use \ before it. So that, now $ is treated as a part of matching string, but not
as metacharacter.

Question Bank

Explain merge methods with example demonstrating the following joins


1
i) Outer ii)left iii)right
2 Discuss various techniques for stripping out extraneous information in the dataset.
3 What is data normalization? Explain with an example.
4 Illustrate with examples to handle missing data while reading the CSV file.
5 Describe reshaping with hierarchical indexing with suitable examples.
6 Write a short note string manipulation.
7 Write a short note on pivoting mechanism.
8 List and describe different functions used for pattern matching in re module with example.
9 Discuss the data transformation mechanisms with examples.
10 Briefly discuss Discretization and Binning
Implement a python program to demonstrate
11 (i) Importing Datasets
(ii) Cleaning the Data
(iii) Data frame manipulation using NumPy
REGULAR EXPRESSION
12 What is the need of regular expressions in programming? Explain.
13 Discuss any 5 meta characters used in regular expressions with suitable example.
14 Discuss match() , search() and findall() functions of re module.
15 What is the need of escape characters in regular expressions? Give suitable code snippet
Write a Python program to search for lines that start with the word ‘From’ and a character
16 followed by a two digit number between 00 and 99 followed by “:” Print the number if it is
greater than zero. Assume any input file.
17 How to extract a substring from the selected lines from the file
Explain the intention/meaning of the following Regular expressions
18 1. ^From .* ([0-9][0-9]):
2. ^Details:.*rev=([0-9.]+
3. . ^X\S*: ([0-9.]+)

ROOPA.H.M, Dept of MCA, RNSIT Page 42

You might also like