KEMBAR78
Pandas and Python | PDF | Comma Separated Values | Microsoft Excel
0% found this document useful (0 votes)
16 views24 pages

Pandas and Python

This document describes how to use the Pandas library in Python to manipulate and analyze data. Pandas allows importing data from various sources such as Excel files, CSV, and SQL databases, and representing them in DataFrames. DataFrames allow selecting rows and columns, sorting and filtering data, applying functions to columns, and removing duplicate rows and columns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views24 pages

Pandas and Python

This document describes how to use the Pandas library in Python to manipulate and analyze data. Pandas allows importing data from various sources such as Excel files, CSV, and SQL databases, and representing them in DataFrames. DataFrames allow selecting rows and columns, sorting and filtering data, applying functions to columns, and removing duplicate rows and columns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Pandas and Python

Pandas is an open-source Python library that provides analysis and


data manipulation inprogramming in Python.
It is a very promising library for data representation, filtering, and programming.
statistics. The most important piece in pandas is the DataFrame where it stores and plays.
with the data.

In this tutorial, you will learn what a DataFrame is, how to create it from different sources,
how to export it to different results and how to manipulate its data.

Install pandas
Can you install pandas in Python?using pipRun the following command in cmd:
pip install pandas

Also, you can install pandas using conda like this:

condainstallpandas

Read an Excel file


You can read from an Excel file using the read_excel() method from pandas. For this,
you need to import one more module called xlrd.

Install xlrd using pip:


pip install xlrd

The following example shows how to read from an Excel sheet:

1. We create an Excel sheet with the following contents:

2. Import the pandas module.


import pandas
3. We will provide the name of the Excel file and the sheet number that we need.
read the data using the read_excel() method.
pandas.read_excel('pandasExcel.xlsx', 'Sheet1')
The previous fragment will generate the following result:
If you check the output type using the keyword type, it will give you the following result:

<class 'pandas.core.frame.DataFrame'>
This result is called DataFrame! That is the basic unit of pandas with which we work.
to be addressed until the end of the tutorial.
The DataFrame is a labeled two-dimensional structure where we can store
data of different types. DataFrame is similar to a SQL table or a spreadsheet.
Excel.

Import CSV file


To read a CSV file, you can use the read_csv() method from pandas.

Import the pandas module:

import pandas
Now call the read_csv() method as follows:

pandas.read_csv('Book1.csv')
Book1.csv has the following content:
The code will generate the following DataFrame:

Read a text file


We can also use the read_csv method from pandas to read from a text file;
Consider the following example:

import pandas

pandas.read_csv('myFile.txt')
The myFile.txt has the following format:
The output of the previous code will be:

This text file is treated like a CSV file because we have separated elements.
by commas. The file can also use another delimiter, such as a semicolon, a
tabulator, etc.

Suppose we have a tab delimiter and the file looks like this:
When the delimiter is a tab, we will have the following result:

Since pandas has no idea of the delimiter, translate the tab to \ t.

To define the tab character as a delimiter, pass the delimiter argument from
this way:

pandas.read_csv('myFile.txt', delimiter='\t')
Now the output will be:

It seems correct now.


Read SQL
You can use the read_sql() method of pandas to read from a SQL database. This is
demonstrate in the following example:

import sqlite3

import pandas

con = sqlite3.connect('mydatabase.db')

pandas.read_sql('select * from Employee', con)


In this example, we connect to aSQLite3 databasethat has a table called
"Employee". Using the read_sql() method from pandas, we pass a query and an object of
Connection to the read_sql() method. The query retrieves all the data from the table.
Our employee table looks like the following:

When you run the above code, the output will be as follows:
Select columns
Let's assume we have three columns in the Employee table like this:

To select columns from the table, we will run the following query:

select Name, Job from Employee


The statement of the pandas code will be as follows:
pandas.read_sql('select Name, Job from Employee', con)

We can also select a column from a table by accessing the DataFrame.


Consider the following example:

x = pandas.read_sql('select * from Employee', con)

x
The result will be the following:

Select rows by value


First, we will create a DataFrame from which we will select rows.

To create a DataFrame, consider the following code:

import pandas

frame_data = {'name': ['James','Jason','Rogers'],'age': [18,20,22],'job': ['Assistant','Manager',


Clerk

df = pandas.DataFrame(frame_data)
In this code, we create a DataFrame with three columns and three rows using the method
Pandas DataFrame(). The result will be as follows:

To select a row according to its value, execute the following statement

df.loc[df['name'] == 'Jason']
df.loc [] or DataFrame.loc [] is a boolean array that can be used to access rows or
columns by values or labels. In the previous code, it will search for the row where the
My name is Jason.

The output will be:

Select row by index


To select a row by its index, we can use the slicing operator (:) or the
fix df.loc [].

Consider the following code:

>>> frame_data = {'name': ['James','Jason','Rogers'],'age': [18,20,22],'job': ['Assistant','Manager',


Clerk

>>> df = pandas.DataFrame(frame_data)
We create a DataFrame. Now we are going to access a row using df.loc[]:

>>> df.loc[1]
As you can see, we retrieved a row. We can do the same using the operator of
segmentation in the following way:

>>> df[1:2]

Change column type


The data type of a column can be changed using the astype() attribute of
DataFrame. To check the data type of the columns, we use the dtypes attribute of
DataFrame.

df.dtypes
The exit will be:

Now to convert the data type from one to another:

>>> df.name = df.name.astype(str)


We look for the 'name' column of our DataFrame and change its data type from object.
a string of characters.

Apply a function to columns / rows


To apply a function to a column or row, you can use the apply() method of
DataFrame.

Consider the following example:

>>> frame_data = {'A': [1,2,3],'B': [18,20,22],'C': [54,12,13]}


>>> df = pandas.DataFrame(frame_data)
We create a DataFrame and add integer values in the rows. To apply a
function, for example, the square root in the values, we will import the modulenumpyfor
use the sqrt function like this:
>>>import numpy as np

>>>df.apply(np.sqrt)
The output will be as follows:

To apply a sum function, the code will be:

>>> df.apply(np.sum)

To apply the function to a specific column, you can specify the column of the
next way:

>>>df['A'].apply(np.sqrt)

Sort values / sort by column


To sort the values in a DataFrame, use the DataFrame's sort_values() method.

Create a DataFrame with integer values:

>>> frame_data = {'A': [23,12,30],'B': [18,20,22],'C': [54,112,13]}

>>> df = pandas.DataFrame(frame_data)
Now to sort the values:

>>> df.sort_values(by=['A'])
The output will be:
The sort_values() method has a required 'by' attribute. In the previous code, the
values are sorted by column A. To sort by multiple columns, the code is
next:

Sort df by columns 'A' and 'B'.


If you want to sort in descending order, set the ascending attribute of set_values to
False in the following way:

>>>df.sort_values(by=['A'], ascending=False)
The output will be:

Remove / Delete duplicates


To remove duplicate rows from a DataFrame, use the drop_duplicates() method of the
DataFrame.

Consider the following example:

>>> frame_data = {'name': ['James','Jason','Rogers','Jason'],'age': [18,20,22,20],'job': ['Assistant',


'Manager','Clerk','Manager']}

>>> df = pandas.DataFrame(frame_data)
Here we create a DataFrame with a duplicate row. To check for duplicate rows in
the DataFrame, use the DataFrame's duplicated() method.

>>> df.duplicated()
The result will be:

It can be seen that the last row is a duplicate. To delete this row, execute the following
line of code:

>>> df.drop_duplicates()
Now the result will be:

Remove duplicates by column


Sometimes, we have data where the column values are the same and we want to
we can delete a row by column by passing the name of the column that
we must eliminate.

For example, we have the following DataFrame:

The provided text does not contain translatable content.


'Manager','Clerk','Employee']}

>>> df = pandas.DataFrame(frame_data)
Here you can see that Jason appears twice. If you want to remove duplicates by column,
just pass the column name as follows:

>>> df.drop_duplicates(['name'])
The result will be as follows:

Delete a column
To remove a whole column or row, we can use the drop() method of the DataFrame
specifying the name of the column or row.

Consider the following example:

>>> df.drop(['job'], axis=1)


In this line of code, we are removing the column called 'job'. The argument of
axis is necessary here. If the axis value is 1, it means we want to remove columns, if the
The axis value of 0 means that the row will be deleted. In axis values, 0 is for index and 1
for columns.

The result will be:


Remove rows
We can use the drop() method to remove a row by passing the index of the row.

Let's assume we have the following DataFrame:

>>> frame_data = {'name': ['James','Jason','Rogers'],'age': [18,20,22],'job': ['Assistant','Manager',


Clerk

>>> df = pandas.DataFrame(frame_data)
To delete a row with index 0 where the name is James, the age is 18 and the job
as an assistant, use the following code:

>>> df.drop([0])

We are going to create a DataFrame where the indices are the names:

>>> frame_data = {'name': ['James','Jason','Rogers'],'age': [18,20,22],'job': ['Assistant','Manager',


Clerk

>>> df = pandas.DataFrame(frame_data, index = ['James','Jason','Rogers'])

Now we can delete a row with a certain value. For example, if we want to delete a
row where the name is Rogers, then the code will be:

>>> df.drop(['Rogers'])
The output will be:

You can also delete a range of rows as follows:


>>>df.drop(df.index[[0, 1]])
This will delete the rows from index 0 to 1 and only one row will remain since our DataFrame is
composed of 3 rows:

If you want to delete the last row of the DataFrame and do not know the total number of rows,
You can use negative indexing as shown below:

>>>df.drop(df.index[-1])
-1 deletes the last row. Similarly, -2 will delete the last 2 rows and so on.

Sum a column
You can use the sum() method of the DataFrame to sum the elements of the column.

Let's suppose we have the following DataFrame:

>>> frame_data = {'A': [23,12,12],'B': [18,18,22],'C': [13,112,13]}

>>> df = pandas.DataFrame(frame_data)
Now to sum the elements of column A, use the following line of code:

>>> df['A'].sum()

You can also use the apply() method of the DataFrame and pass the sum method.
numpy to sum the values.

Count unique values


To count unique values in a column, you can use the nunique() method of
DataFrame.

Let's suppose we have a DataFrame as follows:


>>> frame_data = {'A': [23,12,12],'B': [18,18,22],'C': [13,112,13]}

>>> df = pandas.DataFrame(frame_data)
To count the unique values in column A:

>>> df['A'].nunique()

As you can see, column A has only 2 unique values 23 and 12 and the other 12 is a
duplicate, that's why we have 2 in the output.

If you want to count all the values in a column, you can use the count() method of the
next way:

>>> df['A'].count()

Rows of subsets
To select a subset of a DataFrame, you can use brackets.

For example, we have a DataFrame that contains some integers. We can select or
find the subset of a row like this:

df.[start:count]
The starting point will be included in the subset, but the stopping point is not included.
For example, to select 3 rows starting from the first row, you will write:

>>> df[0:3]
The output will be:

That code means to start from the first row which is 0 and select 3 rows.

Similarly, to select the first 2 rows, you will write:


>>> df[0:2]

To select or retrieve a subset with the last row, use negative indexing:

>>> df[-1:]

Write to an Excel
To write a DataFrame to an Excel sheet, we can use the to_excel() method.

To write on an Excel sheet, you need to open the sheet, and to open an Excel sheet,
we will have to import the openpyxl module.

Install openpyxl using pip:

pip install openpyxl

Consider the following example:

>>> import openpyxl


>>> frame_data = {'name': ['James','Jason','Rogers'],'age': [18,20,22],'job': ['Assistant','Manager',
Clerk

>>> df = pandas.DataFrame(frame_data)

>>> df.to_excel("pandasExcel.xlsx", "Sheet1")


The Excel file will look like the following:

Write to a CSV file


Similarly, to write a DataFrame to CSV, you can use the to_csv() method.
as shown in the following line of code.

Save the DataFrame to a CSV file named 'pandasCSV.csv'


The output file will be like the following:
Write to SQL
To write data in SQL, we can use the to_sql() method.

Consider the following example:

import sqlite3

import pandas

con = sqlite3.connect('mydatabase.db')

frame_data = {'name': ['James','Jason','Rogers'],'age': [18,20,22],'job': ['Assistant','Manager',


Clerk

df = pandas.DataFrame(frame_data)

df.to_sql('users', con)
In this code, we create a connection to a sqlite3 database. Then we create a
DataFrame with three rows and three columns.

Finally, we use the to_sql method of our DataFrame (df) and pass the name of
the table where the data will be stored along with the connection object.
The SQL database will look like this:

Write to JSON
You can use the DataFrame's to_json() method to write to a JSON file.

This is demonstrated in the following example:

df.to_json("myJson.json")
In this line of code, the name of the JSON file is passed as an argument. The
The DataFrame will be stored in the JSON file. The file will contain the following content:
Write in an HTML file
You can use the DataFrame's to_html() method to create an HTML file with the
content of the DataFrame.

Consider the following example:

>>> df.to_html("myhtml.html")
The results file will have the following content:

When you open the HTML file in the browser, it will look like this:

You might also like