KEMBAR78
Pandas: Import | PDF | Comma Separated Values | Mean
100% found this document useful (1 vote)
147 views13 pages

Pandas: Import

Pandas is a Python library used for working with and analyzing datasets. It allows users to clean messy data, explore and manipulate it, and make conclusions based on statistical analysis. Pandas provides functions for reading CSV files into DataFrames. DataFrames allow viewing and exploring data through methods like head(), tail(), and info(). Pandas can clean data by handling empty values, wrong formats, incorrect values, and duplicates. It also allows visualizing data through plots like bar plots, line plots, histograms, and more generated by the DataFrame plot method.

Uploaded by

hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
147 views13 pages

Pandas: Import

Pandas is a Python library used for working with and analyzing datasets. It allows users to clean messy data, explore and manipulate it, and make conclusions based on statistical analysis. Pandas provides functions for reading CSV files into DataFrames. DataFrames allow viewing and exploring data through methods like head(), tail(), and info(). Pandas can clean data by handling empty values, wrong formats, incorrect values, and duplicates. It also allows visualizing data through plots like bar plots, line plots, histograms, and more generated by the DataFrame plot method.

Uploaded by

hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Pandas

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.

Pandas allows us to analyze big data and make conclusions based on


statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated
files).

import pandas as pd

df = pd.read_csv('data.csv')

Viewing the Data


The head() method returns the headers and a specified number of rows,
starting from the top.
df.head()
df.head(20)
The tail() method returns the headers and a specified number of rows,
starting from the bottom.
Df.tail()
Info About the Data
df.info()
df.columns

Cleaning Data
1. Clean Data
2.  Clean Empty Cells
3.  Clean Wrong Format
4. Clean Wrong Data
5. Remove Duplicates

Data cleaning means fixing bad data in your data set.

Bad data could be:

 Empty cells
 Data in wrong format
 Wrong data
 Duplicates

1. Pandas - Cleaning Empty Cells

Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.

1. Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.

import pandas as pd

df = pd.read_csv('data.csv')
new_df = df.dropna()

print(new_df.to_string())

If you want to change the original DataFrame, use the inplace =


True argument

df.dropna(inplace = True)

print(df.to_string())

Replace Empty Values


Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty
cells.

The fillna() method allows us to replace empty cells with a value:

df.fillna(130, inplace = True)

Replace Only For a Specified Columns


The example above replaces all empty cells in the whole Data Frame.

To only replace empty values for one column, specify the column name for
the DataFrame:

df["Calories"].fillna(130, inplace = True)

Replace Using Mean, Median, or Mode


A common way to replace empty cells, is to calculate the mean, median or
mode value of the column.
Pandas uses the mean() median() and mode() methods to calculate the
respective values for a specified column:

x = df["Calories"].mean()

df["Calories"].fillna(x, inplace = True)

x = df["Calories"].median()

df["Calories"].fillna(x, inplace = True)

x = df["Calories"].mode()[0]

df["Calories"].fillna(x, inplace = True)

2. Data in wrong format

Data of Wrong Format


Cells with data of wrong format can make it difficult, or even impossible, to
analyze data.

To fix it, you have two options: remove the rows, or convert all cells in the
columns into the same format.

Convert Into a Correct Format


df['Date'] = pd.to_datetime(df['Date'])

Removing Rows
df.dropna(subset=['Date'], inplace = True)

3. wrong data
"Wrong data" does not have to be "empty cells" or "wrong format", it can
just be wrong, like duration "450" instead of "60".

Replacing Values
One way to fix wrong values is to replace them with something else.

df.loc[7, 'Duration'] = 120

Loop through all values in the "Duration" column.

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120

Removing Rows
Another way of handling wrong data is to remove the rows that contains
wrong data.

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True)

4. Removing Duplicates

df.duplicated()

Removing Duplicates
df.drop_duplicates(inplace = True)
import matplotlib.pyplot as plt
import pandas as pd

df.plot(y='Tmax', x='Month')

To plot both maximum and minimum temperatures,

df.plot(y=['Tmax','Tmin'], x='Month')

The Pandas Plot Function

Pandas has a built in .plot() function as part of the DataFrame


class. It has several key parameters:

kind — ‘bar’,’barh’,’pie’,’scatter’,’kde’ etc which can be found in


the docs.
color — Which accepts and array of hex codes corresponding
sequential to each data series / column.
linestyle — ‘solid’, ‘dotted’, ‘dashed’ (applies to line graphs
only)
xlim, ylim — specify a tuple (lower limit, upper limit) for
which the plot will be drawn
legend— a boolean value to display or hide the legend
labels — a list corresponding to the number of columns in the
dataframe, a descriptive name can be provided here for the
legend
title — The string title of the plot

Bar Charts

df.plot(kind='bar', y='Tmax', x='Month')

df.plot.bar(y='Tmax', x='Month')

df.plot.barh(y='Tmax', x='Month')

df.plot.bar(y=['Tmax','Tmin'], x='Month')

Different color for each bar

color=['blue', 'red']

Control color of border

edgecolor='blue'

df.plot.bar(xlabel='Class')
df.plot.bar(ylabel='Amounts')
df.plot.bar(title='I am title')
figsize=(8, 6)

Rotate Label

rot=70

Multiple charts

weather.plot(y=['Tmax','Tmin','Rain','Sun'], x='Month', subplots=True,


layout=(2,2))

df.plot.bar(y=['Tmax','Tmin','Rain','Sun'], x='Month', subplots=True,


layout=(2,2))

Scatter plot of two columns


df.plot.scatter(x='Sun', y='Rain')

df.plot(kind='line', y='Tmax', x='Month')

df.plot( y='Tmax', x='Month')

df.plot(kind='hist', y='Tmax', x='Month')

df.plot(kind='hexbin', y='Tmax', x='Month')

df.plot.kde()

Pie

df= pd.DataFrame({'cost': [79, 40 , 60]},index=['Oranges', 'Bananas', '
Apples'])

df.plot.pie(y='cost', figsize=(8, 6))
plt.show()
Python Pandas - Series
Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index.
pandas.Series( data, index, dtype, copy)

Data: data takes various forms like ndarray, list, constants


Index: Index values must be unique and hashable, same length as data.
Default np.arrange(n) if no index is passed.

Dtype: dtype is for data type.


Copy: Copy data. Default False

A series can be created using various inputs like −

 Array
 Dict
 Scalar value or constant

Create an Empty Series


import pandas as pd
s = pd.Series()
print(s)

import pandas as pd
s = pd.Series()
s

Create a Series from ndarray


import pandas as pd
import numpy as np
data = np.array(['FY','SY','TY','BE'])
s = pd.Series(data)
s

Customized indexed values


import pandas as pd
import numpy as np
data = np.array(['FY','SY','TY','BE'])
s = pd.Series(data,index=[100,101,102,103])
s

Create a Series from dictionary


import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
s

import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
s

Create a Series from Scalar


If data is a scalar value, an index must be provided. The value will be repeated to match the
length of index.

import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
s

Accessing Data from Series with Position


Data in the series can be accessed similar to that in an ndarray.
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the element


s[0]

Retrieve the first three elements in the Series.


s[:3]

Retrieve the last three elements.


s[:3]
Retrieve Data Using Label (Index)
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element


print s['d']

Retrieve multiple elements using a list of index label values.


s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements


s[['a','c','d']]

If a label is not contained, an exception is raised.


s[['aa','c','d']]

Data Frame:
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns.

Features of DataFrame

 Potentially columns are of different types


 Size – Mutable
 Labeled axes (rows and columns)
 Can Perform Arithmetic operations on rows and columns

pandas.DataFrame( data, index, columns, dtype, copy)

Data:
Data takes various forms like ndarray, series, map, lists, dict, constants and also another
DataFrame.

Index:
For the row labels, the Index to be used for the resulting frame is Optional Default
np.arange(n) if no index is passed.

columns:
For column labels, the optional default syntax is - np.arange(n). This is only true if no index
is passed.

dtype:
Data type of each column.

copy

This command (or whatever it is) is used for copying of data, if the default is False.

Create DataFrame

Create an Empty DataFrame

import pandas as pd
df = pd.DataFrame()
print(df)

Create a DataFrame from Lists


import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

import pandas as pd
data = [['SY',10],['TY',12],['FY',13]]
df = pd.DataFrame(data,columns=['CLASS','RN'])
print(df)

Create a DataFrame from Dict of ndarrays / Lists


import pandas as pd
data = {'CLASS':['FY', 'SY', 'TY', 'BTECH'],'RN':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

You might also like