نحوه خواندن فایل
import pandas as pd
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")
یا
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]
یا
df = pd.read_csv('output_list.txt', sep=" ", header=None, names=["a", "b", "c"])
Pandas
we're going to deepen our investigation to how Python can be used to manipulate,
clean, and query data by looking at the Pandas data tool kit
The pandas Series
The pandas is the base data structure of pandas. A series is similar to a NumPy
array, but it differs by having an index, which allows for much richer lookup of
items instead of just a zero-based array index value.
import pandas as pd
d=pd.Series([11,12,13,14])
d
Multiple items can be retrieved by specifying their labels in a Python list.
import pandas as pd
d[[1,3]]
Pandas: Series
d=pd.Series([11,12,13,14],index=['a','b','c','d'])
d[['a','b']] or d[[0,1]]
We can examine the index of a using the property:
d=d.index
Two objects can be applied to each other with an arithmetic operation
d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,5],index=[‘a',‘b','c','d'])
diff=d1-d2
print(diff)
diff.mean()
diff
Pandas: DataFrame
A pandas series can only have a single value associated with each index label.
To have multiple values per index label we can use a data frame. A data frame
represents one or more objects aligned by index label.
Each series will be a column in the data frame, and each column can have an
associated name
d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,4],index=['a','b','c','d'])
temp_df=pd.DataFrame({'value1':d1,'value2':d2})
temp_df
Columns in a object can be accessed using an array indexer with the name
of the column or a list of column names
temp_df['value1']
temp_df[['value1','value2']]
Pandas: DataFrame
Passing a list to the [] operator of DataFrame retrieves the specified columns
whereas a Series would return rows.
new column can be added to DataFrame simply by assigning another Series to a
column using the array indexer [] notation
temp_dfs=pd.DataFrame()
g=temp_df['value1']-temp_df['value2']
print(g)
temp_df['diff']=temp_df['value1']-temp_df['value2']
temp_df
The names of the columns in a DataFrame are accessible via the columns
property
temp_df.columns
Pandas: DataFrame
The DataFrame and Series objects can be sliced to retrieve specific rows
temp_df [0:3]
temp_df.value1[0:3]
Entire rows from a data frame can be retrieved using the .loc and .iloc properties.
.loc ensures that the lookup is by index label, where .iloc uses the 0-based position.
temp_df.loc['a']
temp_df.iloc[0]
temp_df.iloc[[1,3,5,7]].column_Name
Pandas: DataFrame
The following code shows values in the IMO column that are greater than 7
df2.IMO>7
Loading data from files into a DataFrame
import pandas as pd
df2 = pd.read_excel('2010.xlsx')
Df2=pd.read_csv('2010.csv')
Get type of column
type(df2.IMO[0])
Pandas: DataFrame
For traversing DataFrame (transposed), we use T assign
df2=df2.T
Loading data from row
df2.loc[['IYR','IMO']]
df2.loc['IYR'][0]
Pandas: DataFrame
Deleting data from DataFrames using drop for rows or del for columns
df2.drop('IYR')
del df2['IMO']
df = df.drop(['IMO''], axis=1) # axis is important
Add column to DataFrames
df2['IMO']=0
Read data from DataFrames
df2['IMO']=df2['IMO']+2
Query for DataFrames
If you want accidents in months that is bigger than 6, we should write code below:
df2['IMO']>6
Now mask the answers by where attribute:
dfbigger=df2.where(df2['IMO']>6)
dfbigger=df2[(df2['IMO']>6) & (df2['DAY']>10)]
dfbigger
Set or reset index for DataFrames
dfbigger=dfbigger.set_index('IYR')
print(dfbigger)
dfbigger=dfbigger.reset_index('IYR')
dfbigger
DataFrames: preProcess
Count non-NA cells for each column or row
df2.count(axis=0, numeric_only=False)
Get numeric columns or object columns
df2.dtypes
df2._get_numeric_data().columns
df2.select_dtypes(include=['object'])
df4=df2.select_dtypes(include=['object'])
df2[~df2.isin(df4)]
Find empty cell and replace with nan
DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
df2 = df2.replace(r'\s',np.nan, regex=True)
Plot
matplotlib.pyplot is a collection of command style functions that make matplotlib
work like MATLAB
import matplotlib.pyplot as plt
Plt.plot([1,2,3], [1,2,3], 'go-', linewidth=2)
Plt.plot([1,2,3], [1,4,9], 'rs', markersize=14)
plt.show()
another sample
import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.title('some values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['t','t**2','t**3'])
plt.show()
plot
another sample
import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.title('some values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['t','t**2','t**3'])
plt.show()
Plot dataframe
another sample
t=np.arange(0,5,0.2)
df=pd.DataFrame({0:t , 1:t**1.5 , 2:t**2 , 3:t**2.5 , 4:t**3})
legend_labels=['Solid' , 'Dashed' , 'Dotted' , 'Dot-dashed' , 'Points']
df.plot(style=['r-','g--', 'b:', 'm-.' , 'k:'])
plt.legend(legend_labels )
plt.show()
matplotlib.pyplot.subplot
matplotlib.pyplot.subplots return an instance of Figure and an array of (or a single) Axes (array or
not depends on the number of subplots)
matplotlib.pyplot.subplot(*args, **kwargs)
import matplotlib.pyplot as plt
import numpy as np
# Simple data to display in various forms
x = np.linspace(0, 2 * np.pi, 400)
y = np.sin(x ** 2)
plt.close('all')
# Just a figure and one subplot
f, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Simple plot')
plt.show()
matplotlib.pyplot.subplot
A scatter plot displays the correlation between a pair of variables
Define two subplot
f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(x, y)
axarr[0].set_title('Sharing X axis')
axarr[1].scatter(x, y)
plt.show()
Define two subplot in one row
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
ax1.plot(x, y)
ax1.set_title('Sharing Y axis')
ax2.scatter(x, y)
plt.show()
matplotlib.pyplot.subplot
Define three subplot sharing both x and y axes
f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)
ax1.plot(x, y)
ax1.set_title('Sharing both axes')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
plt.show()
matplotlib.pyplot.subplot
Define Four axes, returned as a 2-d array
f, axarr = plt.subplots(2, 2)
axarr[0, 0].plot(x, y)
axarr[0, 0].set_title('Axis [0,0]')
axarr[0, 1].scatter(x, y)
axarr[0, 1].set_title('Axis [0,1]')
axarr[1, 0].plot(x, y ** 2)
axarr[1, 0].set_title('Axis [1,0]')
axarr[1, 1].scatter(x, y ** 2)
axarr[1, 1].set_title('Axis [1,1]')
plt.show()
Calculate correlation by seaborn package
1-
colNames = ["Age", "type_employer", "fnlwgt", "Education", "Education-Num", "Martial","Occupation",
"Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"H-per-week", "Country", "Label"]
data = pd.read_csv("adult-data.txt", names=colNames,delimiter=',',header=None)
data
2- conda install seaborn
3-
import seaborn as sns
%matplotlib inline
sns.heatmap(data.corr())
plt.show()
Read data
from sklearn import preprocessing
import pandas as pd
df2 = pd.read_excel('2010.xlsx')
df2
Show Numeric Columns
df2.select_dtypes(include=[np.number])
Replace empty cells with Nan value
df2 = df2.replace(r'\s',np.nan, regex=True)
Drop all empty columns
df2=df2.dropna(axis='columns', how='all')
#df2.isnull().mean()
df2.fillna(df2.mean(),inplace=True)
Drop all empty columns with threshshold <0.5
#df2.columns[df2.isnull().mean() < 0.8]
df2=df2[df2.columns[df2.isnull().mean() < 0.5]]
Find Missing values
Now let's see if we have any missing value
df2.isnull()
df2.notnull()
df2.isnull()[15:20]
It is possible to drop rows with NanValue:
df2 = df2.dropna()
df2=df2.dropna(axis='columns', how='all') //rows
If a Column like IMO2 is all Nan, we can drop it:
df2 = df2.drop(['IMO2'], axis=1)
Show the summery of null value for each columns
df2.isnull().sum()
Delete Missing values or replace
Fill all nan columns with mean
df2.fillna(df2.mean(),inplace=True)
if a column like IYR of some accidents are NaN in our dataset. Let's
change NaN to mean value of
df2.IYR.iloc[[1, 2, 3]] =np.nan // df2.at[{0,11,12,13,14,15,16}, 'IYR']=np.nan
df2=df2.fillna({'IYR': df['IYR'].mean()})
df2[1:10]