KEMBAR78
Data Analysis with Python - Pandas | WeiYuan | PDF
Python 的資料分析 - Pandas
給新手的 Python 程式設計 | WeiYuan
site: v123582.github.io
line: weiwei63
§ 全端⼯程師 + 資料科學家
略懂⼀點網站前後端開發技術,學過資料探勘與機器
學習的⽪⽑。平時熱愛參與技術社群聚會及貢獻開源
程式的樂趣。
Outline
§ 什麼是 Pandas ?
§ 序列:Series
§ 資料表:DataFrame
§ 描述統計與統計函數
§ 資料合併與分組
§ 缺失數據與稀疏數據
3
什麼是 Pandas ?
§ Pandas 是基於 NumPy 的一個資料分析函式庫,提供了大量
進階的資料結構和資料處理的方法,目的是為了達到高效的資料
分析。提供了兩個主要的資料結構:Series 和 DataFrame,
這些數據結構都是構建在numpy 的 Ndarray 之上。可以把
DataFrame 想成是 Series 的容器,也就是說 DataFrame
是由 Series 所組成的。
4
Import pandas into python
5
1
2
3
4
5
6
7
8
9
10
11
import numpy as np
import pandas as pd
Series and DataFrame
6
1
2
3
4
5
6
7
8
9
10
11
s = pd.Series([1,3,5,np.nan,6,8])
# 0 1
# 1 3
# 2 5
# 3 NaN
# 4 6
# 5 8
# dtype: float64
Series and DataFrame
7
1
2
3
4
5
6
7
8
9
10
11
d = pd.DataFrame(np.random.randn(6,4),
index=np.arange(6),
columns=[’A’, ‘B’, ‘C’, ‘D’])
# A B C D
# 0 0.358221 -0.870112 -1.393456 -0.902327
# 1 1.210681 -0.484630 1.551892 -1.747265
# 2 0.587932 -0.433354 -0.742197 -0.128311
# 3 -0.100495 -0.742343 -0.356780 -0.346326
# 4 -0.789095 0.494642 -0.368307 0.614529
# 5 1.689294 -1.468678 2.886471 1.076100
Outline
§ 什麼是 Pandas ?
§ 序列:Series
§ 資料表:DataFrame
§ 描述統計與統計函數
§ 資料合併與分組
§ 缺失數據與稀疏數據
§ Series Definition
§ Series 的基本用法
§ Create a Series
§ Access a Series
§ Reindex
§ 插入或丟棄資料
§ 算术运算和数据对齐
§ 排序與排名
8
Series
§ Series 是一個一維陣列容器,類似於 NumPy 的一維 array,
除了包含一組數值還包含一組索引,所以可以把它理解為一組帶
索引的陣列。他能夠保存任何類型的資料(整數,字符串,浮點
數等等)的一維標記數組,標籤稱為索引。
9
Series Definition
10
1
2
3
4
5
6
7
8
9
10
11
pandas.Series( data, index, dtype, copy)
• data => 數據採取各種形式,如:ndarray,list,constants
• index => 索引值必须是唯一的和散列的,与数据的长度相同,默认np.arange(n)如
果没有索引被传递。
• dtype => dtype用于数据类型。如果没有,将推断数据类型。
• copy => 复制数据,默认为false。
Series vs Dictionary
11
Series vs NdArray
12
Series 的基本用法
§ axes 返回行轴标签列表。
§ dtype 返回对象的数据类型(dtype)。
§ empty 如果系列为空,则返回True。
§ ndim 返回底层数据的维数,默认定义:1。
§ size 返回基础数据中的元素数。
§ values 将系列作为ndarray返回。
§ head() 返回前n行。
§ tail() 返回最后n行。
13
Create a Series
14
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
s = pd.Series()
print(s)
# Series([], dtype: float64)
Create a Series
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)
# 0 a
# 1 b
# 2 c
# 3 d
# dtype: object
Create a Series
16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data, index=[100,101,102,103])
print(s)
# 100 a
# 101 b
# 102 c
# 103 d
# dtype: object
Create a Series
17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)
# a 0.0
# b 1.0
# c 2.0
# dtype: float64
Create a Series
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print(s)
# b 1.0
# c 2.0
# d NaN
# a 0.0
# dtype: float64
Create a Series
19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)
# 0 5
# 1 5
# 2 5
# 3 5
# dtype: int64
Try it !
§ #練習: Write a Python program to create and display a
one-dimensional array-like object containing an array of
data using Pandas module.
20
Try it !
§ #練習: Write a Python program to convert a Panda
module Series to Python list and it’s type.
21
Try it !
§ #練習: Write a Python program to convert a Python
dictionary to Panda module Series and it’s type.
22
Access a Series
23
1
2
3
4
5
6
7
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first element
print(s.index) # ['a', 'b', 'c', 'd', 'e']
print(s.values) # [1, 2, 3, 4, 5]
Access a Series
24
1
2
3
4
5
6
7
8
9
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first element
print(s[0]) # 1
print(s['a']) # 1
print(s.get('a')) # 1
Access a Series
25
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first three element
print(s[:3])
print(s[:'c'])
# a 1
# b 2
# c 3
# dtype: int64
Access a Series
26
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the last three element
print s[-3:]
print s['c':]
# c 3
# d 4
# e 5
# dtype: int64
Access a Series
27
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve multiple elements
print(s[[0, 2, 3]])
print(s[['a','c','d']])
Access a Series
28
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve multiple elements
print s['f']
# KeyError: 'f'
Reindex
29
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s = pd.Series([1,2,3],index = ['c','b','a'])
s.index = ['a', 'b', 'c']
print(s)
Reindex
30
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s = pd.Series([1,2,3],index = ['c','b','a'])
s = s.reindex(['a', 'b', 'c'], fill_value = 0)
print(s)
Reindex
31
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s = pd.Series([1, 2, 3])
ffill = s.reindex(range(6), method = 'ffill')
print(ffill)
bfill = s.reindex(range(6), method = 'bfill')
print(bfill)
插入或丟棄資料
32
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
input = pd.Series([1,2,3,4,5])
input.append(6)
# AttributeError: 'int' object has no attribute 'index'
插入或丟棄資料
33
1
2
3
4
5
6
7
8
9
10
11
12
13
ds = pd.Series([1,2,3,4,5])
ds.append(pd.Series([6]))
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# 0 6
插入或丟棄資料
34
1
2
3
4
5
6
7
8
9
10
11
12
13
ds.set_value(max(ds.index) + 1, 6)
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# 5 6
# dtype: int64
插入或丟棄資料
35
1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
pd.Series(np.concatenate((ds.values, [6])))
插入或丟棄資料
36
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s = pd.Series([1,2,3])
del s[0]
s.pop(1)
s.drop(2)
算术运算和数据对齐
37
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s1 = pd.Series([1,2,3])
s2 = pd.Series([3,2,1])
s1 + s2
s1 + 1
s1 - s2
s1 - 1
s1 * s2
s1 * 2
s1 / s2
s1 / 2
算术运算和数据对齐
38
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s1 = pd.Series([1,2,3], index=['a', 'b', 'c'])
s2 = pd.Series([3,2,1], index=['d', 'e', 'f'])
s1 + s2
s1 + 1
s1 - s2
s1 - 1
s1 * s2
s1 * 2
s1 / s2
s1 / 2
Try it !
§ #練習: Write a Python program to add, subtract,
multiple and divide two given Pandas Series.
• Sample Input Given: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
39
Try it !
§ #練習: Write a Python program to add, subtract,
multiple and divide two Pandas Series from user
input.
• Sample Input: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
40
Try it !
§ #練習: Write a Python program to compare the
elements of the two Pandas Series.
• Sample Input: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
41
排序與排名
42
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
s = pd.Series([1,2,3])
s.sort_index()
s.sort_index(ascending=False)
s.sort_values()
s.sort_values(ascending=False)
s.rank()
s.rank(ascending=False)
Thanks for listening.
2017/08/03 (Thus.) Scientific Computing with Python – NumPy
Wei-Yuan Chang
v123582@gmail.com
v123582.github.io

Data Analysis with Python - Pandas | WeiYuan