PYTHON FOR DATA SCIENCE Iteration Also see NumPy Arrays Combining Data
Cheat Sheet Selecting
>>> df3.loc[:,(df3>1).any()] Select cols with any vals >1 X1
data1
X2 X1
data2
X3
>>> df3.loc[:,(df3>1).all()] Select cols with vals > 1
a 11.432 a 20.784
>>> df3.loc[:,df3.isnull().any()] Select cols with NaN
PANDAS
>>> df3.loc[:,df3.notnull().all()] Select cols without NaN b 1.303 b NaN
Indexing With isin c 99.906 d 20.784
>>> df[(df.Country.isin(df2.Type))] Find same elements
Learn Python for Data Science Interactively at >>> df3.filter(items=”a”,”b”])
>>> df.select(lambda x: not x%5)
Filter on values
Select specific elements
www.DataCamp.com Where
>>> s.where(s > 0) Subset the data
Setting/Resetting Index
Query
>>> df6.query(‘second > first’) Query DataFrame >>> pd.merge(data1, X1 X2 X3
data2, a 11.432 20.784
Reshaping Data how=’left’,
b 1.303 NaN
on=’X1’)
c 99.906 NaN
Pivot Setting/Resetting Index
>>> pd.merge(data1, X1 X2 X3
>>> df3= df2.pivot (index=’Date’, Spread rows into columns >>> df.set_index(‘Country’) Set the index
columns=’Type’, >>> df4 = df.reset_index() Reset the index data2, a 11.432 20.784
values=’Value’) >>> df = df.rename (index=str, Rename DataFrame how=’right’,
b 1.303 NaN
columns={“Country”:”cntry”, on=’X1’)
d NaN 20.784
“Capital”:”cptl”,
Date Type Value Type a b c “Population”:”ppltn”})
>>> pd.merge(data1, X1 X2 X3
0 2016-03-01 a 11.432 Date data2, a 11.432 20.784
1 2016-03-02 b 13.031 2016-03-01 11.432 NaN 20.784 Reindexing how=’inner’,
b 1.303 NaN
on=’X1’)
2 2016-03-01 c 20.784 2016-03-02 1.303 13.031 NaN
>>> s2 = s.reindex([‘a’,’c’,’d’,’e’,’b’])
3 2016-03-03 a 99.906 2016-03-03 99.906 NaN 20.784
4 2016-03-02 a 1.303 >>> pd.merge(data1, X1 X2 X3
5 2016-03-03 c 20.784 Forward Filling Backward Filling data2, a 11.432 20.784
>>> df.reindex (range(4), >>> s3 = s.reindex (range(5), how=’outer’,
b 1.303 NaN
method=’ffill’) method=’bfill’) on=’X1’)
c 99.906 NaN
Pivot Table Country Capital Population 0 3
0 Belgium Brussels 11190846 1 3 d NaN 20.784
>>> df4 = pd.pivot_table(df2, Spread rows into columns 1 India New Delhi 1303171035 2 3
values=’Value’, 2 Brazil Brasília 207847528 3 3
index=’Date’, 3 Brazil Brasília 207847528 4 3 Join
columns=’Type’])
>>> data1.join(data2, how=’right’)
MultiIndexing
Stack / Unstack
>>> arrays = [np.array([1,2,3]), MultiIndexing
>>> stacked = df5.stack() Pivot a level of column labels np.array([5,4,3])]
>>> stacked.unstack() Pivot a level of index labels >>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays) Vertical
>>> tuples = list(zip(*arrays)) >>> s.append(s2)
0 1 1 5 0 0.233482 Horizontal/Vertical
>>> index = pd.MultiIndex.from_tuples(tuples,
1 5 0.233482 0.390959 1 0.390959 names=[‘first’, ‘second’]) >>> pd.concat([s,s2],axis=1, keys=[‘One’,’Two’])
2 4 0.184713 0.237102 2 4 0 0.184713 >>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index) >>> pd.concat([data1, data2], axis=1, join=’inner’)
>>> df2.set_index([“Date”, “Type”])
3 3 0.433522 0.429401 1 0.237102
Unstacked 3 3 0 0.433522
1 0.429401
Stacked Dates
Duplicate Data >>> df2[‘Date’]= pd.to_datetime(df2[‘Date’])
Melt >>> df2[‘Date’]= pd.date_range(‘2000-1-1’,
>>> s3.unique() Return unique values periods=6,
>>> pd.melt(df2, Gather columns into rows >>> df2.duplicated(‘Type’) Check duplicates freq=’M’)
id_vars=[“Date”], >>> df2.drop_duplicates(‘Type’, keep=’last’) Drop duplicates >>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
value_vars=[“Type”,“Value”], >>> df.index.duplicated() Check index duplicates >>> index = pd.DatetimeIndex(dates)
value_name=”Observations”) >>> index = pd.date_range(datetime(2012,2,1), end, freq=’BM’)
Date Type Value Date Variable Observations
0 2016-03-01 a 11.432 0 2016-03-01 Type a
Grouping Data
1 2016-03-02 b 13.031 1 2016-03-02 Type b Visualization
2 2016-03-01 c 20.784 2 2016-03-01 Type c
Aggregation
3 2016-03-03 a 99.906 3 2016-03-03 Type a >>> df2.groupby(by=[‘Date’,’Type’]).mean() >>> import matplotlib.pyplot as plt
4 2016-03-02 a 1.303 4 2016-03-02 Type a >>> df4.groupby(level=0).sum()
>>> df4.groupby(level=0).agg({‘a’:lambda x:sum(x)/len(x), >>> s.plot() >>> df2.plot()
5 2016-03-03 c 20.784 5 2016-03-03 Type c
‘b’: np.sum}) >>> plt.show() >>> plt.show()
6 2016-03-01 Value 11.432 Transformation
7 2016-03-02 Value 13.031 >>> customSum = lambda x: (x+x%2)
>>> df4.groupby(level=0).transform(customSum)
8 2016-03-01 Value 20.784
9 2016-03-03 Value 99.906
10 2016-03-02 Value 1.303
11 2016-03-03 Value 20.784
Missing Data
>>> df.dropna() Drop NaN values
Iteration >>> df3.fillna(df3.mean()) Fill NaN values with a pre
determined value
>>> df.iteritems() (Column-index, Series) pairs >>> df2.replace(“a”, “f”) Replace values with others
>>> df.iterrows() (Row-index, Series) pairs