KEMBAR78
Data Analysis With PANDAS: Cheat Sheet | PDF | Database Index | Json
86% found this document useful (7 votes)
2K views4 pages

Data Analysis With PANDAS: Cheat Sheet

The document discusses various data structures and operations in pandas including Series, DataFrames, hierarchical indexing, and common operations. It provides an overview of how to create, access, and manipulate Series and DataFrames through various methods like getting column and row values, assigning new columns, deleting columns, and switching rows and columns. It also covers hierarchical indexing and how to perform operations like sorting, swapping, and summarizing data at different index levels in DataFrames.

Uploaded by

nesayu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
86% found this document useful (7 votes)
2K views4 pages

Data Analysis With PANDAS: Cheat Sheet

The document discusses various data structures and operations in pandas including Series, DataFrames, hierarchical indexing, and common operations. It provides an overview of how to create, access, and manipulate Series and DataFrames through various methods like getting column and row values, assigning new columns, deleting columns, and switching rows and columns. It also covers hierarchical indexing and how to perform operations like sorting, swapping, and summarizing data at different index levels in DataFrames.

Uploaded by

nesayu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Structures continued

Data Analysis with PANDAS * DF has a to_panel() method which is the


inverse of to_frame().
Common Ops :
Swap and Sort **
series1.swaplevel(0,
1).sortlevel(0)

CHEAT SHEET
# the order of rows also change
** Hierarchical indexing makes N-dimensional
arrays unnecessary in a lot of cases. Aka * The order of the rows do not change. Only the
Created By: Arianne Colton and Sean Chen prefer to use Stacked DF, not Panel data. two levels got swapped.

INDEX OBJECTS ** Data selection performance is much better if


the index is sorted starting with the outermost
level, as a result of calling sortlevel(0) or
Data Structures Immutable objects that hold the axis labels and other
metadata (i.e. axis name) sort_index().

i.e. Index, MultiIndex, DatetimeIndex, PeriodIndex Summary Statistics by Level


SERIES (1D) Get Columns and df1.columns Any sequence of labels used when constructing
One-dimensional array-like object containing an array of Row Names df1.index Series or DF internally converted to an Index. Most stats functions in DF or Series have a level
data (of any NumPy data type) and an associated array Get Name option that you can specify the level you want on an
Attribute
df1.columns.name Can functions as fixed-size set in additional to being axis.
of data labels, called its index. If index of data is not
specified, then a default one consisting of the integers 0 df1.index.name array-like. Sum rows (that
(None is default)
through N-1 is created. df1.values HIERARCHICAL INDEXING have same key2 df1.sum(level = 'key2')
value)
series1 = pd.Series ([1, Get Values # returns the data as a 2D ndarray, the Multiple index levels on an axis : A way to work with df1.sum(level = 'col3', axis
Create Series 2], index = ['a', 'b']) dtype will be chosen to accomandate all of Sum columns ..
the columns higher dimensional data in a lower dimensional form. = 1)
series1 = pd.Series(dict1)*
** Get Column as df1['state'] or df1.state MultiIndex :
Get Series Values series1.values Series Under the hood, the functionality provided here
Get Values by Index
series1['a'] ** Get Row as series1 = Series(np.random.randn(6), index = utilizes pandas groupby.
series1[['b','a']] df1.ix['row2'] or df1.ix[1] [['a', 'a', 'a', 'b', 'b', 'b'], [1, 2, 3,
Series
Get Series Index series1.index 1, 2, 3]]) DataFrames Columns as Indexes
Assign a column
Get Name Attribute series1.name that doesnt exist df1['eastern'] = df1.state
series1.index.names = ['key1', 'key2'] DFs set_index will create a new DF using one or more
will create a new == 'Ohio' of its columns as the index.
(None is default) series1.index.name column
Series Partial series1['b'] # Outer Level df2 = df1.set_index(['col3',
** Common Index series1 + series2 Delete a column del df1['eastern']
Values are Added Indexing series1[:, 2] # Inner Level 'col4']) *
Switch Columns df1.T New DF using
Unique But Unsorted series2 = series1.unique() and Rows df1['outerCol3','InnerCol2'] # col3 becomes the outermost index, col4
DF Partial columns as index
Or becomes inner index. Values of col3, col4
Indexing
* Can think of Series as a fixed-length, ordered * Dicts of Series are treated the same as Nested df1['outerCol3']['InnerCol2'] become the index values.
dict. Series can be substitued into many dict of dicts.
functions that expect a dict. Swaping and Sorting Levels * "reset_index" does the opposite of "set_index",
** Data returned is a view on the underlying the hierarchical index are moved into columns.
** Auto-align differently-indexed data in arithmetic data, NOT a copy. Thus, any in-place Swap Level (level swapSeries1 = series1.
operations interchanged) * swaplevel('key1', 'key2')
modificatons to the data will be reflected in df1. By default, 'col3' and 'col4' will be removed
series1.sortlevel(1) from the DF, though you can leave them by
DATAFRAME (2D) Sort Level
# sorts according to first inner level option : 'drop = False'.
PANEL DATA (3D)
Tabular data structure with ordered collections of
columns, each of which can be different value type. Create Panel Data : (Each item in the Panel is a DF)
Data Frame (DF) can be thought of as a dict of Series.
dict1 = {'state': ['Ohio',
import pandas_datareader.data as web
panel1 = pd.Panel({stk : web.get_data_
Missing Data
'CA'], 'year': [2000, 2010]} yahoo(stk, '1/1/2000', '1/1/2010')
for stk in ['AAPL', 'IBM']}) df1.dropna(how = 'all') # drop row that are all
Python NaN - np.nan(not a number)
df1 = pd.DataFrame(dict1) # panel1 Dimensions : 2 (item) * 861 (major) * 6 (minor) missing
NaN or python built-in None mean
Create DF # columns are placed in sorted order Pandas * df1.dropna(thresh = 3) # drop any row containing
Stacked DF form : (Useful way to represent panel data) missing/NA values
(from a dict of < 3 number of observations
df1 = pd.DataFrame(dict1,
equal-length lists index = ['row1', 'row2'])) panel1 = panel1.swapaxes('item', 'minor') * Use pd.isnull(), pd.notnull() or
or NumPy arrays)
series1/df1.isnull() to detect missing data.
FILLING IN MISSING DATA
# specifying index panel1.ix[:, '6/1/2003', :].to_frame() *
df2 = df1.fillna(0) # fill all missing data with 0
df1 = pd.DataFrame(dict1, => Stacked DF (with hierarchical indexing **) :
columns = ['year', 'state']) FILTERING OUT MISSING DATA df1.fillna('inplace = True') # modify in-place
# Open High Low Close Volume Adj-Close
# columns are placed in your given order Use a different fill value for each column :
# major minor dropna() returns with ONLY non-null data, source
* Create DF data NOT modified. df1.fillna({'col1' : 0, 'col2' : -1})
dict1 = {'col1': {'row1': 1, # 2003-06-01 AAPL Only forward fill the 2 missing values in front :
(from nested dict 'row2': 2}, 'col2': {'row1':
of dicts) 3, 'row2': 4} } # IBM df1.dropna() # drop any row containing missing value df1.fillna(method = 'ffill', limit = 2)
The inner keys as df1 = pd.DataFrame(dict1) # 2003-06-02 AAPL df1.dropna(axis = 1) # drop any column i.e. for column1, if row 3-6 are missing. so 3 and 4 get filled
row indices containing missing values
# IBM with the value from 2, NOT 5 and 6.
Essential Functionality Data Aggregation and Group Operations
INDEXING (SLICING/SUBSETTING) ARITHMETIC AND DATA ALIGNMENT Categorizing a data set and applying a function to DATA AGGREGATION
each group, whether an aggregation or transformation.
Same as NdArray : In INDEXING : view df1 + df2 : For indices that dont overlap, Data aggregation means any data transformation that

of the source array is returned. internal data alignment introduces NaN. Aggregation of Time Series data - please produces scalar values from arrays, such as mean,
1, Instead of NaN, replace with 0 for the indice that is not
Note see Time Series section. Special use case of max, etc.
Endpoint is inclusive in pandas slicing with def func1(array): ...
found in th df : groupby is used - called resampling. Use Self-Defined
labels : series1['a':'c'] where
Function
Python slicing is NOT. Note that pandas non- df1.add(df2, fill_value = 0) GROUPBY (SPLIT-APPLY-COMBINE) grouped.agg(func1)
Get DF with Column
label (i.e. integer) slicing is still non-inclusive. 2, Useful Operations : - Similar to SQL groupby Names as Fuction grouped.agg([mean, std])
df1 - df1.ix[0] # subtract every row in df1 by first row Names
Index by Column(s) df1['col1']
Compute Group Mean df1.groupby('col2').mean() Get DF with Self- grouped.agg([('col1',
df1[ ['col1', 'col3'] ] Defined Column
SORTING AND RANKING df1.groupby([df1['col2'],
Names mean), ('col2', std)])
Index by Row(s) df1.ix['row1'] df1['col3']]).mean()
GroupBy More Than Use Different Fuction
df1.ix[ ['row1', 'row3'] ] Sort Index/Column One Key Depending on the
grouped.agg({'col1' : [min,
sort_index() returns a new, sorted object. Default # result in hierarchical index consisting max], 'col3' : sum})
Index by Both df1.ix[['row2', 'row1'], of unique pairs of keys Column
Column(s) and 'col3'] is ascending = True.
GroupBy Object : grouped = df1['col1'].
Row(s) Row index are sorted by default, axis = 1 is used
groupby(df1['col2'])
GROUP-WISE OPERATIONS AND
for sorting column. (ONLY computed
Boolean Indexing df1[ [True, False] ] intermediate data TRANSFORMATIONS
about the group key grouped.mean() # gets the mean
df1[df1['col2'] > 6] * Sorting Index/Column means sort the row/ of each group formed by 'col2' Agg() is a special case of data transformation, aka
# returns df that has col2 value > 6 - df1['col2']
column labels, not sorting the data. reduce a one-dimensional array to scalar.
# select col1 for aggregation :
Note that df1['col2'] > 6 returns a Sort Data Transform() is a specialized data transformation :
df1.groupby('col2')['col1']
boolean Series, with each True/False value Indexing GroupBy It applies a function to each group, if it produces
* Missing values (np.nan) are sorted to the end of the Object or
a scalar value, the value will be placed in every
determine whether the respective row in the Series by default df1['col1']. row of the group. Thus, if DF has 10 rows, after
result. Series Sorting sortedS1 = series1.order() groupby(df1['col2']) transform(), there will be still 10 rows, each one with
Avoid integer indexing since it might the scalar value from its respective groups value from
series1.sort() # In-place sort Any missing values in the group are excluded the function.
introduce subtle bugs (e.g. series1[-1]). Note
Note If have to use position-based indexing, DF Sorting df1.sort_index(by = from the result. The passed function must either produce a scalar
use "iget_value()" from Series and ['col2', 'col1'])
value or a transformed array of same size.
"irow/icol()" from DF instead of # sort by col2 first then col1 1. Iterating over GroupBy object General purpose transformation : apply()
integer indexing. Ranking GroupBy object supports iteration : generating a df1.groupby('col2').apply(your_func1)
sequence of 2-tuples containing the group name along
DROPPING ROWS/COLUMNS Break rank ties by assigning each tie-group the mean with the chunk of data. # your func ONLY need to return a pandas object or a scalar.
rank. (e.g. 3, 3 are tie as the 5th place; thus, the result is # Example 1 : Yearly Correlations with SPX
Drop operation returns a new object (i.e. DF) : 5.5 for each) for name, groupdata in df1.groupby('col2'):
# close_price is DF with stocks and SPX closed price columns
Remove Row(s) df1.drop('row1') Output Rank of series1.rank() # name is single value, groupdata is filtered DF contains data and dates index
(axis = 0 is default) df1.drop(['row1', 'row3']) Each Element only match that single value.
df1.rank(axis = 1) for (k1, k2), groupdata in df1. returns = close_price.pct_change().dropna()
df1.drop('col2', axis = 1) (Rank start from 1) # rank each rows value
Remove Column(s) groupby(['col2', 'col3']): by_year = returns.groupby(lambda x :
# If groupby multiple keys : first element in the tuple is a tuple x.year)
REINDEXING FUNCTION APPLICATIONS of key values.
spx_corr = lambda x : x.corrwith(x['SPX'])
Create a new object with rearraging data conformed to a NumPy works fine with pandas objects : np.abs(df1) Convert Groups dict(list(df1.groupby('col2'))) by_year.apply(spx_corr)
new index, introducing missing values if any index values to Dict # col2 unique values will be keys of dict # Example 2 : Exploratory Regression
were not already present. f = lambda x: x.max() - grouped = df1.groupby([df1.
Applying a import statsmodels.api as sm
Change df1 Date date_index = pd.date_ Function to Each x.min() # return a scalar value Group Columns dtypes, axis = 1)
Index Values to the range('01/23/2010', Column or Row by dtype def regress(data, y, x):
def f(x): return dict(list(grouped))
New Index Values periods = 10, freq = 'D') (Default is to apply Y = data[y]; X = data[x]
Series([x.max(), x.min()]) # separates data Into different types
to each column :
axis = 0) # return multiple values X['intercept'] = 1
(ReIndex default is
row index) df1.reindex(date_index) df1.apply(f) 2. Grouping with functions result = sm.OLS(Y, X).fit()
Replace Missing df1.reindex(date_index, Applying a f = lambda x: '%.2f' %x Any function passed as a group key will be called once return result.params
Values (NaN) wth 0 fill_value = 0) Function
df1.applymap(f) per (default is row index) value, with the return values
Element-Wise being used as the group names. (This assumes row by_year.apply(regress, 'AAPL', ['SPX'])
df1.reindex(columns = # format each entry to 2-decimals
ReIndex Columns index are named)
['a', 'b'])
UNIQUE, COUNTS df1.groupby(len).sum() Created by Arianne Colton and Sean Chen
ReIndex Both Rows df1.reindex(index = [..],
and Columns columns = [..]) Its NOT mandatory for index labels to be unique # returns a DF with row index that are length of the names. www.datasciencefree.com
although many functions require it. Check via : Thus, names of same length will sum their values. Column Based on content from
Succinct ReIndex df1.ix[[..], [..]] series1/df1.index.is_unique names retain. Python for Data Analysis by Wes McKinney
pd.value_counts() returns value frequency.
Updated: August 22, 2016
Data Wrangling : Merge, Reshape, Clean, Transform
COMBINING AND MERGING DATA RESHAPING AND PIVOTING COMMON OPERATIONS 5. Discretization and Binning
Continuous data is often discretized into bins for
1. pd.merge() aka database join : connects 1. Reshaping with Hierarchical Indexing 1. Removing Duplicate Rows analysis.
rows in DF based on one or more keys. series1 = df1.stack()
Merge via Column (Common) series1 = df1.duplicated() # Boolean series1 # Divide Data Into 2 Bins of Number (18 - 26], (26 - 35]
# Rotates (innermost level *) columns to rows as innermost indicating whether each row is a duplicate or not. # ] means inclusive, ) is NOT inclusive
df3 = pd.merge(df1, df2, on = 'col2') *
index level, resulted in Series with hierarchical index. df2 = df1.drop_duplicates()# Duplicates has bins = [18, 26, 35]
# INNER join is default Or use option : how = 'outer/ df1 = series1.unstack() been dropped in df2.
left/right' cat = pd.cut(array1, bins, labels=[..])
# Rotates (innermost level *) rows to columns as innermost 2. Add New Column Based On Value of Column(s) # cat is Categorical object.
# the indexes of df1 and df2 are discarded in df3 column level.
df1['newCol'] = df1['col2'].map(dict1) pd.value_counts(cat)
Use ALL overlapping column names as the keys Default is to stack/unstack innermost level. If
* to merge. Good practice is to specify the keys : * want a different level, i.e. stack(level = # Maps col2 value as dict1s key, gets dict1s value cat = pd.cut(array1, numofBins) # Compute
on = [col2, col3]. 0) - the outermost level. equal-length bins based on min and max values in array1
df1['newCol'] = df1['col2'].map(func1)
If different key name in df1 and df2, use option : cat = pd.qcut(array1, numofBins)# Bins the
* Note : Unstacking might introduce missing data if # Apply a function to each col2 value
left_on=lkey, right_on=rkey data based on sample quantiles - roughly equal-size bins
not all of the values in the level arent found in each 3. Replacing Values
Merge via Row (Uncommon) of the subgroups. Stacking filters out missing data 6. Detecting and Filtering Outliers
df3 = pd.merge(df1, df2, left_index = by default, i.e. data.unstack().stack() # Replace is NOT In-Place any() test along an axis if any element is True.
True, right_index = True) Default is test along column (axis = 0).
df2 = df1.replace(np.nan, 100)
# Use indexes as merge key : aka rows with same index 2. Pivoting # Replace Multiple Values At Once df1[(np.abs(df1) > 3).any(axis = 1)]
value are joined together. Common format of storing multiple time series in # Select all rows having a value > 3 or < -3.
databases and CSV is : df2 = df1.replace([-1, np.nan], 100)
2. pd.concat() : glues or stacks objects along an df2 = df1.replace([-1, np.nan], [1, 2]) # Another useful function : np.sign() returns 1 or -1.
axis (default is along rows : axis = 0). Long/Stacked Format : date, stock_name, price
# Argument Can Be a Dict As Well 7. Permutation and Random Sampling
df3 = pd.concat([df1, df2], ignore_index
However, a DF with these 3 columns data like above randomOrder = np.random.permutation(df1.
= True) # ignore_index = True : Discard indexes in df3 df2 = df1.replace({-1: 1, np.nan : 2})
will be difficult to work with. Thus, wide format shape[0])
# If df1 has 2 rows, df2 has 3 rows, then df3 has 5 rows is prefered : date as row index, stock_name as 4. Renaming Axis Indexes
df2 = df1.take(randomOrder)
3. combine_first() : combine data with overlap, columns, price as DF data values.
Convert Index df1.index = df1.index. 8. Computing Indicator/Dummy Variables
patching missing value. pivotedDf2 = df1.pivot('date', 'stock_ to Upper Case map(str.upper)
name', 'price') If a column in DF has K distinct values, derive a
df3 = df1.combine_first(df2) df2 = df1.rename(index = indicator DF containing K columns of 0s and 1s.
# df1 and df2 indexes overlap in full or part. If a row NOT # Example pivotedDf2 : Rename {'row1' : 'newRow1'}, columns 1 means exist, 0 means NOT exist.
exist in df1 but in df2, it will be in df3. If row1 of df1 and # AAPL IBM JD row1 to = str.upper) dummyDf = pd.get_dummies(df1['col2'],
row3 of df2 have the same index value, but row1s col3 newRow1
# 2003-06-01 120.2 100.1 20.8 # Optionally inplace = True prefix = 'col-')# Add prefix to the K column names
value is NA, df3 get this row with the col3 data from df2

Getting Data Descriptive Statistics Methods


# Example : Correlation
TEXT FORMAT (CSV) JSON (JAVASCRIPT OBJECT NOTATION) DATA Compared with equivalent methods of ndArray,
descriptive statistics methods in Pandas are built import pandas_datareader.data as web
df1 = pd.read_csv(file/URL/file-like-object, One of the standard formats for sending data by HTTP
sep = ',', header = None) from the ground up to exclude missing data. data = {}
request between web browsers and other applications.
# Type-Inference : do NOT have to specify which columns are It is much more flexible data format than tabular text from NA (i.e. NaN) values are excluded. This can be for ticker in ['AAPL', 'JD']:
numeric, integer, boolean or string. like CSV. disabled using the "skipna = False" option.
data[ticker] = web.get_data_
# In Pandas, missing data in the source data is usually empty Convert JSON string Column Sums (Use axis = 1 to sum over rows) yahoo(ticker, '1/1/2000', '1/1/2010')
string, NA, -1, #IND or NULL. You can specify missing values data = json.load(jsonObj)
to Python form
via option i.e. : na_values = ['NULL']. series1 = df1.sum() prices = pd.DataFrame({ticker : d['Adj
Convert Python object asJson = json.dumps(data) Returns Index Labels Where Min/Max Values are Attained Close'] for ticker, d in data.iteritems()})
# Default delimiter is comma. to JSON
# Default is first row is the column header. df1 = df1.idxmin() or df1.idxmax() volumes = ...
df1 = pd.read_csv(.., names = [..]) Create DF from JSON pd.DataFrame(data['name'], Mutiple Summary Statistics (i.e. count, mean, std) returns = prices.pct_change()
columns = ['field1']) On Non-Numeric Data, Alternate Statistics (i.e. count, unique)
# Explicitly specify column header, also imply first row is data returns.AAPL.corr(returns.JD)
df1 = pd.read_csv(.., names = [.., XML AND HTML DATA df1.describe()
'date'], index_col = 'date') # Series corr() computes correlation of overlapping, non-NA,
HTML : aligned-by-index values in two Series.
# Want 'date' column to be row index of the returned DF doc = lxml.html. CORRELATION AND COVARIANCE
df1.to_csv(filepath/sys.stdout, sep = ',') parse(urlopen('http://..')).getroot()
tables = doc.findall('.//table') cov(), corr() Created by Arianne Colton and Sean Chen
# Missing values appear as empty strings in the output. Thus, rows = tables[1].findall('.//tr')
It is better to add option i.e. : na_rep = 'NULL' corrwith() - pairwise correlations : aka compute www.datasciencefree.com
XML : a DF with a Series. If input is not Series, but another Based on content from
# Default is row and column labels are written. Disabled by lxml.objectify.parse(open(filepath)). DF, it will compute the correlations of matching column Python for Data Analysis by Wes McKinney
options : index = False, header = False getroot() names. i.e. returns.corrwith(volumes) Updated: August 22, 2016
Time Series
Python standard library data types for date and time : DATE RANGES, FRQUENCIES AND SHIFTING * NY is 4 hours behind UTC during daylight saving ts1.resample('5min', how = 'ohlc')
datetime, time, calendar. Generic time series in Pandas are assumed to be irreg- time and 5 hours the rest of the year. # returns a DF with 4 columns - open, high, low , close
Pandas data type for date and time : Timestamp. * ular, aka have no fixed frequency. However, we prefer to
1. Python Time Zone (From 3rd-party pytz library) * Alternate way to downsample : ts1.
work with fixed frequency, i.e. daily, monthly, etc.
Convert String from datetime import datetime pytz.common_timezones groupby(lamba x : x.month).mean()
Get List of Timezone Names
to DateTime Take a Look at # Convert to Fixed Daily Frequency. 2. Upsampling and Interpolation * - Interpolate
datetime.strptime('8/8/2008', Resampling # Introduce Missing Value (NaN) If Needed pytz.timezone('US/
Get a Timezone Object low frequency to higher frequency. By default missing
'%m/%d/%Y') Section Eastern')
ts1.resample('D', how = ..) values (NaN) are introduced.
Get Time Now now = datetime.now()
1. Frequencies and Date Offsets 2. Localization and Conversion df1.resample('D', fill_method = 'ffill')
DateTime from datetime import timedelta Frequencies in Pandas are composed of a base # Forward fills values : i.e. missing value index such as
Time Series By Default is ts1.index.tz => None
Arithmetic datetime(2011, 1, 8) + frequency and a multiplier. Base frequencies are Time Zone Naive index 3 will copy value from index 2.
timedelta(12) => 2011-01-20 typically referred to by a string alias, like M for monthly
or H for hourly. Specify Time Zone When Use option : tz = 'UTC' in
# Timedelta represents temporal difference Create Time Series pd.date_range() * Interpoation will ONLY apply row-wise.
between two datetime objects. freq = '4H'
Localization From Naive ts1_utc = ts1. TIME SERIES PLOTTING
timestamps = pd.to_ freq = '1h30min'
Convert String tz_localize('UTC')
to Pandas datetime(['8/8/2008', ..]) # Standard US equity option monthly expirataion, every third # Example : Source Data Format - First Column is Date.
Timestamp Friday of the month : freq = 'WOM-3FRI' Convert to Another Time ts1_eastern = ts1_utc.
# NaT (Not a Time) is Pandas NA Value for Zone Once Time Series tz_convert('US/ Use first column as the Index, then parse the index values as
Type Timestamp Data 2. Generating Date Ranges Been Localized Eastern') Date.
pd.to_datetime('') => NaT prices = pd.read_csv(.., parse_date =
pd.isnull(NaT) => True Default pd.date_range(begin, end) Or 3. ** Time Zone-aware Timestamp Objects True, index_col = 0)
Frequency pd.date_range(begin or end,
# Missing value (i.e. empty string) is Daily periods = n) stamp_utc = pd.Timestamp('2008-08-08 px = prices[['AAPL', 'IBM']]
03:00', tz = 'UTC') px = px.resample('B', fill_method = 'ffill')
datetime is widely used, it stores both the date # Option freq = 'BM' means last
business day at end of the month stamp_eastern = stamp_utc.tz_convert(...) px['AAPL'].plot()
and time down to microsecond.
* Timestamp object can be substituted anywhere 3. Shifting (Leading and Lagging) Data Pandas Time Arithmetic - Daylight Savings Time Transitions px['AAPL'].ix['01-2008':'03-2012'].plot()
you would use datetime object. Shifting refers to moving data backward and forward Are Respected : px.ix['2008'].plot()
PANDA TIME SERIES through time. stamp = pd.Timestamp('2012-11-04 00:30',
Series and DF shift() does naive shift, aka index does tz = 'US/Eastern') => 2012-11-04-00:30:00 -400 EDT MOVING WINDOW FUNCTIONS
Create Time Series
not shift, only value. * stamp + 2 * Hour() => 2012-11-04-01:30:00 -500 EST Like other statistical functions, these functions also
ts1 = pd.Series(np.random.randn(8), index = automatically exclude missing data.
[ datetime(2011, 1, 2), .. ]) # ts1 is Daily Data ** If two time series with different time zones are
ts1 = pd.Series(..., index = pd.date_ ts1.shift(1) # move yesterdays value to today, today combined, i.e. ts1 + ts2, the timestamps will auto-align pd.rolling_mean(px.AAPL, 200).plot()
range('1/1/2000', periods = 1000)) value to tomorrow, etc. with respect to time zone. The result will be in UTC.
pd.rolling_std(px.AAPL.pct_change(), 22,
# ts1.index is "DatetimeIndex" Panda class # ts1 is Any Time Series Data. Shift Data By 3 Days RESAMPLING min_periods = 20).plot()
ts1.shift(3, freq = 'D') Or Process of converting a time series from one frequency to pd.rolling_corr(px.AAPL.pct_change(),
Index value ts1.index[0] is Panda another frequency : px.IBM.pct_change(), 22).plot()
Timestamp object which stores timestamp using ts1.shift(1, freq = '3D')
NumPys datetime64 type at the nanoseond 1. Downsampling - Aggregating higher frequency
resolution. Further, Timestamp class stores the # Common Use of Shift : To Computer % Change data to lower frequency. PERFORMANCE
frequency information as well as timezone. ts1 / ts.shift(1) - 1 Since Timestamps is represented as 64-bit integers
* ts1.resample('M', how = 'mean') using NumPys datetime64 type, it means for each data
ts1.index.dtype => datetime64[ns]
* In the return result from shift(), some data value => Index : 2000-01-31, 2000-02-29, ... point, there is an associated 8 bytes of memory per
Indexing (Slicing/Subsetting) timestamp.
might be NaN. ts1.resample('M', ..., kind ='period')
Argument can be a string date, datetime or Timestamp. Other ways to shift data : # 'period' - Use time-span representation Creating views on existing time series or DF do
Select Year of 2001 ts1['2001'] from pandas.tseries.offsets import Day, => Index : 2000-01, 2000-02, ... not cause any more memory to be used.
df1.ix['2001'] MonthEnd
# ts1 is one minute data of value 1 to 100 of time : Indexes for lower frequencies (daily and up) are stored
ts1['2001-06'] datetime(2008, 8, 8) + 3*Day() => 2008-08-11 00:00:00, 00:01:00, ... in a central cache, so any fixed-frequency index
Select June 2001
is a view on the date cache.Thus, low-frequency
Select From 2001- ts1['1/1/2001':'8/1/2001'] datetime(2008, 8, 8) + MonthEnd(2) => ts1.resample('5min', how = 'sum') => indexes memory footprint is not significant.
01-01 to 2001-08-01 2008-09-30
00:00:00 15 (aka : 1 + 2 + 3 + 4 + 5) Performance-wise, Pandas has been highly optimized
MonthEnd().rollforward(datetime(2008, 8, 00:05:00 40 for data alignment operations (i.e. ts1 + ts2) and
Select From 2001- ts1[datetime(2001, 1, 8):]
8)) => 2008-08-31 resampling.
01-08 On # Default is left bin edge is inclusive, thus 00:00:00 value in
included in the 00:00:00 to 00:05:00 interval.
Common Operations\ TIME ZONE HANDLING
# Option : closed = 'right' change interval to right Created by Arianne Colton and Sean Chen
Daylight saving time (DST) transitions are a
Get Time Series ts1.truncate(after = common source of complication. inclusive. Also include option label = 'right' as well : www.datasciencefree.com
Data Before '1/8/2011') UTC is the current international standard. Time zones 00:00:00 1 Based on content from
2011-01-09 are expressed as offsets from UTC. * 00:05:00 20 (aka : 2 + 3 + 4 + 5 + 6) Python for Data Analysis by Wes McKinney
Updated: August 22, 2016

You might also like