KEMBAR78
Data Loading and Wrangling Guide | PDF | Comma Separated Values | File Format
0% found this document useful (0 votes)
97 views22 pages

Data Loading and Wrangling Guide

The document discusses techniques for loading, storing, and manipulating data in Python using the pandas library. It covers: 1) Reading and writing tabular data from text files like CSV and HTML tables, as well as handling missing data values. 2) Loading Excel files using pandas and the xlrd/openpyxl libraries. 3) Techniques for working with large datasets in chunks, as well as examples of accessing specific rows or columns of data.

Uploaded by

FucKerWengie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views22 pages

Data Loading and Wrangling Guide

The document discusses techniques for loading, storing, and manipulating data in Python using the pandas library. It covers: 1) Reading and writing tabular data from text files like CSV and HTML tables, as well as handling missing data values. 2) Loading Excel files using pandas and the xlrd/openpyxl libraries. 3) Techniques for working with large datasets in chunks, as well as examples of accessing specific rows or columns of data.

Uploaded by

FucKerWengie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UECM 1534 Programming Techniques for Data Processing Jan 18/19

Chapter 05: Data Loading,


Storage and Wrangling
Accessing data is a necessary first step for using most of the tools. In this chapter, student will
learn to reading and writing in text format, reading and rewriting Microsoft excel file,
handling missing data, and combining and merging datasets.

1. Reading and Writing Data in Text Format


pandas features a number of functions for reading tabular data as a DataFrame object. Table
below summarizes some of them, though .read_csv and .read_table are likely the ones
you’ll use the most.

We can use .read_csv to read it into a DataFrame:

In [1]: df = pd.read_csv('your path/5_1.csv')

1|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [2]: df
Out[2]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

When reading the csv file, make sure the csv file is located in your current working directory.
To get current working directory and set a new working directory, we can use os package.

In [3]: import os
In [4]: currentpath = os.getcwd()

In [5]: currentpath
Out[5]: 'C:\\Users\\YourUserName'

In [6]: os.chdir('C:\\Users\\UECM1534')

In [7]: currentpath = os.getcwd()

Once you successfully change the working directory, you may read the csv file in the working
directory you have set by typing:

In [8]: mypath = currentpath + '/5_1.csv'


In [9]: df = pd.read_csv(mypath)

We could also have used .read_table and specified the delimiter:

In [10]: pd.read_table(mypath, sep=',')


Out[10]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

A file will not always have a header row. To read this file, you have a couple of options. You can
allow pandas to assign default column names, or you can specify names yourself:

In [11]: mypath1 = currentpath + '/5_2.csv'

In [12]: pd.read_csv(mypath1, header=None)


Out[12]:
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

2|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [13]: pd.read_csv(mypath1, names=['a', 'b', 'c', 'd', 'message'])


Out[13]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

Suppose you wanted the message column to be the index of the returned DataFrame. You can
either indicate you want the column at index 4 or named 'message' using the index_col
argument:

In [14]: names = ['a', 'b', 'c', 'd', 'message']

In [15]: pd.read_csv(mypath1, names=names, index_col='message')


Out[15]:
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12

In the event that you want to form a hierarchical index from multiple columns, pass a list of
column numbers or names:

In [16]: mypath2 = currentpath + '/5_3.csv'

In [17]: pd.read_csv(mypath2, index_col=[0,1])


Out[17]:
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16

In these cases, you can pass a regular expression as a delimiter for .read_table. This can be
expressed by the regular expression \s+, so we have then:

In [18]: mypath3 = currentpath + '/5_4.txt'

In [19]: result = pd.read_table(mypath3, sep='\s+')

3|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [20]: result
Out[20]:
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

Because there was one fewer column name than the number of data rows, .read_table infers
that the first column should be the DataFrame’s index in this special case.

You can also skip certain rows of a file with skiprows:

In [21]: mypath4 = currentpath + '/5_5.csv'

In [22]: pd.read_csv(mypath4, skiprows=[0, 2, 3])


Out[22]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

Handling missing values is an important. Missing data is usually either not present (empty
string) or marked by some sentinel value. By default, pandas uses a set of commonly occurring
sentinels, such as NA and NULL:

In [23]: mypath5 = currentpath + '/5_6.csv'

In [24]: result = pd.read_csv(mypath5)

In [25]: result
Out[25]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

In [26]: pd.isnull(result)
Out[26]:
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False

Different NA sentinels can be specified for each column in a dict:

In [27]: sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

4|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [28]: pd.read_csv(mypath5, na_values=sentinels)


Out[28]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

1.1 Reading Text Files in Pieces


When processing very large files or figuring out the right set of arguments to correctly
process a large file, you may only want to read in a small piece of a file or iterate through
smaller chunks of the file.

Before we look at a large file, we make the pandas display settings more compact:

In [29]: pd.options.display.max_rows = 10

Now we have:

In [30]: mypath6 = currentpath + '/5_7.csv'

In [31]: result = pd.read_csv(mypath6)

In [32]: result
Out[32]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
... ... ... ... ... ..
9995 2.311896 -0.417070 -1.409599 -0.515821 L
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0
[10000 rows x 5 columns]

If you want to only read a small number of rows (avoiding reading the entire file), specify
that with nrows:

5|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [33]: pd.read_csv(mypath6, nrows=5)


Out[33]:
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q

1.2 Writing Data to Text Format

Data can also be exported to a delimited format. Let’s consider one of the CSV files read
before:

In [34]: result = pd.read_csv(mypath5)

In [35]: result
Out[35]:
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

Using DataFrame’s .to_csv method, we can write the data out to a comma-separated file:

In [36]: data.to_csv(currentpath +'/out.csv')

1.3 Reading html Table


By using the .to_csv method, you can now save any table you created or read from other
file to excel. To read html file, the pandas.read_html function has a number of options,
but by default it searches for and attempts to parse all tabular data contained within
<table> tags. The result is a list of DataFrame objects:

In [37]: url =
‘https://www.fdic.gov/bank/individual/failed/banklist.html'

In [38]: tables=pd.read_html('https://www.fdic.gov/bank/individual
....: /failed/bankl ist.html')

In [39]: len(tables)
Out[39]: 1

6|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [40]: failures = tables[0]

In [41]: failures.sort_values(by = 'Bank Name', ascending = False)


Out[41]:

From here we could proceed to do some data cleaning and analysis, like computing the
number of bank failures by year:

In [42]: close_time = pd.to_datetime(failures['Closing Date'])

In [43]: close_time.dt.year.value_counts()
Out[43]:
2010 157
2009 140
2011 92
2012 51
2008 25
...
2004 4
2001 4
2007 3
2003 3
2000 2
Name: Closing Date, Length: 15, dtype: int64

2. Binary Data Formats – Microsoft Excel Files


pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either
the ExcelFile class or pandas.read_excel function. Internally these tools use the add-on
packages xlrd and openpyxl to read XLS and XLSX files, respectively. You may need to install these
manually with pip or conda.

To use pd.ExcelFile, create an instance by passing a path to an xls or xlsx file:

In [44]: xlsx = pd.ExcelFile(mypath + '/5_8.xlsx')

7|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

Data stored in a sheet can then be read into DataFrame with parse:

In [45]: pd.read_excel(xlsx, 'Sheet1')


Out[45]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, but you can
also simply pass the filename to pandas.read_excel:

In [46]: frame = pd.read_excel(mypath + '/5_8.xlsx', 'Sheet1')

In [47]: frame
Out[47]:
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

To write pandas data to Excel format, you must first create an ExcelWriter, then write data to
it using pandas objects’ to_excel method:

In [48]: writer = pd.ExcelWriter(mypath + '/test.xlsx')

In [49]: frame.to_excel(writer, 'Sheet2')

In [50]: writer.save()

You can also pass a file path to to_excel and avoid the ExcelWriter:

In [51]: frame.to_excel(mypath + '/test2.xlsx')

3. Handling Missing Data


Missing data occurs commonly in many data analysis applications. One of the goals of pandas
is to make working with missing data as painless as possible. For example, all of the descriptive
statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is
functional for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a
Number) to represent missing data. We call this a sentinel value that can be easily detected:

8|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [52]: string_data = pd.Series(['aardvark', 'artichoke', np.nan,


....: 'avocado'])

In [53]: string_data
Out[53]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object

In [54]: string_data.isnull()
Out[54]:
0 False
1 False
2 True
3 False
dtype: bool

The built-in Python None value is also treated as NA in object arrays:

In [55]: string_data[0] = None

In [56]: string_data.isnull()
Out[56]:
0 True
1 False
2 True
3 False
dtype: bool

There is work ongoing in the pandas project to improve the internal details of how missing data
is handled, but the user API functions, like pandas.isnull, abstract away many of the annoying
details. See table below for a list of some functions related to missing data handling.

9|Page
UECM 1534 Programming Techniques for Data Processing Jan 18/19

3.1 Filtering Out Missing Data


There are a few ways to filter out missing data. While you always have the option to do it
by hand using pandas.isnull and boolean indexing, the .dropna can be helpful. On a
Series, it returns the Series with only the non-null data and index values:

In [57]: from numpy import nan as NA

In [58]: data = pd.Series([1, NA, 3.5, NA, 7])

In [59]: data.dropna()
Out[59]:
0 1.0
2 3.5
4 7.0
dtype: float64

This is equivalent to:

In [60]: data[data.notnull()]
Out[60]:
0 1.0
2 3.5
4 7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows or
columns that are all NA or only those containing any NAs. Method .dropna by default
drops any row containing a missing value:

In [61]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],


....: [NA, NA, NA], [NA, 6.5, 3.]])

In [62]: cleaned = data.dropna()

In [63]: data
Out[63]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0

In [64]: cleaned
Out[64]:
0 1 2
0 1.0 6.5 3.0

10 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

Passing how='all'will only drop rows that are all NA:

In [65]: data.dropna(how='all')
Out[65]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0

To drop columns in the same way, pass axis=1:

In [66]: data[4] = NA

In [67]: data
Out[67]:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN

In [68]: data.dropna(axis=1, how='all')


Out[68]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0

Suppose you want to keep only rows containing a certain number of observations. You
can indicate this with the thresh argument:

In [69]: df = pd.DataFrame(np.random.randn(6, 3))

In [70]: df.iloc[:4, 1] = NA

In [71]: df.iloc[:2, 2] = NA

In [72]: df
Out[72]:
0 1 2
0 -0.204708 NaN NaN
1 -0.555730 NaN NaN
2 0.092908 NaN 0.769023
3 1.246435 NaN -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843

11 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [73]: df.dropna()
Out[73]:
0 1 2
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
In [74]: df.dropna(thresh=2)
Out[74]:
0 1 2
2 0.092908 NaN 0.769023
3 1.246435 NaN -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843

3.2 Filling In Missing Data


Rather than filtering out missing data (and potentially discarding other data along with it),
you may want to fill in the “holes” in any number of ways. For most purposes, the .fillna
method is the workhorse function to use. Calling method .fillna with a constant replaces
missing values with that value:

In [75]: df.fillna(0)
Out[75]:
0 1 2
0 -0.204708 0.000000 0.000000
1 -0.555730 0.000000 0.000000
2 0.092908 0.000000 0.769023
3 1.246435 0.000000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843

Calling fillna with a dict, you can use a different fill value for each column:
In [76]: df.fillna({1: 0.5, 2: 0})
Out[76]:
0 1 2
0 -0.204708 0.500000 0.000000
1 -0.555730 0.500000 0.000000
2 0.092908 0.500000 0.769023
3 1.246435 0.500000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843

With fillna you can do lots of other things with a little creativity. For example, you
might pass the mean or median value of a Series:

In [77]: data = pd.Series([1., NA, 3.5, NA, 7])

12 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [78]: data.fillna(data.mean())
Out[78]:
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64

4. Combining and Merging Dataset


Data contained in pandas objects can be combined together in a number of ways:

 pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar
to users of SQL or other relational databases, as it implements database join operations.
 pandas.concat concatenates or “stacks” together objects along an axis.
 The combine_first instance method enables splicing together overlapping data to fill in
missing values in one object with values from another.

They’ll be utilized in examples throughout the rest of the chapter.

4.1 Database-Style DataFrame Joins


Merge or join operations combine datasets by linking rows using one or more keys. These
operations are central to relational databases (e.g., SQL-based). The merge function in
pandas is the main entry point for using these algorithms on your data.

In [79]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a',


....: 'b'],'data1': range(7)})

In [80]: df2 = pd.DataFrame({'key': ['a', 'b', 'd'],


....: 'data2': range(3)})

In [81]: df1
Out[81]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b

13 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [82]: df2
Out[82]:
data2 key
0 0 a
1 1 b
2 2 d

This is an example of a many-to-one join; the data in df1 has multiple rows labeled a
and b, whereas df2 has only one row for each value in the key column. Calling .merge
with these objects we obtain:

In [83]: pd.merge(df1, df2)


Out[83]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0

Note that no specification which column to join on. If that information is not specified,
merge uses the overlapping column names as the keys. It’s a good practice to specify
explicitly, though:

In [84]: pd.merge(df1, df2, on='key')


Out[84]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0

If the column names are different in each object, you can specify them separately:

In [85]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a',


....: 'a', 'b'],'data1': range(7)})

In [86]: df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],


....: 'data2': range(3)})

14 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [87]: pd.merge(df3, df4, left_on='lkey', right_on='rkey')


Out[87]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a

Other possible options are 'left', 'right', and 'outer'. The outer join takes the
union of the keys, combining the effect of applying both left and right joins:

In [88]: pd.merge(df1, df2, how='outer')


Out[88]:
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 6.0 b 1.0
3 2.0 a 0.0
4 4.0 a 0.0
5 5.0 a 0.0
6 3.0 c NaN
7 NaN d 2.0

See table below for a summary of the options for how=.

Many-to-many merges have well-defined, though not necessarily intuitive, behavior.


Here’s an example:

In [89]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a',


....: 'b'], 'data1': range(6)})

In [90]: df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],


....: 'data2': range(5)})

15 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [91]: df1
Out[91]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b

In [92]: df2
Out[92]:
data2 key
0 0 a
1 1 b
2 2 a
3 3 b
4 4 d

In [93]: pd.merge(df1, df2, on='key', how='left')


Out[93]:
data1 key data2
0 0 b 1.0
1 0 b 3.0
2 1 b 1.0
3 1 b 3.0
4 2 a 0.0
5 2 a 2.0
6 3 c NaN
7 4 a 0.0
8 4 a 2.0
9 5 b 1.0
10 5 b 3.0

In [94]: pd.merge(df1, df2, how='inner')


Out[94]:
data1 key data2
0 0 b 1
1 0 b 3
2 1 b 1
3 1 b 3
4 5 b 1
5 5 b 3
6 2 a 0
7 2 a 2
8 4 a 0
9 4 a 2

16 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

To merge with multiple keys, pass a list of column names:

In [95]: left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],


....: 'key2': ['one', 'two', 'one'],
....: 'lval': [1, 2, 3]})

In [96]: right = pd.DataFrame({'key1': ['foo', 'foo', 'bar',


....: 'bar'], 'key2': ['one', 'one', 'one', 'two'],
....: 'rval': [4, 5, 6, 7]})

In [97]: pd.merge(left, right, on=['key1', 'key2'], how='outer')


Out[97]:
key1 key2 lval rval
0 foo one 1.0 4.0
1 foo one 1.0 5.0
2 foo two 2.0 NaN
3 bar one 3.0 6.0
4 bar two NaN 7.0

A last issue to consider in merge operations is the treatment of overlapping column names.

In [98]: pd.merge(left, right, on='key1')


Out[98]:
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

4.2 Merging on Index


In some cases, the merge key(s) in a DataFrame will be found in its index. In this case, you
can pass left_index=True or right_index=True (or both) to indicate that the index
should be used as the merge key:

In [99]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b',


....: 'c'],'value': range(6)})

In [100]: right1 = pd.DataFrame({'group_val': [3.5, 7]},


....: index=['a', 'b'])

17 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [101]: left1
Out[101]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5

In [102]: right1
Out[102]:
group_val
a 3.5
b 7.0

In [103]: pd.merge(left1, right1, left_on='key', right_index=True)


Out[103]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0

Since the default merge method is to intersect the join keys, you can instead form the
union of them with an outer join:

In [104]: pd.merge(left1, right1, left_on='key', right_index=True,


....: how='outer')
Out[104]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN

Using the indexes of both sides of the merge is also possible:

In [105]: left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],


....: index=['a', 'c', 'e'],
....: columns=['Ohio', 'Nevada'])

In [106]: right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.],


....: [13, 14]], index=['b', 'c', 'd', 'e'],
....: columns=['Missouri', 'Alabama'])

18 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [107]: left2
Out[107]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0

In [108]: right2
Out[108]:
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0

In [109]: pd.merge(left2, right2, how='outer', left_index=True,


....: right_index=True)
Out[109]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0

DataFrame has a convenient .join instance for merging by index. It can also be used to
combine together many DataFrame objects having the same or similar indexes but non-
overlapping columns. In the prior example, we could have written:

In [110]: left2.join(right2, how='outer')


Out[110]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0

DataFrame’s join method preserving the left frame’s row index.

In [111]: left1.join(right1, on='key')


Out[111]:
key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN

19 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

4.3 Concatenating Along an Axis


The .concat function in pandas provides a consistent way to address each of these
concerns. I’ll give a number of examples to illustrate how it works. Suppose we have
three Series with no index overlap:

In [112]: s1 = pd.Series([0, 1], index=['a', 'b'])

In [113]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

In [114]: s3 = pd.Series([5, 6], index=['f', 'g'])

Calling .concat with these objects in a list glues together the values and indexes:

In [115]: pd.concat([s1, s2, s3])


Out[115]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64

By default .concat works along axis=0, producing another Series. If you pass axis=1,
the result will instead be a DataFrame (axis=1is the columns):

In [116]: pd.concat([s1, s2, s3], axis=1)


Out[116]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0

The same logic extends to DataFrame objects:

In [117]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2),


....: index=['a', 'b', 'c'],
....: columns=['one', 'two'])

In [118]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2),


....: index=['a', 'c'],
....: columns=['three', 'four'])

20 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [119]: df1
Out[119]:
one two
a 0 1
b 2 3
c 4 5

In [120]: df2
Out[120]:
three four
a 5 6
c 7 8

In [121]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])


Out[121]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0

A last consideration concerns DataFrames in which the row index does not contain
any relevant data:

In [122]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a',


....: 'b', 'c', 'd'])

In [123]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b',


....: 'd', 'a'])

In [124]: df1
Out[124]:
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741

In [125]: df2
Out[125]:
b d a
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614

In this case, you can pass ignore_index=True:

21 | P a g e
UECM 1534 Programming Techniques for Data Processing Jan 18/19

In [126]: pd.concat([df1, df2], ignore_index=True)


Out[126]:
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741
3 -1.021228 0.476985 NaN 3.248944
4 0.302614 -0.577087 NaN 0.124121

22 | P a g e

You might also like