KEMBAR78
UNIT 5 Python Aktu | PDF | Computer Programming | Computing
0% found this document useful (0 votes)
31 views49 pages

UNIT 5 Python Aktu

The document provides an overview of NumPy, a crucial library for numerical computing in Python, detailing its properties such as N-dimensional arrays, vectorization, and broadcasting. It covers basic operations, array creation, and the use of universal functions for efficient computation. Additionally, it explains array indexing, slicing, and boolean indexing for data manipulation.

Uploaded by

aryamann khare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views49 pages

UNIT 5 Python Aktu

The document provides an overview of NumPy, a crucial library for numerical computing in Python, detailing its properties such as N-dimensional arrays, vectorization, and broadcasting. It covers basic operations, array creation, and the use of universal functions for efficient computation. Additionally, it explains array indexing, slicing, and boolean indexing for data manipulation.

Uploaded by

aryamann khare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

UNIT 5

PYTHON
TOPIC INCLUDE – PANDAS , NUMPY , MATPLOTLIB AND TKINTER….
NumPy is a foundational package for numerical computing in Python. Let's dive deep into
NumPy, its properties, and its built-in functions with examples to thoroughly understand its
capabilities and functionalities.

Introduction to NumPy

NumPy, short for Numerical Python, is a powerful library that provides support for large multi-
dimensional arrays and matrices, along with a collection of mathematical functions to operate on
these arrays. It is the foundation of many other scientific computing packages in Python, such as
SciPy, Pandas, and scikit-learn.

Properties of NumPy

1. N-dimensional array object (ndarray)

The core feature of NumPy is its powerful N-dimensional array object, ndarray. This array object
allows for efficient storage and manipulation of large datasets in a compact and efficient manner.

2. Vectorization

NumPy enables vectorized operations on arrays, which allows for operations to be applied
element-wise without the need for explicit loops. This results in faster and more concise code.

3. Broadcasting

Broadcasting is a powerful mechanism that allows NumPy to work with arrays of different
shapes when performing arithmetic operations. This can lead to significant simplifications in
code and improved performance.

4. Universal Functions (ufuncs)

Universal functions are functions that operate element-wise on ndarrays, providing a means to
perform vectorized operations on data.

5. Integration with C/C++ and Fortran

NumPy can interface with C, C++, and Fortran code, making it highly versatile and suitable for
high-performance computing tasks.

6. Random Number Generation

NumPy includes functions for generating random numbers, which are essential for statistical
sampling and simulations.

Basic NumPy Operations


Numpy

Numpy is the most popular python library for matrix/vector computations. Due to
python’s popularity, it is also one of the leading libraries for numerical analysis, and a
frequent target for computing benchmarks and optimization.

It is important to keep in mind that numpy is a separate library that is not part of the
base python. Unlike R, base python is not vectorized, and one has to load numpy (or
another vectorized library, such as pandas) in order to use vectorized operations.
This also causes certain differences between the base python approach and the way
to do vectorized operations.

3.1.1 Importing numpy


Numpy is typically imported as np:
import numpy as np
np is pretty much the
standard acronym for the numpy and widely used in online
documentation. Below we assume numpy has been imported as np.

3.1.2 Array: The Fundamental Data Structure in


Numpy
Numpy is fundamentally based on arrays, N-dimensional data structures. Here we
mainly stay with one- and two-dimensional structures (vectors and matrices) but the
arrays can also have higher dimension (called tensors). Besides arrays, numpy also
provides a plethora of functions that operate on the arrays, including vectorized
mathematics and logical operations.

Arrays can be created with np.array. For instance, we can create a 1-D vector of
numbers from 1 to 4 by feeding a list of desired numbers to the np.array:
a = np.array([1,2,3,4])
print("a:\n", a)
## a:
## [1 2 3 4]
Note that it is printed in brackets as list, but unlike a list, it does not have commas
separating the components.

If we want to create a matrix (two-dimensional array), we can feed np.array with a list
of lists, one sublist for each row of the matrix:
b = np.array([[1,2], [3,4]])
print("b:\n", b)
## b:
## [[1 2]
## [3 4]]
The output does not have the best formatting but it is clear enough.

One of the fundamental property of arrays its dimension, called shape in numpy.
Shape is array’s size along all of its dimensions. This can be queried by
attribute .shape which returns the sizes in a form of a tuple:
a.shape
## (4,)
b.shape
## (2, 2)
One can see that vector a has a single dimension of size 4, and matrix b has two
dimensions, both of size 2 (remember: (4,) is a tuple of length 1!).
One can also reshape arrays, i.e. change their shape into another compatible shape.
This can be achieved with .reshape() method. .reshape takes one argument, the new
shape (as a tuple) of the array. For instance, we can reshape the length-4 vector into
a 2x2 matrix as
a.reshape((2,2))
## array([[1, 2],
## [3, 4]])
and we can “straighten” matrix b into a vector with
b.reshape((4,))
## array([1, 2, 3, 4])

3.1.3 Creating Arrays


Sometimes it is practical to create arrays manually as we did above, but usually it is
much more important to make those by computation. Below we list a few options.

np.arange creates sequences, quite a bit like range, but the result will be a numpy
vector. If needed, we can reshape the vector into a desired format:
np.arange(10) # vector of length 10
## array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(10).reshape((2,5)) # 2x5 matrix
## array([[0, 1, 2, 3, 4],
## [5, 6, 7, 8, 9]])
np.zeros and np.ones create arrays filled with zeros and ones respectively:
np.zeros((5,))
## array([0., 0., 0., 0., 0.])
np.ones((2,4))
## array([[1., 1., 1., 1.],
## [1., 1., 1., 1.]])
Arrays can be combined in different ways, e.g. np.column_stack combines them as
columns (next to each other), and np.row_stack combines these as rows (underneath
each other). For instance, we can combine a column of ones and two columns of
zeros as follows:
oneCol = np.ones((5,)) # a single vector of ones
zeroCols = np.zeros((5,2)) # two columns of zeros
np.column_stack((oneCol, zeroCols)) # 5x3 columns
## array([[1., 0., 0.],
## [1., 0., 0.],
## [1., 0., 0.],
## [1., 0., 0.],
## [1., 0., 0.]])
Note that column_stack expects all arrays to be passed as a single tuple (or list).
Exercise 3.1 Use np.zeros, np.ones, mathematical operations and concatenation to
create the following array:
## array([[-1., -1., -1., -1.],
## [ 0., 0., 0., 0.],
## [ 2., 2., 2., 2.]])

3.1.4 Vectorized Functions (Universal Functions)


It is possible to use loops to do computation with numpy objects exactly in the same
way when working with lists. However, one should use vectorized operations instead
whenever possible. Vectorized operations are easier to code, easier to read, and
result in faster code.

Numpy offers a plethora of vectorized functions and operators, called universal


functions. Many of these work as expected. For instance, mathematical operations.
We create a matrix, and then add “100” to it, and then rise “2” to the power of the
values:

a = np.arange(12).reshape((3,4))
print(a)
## [[ 0 1 2 3]
## [ 4 5 6 7]
## [ 8 9 10 11]]
print(100 + a, "\n")
## [[100 101 102 103]
## [104 105 106 107]
## [108 109 110 111]]
print(2**a, "\n") # remember: exponent with **, not with ^
## [[ 1 2 4 8]
## [ 16 32 64 128]
## [ 256 512 1024 2048]]
Both of these mathematical operations, + and ** are performed elementwise2 for
every single element of the matrix.
Exercise 3.2 Create the following array:

## array([[ 2, 4, 6, 8, 10],
## [12, 14, 16, 18, 20],
## [22, 24, 26, 28, 30],
## [32, 34, 36, 38, 40]])
Comparison operators are vectorized too:

a > 6
## array([[False, False, False, False],
## [False, False, False, True],
## [ True, True, True, True]])
a == 7
## array([[False, False, False, False],
## [False, False, False, True],
## [False, False, False, False]])
As comparison operators are vectorized, one might expect that the other logical
operators, and, or and not, are also vectorized. But this is not the case. There are
vectorized logical operators, but they differ from the base python version. These are
more similar to corresponding operators in R or C, namely & for logical and, | for
logical or, and ~ for logical not:
(a < 3) | (a > 8) # logical or
## array([[ True, True, True, False],
## [False, False, False, False],
## [False, True, True, True]])
(a > 4) & (a < 7) # logical and
## array([[False, False, False, False],
## [False, True, True, False],
## [False, False, False, False]])
~(a > 6) # logical not
## array([[ True, True, True, True],
## [ True, True, True, False],
## [False, False, False, False]])
There is no vectorized multi-way comparison like 1 < x < 2.
3.1.5 Array Indexing and Slicing
Indexing refer to extracting elements based on their position or certain criteria. This is
one of the fundamental operations with arrays. There are two ways to extract
elements: based on position, and based on logical criteria. Unfortunately, this also
makes indexing somewhat confusing, and it needs some time to become familiar
with.

3.1.5.1 Extracting elements based on position


Array indexing is very similar to list indexing. As matrices have two dimensions, we
need two indices.

a = np.arange(12)
print(a[::2]) # every second element
## [ 0 2 4 6 8 10]
However, unlike lists, one can do vectorized assignments in numpy:

a[5:11] = -1 # assign multiple elements


a
## array([ 0, 1, 2, 3, 4, -1, -1, -1, -1, -1, -1, 11])
One can also extract multiple elements from a vector:

a[[4,5,7]] # extract 3 elements in one go


## array([ 4, -1, -1])
When working with matrices (2-D arrays), we need two indices, separated by comma.
Comma separates two slices

c = np.arange(12).reshape((3,4))
c
## array([[ 0, 1, 2, 3],
## [ 4, 5, 6, 7],
## [ 8, 9, 10, 11]])
c[1,2] # 2nd row, 3rd column
## 6
c[1] # 2nd row
## array([4, 5, 6, 7])
Comma can separate not just two indices but two slices, so we can write

c[:,2] # all rows, 3rd column


## array([ 2, 6, 10])
c[:2] # 1st, 2nd row
## array([[0, 1, 2, 3],
## [4, 5, 6, 7]])
c[:2, :3] # 1s, 2nd row, first three columns
## array([[0, 1, 2],
## [4, 5, 6]])
Exercise 3.3 Create matrix and access rows and columns

• create a 4x5 array of even numbers: 10, 12, 14, …

• extract third column

• set the fourth row to 1,2,3,4,5

3.1.5.2 Boolean indexing


An extremely widely used approach is to extract elements of an array based on a
logical criteria. Fundamentally, it is just using a logical vector for indexing. The vector
must be of the same lengts as the array in question, and the results contains only
those elements the correspond to True in the indexing vector. Here is an example
how we can do this manually:
a = np.array([1,2,7,8])
i = np.array([True, False, True, False])
a[i] # 1, 7
## array([1, 7])
It is important you understand what is going on here: arrays a and i will be “matched”,
so each element of a will have its “match” in i. Next, only those elements of a that are
matched with True are extracted, in this case just 1 and 7.
The previous example–manually creating a logical index vectors of trues and falses is
hardly ever useful. Almost always we use logical operations instead. For instance, we
can extract all elements of a that are greater than 5:
i = a > 5
i
## array([False, False, True, True])
a[i]
## array([7, 8])
This is often written in a more compact manner by skipping explicit logical vector i:
a[a > 5]
## array([7, 8])
New users of numpy (and other languages that support logical indexing) sometimes
forget that the logical condition does not have to be related to the same array that we
are attempting to extract. For instance, we can extract all results for a certain perso n:

names = np.array(["Cyrus", "Darius", "Xerxes", "Artaxerxes", "Cyrus", "Darius"])


results = np.array([17, 14, 20, 18, 13, 15])
results[names == "Darius"]
## array([14, 15])
Here index vector is based on the variable name only and is not directly related
to results. However, we use it to extract values from the latter.
Finally, we also can extract rows (or columns) from a 2-D array in a fairly similar
fashion:

names = np.array(["Cyrus", "Darius", "Xerxes"])


results = np.array([[17, 14], [20, 18], [13, 15]])
results
## array([[17, 14],
## [20, 18],
## [13, 15]])
results[names == "Darius",:]
## array([[20, 18]])
The results is the second row of the 2-D array results, corresponding to the name
“Darius”.
Logical indexing can also be used on the left-hand-side of the expression, in order to
replace elements. Below is an example where we replace all the negative elements
of a with zero.
a = np.random.randn(2,3)
a
## array([[ 1.3315865 , 0.71527897, -1.54540029],
## [-0.00838385, 0.62133597, -0.72008556]])
a[a < 0] = 0
a
## array([[1.3315865 , 0.71527897, 0. ],
## [0. , 0.62133597, 0. ]])
When replacing elements in such fashion then we need to supply the replacement
vector that is either length 1 (all elements are replaced by “0” in the example above),
or alternatively we should supply a replacement vector of correct length. For instance,
we can replace the positive numbers left in a with 1, 2, 3:
a[a > 0] = np.array([1, 2, 3])
a
## array([[1., 2., 0.],
## [0., 3., 0.]])
Exercise 3.4 consider two vectors

names = np.array(["Roxana", "Statira", "Roxana", "Statira", "Roxana"])


score = np.array([126, 115, 130, 141, 132])
Do the following using a single one-line vectorized operation.
• Extract all test scores that are smaller than 130

• Extract all test scores by Statira

• Add 10 points to Roxana’s scores. (You need to extract it first.)

3.1.6 Random numbers


TBD: make a separate section with some simulations and related examples.
Numpy offer a large set of random number generators. These can be invoked
as np.random.generator(params, size). For instance, np.random.choice(N) can be used to
create random numbers from 0 to N−1𝑁−1. size determines the shape of the resulting
object.

Here is an example to simulate roll of a die for 5 times:

x = np.random.choice(6, size=5)
x
## array([0, 2, 0, 4, 3])
But maybe we prefer not to label the results as 0..5 but 1..6. So we can just add one
to the result. Here is an example that creates 2-D array of die rolls:

1 + np.random.choice(6, size=(2,4))
## array([[1, 5, 4, 1],
## [4, 3, 2, 1]])
Numpy offers a large set of various random values. Here we list a few more:

3.1.6.1 Random elements from list


random.choice can also extract random elements from a list:
nucleotides = ["A", "G", "C", "T"]
dna = np.random.choice(nucleotides, 20)
"".join(dna)
## 'ACGTCGGGTGCGACCCGAGT'
As the example demonstrates, random.choice picks random elements with
replacement (use replace option to change this behavior).

3.1.6.2 Random normals


random.normal(loc, scale, size) generates normally distributed random numbers.
The distribution is centered at loc and its variance is scale:
np.random.normal(1000, 100, size=10)
## array([ 969.13995143, 984.62645147, 1067.89311948, 808.64457868,
## 905.75419444, 825.4457021 , 897.90547332, 983.79309738,
## 934.20291005, 1042.21130617])

3.1.6.3 Binomial random numbers


random.binomial(n, p, size) creates random binomials where probability of success
is p and sample size is n:
np.random.binomial(2, 0.5, size=(2,4))
## array([[2, 2, 1, 1],
## [2, 2, 1, 2]])
Exercise 3.5 We can describe a coin toss as Binomial(1, 0.5) where 1 refers to the
fact that we toss a single coin, and 0.5 means it has probability 0.5 to come heads up.
So such random variables are sequences of zeros and ones. But how can we get a
sequence of -1 and 1 instead? Demonstrate it on computer!

3.1.6.4 Uniform random numbers


random.uniform(low, high, size) creates uniformly distributed random numbers in the
interval [low, high]:
np.random.uniform(-1, 1, size=(3,4)) # random numbers in [-1, 1]
## array([[-0.4078626 , -0.73741789, 0.68563587, 0.31807261],
## [ 0.19087921, -0.1272926 , -0.28749935, 0.17426185],
## [-0.70105733, -0.6575228 , -0.20567095, 0.27590313]])

3.1.6.5 Repeating the exact same random sequence


The random numbers are often called pseudorandom as they are not truly random–
they are computed based on a well-defined algorithm, so when feeding the same
initial values to the algorithm, one always gets the same random numbers. However,
normally the initial values are taken from certain hart-to-control parameters outside of
the program control, such as time in microseconds and hard disk serial number, so in
practice it is impossible to replicate the same sequence.

However, if you need to replicate your results exactly, you have to set the initial
values explicitly using random.seed(value). This re-initializes RNG-s to the given initial
state:
np.random.seed(1)
np.random.uniform(size=5) # 1st batch of numbers
## array([4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01,
## 1.46755891e-01])
np.random.uniform(size=5) # 2nd batch is different
## array([0.09233859, 0.18626021, 0.34556073, 0.39676747, 0.53881673])
np.random.seed(1)
np.random.uniform(size=5) # repeat the 1st batch
## array([4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01,
## 1.46755891e-01])

3.1.7 Statistical functions


Numpy offers a set of basic statistical functions, including sum, mean, and standard
deviations std. These can be applied to the array as a whole, or separately to rows or
columns. In the latter case one has to specify the argument axis, where the value 0
means to apply the operation row-wise (and preserve columns) and axis=1 means to
apply the operation column-wise (and preserve rows). Here is an example:
a = np.arange(12).reshape((3,4))
a # 3 rows, 4 columns
## array([[ 0, 1, 2, 3],
## [ 4, 5, 6, 7],
## [ 8, 9, 10, 11]])
a.sum() # total sum
## 66
a.sum(axis=0) # add rows, preserve columns
## array([12, 15, 18, 21])
a.sum(axis=1) # add columns, preserve rows
## array([ 6, 22, 38])
The functions come in two forms: as a method x.sum(), and as a separate
function np.sum(x). These two ways are pretty much equivalent.
By default, a missing value of an array causes the function to return missing:

a = a.astype(float) # as np.nan is float, need a float array


a[1,2] = np.nan
a
## array([[ 0., 1., 2., 3.],
## [ 4., 5., nan, 7.],
## [ 8., 9., 10., 11.]])
np.sum(a)
## nan
This differs from the corresponding functionality in pandas where missings are
ignored by default!

The other statistical functions include

• mean for average


• median for median
• var for variance
• std for standard deviation
• np.percentile and np.quantile for quantiles
Pandas
.

Pandas is the standard python library to work with dataframes. Unlike in R, this is not
a part of base python and must be imported separately. It is typically imported as pd:
import pandas as pd
Pandas relies heavily on numpy but is a separate package. Unfortunately, it also uses
a somewhat different syntax and somewhat different defaults. However, as it is “made
of” numpy, it works very well together with the latter.

Pandas contains two central data types: Series and DataFrame. Series is often used
as a second-class citizen, just as a single variable (column) in data frame. But it can
also be used as a vectorized dict that links keys (indices) to values. DataFrame is
broadly similar to other dataframes as implemented in R or spark. When you extract
its individual columns and rows you normally get those in the form of Series. So it is
extremely useful to know the basics of Series when working with data frames.
Both DataFrame and Series include index, a glorified row name, which is very useful
for extracting information based on names, or for merging different variables into a
data frame (See Section Concatenating data with pd.concat).
We start by introducing Series as this is a simpler data structure than DataFrame,
and allows us to introduce index.

3.2.1 Series
Series is a one-dimensional positional column (or row) of values. It is in some sense
similar to list, but from another point of view it is more like a dict, as it contains index,
and you can look up values based on index as a key. So it allows not only positional
access but also index-based (key-based) access. In terms of internal structure, it is
implemented with vectorized operations in mind, so it supports vectorized arithmetic,
and vectorized logical, string, and other operations. Unlike dicts, it also supports
multi-element extraction.

Let’s create a simple series:

s = pd.Series([1,2,5,6])
s
## 0 1
## 1 2
## 2 5
## 3 6
## dtype: int64
Series is printed in two columns. The first one is the index, the second one is the
value. In this example, index is essentially just the row number and it is not very
useful. This is because we did not provide any specific index and hence pandas
picked just the row number. Underneath the two columns, you can also see the data
type, in this case it is 64-bit integer, the default data type for integers in python.

Now let’s make another example with a more informative index:

pop = pd.Series( [ 38, 26, 19, 19],


index = ['ca', 'tx', 'ny', 'fl'])
# population, in millions
pop
## ca 38
## tx 26
## ny 19
## fl 19
## dtype: int64
Now the index is helpful: we are looking at state populations, and index tells us which
state is in which row. Another advantage of possessing index is that even when we
filter and manipulate the series, it’s index will still retain the original row label. So we
know that index “fl” will always correspond to Florida. But if we have removed a few
cases, or re-ordered the series, then Florida may not be on the fourth position any
more.

Exercise 3.6 Create a series of 4 capital cities where the index is the name of
corresponding country.

We can extract values and index using the corresponding attributes:

pop.values
## array([38, 26, 19, 19])
pop.index
## Index(['ca', 'tx', 'ny', 'fl'], dtype='object')
Note that values are returned as np array, and index is a special index object. If
desired, this can be converted to a list:

list(pop.index)
## ['ca', 'tx', 'ny', 'fl']
Series also supports ordinary mathematics, e.g. we can do operations like

pop > 20
## ca True
## tx True
## ny False
## fl False
## dtype: bool
the result will be another series, here of logical values, as indicated by the “bool” data
type.

3.2.2 DataFrame
DataFrame is the central data structure for holding 2-dimensional rectangular data. It
is in many ways similar to R dataframes. However, it also shares a number of
features with Series, in particular the index, so you can imagine a data frame is just a
number of series stacked next to each other. Also, extracting single rows or columns
from DataFrames typically results in a series.

3.2.2.1 Creating data frames


DataFrame can be created manually as a dict of lists (or series). The keys of the list
are the variable names and values are the variable values, normally these are lists or
series. As an example, let’s create a data frame with three variables, ca, tx and md,
and three rows:

df = {'ca': [35, 37, 38], 'tx': [23, 24, 26], 'md': [5,5,6]}
pop = pd.DataFrame(df)
print('population:\n', pop, '\n')
## population:
## ca tx md
## 0 35 23 5
## 1 37 24 5
## 2 38 26 6
The data frame is printed as four columns. Exactly as in case of series, the first
column is index. In the example above we did not specify the index and hence
pandas picked just row numbers. But we can provide an explicit index, for instance
the year of observation:

pop = pd.DataFrame(df, index = [2010,2012,2014])


print('population:\n', pop, '\n')
## population:
## ca tx md
## 2010 35 23 5
## 2012 37 24 5
## 2014 38 26 6
In this case the index is rather useful.

Exercise 3.7 Create a dataframe of (at least 4) countries, with 2 variables: population
and capital. Country name should be the index.

Hint: feel free to invent populations!


3.2.2.2 Read data from file
To create data frames manually is useful for testing and debugging, in real
applications we typically read data from disk. This can be done with pd.read_csv that
takes the file name as the first argument, and also supports many other options. In
the example below, we read data about G.W.Bush approval rate in fall
2001. pd.read_csv assumes files are comma-separated by default, but as this
example file is tab-separated we have to declare it using sep="\t" as an extra
argument. We also read the first 10 rows only for demonstration:
approval = pd.read_csv("../data/gwbush-approval.csv", sep="\t", nrows=10)
approval
## date approve disapprove dontknow
## 0 2001 Dec 14-16 86 11 3
## 1 2001 Dec 6-9 86 10 4
## 2 2001 Nov 26-27 87 8 5
## 3 2001 Nov 8-11 87 9 4
## 4 2001 Nov 2-4 87 9 4
## 5 2001 Oct 19-21 88 9 3
## 6 2001 Oct 11-14 89 8 3
## 7 2001 Oct 5-6 87 10 3
## 8 2001 Sep 21-22 90 6 4
## 9 2001 Sep 14-15 86 10 4
Exercise 3.8 In the example above: how many columns are printed? How many
variables does the dataframe contain?

What happens if we use a wrong separator? This can be easily checked with printing
the number of columns, and printing a few lines of data. Here is an example:

a = pd.read_csv("../data/gwbush-approval.csv") # wrong separator


a.shape
## (31, 1)
a.head(2)
## date\tapprove\tdisapprove\tdontknow
## 0 2001 Dec 14-16\t86\t11\t3
## 1 2001 Dec 6-9\t86\t10\t4
Two problems are immediately visible: first, the file contains a single column only
(because it does not consider tab symbols as separators), and the two lines we
printed look weird. If you ask for variable names, you can also see that all variable
names are combined together into a single weird name:

a.columns
## Index(['date\tapprove\tdisapprove\tdontknow'], dtype='object')
The tab markers \t in printout give strong hints that the correct separator is tab.
It may initially be quite confusing to understand how to specify the file name. If you
load data in a jupyter notebook, then the working directory is normally the same
directory where the notebook is located3. Notebook also let’s you to complete file
names with TAB key. But in any case, the working directory can be found
with os.getcwd (get current working directory):
import os
os.getcwd()
## '/home/siim/tyyq/lecturenotes/machinelearning-py'
This helps to specify the relative path if your data file is not located in the same place
as your code. You can also find which files does python find in a given folder,
e.g. in ../data/:
files = os.listdir("../data/")
files[:5]
## ['house-votes-84.csv.bz2', 'marathon.csv.bz2', 'growth-unemployment-
2016.csv.bz2', 'hadcrut-5.0.1.0-annual.csv.bz2', 'trump-approval.csv']
As we see, this function returns a list of file names it finds in the given location.

Exercise 3.9 Refresh your knowledge of relative paths!

• What is your current working directory?

• List all files in

o your current folder

o in the parent folder of it.

As another complication, notebooks are often run on a separate server or in a docker


container. These may have no access to files in your computer (as the server), or
only have a limited access (like docker container). In such a case, you should upload
the file to notebook, even if it is running on your computer!

3.3 Indexing data frames and series


Indexing refers to selecting data from data frames and series based on variable
names, logical conditions, and position. It is a complex task with many different
methods, and unfortunately also with many caveats. Below, the topic is split into
several subsections:

• Select variables explains how to select desired variables from a data frame

• Filter observations with logical operations describes how to filter rows

• Positional indexing of Series introduces positional indexing, indexing based on row


number, and how to do it with series
• Positional indexing of data frames explains positional indexing, indexing based on
both row and column numbers, for data frames

• Modifying data frames: there are slight differences when modifying data instead of
extracting, these are discussed here.

• Indexing: summary and comparison provides a summary of all methods.

Fortunately, Series and data frames behave in a broadly similar way, e.g. selecting
cases by logical conditions, based on index, and location are rather similar. As series
do not have columns, we cannot access elements by column name or by column
position though.

These notes do not provide a comprehensive overview, consult e.g. McKinney


"Python for Data Analysis for more details.

3.3.1 Select variables in data frames


We use the G.W.Bush approval data we loaded above to demonstrate variable
access. For a refresher, the first lines of the data frame look like

approval.head(4)
## date approve disapprove dontknow
## 0 2001 Dec 14-16 86 11 3
## 1 2001 Dec 6-9 86 10 4
## 2 2001 Nov 26-27 87 8 5
## 3 2001 Nov 8-11 87 9 4
To begin with, data frames have variable names. We can extract a single variable
either with ["varname"] or a shorthand as attribute .varname (note:
replace varname with the name of the relevant variable):
approval["approve"] # approval, as series
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
## Name: approve, dtype: int64
approval.approve # the same, as series
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
## Name: approve, dtype: int64
These constructs return the column as a series. If we prefer to get a single-column
data frame, we can wrap the variable name into a list:

approval[["approve"]] # approval, as data frame


## approve
## 0 86
## 1 86
## 2 87
## 3 87
## 4 87
## 5 88
## 6 89
## 7 87
## 8 90
## 9 86
The attribute shorthand is usually the easier way, but it does not work if you need to
use indirect variable name (variable name that is stored in another variable) or if the
variable name contains spaces or other special characters. It also does not work for
creating new variables in the data frame. See more in Section 3.3.5.

The previous example where we extracted a single column as a data frame instead
of Series also hints how to extract more than one variable: just wrap all the required
variable names into a list:

vars = ["date", "approve"]


approval[vars]
## date approve
## 0 2001 Dec 14-16 86
## 1 2001 Dec 6-9 86
## 2 2001 Nov 26-27 87
## 3 2001 Nov 8-11 87
## 4 2001 Nov 2-4 87
## 5 2001 Oct 19-21 88
## 6 2001 Oct 11-14 89
## 7 2001 Oct 5-6 87
## 8 2001 Sep 21-22 90
## 9 2001 Sep 14-15 86
There are no attribute shortcuts to extract multiple columns.

3.3.2 Filter observations with logical operations


Filtering refers to extracting only a subset of rows from the dataframe based on
certain conditions. The conditions are logical operations that can be either true or
false, depending on the values in each row. Filtering produces a sub-dataframe
where only those observations that meet the selection criteria are present: Here is an
example:

approval[approval.approve > 88]


## date approve disapprove dontknow
## 6 2001 Oct 11-14 89 8 3
## 8 2001 Sep 21-22 90 6 4
Note that we have to refer to data variables as approval.approve, not just approve,
unlike in R dplyr where one can just write approve. This is somewhat harder to write
but it is less ambiguous and produces fewer hard-to-find bugs.
Obviously we can use more complex selection conditions, for instance we can look
for very low or very high approval rates as follows:

approval[(approval.approve < 86) | (approval.approve > 89)]


## date approve disapprove dontknow
## 8 2001 Sep 21-22 90 6 4
Note that we are using the vectorized “or” operator not the base python or. We also
|,
need to wrap both the “less than” and “greater than” parts in parenthesis.
See more in Section 3.1.4.
Exercise 3.10 How many polls in the data show the president’s approval rate at least
88%? At which dates are those polls conducted?

See the solution

The filtered object is not a new data frame but a view of the original data frame. This
may give you warnings and errors later when you attempt to modify the filtered data.
If you intend to do that, perform a deep copy of data using the .copy method. See
more in Section 3.3.5.

3.3.3 Positional indexing of Series


Besides selecting variables and filtering by logical conditions, we occasionally need
to access elements by index, or by position (location). Here we demonstrate the
positional indexing using a series object, positional indexing of data frames is
discussed in Section 3.3.4 below:

pop = pd.Series([32.7, 267.7, 15.3], # in millions


index=["MY", "ID", "KH"])
pop
## MY 32.7
## ID 267.7
## KH 15.3
## dtype: float64
We can access series’ values in two ways: by position, and by index. In order to
access elements by position, we have to use attribute .iloc[] where i loc refers to
“integer”. Unlike most other methods, .iloc expects arguments in brackets. A single
number in brackets returns the element as an element (e.g. a single number), if
brackets contain a list (this looks like double brackets), it returns a series, potentially
containing only a single element. So in order to extract 2nd and 3rd element in the
population series, we can write:
pop.iloc[1] # extract 2nd element as a number
## 267.7
pop.iloc[[1,2]] # extract 2nd, 3th as a series
## ID 267.7
## KH 15.3
## dtype: float64
Alternatively, we can also extract the elements by index. This works in a similar
fashion, except we have to use .loc[] instead of .iloc[]. The rules for single and
double brackets apply in the similar fashion as in case of positional access.
pop.loc["ID"] # extract Indonesian population as a number
## 267.7
pop.loc[["ID", "MY"]] # extract Indonesian and Malaysian population
# as a series
## ID 267.7
## MY 32.7
## dtype: float64
Exercise 3.11 Use your series of capital cities (see the exercise above). Extract:
• 1st, 3rd element by position as single elements (city names)

• 2nd element by country name as a 1-element series.

One can also drop the .loc[] syntax and just use square brackets, so instead of
writing pop.loc[["ID", "MY"]], one can just write pop[["ID", "MY"]].
The fact that there are several ways to extract positional data causes a lot of
confusion for beginners. It is not helped by the common habit of not using indices and
just relying on the automatic row-numbers. In this case positional access
by .iloc[] produces exactly the same results as the index access by .loc[], and one
can conveniently forget about the index and use whatever feels easier. But
sometimes the index changes as a result of certain operations and that may lead to
errors or unexpected results. For instance, we can create an alternative population
series without explicit index:
pop1 = pd.Series([np.nan, 26, 19, 13]) # index is 0, 1, ...
pop1
## 0 NaN
## 1 26.0
## 2 19.0
## 3 13.0
## dtype: float64
In this example, position and index are equivalent and hence it is easy to forget
that .loc[] is index-based access, not positional access! So one may freely mix both
methods (and remember, .loc is not needed):
pop1.loc[2]
## 19.0
pop1.iloc[2]
## 19.0
pop1[2]
## 19.0
This becomes a problem if a numeric index is not equivalent to row number any more,
for instance after we drop missings:

pop2 = pop1.dropna() # remove missings


pop2 # note: the first row has index 1
## 1 26.0
## 2 19.0
## 3 13.0
## dtype: float64
pop2.iloc[2] # this is by position
## 13.0
pop2.loc[2] # this is by index
## 19.0
pop2[2] # also by index
## 19.0
Additionally, if pop2 for some reason turns into a numpy array, then pop2[2] is is based
on position as arrays do not have index!

3.3.4 Positional indexing of data frames


We use a small data frame of capital cities to demonstrate how indexing on data
frames works. The data frame contains two variables, name of the capital city and
population as variables, index is the country name:

countries = pd.DataFrame({"capital":["Kuala Lumpur", "Jakarta", "Phnom Penh"],


"population":[32.7, 267.7, 15.3]}, # in millions
index=["MY", "ID", "KH"])
countries
## capital population
## MY Kuala Lumpur 32.7
## ID Jakarta 267.7
## KH Phnom Penh 15.3
(MY is Malaysia, ID Indonesia and KH is Cambodia).

Exactly as series, data frames allow positional access by .iloc[]. However, as data
frames are two-dimensional objects, .iloc accepts two arguments (in brackets,
separated by comma), the first one for rows, the second one for columns. So we can
write
countries.iloc[2] # 3rd row, as series
## capital Phnom Penh
## population 15.3
## Name: KH, dtype: object
countries.iloc[[2]] # 3rd row, as data frame
## capital population
## KH Phnom Penh 15.3
countries.iloc[2,1] # 3rd row, 2nd column, as a number
## 15.3
There is also an index-based extractor .loc[] that accepts one (for rows) or two (for
rows and columns) indices. In case of data frames, the default row index is just the
row number; but the column index is the variable names. So we can write
countries.loc["MY","capital"] # Malaisian capital
## 'Kuala Lumpur'
countries.loc[["KH", "ID"], ["population", "capital"]]
# Extract a sub-dataframe
## population capital
## KH 15.3 Phnom Penh
## ID 267.7 Jakarta
Unfortunately, data frames add their confusing constructs. When accessing data
frames with .loc[] then we have to specify rows first, and possibly columns second. If
we drop .loc then we cannot specify rows. That is, unless we extract one variable
with brackets, get a series and extract the desired row in the second set of brackets…
countries["capital"]
## MY Kuala Lumpur
## ID Jakarta
## KH Phnom Penh
## Name: capital, dtype: object
countries["capital"]["MY"]
## 'Kuala Lumpur'
Finally, remember that 2-D numpy arrays will use similar integer-positional syntax
as .iloc[], just without .iloc.
In conclusion, it is very important to know what is your data type when using numpy
and pandas. Indexing is all around us when working with data, there are many
somewhat similar ways to extract elements, and which way is correct depends on the
exact data type.

3.3.5 Modifying data frames


Modifying data frames can be done in a broadly similar way as extracting elements,
you just need to put the expression on the left-hand side.

We can create a new column by assigning a list to a new column name:

countries["temperature"] = [27.7, 26.1, 26.6] # daily mean, January


countries
## capital population temperature
## MY Kuala Lumpur 32.7 27.7
## ID Jakarta 267.7 26.1
## KH Phnom Penh 15.3 26.6
One can also overwrite an existing column in a similar fashion.

However, there are several exceptions and caveats. Let’s demonstrate this by
modifying the data frame of three countries we created above.

3.3.5.1 One cannot create variables with dot-attribute


We can extract a single series as countries.capital , but we cannot create a new
variable with dot-notation. For instance:
countries.foo = [1, 2, 3] # does not work
## <string>:1: UserWarning: Pandas doesn't allow columns to be created via a new
attribute name - see https://pandas.pydata.org/pandas-
docs/stable/indexing.html#attribute-access
But after we have created a column with bracket-notation, we can access it using dot-
notation.

3.3.5.2 Explicitly make copy when working with filtered data


A typical data science workflow consists of a) filtering data to relevant cases only,
and b) modifying the resulting subset. The first step often involves removing missing
values, or limiting the analysis to a certain subset of interest. It is important to realize
that Pandas’ filtering does not copy the interesting cases in memory, it may instead
just create a view, i.e. re-use the same location in computer memory but just limit
access to certain part of it.4 This is a very good idea in terms of conserving memory
and avoiding unnecessary copy operations. However, this may cause warnings and
errors when modifying the filtered data later. We demonstrate this on the same
dataset.

Select only large countries (population over 20M):

large = countries[countries.population > 20]


large
## capital population temperature
## MY Kuala Lumpur 32.7 27.7
## ID Jakarta 267.7 26.1
We got a subset of Malaysia and Indonesia. Now let’s add another variable to these
large countries:

large["language"] = ["Malay", "Indonesian"]


## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
large
## capital population temperature language
## MY Kuala Lumpur 32.7 27.7 Malay
## ID Jakarta 267.7 26.1 Indonesian
Note the warning: A value is trying to be set on a copy of a slice…. This tells you that
filtering countries[countries.population > 20] did not create a new data frame but a
view of the existing one in memory, and Pandas is unhappy with the code modifying
just a part of the original data frame.
Although the result appears correct here, do not rely on this approach! It may or may
not work, depending on the exact memory layout of the dataset!

Fortunately, the solution is very simple. We need to make an explicit copy


with .copy method before we start any modifications:
large = countries[countries.population > 20].copy() # explicit copy
large["language"] = ["Malay", "Indonesian"]
large
## capital population temperature language
## MY Kuala Lumpur 32.7 27.7 Malay
## ID Jakarta 267.7 26.1 Indonesian
Now the modification works without a warning.

Explicit copy is not needed before you start modifying data, you can do various
filtering steps without .copy as long as you make the copy before modifications.

3.3.5.3 Modifying index


The index that is attached to series’ and data frames is potentially a useful and
iformative tool. But sometimes it is not very useful. For instance, when you load data
from disk, then the index defaults to be the row number, and this is rarely what we are
interested in. In such cases one may want to change the index. If you want to create
a new index then you can just assign it to df.index. For instance, we can just assign
country names as index to our data frame of large countries:
large.index = ["Malaysia", "Indonesia"]
large
## capital population temperature language
## Malaysia Kuala Lumpur 32.7 27.7 Malay
## Indonesia Jakarta 267.7 26.1 Indonesian
Alternatively, we can convert a column to index with .set_index() method:
large.set_index("capital")
## population temperature language
## capital
## Kuala Lumpur 32.7 27.7 Malay
## Jakarta 267.7 26.1 Indonesian
This will remove the column “capital” from data frame as its values will be in index
instead. Note that by default, .set_index() returns a new data frame instead of
modifying it in place, so if you want to preserve it, you have to store it in a new
variable. The opposite–converting the index into a column can be done
with .reset_index().
Exercise 3.12 Take the data frame of capital-population data frame from
Section 3.3.4.

• Replace the index by country names

• Convert the index into a variable “country”

Ensure that you store and print the final data frame!

3.3.6 Indexing: summary and comparison


Indexing data is complex. Here we repeat and summarize the main methods we have
discussed so far. First create three objects, a numpy matrix, a data frame, and a
series. The first two are 2-dimensional but the last one 1-dimensional.

M = np.array([[1507, 12478],
[-500, 11034],
[1537, 8443],
[1591, 6810]])
M
## array([[ 1507, 12478],
## [ -500, 11034],
## [ 1537, 8443],
## [ 1591, 6810]])
df = pd.DataFrame(M, columns=["established", "population"],
index=["Mumbai", "Delhi", "Bangalore", "Hyderabad"])
df
## established population
## Mumbai 1507 12478
## Delhi -500 11034
## Bangalore 1537 8443
## Hyderabad 1591 6810
s = pd.Series(M[:,0], index=df.index)
s
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## dtype: int64
(This is data about four cities, the year when those were established, and population
in thousands).

Exercise 3.13 Create another numpy matrix and a data frame about cities in a similar
fashion: create a matrix of data, and create a data frame from it using pd.DataFrame.
Specify index (row names) and columns (variable names). Include at least 3 cities
and 3 variables (e.g. population in millions, size in km2, and population density
people per km2).
Hint: you may invent both city names and the figures!

Extract rows/columns by number (integer):

• Numpy array: just use the numbers in brackets:

M[1,0] # second row, first column


## -500
M[2,:] # third row
## array([1537, 8443])

• Data frames: use iloc and brackets:


df.iloc[1,0] # second row, first column
## -500
df.iloc[2,:] # third row
## established 1537
## population 8443
## Name: Bangalore, dtype: int64

• Series: use iloc and brackets (but these are just 1-dimensional):
s.iloc[1] # second row
## -500
Extract using index (city names/column names):

• numpy array: not possible

• Data frames: use loc and brackets:


df.loc["Delhi","established"] # second row, first column
## -500
df.loc["Bangalore",:] # third row
## established 1537
## population 8443
## Name: Bangalore, dtype: int64
• Series: use loc (but not columns here):
s.loc["Delhi"]
## -500
If we want to extract individual columns, we can do the following:

• Numpy arrays: use brackets and use a colon : in row indicators place:
M[:,0]
## array([1507, -500, 1537, 1591])

• Data frames: you can use iloc and brackets, exactly as in case of numpy arrays. You
can also use brackets and column names (column index) without iloc, or dot-column
name:
df.iloc[:,0]
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64
df["established"]
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64
df.established
## Mumbai 1507
## Delhi -500
## Bangalore 1537
## Hyderabad 1591
## Name: established, dtype: int64
If you want to extract rows and columns in a mixed, e.g. rows by number, and
columns by column names (index), you can use double extraction (two sets of
brackets) and chain your extractions into a single line:

df.iloc[:3,:]["population"]
## Mumbai 12478
## Delhi 11034
## Bangalore 8443
## Name: population, dtype: int64
Exercise 3.14 Take your own city matrix and city data frame. From both of these
extract:
• population density (for all cities)

• data for the third city. For the data frame do it in two ways: using index, and using row
number!

• area of the second city. For the data frame, do it in two ways: using column name
(column index), and column number!

Finally, if asking for a single entry (singleton), pandas simplifies the result into a
lower-ranked object (series instead of data frame, or a number instead of series). If
you want to retain a similar data structure as the original one, wrap your selector in a
list. For instance, the previous example that returns a data frame: single line:

df.iloc[:3,:][["population"]]
## population
## Mumbai 12478
## Delhi 11034
## Bangalore 8443
All these methods can create rather confusing situations sometimes. For instance, if
we do not specify index, it will be automatically created as row numbers (but starting
from 0, not 1). In that case df.iloc[i] and df.loc[i] give the same result
(assuming i is a list of row numbers). Even worse, if the index skips some numbers,
then df.loc[i] may or may not work, and even where it works, it may give wrong
results! In a similar fashion, M[i,j] works but df[i,j] does not
work, df.loc[i,j] works but M.loc[i,j] does not work. In order to tell if the syntax is
correct it is necessary to know what is the data structure.

2. There are also operations that are not performed elementwise when using
array, in particular matrix product↩︎

3. If you run your code from command line, the working directory is the directory
where you run the command, not the directory where the program is located.↩︎

4. Pandas decides whether to make a copy or a view in each case separately,


depending on what is the more efficient approach.↩︎
Plotting: matplotlib
import numpy as np
np.random.seed(10)
Python has many plotting libraries. Here we discuss some of the simplest
ones, matplotlib and seaborn. Matplotlib is in a sense a very basic plotting library,
oriented on vectors, not datasets (in this sense comparable to base-R plotting). But it
is very widely used, and with a certain effort, it allows to create very nice looking
plots. It is also easier to tinker with the lower level features in matplotlib, compared to
the more high-level data oriented libraries.

Seaborn is such a high-level data oriented plotting library (comparable to ggplot in R


in this sense). It has ready-made functionality to pick variables from datasets and
modify the visual properties of lines and points depending on other values in data.

We assume you have imported the following modules:

import numpy as np
import pandas as pd

4.1 Matplotlib
Matplotlib is designed to be similar to the plotting functionality in the popular matrix
language matlab. This is a library geared to scientific plotting. In these notes, we are
mainly interested in the pyplot module but matplotlib contains more functionality,
e.g. handling of images. Typically we import pyplot as plt:
import matplotlib.pyplot as plt
This page is compiled using matplotlib 3.5.1.

4.1.1 Introductory examples


The module provides the basic functions like scatterplot and line plot, both of these
functions should be called with x and y vectors. Here is a demonstration of a simple
scatterplot:

x = np.random.normal(size=50)
y = np.random.normal(size=50)
_ = plt.scatter(x, y)
_ = plt.show()
This small code demonstrates several functions:

• first, we create 50 random dots using numpy

• plt.scatter creates a scatterplot (point plot). It takes arguments x and y for the
horizontal and vertical placement of dots
• plt.scatter returns an object, we may want to assign it to a temporary variable to
avoid printing. We use variable name _ (just underscore) for a value we are not really
storing but just avoiding printing.
• Scatterplot automatically computes the suitable axis range.

• plt.show makes the plot visible. It may not be necessary, depending on the
environment you use. For instance, when you run a notebook cell, it will automatically
make the plot visible at the end. However, if you want to make two plots inside of the
cell, you still need to call plt.show to tell matplotlib that you are done with the first plot
and now it is time to show it.
• Finally, plt.show also returns an object, and we assign it to a temporary variable to
avoid printing.
Next, here is another simple example of line plot:

x = np.linspace(-5, 5, 100)
y = np.sin(x)
_ = plt.plot(x, y)
_ = plt.show()
Most of the functionality should be clear by now, but here are a few notes:

• The first lines create a linear sequence of 100 numbers between -5 and 5, and
compute sin of these numbers.

• Line plots are done using plt.plot, it also takes arguments x and y.

4.1.2 Tuning plots


Matplotlib offers a number of arguments and additional functions to improve the look
of the plots. Below we demonstrate a few:

x = np.random.normal(size=50)
y = np.random.normal(size=50)
_ = plt.scatter(x, y,
color="red", # dot color
edgecolor="black",
alpha=0.5 # transparency
)
_ = plt.xlabel("x") # axis labels
_ = plt.ylabel("y")
_ = plt.title("Random dots") # main label
_ = plt.xlim(-5, 5) # axis limits
_ = plt.ylim(-5, 5)
_ = plt.show()
Most of the features demonstated above are obvious from the code and comments.
However, some explanations are still needed:

• Argument color denotes dot color when specified as color name, like “red” or “black”.
There is also another way to specify colors, c, see below.
• Alpha denotes transparency with alpha=0 being completely transparent (invisible)
and alpha=1 being completely oblique (default).

• All the additional functions return an object that we store into a temporary variable in
order to avoid printing.

• All the additional functions in plt are executed before the actual plot is drawn on
screen. In particular, despite we specify the axis limits after plt.scatter, they still
apply to the scatterplot.
Sometimes we want to make color of the dots dependent of another variable. In this
case we can use argument c instead of color:
x = np.random.normal(size=50)
y = np.random.normal(size=50)
z = np.random.choice([1,2,3], size=50)
_ = plt.scatter(x, y,
c=z # color made of variable "z"
)
_ = plt.show()
Now the dots are of different color, depending on the value of z. Note that the values
submitted to c argument must be numbers, strings will not work.

4.1.3 Histograms
Histograms are a quick and easy way to get an overview of 1-D data distributions.
These can be plotted using plt.hist. As hist returns bin data, one may want to
assign the result into a temporary variable to avoid spurious printing in ipython -based
environments (such as notebooks):
x = np.random.normal(size=1000)
_ = plt.hist(x)
_ = plt.show()
Not surprisingly, the histogram of normal random variables looks like, well, a normal
curve.

We may tune the picture somewhat using arguments bins to specify the desired
number of bins, and make bins more distinct by specifying edgecolor:
_ = plt.hist(x, bins=30, edgecolor="w")
_ = plt.show()
Introduction to Tkinter Module in Python

Tkinter module in Python is a standard library in Python used for creating Graphical
User Interface (GUI) for Desktop Applications. With the help
of Tkinter developing desktop applications is not a tough task.

The Tkinter module in Python is a good way to start creating simple projects in Python

Before starting with Tkinter you should have basic knowledge of Python.

The Tkinter library provides us with a lot of built-in widgets (also called Tk widgets or
Tk interface) that can be used to create different desktop applications.

What is Tkinter?

Tkinter in Python helps in creating GUI Applications with minimum hassle. Among
various GUI Frameworks, Tkinter is the only framework that is built-in into Python's
Standard Library.

• An important feature in favor of Tkinter is that it is cross-platform, so the same


code can easily work on Windows, macOS, and Linux.
• Tkinter is a lightweight module.
• It comes as part of the standard Python installation, so you don't have to install it
separately.
• It supports a lot of built-in widgets that can be used directly to create desktop
applications.

What are Tcl, Tk, and Tkinter?

Let's try to understand more about the Tkinter module by discussing more about its
origin.

• Tkinter is based upon the Tk toolkit, which was originally designed for the Tool
Command Language (Tcl). As Tk is very popular thus it has been ported to
a variety of other scripting languages, including Perl (Perl/Tk), Ruby
(Ruby/Tk), and Python (Tkinter).
• The wide variety of widgets, portability, and flexibility of Tk makes it the right
tool which can be used to design and implement a wide variety of simple and
complex projects.
• Python with Tkinter provides a faster and more efficient way to build useful
desktop applications that would have taken much time if you had to program
directly in C/C++ with the help of native OS system libraries.

How to install Tkinter?

Tkinter comes as part of standard Python installation. So if you have installed the
latest version of Python, then you do not have to do anything else.

If you do not have Python installed on your system - Install Python (the current version is 3.11.4 at the
time of writing this article) first, and then check for Tkinter.

You can determine whether Tkinter is available for your Python interpreter by
attempting to import the Tkinter module.

import tkinter

If Tkinter is available, then there will be no errors, otherwise, you will see errors in the
console.

Using Tkinter to Create Desktop Applications

The basic steps of creating a simple desktop application using the Tkinter module in
Python are as follows:

1. First of all, import the Tkinter module.


2. The second step would be to create a basic window for the desktop application.
3. Then you can add different GUI components to the window and functionality to
these components or widgets.
4. Then enter the main event loop using mainloop() function to run the Desktop
application.

So let's write some code and create a basic desktop application using the Tkinter
module in Python.
Hello World tkinter Example

• When you create a desktop application, the first thing that you will have to do is
create a new window for the desktop application.
• The main window object is created by the Tk class in Tkinter.
• Once you have a window, you can text, input fields, buttons, etc. to it.

Here's a code example,

import tkinter as tk

win = tk.Tk()

win.title('Hello World!')

# you can add widgets here

win.mainloop()
The window may look different depending on the operating system.

Tkinter Methods used above:

The two main methods that are used while creating desktop applications in Python are:

1. Tk( )

The syntax for the Tk() method is:

Tk(screenName=None, baseName=None, className='Tk', useTk=1)

This method is used to create the main window.

This is how you can use it, just like in the Hello World code example,

win = tkinter.Tk() ## where win indicates name of the main


window object

2. The mainloop() Function

This method is used to start the application. The mainloop() function is an infinite
loop that is used to run the application.

It will wait for events to occur and process the events as long as the window is not
closed.

Common Tkinter Widgets

Here are some of the commonly used Tkinter widgets:

1. Button: A simple button that can be clicked to perform an action.


2. Label: This is used to display simple text.
3. Entry: Simple input field to take input from the user. This is used for single-line
input.
4. Text: This can be used for multi-line text fields, to show text, or to take input
from the user.
5. Canvas: This widget can be used for drawing shapes, images, and custom
graphics.
6. Frame: This acts as a container widget that can be used to group other widgets
together.
7. Checkbutton: A checkbox widget used for creating toggle options on or off.
8. Radiobutton: This is used to create Radio buttons.
9. Listbox: This is used to create a List widget used for displaying a list of values.
10. Scrollbar: This widget is used for scrolling content in other widgets like Text and
Listbox.

Here's a simple example where we have used some of the widgets,

import tkinter as tk

root = tk.Tk()

root.title("Tkinter World")

label = tk.Label(root, text="Hello, Studytonight!")

label.pack()

entry = tk.Entry(root)

entry.pack()

button = tk.Button(root, text="Click Me!")

button.pack()
root.mainloop()

Let's create two simple applications using Tkinter: a greet app and a counter app.

1. Greet App

This app will have a text entry where the user can input their name and a button to display a
greeting message.

Explanation:

1. Import tkinter and messagebox: We import the necessary modules.


2. Define the greet function: This function retrieves the name from the entry widget and
displays a greeting message using a message box.
3. Create the main window (root): This is the main window of the application.
4. Create a label: This widget instructs the user to enter their name.
5. Create a text entry: This widget allows the user to input their name.
6. Create a greet button: When clicked, this button calls the greet function.
7. Run the application: root.mainloop() starts the Tkinter event loop.
2. Counter App

This app will have a label displaying a counter and two buttons to increment and decrement the
counter.
Explanation:

1. Import tkinter: We import the necessary module.


2. Define the increment function: This function increments the global counter variable
and updates the counter label.
3. Define the decrement function: This function decrements the global counter variable
and updates the counter label.
4. Initialize the counter: We start the counter at 0.
5. Create the main window (root): This is the main window of the application.
6. Create a label to display the counter: This widget displays the current value of the
counter.
7. Create increment and decrement buttons: These buttons call the increment and
decrement functions when clicked.
8. Run the application: root.mainloop() starts the Tkinter event loop.

Creating an advanced calculator using Tkinter involves implementing


more complex functionalities, such as handling various operations
(addition, subtraction, multiplication, division), support for parentheses,
and perhaps even more advanced mathematical functions like
trigonometric operations, logarithms, etc. Below is a step-by-step guide
to creating an advanced calculator using Tkinter.

Advanced Calculator with Tkinter


Explanation

1. Import tkinter and math modules: Import necessary modules for the GUI and
mathematical functions.
2. Calculator Class: Create a class Calculator to manage the calculator's functionalities.
o Initialization: Initialize the calculator, create input and button frames, and add
buttons.
o create_input_frame: Create the input frame where the expression is displayed.
o create_buttons_frame: Create the frame for the buttons.
o add_buttons: Define the buttons and add them to the button frame.
o create_button: Create individual buttons and place them in the grid.
o on_button_click: Define the behavior when a button is clicked, handling
operations, special functions, and updating the expression.
3. Main Loop: Instantiate the Calculator class and start the Tkinter main loop.

You might also like