Module-1
An Introduction to data analysis and Python
programming
Chapter-1
Introduction to the data analysis
● Knowledge domains of data
● Understanding the nature of data
● The data analysis process
● Qualitative and quantitative data analysis
Data Analysis
Data Analysis
In a world increasingly centralized around information
technology, huge amounts of data are produced and stored each
day. Often these data come from automatic detection systems,
sensors, and scientific instrumentation.
what are the data?
● The data actually are not information, at least in terms of their
form. In the formless stream of bytes.
● This process of extracting information from raw data is called
data analysis
Knowledge Domains of the Data Analyst
● Data analysis is basically a discipline suitable to the study
of problems that may occur in several fields of
applications.
● Moreover, data analysis includes many tools and
methodologies that require good knowledge of computing,
mathematical, and statistical concepts.
● A good data analyst must be able to move and act in many
different disciplinary areas.
Example:Mathematics and Statistics
Example:Mathematics and Statistics
Computer science gives you the tools for data analysis, so you can say
that the statistics provide the concepts that form the basis of data
analysis.
Among the most commonly used statistical techniques in data analysis
are
• Bayesian methods
• Regression
• Clustering
Example:Machine Learning and Artificial Intelligence
● One of the most advanced tools that falls in the data analysis
camp is machine learning.
● Machine learning is a discipline that uses a whole series of
procedures and algorithms that analyze the data in order to
recognize patterns, clusters.
● Then extracts useful information for data analysis in an
automated way.
Understanding the Nature of the Data
The data constitute the raw material to be processed, and thanks to
their processing and analysis, it is possible to extract a variety of
information in order to increase the level of knowledge of the
system under study, that is, one from which the data came.
When the Data Become Information ?
Data are the events recorded in the world. Anything that can be
measured or categorized can be converted into data.
When the Data Become Information ?
When the Data Become Information ?
Cont..
Types of Data,Data can be divided into two distinct
categories:
• Categorical (nominal and ordinal)
• Numerical (discrete and continuous)
Categorical data are subdivided into
1. nominal and
2. ordinal
A nominal variable has no intrinsic order that is identified in
its category,An ordinal variable instead has a predetermined
order.
Categorical data are
Cont..
Numerical data are values or observations that come from
measurements.
Numerical data are subdivided into
1. discrete and
2. continuous numbers.
● Discrete values can be counted and are distinct and separated
from each other.
● Continuous values are values produced by measurements or
observations that assume any value within a defined range.
The Data Analysis Process
The Data Analysis Process Data analysis can be described as a
process consisting of several steps in which the raw data are
transformed and processed in order to produce data visualizations on
the collected data.
So, data analysis is consisting of the following sequence of stages:
• Problem definition
• Data extraction
• Data preparation - Data cleaning
• Data preparation - Data transformation
• Data exploration and visualization
Cont..
• Predictive modeling
• Model validation/test
• Predictive modeling
• Deploy - Visualization and interpretation of results
• •Model validation/test
Deploy - Deployment of the solution
• Deploy - Visualization and interpretation of results
• Deploy - Deployment of the solution
Cont..
Problem Definition
● The process of data analysis actually begins long before
the collection of raw data.
● Data analysis always starts with a problem to be solved,
which needs to be defined.
● Once the problem has been defined and documented, you
can move to the project planning stage of data analysis.
Data Extraction
● Once the problem has been defined, the first step is to
obtain the data in order to perform the analysis.
● The data must be chosen with the basic purpose of
building the predictive model, and so data selection is
crucial for the success of the analysis as well.
● The sample data collected must reflect as much as possible
the real world
Data Preparation
● Among all the steps involved in data analysis, data
preparation, although seemingly less problematic, in fact
requires more resources and more time to be completed.
● Data are often collected from different data sources
● The preparation of the data is concerned with obtaining,
cleaning, normalizing, and transforming data into an
optimized dataset.
Data Exploration/Visualization
● Exploring the data involves essentially searching the data
in a graphical or statistical presentation in order to find
patterns, connections, and relationships.
● Data visualization is the best tool to highlight possible
patterns.
● In fact, numerous technologies are utilized exclusively to
display data, and many display types are applied to extract
the best possible information from a dataset.
Cont..
Generally, this phase, in addition to a detailed study of charts
through the visualization data, may consist of one or more of
the following activities:
• Summarizing data
• Grouping data
• Exploring the relationship between the various attributes
• Identifying patterns and trends
• Constructing regression models
• Constructing classification models
Predictive Modeling
Predictive modeling is a process used in data analysis to create or
choose a suitable statistical model to predict the probability of a
result.
These models are in a specific way they are used for two main
purposes.
● The first is to make predictions about the data values produced by
the system; in this case, you will be dealing with regression
models.
● The second purpose is to classify new data products, and in this
case, you will be using classification models or clustering models.
Cont..
In fact, it is possible to divide the models according to the
type of result they produce:
• Classification models: If the result obtained by the model
type is categorical
• Regression models: If the result obtained by the model type
is numeric.
• Clustering models: If the result obtained by the model type
is descriptive.
Model Validation
● Validation of the model, that is, the test phase, is an important
phase that allows you to validate the model built on the basis
of starting data.
● Generally, you will refer to the data as the training set when
you are using them for building the model, and as the
validation set when you are using them for validating the
model.
Deployment
Deployment
● This is the final step of the analysis process, which
aims to present the results, that is, the conclusions of
the analysis
● There are several ways to deploy the results of data
analysis
Cont..
In the documentation supplied by the analyst, each of these
four topics will be discussed in detail:
• Analysis results
• Decision deployment
• Risk analysis
• Measuring the business impact
Quantitative and Qualitative Data Analysis
Qu
● Data analysis is completely focused on data,depending on the
nature of the data.
● When the analyzed data have a strictly numerical or
categorical structure, then you are talking about quantitative
analysis,
● when you are dealing with values that are expressed through
descriptions in natural language, then you are talking about
qualitative analysis.
Cont..
Figure 1-2. Shows the differences between the two types of analysis.
Chapter-2
Introduction to the Python
● Python the programming language
● Python the interpreter
● Py2 and py3
● PyPi
● Introduction to SciPy
Python—The Programming Language
The Python programming language was created by Guido Von
Rossum in 1991.
This language can be characterized by a series of adjectives:
• Interpreted
• Portable
• Object-oriented
• Interactive
• Interfaced
• Open source
• Easy to understand and use
Cont…
● Python is an interpreted language
● Unlike with languages such as C, C++, and Java, there is
no compile time with Python.
● Python is a highly portable programming language.
Cont..
• The Python code will remain the same in the operating
system (Linux, Windows, or Mac).
• Python is an object-oriented programming language,it
allows you to specify classes of objects and implement
their inheritance.
• Python is an interactive programming language.
• Python is an open-source programming language
Python—The Interpreter
● Python interpreter starts, characterized by a >>> prompt.
● The Python interpreter is simply a program that reads and
interprets the commands passed to the prompt.
● Each time you press the Enter key, the interpreter begins to
scan the code (either a row or a full file of code) token by
token (called tokenization).
Cont..
● These tokens are fragments of text that the interpreter arranges
in a tree structure.
● The tree obtained is the logical structure of the program, which
is then converted to bytecode (.pyc or .pyo).
● The process chain ends with the bytecode that will be executed
by a Python virtual machine (PVM). See Figure 2-1.
• The Cython project is based on creating a compiler that
translates Python code into C.
• In parallel to Cython, there is a version totally built and
compiled in Java, named Jython
• The PyPy interpreter is a JIT (just-in-time) compiler, and it
converts the Python code directly in machine code at runtime.
Python 2 and Python 3
• Python language, he soon found that these changes would
make the new version incompatible with a lot of existing
code.
• Thus he decided to start with a new version of Python called
Python 3.0.
• To overcome the problem of incompatibility and avoid
creating huge amounts of unusable code, it was decided to
maintain a compatible version.
• Python 3.0 made its first appearance in 2008
Difference Between Python 2 and 3
Python 2 Python 3
• Python 2 was released in the • Python 3 was released in the
year 2000. year 2008.
• In Python 2, print is considered • In Python 3, print is considered
to be a statement and not a to be a function and not a
function. statement.
• In Python 2, strings are stored • In Python 3, strings are stored
as ASCII by default. as UNICODE by default.
Cont…
• Python 2 is no longer in • Python 3 is more popular than
use since 2020. Python 2 and is still in use in
• Python 2 was mostly used today’s times.
to become a DevOps • Python 3 is used in a lot of
Engineer. fields like Software
Engineering, Data Science, etc.
Example Code
• Example code
def main():
def main():
print "Hi! This is Python 2"
print ("Hi! This is Python 3")
if __name__== "__main__":
if __name__== "__main__":
main()
main()
Installing Python
• In order to develop programs in Python you have to install it
on your operating system.
• Linux distributions and MacOS X machines should already
have a preinstalled version of Python.
• If not, or if you would like to replace that version with
another, you can easily install it
Cont..
Anaconda
• Anaconda is a free distribution of Python packages distributed
by Continuum Analytics (https://www.anaconda.com).
• This distribution supports Linux, Windows, and MacOS X
operating systems.
• Anaconda, in addition to providing the latest packages released
in the Python.
Run an Entire Program Cont..
• The best way to become familiar with Python is to write an
entire program and then run it from the terminal.
• First write a program using a simple text editor.
• For example,you can use the code and save it as
MyFirstProgram.py.
Listing 2-1. MyFirstProgram.py
myname = input("What is your name? ")
print("Hi " + myname + ", I'm glad to say: Hello world!")
Cont..
• Now you’ve written your first program in Python, and you can
run it directly from the command line
MyFirstProgram.py
What is your name? Fabio Nelli
Hi Fabio Nelli, I'm glad to say: Hello world!
Make Calculations
• You have already seen that the print() function is useful for
printing almost anything.
Cont..
• Start a session on the Python shell and begin to perform these
mathematical operations:
>>> 1 + 2
3
>>> (1.045 * 3)/4
0.78375
>>> 4 ** 2 16
Import New Libraries and Functions
Cont..
• You saw that Python is characterized by the ability to extend its
functionality by importing numerous packages and modules.
• To import a module in its entirety, you have to use the import
command.
>>> import math
Cont..
library_name.function_name()
• For example, you can now calculate the sine of the value
contained in the variable a.
>>> math.sin(a)
>>> sin(a) 0.040693257349864856
Data Structure
• Python provides a number of extremely useful data structures.
• These data structures are able to contain lots of data
Cont..
>>> dict = {'name':'William', 'age':25, 'city':'London'}
• If you want to access a specific value within the dictionary,
you have to indicate the name of the associated key.
>>> dict["name"]
'William'
• This is possible through the use of the items()
Cont..
>>> list = [1,2,3,4]
>>> list[2]
3
>>> list[1:3]
[2, 3]
>>> list[-1]
4
Indentation
• Python indentation assumes an integral role in the
implementation of the code, by dividing it into logical blocks.
Cont..
● In fact, while in Java, C, and C++, each line of code is
separated from the next by a semicolon (;),
● Python you should not specify any symbol that separates them,
Example:
IPython
• The IPython shell, which is a powerful interactive shell
resulting in a greatly enhanced Python terminal.
• The IPython Notebook, which is a web interface that allows
you to mix text, executable code, graphics, and formulas in a
single representation.
• As you can see, a particular prompt appears with the value In
[1]. This means that it is the first line of input.
EX:In [1]: print("Hello World!")
Hello World!
In [2]: 3/2
Out[2]: 1.5
Cont..
Jupyter Notebook
● Jupyter Notebook is the latest evolution of this interactive
environment
● In fact, with Jupyter Notebook, you can merge executable code,
text, formulas, images, and animations into a single Web
document.
Cont..
PyPI—The Python Package Index
• The Python Package Index (PyPI) is a software repository
that contains all the software needed for programming in
Python,
• for example, all Python packages belonging to other Python
libraries.
• The content repository is managed directly by the developers
of individual packages that deal with updating the repository
with the latest versions of their released libraries.
Cont..
● official page of PyPI at https://pypi.python.org/pypi
● By launching it from the command line, you can manage all the
packages and individually decide if a package should be
installed, upgraded, or removed.
●
SciPy
• SciPy is a set of open-source Python libraries specialized for
scientific computing.
• Among the libraries that are part of the SciPy group, there
are three in particular that are discussed in the following
chapters:
• NumPy
• Pandas
• matplotlib
Cont..
NumPy
● This library, whose name means numerical Python
● NumPy is the foundation library for scientific computing in
Python
● It provides a high-performance multidimensional array object,
and tools for working with these arrays.
Cont..
This package provides some features that will be added to the
standard Python:
● Ndarray
● Element-wise computation
● Reading-writing datasets
● Integration with other languages
Cont..
Pandas
● This package provides complex data structures and functions
● specifically designed to make the work on them easy, fast, and
effective.
● The fundamental concept of this package is the DataFrame, a
two-dimensional tabular data structure with row and column
labels.
● Pandas applies the high-performance
Cont..
matplotlib
● matplotlib is the Python library that is currently most popular for
producing plots and other data visualizations in 2D.
● Since data analysis requires visualization tools, this is the library
that best suits this purpose.
● Matplotlib is a comprehensive library for creating static,
animated, and interactive visualizations in Python.