02-04-2017
Introduction
to
Data Science
Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions
Introduction to Data Science
1
02-04-2017
What is Data Science?
“To gain insights into data through
computation, statistics, and visualization.”
Quora Threads for Expert Definitions
What is Data Science?
What does a Data Scientist do?
Data Science is Process
Ask an interesting question
Get the data
Explore the data
Model the data
Communicate and visualize your results Data Science
2
02-04-2017
Data Science is Multidisciplinary
The Scientific Method (wiki)
Programming
Databases
Statistics
Machine Learning
Domain Knowledge
Data Science is Multidisciplinary
3
02-04-2017
Science Paradigm
Why Data Science?
• The ability to take data – to be able to understand it, to process it, to
extract value from it, to visualize it, to communicate it's going to be a
hugely important skill in the next decades, not only at the professional
level but even at the educational level for elementary school kids, for
high school kids, for college kids. Because now we really do have
essentially free and ubiquitous data.”
• – Hal Varian
4
02-04-2017
Who is Data Scientist?
“A data scientist… excels at analyzing data, particularly large amounts of data, to
help a business gain a competitive edge.”
“The analysis of data using the scientific method”
“A data scientist is an individual, organization or application that performs
statistical analysis, data mining and retrieval processes on a large amount of data
to identify trends, figures and other relevant information.”
WHO’S A DATA SCIENTIST
• “A data scientist is someone who knows more statistics than a
computer scientist and more computer science than a statistician.”
- Josh Blumenstock
“Data Scientist = statistician + programmer + coach +
storyteller + artist”
- Shlomo Aragmon
5
02-04-2017
WHO’S A DATA SCIENTIST
Who is Data Scientist?
6
02-04-2017
Who’s a Data Scientist?
What does a Data Scientist Do?
OSEMN Things!
Obtain data
Scrub data
BUILD
DESCRIPTIVE Explore data
DATA PREDICTIVE
PRODUCTS PRESCRIPTIVE Build Models
tools built with data
iNterpret results
to inform decision making
Hence the acronym
O-S-E-M-N
(pronounced, ‘awesome’)
7
02-04-2017
.. And This
Hypothesis Data Machine Parallel
Testing Visualization Learning Computing
Deep Database
Coding Optimization
Learning Querying
Key Concepts
use many data sources
understand how the data were collected (sampling is essential)
weight the data thoughtfully (not all polls are equally good)
use statistical models (not just hacking around in Excel)
understand correlations (e.g., states that trend similarly)
think like a Bayesian, check like a frequentist (reconciliation)
have good communication skills (What does a 60% probability even
mean?
visualize, validate, and understand the conclusions
8
02-04-2017
Common Challenges
Big (massive) data (millions of
users,billions of events)
curse of dimensionality
(hundreds of variables)
missing data (not missing at
random)
need to avoid overfitting (test
data vs. training data)
Common Tasks
data munging/scraping/sampling/cleaning in order to get an
informative, manageable data set;
data storage and management in order to be able to access data
quickly and reliably during subsequent analysis;
exploratory data analysis to generate hypotheses and intuition
about the data;
prediction based on statistical tools such as regression,
classification, clustering, forecasting and optimization; and
communication of results through visualization, stories,and
interpretable summaries.
9
02-04-2017
Tools for the course
Tools for this course
10
02-04-2017
Tools for the course
Tools for the course - Python
11
02-04-2017
Python Is IOSEMN
Inquire
Obtain
Scrub
Explore
Model
iNterpret
js
Outsider
Python Data Science Ecosystem
12
02-04-2017
Python Data Science Ecosystem
Packages - Data Manipulation Packages - Modelling
NumPy Low level array operations SciPy FFTs, integration, other general algorithms
Data tables and in-memory manipulation Statistical distributions and tests
Dask Parallel out-of-core array manipulation Machine Learning pipelines
High level interface for databases and PyMC3 Bayesian Probabilistic Programming
different computational backends
24
Packages - Visualisation
IPython Notebooks
Widely used and powerful plotting package
seaborn Opinionated but beautiful data visualisations
Bokeh Interactive plotting with server option
Graphics API with translation between
languages (e.g. Python -> D3)
13
02-04-2017
Packages - Description
• NumPy
NumPy is a low level library written in C (and FORTRAN) for high level mathematical functions. NumPy cleverly
overcomes the problem of running slower algorithms on Python by using multidimensional arrays and functions that
operate on arrays. Any algorithm can then be expressed as a function on arrays, allowing the algorithms to be run
quickly.
NumPy is part of the SciPy project, and is released as a separate library so people who only need the basic
requirements can use it without installing the rest of SciPy.
NumPy is compatible with Python versions 2.4 through to 2.7.2 and 3.1+
• SciPy
SciPy is a library that uses NumPy for more mathematical functions. SciPy uses NumPy arrays as the basic data
structure, and comes with modules for various commonly used tasks in scientific programming, including linear
algebra, integration (calculus), ordinary differential equation solving and signal processing.
• Numba
Numba is a NumPy aware Python compiler (just-in-time (JIT) specializing compiler) which compiles annotated Python
(and NumPy) code to LLVM (Low Level Virtual Machine) through special decorators. Briefly, Numba uses a system that
compiles Python code with LLVM to code which can be natively executed at runtime.
Packages - Description
• scikit-learn
scikit-learn is a Python module for machine learning built on top of SciPy and distributed under
the 3-Clause BSD license.
• Pandas
Pandas is data manipulation library based on Numpy which provides many useful functions for
accessing, indexing, merging and grouping data easily. The main data structure (DataFrame) is
close to what could be found in the R statistical package; that is, heterogeneous data tables with
name indexing, time series operations and auto-alignment of data.
• Matplotlib
Matplotlib is a flexible plotting library for creating interactive 2D and 3D plots that can also be
saved as manuscript-quality figures. The API in many ways reflects that of MATLAB, easing
transition of MATLAB users to Python. Many examples, along with the source code to re-create
them, are available in the matplotlib gallery.
14
02-04-2017
Packages - Description
• Rpy2
Rpy2 is a Python binding for the R statistical package allowing the execution of R
functions from Python and passing data back and forth between the two environments.
Rpy2 is the object oriented implementation of the Rpy bindings.
• PsycoPy
PsychoPy is a library for cognitive scientists allowing the creation of cognitive psychology
and neuroscience experiments. The library handles presentation of stimuli, scripting of
experimental design and data collection.
Packages - Description
• datetime (or) time
Date and time functions to manage date and time data
• math
Core math functions and the constants like pi, e etc.
• pickle
Serializes objects to file
• os (or) os.path
Operating system interfaces.
• re
A library of perl-like regular expression operations
• string
Useful constants and classes related to strings.
• sys
System parameters and functions
15
02-04-2017
Who is using Python?
Why should I become a Data Scientist?
DEMAND & SUPPLY
"We project a need for 1.5 million additional managers and analysts in the United States who can ask the
right questions and consume the results of the analysis of Big Data effectively."
"A significant constraint on realizing value from Big Data will be a shortage of talent, particularly of people
with deep expertise in statistics and machine learning, and the managers and analysts who know how to
operate companies by using insights from Big Data."
Big data: The next frontier for innovation, competition, and productivity, McKinsey report
"By 2018 the United States will experience a shortage of 190,000 skilled data scientists, and 1.5 million
managers and analysts capable of reaping actionable insights from the big data deluge."
Game changers: Five opportunities for US growth and renewal, McKinsey report
25
16
02-04-2017
OK. How so do I become a Data Scientist?
Read books on
- Statistics
- Machine Learning
- Programming
- Databases
Take University courses
Apply for internships to work on real-life projects
Spend hours debugging on
StackOverflow
Participate in Data Hackathons/Data
Driven competitions
27
What is Python?
• Programming language
• You write instructions to the computer
• Python “interpreter” runs those instructions
17
02-04-2017
Why python?
• It’s awesome and popular!
• Free and Open Source language.
• Readable syntax.
• Great for interactive work
• Easy to learn and has an active community.
• Large amount of libraries.
• High level, general purpose.
• Backed up with fast C & Fortran numerical libraries
28
Python - Applications
Python is a powerful multi-paradigm computer programming language. With Python, we can do many things.
Below are some of the things that can be achieved using Python.
Systems Programming: Python’s built-in interfaces to operating-system services make it ideal for writing
portable, maintainable system-administration tools and utilities (sometimes called shell tools). Python
programs can search files and directory trees, etc.
GUIs: Python’s simplicity and rapid turnaround makes it a good match for graphical user
interface programming on the desktop. Python comes with a standard object-oriented interface to the Tk GUI
API called tkinter (Tkinter in 2.X) that allows Python programs to implement portable GUIs with a native look
and feel.
Internet Scripting: Python comes with standard Internet modules that allow Python programs to perform a
wide variety of networking tasks in client and server modes.
Database Programming: For traditional database demands, there are Python interfaces to all commonly used
relational database systems like Sybase, Oracle, Informix, ODBC, MySQL, PostgreSQL, SQLite, and more.
28
18
02-04-2017
Python - Applications
Rapid Prototyping: To Python programmers the components written in Python and C look the same. Because
of this, it’s possible to prototype systems in Python initially, and then move selected components to a compiled
language such as C or C++ for delivery.
Numeric and Scientific Programming: Python is also heavily used in numeric programming, a domain that would
not traditionally have been considered to be in the scope of scripting languages, but has grown to become one of
Python’s most compelling use cases.
Google makes extensive use of Python in its web search systems.
The other use cases are as follows:
The popular YouTube video sharing service is largely written in Python.
The Dropbox storage service codes, both its server and desktop client software, is primarily written in
Python.
The Raspberry Pi single-board computer promotes Python as its educational language.
The widespread BitTorrent peer-to-peer file sharing system began its life as a Python program.
Industrial Light & Magic, Pixar, and others use Python in their production of animated movies.
Google’s App Engine web development framework uses Python as an application language.
28
Is Python a Scripting Language?
Python is a general-purpose programming language that is often applied in scripting roles. It is
commonly defined as an object-oriented scripting language, a definition that blends support for OOP
with an overall orientation toward scripting roles.
A scripting language or script language is a programming language that supports scripts, programs
written for a special run-time environment that can interpret (rather than compile) and automate the
execution of tasks that could alternatively be executed one-by-one by a human operator. Python comes
under in this category. So it is called a scripting language.
Still, the term ‘scripting’ seems to have stuck to Python like glue. This may be because, people often use
the word ‘script’ instead of ‘program’ to describe a Python code file.
28
19
02-04-2017
How to get Python & Anaconda?
How to get Python?
There are two ways to get Python.
Base Python:
You can download Python from the www.python.org/downloads.
Once it down loaded, you can install the python
Ensure that you have pip installed which is the package manager for Python and will enable you to easily
install 3rd party packages that you'll need to perform data science tasks.
ANACONDA:
Free enterprise-ready cross platform Python distribution for large- scale data processing, predictive analytics,
and scientific computing.
Download here http://docs.continuum.io/anaconda/install
Features
https://www.continuum.io/why-anaconda
We use ANACONDA for our sessions
20
02-04-2017
Why ANACONDA?
Why ANACONDA?
21
02-04-2017
Python for big data
30
Streaming
Pig UDFs
in Jython
HADOOPY
Anaconda for Big Data
22
02-04-2017
Remote Conda commands
Anaconda for Analytics
23
02-04-2017
How to Run Python Code?
How to run Python Code?
3 ways to run the python interpreter from the terminal window:
• type ‘python’
• type ‘ipython’
• type ‘python helloworld.py’
4th Ways is Using any IDE like Jupyter, Spider, Pycharm, Canopy,
Rodeo etc.
We are using Jupyter notebook as part of our training,
24
02-04-2017
Introduction to IPython Note Book
39
IPython Notebook
One of Python’s most useful features is its interactive interpreter.
It allows for very fast testing of ideas without the overhead of creating test files as is typical in most programming
languages. However, the interpreter supplied with the standard Python distribution is somewhat limited for extended
interactive use.
Ipython:
A comprehensive environment for interactive and exploratory computing
Three Main Components:
An enhanced interactive Python shell. A decoupled two-process c communication model , which allows for multiple clients
to connect to a computation kernel, most notably the web-based notebook. An architecture for interactive parallel
computing
Some of the many useful features of IPython includes:
Command history, which can be browsed with the up and down arrows on the keyboard.
Tab auto-completion.
In-line editing of code.
Object introspection, and automatic extract of documentation strings from python objects like classes and functions.
Good interaction with operating system shell.
Support for multiple parallel back-end processes, that can run on computing clusters or cloud services like Amazon EC2.
25
02-04-2017
IPython Notebook
IPython provides a rich architecture for interactive computing with:
A powerful interactive shell.
A kernel for Jupyter.
Easy to use, high performance tools for parallel computing.
Support for interactive data visualization and use of GUI toolkits.
Flexible, embeddable interpreters to load into your own projects.
Beyond the Terminal ...
The REPL (read, eval, print loop) as a Network Protocol
Kernels
Execute Code
Clients
Read input
Present Output
Simple abstractions enable rich, sophisticated clients
IPython Notebook
The Four Most Helpful Commands
The four most helpful commands is shown to you in a banner, every time you start IPython:
Command Description
? Introduction and overview of IPython’s features.
%quickref Quick reference.
help Python’s own help system.
object? Details about object, use object?? for extra details.
Tab Completion:
Tab completion, especially for attributes, is a convenient way to explore the structure of any object you’re dealing
with. Simply type object_name.<TAB> to view the object’s attributes. Besides Python objects and keywords, tab
completion also works on file and directory names
26
02-04-2017
IPython Notebook
The %run magic command allows you to run any python script and load all of its data directly into the interactive
namespace. Since the file is re-read from disk each time, changes you make to it are reflected immediately (unlike
imported modules, which have to be specifically reloaded). IPython also includes dreload, a recursive reload function.
%run has special flags for timing the execution of your scripts (-t), or for running them under the control of either
Python’s pdb debugger (-d) or profiler (-p).
The %edit command gives a reasonable approximation of multiline editing, by invoking your favorite editor on the spot.
IPython will execute the code you type in there as if it were typed interactively.
Magic Functions ...
The following examples show how to call the builtin %timeit magic, both in line and cell mode:
The builtin magics include:
Functions that work with code: %run, %edit, %save, %macro, %recall, etc.
Functions which affect the shell: %colors, %xmode, %autoindent, %automagic, etc.
Other functions such as %reset, %timeit, %%writefile, %load, or %paste.
IPython Notebook
Exploring your Objects
Typing object_name? will print all sorts of details about any object, including docstrings, function definition lines (for
call arguments) and constructor details for classes.
To get specific information on an object, you can use the magic commands %pdoc, %pdef, %psource and %pfile.
Magic Functions:
IPython has a set of predefined magic functions that you can call with a command line style syntax.
There are two kinds of magics, line-oriented and cell-oriented.
Line magics are prefixed with the % character and work much like OS command-line calls: they get as an argument the
rest of the line, where arguments are passed without parentheses or quotes.
Cell magics are prefixed with a double %%, and they are functions that get as an argument not only the rest of the line,
but also the lines below it in a separate argument.
You can run the script.py. You can toggle this behaviour by running the %automagic magic.
A more detailed explanation of the magic system can be obtained by calling %magic,
To see all the available magic functions, call %lsmagic
27
02-04-2017
IPython Notebook
System Shell Commands:
To run any command at the system shell, simply prefix it with !. You can capture the output into a Python list. To pass the
values of Python variables or expressions to system commands, prefix them with $.
System Aliases:
It’s convenient to have aliases to the system commands you use most often.
This allows you to work seamlessly from inside IPython with the same commands you are used to in your system shell.
IPython comes with some pre-defined aliases and a complete system for changing directories, both via a stack (%pushd,
%popd and %dhist) and via direct %cd.
The latter keeps a history of visited directories and allows you to go to any previously visited one.
System Shell Commands ...
IPython Notebook
History
IPython stores both the commands you enter, and the results it produces. You can easily go through previous
commands with the up- and down-arrow keys, or access your history in more sophisticated ways.
Input and output history are kept in variables called In and Out, keyed by the prompt numbers. The last three objects in
output history are also kept in variables named _, __ and ___.
You can use the %history magic function to examine past input and output. Input history from previous sessions is saved
in a database, and IPython can be configured to save output history.
Several other magic functions can use your input history, including %edit, %rerun, %recall, %macro, %save and
%pastebin.
You can use a standard format to refer to lines:
This will take line 3 and lines 18 to 20 from the current session, and lines 1-5 from the previous session.
28
02-04-2017
IPython Notebook
Debugging
After an exception occurs, you can call %debug to jump into the Python debugger (pdb) and examine the problem.
Alternatively, if you call %pdb, IPython will automatically start the debugger on any uncaught exception.
You can print variables, see code, execute statements and even walk up and down the call stack to track down the true
source of the problem. This can be an efficient way to develop and debug code, in many cases eliminating the need for
print statements or external debugging tools.
You can also step through a program from the beginning by calling %run -d theprogram.py.
.
Introduction to Jupyter
39
29
02-04-2017
Jupyter Notebook
Jupyter
30
02-04-2017
Jupyter
Jupyter
31
02-04-2017
Jupyter
Jupyter
32
02-04-2017
Jupyter
Jupyter
33
02-04-2017
Jupyter – Getting Started
Jupyter
34
02-04-2017
Jupyter
Exercise
1. Launch new Jupyter notebook (ipython) and save the code
2. Practice Assignment statements (create basic variables and perform mathematical operations)
3. Create markdown file with Some descriptions
4. Practice Short cuts (using Jupyter cheat sheet) like
1. run the code,
2. insert/delete the cell,
3. add/remove the output & line numbers,
4. Merge/split cells
5. Toggle between different types of code formats, comments
6. etc
35
02-04-2017
Introduction to Canopy
39
Canopy: Integrated Analysis Environment
Canopy is a comprehensive Python analysis environment that provides easy installation of the core
scientific analytic and scientific Python packages, creating a robust platform you can explore,
develop, and visualize on. In addition to its pre-built, tested Python distribution, Canopy has valuable
tools for iterative data analysis, visualization and application development including:
• One-Click Python Package Deployment with a Graphical Package Manager
• Code Editor with IPython Notebook Support
• Interactive Graphical Python Code Debugger
• Integrated IPython Prompt
• Convenient Documentation Browser
• Python for Excel with PyXLL (add-on)
• Integration with the Intel MKL and Microsoft Python Tools for Visual Studio
Download Canopy from https://store.enthought.com/downloads/#default
36
02-04-2017
Create IPython Notebook in Canopy
In Canopy, go to File -> New -> Jupyter (IPython) Notebook
Name the new notebook and click OK
The same Notebook will open on Web Browser
When you open a new IPython Notebook, an IPython interactive cell with
the prompt In[ ]: to the left, appears. You can type code into this cell just as
you would in the IPython shell of the Canopy window.
Contact us
Visit us on: http://www.analytixlabs.in/
For course registration, please visit: http://www.analytixlabs.co.in/course-registration/
For more information, please contact us: http://www.analytixlabs.co.in/contact-us/
Or email: info@analytixlabs.co.in
Call us we would love to speak with you: (+91) 88021-73069
Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/
37