KEMBAR78
Introduction To Data Science | PDF | Python (Programming Language) | Scripting Language
0% found this document useful (0 votes)
7 views37 pages

Introduction To Data Science

Uploaded by

vikas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views37 pages

Introduction To Data Science

Uploaded by

vikas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

02-04-2017

Introduction
to
Data Science

Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions

Introduction to Data Science

1
02-04-2017

What is Data Science?

“To gain insights into data through


computation, statistics, and visualization.”

Quora Threads for Expert Definitions


 What is Data Science?
 What does a Data Scientist do?

Data Science is Process

 Ask an interesting question

 Get the data

 Explore the data

 Model the data

 Communicate and visualize your results Data Science

2
02-04-2017

Data Science is Multidisciplinary

 The Scientific Method (wiki)


 Programming
 Databases
 Statistics
 Machine Learning
 Domain Knowledge

Data Science is Multidisciplinary

3
02-04-2017

Science Paradigm

Why Data Science?

• The ability to take data – to be able to understand it, to process it, to


extract value from it, to visualize it, to communicate it's going to be a
hugely important skill in the next decades, not only at the professional
level but even at the educational level for elementary school kids, for
high school kids, for college kids. Because now we really do have
essentially free and ubiquitous data.”

• – Hal Varian

4
02-04-2017

Who is Data Scientist?

“A data scientist… excels at analyzing data, particularly large amounts of data, to


help a business gain a competitive edge.”

“The analysis of data using the scientific method”

“A data scientist is an individual, organization or application that performs


statistical analysis, data mining and retrieval processes on a large amount of data
to identify trends, figures and other relevant information.”

WHO’S A DATA SCIENTIST

• “A data scientist is someone who knows more statistics than a


computer scientist and more computer science than a statistician.”
- Josh Blumenstock

“Data Scientist = statistician + programmer + coach +


storyteller + artist”
- Shlomo Aragmon

5
02-04-2017

WHO’S A DATA SCIENTIST

Who is Data Scientist?

6
02-04-2017

Who’s a Data Scientist?

What does a Data Scientist Do?

OSEMN Things!
Obtain data

Scrub data
BUILD
DESCRIPTIVE Explore data
DATA PREDICTIVE
PRODUCTS PRESCRIPTIVE Build Models
tools built with data
iNterpret results
to inform decision making
Hence the acronym
O-S-E-M-N
(pronounced, ‘awesome’)

7
02-04-2017

.. And This

Hypothesis Data Machine Parallel


Testing Visualization Learning Computing

Deep Database
Coding Optimization
Learning Querying

Key Concepts

 use many data sources


 understand how the data were collected (sampling is essential)
 weight the data thoughtfully (not all polls are equally good)
 use statistical models (not just hacking around in Excel)
 understand correlations (e.g., states that trend similarly)
 think like a Bayesian, check like a frequentist (reconciliation)
 have good communication skills (What does a 60% probability even
mean?
 visualize, validate, and understand the conclusions

8
02-04-2017

Common Challenges

 Big (massive) data (millions of


users,billions of events)
 curse of dimensionality
(hundreds of variables)
 missing data (not missing at
random)
 need to avoid overfitting (test
data vs. training data)

Common Tasks

 data munging/scraping/sampling/cleaning in order to get an


informative, manageable data set;
 data storage and management in order to be able to access data
quickly and reliably during subsequent analysis;
 exploratory data analysis to generate hypotheses and intuition
about the data;
 prediction based on statistical tools such as regression,
classification, clustering, forecasting and optimization; and
 communication of results through visualization, stories,and
interpretable summaries.

9
02-04-2017

Tools for the course

Tools for this course

10
02-04-2017

Tools for the course

Tools for the course - Python

11
02-04-2017

Python Is IOSEMN

Inquire

Obtain

Scrub

Explore

Model

iNterpret
js
Outsider

Python Data Science Ecosystem

12
02-04-2017

Python Data Science Ecosystem

Packages - Data Manipulation Packages - Modelling


NumPy Low level array operations SciPy FFTs, integration, other general algorithms

Data tables and in-memory manipulation Statistical distributions and tests

Dask Parallel out-of-core array manipulation Machine Learning pipelines

High level interface for databases and PyMC3 Bayesian Probabilistic Programming
different computational backends
24
Packages - Visualisation
IPython Notebooks
Widely used and powerful plotting package

seaborn Opinionated but beautiful data visualisations

Bokeh Interactive plotting with server option

Graphics API with translation between


languages (e.g. Python -> D3)

13
02-04-2017

Packages - Description
• NumPy
NumPy is a low level library written in C (and FORTRAN) for high level mathematical functions. NumPy cleverly
overcomes the problem of running slower algorithms on Python by using multidimensional arrays and functions that
operate on arrays. Any algorithm can then be expressed as a function on arrays, allowing the algorithms to be run
quickly.
NumPy is part of the SciPy project, and is released as a separate library so people who only need the basic
requirements can use it without installing the rest of SciPy.
NumPy is compatible with Python versions 2.4 through to 2.7.2 and 3.1+

• SciPy
SciPy is a library that uses NumPy for more mathematical functions. SciPy uses NumPy arrays as the basic data
structure, and comes with modules for various commonly used tasks in scientific programming, including linear
algebra, integration (calculus), ordinary differential equation solving and signal processing.

• Numba
Numba is a NumPy aware Python compiler (just-in-time (JIT) specializing compiler) which compiles annotated Python
(and NumPy) code to LLVM (Low Level Virtual Machine) through special decorators. Briefly, Numba uses a system that
compiles Python code with LLVM to code which can be natively executed at runtime.

Packages - Description
• scikit-learn
scikit-learn is a Python module for machine learning built on top of SciPy and distributed under
the 3-Clause BSD license.

• Pandas
Pandas is data manipulation library based on Numpy which provides many useful functions for
accessing, indexing, merging and grouping data easily. The main data structure (DataFrame) is
close to what could be found in the R statistical package; that is, heterogeneous data tables with
name indexing, time series operations and auto-alignment of data.

• Matplotlib
Matplotlib is a flexible plotting library for creating interactive 2D and 3D plots that can also be
saved as manuscript-quality figures. The API in many ways reflects that of MATLAB, easing
transition of MATLAB users to Python. Many examples, along with the source code to re-create
them, are available in the matplotlib gallery.

14
02-04-2017

Packages - Description
• Rpy2
Rpy2 is a Python binding for the R statistical package allowing the execution of R
functions from Python and passing data back and forth between the two environments.
Rpy2 is the object oriented implementation of the Rpy bindings.

• PsycoPy
PsychoPy is a library for cognitive scientists allowing the creation of cognitive psychology
and neuroscience experiments. The library handles presentation of stimuli, scripting of
experimental design and data collection.

Packages - Description
• datetime (or) time
Date and time functions to manage date and time data
• math
Core math functions and the constants like pi, e etc.
• pickle
Serializes objects to file
• os (or) os.path
Operating system interfaces.
• re
A library of perl-like regular expression operations
• string
Useful constants and classes related to strings.
• sys
System parameters and functions

15
02-04-2017

Who is using Python?

Why should I become a Data Scientist?

DEMAND & SUPPLY

"We project a need for 1.5 million additional managers and analysts in the United States who can ask the
right questions and consume the results of the analysis of Big Data effectively."

"A significant constraint on realizing value from Big Data will be a shortage of talent, particularly of people
with deep expertise in statistics and machine learning, and the managers and analysts who know how to
operate companies by using insights from Big Data."

Big data: The next frontier for innovation, competition, and productivity, McKinsey report

"By 2018 the United States will experience a shortage of 190,000 skilled data scientists, and 1.5 million
managers and analysts capable of reaping actionable insights from the big data deluge."

Game changers: Five opportunities for US growth and renewal, McKinsey report

25

16
02-04-2017

OK. How so do I become a Data Scientist?

Read books on
- Statistics
- Machine Learning
- Programming
- Databases

Take University courses

Apply for internships to work on real-life projects

Spend hours debugging on


StackOverflow

Participate in Data Hackathons/Data


Driven competitions

27

What is Python?

• Programming language

• You write instructions to the computer

• Python “interpreter” runs those instructions

17
02-04-2017

Why python?

• It’s awesome and popular!


• Free and Open Source language.
• Readable syntax.
• Great for interactive work
• Easy to learn and has an active community.
• Large amount of libraries.
• High level, general purpose.
• Backed up with fast C & Fortran numerical libraries

28

Python - Applications
Python is a powerful multi-paradigm computer programming language. With Python, we can do many things.
Below are some of the things that can be achieved using Python.

 Systems Programming: Python’s built-in interfaces to operating-system services make it ideal for writing
portable, maintainable system-administration tools and utilities (sometimes called shell tools). Python
programs can search files and directory trees, etc.

 GUIs: Python’s simplicity and rapid turnaround makes it a good match for graphical user
interface programming on the desktop. Python comes with a standard object-oriented interface to the Tk GUI
API called tkinter (Tkinter in 2.X) that allows Python programs to implement portable GUIs with a native look
and feel.

 Internet Scripting: Python comes with standard Internet modules that allow Python programs to perform a
wide variety of networking tasks in client and server modes.

 Database Programming: For traditional database demands, there are Python interfaces to all commonly used
relational database systems like Sybase, Oracle, Informix, ODBC, MySQL, PostgreSQL, SQLite, and more.

28

18
02-04-2017

Python - Applications
Rapid Prototyping: To Python programmers the components written in Python and C look the same. Because
of this, it’s possible to prototype systems in Python initially, and then move selected components to a compiled
language such as C or C++ for delivery.

Numeric and Scientific Programming: Python is also heavily used in numeric programming, a domain that would
not traditionally have been considered to be in the scope of scripting languages, but has grown to become one of
Python’s most compelling use cases.

Google makes extensive use of Python in its web search systems.

The other use cases are as follows:


 The popular YouTube video sharing service is largely written in Python.
 The Dropbox storage service codes, both its server and desktop client software, is primarily written in
Python.
 The Raspberry Pi single-board computer promotes Python as its educational language.
 The widespread BitTorrent peer-to-peer file sharing system began its life as a Python program.
 Industrial Light & Magic, Pixar, and others use Python in their production of animated movies.
 Google’s App Engine web development framework uses Python as an application language.

28

Is Python a Scripting Language?


 Python is a general-purpose programming language that is often applied in scripting roles. It is
commonly defined as an object-oriented scripting language, a definition that blends support for OOP
with an overall orientation toward scripting roles.

 A scripting language or script language is a programming language that supports scripts, programs
written for a special run-time environment that can interpret (rather than compile) and automate the
execution of tasks that could alternatively be executed one-by-one by a human operator. Python comes
under in this category. So it is called a scripting language.

 Still, the term ‘scripting’ seems to have stuck to Python like glue. This may be because, people often use
the word ‘script’ instead of ‘program’ to describe a Python code file.

28

19
02-04-2017

How to get Python & Anaconda?

How to get Python?


There are two ways to get Python.
Base Python:
 You can download Python from the www.python.org/downloads.
 Once it down loaded, you can install the python
 Ensure that you have pip installed which is the package manager for Python and will enable you to easily
install 3rd party packages that you'll need to perform data science tasks.

ANACONDA:
Free enterprise-ready cross platform Python distribution for large- scale data processing, predictive analytics,
and scientific computing.

Download here http://docs.continuum.io/anaconda/install

Features
https://www.continuum.io/why-anaconda

We use ANACONDA for our sessions

20
02-04-2017

Why ANACONDA?

Why ANACONDA?

21
02-04-2017

Python for big data

30

Streaming

Pig UDFs
in Jython

HADOOPY

Anaconda for Big Data

22
02-04-2017

Remote Conda commands

Anaconda for Analytics

23
02-04-2017

How to Run Python Code?

How to run Python Code?

3 ways to run the python interpreter from the terminal window:


• type ‘python’

• type ‘ipython’

• type ‘python helloworld.py’

4th Ways is Using any IDE like Jupyter, Spider, Pycharm, Canopy,
Rodeo etc.

We are using Jupyter notebook as part of our training,

24
02-04-2017

Introduction to IPython Note Book

39

IPython Notebook
One of Python’s most useful features is its interactive interpreter.
It allows for very fast testing of ideas without the overhead of creating test files as is typical in most programming
languages. However, the interpreter supplied with the standard Python distribution is somewhat limited for extended
interactive use.

Ipython:
A comprehensive environment for interactive and exploratory computing

Three Main Components:


An enhanced interactive Python shell. A decoupled two-process c communication model , which allows for multiple clients
to connect to a computation kernel, most notably the web-based notebook. An architecture for interactive parallel
computing

Some of the many useful features of IPython includes:


 Command history, which can be browsed with the up and down arrows on the keyboard.
 Tab auto-completion.
 In-line editing of code.
 Object introspection, and automatic extract of documentation strings from python objects like classes and functions.
 Good interaction with operating system shell.
 Support for multiple parallel back-end processes, that can run on computing clusters or cloud services like Amazon EC2.

25
02-04-2017

IPython Notebook

 IPython provides a rich architecture for interactive computing with:


 A powerful interactive shell.
 A kernel for Jupyter.
 Easy to use, high performance tools for parallel computing.
 Support for interactive data visualization and use of GUI toolkits.
 Flexible, embeddable interpreters to load into your own projects.

 Beyond the Terminal ...


 The REPL (read, eval, print loop) as a Network Protocol
 Kernels
 Execute Code
 Clients
 Read input
 Present Output
 Simple abstractions enable rich, sophisticated clients

IPython Notebook
The Four Most Helpful Commands
 The four most helpful commands is shown to you in a banner, every time you start IPython:
Command Description
? Introduction and overview of IPython’s features.
%quickref Quick reference.
help Python’s own help system.
object? Details about object, use object?? for extra details.
Tab Completion:
 Tab completion, especially for attributes, is a convenient way to explore the structure of any object you’re dealing
with. Simply type object_name.<TAB> to view the object’s attributes. Besides Python objects and keywords, tab
completion also works on file and directory names

26
02-04-2017

IPython Notebook
 The %run magic command allows you to run any python script and load all of its data directly into the interactive
namespace. Since the file is re-read from disk each time, changes you make to it are reflected immediately (unlike
imported modules, which have to be specifically reloaded). IPython also includes dreload, a recursive reload function.
 %run has special flags for timing the execution of your scripts (-t), or for running them under the control of either
Python’s pdb debugger (-d) or profiler (-p).
 The %edit command gives a reasonable approximation of multiline editing, by invoking your favorite editor on the spot.
IPython will execute the code you type in there as if it were typed interactively.

Magic Functions ...


 The following examples show how to call the builtin %timeit magic, both in line and cell mode:

The builtin magics include:


 Functions that work with code: %run, %edit, %save, %macro, %recall, etc.
 Functions which affect the shell: %colors, %xmode, %autoindent, %automagic, etc.
 Other functions such as %reset, %timeit, %%writefile, %load, or %paste.

IPython Notebook
Exploring your Objects
 Typing object_name? will print all sorts of details about any object, including docstrings, function definition lines (for
call arguments) and constructor details for classes.
 To get specific information on an object, you can use the magic commands %pdoc, %pdef, %psource and %pfile.

Magic Functions:
IPython has a set of predefined magic functions that you can call with a command line style syntax.
There are two kinds of magics, line-oriented and cell-oriented.
 Line magics are prefixed with the % character and work much like OS command-line calls: they get as an argument the
rest of the line, where arguments are passed without parentheses or quotes.
 Cell magics are prefixed with a double %%, and they are functions that get as an argument not only the rest of the line,
but also the lines below it in a separate argument.
 You can run the script.py. You can toggle this behaviour by running the %automagic magic.
 A more detailed explanation of the magic system can be obtained by calling %magic,
 To see all the available magic functions, call %lsmagic

27
02-04-2017

IPython Notebook
System Shell Commands:
To run any command at the system shell, simply prefix it with !. You can capture the output into a Python list. To pass the
values of Python variables or expressions to system commands, prefix them with $.

System Aliases:
 It’s convenient to have aliases to the system commands you use most often.
 This allows you to work seamlessly from inside IPython with the same commands you are used to in your system shell.
 IPython comes with some pre-defined aliases and a complete system for changing directories, both via a stack (%pushd,
%popd and %dhist) and via direct %cd.
 The latter keeps a history of visited directories and allows you to go to any previously visited one.

System Shell Commands ...

IPython Notebook
History
 IPython stores both the commands you enter, and the results it produces. You can easily go through previous
commands with the up- and down-arrow keys, or access your history in more sophisticated ways.
 Input and output history are kept in variables called In and Out, keyed by the prompt numbers. The last three objects in
output history are also kept in variables named _, __ and ___.
 You can use the %history magic function to examine past input and output. Input history from previous sessions is saved
in a database, and IPython can be configured to save output history.
 Several other magic functions can use your input history, including %edit, %rerun, %recall, %macro, %save and
%pastebin.

You can use a standard format to refer to lines:

This will take line 3 and lines 18 to 20 from the current session, and lines 1-5 from the previous session.

28
02-04-2017

IPython Notebook
Debugging
 After an exception occurs, you can call %debug to jump into the Python debugger (pdb) and examine the problem.
Alternatively, if you call %pdb, IPython will automatically start the debugger on any uncaught exception.
 You can print variables, see code, execute statements and even walk up and down the call stack to track down the true
source of the problem. This can be an efficient way to develop and debug code, in many cases eliminating the need for
print statements or external debugging tools.
 You can also step through a program from the beginning by calling %run -d theprogram.py.
.

Introduction to Jupyter

39

29
02-04-2017

Jupyter Notebook

Jupyter

30
02-04-2017

Jupyter

Jupyter

31
02-04-2017

Jupyter

Jupyter

32
02-04-2017

Jupyter

Jupyter

33
02-04-2017

Jupyter – Getting Started

Jupyter

34
02-04-2017

Jupyter

Exercise
1. Launch new Jupyter notebook (ipython) and save the code
2. Practice Assignment statements (create basic variables and perform mathematical operations)
3. Create markdown file with Some descriptions
4. Practice Short cuts (using Jupyter cheat sheet) like
1. run the code,
2. insert/delete the cell,
3. add/remove the output & line numbers,
4. Merge/split cells
5. Toggle between different types of code formats, comments
6. etc

35
02-04-2017

Introduction to Canopy

39

Canopy: Integrated Analysis Environment


Canopy is a comprehensive Python analysis environment that provides easy installation of the core
scientific analytic and scientific Python packages, creating a robust platform you can explore,
develop, and visualize on. In addition to its pre-built, tested Python distribution, Canopy has valuable
tools for iterative data analysis, visualization and application development including:
• One-Click Python Package Deployment with a Graphical Package Manager
• Code Editor with IPython Notebook Support
• Interactive Graphical Python Code Debugger
• Integrated IPython Prompt
• Convenient Documentation Browser
• Python for Excel with PyXLL (add-on)
• Integration with the Intel MKL and Microsoft Python Tools for Visual Studio

Download Canopy from https://store.enthought.com/downloads/#default

36
02-04-2017

Create IPython Notebook in Canopy


In Canopy, go to File -> New -> Jupyter (IPython) Notebook
Name the new notebook and click OK
The same Notebook will open on Web Browser
When you open a new IPython Notebook, an IPython interactive cell with
the prompt In[ ]: to the left, appears. You can type code into this cell just as
you would in the IPython shell of the Canopy window.

Contact us

Visit us on: http://www.analytixlabs.in/

For course registration, please visit: http://www.analytixlabs.co.in/course-registration/

For more information, please contact us: http://www.analytixlabs.co.in/contact-us/


Or email: info@analytixlabs.co.in
Call us we would love to speak with you: (+91) 88021-73069

Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/

37

You might also like