KEMBAR78
Introduction to data science | PDF
Data Science
“You can have data without information, but you cannot have information
without data.”
- Daniel Keys Moran
1
Reference Book:
Data Science from Scratch by Joel
Grus
2
Outline
◉ What is data Science?
◉ Tools/ Languages
◉ Getting Data
◉ Linear Algebra
◉ Statistics & Probability
◉ Visualizing Data
3
1.What is Data Science?
“Data! Data! Data!” he cried impatiently. “I can’t make bricks
without clay.”
—Arthur Conan Doyle
4
Hacking
Skills
Math and
Statistics
Knowledge
Substantive
expertise
Data Science
5
❖ “someone who knows more
statistics than a computer scientist
and more computer science than a
statistician”
❖ Someone who extracts insights
from messy data
Data Scientist?
6
7
8
2. Tools / Languages
People are still crazy about Python after twenty-five years, which I find hard
to believe.
—Michael Palin
9
Tools / Languages
❖ R
❖ Python
❖ Matlab
❖ SQL
❖ Excel
❖ Java
❖ SAS (Statistical Analysis System)
❖ SPSS (Modeler and Analytics)
❖ Hadoop (File System Computing)
10
Python
❖ Easy
❖ Python 2.7
❖ Different Libraries for Data mining
Numpy
SciPy
Pandas
Matplotlib
Scikit-learn
11
3. Getting Data
To write it, it took three months; to conceive it, three minutes;
to collect the data in it, all my life.
—F. Scott Fitzgerald
12
Different ways of getting data
◉ stdin and stdout
◉ Reading files
◉ Scraping the web
◉ Using APIs
13
Using Twitter API
◉ Python 2.7
◉ Python- Twitter libraries (Birdy, TwitterAPI, Twitter search, Twython)
◉ Twython
Pip install twython
◉ Go to https://apps.twitter.com/.
◉ Click Create New App.
◉ Click “Create my access token.”
◉ Run SearchAPI.py
14
4. Linear Algebra
Is there anything more useless or less useful than Algebra?
—Billy Connolly
15
Vectors
❖ Vectors are points in some finite-dimensional space
❖ A good way to represent numeric data
❖ Simplest from-scratch approach is to represent vectors as lists of
numbers
Ex :- If you have the heights, weights, and ages of a large number of
people, you can treat your data as three-dimensional vectors
(height, weight, age)
16
Matrices
❖ A matrix is a two-dimensional collection of numbers.
❖ We can represent matrices as lists of lists
❖ We can use a matrix to represent a data set consisting of multiple
vectors
Ex :- If you had the heights, weights, and ages of 1,000 people you could put
them in a 1 000 × 3 matrix
17
Linear Algebra + Data Science
To extract useful information from large, often unstructured, sets of data,
in some data mining applications huge matrices are used.
Ex :- The task of extracting information from all Web pages available
on the Internet is done by search engines. The core of the Google search
engine is a matrix computation
18
19
20
21
5. Statistics & Probability
22
Statistics
Statistics refers to the mathematics and techniques with which we
understand data.
Mean
Median
Range
Variance
Standard Deviation……...
23
Statistics
Framing questions statistically allow us to leverage data resources to
extract knowledge & obtain better answers.
A statistical framework allows researchers to distinguish between
causation & correlation , thus to identify interventions that will cause
changes in outcomes
To establish methods for prediction & estimation to quantify their degree
of certainty
24
Probability
Hard to do data science without some sort of understanding of probability
and its mathematics.
Conditional Probability
Bayes’s Theorem
Random Variables
Continuous Distributions
Normal Distribution………..
In an uncertain world, it can be of immense help to know and understand
chances of various events. You can plan things accordingly.
25
6.Visualizing Data
I believe that visualization is one of the most powerful means of achieving
personal goals.
—Harvey Mackay 26
Brain receives
8.96 Megabits
of data from the
eye every
second.
Average person
comprehends
120 words per
minute reading
Visual
Comprehension
speed
Reading
Comprehension
speed
27
28
Why Visualization?
❖ A fundamental part of the data scientist’s toolkit is data
visualization.
❖ To explore data
❖ To communicate data
29
30
Current Examples
A Day in the life,NYC Taxis
http://chriswhong.github.io/nyctaxi/
U.S.Gun Deaths in 2013
http://www.guns.periscopic.com/?year=2013
31
Tools for Data Visualization
❖ Matplotlib
❖ Seaborn
❖ D3.js
❖ Bokeh
❖ Ggplot
❖ R
32
Example with R
◉ Iris data set
◉ Iris is a data frame with 150 cases (rows) and 5 variables (columns)
named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and
Species.
33
data(iris)
iris
summary(iris)
summary(iris$Petal.Length)
barplot(iris$Petal.Length) #Creating simple Bar Graph
34
plot(x=iris$Petal.Length) # Creating scatter plot
plot(iris$Petal.Length, iris$Petal.Width,
pch=c(23,24,25)[unclass(iris$Species)], main=" Iris Data")
plot(iris$Petal.Length, iris$Petal.Width, pch=21,
bg=c("red","green3","blue")[unclass(iris$Species)], main="Iris
Data")
pairs(iris[1:4], main = " Iris Data", pch = 21, bg = c("red",
"green3", "blue")[unclass(iris$Species)])
35
36

Introduction to data science

  • 1.
    Data Science “You canhave data without information, but you cannot have information without data.” - Daniel Keys Moran 1
  • 2.
    Reference Book: Data Sciencefrom Scratch by Joel Grus 2
  • 3.
    Outline ◉ What isdata Science? ◉ Tools/ Languages ◉ Getting Data ◉ Linear Algebra ◉ Statistics & Probability ◉ Visualizing Data 3
  • 4.
    1.What is DataScience? “Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.” —Arthur Conan Doyle 4
  • 5.
  • 6.
    ❖ “someone whoknows more statistics than a computer scientist and more computer science than a statistician” ❖ Someone who extracts insights from messy data Data Scientist? 6
  • 7.
  • 8.
  • 9.
    2. Tools /Languages People are still crazy about Python after twenty-five years, which I find hard to believe. —Michael Palin 9
  • 10.
    Tools / Languages ❖R ❖ Python ❖ Matlab ❖ SQL ❖ Excel ❖ Java ❖ SAS (Statistical Analysis System) ❖ SPSS (Modeler and Analytics) ❖ Hadoop (File System Computing) 10
  • 11.
    Python ❖ Easy ❖ Python2.7 ❖ Different Libraries for Data mining Numpy SciPy Pandas Matplotlib Scikit-learn 11
  • 12.
    3. Getting Data Towrite it, it took three months; to conceive it, three minutes; to collect the data in it, all my life. —F. Scott Fitzgerald 12
  • 13.
    Different ways ofgetting data ◉ stdin and stdout ◉ Reading files ◉ Scraping the web ◉ Using APIs 13
  • 14.
    Using Twitter API ◉Python 2.7 ◉ Python- Twitter libraries (Birdy, TwitterAPI, Twitter search, Twython) ◉ Twython Pip install twython ◉ Go to https://apps.twitter.com/. ◉ Click Create New App. ◉ Click “Create my access token.” ◉ Run SearchAPI.py 14
  • 15.
    4. Linear Algebra Isthere anything more useless or less useful than Algebra? —Billy Connolly 15
  • 16.
    Vectors ❖ Vectors arepoints in some finite-dimensional space ❖ A good way to represent numeric data ❖ Simplest from-scratch approach is to represent vectors as lists of numbers Ex :- If you have the heights, weights, and ages of a large number of people, you can treat your data as three-dimensional vectors (height, weight, age) 16
  • 17.
    Matrices ❖ A matrixis a two-dimensional collection of numbers. ❖ We can represent matrices as lists of lists ❖ We can use a matrix to represent a data set consisting of multiple vectors Ex :- If you had the heights, weights, and ages of 1,000 people you could put them in a 1 000 × 3 matrix 17
  • 18.
    Linear Algebra +Data Science To extract useful information from large, often unstructured, sets of data, in some data mining applications huge matrices are used. Ex :- The task of extracting information from all Web pages available on the Internet is done by search engines. The core of the Google search engine is a matrix computation 18
  • 19.
  • 20.
  • 21.
  • 22.
    5. Statistics &Probability 22
  • 23.
    Statistics Statistics refers tothe mathematics and techniques with which we understand data. Mean Median Range Variance Standard Deviation……... 23
  • 24.
    Statistics Framing questions statisticallyallow us to leverage data resources to extract knowledge & obtain better answers. A statistical framework allows researchers to distinguish between causation & correlation , thus to identify interventions that will cause changes in outcomes To establish methods for prediction & estimation to quantify their degree of certainty 24
  • 25.
    Probability Hard to dodata science without some sort of understanding of probability and its mathematics. Conditional Probability Bayes’s Theorem Random Variables Continuous Distributions Normal Distribution……….. In an uncertain world, it can be of immense help to know and understand chances of various events. You can plan things accordingly. 25
  • 26.
    6.Visualizing Data I believethat visualization is one of the most powerful means of achieving personal goals. —Harvey Mackay 26
  • 27.
    Brain receives 8.96 Megabits ofdata from the eye every second. Average person comprehends 120 words per minute reading Visual Comprehension speed Reading Comprehension speed 27
  • 28.
  • 29.
    Why Visualization? ❖ Afundamental part of the data scientist’s toolkit is data visualization. ❖ To explore data ❖ To communicate data 29
  • 30.
  • 31.
    Current Examples A Dayin the life,NYC Taxis http://chriswhong.github.io/nyctaxi/ U.S.Gun Deaths in 2013 http://www.guns.periscopic.com/?year=2013 31
  • 32.
    Tools for DataVisualization ❖ Matplotlib ❖ Seaborn ❖ D3.js ❖ Bokeh ❖ Ggplot ❖ R 32
  • 33.
    Example with R ◉Iris data set ◉ Iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. 33
  • 34.
  • 35.
    plot(x=iris$Petal.Length) # Creatingscatter plot plot(iris$Petal.Length, iris$Petal.Width, pch=c(23,24,25)[unclass(iris$Species)], main=" Iris Data") plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Iris Data") pairs(iris[1:4], main = " Iris Data", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)]) 35
  • 36.