Course Introduction
Prof. Sourav Saha
Expectations from the Course?
• Learn tools
• Learn R, Python, etc
• Become programmers
• Become analysts
• Become Data Scientists
Tools you will expect to encounter
Tool name & Version Functionality & Usability
• Microsoft Excel 2016 • Spreadsheet – formulae based
• SPSS 23.0 • Statistical Tool – GUI based
• R version 3.4
• Programming & Analytics Tool –
• Python version 2.7 command based
• SAS 18
• Enterprise Analytics Tool – Menu based
• Tableau
• Power BI
• Visualization Tool – GUI based
Assessment
• Class Problems & Assignments • 20%
• Mid-Term (system) • 20%
• Group Project • 20%
• End-Term Examination • 40%
Computer Program?
Some mysterious
processing
Output
Input
Pick your choice
What is a Program
• Data Structures + Algorithms
• Data Structure = A Container stores Data
• Algorithm = Logic + Control
Data Types & Data Structures
• Applications/programs read data, store data temporarily, process it and
finally output results.
• What is data? Numbers, Characters, etc.
Instructions / Logic / Algorithms / Output Data
Input Data
Programs / Applications / Deduction
Data Types & Data Structures
• Data is classified into data types. e.g. char, float, integer, etc.
• A data type is (i) a domain of allowed values and (ii) a set of
operations on these values.
• System signals an error if wrong operation is performed on data
of a certain type. For example, char x,y,z; z = x*y is
not allowed.
Data Types & Data Structures
• Examples
Data Type Domain Operations
boolean 0,1 and, or, =, etc.
char ASCII =, <>, <, etc.
integer -maxint to +, -, =, ==, <>,
+maxint <, etc.
Data Types & Data Structures
• int i,j; i, j can take only integer values and only integer operations can
be carried out on i, j.
• Built-in types: defined within the language e.g. int,float, etc.
• User-defined types: defined and implemented by the user e.g. using
typedef or class.
Data Types & Data Structures
• Simple Data types: also known as Atomic data types have no
component parts. E.g. int, char, float, etc.
21 3.14 ‘a’
Data Types & Data Structures
• Compound Data or Structured Data types: can be broken into component
parts. E.g. an object, array, set, file, etc. Example: a student object.
Name A H M A D
Age 20
Branch C S C
A Component part
Data Types & Data Structures
• A data structure is a data type whose values
• (i) can be decomposed into a set of component elements each of which is
either simple (atomic) or another data structure
• (ii) include a structure involving the component parts.
More Data Structure
Possible Structures: Set, Linear, Tree, Graph.
LINEAR
SET
TREE
GRAPH
15
Functions of Data Structures
• Add
• Index
• Key
• Position
• Priority
• Retrieve
• Modify
• Delete
Which Data Structure or Algorithm is better?
• Must Meet Requirement
• High Performance
• Low System footprint
• Easy to implement
Which Package to Use?
Agenda
Present an Overview of what packages or solutions are available in the
market for data analysis
Understanding as to what is popular today and what are the trends for
tomorrow
Overview of some individual software packages
Assess their demand and few features
Some Definitions
• SPSS: Statistical Package for the Social Sciences (IBM SPSS Statistics these
days)
• SAS: Statistical Application Systems (Just SAS these days)
• Minitab
• Excel
• SPSS, SAS, and Minitab are statistical packages while Excel is a spreadsheet
Available Options for Statistical Analysis
Proprietary Free Software
Excel
R
SPSS
MINITAB Python
SAS
Weka
Eviews
Gretl
Stata
What people are using
R (Blue) & SAS (Orange) R (Blue) & Python (Orange)
Scholarly Articles / Research?
No. of Jobs
Python & SQL are in most demand
Microsoft Excel
MS Excel
COST PRO
Individual License for Microsoft Office Nearly ubiquitous and is often pre-installed on
Professional $350 new computers
Microsoft Office University Student License: $99 User friendly
Volume Discounts available for large Very good for basic descriptive statistics, charts
organizations and universities and plots
Free Starter Version available on some new PCs CON
Costs money
Not sufficient for anything beyond the most
basic statistical analysis
MINITAB
Minitab
Con Pro
Costs Money at $1,395 per single Easy to learn and use
user
Often taught in schools in
Unsuitable for very complicated introductory statistics courses
statistical computation and analysis
Widely used in engineering for
Not often used in academic process improvement
research
SPSS
SPSS
COST PRO
From $1000 to $12000 per license One of the most widely used statistical
depending on license type. packages in academia and industry
More powerful then Minitab that is also
CON easy to learn and use
Very expensive Has a command line interface in addition
to menu driven user interface
Not adequate for modeling and cutting
edge statistical analysis
Complicated – too many options
SAS
SAS
COST PRO
Complicated pricing model Widely accepted as the leader in statistical
analysis and modeling
$8,500 first year license fee
Widely used in the industry and academia
CON
Very flexible and very powerful
Very very expensive
Not user friendly
Steep learning curve
Relatively poor graphics capabilities
R
R
PRO COST
Widely used and accepted in industry Free / Open Source
and academia
CON
Very powerful and flexible
Not user friendly
Very large user base
Requires steep learning curve
Lots of books and manuals
Several User Interface Shells available
R brownies
Functionality:
• R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-
series analysis, classification, clustering, and others.
• R is easily extensible through functions and the R community is noted for its active contributions in terms of packages.
• Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made.
Packages:
• R is highly extensible through the use of user-submitted packages for specific functions or specific features
• R has stronger object-oriented programming facilities than most statistical computing languages.
• Extending R is also eased by its permissive lexical scoping rules.
Graphics:
• Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols.
• Dynamic and interactive graphics are available through additional packages
Installing R
Screenshots
How Easy to Use?
• Excel very easy to use!
• SPSS and Minitab are relatively easy to use
• SAS a bit more difficult to use but made easier via Enterprise Guide (EG)
• R & Python are mainly command based and requires active learning
• In general most software are easy to use if you learn how to use them! Getting
Started Guides are available for R, SPSS, Minitab, Excel and SAS.
Descriptive (Summary) Statistics Options
SPSS Minitab
SAS EG
Excel
Advanced Statistics: Model Building
Excel
SPSS Minitab
SAS
Fertility: Average number of kids.
Infant mortality: deaths per 1000 live births
30.0% Bar Charts
25.0%
20.0%
Percent
15.0% SPSS SAS
10.0%
5.0%
0.0%
CONDO RANCH SPLIT TWOSTORY
style
Chart of Style
30 Percents
25 30
25
20
20
Percent
15 15 Percents
10
10
5
0
5
RY
T
LI
ND
NC
O
SP
0
ST
CO
RA
CONDO RANCH SPLIT TWOSTORY
O
Style
TW
Percent within all data.
Minitab Excel
Package Assessment
Usage In Industry Vs Academia
• SPSS and Minitab heavily used in Academia, used in Industry but not a lot
• SAS not heavily used in Academia, heavily used in Industry (most clinical
trials use SAS)
• Excel heavily used in both Academia and Industry
• Both R & Python usages picking up in Academia & Industry fast
Operating Systems
• R runs on Windows, Macintosh & Linux
• SPSS runs on Windows, Macintosh, and Unix
• Python runs on Linux, Macintosh, Windows and Unix
• Minitab runs mainly on Windows and Macintosh
• SAS runs on Windows, Macintosh, Unix, Linux
• Excel runs mainly on Windows and Macintosh
Pivot Tables
Very good for displaying information online. Can be very interactive
• Minitab: static, not interactive
• Excel: interactive
• SPSS: interactive
• SAS: interactive
• R: ??
• Python: Absent
Text Analytics
• SPSS
• SPSS Modeler (different from IBM SPSS Statistics)
• SAS
• SAS Enterprise Miner (different SAS Base and Enterprise Guide)
• R
• With external packages
• Python
• With external libraries
• Minitab
• Not Available
• Excel
• With extension
Statistical Modelling
A statistical model is a class of mathematical model, which embodies a set of assumptions
concerning the generation of some sample data, and similar data from a larger population.
A statistical model represents, often in considerably idealized form, the data-generating
process”
• There are three purposes for a statistical model:
• Predictions
• Extraction of information
• Description of stochastic structures
R, SAS, SPSS and Minitab are good for statistical modelling
ANOVA
Post- Latin
Product (Software) One-Way Two-Way MANOVA GLM hoc Squares
Tests Analysis
Minitab Yes Yes Yes Yes Yes Yes
R Yes Yes Yes Yes Yes
SAS Yes Yes Yes Yes Yes Yes
SPSS Yes Yes Yes Yes Yes Yes
Excel Yes Add on Add on Add on
Regression
Product (Software) OLS WLS 2SLS NLLS Logistic GLM LAD Stepwise
Minitab Yes Yes No Yes Yes Yes No Yes
R Yes Yes Yes Yes Yes Yes Yes Yes
SAS
Yes Yes Yes Yes Yes Yes Yes Yes
Excel Yes
SPSS Yes Yes Yes Yes Yes Yes No Yes
Ordinary Least Square (OLS); Weighted Least Square (WLS); Two Stage Least Square 2SLS;
Non-Linear Least Square NLLS); General Linear Model (GLM); Least Absolute Deviation regression (LAD)
Time Series Analysis
Cointegration Multivariat
Product ARIMA GARCH Unit root test VAR
test e GARCH
Minitab Yes No No No No
R Yes Yes Yes Yes Yes
SAS Yes Yes Yes Yes Yes Yes
Excel No No No No No No
SPSS Yes Yes No No No No
Big Data Analytics
• Python
• SAS Enterprise Miner
• IBM SPSS Modeller
• R
Summary
• DOE: Minitab or SAS
• Power / Sample size calculation: Minitab or SAS
• Best way to store your data: Notepad
• Automatic Update of Output: Minitab or Excel
• SAS very popular in industry
• Pivot table: SAS, SPSS or Excel
• Modelling: SAS, SPSS or Minitab
• Summary statistics: SAS, SPSS or Minitab
References
https://sites.google.com/site/r4statistics/popularity
http://en.freestatistics.info/
http://lib.stat.cmu.edu/
http://www.comfsm.fm/~dleeling/statistics/notes000.html
Questions