Introduction to Business
Analytics
By
Dr. Kingshuk Srivastava
Changing life under Digital Age
Transformation is Critical…
Companies must shift to a
Data-Driven
Business
are vulnerable
72% to
disruption
within
three years
Why?….Suddenly!!..
Why we’re all
Internal Threats
vulnerable to seismic
Siloed data and systems
shifts Gaps in expertise and
skills Inability to react
quickly
External Threats
Born-on-digital companies that steal 274,000
market share or rewrite customer
Estimated
expectations
worldwide startups
New business models that reinvent our each day
industry and change the game altogether
4
The shift to a Data-Driven Organization
Operations Reporting & Self-Service New
Data Analytics Busines
Warehousin s
g Models
Valu
e
Data
Data Decision Science
Efficiency
Modernization
Monetizati
on
Uses of Data
What is Data Science?
Data science is a "concept to unify statistics, data analysis
and their related methods" in order to "understand and
analyze an actual phenomena" with data.
What
Why
Analytics?
• Process of collecting, organizing and analysing large sets of data to discover useful information
which is most important
• Organizations have far more data than ever before
• Analytics solutions help organizations take better & fast decisions.
• Identify the opportunities for improvement
• All companies are moving towards using Business Analytics to understand data to develop their
business goals
Analytics is the discovery and rich with recorded
information with meaningful patterns in data
Significance of Analytics
Convert extensive data into powerful insights which drive
into efficient decisions
You can now base your decisions or strategies on data
rather than intuition
Applying right analytics to your data for desired
improvements
Achieve breakthrough results
What is Data Analytics?
Analytics is the use of:
data,
information technology,
statistical analysis,
quantitative methods, and
mathematical or computer-based models
to help managers gain improved insight about their business operations and make better, fact-based
decisions.
Business Analytics & Business Intelligence is a subset of Data Analytics
Components leading to Analytics
Domain Knowledge
Supply Chain
Domain CRM
Expertise Financials
Networking
Engineering Research
Intelligence
Scripting, SQL
Python, R Scala Computer Math &
Data Pipelines Machine
Big Data/ Science Stats
Mathematics
Apache Spark
Learning Computational
Data Science Projects Require multiple Skills
Understanding Wisdom P
Another approach to differentiate
Types of analytics
Understanding; Decision; Action;
Segments
Components Of Data Analytics
Further Understanding
Application Areas
Domains Analytics
Sector Specific specializations
Graph Analytics
Types of Data
Types of Data
The V’s of Big Data
Data Collection Techniques
• Observations,
• Tests,
• Surveys,
• Document analysis
(the research literature)
Quantitative Methods
Experiment: Research situation with at least one independent
variable, which is manipulated by the researcher
Independent Variable: The variable in the study under
consideration. The cause for the outcome for the study.
Dependent Variable: The variable being affected by the
independent variable. The effect of the study
y = f(x)
Which is which here?
Key Factors for High Quality
Experimental Design
Data should not be contaminated by poor measurement or errors
in procedure.
Eliminate confounding variables from study or minimize effects
on variables.
Representativeness: Does your sample represent the population
you are studying? Must use random sample techniques.
What Makes a Good Quantitative
Research Design?
4 Key Elements
1. Freedom from Bias
2. Freedom from Confounding
3. Control of Extraneous Variables
4. Statistical Precision to Test Hypothesis
Bias: When observations favor some individuals in the
population over others.
Confounding: When the effects of two or more variables cannot
be separated.
Extraneous Variables: Any variable that has an effect on the
dependent variable.
Need to identify and minimize these variables.
e.g., Erosion potential as a function of clay content. rainfall
intensity, vegetation & duration would be considered extraneous
variables.
Precision versus accuracy
"Precise" means sharply defined or measured.
"Accurate" means truthful or correct.
Both Accurate Accurate
and Precise Not precise
Not accurate
But precise
Neither accurate
nor precise
Sampling
Sampling is the problem of accurately acquiring the necessary
data in order to form a representative view of the problem.
This is much more difficult to do than is generally realized.
Overall Methodology:
* State the objectives of the survey
* Define the target population
* Define the data to be collected
* Define the variables to be determined
* Define the required precision & accuracy
* Define the measurement `instrument'
* Define the sample size & sampling method, then
select the sample
Data Preprocessing
Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
37
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
38
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer
error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
39
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred
40
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values per
attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class:
smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree
41
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data
42
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)
43
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)
44
THANK YOU