Statistics and Data
Outline
Relevance of Statistics
Introduction to Basic Concepts
Course Details
Relevance of Statistics
CASE: PEPSI’S EXCLUSIVITY AGREEMENT
Case: Pepsi’s Exclusivity
Agreement
•A large university with a total enrollment of
about 50,000 students has offered Pepsi an
exclusivity agreement that would give Pepsi
exclusive rights to sell its products at all
university facilities for the next year with an
option for future years.
• In return, the university would receive 35% of
the on-campus revenues and an additional lump
sum of $200,000 per year.
• Pepsi has been given 2 weeks to respond.
The market for soft drinks is measured in
terms of 12-ounce cans.
Case 1: Pepsi currently sells an average of 22,000
Background cans per week (over the 40 weeks of the
year that the university operates).
Details
The cans sell for an average of 75 cents
each. The costs including labor amount to
20 cents per can.
Case 1: A Problem
• Pepsi is unsure of its market share.
• However, they suspect that it is considerably less
than 50%.
Source: https://99designs.com/icon-button-design/contests/icon-button-design-wanted-guessing-game-167222
Profit-Loss Calculation
• Suppose the current market share were around
25%.
• Pepsi would sell 88,000 (22,000 is 25% of
88,000) cans per week or 3,520,000 cans per
year.
• The profit or loss can be calculated.
Source: https://www.score.org/resource/12-month-profit-and-loss-projection
Case 1: Market Survey
• The only problem is that Pepsi does not know
how many soft drinks are sold weekly at the
university.
• Pepsi assigned a recent university graduate to
survey the university's students to supply the
missing information.
• Accordingly, she organizes a survey that asks 500
students to keep track of the number of soft drinks
they purchase in the next 7 days.
Source: https://getthematic.com/insights/customer-survey-design/
Simple Random
Sample
Simple random sample is a sample
of n observations which has the
same probability of being selected
from the population as any other
sample of n observations.
• Most statistical methods presume
simple random samples.
• However, in some situations
other sampling methods have an
advantage over simple random
samples.
Source: https://www.statisticshowto.com/simple-random-sample/
Stratified Random
Sampling
• Divide the population into mutually
exclusive and collectively exhaustive
groups, called strata.
• Randomly select observations from each
stratum, which are proportional to the
stratum’s size.
• Advantages:
Guarantees that each population
subdivision is represented in the
sample.
Parameter estimates have greater
precision than those estimated from
simple random sampling.
Source: https://www.netquest.com/blog/en/random-sampling-stratified-sampling
Cluster Sampling
• Divide population into mutually exclusive
and collectively exhaustive groups, called
clusters.
• Randomly select clusters.
• Sample every observation in those randomly
selected clusters.
• Advantages and disadvantages:
Less expensive than other sampling
methods.
Less precision than simple random
sampling or stratified sampling.
Useful when clusters occur naturally in
the population.
Source: https://www.netquest.com/blog/en/cluster-sampling
A Simple Representation of Survey Data
(First 8 Rows)
Student Id No. of Cans Purchased in a Week
1 14
2 10
3 8
4 6
5 9
6 12
7 13
8 4
Decision-Making
Design a Market Estimate the
Profit and Loss
Survey and Potential
Calculation
Collect Data Volume
What is Statistics?
What is Statistics?
Data Statistics Information
Statistics is a tool for creating new understanding from a set of numbers.
Steps for Good Statistical Analysis
Find the right data
Use the appropriate statistical tools
Clear communication of the numerical information
Basic Concepts
Population and Sample
Subset
Population Sample
Parameter Statistic
Populations have Parameters, Samples have Statistics.
Population and Sample
• Population
A population is the group of all items of interest to a statistics
practitioner.
Frequently very large.
• Sample
A sample is a set of data drawn from the population.
Large enough, but less than the population.
Parameter and Statistic
• Parameter
A descriptive measure of a population.
• Statistic
A descriptive measure of a sample.
Too expensive to gather
information on the
entire population
Need for
Sampling Often impossible to
gather information on
the entire population
Two Branches
Statistics
Descriptive Statistics Inferential Statistics
Descriptive Statistics
• Descriptive Statistics provides a set of
methods for organizing, summarizing, and
presenting data in a convenient and informative
way.
• These methods include:
Graphical Techniques and
Numerical Techniques.
A Problem…
• Descriptive Statistics describe the data set that’s
being analyzed but doesn’t allow us to draw any
conclusions about the population.
Inferential Statistics
• Statistical inference is the process of making an estimate, prediction, or
decision about a population based on a sample.
Population What can we infer
Sample about a
Population’s
Inference Parameters based
on a Sample’s
Statistics?
Statistic
Parameter
Inferential Statistics
• We use statistics to make inferences about parameters.
• Therefore, we can make an estimate, prediction, or decision about a
population based on sample data.
• Thus, we can apply what we know about a sample to the larger
population from which it was drawn!
Data Types
Types of Data
Data Types
Cross- Time
Sectional Series
Case 1: Survey Data
Student Id No. of Cans Purchased
in a Week
Cross-sectional Data
1 14 • Data collected by recording a characteristic of many
2 10 subjects at the same point in time, or without
3 8
regard to differences in time.
4 6 • Subjects might include individuals, households,
5 9 firms, industries, regions, and countries.
6 12
7 13
8 4
Time Series Data
• Data collected by recording a characteristic of e-3 Wheeler Registrations in India
a subject over several time periods. 800000
• Data can include daily, weekly, monthly, 700000
quarterly, or annual observations. 600000
• The graph shows e-3 wheeler registrations in 500000
India. 400000
• 3-wheeler EVs like e-autos and e-rickshaws 300000
account for close to 65% of all EVs registered 200000
in India. 100000
• For more details, check our article: 0
2015,Sep
2018,May
2013,Jan
2013,Sep
2014,Jan
2014,Sep
2015,Jan
2016,Jan
2016,Sep
2017,Jan
2017,Sep
2018,Jan
2018,Sep
2019,Jan
2019,Sep
2020,Jan
2020,Sep
2021,Jan
2021,Sep
2022,Jan
2013,May
2014,May
2015,May
2016,May
2017,May
2019,May
2020,May
2021,May
https://www.thehindu.com/opinion/op-ed/indias-ev-ambition-rides-on-three-
wheels/article65480119.ece
Case 2: Tween Survey
Case 2: Tween Survey
• Luke McCaffrey owns a ski resort two hours outside Boston.
• Luke is in need of a new marketing manager.
• Luke is particularly interested in serving the needs of the “tween” population
(children aged 8 to 12 years old).
• He believes that tween spending power has grown over the past few years, and
he wants their skiing experience to be memorable so that they want to return.
Tween Survey
• At the end of last year’s ski season, Luke asked 20 tweens four specific
questions:
Q1. On your car drive to the resort, which music streaming service
was playing?
Q2. Rate the quality of the food at the resort on a scale of 1 to 4.
Q3. What time should the main dining area close?
Q4. How much of your own money did you spend at the resort today?
Tween Survey Data
Variables and Scales of Measurement
Variable
• A variable is the general characteristic being observed on an object of
interest.
Types of Variables
Variables
Qualitative Quantitative
Types of Variables
• Qualitative – gender, race, political affiliation
• Quantitative – test scores, age, weight
Discrete
Continuous
Discrete Variable
• A discrete variable assumes a countable number of distinct values.
• Examples: Number of children in a family, number of points scored in a
basketball game.
Continuous Variables
• A continuous variable can assume an infinite number of values within
some interval.
• Examples: Weight, height, investment return.
Scales of Measurement
- Nominal
Qualitative Variables
- Ordinal
- Interval
Quantitative Variables
- Ratio
Nominal Scale
• The least sophisticated level of measurement.
• Data are simply categories for grouping the data.
Qualitative values may be converted
to quantitative values for
analysis purposes.
Ordinal Scale
• Ordinal data may be categorized and ranked with respect to some
characteristic or trait.
• For example, students are often evaluated on an ordinal scale
(excellent, good, fair, poor).
• Differences between categories are meaningless because the actual
numbers used may be arbitrary.
• There is no objective way to interpret the difference between student
quality.
Tweens Survey
• What is the scale of measurement for the music streaming data?
Tweens Survey
• What is the scale of measurement of the music streaming data?
• Solution: These are nominal data—the values in the data differ merely in
name or label.
Tweens Survey
• How are the data based on the ratings of the food quality similar to or
different from the music streaming data?
Tweens Survey
• How are the data based on the ratings of the food quality similar to or
different from the music streaming data?
• Solution: These are ordinal since they can be both categorized and ranked.
Interval Scale
• Differences between values are equal and meaningful. Thus, the
arithmetic operations of addition and subtraction are meaningful.
• No “absolute 0” or starting point defined. Meaningful ratios may not be
obtained.
Interval Scale
•For example, consider the Fahrenheit
scale of temperature.
•This scale is interval because the data
are ranked and differences (+ or -)
may be obtained.
•But there is no “absolute 0”.
Ratio Scale
• The strongest level of measurement.
• Differences between values are equal and meaningful.
• There is an “absolute 0” or defined starting point. “0” does mean
“the absence of …” Thus, meaningful ratios may be obtained.
Ratio Scale
•The following variables are measured on a ratio scale:
General Examples: Weight and Distance
Business Examples: Sales, Profits, and Inventory Levels
Tween Survey
• How are the time data classified? In what ways do the time data differ from
ordinal data? What is a potential weakness of this measurement scale?
Tween Survey
• How are the time data classified? In what ways do the time data differ from
ordinal data? What is a potential weakness of this measurement scale?
• Solution: Clock time responses are on an interval scale. With this type of data,
we can calculate meaningful differences, however, there is no apparent zero
point.
Tween Survey
• What is the measurement scale of the money data? Why is it considered the
most sophisticated form of data?
Tween Survey
• What is the measurement scale of the money data? Why is it considered the
most sophisticated form of data?
• Solution: Since the tweens’ responses are in dollar amounts, this is ratio-scaled
data; ratio-scaled data has a natural zero point which allows the calculation of
ratios.
Synopsis of Tween Survey
• 60% of the tweens listened to Spotify. The resort may want to direct its
advertising dollars to this streaming service.
• 55% of the tweens felt that the food was, at best, fair.
• 95% of the tweens would like the dining area to remain open later.
• 85% of the tweens spent their own money at the lodge.
Course Details
Course Plan
Introduction to
Sampling
Descriptive Probability and
Introduction Distribution and
Statistics Probability
Interval Estimation
Distributions
Hypothesis Testing ANOVA Regression Analysis
Textbook
• Jaggia and Kelly (2021), Business Statistics, McGraw Hill Education (India).
Evaluation Components
Components (Tentative) Weightage
Quiz 20%
Mid-Term 25%
End-Term 35%
Project 20%
Quiz
6 In-Class Quizzes and 4 Scheduled Quizzes
Best 4 out of 6 In-Class Quizzes and Best 2 out of 4 Scheduled
Quizzes will be taken for final grading
Quizzes will be mainly concept-based and may require minor
computing
In-Class Quizzes: 5 Questions, 5 Minutes
Scheduled Quizzes: 10 Questions, 12 Minutes
Mid-Term & End-Term Exams
Mainly Descriptive Questions
5/6 Questions
Total Marks: 50
Open book with excel
Duration: 2-3 Hours
Project
Group project (Group decision at the end of second week, final date
will be updated by TA)
Analysis on primary data is preferred
Two Submissions: Project Proposal & Final Submission
Project Proposal submission at the end of 18th session (exact date
will be updated by TA)
Final Submission (most possibly) on the day of end-term exam (exact
date will be updated by TA)
Project Proposal
• Project Proposal Preparation Details (one page)
Title of the project
Introduction and Motivation for the Problem
Data Source/ Data Collection (If it is a survey, then a brief discussion about the
questionnaire)
Final Project Submission
• Final project report should have sections as follows:
A title of the project with introduction and motivation for the problem
Data Source(s)/Data Collection
Descriptive Statistics
Methodology
Results
Conclusion
• The data set should be provided in the Appendix.
• Project submission should include the data set and the report.
Potential Project Topics
• Effect of Pandemic on Online Shopping
• Students’ Perceptions towards the Quality of Online Education
• Future of EV Industry in India
• Future of Startup Industry in India
• Maternal education and child health
Reading Materials
• Chapter 1 of Jaggia and Kelly
Sections 1.1, 1.2
• Homework: Introductory Case: Gaining Insights into Retail Customer Data