LECTURE 2
Data Collection
Lecturer: Nguyen Thi Thu Van
Email: van.nguyen@ueh.edu.vn
What is Statistics?
Primary Data vs. Secondary Data
Primary data
• Collected by the
investigator
himself.
Secondary
data
• Collected and
published earlier
by some one.
Types of Data
Business data comes in a wide variety of formats. All of data, in all its
different formats, can be divided into two main categories:
Structured data Unstructured data
Structured data is data that is in a Unstructured data is data, often text
form that can be used to develop data, that is heterogeneous in
statistical models, typically a matrix format and requires considerable
where rows are records and pre-processing before it can be
columns are variables or features. used in a model. For instance,
For instance, numerical data videos, images, sounds, etc.
google sheets, star ratings, etc.
In the Statistics course, we work with structured data.
Data Quality Criteria
Assessing the quality of the data is very important
because it is impossible to produce high quality
statistical analysis and make good decisions from
poor quality data.
Contents
Basic Concepts
Scales of Measurement
Sampling Concepts
Sampling Methods
Surveys
Basic Concepts
Getting started by the following data table
Student Gender DOB State Tuition fee $
Name
David Gold Female 14-Oct- Pennsylvania 10,600
2006
Lemon Site Male 17-Dec- Michigan 8,500
2006
Wilson Su Male 22-Dec- Texas 7,000
2006
Keven Lee Male 12-Sep- Chicago 8,500
2006
In this table, there are 4 observations and 5 columns =
5 variables. There are 20 data values/points in total.
Basic Terminology
Variable: a characteristic about the items that
we want to study (e.g., Name, Gender, DOB).
Observation: a single member of items that we
want to study, such as a student, firm, or region.
An observation is a set of variable values.
Data set: all the values of all of the variables for
all of the observations we chose to study.
Variable/Feature
Student Gender DOB State Tuition fee $
Name
David Gold Female 14-Oct- Pennsylvania 10,600
2006
Lemon Site Male 17-Dec- Michigan 8,500
2006
Wilson Su Male 22-Dec- Texas 7,000
Observation 2006
Keven Lee Male 12-Sep- Chicago 8,500
2006
The data set consists of 4 observations and 5 variables. There
are 20 data values/points in total.
Types of Variables
Time Series Data
A sequence of observations collected at equal
periods of time.
Periodicity may be annual, quarterly, monthly,
weekly, daily, hourly, etc.
Example: daily closing price of a certain stock
recorded last week.
Used to study trends
and patterns over time.
Cross-Sectional Data
Each observation represents a different
individual unit at the same point in time.
Example: daily closing prices of a group of 10
stocks recorded on Aug 25, 2023.
Used to study variation among observations or
relationships.
Pooled Data
Combine the two data types to get pooled cross-
sectional and time series data.
Example:
Daily closing price of a group of 10 stocks recorded
last week.
GNP per capita of all European countries over ten
years.
1st Quarter 2nd Quarter 3rd Quarter 4th Quarter
East 20.4 27.4 59 20.4
West 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9
Scales of Measurement
How variables are defined and categorized?
• Temperature in Fahrenheit or
Celsius • Speed
• Times of the day (1 pm, 2 pm, • Height
etc.) • Weight
• IQ scores (100, 110, 120, etc.) • Money
• pH (pH of 2, pH of 3, etc.)
• SAT scores (900, 950, 1000,
etc.)
Sampling Concepts
Basic concepts of sampling
Population vs. Sample
Census vs. Sample
Parameter vs. Statistic
Sampling with/without replacement
Sampling error vs. Bias
Sampling method
Population vs. Sample
A (target) population is the collection of all
items of interest or under investigation, could
be finite or infinite.
A census method is an examination of all items in a
defined population.
A sample is an observed subset of the
population.
A sample method is an examination of all items in a
subset of the population.
A statistic is a specific characteristic of
a sample.
A parameter is a specific characteristic of
a population.
A population may be treated as infinite when the population
size N is at least 20 times the sample size n (i.e., N/n ≥ 20).
Two Ways of Collecting Samples
Sampling with replacement can be defined as random
sampling that allows sampling units to occur more than
once, i.e., duplicates are allowed.
If we do not allow duplicates
when sampling, then we are
sampling without replacement.
Duplicates are unlikely when n < 0.5% N.
What is Sampling Error?
Sampling error is the
difference between a
survey’s result and
population value.
Sampling bias is a systematic error that affects multiple samples,
which occurs when some members of a population are
systematically more likely to be selected in a sample than others.
Sampling Methods
How to Select a Sample that is Representative of a Group?
Sampling Methods
Nonstatistical Sampling Statistical Sampling
Convenience Simple Systematic
Random
Judgment
Cluster
Focus group Stratified
Nonstatistical/Nonrandom Sampling
Convenience Sample
Use a sample that happens to be available (e.g., ask
co-worker opinions at lunch).
Judgment Sample
Use expert knowledge to choose “typical” items (e.g.,
which employees to interview).
Focus Groups
In-depth dialog with a representative panel of
individuals (e.g. iPhone users).
Simple Random Sampling
Every member of the population has an equal
chance of being selected
Every possible sample of a given size has an
equal chance of being selected
Selection may be with replacement or without
replacement
The sample can be obtained using a table of
random numbers or computer random number
generator
Systematic Random Sampling
Decide on sample size: n
Divide frame of N individuals into n groups of k
individuals: k = N/n
Randomly select one individual from the first
group
Select every kth individual thereafter
N = 64
n=8 First Group
k=8
Stratified Random Sampling
Divide population into subgroups (called strata)
according to some common characteristic (e.g.
age, gender, occupation)
Select a simple random sample from each
subgroup
Combine samples from subgroups into one
Population
Divided
into 4 strata
Sample
Cluster Sampling
Divide population into several “clusters” (e.g.
regions), each representative of the population
One-stage cluster sampling: randomly selected k clusters
Two-stage cluster sampling: randomly select k clusters and then
choose a random sample of elements within each cluster.
Population
divided into 16
clusters.
Randomly selected
clusters for sample
Survey
How to Conduct a Survey?
S1. Determine your objectives
S2. Select respondents
S3. Create a data analysis plan
S4. Develop the survey
S5. Pre-test the survey
S6. Distribute and conduct the survey
S7. Analyze the survey
S8. Report the results
How to Design a Survey Questionnaire?
S1. Define your goals and objectives
S2. Use questions that are suitable
for your sample
S3. Decide on your questionnaire
length and question order
S4. Pre-test your questionnaire
Type of Questions should be Used?
Open-ended
Fill-in-the-blank
Check boxes
Ranked choices
Pictograms
Likert scale
Likert Scales
“College-bound high school students should be required to
study a foreign language.” (check one)
Strongly Somewhat Neither Agree Somewhat Strongly
Agree Agree Nor Disagree Disagree Disagree
A Likert scale is a rating scale
used to measure opinions,
attitudes, or behaviors.
Speaking strictly the Likert scale is an ordinal scale, but it is
often treated as an interval scale in many social researches.
Cronbach’s Alpha
Cronbach's alpha coefficient measures the internal
consistency, or reliability, of a set of survey items.
This coefficient helps determine whether a collection
of items consistently measures the same
characteristic.
Cronbach’s alpha quantifies the level of agreement
on a standardized 0 to 1 scale. Higher values
indicate higher agreement between items.
Example. [https://statisticsbyjim.com/basics/cronbachs-alpha/]
A bank wants to survey customers to evaluate how satisfied
they are with the timeliness of its service. We develop the
following four survey questions:
Item 1 – My telephone, email, or letter inquiry was answered in
a reasonable amount of time.
Item 2 – I am satisfied with the timeliness of the service
provided.
Item 3 – The time I waited for services was reasonable.
Item 4 – I am satisfied with the services I received.
These questions all use a 5-point Likert scale ranging from 1
Very Dissatisfied to 5 Very Satisfied.
60 customers are asked to take the survey during the pilot
study phase before distributing the survey more widely.
acceptable
Removing Item 4 causes Cronbach’s alpha to increase from
0.7853 to 0.921674. This result suggests that only items 1, 2,
and 3 measure customer service timeliness. In conclusion,
we should either remove item 4 or reword and retest it.
-- The End of Topic --
Thank You!