KEMBAR78
Lecture 02 | PDF | Sampling (Statistics) | Statistics
0% found this document useful (0 votes)
10 views41 pages

Lecture 02

This lecture covers data collection methods in statistics, distinguishing between primary and secondary data, as well as structured and unstructured data. It discusses various sampling methods, including statistical and nonstatistical sampling, and emphasizes the importance of data quality and survey design. Key concepts such as variables, observations, and scales of measurement are also introduced.

Uploaded by

lllinhtam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views41 pages

Lecture 02

This lecture covers data collection methods in statistics, distinguishing between primary and secondary data, as well as structured and unstructured data. It discusses various sampling methods, including statistical and nonstatistical sampling, and emphasizes the importance of data quality and survey design. Key concepts such as variables, observations, and scales of measurement are also introduced.

Uploaded by

lllinhtam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

LECTURE 2

Data Collection

Lecturer: Nguyen Thi Thu Van


Email: van.nguyen@ueh.edu.vn
What is Statistics?
Primary Data vs. Secondary Data

Primary data

• Collected by the
investigator
himself.

Secondary
data
• Collected and
published earlier
by some one.
Types of Data
Business data comes in a wide variety of formats. All of data, in all its
different formats, can be divided into two main categories:

Structured data Unstructured data


Structured data is data that is in a Unstructured data is data, often text
form that can be used to develop data, that is heterogeneous in
statistical models, typically a matrix format and requires considerable
where rows are records and pre-processing before it can be
columns are variables or features. used in a model. For instance,
For instance, numerical data videos, images, sounds, etc.
google sheets, star ratings, etc.

In the Statistics course, we work with structured data.


Data Quality Criteria
Assessing the quality of the data is very important
because it is impossible to produce high quality
statistical analysis and make good decisions from
poor quality data.
Contents

 Basic Concepts

 Scales of Measurement

 Sampling Concepts

 Sampling Methods

 Surveys
Basic Concepts
Getting started by the following data table
Student Gender DOB State Tuition fee $
Name

David Gold Female 14-Oct- Pennsylvania 10,600


2006

Lemon Site Male 17-Dec- Michigan 8,500


2006

Wilson Su Male 22-Dec- Texas 7,000


2006

Keven Lee Male 12-Sep- Chicago 8,500


2006

In this table, there are 4 observations and 5 columns =


5 variables. There are 20 data values/points in total.
Basic Terminology

 Variable: a characteristic about the items that


we want to study (e.g., Name, Gender, DOB).

 Observation: a single member of items that we


want to study, such as a student, firm, or region.
An observation is a set of variable values.

 Data set: all the values of all of the variables for


all of the observations we chose to study.
Variable/Feature

Student Gender DOB State Tuition fee $


Name

David Gold Female 14-Oct- Pennsylvania 10,600


2006

Lemon Site Male 17-Dec- Michigan 8,500


2006

Wilson Su Male 22-Dec- Texas 7,000


Observation 2006

Keven Lee Male 12-Sep- Chicago 8,500


2006

The data set consists of 4 observations and 5 variables. There


are 20 data values/points in total.
Types of Variables
Time Series Data
 A sequence of observations collected at equal
periods of time.

 Periodicity may be annual, quarterly, monthly,


weekly, daily, hourly, etc.

 Example: daily closing price of a certain stock


recorded last week.

 Used to study trends

and patterns over time.


Cross-Sectional Data
 Each observation represents a different
individual unit at the same point in time.

 Example: daily closing prices of a group of 10


stocks recorded on Aug 25, 2023.

 Used to study variation among observations or


relationships.
Pooled Data
 Combine the two data types to get pooled cross-
sectional and time series data.

 Example:
 Daily closing price of a group of 10 stocks recorded
last week.

 GNP per capita of all European countries over ten


years.

1st Quarter 2nd Quarter 3rd Quarter 4th Quarter


East 20.4 27.4 59 20.4
West 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9
Scales of Measurement
How variables are defined and categorized?

• Temperature in Fahrenheit or
Celsius • Speed

• Times of the day (1 pm, 2 pm, • Height

etc.) • Weight

• IQ scores (100, 110, 120, etc.) • Money

• pH (pH of 2, pH of 3, etc.)
• SAT scores (900, 950, 1000,
etc.)
Sampling Concepts
Basic concepts of sampling
Population vs. Sample

Census vs. Sample

Parameter vs. Statistic

Sampling with/without replacement

Sampling error vs. Bias

Sampling method
Population vs. Sample
 A (target) population is the collection of all
items of interest or under investigation, could
be finite or infinite.
 A census method is an examination of all items in a
defined population.
 A sample is an observed subset of the
population.
 A sample method is an examination of all items in a
subset of the population.
A statistic is a specific characteristic of
a sample.

A parameter is a specific characteristic of


a population.

A population may be treated as infinite when the population


size N is at least 20 times the sample size n (i.e., N/n ≥ 20).
Two Ways of Collecting Samples

 Sampling with replacement can be defined as random


sampling that allows sampling units to occur more than
once, i.e., duplicates are allowed.

 If we do not allow duplicates

when sampling, then we are

sampling without replacement.

Duplicates are unlikely when n < 0.5% N.


What is Sampling Error?

Sampling error is the


difference between a
survey’s result and
population value.

Sampling bias is a systematic error that affects multiple samples,


which occurs when some members of a population are
systematically more likely to be selected in a sample than others.
Sampling Methods
How to Select a Sample that is Representative of a Group?
Sampling Methods

Nonstatistical Sampling Statistical Sampling

Convenience Simple Systematic


Random
Judgment
Cluster
Focus group Stratified
Nonstatistical/Nonrandom Sampling
 Convenience Sample
 Use a sample that happens to be available (e.g., ask
co-worker opinions at lunch).

 Judgment Sample
 Use expert knowledge to choose “typical” items (e.g.,
which employees to interview).

 Focus Groups
 In-depth dialog with a representative panel of
individuals (e.g. iPhone users).
Simple Random Sampling
 Every member of the population has an equal
chance of being selected
 Every possible sample of a given size has an
equal chance of being selected
 Selection may be with replacement or without
replacement
 The sample can be obtained using a table of
random numbers or computer random number
generator
Systematic Random Sampling
 Decide on sample size: n
 Divide frame of N individuals into n groups of k
individuals: k = N/n
 Randomly select one individual from the first
group
 Select every kth individual thereafter
N = 64
n=8 First Group
k=8
Stratified Random Sampling
 Divide population into subgroups (called strata)
according to some common characteristic (e.g.
age, gender, occupation)
 Select a simple random sample from each
subgroup
 Combine samples from subgroups into one

Population
Divided
into 4 strata

Sample
Cluster Sampling
 Divide population into several “clusters” (e.g.
regions), each representative of the population
 One-stage cluster sampling: randomly selected k clusters
 Two-stage cluster sampling: randomly select k clusters and then
choose a random sample of elements within each cluster.

Population
divided into 16
clusters.
Randomly selected
clusters for sample
Survey
How to Conduct a Survey?
S1. Determine your objectives

S2. Select respondents

S3. Create a data analysis plan

S4. Develop the survey

S5. Pre-test the survey

S6. Distribute and conduct the survey

S7. Analyze the survey

S8. Report the results


How to Design a Survey Questionnaire?

S1. Define your goals and objectives

S2. Use questions that are suitable


for your sample

S3. Decide on your questionnaire


length and question order

S4. Pre-test your questionnaire


Type of Questions should be Used?

 Open-ended

 Fill-in-the-blank

 Check boxes

 Ranked choices

 Pictograms

 Likert scale
Likert Scales
“College-bound high school students should be required to
study a foreign language.” (check one)
    
Strongly Somewhat Neither Agree Somewhat Strongly
Agree Agree Nor Disagree Disagree Disagree

A Likert scale is a rating scale


used to measure opinions,
attitudes, or behaviors.

Speaking strictly the Likert scale is an ordinal scale, but it is


often treated as an interval scale in many social researches.
Cronbach’s Alpha
Cronbach's alpha coefficient measures the internal
consistency, or reliability, of a set of survey items.
This coefficient helps determine whether a collection
of items consistently measures the same
characteristic.
Cronbach’s alpha quantifies the level of agreement
on a standardized 0 to 1 scale. Higher values
indicate higher agreement between items.
Example. [https://statisticsbyjim.com/basics/cronbachs-alpha/]
A bank wants to survey customers to evaluate how satisfied
they are with the timeliness of its service. We develop the
following four survey questions:
Item 1 – My telephone, email, or letter inquiry was answered in
a reasonable amount of time.
Item 2 – I am satisfied with the timeliness of the service
provided.
Item 3 – The time I waited for services was reasonable.
Item 4 – I am satisfied with the services I received.
These questions all use a 5-point Likert scale ranging from 1
Very Dissatisfied to 5 Very Satisfied.
60 customers are asked to take the survey during the pilot
study phase before distributing the survey more widely.

acceptable

Removing Item 4 causes Cronbach’s alpha to increase from


0.7853 to 0.921674. This result suggests that only items 1, 2,
and 3 measure customer service timeliness. In conclusion,
we should either remove item 4 or reword and retest it.
-- The End of Topic --
Thank You!

You might also like