MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
CHAPTER 2: Data Collection and Basic Concepts in Sampling Design
Objectives:
1.Determine the sources of data (primary and secondary
. data).
2. Determine the appropriate sample size.
3. Differentiate various sampling techniques.
DATA COLLECTION
Everybody collects, interprets and uses information, much of it in numerical or statistical forms in
day-today life. It is a common practice that people receive large quantities of information everyday through
conversations, televisions, computers, the radios, newspapers, posters, notices and instructions. It is just
because there is so much information available that people need to be able to absorb, select and reject it. In
everyday life, in business and industry, certain statistical information is necessary and it is independent to
know where to find it how to collect it.
Analysis of data can lead to powerful results. Data can be used to offset anecdotal claims, such as
the suggestion that cellular telephones cause brain cancer. Anecdotal means that the information being
conveyed is based on casual observation, not scientific research. Because data are powerful, they can be
dangerous when misused. The misuse of data usually occurs when data are incorrectly obtained or analyzed.
For example, radio or television talk shows regularly ask poll questions for which respondents must call in or
use the Internet to supply their vote. Most likely, the individuals who are going to call in are those who have
a strong opinion about the topic. This group is not likely to be representative of people in general, so the
results of the poll are not meaningful. Whenever we look at data, we should be mindful of where the data
come from.
Even when data tell us that a relation exists, we need to investigate. For example, a study showed
that breast-fed children have higher IQs than those who were not breast-fed. Does this study mean that a
mother who breast-feeds her child will increase the child’s IQ? Not necessarily. It may be that some factor
other than breast-feeding contributes to the IQ of the children. In this case, it turns out that mothers who
breastfeed generally have higher IQs than those who do not. Therefore, it may be genetics that leads to the
higher IQ, not breast-feeding.
Page 1
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
Data collection is the process of gathering and measuring information on variables of interest, in an
established systematic fashion that enables one to answer stated research questions, test hypotheses, and
evaluate outcomes.
Without proper planning for data collection, a number of problems can occur. If the data collection
steps and processes are not properly planned, the research project can ultimately end up with a data set that
does not serve the purpose for which it was intended. For example, if more than one person is involved in
the data collection, but data collectors do not follow consistent data collection practices, they can end up with
data with different units, collection processes, and variable names.
Consequences from Improperly Collected Data
• Inability to answer research questions accurately.
• Inability to repeat and validate the study.
• Distorted findings resulting in wasted resources.
• Misleading other researchers to pursue fruitless avenues of investigation.
• Compromising decisions for public policy.
• Causing harm to human participants and animal subjects.
Steps in Data Gathering
1. Set the objectives for collecting data
2. Determine the data needed based on the set objectives.
3. Determine the method to be used in data gathering and define the comprehensive
data collection points.
4. Design data gathering forms to be used.
5. Collect data.
Choosing of Method of Data Collection
Decision-makers need information that is relevant, timely, accurate and usable. The cost of obtaining,
processing and analyzing these data is high. The challenge is to find ways, which lead to information that is
cost-effective, relevant, timely and important for immediate use. Some methods pay attention to timeliness
and reduction in cost. Others pay attention to accuracy and the strength of the method in using scientific.
Page 2
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
The statistical data may be classified under two categories, depending upon the sources. approaches:
Primary Data and Secondary Data.
SOURCES OF DATA
Whether conducting research in the social sciences, humanities arts, or natural sciences, the ability to
distinguish between primary and secondary sources is essential.
• Primary Sources - Provide a first-hand account of an event or time period and are considered to be
authoritative. They represent original thinking, reports on discoveries or events, or they can share new
information. Often these sources are created at the time the events occurred but they can also include
sources that are created later. They are usually the first formal appearance of original research.
Primary Data - are data documented by the primary source. The data collectors documented the data
themselves. The first hand information obtained by the investigator is more reliable and accurate since the
investigator can extract the correct information by removing doubts, if any, in the minds of the respondents
regarding certain questions. High response rates might be obtained since the answers to various questions
are obtained on the spot. It permits explanation of questions concerning difficult subject matter.
• Secondary Sources - offer an analysis, interpretation or a restatement of primary sources and are
considered to be p e r s u a s i v e . T h e y o f t e n i n v o l v e generalisation, synthesis, interpretation,
commentary or evaluation in an attempt to convince the reader of the creator's argument. They often
attempt to describe or explain primary sources.
Secondary Data - are data documented by a secondary source. The data collectors had the data documented
by other sources.
In secondary data, data are primary data for the agency that collected them, and become secondary
for someone else who uses these data for his own purposes.
Secondary data are less expensive to collect both in money and time. These data can also be better
utilized and sometimes the quality of such data may be better because these might have been collected by
persons who were specially trained for that purpose.
On the other hand, such data must be used with great care, because such data may also be full of
errors due to the fact that the purpose of the collection of the data by the primary agency may have been
different from the purpose of the user of these secondary data.
Page 3
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
Secondly, there may have been bias introduced, the size of the sample may have been inadequate, or there
may have been arithmetic or definition errors, hence, it is necessary to critically investigate the validity of
the secondary data.
The primary data can be collected by the following five methods:
1. Direct personal interviews – The researcher has direct contact with the interviewee. The researcher gathers
information by asking questions to the interviewee.
2. Indirect/Questionnaire Method – This methods of data collection involve sourcing and accessing existing
data that were originally collected for the purpose of the study. Designing good “questioning tools” forms an
important and time consuming phase in the development of most research proposals. Once the decision has
been made to use these techniques, the following questions should be considered before designing our
tools:
• What exactly do we want to know, according to the objectives and variables we identified earlier? Is
questioning the right technique to obtain all answers, or do we need additional techniques, such as
observations or analysis of records?
• Of whom will we ask questions and what techniques will we use? Do we understand the topic sufficiently to
design a questionnaire, or do we need some loosely structured interviews with key informants or a focus
group discussion first to orient ourselves?
• Are our informants mainly literate or illiterate? If illiterate, the use of self- administered questionnaires is not
an option.
• How large is the sample that will be interviewed? Studies with many respondents often use shorter, highly
structured questionnaires, whereas smaller studies allow more f lexibility and may use questionnaires with a
number of open-ended questions.
Key Design Principles of a Good Questionnaire
1. Keep the questionnaire as short as possible.
2. Decide on the type of questionnaire (Open Ended or Closed Ended).
3. Write the questions properly.
4. Order the questions appropriately.
5. Avoid questions that prompt or motivate the respondent to say what you would like to hear.
6. Write an introductory letter or an introduction.
7. Write special instructions for interviewers or respondents.
Page 4
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
8. Translate the questions if necessary
9. Always test your questions before taking the survey. (Pre-test)
An open-ended question is a type of question that does not include response categories. The respondent is
not given any possible answers to choose from. This type of question is usually appropriate for collecting
subjective data. It permits free responses that should be recorded in the respondent’s own words.
Example:
• Can you describe exactly what the traditional birth attendant did when your labor started?
• What do you think are the reasons for a high drop-out rate of village health committee members?
A closed-ended question is a type of question that includes a list of response categories from which the
respondent will select his answer. It is useful if the range of possible responses is known. This type of
question is usually appropriate for collecting objective data.
Example:
Did you eat any of the following foods yesterday
• Fish or meat Yes No
• Eggs. Yes No
• Milk or cheese Yes No
Take Note!
Question wording and question order have a large effect on the responses obtained.
Example:
Two surveys were taken in late 1993/early 1994 about Elvis Presley.
One survey asked: “In the past few years, there have been a lot of rumors and stories about whether Elvis
Presley is really dead. How do you feel about this? Do you think there is any possibility that these rumors are
true and that Elvis Presley is still alive, or don’t you think so?”
Page 5
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
Second survey asked: “A recent television show examined various theories about Elvis Presley’s death. Do
you think it is possible that Elvis is alive or not?”
8% of the respondents to the first question said it is possible that Elvis is still alive and 16% of respondents
to the second question said it is possible that Elvis is still alive.
3. A focus group is a group interview of approximately six to twelve people who share similar characteristics
or common interests. A facilitator guides the group based on a predetermined set of topics.
4. Experiment is a method of collecting data where there is direct human intervention on the conditions that
may affect the values of the variable of interest.
Bear in mind that the experimental method has several limitations that you should be aware of.
- Ethical, moral, and legal Concerns
- Unrealistic Controlled Environments
- Inability to Control for All Variables
5. Observation is a technique that involves systematically selecting, watching and recoding behaviors of
people or other phenomena and aspects of the setting in which they occur, for the purpose of getting (gaining)
specified information. It includes all methods from simple visual observations to the use of high level
machines and measurements, sophisticated equipment or facilities such as:
- Radiographic
- biochemical
Page 6
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
- X-ray machines
- Microscope
- Clinical examinations
- Microbiological examinations
It gives relatively more accurate data on behavior and activities but Investigators or observer’s own biases,
prejudice, desires, and etc. and needs more resources and skilled human power during the use of high level
machines.
The secondary data can be collected by the following five methods:
1. Published report on newspaper and periodicals.
2. Financial Data reported in annual reports.
3. Records maintained by the institution.
4. Internal reports of the government departments.
5. Information from official publications.
Take Note!
• Always investigate the validity and reliability of the data by examining the collection method employed by
your source.
• Do not use inappropriate data for your research.
• The choice of methods of data collection is largely based on the accuracy of the information they yield.
SAMPLE SIZE
“How many participants should be chosen for a survey”?
One of the most frequent problems in statistical analysis is the determination of the appropriate sample size.
One may ask why sample size is so important. The answer to this is that an appropriate sample size is
required for validity. If the sample size it too small, it will not yield valid results. An appropriate sample
size can produce accuracy of results. Moreover, the results from the small sample size will be questionable.
A sample size that is too large will result in wasting money and time because enough sample will normally
give an accurate result.
Page 7
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
The sample size is typically denoted by n and it is always a positive integer. No exact sample size can be
mentioned here and it can vary in different research settings. However, all else being equal, large sized
sample leads to increased precision in estimates of various properties of the population.
Take Note!
- Representativeness, not size, is the more important consideration.
- Use no less than 30 subjects if possible.
- If you use complex statistics, you may need a minimum of 100 or more in your sample (varies with
method).
Representative Sample
Page 8
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
Desired Confidence
Level Z - Score
80% 1.28
85% 1.44
90% 1.65
95% 1.96
99% 2.58
Choosing of sample size depends on non- statistical considerations and statistical considerations.
• Non-statistical considerations – It may include availability of resources, man power, budget, ethics and sampling
frame.
• Statistical considerations – It will include the desired precision of the estimate.
Three criteria need to be specified to determine the appropriate sample size:
1. Level of Precision
Also called sampling error, the level of precision, is the range in which the true value of the population is
estimated to be.
2. Confidence Interval
It is statistical measure of the number of times out of 100 that results can be expected to be within a
specified range. For example, a confidence interval of 90% means that results of an action will probably
meet expectations 90% of the time.
To find the right z – score to use, refer to the table:
3. Degree of Variability
Depending upon the target population and attributes under consideration, the degree of variability varies
considerably. The more heterogeneous a population is, the larger the sample size is required to get an
optimum level of precision.
Methods in Determining the Sample Size
• Estimating the Mean or Average
The sample size required to estimate the population mean µ to with a level of confidence with specified
margin of error e, given by
Page 9
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
where:
Z is the z-score corresponding to level of confidence.
e is the level of precision.
Take Note:
If When σ is unknown, it is common practice to conduct a preliminary survey to determine s and use it as
an estimate of σ or use results from previous studies to obtain an estimate of σ. When using this approach,
the size of the sample should be at least 30. The formula for the sample standard deviation s is
Example:
A soft drink machine is regulated so that the amount of drink dispensed is approximately normally
distributed with a standard deviation equal to 0.5 ounce. Determine the sample size needed if we wish to
be 95% confident that our sample mean will be within 0.03 ounce from the true mean.
Estimating Proportion (Infinite Population)
The sample size required to obtain a confidence interval for p with specified margin of error e is given by
Page 10
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
Page 11
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
Page 12
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
Page 13
MODULE STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION – CAE11
For more knowledge about this lesson, please check the link provided;
https://www.youtube.com/watch?v=RxNUZRx00YA
REFERENCES
https://www.investopedia.com/terms/s/statistics.asp
Page 14