CHAPTER 5: RELIABILITY
reliability ➢ refers to consistency in measurement (produces similar results)
➢ not necessarily consistently good or bad, but simply consistent.
Reliable test ➢ Reliable tests give scores that closely approximate true scores.
➢ refers to the proportion of the total variance attributed to true variance. The
greater the proportion of the total variance attributed to true variance, the more
reliable the test.
reliability ➢ is a statistic that quantifies reliability, ranging from 0 (not at all reliable) to 1
coefficient (perfectly reliable).
➢ An index of reliability, a proportion that indicates the ratio between the true score
variance on a test and the total variance.
measurement refers to the inherent uncertainty associated with any measurement, even after care has
error been taken to minimize preventable mistakes
True score the long-term average of many measurements free of carryover effects.Can only be
approximated
Carryover Measurement processes that alter what is measured
effect
Practice effects The test itself provides an opportunity to learn and practice the ability being measured.
Fatigue effect repeated testing reduces overall mental energy or motivation to perform on a test.
standard error represents the typical distance from an observed score to the true score.
of
measurement,
construct score a person’s standing on a theoretical variable independent of any particular measurement.
Classical test
theory (CTT)
x= observed score
t= true score
E = error score
variance (σ2) the standard deviation squared.
A statistic useful in describing sources of test score variability
true variance Variance from true differences
error variance. variance from irrelevant, random sources
, he total variance in an observed distribution of test scores (σ2) equals the sum of the
true variance (σ2 t) and the error variance (σ2 e).
CHAPTER 5: RELIABILITY
Measurement Systematic error
error ➢ influence test scores in a consistent direction. Systematic errors either
(systematic or consistently inflate scores or consistently deflate scores. (does not affect score
random.) consistency.)
➢ Once a systematic error becomes known, it becomes predictable—as well as
fixable.
➢ E.g.: faulty measurement scale.
Random error
➢ unpredictable fluctuations and inconsistencies of other variables in the
measurement process.
➢ E.g. noise, bad day etc.
Bias ➢ technical term for the degree to which a measure predictably overestimates or
underestimates a quantity is.
➢ In statistics, bias refers to the degree to which systematic error influences the
measurement.
Sources of Error Variance
Test One source of error sampling is item sampling or content sampling, terms that refer
construction to variation among items within a test as well as to variation among items between tests.
The extent to which a testtaker’s score is affected by the content sampled on a test and
by the way the content is sampled (i.e., the way in which the item is constructed)
Test ➢ Sources of error variance that occur during test administration may influence the
administration testtaker’s attention or motivation e.g (room temperature, level of lighting, etc)
➢ testtaker variables such as Pressing emotional problems, physical discomfort,
lack of sleep, and the effects of drugs or medication etc.
➢ Examiner-related variables are potential sources of error variance. (E.g. the
level of professionalism exhibited by examiners)
Test scoring Scorers and scoring systems are potential sources of error variance.
and ➢ Ex: technical glitch might contaminate the data.
interpretation ➢ If subjectivity is involved in scoring, then the scorer (or rater) can be a source of
error variance.
Other sources methodological error.
of error
example :
➢ the researchers may have gotten such factors right but simply did not include
enough people in their sample to draw the conclusions that they did.(bias
sampling)
➢ interviewers may not have been trained properly,
➢ the wording in the questionnaire may have been ambiguous,
➢ or the items may have somehow been biased to favor one or another of the
candidates.
CHAPTER 5: RELIABILITY
Reliability Estimates
Test-Retest ➢ is an estimate of reliability obtained by correlating pairs of scores from the same
Reliability people on two different administrations of the same test.
Estimates ➢ appropriate when evaluating the reliability of a test that purports to measure
something that is relatively stable over time, such as a personality trait.
➢ It is generally the case (although there are exceptions) that, as the time interval
between administrations of the same test increases, the correlation between the
scores obtained on each testing decreases
➢ A low estimate of test-retest reliability might be found even when the interval
between testings is relatively brief
coefficient of ➢ The longer the time that passes, the greater the likelihood that the reliability
stability. coefficient will be lower.
➢ When the interval between testing is greater than six months, the estimate of
test-retest reliability is often referred to as the coefficient of stability.
coefficient of The degree of the relationship between various forms of a test can be evaluated by
equivalence. means of an alternate-forms or parallel-forms coefficient of reliability.
Parallel forms ➢ 2 set of test (with equivalence)
➢ for each form of the test, the means and the variances of observed test scores
are equal
➢ In theory, the means of scores obtained on parallel forms correlate equally with
the true score.
parallel forms ➢ refers to an estimate of the extent to which item sampling and other errors have
reliability affected test scores on versions of the same test when for each form of the test,
the means and variances of observed test scores are equal.
➢ Developing alternate forms of tests can be time-consuming and expensive
Alternate forms typically designed to be equivalent with respect to variables such as content and level of
difficulty.
alternate forms ➢ refers to an estimate of the extent to which these different forms of the same test
reliability have been affected by item sampling error, or other error.
➢
➢ Estimating alternate forms reliability is straightforward: Calculate the correlation
between scores from a representative sample of individuals who have taken both
tests
internal Deriving this type of estimate entails an evaluation of the internal consistency of the test
consistency items.
estimate of An estimate of the reliability of a test can be obtained without developing an alternate
reliability form of the test and without having to administer the test twice to the same people..
(estimate of
inter-item)
Split-Half ➢ obtained by correlating two pairs of scores obtained from equivalent halves of a
Reliability single test administered once.
Estimates ➢ It is a useful measure of reliability when it is impractical or undesirable to assess
reliability with two tests or to administer a test twice (because of factors such as
time or expense)
CHAPTER 5: RELIABILITY
Step 1. Divide the test into equivalent halves.
Step 2. Calculate a Pearson r between scores on the two halves of the test.
Step 3. Adjust the half-test reliability using the Spearman–Brown formula
(discussed shortly).
➢ In general, a primary objective in splitting a test in half for the purpose of
obtaining a split-half reliability estimate is to create what might be called
“mini-parallel-forms,” with each half equal to the other—or as nearly equal as
humanly possible—in format, stylistic, statistical, and related aspects.
odd-even An acceptable way to split a test is to assign odd-numbered items to one half of the test
reliability. and even-numbered items to the other half.
The ➢ allows a test developer or user to estimate internal consistency reliability from a
Spearman–Bro correlation between two halves of a test.
wn formula ➢ Usually, but not always, reliability increases as test length increases. Ideally, the
additional test items are equivalent with respect to the content and the range of
difficulty of the original items.
➢ If test developers or users wish to shorten a test, the Spearman–Brown formula
may be used to estimate the effect of the shortening on the test’s reliability.
Reduction in test size for the purpose of reducing test administration time is a
common practice in certain situations.
➢ could also be used to determine the number of items needed to attain a desired
level of reliability.
➢
➢ Useful in measuring the reliability of homogenous test and speed test.
Other Methods of Estimating Internal Consistency
Inter-item refers to the degree of correlation among all the items on a scale. A measure of
consistency inter-item consistency is calculated from a single administration of a single form of a test
Inter - item is useful in assessing homogeneity. .
Homogenous Contains items that measure a single trait
It is often insufficient tool for multi-faceted psychological variables.
Heterogenous Contains items that measure different factors (e.g. more than one trait.
KR 20 ➢ Is the statistic of choice for determining the interitem of dichotomous items (e.g.
kuder-richards items with right and wrong)
formula ➢ If items are more heterogeneous, KR 20 will yield a low reliability that the split half
method.
Coefficient ➢ Developed by Cronbach (1951)
alpha ➢ .0 to 1
(chronback ➢ may be thought of as the mean of all possible split-half correlations,
alpha) ➢ Appropriate for use on tests with non-dichotomous items.
➢ is the most frequently used measure of internal consistency, but has several
well-known limitations. It accurately measures internal consistency under highly
specific conditions that are rarely met in real measures.
These coefficients are called loadings, and they represent the strength of the relationship
CHAPTER 5: RELIABILITY
between the true score and the observed scores. Coefficient alpha is accurate when
these loadings are equal. If they are nearly equal, Cronbach’s alpha is still quite
accurate, but when the loadings are quite unequal, Cronbach’s alpha underestimates
reliability.
McDonald’s It accurately estimates internal consistency even when the test loadings are unequal.
(1978) omega.
Average Measures on the degree of difference that exist between the item scores.
proportional
distance (APD) ➢ 2 or lower = excellent internal consistency
➢ .25 to 2 = acceptable range
➢ 25 =problems with internal consistency
Measures of ➢ the degree of agreement or consistency between two or more scorers (or judges
Inter-Scorer or raters) with regard to a particular measure.
Reliability ➢ This reduction of potential bias can be accomplished by having at least one other
individual observe and rate the same behaviors. If consensus can be
demonstrated in the ratings, the researchers can be more confident regarding the
accuracy of the ratings.
a coefficient of determining the degree of consistency among scorers in the scoring of a test is to
inter-scorer calculate a coefficient of correlation.
reliability. E.g Kappa
transient error a source of error attributable to variations in the testtaker’s feelings, moods, or mental
state over time.
CHAPTER 5: RELIABILITY
homogeneous ➢ is functionally uniform throughout. Tests designed to measure one factor,
such as one ability or one trait, are expected to be homogeneous in items.
➢ it is reasonable to expect a high degree of internal consistency.
It is important to note that high internal consistency does not guarantee item
homogeneity. As long as the items are positively correlated, adding many items
eventually results in high internal consistency coefficients, homogeneous or not.
heterogeneous ➢ Tests designed to measure multiple factor,
in items, an estimate of internal consistency might be low relative to a more
appropriate estimate of test-retest reliability.
dynamic is a trait, state, or ability presumed to be ever-changing as a function of situational
characteristic and cognitive experiences.
static characteristic ➢ State or ability presumed to be unchanging.
➢ obtained measurement would not be expected to vary significantly as a
function of time, and either the test-retest or the alternate-forms method
would be appropriate.
CHAPTER 5: RELIABILITY
power test When a time limit is long enough to allow testtakers to attempt all items, and some
items are so difficult that no testtaker is able to obtain a perfect score,
Speed test ➢ contains items of uniform level of difficulty (typically uniformly low) so that,
when given generous time limits, all testtakers should be able to complete
all the test items correctly. In practice, however, the time limit on a speed
test is established so that few if any of the testtakers will be able to complete
the entire test.
➢ Score differences on a speed test are therefore based on performance
speed.
➢ A reliability estimate of a speed test should be based on performance from
two independent testing periods using one of the following: (1) test-retest
reliability, (2) alternate-forms reliability, or (3) split-half reliability from two
separately timed half tests. If a split-half procedure is used, then the
obtained reliability coefficient is for a half test and should be adjusted using
the Spearman–Brown formula.
Criterion-referenced ➢ designed to provide an indication of where a testtaker stands with respect to
tests some variable or criterion (such as an educational or a vocational
objective).
➢ tend to contain material that has been mastered in hierarchical fashion.
➢ Scores on criterion- referenced tests tend to be interpreted in pass–fail (or,
perhaps more accurately, “master–failed-to-master”)
➢ terms, and any scrutiny of performance on individual items tends to be for
diagnostic and remedial purposes.
classical test theory the most widely used and accepted model in the psychometric literature today
(CTT) CTT assumptions are characterized as “weak”—precisely because its assumptions
are so readily met.
true score as a value that according to CTT genuinely reflects an individual’s ability (or trait)
level as measured by a particular test.
domain sampling ➢ items on a test are a sample from a much larger, potentially infinite, domain
theory of potential items representing a specific construct
➢ seek to estimate the extent to which specific sources of variation under
defined conditions are contributing to the test score. a test’s reliability is
conceived of as an objective measure of how precisely the test score
assesses the domain from which the test draws a sample (Thorndike, 1985).
generalizability ➢ Developed by lee J. Cronbach
theory ➢ is based on the idea that a person’s test scores vary from testing to testing
because of variables in the testing situation.
➢ Instead of conceiving of all variability in a person’s scores as error,
Cronbach encouraged test developers and researchers to describe the
details of the particular test situation or universe leading to a specific test
score.
➢
➢ Cronbach encouraged test developers and researchers to describe the
details of the particular test situation or universe leading to a specific
test score. This universe is described in terms of its facets,
➢
➢ Facets- include considerations such as the number of items in the
test, the amount of training the test scorers have had, and the purpose
CHAPTER 5: RELIABILITY
of the test administration.
➢
➢ Generalizability has not replaced CTT. Perhaps one of its chief contributions
has been its emphasis on the fact that a test’s reliability does not reside
within the test itself.
➢ From the perspective of generalizability theory, a test’s reliability is a
function of the circumstances under which the test is developed,
administered, and interpreted.
generalizability examines how generalizable scores from a particular test are if the test is
study administered in different situations.
a generalizability study examines how much of an impact different facets of the
universe have on the test score.
coefficients of The influence of particular facets on the test score is represented.
generalizability. These coefficients are similar to reliability coefficients in the true score model.
Decision study ➢ After the generalizability study is done, Cronbach et al. (1972)
recommended that test developers do a decision study, which involves the
application of information from the generalizability study.
➢ In the decision study, developers examine the usefulness of test scores in
helping the test user make decisions.
➢ In practice, test scores are used to guide a variety of decisions, (e.g from
placing a child in special education to hiring new employees to discharging
mental patients from the hospital.)
Item response ➢ model the probability that a person with X ability will be able to perform at a
theory (IRT) / level of Y.
latent-trait theory ➢ Stated in terms of personality assessment, it models the probability that a
person with X amount of a particular personality trait will exhibit Y amount of
that trait on a personality test designed to measure it. Because so often the
psychological or educational construct being measured is physically
unobservable
➢
➢ refers to a family of theories and methods—and quite a large family at
that—with many other names used to distinguish specific approaches.
There are well over a hundred varieties of IRT models. Each model is
designed to handle data with certain assumptions and data characteristics.
discrimination signifies the degree to which an item differentiates among people with higher or
lower levels of the trait, ability, or whatever it is that is being measured.
polytomous test test items or questions with three or more alternative responses, where only one is
items scored correct or scored as being consistent with a targeted trait or other construct
dichotomous test test items or questions that can be answered with only one of two alternative
items responses, such as true–false, yes–no, or correct–incorrect questions
Reliability and Individual Scores
The Standard Error of is the tool used to estimate or infer the extent to which an observed score deviates
Measurement from a true score.
CHAPTER 5: RELIABILITY
it provides an estimate of the amount of error inherent in an observed score or
measurement. In general, the relationship between the SEM and the reliability of a
test is inverse; the higher the reliability of a test (or individual subtest within a test),
the lower the SEM.
In practice, the standard error of measurement is most frequently used in the
interpretation of individual test scores
standard error of a
score
confidence interval a range or band of test scores that is likely to contain the true score.