SOURCES OF ERROR VARIANCE
PSYCHOMETRIC PROPPERTIES: RELIABILITY
Test construction - Variation may exist within
THE CONCEPT OF RELIABILITY
items in a test or between tests (i.e., item
Reliability: Consistency in measurement. sampling or content sampling)
Reliability coefficient is an index of reliability, Test administration - Sources of error may
a proportion that indicates the ratio between stem from the testing environment; test taker
the true score variance on a test and the total variables such as pressing emotional
variance. problems, physical discomfort, lack of sleep,
Observed score = True score plus error (X = T and the effects of drugs or medication and
+ E) examiner-related variables such as physical
Error refers to the component of the observed appearance and demeanor may play a role.
score that does not have to do with a test Test scoring and interpretation - Computer
taker’s true ability or the trait being testing reduces error in test scoring, but many
measured. tests still require expert interpretation (e.g.,
projective tests); subjectivity in scoring can
VARIANCE AND MEASURE MENT ERROR enter into behavioral assessment.
Surveys and polls usually contain a disclaimer
Variance = Standard deviation squared as to the margin of error associated with their
findings.
Sampling error - The extent to which the
Variance equals true variance
population of voters in the study actually was
plus error variance. (Variance Score = true score +
representative of voters in the election.
Variance error)
Methodological error - Interviewers may not
Reliability is the proportion of the total have been trained properly, the wording in
variance attributed to true variance. the questionnaire may have been ambiguous,
Measurement error: All of the factors or the items may have somehow been biased
associated with the process of measuring to favor one or another of the candidates.
some variable, other than the variable being
measured. RELIABILITY ESTIMATE S
Test-retest reliability: An estimate of reliability
THE CONCEPT OF RELIABILITY: MEASUREMENT obtained by correlating pairs of scores from the same
ERROR people on two different administrations of the same
Measurement error: test.
1. Random error: A source of error in measuring Most appropriate for variables that should be
a targeted variable caused by unpredictable stable over time (e.g., personality) and not
fluctuations and inconsistencies of other appropriate for variables expected to change
variables in the measurement process (i.e., over time (e.g., mood)
noise) As time passes, correlation between the
2. Systematic error: A source of error in scores obtained on each testing decreases
measuring a variable that is typically constant With intervals greater than 6 months, the
or proportionate to what is presumed to be estimate of test-retest reliability is called the
the true value of the variable being measured. coefficient of stability.
Parallel-forms and alternate-forms Average proportional distance (APD): Focuses
on the degree of difference between scores
Coefficient of equivalence: The degree of the on test items; it involves averaging the
relationship between various forms of a test. difference between scores on all of the items,
Parallel forms: For each form of the test, the dividing by the number of response options on
means and the variances of observed test the test, and then subtracting by 1.
scores are equal.
Alternate forms: Different versions of a test MEASURES OF INTER-SCORER RELIABILITY
that have been constructed so as to be
parallel; they do not meet the strict Inter-scorer reliability: The degree of
requirements of parallel forms but item agreement or consistency between two or
content and difficulty are similar between more scorers (or judges or raters) with regard
tests. to a particular measure.
Reliability is checked by administering two It is often used with behavioral measures
forms of a test to the same group; scores may Guards against biases in scoring
be affected by error related to the state of Coefficient of inter-scorer reliability: The
testtakers (e.g.,practice, fatigue, etc.) or item scores from different raters are correlated
sampling. with one another.
Split-half reliability: RELIABILITY INTERPRETATION
Obtained by correlating two pairs of scores Approaching 1 – Higher Reliability
obtained from equivalent halves of a single High Standard = 0.90-0.95
test administered once; entails three steps: Acceptable = 0.80-0.89
Step 1 - Divide the test into equivalent halves Barely Acceptable = 0.60-0.70
Step 2 - Calculate a Pearson r between scores
on the two halves of the test TRUE SCORE MODEL VS. ALTERNATIVES
Step 3 - Adjust the half-test reliability using
the Spearman-Brown formula. The true-score model is often referred to as
classical test theory (CTT), which is the most
Spearman-Brown formula allows a test
widely used model due to its simplicity.
developer or user to estimate internal
consistency reliability from a correlation of True score: A value that according to classical
two halves of a test. test theory genuinely reflects an individual’s
ability (or trait) level as measured by a
OTHER METHODS OF EST IMATING INTERNAL particular test.
CONSISTENCY CTT assumptions are more readily met in
comparison to those of item response theory
Inter-item consistency: The degree of (IRT).
relatedness of items on a scale; this helps A problematic assumption of CTT has to do
gauge the homogeneity of a test. with the equivalence of items on a test.
Kuder-Richardson formula 20: Statistic of Domain sampling theory: Estimates the
choice for determining the inter-item extent to which specific sources of variation
consistency of dichotomous items. under defined conditions are contributing to
Coefficient alpha: Mean of all possible split- the test score.
half correlations, corrected by the Spearman- Generalizability theory: Based on the idea
Brown formula; it is the most popular that a person’s test scores vary from testing to
approach for internal consistency, and the testing because of variables in the testing
values range from 0 to 1. situation.
Instead of conceiving of variability in a 1. How did this individual’s performance on
person’s scores as error, Cronbach test 1 compare with his or her
encouraged test developers and researchers performance on test 2?
to describe the details of the particular test 2. How did this individual’s performance on
situation or universe leading to a specific test test 1 compare with someone else’s
score. performance on test 1?
This universe is described in terms of its 3. How did this individual’s performance on
facets, including the number of items in the test 1 compare with someone else’s
test, the amount of training the test scorers performance on test 2?
have had, and the purpose of the test
administration.
Item response theory: Provides a way to
model the probability that a person with X
ability will be able to perform at a level of Y.
IRT refers to a family of methods and
techniques used to distinguish specific
approaches.
IRT incorporates considerations of an item's
level of difficulty and discrimination.
Difficulty relates to an item not being easily
accomplished, solved, or comprehended.
Discrimination refers to the degree to which
an item differentiates among people with
higher or lower levels of the trait, ability, or
other variables being measured.
THE STANDARD ERROR OF MEASUREMENT
Standard error of measurement, often
abbreviated as SEM, provides a measure of
the precision of an observed test score; an
estimate of the amount of error inherent in an
observed score or measurement.
The higher the reliability of the test, the lower
the standard error.
Standard error can be used to estimate the
extent to which an observed score deviates
from a true score.
Confidence interval: A range or band of test
scores that is likely to contain the true score.
THE STANDARD ERROR OF THE DIFFERENCE
The standard error of difference: A measure
that can aid a test user in determining how
large a difference in test scores should be
before it is considered statistically significant.
It can be used to address three types of
questions: