KEMBAR78
Unit 2 | PDF | Level Of Measurement | Psychological Testing
0% found this document useful (0 votes)
25 views10 pages

Unit 2

The document outlines the classification of psychological tests, including maximal performance tests, behavioral observation, self-report methods, standardized and non-standardized tests, and various types of measurement scales. It discusses the importance of measurement in psychology, detailing its benefits, levels, and potential errors, as well as introducing psychometrics and Item Response Theory (IRT) for evaluating test responses. Overall, it emphasizes the significance of reliable and valid assessments in understanding psychological constructs and abilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Unit 2

The document outlines the classification of psychological tests, including maximal performance tests, behavioral observation, self-report methods, standardized and non-standardized tests, and various types of measurement scales. It discusses the importance of measurement in psychology, detailing its benefits, levels, and potential errors, as well as introducing psychometrics and Item Response Theory (IRT) for evaluating test responses. Overall, it emphasizes the significance of reliable and valid assessments in understanding psychological constructs and abilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 2

Classification of Psychological tests


I
Maximal performance test
The objective is to obtain the best possible or maximum level of performance from the test takers. It is
designed to assess the upper limits of the examinee’s knowledge and abilities. Maximum performance
items present problems that have to be solved by the test takers.

Examples include:

 Achievement Tests: It measures what the student has learnt.


 Aptitude Tests: It measures the student’s ability to learn in the future like JEE, NEET, CAT, MAT,
etc
 Intelligence Tests: It assesses a person’s ability to cope with the environment. – WAIS (verbal
comprehension, working memory, perceptual reasoning, and processing speed), WISC

Behavioural Observation

A behavioral assessment test is performed through observations, questionnaires, and interviews. It can be
either clinical or functional in nature depending on whether it is used for diagnostic purposes or to determine
the antecedents for challenging behaviors.

Self-Report

Self-report is a method for collecting data whose source is the subject’s verbal message about him- or
herself. Self-reports are the most common assessment procedures for collecting data in psychology and
psychological assessment. A typical self-report inventory presents a number of questions or statements that
may or may not describe certain qualities or characteristics of the test subject.
BDI - A 21-question self-report questionnaire that measures the severity of depression in adults. Developed
by Aaron T. Beck in 1961
The 16 Personality Factor (16PF) Questionnaire is a personality test that measures 16 personality traits by
Raymond B. Cattell, Maurice Tatsuoka, and Herbert Eber
II
Standardized Test
A standardized test is a test that is administered and scoredin a consistent, or "standard". manner.
Standardized tests are designed in such a way that thequestions, conditions for administering, scoring
procedures, and interpretations are consistentand are administered and scored in a predetermined, standard
manner.
FORMS OF STANDARDIZED TEST
 Achievement test
 Diagnostic test
 Aptitude test
 Intelligence test ·
 College-admission test
 Psychological test
Non-Standardized Test
 A non-standardized test is one that is not given to people initially to standardize it
 Allows for an assessment of an individual's abilities or performances, but doesn't allow for a fair
comparison of one student to another
 The most common type of test that are non-standardized are the classroom test given by teachers all
the time
 These tests designed by the teacher to determine or monitor the progress of the students.
 Classroom tests play a central role in the evaluation of students learning
III
Objective Tests
 Cattell (e.g. 1957) as tests which can be objectively scored and whose meaning or purpose is hidden
from subjects (even if they are knowledgeable in psychology).
 They are useful because they can test a wide sample of the curriculum in a short time, have less
reliance on language skills of the students are useful for diagnostic purposes.
 The 16 Personality Factor (16PF) Questionnaire is a personality test that measures 16 personality
traits by Raymond B. Cattell, Maurice Tatsuoka, and Herbert Eber
 The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) is a psychological test that measures
a person's mental health. The test consists of 567 true-false questions
Projective Tests
 Projective tests provide qualitative, subjective data from individuals' open-ended responses to
ambiguous stimuli, which are subjected to interpretation. They offer a richer analysis of one's
personality and unconscious mind but are considered less reliable. A common definition of a
projective test is that it constitutes an ambiguous stimulus to which subjects have to respond.
 Rorschach test, projective method of psychological testing in which a person is asked to describe
what he or she sees in 10 inkblots, of which some are black or gray and others have patches of
colour. The test was introduced in 1921 by Swiss psychiatrist Hermann Rorschach
 The Thematic Apperception Test, or TAT, involves showing around 10-12 cards to the test taker, with
each card prompting a story response. The test taker describes ambiguous scenes to learn more about
a person's emotions, motivations, and personality. It was developed by American psychologists
Henry A. Murray and Christina D. Morgan at Harvard University in the 1930s.
IV
 Personality tests are tools that assess a person's motivations, interests, and how they interact with
others. The 16 Personality Factor (16PF) Questionnaire is a personality test that measures 16
personality traits by Raymond B. Cattell, Maurice Tatsuoka, and Herbert Eber
 Intelligence tests measure a person's cognitive ability. Some examples of intelligence tests include
The Wechsler Adult Intelligence Scale (WAIS) is a psychological test that measures an adult's
cognitive abilities.
 An aptitude test is designed to assess what a person is capable of doing or to predict what a person is
able to learn or do given the right education and instruction.
 The SAT measures aptitudes in areas including math, reasoning, and language, The Graduate
Requisite Exam (GRE) measures a student's readiness for graduate school
V
 Speed Tests: These tests have strict time limits, requiring quick responses. An example is a clerical
speed test, like data entry. Speed tests generate more variable results, which are useful for predicting
performance in roles where fast decision-making is crucial (e.g., pilots or police officers). However,
they can be unfair to older individuals or those with disabilities, as processing speed may decline
with age, leading to potential legal challenges.
 Power Tests: These tests have no strict time limits, focusing instead on knowledge or skills. For
example, a technical skills test might assess problem-solving ability. Power tests are better suited for
roles where deep understanding is more important than quick responses.
 Paper-and-Pencil Tests: Common in industrial testing, these tests evaluate knowledge or traits, such
as a multiple-choice test or digital questionnaire.
 Performance Tests: These tests require candidates to complete tasks involving physical objects,
such as a dental hygienist preparing tools. Performance tests measure practical skills relevant to job
tasks.
Measurement
Measurement is the act of assigning numbers or symbols to characteristics of things (people, events,
whatever) according to rules. The rules used in assigning numbers are guidelines for representing the
magnitude (or some other characteristic) of the object being measured.
A scale is a set of numbers (or other symbols) whose properties model empirical properties of the objects to
which the numbers are assigned.
Benefits of measurement
1. Objectivity
 Measurement helps ensure objectivity by reducing personal bias or subjective interpretation.
 When attributes like intelligence, stress, or job performance are measured using standardized tools,
the results become more reliable and independent of the observer’s opinions or assumptions.
 This makes scientific research more valid and reproducible.
2. Quantification
 Allows the conversion of abstract psychological concepts into numerical values.
 Makes it easier to compare individuals, track changes over time, and establish clear relationships
between variables.
 Quantification also supports systematic data collection, enhancing research quality.
3. Observation of Subtle Effects
Many psychological effects or changes are too small to detect through casual observation. Measurement
tools, such as psychological tests or physiological sensors, can capture these subtle changes, enabling
researchers to identify patterns that would otherwise go unnoticed.
4. Statistical Analysis
 It enables the application of statistical techniques to analyze data. With numerical data, researchers
can compute averages, correlations, regressions, and more.
 Statistical analysis helps uncover relationships, test hypotheses, and draw generalizable conclusions
from sample data.
5. Better Communication
 When psychological phenomena are measured and expressed numerically, communication becomes
clearer between researchers, practitioners, and even the general public.
 Numbers are easier to interpret, compare, and report than vague qualitative descriptions, enhancing
clarity in research papers, reports, and professional discussions.
Levels of measurement
Nominal Scales
 Nominal scales are the most basic type of measurement.
 They are used to sort things into different groups or categories based on some characteristic.
 Each item can belong to only one group, and every item must fit into a group.
 These groups do not have any order — they are just labels, like male/female or types of jobs.
 For example, DSM each disorder listed in that manual is assigned its own number, ex - 303.00
identified alcohol intoxication, and 307.00 identified stuttering. But these numbers were used
exclusively for classification purposes and could not be meaningfully added, subtracted, ranked, or
averaged. Hence, the middle number between these two diagnostic codes, 305.00, did not identify an
intoxicated stutterer.
Ordinal scales
 They allow classification and rank ordering based on a characteristic.
 However, the exact difference between ranks is unknown.
 Numbers used in ordinal scales indicate order, not precise measurement units. For example, the gap
between the first and second job applicants may be small, while the gap between the second and third
could be large.
 Ordinal scales lack an absolute zero point, meaning no applicant has “zero” ability. Because of this,
statistical analysis is limited.
Interval scales
 They have the properties of nominal and ordinal scales, with the added feature of equal intervals
between points.
 Each unit represents the same amount, allowing meaningful comparisons of differences b/w scores.
 However, like ordinal scales, interval scales lack an absolute zero point. This means zero does not
indicate the complete absence of the measured trait.
 Intelligence tests, for example, use interval scales, where the difference between IQ 80 and 100
equals the difference between IQ 100 and 120. However, a zero IQ is impossible and does not mean
zero intelligence.
 Interval scales allow statistical analysis, including meaningful averages.
Ratio scales
 They have all the properties of nominal, ordinal, and interval scales, with the added feature of a true
zero point.
 This allows for meaningful mathematical operations, including ratios.
 In psychology, ratio scales are used in certain tests, such as neurological assessments. For example,
in a hand grip test, the pressure exerted is measured on a ratio scale.
 Timed perceptual-motor tasks, like assembling a puzzle, also use ratio scales. Since time has a true
zero (0 seconds), comparisons like "twice as fast" are meaningful.
Measurement Errors
Fallibility of the Measurement Instrument
 It refers to the inherent imperfections or limitations present in the instrument which causes
measurement error.
 Every test, scale, or assessment tool is subject to fallibility, meaning that the observed scores will
differ to some degree from the "true" scores.
 This fallibility can be due to factors such as poorly worded items, unclear instructions etc.
 Example scenarios: A digital scale might show slightly different readings for the same object when
measured multiple times due to random fluctuations in weight distribution.
Data entry errors refer to mistakes made during the process of transferring data from measurement
instruments (like test sheets, survey responses, or observation records) into databases or statistical software
for analysis. These errors are administrative or procedural rather than inherent to the test itself, but they still
affect the accuracy and reliability of the final data set.
Common causes of data entry errors include:
 Typographical mistakes (entering 67 instead of 76).
 Skipping entries.
 Misinterpreting responses.
 Errors in coding (assigning wrong numeric values to categorical data).

Respondent errors refer to errors caused by factors related to the person being measured (the respondent),
which interfere with their ability to accurately represent their true abilities, attitudes, or knowledge during
the measurement process. This may occur due to the individual’s state, motivation, understanding, and
cooperation which directly affects the accuracy of the measurement.

Respondent errors reduce reliability and validity, because they introduce variance unrelated to the
actual trait being measured

Type of Rating
Explanation Example
Error

During a self-rating on a personality


Tendency to give inflated/higher ratings inventory (like NEO-PI-R), a respondent
Leniency Error
than warranted. rates themselves as highly conscientious
even if they frequently miss deadlines.

In a self-assessment of anxiety symptoms, a


Tendency to give lower ratings than respondent rates their coping abilities as
Severity Error
deserved. "very poor," even though they effectively
manage mild stress.

Central Tendency to avoid extremes and give Rating of "3" on a 5-point scale regardless
Tendency Error average ratings to everyone. of actual performance.

One positive trait influences all other If an employee is very punctual, the rater
Halo Effect
ratings. assumes they are also good at teamwork.

Rating someone high on leadership


Assuming logically related traits must have
Logical Error automatically results in high communication
similar ratings.
ratings.

Comparing the person to others instead of An average worker looks outstanding


Contrast Error
to objective standards. compared to weaker colleagues.

a therapist rates a client’s coping skills as


Rating influenced by most recent
Recency Error excellent because of good progress in the
behavior, ignoring earlier performance.
last two sessions, ignoring earlier struggles.

First Impression Initial impression anchors all future A strong first day leads to consistently
Error ratings, even if it no longer fits. inflated ratings despite later performance.

Incorrectly attributing performance to


Attribution Assuming low productivity is due to
personal traits rather than external
Error laziness instead of technical issues.
circumstances.
Type of Rating
Explanation Example
Error

Past performance spills over into the An employee’s strong performance last year
Spillover Effect
current rating period. inflates this year’s rating.

Social Respondents give answers to questions that


On a personality test, a client underreports
Desirability Bias they believe will make them look good to
impulsive behavior and exaggerates positive
others, concealing their true opinions or
traits, trying to appear "perfect."
experiences

Psychometrics
Test theory refers to the body of principles and techniques used to design, develop, evaluate, and interpret
psychological and educational tests. It focuses on ensuring the reliability, validity, and fairness of the tests
used to measure psychological constructs, abilities, or traits (Cohen, Swerdlik, & Sturman, 2018)
Item Response Theory
IRT has been around since the mid-twentieth century. Modern test theory really got underway with the
seminal work by Lord and Novik (1968).
Item Response Theory (IRT) is a framework used to understand how people respond to test questions,
survey items, or assessment tools. Instead of simply counting how many items a person got right (like in
traditional scoring), IRT looks deeper. It connects a person’s ability (or their level on a hidden trait) with
the characteristics of each item they answered.
In IRT, the goal is to understand both sides:
 The people (respondents): How much of the trait (like anxiety, knowledge, or satisfaction) does a
person have?
 The items: How difficult is each item? How good is it at telling apart people with high or low levels
of the trait?
IRT aims to achieve three major goals:
1. Estimating Person’s Trait Level (Theta, θ)
 Every person’s theta (θ) score shows where they stand on the trait being measured.
 For example, if the trait is depression, a lower theta means low depression, and a higher theta means
higher depression.
 Theta is not a simple sum of correct answers. Instead, it reflects the underlying level of the trait,
based on both which items the person answered and how difficult those items were.
2. Estimating Item Properties (Parameters)
There are several properties (parameters) that IRT estimates for each item:
 Difficulty (b): This tells us how “hard” the item is. For a difficult item, only people with high ability
(high theta) are likely to get it right and vice versa.
 Discrimination (a): This shows how well the item can separate high-ability people from low-
ability people. A good item sharply distinguishes between high and low scorers than a poorly
discriminating item.
 Guessing (c): In some tests, especially multiple-choice tests, there’s a chance that someone can guess
the right answer. The guessing parameter estimates this probability — how likely it is that a person
with very low ability could still get the item correct by guessing.
3. Improving Measurement Efficiency
IRT helps create shorter and smarter tests by looking at both the person’s ability and how good each
item is at measuring the trait.
Ex - In adaptive tests, if someone answers hard questions correctly, the test skips easier ones and gives
tougher ones — saving time and still giving an accurate score. This works because IRT shows how much
information each item gives about the person’s ability.
Assumptions
There are four assumptions in IRT, but they are quite restrictive
 Monotonicity – As a person's trait level increases, the probability of answering correctly also
increases.
 Unidimensionality – All items measure only one main trait, and the item characteristic curves
follow a set form, such as a two-parameter model.
 Local Independence – Each item functions independently, meaning the response to one item does
not influence responses to other items.
 Invariance – Item parameters, like difficulty, can be estimated from any group of people who
answered the item, allowing flexibility in analysis without relying on a specific sample.
Features of IRT
1. Many parameters are estimated simultaneously, and so IRT usually requires a large sample of items
(20 or more) per test as well as a large sample of respondents for results to be stable
2. IRT allows for the assessment of measurement error at any level of theta (θ).
3. One of the more interesting features of IRT is that neutral response categories (e.g., the "neither
agree nor disagree") can be determined to be actually a neutral choice or, instead, a category
individuals use randomly
4. The fundamental assumption in IRT is that there is a linkage between a response to any item on a test
and the characteristic being assessed by the test
5. The characteristic is called a latent trait and it is denoted by the symbol theta (θ)
6. The linkage is that the probability of a positive response to any single item on a test is a function of
the individual's level
7. The critical feature that is analyzed in IRT models is the entire pattern of responses to all test items
by an individual
8. IRT allows for accurate assessments of item bias against or for selected subgroups
9. IRT has found tremendous use in computer adaptive testing, where an item of moderate difficulty is
presented to the respondent
How does IRT work?
 Item Response Theory (IRT) models predict how people will respond to items based on their ability
level (θ) and the items’ characteristics (like difficulty).
 Each response gives some clue about the person’s ability. The higher the ability, the higher the
chance of answering correctly. This relationship is shown using a graph called the Item
Characteristic Curve.
 The curve is usually S-shaped. The curve starts flat on the left (for low-ability people), rises steeply
in the middle, and flattens out at the top (for high-ability people).
 The curve shows that as ability increases, the chance of a correct answer also increases. Ability (θ)
can range from very low to very high, but in practice, it usually ranges from -3 to +3.
 This curve helps show how useful the item is. Ex - a bad item (like a confusing or poorly worded
question) might have a flatter or messy curve.
IRT answers questions like:
 How difficult is each question?
 How well does each question discriminate high-ability from low-ability test-takers?
 Can people guess the correct answer?
The 3 IRT Models
1PL (One-Parameter Logistic Model)
 In this model, the only item-specific parameter is difficulty (b) — how hard the item is.
 The discrimination (a) is fixed across all items (often set to 1). This means all items are equally
good at separating high- and low-ability people.
 This is a simple and strict model, focusing mostly on difficulty.
 All curves are parallel — they differ only in where they sit on the difficulty scale, Easier items shift
left, harder items shift right
 Example: How hard is each question?
 Use: Simpler tests where all questions are equally good at measuring ability
2PL (Two-Parameter Logistic Model)
 This model introduces two item parameters: Difficulty (b) and Discrimination (a) — how well the
item separates high-ability from low-ability respondents.
 Discrimination varies across items. Some items are better at distinguishing ability than others.
 Curves can have different slopes. Steeper curves mean the item is very sensitive to ability — small
changes in θ cause big changes in the probability of a correct response.
 Advantages: Provides more flexibility — not all items need to have the same discrimination.
 Example: Which questions are harder and which ones are better at telling apart strong and weak
students?
 Use: Common when test quality varies (some items are better than others).
3PL (Three-Parameter Logistic Model)
 This model adds a third parameter: Difficulty (b), Discrimination (a) and Guessing parameter (c)
— the chance that low-ability people can guess the correct answer.
 This is important for multiple-choice tests, where guessing can play a large role.
 Curves never touch zero — even for very low-ability people, there’s still a small chance (c) of
getting the item right through guessing.
 Advantages:
 Realistic for tests with multiple-choice items.
 Gives a more accurate picture of what’s happening at the low end of ability.
 Example: in multiple-choice tests, some questions might be guessed right, and this model accounts
for that
Picture
The graph showing the probabilities of a correct response under the IPL, 2PL, and 3PL models:
1. Blue Curve (IPL), Considers only item difficulty. The probability increases symmetrically as ability
(8) rises above the difficulty level (b = 0)
2. Green Curve (2PL): Includes discrimination (a = 1, 5) making the curve steeper it differentiates
better between high and low abilities
3. Red Curve (3PL) Adds guessing (c = 0.2) lifting the baseline. Even low-ability individuals have at
least a 20% chance of answering correctly due to guessing
Advantages of IRT
 Provides deeper insights into both people’s abilities and item quality.
 Works well with adaptive tests, even if people get different sets of items.
 Allows for shorter, smarter, and personalized tests without losing accuracy.
 Gives detailed item feedback, helping improve the test over time.
Drawbacks of IRT
 Requires strict assumptions to apply the model correctly.
 Needs large sample sizes (both items and respondents) for reliable results.
 Many IRT software tools are complex and not user-friendly for beginners.
2.4 & 2.5 continued
Norms are differentiated by certain characteristics of the reference group that comprise them
 Age Norms: These norms show the average performance of test-takers at a specific age (e.g.,
average scores of 10-year-olds, 12-year-olds, etc.). They are commonly used in developmental tests
(like intelligence tests or cognitive ability tests).
 Grade Norms: These norms indicate the average performance of students in a particular school
grade (e.g., Grade 5 norms, Grade 8 norms). These are more relevant for academic achievement tests.

 Local Norms: These are derived from a sample taken from a smaller, more localized population
(e.g., a particular school district, city, or region).
 National Norms: These are based on a sample drawn from the entire country, ideally representing
the whole population in terms of demographics like age, gender, socio-economic status, and
educational background.

 Group Norms: These represent the average performance of a group of participants tested at the
same time or within a shared context.
 Individual Norms: These focus on comparing one person’s repeated performance over time,
allowing for personal growth tracking.
To be accurate, test norms must be based on the scores of large and representative samples of participants
• who have been tested under standard conditions, and
• who take the test as seriously as will other students for whom the norms are needed
Three R's are most often used to judge the appropriateness of a set of norms for a given testing situation
a) Representativeness: This refers to how well the norm group reflects the population for which the test
is intended. If a test is meant for high school students in India, the norm group should represent Indian
students across regions, school types, and socio-economic backgrounds.
b) Relevance: This refers to how closely the norm group matches the specific target group for whom you
are interpreting scores. For example, using norms derived from urban students might not be relevant when
testing students from rural backgrounds.
c) Recency: This refers to how up-to-date the norms are. Older norms may no longer be valid if
educational methods, curriculum, or social conditions have changed significantly. Generally, norms should
be updated periodically to reflect current realities.
Standard scores
Name Mean SD Explanation
Z score 0 1 The basis of other scale
T score 50 10 Usually limited to a
range of 20 to 80
stens 5.5 2 Standard ten (1 to 10
scale0
stanine 5 2 Standard nine (1 to 9)
IQ 100 15 The common
intellectual …

You might also like