Background of Test Design
Over the last 15 years, there has been a proliferation in the use of assessment
for accountability purposes at the national, state, and local district level. Test
results have been used as indices in making decisions about individual students,
such as advancement from one grade to the next or graduation from high school.
Test results have also been aggregated across individuals to make decisions about
groups; they have been used to judge the quality of schools or to determine
funding allotments within a district or state. Furthermore, test results have been
aggregated to the state level and used as a tool to make comparisons among
states. In short, tests have been used for many purposes.
At the workshop, Pamela Moss, a member of the joint committee that developed
the Standards for Educational and Psychological Testing (American Educational
Research Association [AERA] et al., 1999), observed that tests in educational
settings are typically designed to fulfill one of three general purposes: (1) to
provide diagnostic information, (2) to evaluate student progress, or (3) to evaluate
programs. During the discussion about the purposes tests can serve in educational
settings and in her overview of the Standards, Moss alluded to a number of
measurement concepts.
I. Definition of Test Design
Assessing learners’ outcomes is not an easy task since every single teacher has
a different view of what assessing is, and each teacher may have a different
thought of what this process implies. However, there is something which seems to
be a common element: evaluation is a natural activity: “Evaluation is not restricted
to the context of education; it is part of our everyday lives” (Dickins & Germaine.
1992, p: 3). Taking this into account, we might say that we are on an implicit
process of evaluating and assessing our daily life trying to understand people,
actions, phenomena, etc., and of course our teaching and learning development.
The assumptions we gather from lifetime experience are linked to our
understanding of the nature of language teaching and learning. More precisely,
assessment is an umbrella term that encompasses some instruments used to
measure learner’s achievement such as: tests, project works, observation.
‘Testing’ is different from assessment and a test is “a method of measuring a
person’s ability or knowledge in a given area” (Brown, 1994, p.252). According to
Cohen (1994, p.196):“a single test of overall ability…does not give an accurate
picture of an individual’s proficiency and that a range of different assessment
procedures are necessary”. Because of numerous issues and biases with
standardized tests (Garcia & Pearson, 1991, 1994; Wrigley & Guth, 1992),
development in language assessment methods and procedure have resulted in
“an increase in the use of ‘alternative’ method.
II. Steps of Designing a test
Educational Testing Service develops assessments that are of the highest
quality, accurately measure the necessary knowledge and skills, and are fair to all
test takers. We understand that creating a fair, valid and reliable test is a complex
process that involves multiple checks and balances. That’s why dozens of
professionals — including test specialists, test reviewers, editors, teachers and
specialists in the subject or skill being tested — are involved in developing every
test question, or "test item." And it's why all questions (or "items") are put
through multiple, rigorous reviews and meet the highest standards for quality and
fairness in the testing industry. Below is an overview of the key steps ETS takes
when developing a new test.
Step 1: Defining Objectives
Educators, licensing boards or professional associations identify a need to
measure certain skills or knowledge. Once a decision is made to develop a test to
accommodate this need, test developers ask some fundamental questions:
Who will take the test and for what purpose?
What skills and/or areas of knowledge should be tested?
How should test takers be able to use their knowledge?
What kinds of questions should be included? How many of each kind?
How long should the test be?
How difficult should the test be?
Step 2: Item Development Committees
The answers for the questions in Step 1 are usually completed with the help of
item development committees, which typically consist of educators and/or other
professionals appointed by ETS with the guidance of the sponsoring agency or
association. Responsibilities of these item development committees may include:
a. Defining test objectives and specifications.
b. Helping ensure test questions are unbiased.
c. Determining test format (e.g., multiple-choice, essay, constructed-
response, etc.)
d. Considering supplemental test materials.
e. Reviewing test questions, or test items, written by ETS staff.
f. Writing test questions .
Step 3: Writing and Reviewing Questions
Each test question — written by ETS staff or item development committees —
undergoes numerous reviews and revisions to ensure it is as clear as possible, that
it has only one correct answer among the options provided on the test and that it
conforms to the style rules used throughout the test. Scoring guides for open-
ended responses, such as short written answers, essays and oral responses, go
through similar reviews.
Step 4: The Pretest
After the questions have been written and reviewed, many are pretested with a
sample group similar to the population to be tested. The results enable test
developers to determine:
The difficulty of each question
If questions are ambiguous or misleading
If questions should be revised or eliminated
If incorrect alternative answers should be revised or replaced
Step 5: Detecting and Removing Unfair Questions
To meet the stringent ETS Standards for Quality and Fairness (PDF) guidelines,
trained reviewers must carefully inspect each individual test question, the test as
a whole and any descriptive or preparatory materials to ensure that language,
symbols, words, phrases and content generally regarded as sexist, racist or
otherwise inappropriate or offensive to any subgroup of the test-taking
population are eliminated.
ETS statisticians also can identify questions on which two groups of test takers
who have demonstrated similar knowledge or skills perform differently on the test
through a process called Differential Item Functioning (DIF). If one group performs
consistently better than another on a particular question, that question receives
additional scrutiny and may be deemed biased or unsatisfactory. Note: If people in
different groups actually differ in their average levels of relevant knowledge or
skills, a fair test question will reflect those differences.
Step 6: Assembling the Test
After the test is assembled, it is reviewed by other specialists, committee
members and sometimes other outside experts. Each reviewer answers all
questions independently and submits a list of correct answers to the test
developers. The lists are compared with the ETS answer keys to verify that the
intended answer is, indeed, the correct answer. Any discrepancies are resolved
before the test is published.
Step 7: Making Sure — Even After the Test is Administered — that the Test
Questions are Functioning Properly
Even after the test has been administered, statisticians and test developers review
to make sure that test questions are working as intended. Before final scoring
takes place, each question undergoes preliminary statistical analysis and results
are reviewed question by question. If a problem is detected, such as the
identification of a misleading answer to a question, corrective action, such as not
scoring the question, is taken before final scoring and score reporting takes place.
Tests are also reviewed for reliability. Performance on one version of the test
should reasonably predict performance on any other version of the test. If
reliability is high, results will be similar no matter which version a test taker
completes.
III. Strategies of Designing tests
There are some strategies of assessment that can be exploited to promote
21st century learning as suggested by National Research Council (US) Board
on Science Education (2010). They are structured interviews, situational
judgment tests, role plays, group exercises, in-basket exercises, work
samples, and performance standards/appraisal.
1. Structured interviews: A structured interview uses a standard set of
questions to assess students’ inter-personal, communication and
leadership/team skills.
2. Situational judgment tests: A situational judgment is intended to
measure a variety of soft skills by presenting individuals with short scenarios
as well as a number of responses. Students are asked to choose the best
response for that scenario or to rank the responses in order of most
appropriate to least appropriate.
3. Role plays: In a role play assessment, the students are provided with
written information about a realistic situation that may involve a non-
routine problem. After a period of time to prepare for role plays, they
present their responses to the situation. The assessors rate the response
using behaviorally anchored rating scales, which describe specific behaviors.
4-Group exercises: The development, administration, and scoring of a group
exercise are similar to the role play. The critical difference is that the
students work in groups to address a problem or respond to a situation,
making it possible to assess their interactive skills such as negotiation,
persuasion and teamwork.
5. in-basket exercises: An in-basket exercise is an activity that can assess
how well students perform job-related tasks within a certain period of time.
6. Work samples: Work sample tests require students to perform tasks or
work activities that mirror the tasks employees perform on the job.
7. Performance standards/appraisal: A performance-based assessment is a
summative strategy to assess student knowledge as well as their ability to
apply knowledge in a “real-world” situation.
IV. Types of Designing Tests
There are common types of tests and test items are discussed below:
Objective Tests: An objective test is one in which a students'
performance is measured against a standard and specific set of
answers (i.e. for each question there is a right or wrong answer).
When composing test questions, it is important to be direct and use
language that is straightforward and familiar to the students. In
addition, the answer choices provided on the test should be
challenging enough that students aren't able to guess the correct
answer simply by comparing how all of the options are written.
Examples of objective test items include the following:
Multiple-choice
True-false
Matching
Problem based questions. These require the student to
complete or solve an equation or prompt and are commonly
used in application based courses such as mathematics,
chemistry, and physics.
Subjective Tests: Unlike objective tests for which there is a
definitive standardized or formulated answer, subjective tests
are evaluated based on the judgment or opinion of the
examiner. Tests of this nature are often designed in a manner
in which the student is presented with a number of questions or
writing prompts for which he/she will demonstrate mastery of
the learning objective in his/her response to the question.
When composing prompts as test questions, it is crucial that
you phrase the prompt clearly and precisely. You want to make
sure that prompt elicits the type of thinking skill that you want
to measure and that the students' task is clear. For example, if
you want students to compare two items, you need to provide
or list the criteria to be used as the basis for comparison.
Examples of subjective test items include the following:
Essay
Short answer
When grading subjective tests or test items, the use of an established set of
scoring criteria or a well-developed rubric helps to level the playing field
and increase the test's reliability. For more information on rubric
development, please see the additional online resources provided.
The table below contains a chart showing advantages and disadvantages
for a selection of test items. It's important to note that this is not an
exhaustive list, and remember that as the course instructor, you have the
freedom to choose what form of assessment most aptly measures your
specific learning objective.
Table: Advantages and Disadvantages of Commonly Used Types
of Achievement Test Items
Type of Item Advantages Disadvantages
True-False Many items can be Limited primarily to testing
administered in a knowledge of information.
relatively short time. Easy to guess correctly on
Moderately easy to write many items, even if
and easily scored. material has not been
mastered.
Multiple Choice Can be used to assess a Difficult and time
broad range of content in consuming to write good
a brief period. Skillfully items. Possible to assess
written items can be higher order cognitive skills,
measure higher order but most items assess only
cognitive skills. Can be knowledge. Some correct
scored quickly. answers can be guesses.
Matching Items can be written Higher order cognitive skills
quickly. A broad range of difficult to assess.
content can be assessed.
Scoring can be done
efficiently.
Short Answers or Many can be administered Difficult to identify
Completion in a brief amount of time. defensible criteria for
Relatively efficient to correct answers. Limited to
score. Moderately easy to questions that can be
write items. answered or completed in a
few words.
Essay Can be used to measure Time consuming to
higher order cognitive administer and score.
skills. Easy to write Difficult to identify reliable
questions. Difficult for criteria for scoring. Only a
respondent to get correct limited range of content
answer by guessing. can be sampled during any
one testing period.
V. The purposes behind designing tests
Assessment can serve different purposes, which include:
• Selection: eg. To determine whether learners have sufficient language
proficiency to be able to undertake tertiary study;
• Certification: eg. To provide people with a statement of their language
ability for employment purposes;
• Accountability: eg. To provide educational funding authorities with
evidence that intended learning outcomes have been achieved and to
justify expenditure;
• Diagnosis: eg. To identify learners’ strengths and weaknesses;
• Instructional decision-making: eg. To decide what material to present
next or what to revise;
• Motivation: eg. To encourage learners to study harder.
VI. Conclusion:
Tests and exams often play a significant role in the overall assessment of
students' learning. Therefore, as instructors, it essential that we pay
particular attention to the manner in which we construct these instruments.
Remember to always keep your course goals and learning objectives at the
forefront of your mind as you begin to determine what kind of test is the
best measure of your students' learning. To that end, if it fits with your
course design and content, you may want consider alternate forms of
assessment such as group projects, student portfolios or other activities
that extend and build throughout the course of the semester. These
alternative or non-traditional forms of assessment frequently offer students
a more authentic opportunity to apply their knowledge and higher-order
thinking skills.
References
https://www.nap.edu/read/10366/chapter/4#41
https://www.depts.ttu.edu/tlpdc/Resources/Teaching_resources/TLP
DC_teaching_resources/createtests.php?
fbclid=IwAR3bms4mCE8OIHkssAj_ReQc7pBQQjSwTmIKXB-
rfIVhuqomnib9DPfdM3w
Octavio Garay Álvarez, ASSESSMENT AND TESTING IN THE EFL
CLASSROOM, page2
Doctor Assia Ben Tayeb Ouahiani, Assessment EFL University
Classroom: between Tradition and Innovation, Department of
English, Abu Baker Belkaid University of Tlemcen, Algeria.
st
Atlqah Nurul Asrl, Designing a 21 Century Assessment in EFL
Learning Context, Politiknik Negeri Malang, Malang.