KEMBAR78
Language Assessment | PDF | Educational Assessment | Intelligence
0% found this document useful (0 votes)
396 views56 pages

Language Assessment

The document discusses the differences between testing, assessing, and teaching. It defines what a test is - a method to measure ability, knowledge or performance in a domain. It also defines assessment as an ongoing process that is broader than tests and involves any time a student responds or performs. The key difference is that tests occur at set times while assessment is ongoing and involves any time a student produces language.

Uploaded by

Marisol Castillo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
396 views56 pages

Language Assessment

The document discusses the differences between testing, assessing, and teaching. It defines what a test is - a method to measure ability, knowledge or performance in a domain. It also defines assessment as an ongoing process that is broader than tests and involves any time a student responds or performs. The key difference is that tests occur at set times while assessment is ongoing and involves any time a student produces language.

Uploaded by

Marisol Castillo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION

H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

CHAPTER 1: TESTING, ASSESSING, AND TEACHING


If you hear the word test in any classroom setting, your thoughts are not likely to be positive,
pleasant, or affirming. The anticipation of a test is almost always accompanied by feelings of anxiety and
setf-doubt—along with a fervent hope that you will come out of it alive. Tests seem as unavoidable as
tomorrow’s sunrise in virtually every kind of educational setting. Courses of study in every discipline are
marked by periodic tests—milestones of progress (or inadequacy)—and you intensely wish for a
miraculous exemption from these ordeals. We live by tests and sometimes (metaphorically) die by them.

For a quick revisiting of how tests affect many learners, take the following vocabulary quiz. All the
words are found in standard English dictionaries, so you should be able to answer all six items correctly,
right? Okay, take the quiz and circle the correct definition for each word.

Circle the correct answer. You have 3 minutes to complete this examination!

1. polygene

a. the first stratum of lower-order protozoa containing multiple genes

b. a combination of two or more plastics to produce a highly durable material

c. one of a set of cooperating genes, each producing a small quantitative effect

d. any of a number of multicellular chromosomes

2. cynosure

a. an object that serves as a focal point of attention and admiration; a center of interest or attention

b. a narrow opening caused by a break or fault in limestone caves

c. the cleavage in rock caused by glacial activity

d. one of a group of electrical impulses capable of passing through metals

3. gudgeon

a. a jail for commoners during the Middle Ages, located in the villages of Germany and France

b. a strip of metal used to reinforce beams and girders In building construction

c. a tool used by Alaskan Indians to carve totem poles

d. a small Eurasian freshwater fish

4. hippogriff

a. a term used in children’s literature to denote colorful and descriptive phraseology

b. a mythological monster having the wings, claws, and head of a griffin and the body of a horse

c. ancient Egyptian cuneiform writing commonly found on the walls of tombs

d. a skin transplant from the leg or foot to the hip


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

5. reglet

a. a narrow, flat molding

b. a musical composition of regular beat and harmonic intonation

c. an Australian bird of the eagle family

d. a short sleeve found on women’s dresses in Victorian England

6. fictile

a. a short, oblong-shaped projectile used in early eighteenth-century cannons

b. an Old English word for the leading character of a fictional novel

c. moldable plastic; formed of a moldable substance such as clay or earth

d. pertaining to the tendency of certain lower mammals to lose visual depth perception with increasing
age.

Now, how did that make you feel? Probably just the same as many learners feel when they take many
multiple-choice (or shall we say multiple-guess?), timed, “tricky” tests. To add to the torment, if this were
a commercially administered standardized test, you might have to wait weeks before learning your
results. You can check your answers on this quiz now by turning to page 16. If you correctly identified
three or more items, congratulations! You just exceeded the average.

Of course, this little pop quiz on obscure vocabulary is not an appropriate example of classroom-based
achievement testing, nor is it intended to be. It’s simply an illustration of how tests make us feel much of
the time. Can tests be positive experiences? Can they build a person’s confidence and become learning
experiences? Can they bring out the best in students? The answer is a resounding yes! Tests need not be
degrading, artificial, anxiety-provoking experiences. And that’s partly what this book is all about: helping
you to create more authentic, intrinsically motivating assessment procedures that are appropriate for their
context and designed to offer constructive feedback to your students.

Before we look at tests and test design in second language education, we need to understand three
basic interrelated concepts: testing, assessment, and teaching. Notice that the title of this book is
Language Assessment, not Language Testing. There are important differences between these two
constructs, and an even more important relationship among testing, assessing, and teaching.

WHAT IS A TEST?

A test, in simple terms, is a method of measuring a person's ability, knowledge, or


performance in a given domain. Let’s look at the components of this definition. A test is first a
method. It is an instrument—a set of techniques, procedures, or items— that requires performance on
the part of the test-taker. To qualify as a test, the method must be explicit and structured: multiple-choice
questions with prescribed correct answers; a writing prompt with a scoring rubric; an oral interview based
on a question script and a checklist of expected responses to be filled in by the administrator.

Second, a test must measure. Some tests measure general ability, while others focus on very specific
competencies or objectives. A multi-skill proficiency test determines a general ability level; a quiz on
recognizing correct use of definite articles measures specific knowledge. The way the results or
measurements are communicated may vary. Some tests, such as a classroom-based short-answer essay
test, may earn the test-taker a letter grade accompanied by the instructor’s marginal comments. Others,
particularly large-scale standardized tests, provide a total numerical score, a percentile rank, and perhaps
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

some sub scores. If an instrument does not speedy a form of reporting measurement—a means for
offering the test-taker some kind of result—then that technique cannot appropriately be defined as a test.

Next, a test measures an individual’s ability, knowledge, or performance. Testers need to


understand who the test-takers are. What is their previous experience and background? Is the test
appropriately matched to their abilities? How should test- takers interpret their scores?

A test measures performance, but the results imply the test-taker’s ability, or, to use a concept
common in the field of linguistics, competence. Most language tests measure one’s ability to perform
language, that is, to speak, write, read, or listen to a subset of language. On the other hand, it is not
uncommon to find tests designed to tap into a test-taker’s knowledge about language: defining a
vocabulary item, reciting a grammatical rule, or identifying a rhetorical feature in written discourse.
Performance-based tests sample the test-taker’s actual use of language, but from those samples the test
administrator infers general competence. A test of reading comprehension, for example, may consist of
several short reading passages each followed by a limited number of comprehension questions—a small
sample of a second language learner’s total reading behavior. But from the results of that test, the
examiner may infer a certain level of general reading ability.

Finally, a test measures a given domain. In the case of a proficiency test, even though the actual
performance on the test involves only a sampling of skills, that domain is overall proficiency in a
language—general competence in all skills of a language. Other tests may have more specific criteria. A
test of pronunciation might well be a test of only a limited set of phonemic minimal pairs. A vocabulary
test may focus on only the set of words covered in a particular lesson or unit. One of the biggest obstacles
to overcome in constructing adequate tests is to measure the desired criterion and not include other
factors inadvertently, an issue that is addressed in Chapters 2 and 3.

A well-constructed test is an instrument that provides an accurate measure of the test-taker’s ability
within a particular domain. The definition sounds fairly simple, but in fact, constructing a good test is a
complex task involving both science and art.

ASSESSMENT AND TEACHING

Assessment is a popular and sometimes misunderstood term in current educational practice. You
might be tempted to think of testing and assessing as synonymous terms, but they are not. Tests are
prepared administrative procedures that occur at identifiable times in a curriculum when learners muster
all their faculties to offer peak performance, knowing that their responses are being measured and
evaluated.

Assessment, on the other hand, is an ongoing process that encompasses a much wider domain.
Whenever a student responds to a question, offers a comment, or tries out a new word or structure, the
teacher subconsciously makes an assessment of the student’s performance. Written work—from a jotted-
down phrase to a formal essay—is performance that ultimately is assessed by self, teacher, and possibly
other students. Reading and listening activities usually require some sort of productive performance that
the teacher implicitly judges, however peripheral that judgment may be. A good teacher never ceases to
assess students, whether those assessments are incidental or intended.

Tests, then, are a subset of assessment; they are certainly not the only form of assessment that a
teacher can make. Tests can be useful devices, but they are only one among many procedures and tasks
that teachers can ultimately use to assess students.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

But now, you might be thinking, if you make assessments every time you teach something in the
classroom, does all teaching involve assessment? Are teachers constantly assessing students with no
interaction that is assessment-free?

The answer depends on your perspective. For optimal learning to take place, students in the classroom
must have the freedom to experiment, to try out their own hypotheses about language without feeling
that their overall competence is being judged in terms of those trials and errors. In the same way that
tournament tennis players must, before a tournament, have the freedom to practice their skills with no
implications for their final placement on that day of days, so also must learners have ample opportunities
to “play” with language in a classroom without being formally graded. Teaching sets up the practice
games of language learning: the opportunities for learners to listen, think, take risks, set goals, and
process feedback from the “coach” and then recycle through the skills that they are trying to master. (A
diagram of the relationship among testing, teaching, and assessment is found in Figure 1.1.)

TESTS

ASSESSMENT

TEACHING

Figure 1.1. Tests, assessment, and teaching

At the same time, during these practice activities, teachers (and tennis coaches) are indeed observing
students’ performance and making various evaluations of each learner: How did the performance compare
to previous performance? Which aspects of the performance were better than others? Is the learner
performing up to an expected potential? How does the performance compare to that of others in the same
learning community? In the ideal classroom, all these observations feed into the way the teacher provides
instruction to each student.

Informal and Formal Assessment

One way to begin untangling the lexical conundrum created by distinguishing among tests,
assessment, and teaching is to distinguish between informal and formal assessment. Informal
assessment can take a number of forms, starting with incidental, unplanned comments and responses,
along with coaching and other impromptu feedback to the student. Examples include saying “Nice job!”
“Good work!” “Did you say can or can’t?'" “I think you meant to say you broke the glass, not you break
the glass,” or putting a  on some homework.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Informal assessment does not stop there. A good deal of a teacher’s informal assessment is embedded
in classroom tasks designed to elicit performance without recording results and making fixed judgments
about a student’s competence. Examples at this end of the continuum are marginal comments on papers,
responding to a draft of an essay advice about how to better pronounce a word, a suggestion for a
strategy for compensating for a reading difficulty, and showing how to modify a student's note-taking to
better remember the content of a lecture.

On the other hand, formal assessments are exercises or procedures specifically designed to tap into
a storehouse of skills and knowledge. They are systematic, planned sampling techniques constructed to
give teacher and student an appraisal of student achievement. To extend the tennis analogy, formal
assessments are the tournament games that occur periodically in the course of a regimen of practice.

Is formal assessment the same as a test? We can say that all tests are formal assessments, but not all
formal assessment is testing. For example, you might use a student’s journal or portfolio of materials as a
formal assessment of the attainment of certain course objectives, but it is problematic to call those two
procedures “tests.” A systematic set of observations of a student’s frequency of oral participation in class
is certainly a formal assessment, but it too is hardly what anyone would call a test. Tests are usually
relatively time-constrained (usually spanning a class period or at most several hours) and draw on a
limited sample of behavior.

Formative and Summative Assessment

Another useful distinction to bear in mind is the function of an assessment: How is the procedure to be
used? Two functions are commonly identified in the literature: formative and summative assessment. Most
of our classroom assessment is formative assessment: evaluating students in the process of “forming”
their competencies and skills with the goal of helping them to continue that growth process. The key to
such formation is the delivery (by the teacher) and internalization (by the student) of appropriate
feedback on performance, with an eye toward the future continuation (or formation) of learning.

For all practical purposes, virtually all kinds of informal assessment are (or should be) formative. They
have as their primary focus the ongoing development of the learner’s language. So when you give a
student a comment or a suggestion, or call attention to an error, that feedback is offered in order to
improve the learner’s language ability.

Summative assessment aims to measure, or summarize, what a student has grasped, and typically
occurs at the end of a course or unit of instruction. A summation of what a student has learned implies
looking back and taking stock of how well that student has accomplished objectives, but does not
necessarily point the way to future progress. Final exams in a course and general proficiency exams are
examples of summative assessment.

One of the problems with prevailing attitudes toward testing is the view that all tests (quizzes, periodic
review tests, midterm exams, etc.) are summative. At various points in your past educational experiences,
no doubt you’ve considered such tests as summative. You may have thought, “Whew! I’m glad that’s
over. Now I don’t have to remember that stuff anymore!” A challenge to you as a teacher is to change
that attitude among your students: Can you instill a more formative quality to what your students might
otherwise view as a summative test? Can you offer your students an opportunity to convert tests into
“learning experiences”? We will take up that challenge in subsequent chapters in this book.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Norm-Referenced and Criterion-Referenced Tests

Another dichotomy that is important to clarify here and that aids in sorting out common terminology in
assessment is the distinction between norm-referenced and criterion-referenced testing. In norm-
referenced tests, each test-taker’s score is interpreted in relation to a mean (average score), median
(middle score), standard deviation (extent of variance in scores), and/or percentile rank. The purpose in
such tests is to place test-takers along a mathematical continuum in rank order. Scores are usually
reported back to the test-taker in the form of a numerical score (for example, 230 out of 300) and a
percentile rank (such as 84 percent, which means that the test-taker’s score was higher than 84 percent
of the total number of test- takers, but lower than 16 percent in that administration). Typical of norm-
referenced tests are standardized tests like the Scholastic Aptitude Test (SAT ®) or the Test of English as a
Foreign Language (TOEFL®), intended to be administered to large audiences, with results efficiently
disseminated to test-takers. Such tests must have fixed, predetermined responses in a format that can be
scored quickly at minimum expense. Money and efficiency are primary concerns in these tests.

Criterion-referenced tests, on the other hand, are designed to give test-takers feedback, usually in
the form of grades, on specific course or lesson objectives. Classroom tests involving the students in only
one class, and connected to a curriculum, are typical of criterion-referenced testing. Here, much time and
effort on the part of the teacher (test administrator) are sometimes required in order to deliver useful,
appropriate feedback to students, or what oiler (1979, p. 52) called “instructional value.” In a criterion-
referenced test, the distribution of students’ scores across a continuum may be of little concern as long as
the instrument assesses appropriate objectives. In Language Assessment, with an audience of classroom
language teachers and teachers in training, and with its emphasis on classroom-based assessment (as
opposed to standardized, large-scale testing), criterion-referenced testing is of more prominent interest
than norm-referenced testing.

APPROACHES TO LANGUAGE TESTING: A BRIEF HISTORY

Now that you have a reasonably clear grasp of some common assessment terms, we now turn to one
of the primary concerns of this book: the creation and use of tests, particularly classroom tests. A brief
history of language testing over the past half- century will serve as a backdrop to an understanding of
classroom-based testing.

Historically, language-testing trends and practices have followed the shifting sands of teaching
methodology (for a description of these trends, see Brown, Teaching by Principles [hereinafter TBP],
Chapter 2). For example, in the 1950s, an era of behaviorism and special attention to contrastive analysis,
testing focused on specific language elements such as the phonological, grammatical, and lexical contrasts
between two languages. In the 1970s and 1980s, communicative theories of language brought with them
a more integrative view of testing in which specialists claimed that “the whole of the communicative event
was considerably greater than the sum of its linguistic elements” (Clark, 1983, p. 432). Today, test
designers are still challenged in their quest for more authentic, valid instruments that simulate real - world
interaction.

Discrete-Point and Integrative Testing

This historical perspective underscores two major approaches to language testing that were debated in
the 1970s and early 1980s. These approaches still prevail today, even if in mutated form: the choice
between discrete-point and integrative testing methods (Oiler, 1979). Discrete-point tests are
constructed on the assumption that language can be broken down into its component parts and that those
parts can be tested successfully. These components are the skills of listening, speaking, reading, and
writing, and various units of language (discrete points) of phonology/ graphology, morphology, lexicon,
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

syntax, and discourse. It was claimed that an overall language proficiency test, then, should sample all
four skills and as many linguistic discrete points as possible.

Such an approach demanded a decontextualization that often confused the test-taker. So, as the
profession emerged into an era of emphasizing communication, authenticity, and context, new approaches
were sought, oiler (1979) argued that language competence is a unified set of interacting abilities that
cannot be tested separately. His claim was that communicative competence is so global and requires such
integration (hence the term “integrative” testing) that it cannot be captured in additive tests of grammar,
reading, vocabulary, and other discrete points of language. Others (among them Cziko, 1982, and
Savignon, 1982) soon followed in their support for integrative testing.

What does an integrative test look like? Two types of tests have historically been claimed to be
examples of integrative tests: cloze tests and dictations. A cloze test is a reading passage (perhaps 150
to 300 words) in which roughly every7 sixth or seventh word has been deleted; the test-taker is required
to supply words that fit into those blanks. (See Chapter 8 for a full discussion of cloze testing.) Oller
(1979) claimed that cloze test results are good measures of overall proficiency. According to theoretical
constructs underlying this claim, the ability to supply appropriate words in blanks requires a number of
abilities that lie at the heart of competence in a language: knowledge of vocabulary, grammatical
structure, discourse structure, reading skills and strategies, and an internalized “expectancy” grammar
(enabling one to predict an item that will come next in a sequence). It was argued that successful
completion of cloze items taps into all of those abilities, which were said to be the essence of global
language proficiency.

Dictation is a familiar language-teaching technique that evolved into a testing technique. Essentially,
learners listen to a passage of 100 to 150 words read aloud by an administrator (or audiotape) and write
what they hear, using correct spelling. The listening portion usually has three stages: an oral reading
without pauses; an oral reading with long pauses between every phrase (to give the learner time to write
down what is heard); and a third reading at normal speed to give test-takers a chance to check what they
wrote.

Supporters argue that dictation is an integrative test because it taps into grammatical and discourse
competencies required for other modes of performance in a language. Success on a dictation requires
careful listening, reproduction in writing of what is heard, efficient short-term memory, and, to an extent,
some expectancy rules to aid the short-term memory. Further, dictation test results tend to correlate
strongly with other tests of proficiency. Dictation testing is usually classroom- centered since large-scale
administration of dictations is quite impractical from a scoring standpoint. Reliability of scoring criteria for
dictation tests can be improved by designing multiple-choice or exact-word cloze test scoring.

Proponents of integrative test methods soon centered their arguments on what became known as the
unitary trait hypothesis, which suggested an “indivisible” view of language proficiency: that vocabulary,
grammar, phonology, the “four skills,” and other discrete points of language could not be disentangled
from each other in language performance. The unitary trait hvpothesis contended that there is a general
factor of language proficiency such that all the discrete points do not add up to that whole.

Others argued stronglv against the unitary trait position. In a study of students in Brazil and the
Philippines, Farhady (1982) found significant and widely varying differences in performance on an ESL
proficiency test, depending on subjects’ native country, major field of studv, and graduate versus
undergraduate status. For example, Brazilians scored very low in listening comprehension and relatively
high in reading comprehension. Filipinos, whose scores on five of the six components of the test were
considerably higher than Brazilians’ scores, were actually lower than Brazilians in reading comprehension
scores. Farhady’s contentions were supported in other research that seriously questioned the unitary trait
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

hypothesis. Finally, in the face of the evidence, oiler retreated from his earlier stand and admitted that
“the unitary trait hypothesis was wrong” (1983, p. 352).

Communicative Language Testing

By the mid-1980s, the language-testing field had abandoned arguments about the unitary trait
hypothesis and had begun to focus on designing communicative language-testing tasks. Bachman and
Palmer (1996, p. 9) include among “fundamental” principles of language testing the need for a
correspondence between language test performance and language use: “In order for a particular language
test to be useful for its intended purposes, test performance must correspond in demonstrable ways to
language use in non-test situations.” The problem that language assessment experts faced was that tasks
fended to be artificial, contrived, and unlikely to mirror language use in real fife. As Weir (1990, p. 6)
noted, “Integrative tests such as cloze only tell us about a candidate’s linguistic competence. They do not
tell us anything directly about a student’s performance ability.”

And so a quest for authenticity was launched, as test designers centered on communicative
performance. Following Canale and Swain’s (1980) model of communicative competence, Bachman (1990)
proposed a model of language competence consisting of organizational and pragmatic competence,
respectively subdivided into grammatical and textual components, and into illocutionary and sociolinguistic
components. (Further discussion of both Canale and Swain’s and Bachman’s models can be found in PLLTy
Chapter 9) Bachman and Palmer (1996, pp. 70f) also emphasized the importance of strategic
competence (the ability to employ communicative strategies to compensate for breakdowns as well as to
enhance the rhetorical effect of utterances) in the process of communication. All elements of the model,
especially pragmatic and strategic abilities, needed to be included in the constructs of language testing
and in the actual performance required of test-takers.

Communicative testing presented challenges to test designers, as we will see in subsequent chapters
of this book. Test constructors began to identify the kinds of real-world tasks that language learners were
called upon to perform. It was clear that the contexts for those tasks were extraordinarily widely varied
and that the sampling of tasks for any one assessment procedure needed to be validated by what
language users actually do with language. Weir (1990, p. 11) reminded his readers that “to measure
language proficiency .., account must now be taken of: where, when, how, with whom, and why language
is to be used, and on what topics, and with what effect.” And the assessment field became more and more
concerned with the authenticity of tasks and the genuineness of texts. (See Skehan, 1988, 1989, for a
survey of communicative testing research.)

Performance-Based Assessment

In language courses and programs around the world, test designers are now tackling this new and
more student-centered agenda (Alderson, 2001, 2002). Instead of just offering paper-and-pencil selective
response tests of a plethora of separate items, performance-based assessment of language typically
involves oral production, written production, open-ended responses, integrated performance (across skill
areas), group performance, and other interactive tasks. To be sure, such assessment is time-consuming
and therefore expensive, but those extra efforts are paying off in the form of more direct testing because
students are assessed as they perform actual or simulated real-world tasks. In technical terms, higher
content validity (see Chapter 2 for an explanation) is achieved because learners are measured in the
process of performing the targeted linguistic acts.

In an English language-teaching context, performance-based assessment means that you may have a
difficult time distinguishing between formal and informal assessment. If you rely a little less on formally
structured tests and a little more on evaluation while students are performing various tasks, you will be
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

taking some steps toward meeting the goals of performance-based testing. (See Chapter 10 for a further
discussion of performance-based assessment.)

A characteristic of many (but not all) performance-based language assessments is the presence of
interactive tasks. In such cases, the assessments involve learners in actually performing the behavior that
we want to measure. In interactive tasks, test-takers are measured in the act of speaking, requesting,
responding, or in combining listening and speaking, and in integrating reading and writing. Paper-and-
pencil tests certainly do not elicit such communicative performance.

A prime example of an interactive language assessment procedure is an oral interview. The test-taker
is required to listen accurately to someone else and to respond appropriately. If care is taken in the test
design process, language elicited and volunteered by the student can be personalized and meaningful, and
tasks can approach the authenticity of real-life language use.

CURRENT ISSUES IN CLASSROOM TESTING

The design of communicative, performance-based assessment rubrics continues to challenge both


assessment experts and classroom teachers. Such efforts to improve various facets of classroom testing
are accompanied by some stimulating issues, all of which are helping to shape our current understanding
of effective assessment. Let’s look at three such issues: the effect of new theories of intelligence on the
testing industry; the advent of what has come to be called “alternative’’ assessment; and the increasing
popularity of computer-based testing.

New Views on Intelligence

Intelligence was once viewed strictly as the ability to perform (a) linguistic and (b) logical-
mathematical problem solving. This “IQ” (intelligence quotient) concept of intelligence has permeated the
Western world and its way of testing for almost a century. Since “smartness” in general is measured by
timed, discrete-point tests consisting of a hierarchy of separate items, why shouldn’t every field of study
be so measured? For many years, we have lived in a world of standardized, norm-referenced tests that
are timed in a multiple-choice format consisting of a multiplicity of logic- constrained items, many of which
are inauthentic.

However, research on intelligence by psychologists like Howard Gardner, Robert Sternberg, and Daniel
Goleman has begun to turn the psychometric world upside down. Gardner (1983, 1999), for example,
extended the traditional view of intelligence to seven different components. He accepted the traditional
conceptualizations of linguistic intelligence and logical-mathematical intelligence on which standardized IQ
tests are based, but he included five other “frames of mind” in his theory of multiple intelligences:

• spatial intelligence (the ability to find your way around an environment, to form mental i mages of
reality)

• musical intelligence (the ability to perceive and create pitch and rhythmic patterns)

• bodily-kinesthetic intelligence (fine motor movement, athletic prowess)

• interpersonal intelligence (the ability to understand others and how they feel, and to interact
effectively with them)

• intrapersonal intelligence (the ability to understand oneself and to develop a sense of self-identity)

Robert Sternberg (1988, 1997) also charted new territory in intelligence research in recognizing
creative thinking and manipulative strategies as part of intelligence. All “smart” people aren’t necessarily
adept at fast, reactive thinking. They may be very innovative in being able to think beyond the normal
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

limits imposed by existing tests, but they may need a good deal of processing time to enact this creativity.
Other forms of smartness are found in those who know how to manipulate their environment, namely,
other people. Debaters, politicians, successful salespersons, smooth talkers, and con artists are all smart
in their manipulative ability to persuade others to think their way, vote for them, make a purchase, or do
something they might not otherwise do.

More recently, Daniel Goleman’s (1993) concept of “EQ” (emotional quotient) has spurred us to
underscore the importance of the emotions in our cognitive processing. Those who manage their
emotions—especially emotions that can be detrimental—tend to be more capable of fully intelligent
processing. Anger, grief, resentment, self-doubt, and other feelings can easily impair peak performance in
everyday tasks as well as higher-order problem solving.

These new conceptualizations of intelligence have not been universally accepted by the academic
community (see White, 1998, for example). Nevertheless, their intuitive appeal infused the decade of the
1990s with a sense of both freedom and responsibility in our testing agenda. Coupled with parallel
educational reforms at the time (Armstrong, 1994), they helped to free US from relying exclusively on
timed, discrete-point, analytical tests in measuring language. We were prodded to cautiously combat the
potential tyranny of “objectivity "and its accompanying impersonal approach. But we also assumed the
responsibility for tapping into whole language skills, learning processes, and the ability to negotiate
meaning. Our challenge was to test interpersonal, creative, communicative, interactive skills, and in doing
so to place some trust in our subjectivity and intuition.

Traditional and “Alternative” Assessment

Implied in some of the earlier description of performance-based classroom assessment is a trend to


supplement traditional test designs with alternatives that are more authentic in their elicitation of
meaningful communication. Table 1.1 highlights differences between the two approaches (adapted from
Armstrong, 1994, and Bailey, 1998, p. 207)

Two caveats need to be stated here. First, the concepts in Table 1.1 represent some
overgeneralizations and should therefore be considered with caution. It is difficult, in fact, to draw a clear
line of distinction between what Armstrong (1994) and Bailey (1998) have called traditional and
alternative assessment. Many forms of assessment fall in between the two, and some combine the best of
both.

Second, it is obvious that the table shows a bias toward alternative assessment, and one should not be
misled into thinking that everything on the left-hand side is tainted while the list on the right-hand side
offers salvation to the field of language assessment! As Brown and Hudson (1998) aptly pointed out, the
assessment traditions available to US should be valued and utilized for the functions that they provide. At
the same time, we might all be stimulated to look at the right-hand list and ask ourselves if, among those
concepts, there are alternatives to assessment that we can constructively use in our classrooms.

It should be noted here that considerably more time and higher institutional budgets are required to
administer and score assessments that presuppose more subjective evaluation, more individualization,
and more interaction in the process of offering feedback. The payoff for the latter, however, comes with
more useful feedback to students, the potential for intrinsic motivation, and ultimately a more complete
description of a student’s ability. (See Chapter 10 for a complete treatment of alternatives in assessment.)
More and more educators and advocates for educational reform are arguing for a de-emphasis on large-
scale standardized tests in favor of building budgets that will offer the kind of contextualized,
communicative performance-based assessment that will better facilitate learning in our schools. (In
Chapter 4, issues surrounding standardized testing are addressed at length.)
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Table 1.1. Traditional and alternative assessment

Traditional Assessment Alternative Assessment


One-shot, standardized exams Continuous long-term assessment
Timed, multiple-choice format Untimed, free-response format
Decontextualized test items Contextualized communicative tasks
Scores suffice for feedback Individualized feedback and washback
Norm-referenced scores Criterion-referenced scores
Focus on the "right" answer Open-ended, creative answers
Summative Formative
Oriented to product Oriented to process Interactive
Non-interactive performance performance
Fosters extrinsic motivation Fosters intrinsic motivation

As you read this book, I hope you will do so with an appreciation for the place of testing in
assessment, and with a sense of the interconnection of assessment and teaching. Assessment is an
integral part of the teaching-learning cycle. In an interactive, communicative curriculum, assessment is
almost constant. Tests, which are a subset of assessment, can provide authenticity, motivation, and
feedback to the learner. Tests are essential components of a successful curriculum and one of several
partners in the learning process. Keep in mind these basic principles:

1. Periodic assessments, both formal and informal, can increase motivation by serving as milestones of
student progress.

2. Appropriate assessments aid in the reinforcement and retention of information.

3. Assessments can confirm areas of strength and pinpoint areas needing further work.

4. Assessments can provide a sense of periodic closure to modules within a curriculum.

5. Assessments can promote student autonomy by encouraging students’ self- evaluation of their
progress.

6. Assessments can spur learners to set goals for themselves.

7. Assessments can aid in evaluating teaching effectiveness.

Answers to the vocabulary quiz on pages 1 and 2: 1c, 2a, 3d, 4b, 5a, 6c.

EXERCISES

[Note: (I) Individual work; (G) Group or pair work; (C) Whole-class discussion.]

1. (G) In a small group, look at Figure 1.1 on page 5 that shows tests as a subset of assessment and
the latter as a subset of teaching. Do you agree with this diagrammatic depiction of the three terms?
Consider the following classroom teaching techniques: choral drill, pair pronunciation practice, reading
aloud, information gap task, singing songs in English, writing a description of the weekend’s activities.
What proportion of each has an assessment facet to it? Share your conclusions with the rest of the class.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

2. (G) The chart below shows a hypothetical line of distinction between formative and summative
assessment, and between informal and formal assessment. As a group, place the following
techniques/procedures into one of the four cells and justify your decision. Share your results with other
groups and discuss any differences of opinion.

Placement tests

Diagnostic tests

Periodic achievement tests

Short pop quizzes

Standardized proficiency tests

Final exams

Portfolios

Journals

Speeches (prepared and rehearsed)

Oral presentations (prepared, but not rehearsed)

Impromptu student responses to teacher’s questions

Student-written response (one paragraph) to a reading assignment

Drafting and revising writing Final essays (after several drafts)

Student oral responses to teacher questions after a videotaped lecture

Whole class open-ended discussion of a topic

Formative Summative

Informal

Formal

3. (I/C) Review the distinction between norm-referenced and criterion- referenced testing. If norm-
referenced tests typically yield a distribution of scores that resemble a bell-shaped curve, what kinds of
distributions are typical of classroom achievement tests in your experience?

4. (I/C) Restate in your own words the argument between unitary trait proponents and discrete-point
testing advocates. Why did oiler back down from the unitary trait hypothesis?

$. (I/C) Why are cloze and dictation considered to be integrative tests?

6. (G) Look at the list of Gardner’s seven intelligences. Take one or two intelligences, as assigned to
your group, and brainstorm some teaching activities that foster that type of intelligence. Then, brainstorm
some assessment tasks that may presuppose the same intelligence in order to perform well. Share your
results with other groups.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

7. (c) As a whole-class discussion, brainstorm a variety of test tasks that class members have
experienced in learning a foreign language. Then decide which of those tasks are performance-based,
which are not, and which ones fall in between.

8. (G) Table 1.x lists traditional and alternative assessment tasks and characteristics. In pairs, quickly
review the advantages and disadvantages of each, on both sides of the chart. Share your conclusions with
the rest of the class.

9. (C) Ask class members to share any experiences with computer-based testing and evaluate the
advantages and disadvantages of those experiences.

FOR YOUR FURTHER READING

McNamara, Tim. (2000yLanguage testing. Oxford: Oxford University Press.

One of a number of Oxford University Press’s brief introductions to various areas of language study,
this 140-page primer on testing offers definitions of basic terms in language testing with brief
explanations of fundamental concepts. It is a useful little reference book to check your understanding of
testing jargon and issues in the field.

Mousavi, Seyyed Abbas. (2002). An encyclopedic dictionary of language testing. Third Edition. Taipei:
Tung Hua Book Company.

This publication may be difficult to find in local bookstores, but it is a highly useful compilation of
virtually every term in the field of language testing, with definitions, background history, and research
references. It provides comprehensive explanations of theories, principles, issues, tools, and tasks. Its
exhaustive 88-page bibliography is also downloadable at http://www.abbas-moiJsavi.com. A shorter
version of this 942-page tome may be found in the previous version, Mousavi’s (1999) Dictionary of
language testing (Tehran: Rahnama Publications).
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

CHAPTER 2: PRINCIPLES OF LANGUAGE ASSESSMENT


This chapter explores how principles of language assessment can and should be applied to formal
tests, but with the ultimate recognition that these principles also apply to assessments of all kinds. In this
chapter, these principles will be used to evaluate an existing, previously published, or created test.
Chapter 3 will center on how to use those principles to design a good test.

How do you know if a test is effective? For the most part, that question can be answered by
responding to such questions as: Can it be given within appropriate administrative constraints? Is it
dependable? Does it accurately measure what you want it to measure? These and other questions help to
identify five cardinal criteria for “testing a test”: practicality, reliability, validity, authenticity, and
washback. We will look at each one, but with no priority order implied in the order of presentation.

PRACTICALITY

An effective test is practical. This means that it

• is not excessively expensive,

• stays within appropriate time constraints,

• is relatively easy to administer, and

• has a scoring/evaluation procedure that is specific and time-efficient.

A test that is prohibitively expensive is impractical. A test of language proficiency that takes a student
five hours to complete is impractical—it consumes more time (and money) than necessary to accomplish
its objective. A test that requires individual one-on-one proctoring is impractical for a group of several
hundred test-takers and only a handful of examiners. A test that takes a few minutes for a student to take
and several hours for an examiner to evaluate is impractical for most classroom situations. A test that can
be scored only by computer is impractical if the test takes place a thousand miles away from the nearest
computer. The value and quality of a test sometimes hinge on such nitty-gritty, practical considerations.

Here’s a little horror story about practicality gone awry. An administrator of a six-week summertime
short course needed to place the 50 or so students who had enrolled in the program. A quick search
yielded a copy of an old English Placement Test from the University of Michigan. It had 20 listening items
based on an audio- tape and 80 items on grammar, vocabulary, and reading comprehension, all multiple-
choice format. A scoring grid accompanied the test. On the day of the test, the required number of test
booklets had been secured, a proctor had been assigned to monitor the process, and the administrator
and proctor had planned to have the scoring completed by later that afternoon so students could begin
classes the next day. Sounds simple, right? Wrong.

The students arrived, test booklets were distributed, and directions were given. The proctor started the
tape. Soon students began to look puzzled. By the time the tenth item played, everyone looked
bewildered. Finally, the proctor checked a test booklet and was horrified to discover that the wrong tape
was playing; it was a tape for another form of the same test! Now what? She decided to randomly select a
short passage from a textbook that was in the room and give the students a dictation. The students
responded reasonably well. The next 80 non-tape-based items proceeded without incident, and the
students handed in their score sheets and dictation papers.

When the red-faced administrator and the proctor got together later to score the tests, they faced the
problem of how to score the dictation—a more subjective process than some other forms of assessment
(see Chapter 6). After a lengthy exchange, the two established a point system, but after the first few
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

papers had been scored, it was clear that the point system needed revision. That meant going back to the
first papers to make sure the new system was followed.

The two faculty members had barely begun to score the 80 multiple-choice items when students began
returning to the office to receive their placements. Students were told to come back the next morning for
their results. Later that evening, having combined dictation scores and the 80-item multiple-choice scores,
the two frustrated examiners finally arrived at placements for all students.

It’s easy to see what went wrong here. While the listening comprehension section of the test was
apparently highly practical, the administrator had failed to check the materials ahead of time (which, as
you will see below, is a factor that touches on unreliability as well). Then, they established a scoring
procedure that did not fit into the time constraints. In classroom-based testing, time is almost always a
crucial practicality factor for busy teachers with too few hours in the day!

RELIABILITY

A reliable test is consistent and dependable. If you give the same test to the same student or matched
students on two different occasions, the test should yield similar results. The issue of reliability of a test
may best be "addressed by considering a number of factors that may contribute to the unreliability of a
test. Consider the following possibilities (adapted from Mousavi, 2002, p. 804): fluctuations in the student,
in scoring, in test administration, and in the test itself.

Student-Related Reliability

The most common learner-related issue in reliability is caused by temporary illness, fatigue, a “bad
day,” anxiety, and other physical or psychological factors, which may make an “observed” score deviate
from one’s “true” score. Also included in this category are such factors as a test-taker’s “test-wiseness” or
strategies for efficient test taking (Mousavi, 2002, p. 804).

Rater Reliability

Human error, subjectivity, and bias may enter into the scoring process. Inter-rater reliability occurs
when two or more scorers yield inconsistent scores of the same test, possibly for lack of attention to
scoring criteria, inexperience, inattention, or even preconceived biases. In the story above about the
placement test, the initial scoring plan for the dictations was found to be unreliable—that is, the two
scorers were not applying the same standards.

Rater-reliability issues are not limited to contexts where two or more scorers are involved. Intra-rater
reliability is a common occurrence for classroom teachers because of unclear scoring criteria, fatigue, bias
toward particular “good” and “bad” students, or simple carelessness. When I am faced with up to 40 tests
to grade in only a week, I know that the standards I apply—however subliminally—to the first few tests
will be different from those I apply to the last few. I may be “easier” or “harder” on those first few papers
or I may get tired, and the result may be an inconsistent evaluation across all tests. One solution to such
intra-rater unreliability is to read through about half of the tests before rendering any final scores or
grades, then to recycle back through the whole set of tests to ensure an even-handed judgment. In tests
of writing skills, rater reliability is particularly hard to achieve since writing proficiency involves numerous
traits that are difficult to define. The careful specification of an analytical scoring instrument, however, can
increase rater reliability (J. D. Brown, 1991).
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Test Administration Reliability

Unreliability may also result from the conditions in which the test is administered. I once witnessed the
administration of a test of aural comprehension in which a tape recorder played items for comprehension,
but because of street noise outside the building, students sitting next to windows could not hear the tape
accurately. This was a clear case of unreliability caused by the conditions of the test administration. Other
sources of unreliability are found in photocopying variations, the amount of light in different parts of the
room, variations in temperature, and even the condition of desks and chairs.

Test Reliability

Sometimes the nature of the test itself can cause measurement errors. If a test is too long, test-takers
may become fatigued by the time they reach the later items and hastily respond incorrectly Timed tests
may discriminate against students who do not perform well on a test with a time limit. We all know people
(and you may be included in this category!) who “know” the course material perfectly but who are
adversely affected by the presence of a clock ticking away. Poorly written test items (that are ambiguous
or that have more than one correct answer) may be a further source of test unreliability.

VALIDITY

By far the most complex criterion of an effective test—and arguably the most important principle—is
validity, “the extent to which inferences made from assessment results are appropriate, meaningful, and
useful in terms of the purpose of the assessment” (Gronlund, 1998, p. 226). A valid test of reading ability
actually measures reading ability—not 20/20 vision, nor previous knowledge in a subject, nor some other
variable of questionable relevance. To measure writing ability, one might ask students to write as many
words as they can in 15 minutes, then simply count the words for the final score. Such a test would be
easy to administer (practical), and the scoring quite dependable (reliable). But it would not constitute a
valid test of writing ability without some consideration of comprehensibility, rhetorical discourse elements,
and the organization of ideas, among other factors.

How is the validity of a test established? There is no final, absolute measure of validity, but several
different kinds of evidence may be invoked in support. In some cases, it may be appropriate to examine
the extent to which a test calls for performance that matches that of the course or unit of study being
tested. In other cases, we may be concerned with how well a test determines whether or not students
have reached an established set of goals or level of competence. Statistical correlation with other related
but independent measures is another widely accepted form of evidence. Other concerns about a test’s
validity may focus on the consequences— beyond measuring the criteria themselves—of a test, or even on
the test-taker’s perception of validity. We will look at these five types of evidence below.

Content-Related Evidence

If a test actually samples the subject matter about which conclusions are to be drawn, and if it
requires the test-taker to perform the behavior that is being measured, it can claim content-related
evidence of validity often popularly referred to as content validity (e.g., Mousavi, 2002, Hushes, 2003).
You can usually identify content-related evidence observationally if you can clearly define the achievement
that you are measuring. A test of tennis competency that asks someone to run a 100-vard dash
obviously lacks content validity. If you are trying to assess a person’s ability to speak a second language
in a conversational setting, asking the learner to answer paper-and-pencil multiple-choice questions
requiring grammatical judgments does not achieve content validity. A test that requires the learner
actually to speak within some sort of authentic context does. And ÌĨ a course has perhaps ten objectives
but only two are covered in a test, then content validity suffers.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Consider the following quiz on English articles for a high-beginner level of a conversation class
(listening and speaking) for English learners.

English articles quiz

Directions: The purpose of this quiz is for you and me to find out how well you know and can apply the
rules of article usage. Read the following passage and write a/an, the, or 0 (no article) in each blank.

Last night, I had (1) … very strange dream. Actually, it was (2) … nightmare! You know how much I
love (3) … zoos. Well, I dreamt that I went to (4) … San Francisco zoo with (5) … few friends, when we
got there, it was very dark, but (6) … moon was out, so we weren't afraid. I wanted to see (7) … monkeys
first, so we walked past (8) … merry-go-round and (9) … lions' cages to (10)… monkey section.

(The story continues, with a total of 25 blanks to fill.)

The students had had a unit on zoo animals and had engaged in some open discussions and group
work in which they had practiced articles, all in listening and speaking modes of performance. In that this
quiz uses a familiar setting and focuses on previously practiced language forms, it is somewhat content
valid. The fact that it was administered in written form, however, and required students to read the
passage and write their responses makes it quite low in content validity7 for a listening/speaking class.

There are a few cases of highly specialized and sophisticated testing instruments that may have
questionable content-related evidence of validity. It is possible to contend, for example, that standard
language proficiency tests, with their context- reduced, academically oriented language and limited
stretches of discourse, lack content validity since they do not require the full spectrum of communicative
performance on the part of the learner (see Bachman, 1990, for a full discussion). There is good reasoning
behind such criticism; nevertheless, what such proficiency tests lack in content-related evidence they may
gain in other forms of evidence, not to mention practicality and reliability.

Another way of understanding content validity is to consider the difference between direct and indirect
testing. Direct testing involves the test-taker in actually performing the target task. In an indirect test,
learners are not performing the task itself but rather a task that is related in some way. For example, if
you intend to test learners’ oral production of syllable stress and your test task is to have learners mark
(with written accent marks) stressed syllables in a list of written words, you could, with a stretch of logic,
argue that you are indirectly testing their oral production. A direct test of syllable production would have
Ito require that students actually produce target words orally.

The most feasible rule of thumb for achieving content validity in classroom assessment is to test
performance directly. Consider, for example, a listening/ speaking class that is doing a unit on greetings
and exchanges that includes discourse for asking for personal information (name, address, hobbies, etc.)
with some form-focus on the verb to be, personal pronouns, and question formation. The test on that unit
should include all of the above discourse and grammatical elements and involve students in the actual
performance of listening and speaking.

What all the above examples suggest is that content is not the only type of evidence to support the
validity of a test, but classroom teachers have neither the time nor the budget to subject quizzes,
midterms, and final exams to the extensive scrutiny of a full construct validation (see below). Therefore, i t
is critical that teachers hold content-related evidence in high esteem in the process of defending the
validity of classroom tests.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Criterion-Related Evidence

A second form of evidence of the validity of a test may be found in what is called criterion-related
evidence, also referred to as criterion-related validity, or the extent to which the “criterion” of the test has
actually been reached. You will recall that in Chapter 1 it was noted that most classroom-based
assessment with teacher- designed tests fits the concept of criterion-referenced assessment. In such
tests, specified classroom objectives are measured, and implied predetermined levels of performance are
expected to be reached (80 percent is considered a minimal passing grade).

in the case of teacher-made classroom assessments, criterion-related evidence is best demonstrated


through a comparison of results of an assessment with results of some other measure of the same
criterion. For example, in a course unit whose objective is for students to be able to orally produce voiced
and voiceless stops in all possible phonetic environments, the results of one teacher’s unit test might be
compared with an independent assessment—possibly a commercially produced test in a textbook—of the
same phonemic proficiency. A classroom test designed to assess mastery of a point of grammar in
communicative use will have criterion validity if test scores are corroborated either by observed
subsequent behavior or by other communicative measures of the grammar point in question.

Criterion-related evidence usually falls into one of two categories: concurrent and predictive validity. A
test has concurrent validity if its results are supported by other concurrent performance beyond the
assessment itself. For example, the validity of a high score on the final exam of a foreign language course
will be substantiated by actual proficiency in the language. The predictive validity of an assessment
becomes important in the case of placement tests, admissions assessment batteries, language aptitude
tests, and the like. The assessment criterion in such cases is not to measure concurrent ability but to
assess (and predict) a test-taker s likelihood of future success.

Construct-Related Evidence

A third kind of evidence that can support validity, but one that does not play as large a role for
classroom teachers, is construct-related validity, commonly referred to as construct validity. A construct is
any theory, hypothesis, or model that attempts to explain observed phenomena in our universe of
perceptions. Constructs may or may not be directly or empirically measured—their verification often
requires inferential data. “Proficiency” and “communicative competence” are linguistic constructs; “self-
esteem” and “motivation” are psychological constructs. Virtually every issue in language learning and
teaching involves theoretical constructs. In the field of assessment, construct validity asks, “Does this test
actually tap into the theoretical construct as it has been defined?” Tests are, in a manner of speaking,
operational definitions of constructs in that they operationalize the entity that is being measured (see
Davidson, Hudson, & Lynch, 1985).

For most of the tests that you administer as a classroom teacher, a formal construct validation
procedure may seem a daunting prospect. You will be tempted, perhaps, to run a quick content check and
be satisfied with the test’s validity. But don’t let the concept of construct validity scare you. An informal
construct validation of the use of virtually every classroom test is both essential and feasible.

Imagine, for example, that you have been given a procedure for conducting an oral interview. The
scoring analysis for the interview includes several factors in the final score: pronunciation, fluency,
grammatical accuracy, vocabulary use, and socio- linguistic appropriateness. The justification for these
five factors lies in a theoretical construct that claims those factors to be major components of oral
proficiency. So if you were asked to conduct an oral proficiency interview that evaluated only
pronunciation and grammar, you could be justifiably suspicious about the construct validity of that test.
Likewise, let’s suppose you have created a simple written vocabulary quiz, covering the content of a
recent unit, that asks students to correctly define a set of words. Your chosen items may be a perfectly
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

adequate sample of what was covered in the unit, but if the lexical objective of the unit was the
communicative use of vocabulary, then the writing of definitions certainly fails to match a construct of
communicative language use.

Construct validity is a major issue in validating large-scale standardized tests of proficiency. Because
such tests must, for economic reasons, adhere to the principle of practicality, and because they must
sample a limited number of domains of language, they may not be able to contain all the content of a
particular field or skill. The TOEFL®, for example, has until recently not attempted to sample oral
production, yet oral production is obviously an important part of academic success in a university course
of study. The TOEFL’s omission of oral production content, however, is ostensibly justified by research that
has shown positive correlations between oral production and the behaviors (listening, reading,
grammaticality detection, and writing) actually sampled on the TOEFL (see Duran et al., 1985). Because of
the crucial need to offer a financially affordable proficiency test and the high cost of administering and
scoring oral production tests, the omission of oral content from the TOEFL has been justified as an
economic necessity. (Note: As this book goes to press, oral production tasks are being included in the
TOEFL, largely stemming from the demands of the professional community for authenticity and content
validity)

Consequential Validity

As well as the above three widely accepted forms of evidence that may be introduced to support the
validity of an assessment, two other categories may be of some interest and utility in your own quest for
validating classroom tests. Messick (1989), Gronlund (1998), McNamara (2000), and Brindley (2001),
among others, underscore the potential importance of the consequences of using an assessment.
Consequential validity encompasses all the consequences of a test, including such considerations as its
accuracy in measuring intended criteria, its impact on the preparation of test-takers, its effect on the
learner, and the (intended and unintended) social consequences of a test’s interpretation and use.

As high-stakes assessment has gained ground in the last two decades, one aspect of consequential
validity has drawn special attention: the effect of test preparation courses and manuals on performance.
McNamara (2000, p. 54) cautions against test results that may reflect socioeconomic conditions such as
opportunities for coaching that are “differentially available to the students being assessed (for example,
because only some families can afford coaching, or because children with more highly educated parents
get help from their parents).” The social consequences of large-scale, high-stakes assessment are
discussed in Chapter 6.

Another important consequence of a test falls into the category of washback, to be more fully
discussed below. Gronlund (1998, pp. 209-210) encourages teachers to consider the effect of
assessments on students’ motivation, subsequent performance in a course, independent learning, study
habits, and attitude toward school work.

Face Validity

An important facet of consequential validity is the extent to which “students view the assessment as
fair, relevant, and useful for improving learning” (Gronlund, 1998, p. 210), or what is popularly known as
face validity. “Face validity refers to the degree to which a test looks right, and appears to measure the
knowledge or abilities it claims to measure, based on the subjective judgment of the examinees who take
it, the administrative personnel who decide on its use, and other psychometrically unsophisticated
observers” (Mousavi, 2002, p. 244).
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Sometimes students don’t know what is being tested when they tackle a test. They may feel, for a
variety of reasons, that a test isn’t testing what it is “supposed” to test. Face validity means that the
students perceive the test to be valid. Face validity asks the question “Does the test, on the ‘face’ of it,
appear from the learner’s perspective to test what it is designed to test?” Face validity will likely be high tf
learners encounter

• a well-constructed, expected format with familiar tasks,

• a test that is clearly doable within the allotted time limit,

• items that are clear and uncomplicated,

• directions that are crystal clear,

• tasks that relate to their course work (content validity), and

• a difficulty level that presents a reasonable challenge.

Remember, face validity is not something that can be empirically tested by a teacher or even by a
testing expert. It is purely a factor of the “eye of the beholder”—how the test-taker, or possibly the test
giver, intuitively perceives the instrument. For this reason, some assessment experts (see Stevenson,
1985) view face validity as a superficial factor that is dependent on the whim of the perceiver.

The other side of this issue reminds US that the psychological state of the learner (confidence, anxiety,
etc.) is an important ingredient in peak performance by a learner. Students can be distracted and their
anxiety increased if you “throw a curve” at them on a test. They need to have rehearsed test tasks before
the fact and feel comfortable with them. A classroom test is not the time to introduce new tasks because
you won’t know if student difficulty is a factor of the task itself or of the objectives you are testing.

I once administered a dictation test and a cloze test (see Chapter 8 for a discussion of cloze tests) as a
placement test for a group of learners of English as a second language. Some learners were upset because
such tests, on the face of it, did not appear to them to test their true abilities in English. They felt that a
multiple- choice grammar test would have been the appropriate format to use. A few claimed they didn’t
perform well on the cloze and dictation because they were not accustomed to these formats. As it turned
out, the tests served as superior instruments for placement, but the students would not have thought so.
Face validity was low, content validity was moderate, and construct validity was very high.

As already noted above, content validity is a very important ingredient in achieving face validity. If a
test samples the actual content of what the learner has achieved or expects to achieve, then face validity
will be more likely to be perceived.

Validity is a complex concept, yet it is indispensable to the teacher’s understanding of what makes a
good test. If in your language teaching you can attend to the practicality, reliability, and validity of tests of
language, whether those tests are classroom tests related to a part of a lesson, final exams, or proficiency
tests, then you are well on the wav to making accurate judgments about the competence of the learners
with whom you are working.

AUTHENTICITY

A fourth major principle of language testing is authenticity, a concept that is a little slippery to define,
especially within the art and science of evaluating and designing tests. Bachman and Palmer (1996, p. 23)
define authenticity as “the degree of correspondence of the characteristics of a given language test task to
the features of a target language task,” and then suggest an agenda for identifying those target language
tasks and for transforming them into valid test items.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Essentially, when you make a claim for authenticity in a test task, you are saying that this task is likely
to be enacted in the “real world.” Many test item types fail to simulate real-world tasks They may be
contrived or artificial in their attempt to target a grammatical form or a lexical item. The sequencing of
items that bear no relationship to one another lacks authenticity. One does not have to look very long to
find reading comprehension passages in proficiency tests that do not reflect a real-world passage.

In a test, authenticity may be present in the following ways:

• The language in the test is as natural as possible.

• Items are contextualized rather than isolated.

• Topics are meaningful (relevant, interesting) for the learner.

• Some thematic organization to items is provided, such as through a story line or episode.

• Tasks represent, or closely approximate, real-world tasks.

The authenticity of test tasks in recent years has increased noticeably. Two or three decades ago,
unconnected, boring, contrived items were accepted as a necessary component of testing. Things have
changed. It was once assumed that large- scale testing could not include performance of the productive
skills and stay within budgetary constraints, but now many such tests offer speaking and writing
components. Reading passages are selected from real-world sources that test-takers are likely to have
encountered or will encounter. Listening comprehension sections feature natural language with
hesitations, white noise, and interruptions. More and more tests offer items that are “episodic” in that they
are sequenced to form meaningful units, paragraphs, or stories.

You are invited to take up the challenge of authenticity in your classroom tests. As we explore many
different types of task in this book, especially in Chapters 6 through 9, the principle of authenticity will be
very much in the forefront.

WASHBACK

A facet of consequential validity, discussed above, is “the effect of testing on teaching and learning”
(Hughes, 2003, p. 1), otherwise known among language-testing specialists as washback In large-scale
assessment, washback generally refers to the effects the tests have on instruction in terms of how
students prepare for the test.

“Cram” courses and “teaching to the test” are examples of such washback. Another form of washback
that occurs more in classroom assessment is the information that “washes back” to students in the form of
useful diagnoses of strengths and weaknesses. Washback also includes the effects of an assessment on
teaching and learning prior to the assessment itself, that is, on preparation for the assessment. Informal
performance assessment is by nature more likely to have built-in washback effects because the teacher is
usually providing interactive feedback. Formal tests can also have positive washback, but they provide no
washback if the students receive a simple letter grade or a single overall numerical score.

The challenge to teachers is to create classroom tests that serve as learning devices through which
washback is achieved. Students’ incorrect responses can become windows of insight into further work.
Their correct responses need to be praised, especially when they represent accomplishments in a
student’s inter- language. Teachers can suggest strategies for success as part of their “coaching” role.
Washback enhances a number of basic principles of language acquisition: intrinsic motivation, autonomy,
self-confidence, language ego, interlanguage, and strategic investment, among others. (See PLLT and TBP
for an explanation of these principles.)
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

One way to enhance washback is to comment generously and specifically on test performance. Many
overworked (and underpaid!) teachers return tests to students with a single letter grade or numerical
score and consider their job done. In reality, letter grades and numerical scores give absolutely no
information of intrinsic interest to the student. Grades and scores reduce a mountain of linguistic and
cognitive performance data to an absurd molehill. At best, they give a relative indication of a formulaic
judgment of performance as compared to others in the class— which fosters competitive, not cooperative,
learning.

With this in mind, when you return a written test or a data sheet from an oral production test, consider
giving more than a number, grade, or phrase as your feedback. Even if your evaluation is not a neat little
paragraph appended to the test, you can respond to as many details throughout the test as time will
permit. Give praise for strengths—the “good stuff”—as well as constructive criticism of weaknesses. Give
strategic hints on how a student might improve certain elements of performance. In other words, take
some time to make the test performance an intrinsically motivating experience from which a student will
gain a sense of accomplishment and challenge.

A little bit of washback may also help students through a specification of the numerical scores on the
various subsections of the test. A subsection on verb tenses, for example, that yields a relatively low score
may serve the diagnostic purpose of showing the student an area of challenge.

Another viewpoint on washback is achieved by a quick consideration of differences between formative


and summative tests, mentioned in Chapter 1. Formative tests, by definition, provide washback in the
form of information to the learner on progress toward goals. But teachers might be tempted to feel that
summative tests, which provide assessment at the end of a course or program, do not need to offer much
in the way of washback. Such an attitude is unfortunate because the end of every language course or
program is always the beginning of further pursuits, more learning, more goals, and more challenges to
face. Even a final examination in a course should carry with it some means for giving washback to
students.

In my courses I never give a final examination as the last scheduled classroom session. I always
administer a final exam during the penultimate session, then complete the evaluation of the exams in
order to return them to students during the last class. At this time, the students receive scores, grades,
and comments on their work, and I spend some of the class session addressing material on which the
students were not completely clear. My summative assessment is thereby enhanced by some beneficial
washback that is usually not expected of final examinations.

Finally, washback also implies that students have ready access to you to discuss the feedback and
evaluation you have given. While you almost certainly have known teachers with whom you wouldn’t dare
argue about a grade, an interactive, cooperative, collaborative classroom nevertheless can promote an
atmosphere of dialogue between students and teachers regarding evaluative judgments. For learning to
continue, students need to have a chance to feedback on your feedback, to seek clarification of any issues
that are fuzzy, and to set new and appropriate goals for themselves for the days and weeks ahead.

APPLYING PRINCIPLES TO THE EVALUATION OF CLASSROOM TESTS

The five principles of practicality, reliability, validity, authenticity, and washback go a long way toward
providing useful guidelines for both evaluating an existing assessment procedure and designing one on
your own. Quizzes, tests, final exams, and standardized proficiency tests can all be scrutinized through
these five lenses.

Are there other principles that should be invoked in evaluating and designing assessments? The
answer, of course, is yes. Language assessment is an extraordinarily broad discipline with many branches,
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

interest areas, and issues. The process of designing effective assessment instruments is far too complex
to be reduced to five principles. Good test construction, for example, is governed by research-based rules
of test preparation, sampling of tasks, item design and construction, scoring responses, ethical standards,
and so on. But the five principles cited here serve as an excellent foundation on which to evaluate existing
instruments and to build your own.

We will look at how to design tests in Chapter 3 and at standardized tests in Chapter 4. The questions
that follow here, indexed by the five principles, will help you evaluate existing tests for your own
classroom. It is important for you to remember, however, that the sequence of these questions does not
imply a priority order. Validity, for example, is certainly the most significant cardinal principle of
assessment evaluation. Practicality may be a secondary issue in classroom testing. Or. for a particular
test, you may need to place authenticity as your primary consideration. When all is said and done,
however, if validity is not substantiated, all other considerations may be rendered useless.

1. Are the test procedures practical?

Practicality is determined by the teacher’s (and the students’) time constraints, costs, and
administrative details, and to some extent by what occurs before and after the test. To determine whether
a test is practical for your needs, you may want to use the checklist below.

Practicality checklist

1. Are administrative details clearly established before the test?

2. Can students complete the test reasonably within the set time frame?

3. Can the test be administered smoothly, without procedural "glitches"?

4. Are all materials and equipment ready?

5. Is the cost of the test within budgeted limits?

6. Is the scoring/evaluation system feasible in the teacher's time frame?

7. Are methods for reporting results determined in advance?

As this checklist suggests, after you account for the administrative details of giving a test, you need to
think about the practicality of your plans for scoring the test. In teachers’ busy lives, time often emerges
as the most important factor, one that overrides other considerations in evaluating an assessment. If you
need to tailor a test to fit your own time frame, as teachers frequently do, you need to accomplish this
without damaging the test’s validity and washback. Teachers should, for example, avoid the temptation to
offer only quickly scored multiple-choice selection items that may be neither appropriate nor well-
designed. Everyone knows teachers secretly hate to grade tests (almost as much as students hate to take
them!) and will do almost anything to get through that task as quickly and effortlessly as possible. Yet
good teaching almost always implies an investment of the teacher’s time in giving feedback—comments
and suggestions—to students on their tests.

2. Is the test reliable?

Reliability applies to both the test and the teacher, and at least four sources of unreliability must be
guarded against, as noted in the second section of this chapter. Test and test administration reliability can
be achieved by making sure that all students receive the same quality of input, whether written or
auditory. Part of achieving test reliability depends on the physical context—making sure, for example, that
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

• every student has a cleanly photocopied test sheet,

• sound amplification is clearlv audible to everyone in the room,

• video input is equally visible to all,

• lighting, temperature, extraneous noise, and other classroom conditions are equal (and optimal) for
all students, and

• objective scoring procedures leave little debate about correctness of an answer.

Rater reliability; another common issue in assessments, may be more difficult, perhaps because we
too often overlook this as an issue. Since classroom tests rarely involve two scorers, inter-rater reliability
is seldom an issue. Instead, intra-rater reliability is of constant concern to teachers: What happens to our
fallible concentration and stamina over the period of time during which we are evaluating a test? Teachers
need to find ways to maintain their concentration and stamina over the time it takes to score
assessments. In open-ended response tests, this issue is of paramount importance. It is easy to let
mentally established standards erode over the hours you require to evaluate the test.

Intra-rater reliability for open-ended responses may be enhanced by the following guidelines:

• Use consistent sets of criteria for a correct response.

• Give uniform attention to those sets throughout the evaluation time.

• Read through tests at least twice to check for your consistency.

• If you have made “mid-stream” modifications of what vou consider as a correct response, go back
and apply the same standards to all.

• Avoid fatigue by reading the tests in several sittings, especially if the time requirement is a matter of
several hours.

3. Does the procedure demonstrate content validity?

The major source of validity in a classroom test is content validity: the extent to which the assessment
requires students to perform tasks that were included in the previous classroom lessons and that directly
represent the objectives of the unit on which the assessment is based. If V'OU have been teaching an
English language class to fifth graders who have been reading, summarizing, and responding to short
passages, and if your assessment is based on this work, then to be content valid, the test needs to include
performance in those skills.

There are two steps to evaluating the content validity of a classroom test.

1. Are classroom objectives identified and appropriately framed? Underlying every good classroom test
are the objectives of the lesson, module, or unit of the course in question. So the first measure of an
effective classroom test is the identification of objectives. Sometimes this is easier said than done. Too
often teachers work through lessons day after day with little or no cognizance of the objectives they seek
to fulfill. Or perhaps those objectives are so poorly framed that determining whether or not they were
accomplished is impossible. Consider the following objectives for lessons, all of which appeared on lesson
plans designed by students in teacher preparation programs:
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

a. Students should be able to demonstrate some reading comprehension.

b. To practice vocabulary in context.

c. Students will have fun through a relaxed activity and thus enjoy their learning.

d. To give students a drill on the /i/ - /I/ contrast.

e. Students will produce yes/no questions with final rising intonation.

Only the last objective is framed in a form that lends itself to assessment. In (a), the modal should is
ambiguous and the expected performance is not stated. In (b), everyone can fulfill the act of “practicing”;
no standards are stated or implied. For obvious reasons, (c) cannot be assessed. And (d) is really just a
teacher’s note on the type of activity to be used.

Objective (e), on the other hand, includes a performance verb and a specific linguistic target. By
specifying acceptable and unacceptable levels of performance, the goal can be tested. An appropriate test
would elicit an adequate number of samples of student performance, have a clearly framed set of
standards for evaluating the performance (say, on a scale of 1 to 5), and provide some sort of feedback to
the student.

2. Are lesson objectives represented in the form of test specifications? The next content-validity issue
that can be applied to a classroom test centers on the concept of test specifications. Don’t let this word
scare you. It simply means that a test should have a structure that follows logically from the lesson or unit
you are testing. Many tests have a design that

• divides them into a number of sections (corresponding, perhaps, to the objectives that are being
assessed),

• offers students a variety of item types, and

• gives an appropriate relative weight to each section.

Some tests, of course, do not lend themselves to this kind of structure. A test in a course in academic
writing at the university level might justifiably consist of an in- class written essay on a given topic—only
one “item” and one response, in a manner of speaking. But in this case the specs (specifications) would be
embedded in the prompt itself and in the scoring or evaluation rubric used to grade it and give feedback.
We will return to the concept of test specs in the next chapter.

The content validity of an existing classroom test should be apparent in how the objectives of the unit
being tested are represented in the form of the content of items, clusters of items, and item types. Do you
clearly perceive the performance of test-takers as reflective of the classroom objectives? If so, and you
can argue this, content validity has probably been achieved.

4. Is the procedure face valid and “biased for best”?

This question integrates the concept of face validity with the importance of structuring an assessment
procedure to elicit the optimal performance of the student. Students will generally judge a test to be face
valid if

• directions are clear,

• the structure of the test is organized logically,


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

• its difficulty level is appropriately pitched,

• the test has no “surprises,” and

• timing is appropriate.

A phrase that has come to be associated with face validity is “biased for best,” a' term that goes a little
beyond how the student views the test to a degree of strategic involvement on the part of student and
teacher in preparing for, setting up, and following up on the test itself According to Swain (1984), to give
an assessment procedure that is “biased for best,” a teacher

• offers students appropriate review and preparation for the test,

• suggests strategies that will be beneficial, and

• structures the test so that the best students will be modestly challenged and the weaker students
will not be overwhelmed.

It’s easy for teachers to forget how challenging some tests can be, and so a well- planned testing
experience will include some strategic suggestions on how students might optimize their performance. In
evaluating a classroom test, consider the extent to which before-, during-, and after-test options are
fulfilled.

Test-taking strategies

Before the Test

1. Give students ail the information you can about the test: Exactly what will the test cover? which
topics will be the most important? What kind of items will be on it? How long will it be?

2. Encourage students to do a systematic review of material. For example, they should skim the
textbook and other material, outline major points, write down examples.

3. Give them practice tests or exercises, if available.

4. Facilitate formation of a study group, if possible.

5. Caution students to get a good night's rest before the test.

6. Remind students to get to the classroom early.

During the Test

1. After the test is distributed, tell students to look over the whole test quickly in order to get a good
grasp of its different parts.

2. Remind them to mentally figure out how much time they will need for each part.

3. Advise them to concentrate as carefully as possible.

4. Warn students a few minutes before the end of the class period so that they can finish on time,
proofread their answers, and catch careless errors.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

After the Test

1. When you return the test, include feedback on specific things the student did well, what he or she
did not do well, and, if possible, the reasons for your comments.

2. Advise students to pay careful attention in class to whatever you say about the test results.

3. Encourage questions from students.

4. Advise students to pay special attention in the future to points on which they are weak.

Keep in mind that what comes before and after the test also contributes to its face validity. Good class
preparation will give students a comfort level with the test, and good feedback—washback—will allow
them to learn from it.

5. Are the test tasks as authentic as possible?

Evaluate the extent to which a test is authentic by asking the following questions:

• Is the language in the test as natural as possible?

• Are items as contextualized as possible rather than isolated?

• Are topics and situations interesting, enjoyable, and/or humorous?

• Is some thematic organization provided, such as through a story line or episode?

• Do tasks represent, or closely approximate, real-world tasks?

Consider the following two excerpts from tests, and the concept of authenticity may become a little
clearer.

Multiple-choice tasks—contextualized

"Going To"

1. What … this weekend?

a. you are going to do 4. I'd love to! …


b. are you going to do
a. What's it going to be?
c. your gonna do
b. Who's going to be?
2. I'm not sure. … anything special? c. Where's it going to be?

a. Are you going to do 5. It is … to be at Ruth's house.


b. You are going to do
a. go
c. Is going to do
b. going
3. My friend Melissa and I … a party. Would c. gonna
you like to come?

a. am going to
—Sheila Viotti, from Dave's ESL Café
b. are going to go to
c. go to
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Multiple-choice tasks—decontextualized

1. There are three countries I would like to


visit. One is Italy.

a. The other is New Zealand and other is


Nepal.
b. The others are New Zealand and
Nepal.
c. Others are New Zealand and Nepal.

2. When I was twelve years old, I used …


every day.

a. swimming
b. to swimming
c. to swim

3. When Mr. Brown designs a website, he


always creates it …

a. artistically
b. artistic
c. artist

4. Since the beginning of the year, I … at


Millennium Industries.

a. am working
b. had been working
c. have been working

5. When Mona broke her leg, she asked her


husband … her to work.

a. to drive
b. driving
c. drive

—Brown (2000), New Vistas, Book 4


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

The sequence of items in the contextualized tasks achieves a modicum of authenticity by


contextualizing all the items in a story line. The conversation is one that might occur in the
real world, even if with a little less formality. The sequence of items in the decontextualized
tasks takes the test-taker into five different topic areas with no context for any. Each
sentence is likely to be written ơr spoken in the real world, but not in that sequence. Given
the constraints of a multiple-choice format, on a measure of authenticity I would say the
first excerpt is “good” and the second excerpt is only “fair.”

6. Does the test offer beneficial washback to the learner?

The design of an effective test should point the way to beneficial washback. A test that
achieves content validity demonstrates relevance to the curriculum in question and thereby
sets the stage for washback. When test items represent the various objectives of a unit,
and/or when sections of a test clearly focus on major topics of the unit, classroom tests can
serve in a diagnostic capacity even if they aren’t specifically labeled as such.

Other evidence of washback may be less visible from an examination of the test itself.
Here again, what happens before and after the test is critical. Preparation time before the
test can contribute to washback since the learner is reviewing and focusing in a potentially
broader way on the objectives in question. By spending classroom time after the test
reviewing the content, students discover their areas of strength and weakness. Teachers
can raise the washback potential by asking students to use test results as a guide to setting
goals for their future effort. The key is to play down the “Whew, I'm glad that’s over” feeling
that students are likely to have, and plav up the learning that can now take place from their
knowledge of the results.

Some of the “alternatives” in assessment referred to in Chapter 1 may also enhance


washback from tests. (See also Chapter 10.) Setf-assessment may sometimes be an
appropriate way to challenge students to discover their own mistakes. This can be
particularly effective for writing performance: once the pressure of assessment has come
and gone, students may be able to look back on their written work with a fresh eye. Peer
discussion of the test results may also be an alternative to simply listening to the teacher
tell everyone what they got right and wrong and why. Journal writing may offer students a
specific place to record their feelings, what they learned, and their resolutions for future
effort.

The five basic principles of language assessment were expanded here into six essential
questions you might ask yourself about an assessment. As you use the principles and the
guidelines to evaluate various forms of tests and procedures, be sure to allow each one of
the five to take on greater or lesser importance, depending on the context. In large-scale
standardized testing, for example, practicality is usually more important than washback, but
the reverse may be true of a number of classroom tests. Validity is of course always the
final arbiter. And remember, too, that these principles, important as they are, are not the
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
only considerations in evaluating or making an effective test. Leave some space for other
factors to enter in.

In the next chapter, the focus is on how to design a test. These same five principles
underlie test construction as well as test evaluation, along with some new facets that will
expand your ability to apply principles to the practicalities of language assessment in your
own classroom.

EXERCISES

[Note: (I) Individual work; (G) Group or pair work; (C) Whole-class discussion.]

1. (I/C) Review the five basic principles of language assessment that are defined and
explained in this chapter. Be sure to differentiate among several types of evidence that
support the validity of a test, as well as four kinds of reliability.

2. (G) A checklist for gauging practicality is provided on page 31. In your group,
construct a similar checklist for either face validity, authenticity, or washback, as assigned
to your group. Present your lists to the class and, in the case of multiple groups, synthesize
findings into one checklist for each principle.

3.(I/C) Do you think that consequential and face validity are appropriate considerations
in classroom-based assessment? Explain.

4. (G) In the section on washback, it is stated that “Washback enhances a number of


basic principles of language acquisition: intrinsic motivation, autonomy, self-confidence,
language ego, interlanguage, and strategic investment, among others” (page 29). In a
group, discuss the connection between washback and the above-named general principles
of language learning and teaching. Come up with some spectfic examples for each. Report
your examples to the rest of the class.

5. (I/C) Washback is described here as a positive effect. Can tests provide negative
washback? Explain.

6. (G) In a small group, decide how you would evaluate each of the 12 assessment
scenarios described in the chart on pages 39-40, according to the six factors listed there. Fill
in the chart with 5-4-3-2-1 scores, with 3 indicating that the principle is highly fulfilled and
1 indicating very low or no fulfillment. Use your best intuition to supply these evaluations,
even though you don’t have complete information on each context. Report your group’s
findings to the rest of the class and compare.

Practicality Rater Test Content Face Authenticity


reliability reliability Validity Validity

Scenario 1: Standardized
multiple-choice proficiency test, no
oral or written production, S receives
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
a report form listing a total score and
part scores for listening, grammar,
proofreading, and reading
comprehension
Scenario 2: Timed impromptu test
of written English (TWE). s receives a
report form listing one holistic score
ranging between 0 and 6.
Scenario 3: One-on-one oral
interview to assess overall oral
production ability, s receives one
holistic score ranging between 0 and
5.
Scenario 4: Multiple-choice
listening quiz provided by a textbook
with taped prompts, covering the
content of a three-week module of a
course, S receives a total score from
T with no indication of which items
were correct/incorrect.
Scenario 5: s is given a sheet with
10 vocabulary items and directed to
write 10 sentences using each word.
T marks each item as acceptable/
unacceptable, and s receives the test
sheet back with items marked and a
total score ranging from Oto 10.
Scenario 6: S gives a 5-minute
prepared oral presentation in class. T
evaluates by filling in a rating sheet
indicating S's success in delivery,
rapport, pronunciation, grammar, and
content.
Scenario 7: s listens to a 15-
minute video lecture and takes notes.
T makes individual comments on each
S's notes.
Scenario 8: s writes a take-home
(overnight) one-page essay on an
assigned topic. T reads paper and
comments on organization and
content only, and returns essay to s
for a subsequent draft.

Scenario 9: s creates multiple


drafts of a three- page essay peer-
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
and T-reviewed, and turns in a final
version. T comments on grammatical/
rhetorical errors only, and returns it
to s.
Scenario 10: s assembles a
portfolio of materials over a
semester-long course. T conferences
with s on the portfolio at the end of
the semester.
Scenario 11: s writes a dialogue
journal over the course of a
semester. T comments on entries
every two weeks.

7. (G) Page 33 stresses the importance of stating objectives in terms of performance


verbs that can be observed and assessed. In pairs, write two or three other potential lesson
objectives (addressing a proficiency level and skill area as assigned to your pair) that you
think are effective. Present them to the rest of the class for analysis and evaluation.

8. (I/G) In an accessible language class, ask the teacher to allow you to observe an
assessment procedure that is about to take place (a test, an in-class periodic assessment, a
quiz, etc.). Conduct (a) a brief interview with the teacher before the test, (b) an observation
(if possible) of the actual administration of the assessment, and (c) a short interview with
the teacher after the fact to form your data. Evaluate the effectiveness of the assessment in
terms of (a) the five basic principles of assessment and/or (b) the six steps for test
evaluation described in this chapter. Present your findings either as a written report to your
instructor and/or orally to the class.

FOR YOUR FURTHER READING

Alderson, J. Charles (2001). Language testing and assessment (Part 1). Language
Teaching, 34,213-236.

Alderson, J. Charles (2002). Language testing and assessment (Part 2). Language
Teaching, 33, 79-113.

These two highly informative state-of-the-art articles summarize current issues and
controversies in the field of language testing. A comprehensive bibliography is provided at
the end of each part. Part 1 covers such issues as ethics and politics in language testing,
standards-based assessment, computer-based testing, self-assessment, and other
alternatives in testing. Part 2 focuses on assessment of the skills of reading, listening,
speaking, and writing, along with grammar and vocabulary.

Hughes, Arthur. (2003). Testing for language teachers. Second Edition. Cambridge:
Cambridge University Press.

A widely used training manual for teachers, Hughes’s book contains useful information
on basic principles and techniques for testing language across the four skills. The chapters
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
on validity, reliability, and washback (“backwash”) provide quick alternatives to the
definitions of the same terms used in this book.

CHAPTER 3: DESIGNING CLASSROOM LANGUAGE TESTS


The previous chapters introduced a number of building blocks for designing language
tests. You now have a sense of where tests belong in the larger domain of assessment. You
have sorted through differences between formal and informal tests, formative and
summative tests, and norm- and criterion-referenced tests. You have traced some of the
historical lines of thought in the field of language assessment. You have a sense of major
current trends in language assessment, especially the present focus on communicative and
process-oriented testing that seeks to transform tests from anguishing ordeals into
challenging and intrinsically motivating learning experiences. By now, certain foundational
principles have entered your vocabulary: practicality, reliability, validity, authenticity, and
washback. And you should now possess a few tools with which you can evaluate the
effectiveness of a classroom test.

In this chapter, you will draw on those foundations and tools to begin the process of
designing tests or revising existing tests. To start that process, you need to ask some
critical questions:

1. What is the purpose of the test? Why am I creating this test or why was it created by
someone else? For an evaluation of overall proficiency? To place students into a course? To
measure achievement within a course? Once you have established the major purpose of a
test, you can determine its objectives.

2. What are the objectives of the test? What specifically am I trying to find out?
Establishing appropriate objectives involves a number of issues, ranging from relatively
simple ones about forms and functions covered in a course unit to much more complex ones
about constructs to be operationalized in the test. Included here are decisions about what
language abilities are to be assessed.

3. How will the test specifications reflect both the purpose and the objectives? To
evaluate or design a test, you must make sure that the objectives are incorporated into a
structure that appropriately weights the various competencies being assessed. (These first
three questions all center, in one way or another, on the principle of validity7.)

4. How will the test tasks be selected and the separate items arranged? The tasks that
the test-takers must perform need to be practical in the ways defined in

the previous chapter. They should also achieve content validity' by presenting tasks that
mirror those of the course (or segment thereof) being assessed. Further, they should be
able to be evaluated reliably by the teacher or scorer. The tasks themselves should strive
for authenticity, and the progression of tasks ought to be biased for best performance.

5. What kind of scoring, grading, and/or feedback is expected? Tests vary in the form
and function of feedback, depending on their purpose. For every test, the way results are
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
reported is an important consideration. Under some circumstances a letter grade or a
holistic score may be appropriate; other circumstances may require that a teacher offer
substantive washback to the learner.

These five, questions should form the basis of your approach to designing tests for your
classroom.

TEST TYPES

The first task you will face in designing a test for your students is to determine the
purpose for the test. Defining your purpose will help you choose the right kind of test, and it
will also help you to focus on the specific objectives of the test. We will look first at two test
types that you will probably not have many opportunities to create as a classroom teacher—
language aptitude tests and language proficiency tests—and three types that you will almost
certainly need to create—placement tests, diagnostic tests, and achievement tests.

Language Aptitude Tests

One type of test—although admittedly not a very common one—predicts a person’s


success prior to exposure to the second language. A language aptitude test is designed to
measure capacity or general ability to learn a foreign language and ultimate success in that
undertaking. Language aptitude tests are ostensibly designed to apply to the classroom
learning of any language.

Two standardized aptitude tests have been used in the United States: the Modern
Language Aptitude Test (MLAT) (Carroll & Sapon, 1958) and the Pimsleur Language
Aptitude Battery (PLAB) (Pimsleur, 1966). Both are English language tests and require
students to perform a number of language-related tasks. The MLAT, for example, consists of
five different tasks.

Tasks in the Modern Language Aptitude Test

1. Number learning: Examinees must learn a set of numbers through-aural input and
then discriminate different combinations of those numbers.

2. Phonetic script: Examinees must learn a set of correspondences between speech


sounds and phonetic symbols.

3. Spelling clues: Examinees must read words that are spelled somewhat phonetically,
and then select from a list the one word whose meaning is closest to the "disguised" word.

4. Words in sentences: Examinees are given a key word in a sentence and are then
asked to select a word in a second sentence that performs the same grammatical function
as the key word.

5. Paired associates: Examinees must quickly learn a set of vocabulary words from
another language and memorize their English meanings.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
More information on the MLAT may be obtained from the following website:
http://www.2lti.eom/mlat.htm#2.

The MEAT and PLAB show some significant correlations with ultimate performance of
students in language courses (Carroll, 1981). Those correlations, however, presuppose a
foreign language course in which success is measured by similar processes of mimicry,
memorization, and puzzle-solving. There is no research to show unequivocally that those
kinds of tasks predict communicative success in a language, especially untutored acquisition
of the language.

Because of this limitation, standardized aptitude tests are seldom used today. Instead,
attempts to measure language aptitude more often provide learners with information about
theft* preferred styles and their potential strengths and weaknesses, with follow-up
strategies for capitalizing on the strengths and overcoming the weaknesses. Any test that
claims to predict success in learning a language is undoubtedly flawed because we now
know that with appropriate seft-knowledge, active strategic involvement in learning, and/or
strategies-based instruction, virtually everyone can succeed eventually. To pigeon-hole
learners a priori, before they have even attempted to learn a language, is to presuppose
failure or success without substantial cause. (A further discussion of language aptitude can
be found in PLLT, Chapter 4.)

Proficiency Tests

If your aim is to test global competence in a language, then you are, in conventional
terminology, testing proficiency. A proficiency test is not limited to any one course,
curriculum, or single skill in the language; rather, it tests overall ability. Proficiency tests
have traditionally consisted of standardized multiple-choice items on grammar, vocabulary,
reading comprehension, and aural comprehension. Sometimes a sample of writing is added,
and more recent tests also include oral production performance.

As noted in the previous chapter, such tests often have content validity weaknesses, but
several decades of construct validation research have brought US much closer to
constructing successful communicative proficiency tests.

Proficiency tests are almost always summative and norm-referenced. They provide
results in the form of a single score (or at best two or three sub scores, one for each section
of a test), which is a sufficient result for the gate-keeping role they play of accepting or
denying someone passage into the next stage of a journey. And because they measure
performance against a norm, with equated scores and percentile ranks taking on paramount
importance, they are usually not equipped to provide diagnostic feedback.

A typical example of a standardized proficiency test is the Test of English as a Foreign


Language (TOEFL®) produced by the Educational Testing Service. The TOEFL is used by
more than a thousand institutions of higher education in the United States as an indicator of
a prospective student’s ability to undertake academic work in an English-speaking milieu.
The TOEFL consists of sections on listening comprehension, structure (or grammatical
accuracy), reading comprehension, and written expression. The new computer-scored
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
TOEFL announced for 2005 will also include an oral production component. With the
exception of its writing section, the TOEFL (as well as many other large-scale proficiency
tests) is machine-scorable for rapid turnaround and cost effectiveness (that is, for reasons
of practicality). Research is in progress (Bernstein et al., 2000) to determine, through the
technology of speech recognition, if oral production performance can be adequately
machine-scored. (Chapter 4 provides a comprehensive look at the TOEFL and other
standardized tests.)

A key issue in testing proficiency is how the constructs of language ability are specified.
The tasks that test-takers are required to perform must be legitimate samples of English
language use in a defined context. Creating these tasks and validating them with research is
a time-consuming and costly process. Language teachers would be wise not to create an
overall proficiency test on their own. A far more practical method is to choose one of a
number of commercially available proficiency tests.

Placement Tests

Certain proficiency tests can act in the role of placement tests, the purpose of which is
to place a student into a particular level or section of a language curriculum or school. A
placement test usually, but not always, includes a sampling of the material to be covered in
the various courses in a curriculum; a student’s performance on the test should indicate the
point at which the student will find material neither too easy nor too difficult but
appropriately challenging.

The English as a Second Language Placement Test (ESLPT) at San Francisco State
University has three parts. In Part I, students read a short article and then write a summary
essay. In Part II, students write a composition in response to an article. Part III is multiple-
choice: students read an essay and identify grammar errors in it. The maximum time
allowed for the test is three hours. Justification for this three- part structure rests largely on
the test’s content validation. Most of the ESL courses at San Francisco State involve a
combination of reading and writing, with a heavy emphasis on writing. The first part of the
test acts as both a test of reading comprehension and a test of writing (a summary). The
second part requires students to state opinions and to back them up, a task that forms a
major component of the writing courses. Finally, proofreading drafts of essays is a useful
academic skill, and the exercise in error detection simulates the proofreading process.

Teachers and administrators in the ESL program at SFSU are satisfied with this test’s
capacity to discriminate appropriately, and they feel that it is a more authentic test than its
multiple-choice, discrete-point, grammar-vocabulary predecessor. The practicality of the
ESLPT is relatively low: human evaluators are required for the first two parts, a process
more costly in both time and money than running the multiple- choice Part III responses
through a pre-programmed scanner. Reliability problems are also

present but are mitigated by conscientious training of all evaluators of the test. What is lost
in practicality and reliability is gained in the diagnostic information that the ESLPT provides.
Statistical analysis of errors in the multiple-choice section furnishes data on each student’s
grammatical and rhetorical areas of difficulty, and the essay responses are available to
teachers later as a preview of theft students’ writing.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
Placement tests come in many varieties: assessing comprehension and production,
responding through written and oral performance, open-ended and limited responses,
selection (e.g., multiple-choice) and gap-filling formats, depending on the nature of a
program and its needs. Some programs simply use existing standardized proficiency tests
because of theft obvious advantage in practicality—cost, speed in scoring, and efficient
reporting of results. Others prefer the performance data available in more open-ended
written and/or oral production. The ultimate objective of a placement test is, of course, to
correctly place a student into a course or level. Secondary benefits to consider include face
validity, diagnostic information on students’ performance, and authenticity.

In a recent one-month special summer program in English conversation and writing at


San Francisco State University, 30 students were to be placed into one of two sections. The
ultimate objective of the placement test (consisting of a five-minute oral interview and an
essay-writing task) was to find a performance-based means to divide the students evenly
into two sections. This objective might have been achieved easily by administering a simple
grid-scorable multiple-choice grammar-vocabulary test. But the interview and writing
sample added some important face validity, gave a more personal touch in a small program,
and provided some diagnostic information on a group of learners about whom we knew very
little prior to theft arrival on campus.

Diagnostic Tests

A diagnostic test is designed to diagnose specified aspects of a language. A test in


pronunciation, for example, might diagnose the phonological features of English that are
difficult for learners and should therefore become part of a curriculum. Usually, such tests
offer a checklist of features for the administrator (often the teacher) to use in pinpointing
difficulties. A writing diagnostic would elicit a writing sample from students that would allow
the teacher to identify those rhetorical and linguistic features on which the course needed to
focus special attention.

Diagnostic and placement tests, as we have already implied, may sometimes be


indistinguishable from each other. The San Francisco State ESLPT serves dual purposes. Any
placement test that offers information beyond simply designating a course level may also
serve diagnostic purposes.

There is also a fine line of difference between a diagnostic test and a general
achievement test. Achievement tests analyze the extent to which students have acquired
language features that have already been taught diagnostic tests should elicit information
on what students need to work on in the future. Therefore, a diagnostic test will typically
offer more detailed subcategorized information on the learner. In a curriculum that has a
form-focused phase, for example, a diagnostic test might offer information about a learner s
acquisition of verb tenses, modal auxiliaries, definite articles, relative clauses, and the like.

A typical diagnostic test of oral production was created by Clifford Prator (1972) to
accompany a manual of English pronunciation. Test-takers are directed to read a 150-word
passage while they are tape-recorded. The test administrator then refers to an inventory of
phonological items for analyzing a learner’s production. After multiple listenings, the
administrator produces a checklist of errors in five separate categories, each of which has
several subcategories. The main categories include

1. stress and rhythm,


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
2. intonation,

3. vowels,

4. consonants, and

5. other factors.

An example of subcategories is shown in this list for the first category (stress and
rhythm):

a. stress on the wrong syllable (in multi-syllabic words)

b. incorrect sentence stress

c. incorrect division of sentences into thought groups

d. failure to make smooth transitions between words or syllables

(Prator, 1972)

Each subcategory is appropriately referenced to a chapter and section of Prator’s


manual. This information can help teachers make decisions about aspects of English
phonology on which to focus. This same information can help a student become aware of
errors and encourage the adoption of appropriate compensatory strategies.

Achievement Tests

An achievement test is related directly to classroom lessons, units, or even a total


curriculum. Achievement tests are (or should be) limited to particular material addressed in
a curriculum within a particular time frame and are offered after a course has focused on
the objectives in question. Achievement tests can also serve

the diagnostic role of indicating what a student needs to continue to work on in the
future, but the primary role of an achievement test is to determine whether course
objectives have been met and appropriate knowledge and skills acquired by the end of a
period of instruction.

Achievement tests are often summative because they are administered at the end of a
unit or term of study. They also play an important formative role. An effective achievement
test will offer washback about the quality of a learner’s performance in subsets of the unit
or course. This washback contributes to the formative nature of such tests.

The specifications for an achievement test should be determined by

• the objectives of the lesson, unit, or course being assessed,

• the relative importance (or weight) assigned to each objective,

• the tasks employed in classroom lessons during the unit of time,

• practicality issues, such as the time frame for the test and turnaround time, and

• the extent to which the test structure lends itself to formative washback.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
Achievement tests range from five- or ten-minute quizzes to three-hour final
examinations, with an almost infinite variety of item types and formats. Here is the outline
for a midterm examination offered at the high-intermediate level of an intensive English
program in the United States. The course focus is on academic reading and writing; the
structure of the course and its objectives may be implied from the sections of the test.

Midterm examination outline, high-intermediate

Section A. Vocabulary

Part 1 (5 Items): match words and definitions

Part 2 (5 items): use the word in a sentence

Section B. Grammar

(10 sentences): error detection (underline or circle the error)

Section C. Reading comprehension

(2 one-paragraph passages): four short-answer items for each

Section D. Writing

respond to a two-paragraph article on Native American culture

SOME PRACTICAL STEPS TO TEST CONSTRUCTION

The descriptions of types of tests in the preceding section are intended to help you
"understand how to answer the first question posed in this chapter: What is the purpose of
the test? It is unlikely that you would be asked to design an aptitude test or a proficiency
test, but for the purposes of interpreting those tests, it is important that you understand
their nature. However, your opportunities to design placement, diagnostic, and achievement
tests—especially the latter—will be plentiful. In the remainder of this chapter, we will
explore the four remaining questions posed at the outset, and the focus will be on equipping
you with the tools you need to create such classroom-oriented tests.

You may think that every test you devise must be a wonderfully innovative instrument
that will garner the accolades of your colleagues and the admiration of your students. Not
so. First, new and innovative testing formats take a lot of effort to design and a long time to
refine through trial and error. Second, traditional testing techniques can, with a little
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
creativity, conform to the spirit of an interactive, communicative language curriculum. Your
best tack as a new teacher is to work within the guidelines of accepted, known, traditional
testing techniques. Slowly, with experience, you can get bolder in your attempts. In that
spirit, then, let us consider some practical steps in constructing classroom tests.

Assessing clear, Unambiguous Objectives

In addition to knowing the purpose of the test you’re creating, you need to know as
specifically as possible what it is you want to test. Sometimes teachers give tests simply
because it’s Friday of the third week of the course, and after hasty glances at the chapter(s)
covered during those three weeks, they dash off some test items so that students will have
something to do during the class. This is no way to approach a test. Instead, begin by
taking a careful look at everything that you think your students should “know” or be able to
“do,” based on the material that the students are responsible for. In other words, examine
the objectives for the unit you are testing.

Remember that every curriculum should have appropriately framed assessable


objectives, that is, objectives that are stated in terms of overt performance by students
(see Chapter 2, page 32). Thus, an objective that states “Students will learn tag questions”
or simply names the grammatical focus “Tag

questions” is not testable. You don’t know whether students should be able to understand
them in spoken or written language, or whether they should be able to produce them orally
or in writing. Nor do you know in what context (a conversation? an essay? an academic
lecture?) those linguistic forms should be used. Your first task in designing a test, then, is to
determine appropriate objectives.

If you’re lucky, someone will have already stated those objectives clearly in performance
terms. If you’re a little less fortunate, you may have to go back through a unit and
formulate them yourself. Let’s say you have been teaching a unit in a low- intermediate
integrated-skills class with an emphasis on social conversation, and involving some reading
and writing, that includes the objectives outlined below", either stated already or as you
have reframed them. Notice that each objective is stated in terms of the performance
elicited and the target linguistic domain.

Selected objectives for a unit in a low-intermediate integrated-skills course

Form-focused objectives (listening and speaking)

Students will

1. recognize and produce tag questions, with the correct grammatical form and
final intonation pattern, in simple social conversations.

2. recognize and produce Wh- information questions with correct final intonation
pattern.

Communication skills (speaking)

Students will
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
3.state completed actions and events in a social conversation.

4. ask for confirmation in a social conversation.

5. give opinions about an event in a social conversation.

6. produce language with contextually appropriate intonation, stress, and


rhythm.

Reading skills (simple essay or story)

Students will

7. recognize irregular past tense of selected verbs in a story or essay.

Writing skills (simple essay or story)

Students will

8. write a one-paragraph story about a simple event in the past.

9. use conjunctions so and because in a statement of opinion

You may find, in reviewing the objectives of a unit or a course, that you cannot possibly
test each one. You will then need to choose a possible subset of the objectives to test.

Drawing up Test Specifications

Test specifications for classroom use can be a simple and practical outline of your test.
(For large-scale standardized tests [see Chapter 4] that are intended to be widely
distributed and therefore are broadly generalized, test specifications are much more formal
and detailed.) In the unit discussed above, your specifications will simply comprise (a) a
broad outline of the test, (b) what skills you will test, and (c) what the items will look like.
Let’s look at the first two in relation to the midterm unit assessment already referred to
above.

(a) Outline of the test and (b) skills to be included. Because of the constraints of your
curriculum, your unit test must take no more than 30 minutes. This is an integrated
curriculum, so you need to test all four skills. Since you have the luxury of teaching a small
class (only 12 students!), you decide to include an oral production component in the
preceding period (taking students one by one into a separate room while the rest of the
class reviews the unit individually and completes workbook exercises). You can therefore
test oral production objectives directly at that time. You determine that the 30-minute test
will be divided equally in time among listening, reading, and writing.

(c) Item types and tasks. The next and potentially more complex choices involve the
item types and tasks to use in this test. It is surprising that there are a limited number of
modes of eliciting responses (that is, prompting) and of responding on tests of any kind.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
Consider the options: the test prompt can be oral (student listens) or written (student
reads), and the student can respond orally or in writing. It’s that simple. But some
complexity is added when you realize that the types of prompts in each case vary widely,
and within each response mode, of course, there are a number of options, all of which are
depicted in Figure 3.1.

Elicitation mode: Oral (student listens) Written (student reads)

word, pair of words word, set of words

sentence(s), question sentence(s), question

directions directions

monologue, speech paragraph

pre-recorded conversation essay, excerpt


interactive (live) dialogue
short story, book

Response Oral Written Oral Written


mode:

repeat mark multiple-choice option

read aloud fill in the blank

yes / no spell a word

short response define a term (with a phrase)

describe short answer (2-3 sentences)

role play essay

monologue (speech)

interactive dialogue
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Figure 3.1. Elicitation and response modes in test construction

Granted, not all of the response modes correspond to all of the elicitation modes. For
example, it is unlikely that directions would be read aloud, nor would spelling a word be
matched with a monologue. A modicum of intuition will eliminate these non sequiturs.

Armed with a number of elicitation and response formats, you have decided to design
your specs as follows, based on the objectives stated earlier:

Speaking (5 minutes per person, previous day)

Format: oral interview, T and s

Task: T asks questions of s (objectives 3, 5;


emphasis on 6)

Listening (10 minutes)

Format: T makes audiotape in advance, with one


other voice on it Tasks: a. 5 minimal pair items,
multiple-choice (objective 1)

b. 5 interpretation items, multiple-choice (objective


2)

Reading (10 minutes)

Format: cloze test items (10 total) in a story line


Tasks: fill-in-the-blanks (objective 7)

Writing (10 minutes)

Format: prompt for a topic: why I liked/didn't like a


recent TV sitcom Task: writing a short opinion
paragraph (objective 9)

These informal, classroom-oriented specifications give you an indication of

• the implied elicitation and response formats for items,

• the number of items in each section, and

• the time to be allocated for each.

Notice that three of the six possible speaking objectives are not directly tested. This
decision may be based on the time you devoted to these objectives, but more likely on the
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
feasibility of testing that objective or simply on the finite number of minutes available to
administer the test. Notice, too, that objectives 4 and 8 are not assessed. Finally, notice
that this unit was mainly focused on listening and speaking, yet 20 minutes of the 35-
minute test is devoted to reading and writing tasks. Is this an appropriate decision?

One more test spec that needs to be included is a plan for scoring and assigning relative
weight to each section and each item within. This issue will be addressed later in this
chapter when we look at scoring, grading, and feedback.

Devising Test Tasks

Your oral interview comes first, and so you draft questions to conform to the accepted
pattern of oral interviews (see Chapter 7 for information on constructing - oral interviews).
You begin and end with no scored items (warm-up and wind-down) designed to set students
at ease, and then sandwich between them items intended to test the objective (level check)
and a little beyond (probe).

Oral interview format

A. Warm-up: questions and comments

B. Level-check questions (objectives 3, 5, and 6)

1. Tell me about what you did last weekend.

2. Tell me about an interesting trip you took in the last year.

3. How did you like the TV show we saw this week? c. Probe
(objectives 5, 6)

1. What is your opinion about ? (news event)

2. How do you feel about ? (another news event)

D. Wind-down: comments and reassurance

You are now ready to draft other test items. To provide a sense of authenticity and
interest, you have decided to conform your items to the context of a recent TV sitcom that
you used in class to illustrate certain discourse and form-focused factors. The sitcom
depicted a loud, noisy party with lots of small talk. As you devise your test items, consider
such factors as how students will perceive them (face validity), the extent to which
authentic language and contexts are present, potential difficulty caused by cultural
schemata, the length of the listening stimuli, how well a story line comes across, how things
like the cloze testing format will work, and other practicalities.

Let’s say your first draft of items produces the following possibilities within each section:
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
Test items, first draft

Listening, part a. (sample Item)

Directions: Listen to the sentence [on the tape]. Choose the sentence on
your test page that is closest in meaning to the sentence you heard.

Voice: They sure made a mess at that party, didn’t they?

S reads: a. They didn’t make a mess, did they?

b. They did make a mess, didn’t they?

Listening, part b. (sample item)

Directions: Listen to the question [on the tape]. Choose the sentence on
your test page that is the best answer to the question.

Voice: Where did George go after the party last night?

S reads: a. Yes, he did.

b. Because he was tired

c. To Elaine’s place for another party.

d. He went home around eleven o'clock.

Reading (sample items)

Directions: Fill in the correct tense of the verb (in parentheses) that
should go in each blank.

Then, in the middle of this loud party they (hear) … the loudest thunder
you have ever heard! And then right away lightning (strike) … right outside
their house!

Writing

Directions: Write a paragraph about what you liked or didn't like about
one of the characters at the party in the TV sitcom we saw.

As you can see, these items are quite traditional. You might self-critically admit that the
format of some of the item is contrived, thus lowering the level of authenticity. But the
thematic format of the section, the authentic language within each item, and the
contextualization add face validity, interest, and some humor to what might otherwise be a
mundane test. All four skills are represented, and the tasks are varied within the 30 minutes
of the test.

In revising your draft, you will want to ask yourself some important questions:

1. Are the directions to each section absolutely clear?

2. Is there an example item for each section?

3. Does each item measure a specified objective?


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
4. Is each item stated in clear, simple language?

5. Dose each multiple-choice item have appropriate distractors; that is, are the
wrong items clearly wrong and yet sufficiently “alluring” that they aren’t ridiculously
easy? (See below for a primer on creating effective distractors)

6. Is the difficulty of each item appropriate for your student?

7. Is the language of each item sufficiently authentic?

8. Do the sum of the item and the test as a whole adequately reflect the
learning objectives?

In the current example that we have been analyzing, your revising process is likely to
result in at least four changes or additions:

1. In both interview and writing section, you recognize that a scoring rubric will
be essential. For the interview, you decide to create a holistic scale (see chapter 7)
and for writing section you devise a simple analytic scale (see chapter 9) that
captures only the objectives you have focused on.

2. In the interview question, you realize that follow-up question may be needed
for student who give one-word or very short answer.

3. In the listening section, part b, you intend choice “c” as the correct answer,
but you realize that choice “d” is also acceptable. You need an answer that is
unambiguously incorrect. You shorten it to “d, Around eleven o’clock”. You also note
that providing the prompts for this section on an audio recording will be logistically
difficult, and so you opt to read these items to your students.

4. In the writing prompt, you can see how some students would not use the
words so or because, which were in your objectives, so you reword the prompt:
“Name one of the characters at the party in the TV sitcom we saw. Then, use the
word so at least once and the word because at least once to tell why you liked or
didn’t like that person.”

Ideally, you would try out all your tests on students not in your class before actually
administering the tests. But in our daily classroom teaching, the tryout phase is almost
impossible. Alternatively, you could enlist the aid of a colleague to look over your test. And
so you must do what you can to bring to your students an instrument that is, to the best of
your ability, practical and reliable.

In the final revision of your test, imagine that you are a student taking the test. Go
through each set of directions and all items slowly and deliberately. Time yourself. (Often
we underestimate the time students will need to complete a test.) If the test should be
shortened or lengthened, make the necessary adjustments. Make sure your test is neat and
uncluttered on the page, reflecting all the care and precision you have put into its
construction. If there is an audio component, as there is in our hypothetical test, make sure
that the script is clear, that your voice and any other voices are clear, and that the audio
equipment is in working order before starting the test.

Designing Multiple-Choice Test Items


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
In the sample achievement test above, two of the five components (both of the listening
sections) specified a multiple-choice format for items. This was a bold step to take. Multiple-
choice items, which may appear to be the simplest kind of item to construct, are extremely
difficult to design correctly. Hughes (2003, pp. 76-78) cautions against a number of
weaknesses of multiple-choice items:

• The technique tests only recognition knowledge.

• Guessing may have a considerable effect on test scores.

• The technique severely restricts what can be tested.

• It is very difficult to write successful items.

• Washback may be harmful.

• Cheating may be facilitated.

The two principles that stand out in support of multiple-choice formats are, of course,
practicality and reliability. With their predetermined correct responses and time-saving
scoring procedures, multiple-choice items offer overworked teachers the tempting possibility
of an easy and consistent process of scoring and grading. But is the preparation phase
worth the effort? Sometimes it is, but you might spend even more time designing such
items than you save in grading the test. Of course, if your objective is to design a large-
scale standardized test for repeated administrations, then a multiple-choice format does
indeed become viable.

First, a primer on terminology.

1. Multiple-choice items are all receptive, or selective, response items in that the test-
taker chooses from a set of responses (commonly called a supply type of response) rather
than creating a response. Other receptive item types include true-false questions and
matching lists. (In the discussion here, the guidelines apply primarily to multiple-choice
item types and not necessarily to other receptive types.)

2. Every multiple-choice item has a stem, which presents a stimulus, and several
(usually between three and five) options or alternatives to choose from.

3. One of those options, the key, is the correct response, while the others serve as
distractors.

Since there will be occasions when multiple-choice items are appropriate, consider the
following four guidelines for designing multiple-choice items for both classroom-based and
large-scale situations (adapted from Gronlund, 1998, pp. 60-75, and J. D. Brown, 1996, pp.
54-57).

1. Design each item to measure a specific objective.

Consider this item introduced, and then revised, in the sample test above:

Multiple-choice item, revised

Voice: Where did George go after the party last night?


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
S reads: a. Yes, he did.

b. Because he was tired.

c. To Elaine’s place for another party.

d. Around eleven o’clock.

The specific objective being tested here is comprehension of It’/j-questions. Distractor


(a) is designed to ascertain that the student knows the difference between an answer to a
zc/>question and ayes/no question. Distractors (b) and (d), as well as the key item (c), test
comprehension of the meaning of where as opposed to why and when. The objective has
been directly addressed.

On the other hand, here is an item that was designed to test recognition of the correct
word order of indirect questions.

Multiple-choice item, flawed

Excuse me, do you know…?

a. where is the post office

b. where the post office is

c. where post office is

The specific objective being tested here is comprehension of It’/j-questions. Distractor


(a) is designed to ascertain that the student knows the difference between an answer to a
question and ayes/no question. Distractors (b) and (d), as well as the key item (c), test
comprehension of the meaning of where as opposed to why and when. The objective has
been directly addressed.

2. State both stem and options as simply and directly as possible.

We are sometimes tempted to make multiple-choice items too wordy. A good rule of
thumb is to get directly to the point. Here’s an example.

Multiple-choice cloze item, flawed

My eyesight has really been deteriorating lately. I wonder if I need glasses. I think I’d
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
better go to the … to have my eyes checked.

a. pediatrician

b. dermatologist

c. optometrist

You might argue that the first two sentences of this item give it some authenticity and
accomplish a bit of schema setting. But if you simply want a student to identify the tvpe of
medical professional who deals with eyesight issues, those sentences are superfluous.
Moreover, by lengthening the stem, you have introduced a potentially confounding lexical
item, deteriorate, that could distract the student unnecessarily.

Another rule of succinctness is to remove needless redundancy from your options. In the
following item, which were is repeated in all three options. It should be placed in the stem
to keep the item as succinct as possible.

Multiple-choice item, flawed

We went to visit the temples, … fascinating.

a. which were beautiful

b. which were especially

c. which were holy

3. Make certain that the intended answer is clearly the only correct one.

In the proposed unit test described earlier, the following item appeared it original draft:

Multiple-choice item, flawed

Voice: Where did George go after the party last night?

S reads:

a. Yes, he did.

b. Because he was tired.

c. To Elaine’s place for another party.

d. He went home around eleven o’clock.

A quick consideration of the distractor (d) reveals that it is a plausible answer, al with
the intended key, (c). Eliminating unintended possible answers is often most difficult
problem of designing multiple-choice items. With only a minimum context in each stem, a
wide variety of responses may be perceived as correct.
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
4. Use item indices to accept, discard, or revise items.

The appropriate selection and arrangement of suitable multiple-choice iti on a test can
best be accomplished by measuring items against three indices: i facility (or item difficulty),
item discrimination (sometimes called item differentiation), and distractor analysis.
Although measuring these factors on classroom t would be useful, you probably will have
neither the time nor the expertise to do for every classroom test you create, especially one-
time tests. But they are a must standardized norm-referenced tests that are designed to be
administered a number of times and/or administered in multiple forms.

1. Item facility (or IF) is the extent to which an item is easy or difficult for the
proposed group of test-takers. You may wonder why that is important if in your estimation
the item achieves validity. The answer is that an item that is too easy (say percent of
respondents get it right) or too difficult (99 percent get it wrong) re does nothing to
separate high-ability and low-ability test-takers. It is not really I forming much “work” for
you on a test.

IF simply reflects the percentage of students answering the item correctly. The formula
looks like this:

IF = # of Ss answering the item correctly/ Total # of Ss responding to that item

For example, if you have an item on which 13 out of 20 students respond correctly, your
IF index is 13 divided by 20 or .65 (65 percent). There is no absolute IF value that must be
met to determine if an item should be included in the test as is, modified, or thrown out, but
appropriate test items will generally have IFs that range between .15 and .85. Two good
reasons for occasionally including a very easy item (.85 or higher) are to build in some
affective feelings of “success” among lower- ability students and to serve as warm-up items.
And very difficult items can provide a challenge to the highest-ability students.

2. Item discrimination (ID) is the extent to which an item differentiates between high-
and low-ability test-takers. An item on which high-ability students (who did well in the test)
and low-ability students (who didn’t) score equally well would have poor ID because it did
not discriminate between the two groups. Conversely, an item that garners correct
responses from most of the high-ability group and incorrect responses from most of the
low-ability group has good discrimination powder.

Suppose your class of 30 students has taken a test. Once you have calculated final
scores for all 30 students, divide them roughly into thirds—that is, create three rank-
ordered ability groups including the

top 10 scores, the middle 10, and the lowest 10. To find out which of your 50 or so test
items were most “powerful” in discriminating between high and low ability, eliminate the
middle group, leaving two groups with results that might look something like this on a
particular item:

Item #23 # Correct # Incorrect

High-ability Ss (top 10) 7 3

Low-ability Ss (bottom 10) 2 8


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Using the ID formula (7—2 = 5/10 = .50), you would find that this item has an ID of
.50, or a moderate level.

The formula for calculating ID is

ID= (high group # correct – low group # correct)/ (1/2 x total of your two comparison
groups) = (7 – 2)/ (1/2 x 20) = 5/10 = .50

The result of this example item tells you that the item has a moderate level of ID. High
discriminating power would approach a perfect 1.0, and no discriminating power at all would
be zero. In most cases, you would want to discard an item that scored near zero. As with
IF, no absolute rule governs the establishment of acceptable and unacceptable ID indices.

One clear, practical use for ID indices is to select items from a test bank that includes
more items than you need. You might decide to discard or improve some items with lower
ID because you know they won’t be as powerful an indicator of success on your test.

For most teachers who are using multiple-choice items to create a classroom- based unit
test, juggling IF and ID indices is more a matter of intuition and “art” than a science. Your
best calculated hunches may provide sufficient support for retaining, revising, and
discarding proposed items. But if you are constructing a large-scale test, or one that will be
administered multiple times, these indices are important factors in creating test forms that
are comparable in difficulty. By engaging in a sophisticated procedure using what is called
item response theory (IRT), professional test designers can produce test forms whose
equated test scores are reliable measures of performance. (For more information on IRT,
see Bachman, 1990, pp. 202-209.)

3. Distractor efficiency is one more important measure of a multiple-choice item’s


value in a test, and one that is related to item discrimination. The efficiency of distractors is
the extent to which (a) the distractors “lure” a sufficient number of test- takers, especially
lower-ability ones, and (b) those responses are somewhat evenly distributed across all
distractors. Those of you who have a fear of mathematical formulas will be happy to read
that there is no formula for calculating distractor efficiency and that an inspection of a
distribution of responses will usually yield the information you need.

Consider the following. The same item (#23) used above is a multiple-choice item with
five choices, and responses across upper- and lower-ability students are distributed as
follows:

Choices A B C D E
*

High-ability Ss (10) 0 1 7 0 2

Low-ability Ss (10) 3 5 2 0 0

*Note: C is the correct response


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
No mathematical formula is needed to tell you that this item successfully attracts seven
of the ten high-ability students toward the correct response, while only two of the low-
ability students get this one right. As shown above, its ID is .50, which is acceptable, but
the item might be improved in two ways: (a) Distractor D doesn’t fool anyone. No one
picked it, and therefore it probably has no utility. A revision might provide a distractor that
actually attracts a response or two. (b) Distractor E attracts more responses (2) from the
high-ability group than the low-ability group (O).Why are good students choosing this one?
Perhaps it includes a subtle reference that entices the high group but is “over the head” of
the low group, and therefore the latter students don’t even consider it.

The other two distractors (A and B) seem to be fulfilling their function of attracting some
attention from lower-ability students.

SCORING, GRADING, AND GIVING FEEDBACK

Scoring

As you design a classroom test, you must consider how the test will be scored and
graded. Your scoring plan reflects the relative weight that you place on each section and
items in each section. The integrated-skills class that we have been using as an example
focuses on listening and speaking skills with some attention to reading and writing. Three of
your nine objectives target reading and writing skills. How do you assign scoring to the
various components of this test?

Because oral production is a driving force in your overall objectives, you decide to place
more weight on the speaking (oral interview) section than on the other three sections. Five
minutes is actually a long time to spend in a one-on-one situation with a student, and some
significant information can be extracted from such a session. You therefore designate 40
percent of the grade to the oral interview. You consider the listening and reading sections to
be equally important, but each of them, especially in this multiple-choice format, is of less
consequence than the oral interview. So you give each of them a 20 percent weight. That
leaves 20 percent for the writing section, which seems about right to you given the time and
focus on writing in this unit of the course.

Your next task is to assign scoring for each item. This may take a little numerical
common sense, but it doesn’t require a degree in math. To make matters simple, you
decide to have a 100-point test in which

• the listening and reading items are each worth 2 points.

• the oral interview will yield four scores ranging from 5 to 1, reflecting fluency, prosodic
features, accuracy of the target grammatical objectives, and discourse appropriateness. To
weight these scores appropriately, you will double each individual score and then add them
together for a possible total score of 40. (Chapters 4 and 7 will deal more extensively with
scoring and assessing oral production performance.)

• the writing sample has two scores: one for grammar/mechanics (including the correct
use of so and because) and one for overall effectiveness of the message, each ranging from
5 to 1. Again, to achieve the correct weight for writing, you will double each score and add
them, so the possible total is 20 points. (Chapters 4 and 9 will deal in depth with scoring
and assessing writing performance.)
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE

Here are your decisions about scoring your test:

Percent of Possible Total Correct

Total Grade

Oral 40% 4 scores, 5 to 1 range X 2 = 40


Interview

Listening 20% 10 items @ 2 points each = 20

Reading 20% 10 items @ 2 points each = 20

Writing 20% 2 scores, 5 to 1 range X 2 = 20

Total 100

At this point you may wonder if the interview should carry less weight or the written
essay more, but your intuition tells you that these weights are plausible representations of
the relative emphases in this unit of the course.

After administering the test once, you may decide to shift some of these weights or to
make other changes. You will then have valuable information about how easy or difficult the
test was, about whether the time limit was reasonable, about your students’ affective
reaction to it, and about their general performance. Finally, you will have an intuitive
judgment about whether this test correctly assessed your students. Take note of these
impressions, however nonempirical they may be, and use them for revising the test in
another term.

Grading

Your first thought might be that assigning grades to student performance on this test
would be easy: just give an “A” for 90-100 percent, a “B” for 80-89 percent, and so on. Not
so fast! Grading is such a thorny issue that all of Chapter 11 is devoted to the topic. How
you assign letter grades to this test is a product of

• the country, culture, and context of this English classroom,

• institutional expectations (most of them unwritten),

• explicit and implicit definitions of grades that you have set forth,

• the relationship you have established with this class, and


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
• student expectations that have been engendered in previous tests and quizzes in this
class.

For the time being, then, we will set aside issues that deal with grading this test in
particular, in favor of the comprehensive treatment of grading in Chapter 11.

Giving Feedback

A section on scoring and grading would not be complete without some consideration of
the forms in which you will offer feedback to your students, feedback that you want to
become beneficial washback. In the example test that we have been referring to here—
which is not unusual in the universe of possible formats for periodic classroom tests—
consider the multitude of options. You might choose to return the test to the student with
one of, or a combination of, any of the possibilities below:

1. a letter grade

2. a total score

3. four sub scores (speaking, listening, reading, writing)

4. for the listening and reading sections

a. an indication of correct/incorrect responses

b. marginal comments

5. for the oral interview

a. scores for each element being rated

b. a checklist of areas needing work

c. oral feedback after the interview

d. a post-interview conference to go over the results

6. on the essay

a. scores for each element being rated

b. a checklist of areas needing work

c. marginal and end-of-essay comments, suggestions

d. a post-test conference to go over work

e. a self-assessment

7. on all or selected parts of the test, peer checking of results

8. a whole-class discussion of results of the test

9. individual conferences with each student to review the whole test


LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
Obviously, options 1 and 2 give virtually no feedback. They offer the student only a
modest sense of where that student stands and a vague idea of overall performance, but
the feedback they present does not become washback. Washback is achieved when
students can, through the testing experience, identify their areas of success and challenge.
When a test becomes a learning experience, it achieves washback.

Option 3 gives a student a chance to see the relative strength of each skill area and so
becomes minimally useful. Options 4, 3, and 6 represent the kind of response a teacher can
give (including stimulating a student self-assessment) that approaches maximum washback.
Students are provided with individualized feedback that has good potential for “washing
back” into their subsequent performance. Of course, time and the logistics of large classes
may not permit 3d and 6d, which for many teachers may be going above and beyond
expectations for a test like this. Likewise option 9 may be impractical. Options 6 and 7,
however, are clearly viable possibilities that solve some of the practicality issues that are so
important in teachers’ busy schedules.

In this chapter, guidelines and tools were provided to enable you to address the five
questions posed at the outset: (1) how to determine the purpose or criterion of the test, (2)
how to state objectives, (3) how to design specifications, (4) how to select and arrange test
tasks, including evaluating those tasks with item indices, and (5) how to ensure appropriate
washback to the student. This five-part template can serve as a pattern as you design
classroom tests.

In the next two chapters, you will see how many of these principles and guidelines apply
to large-scale testing. You will also assess the pros and cons of what we’ve been calling
standards-based assessment, including its social and political consequences. The chapters
that follow will lead you through a wide selection of test tasks in the separate skills of
listening, speaking, reading, and writing and provide a sense of how testing for form-
focused objectives fits in to the picture. You will consider an array of possibilities of what
has come to be called “alternative” assessment (Chapter 10), only because portfolios,
conferences, journals, setf- and peer-assessments are not always comfortably categorized
among more traditional forms of assessment. And finally (Chapter 11) you will take a long,
hard look at the dilemmas of grading students.

EXERCISES

[Note: (I) Individual work; (G) Group or pair work; (C) Whole-class discussion.]

1. (I/C) Consult the MLAT website address on page 44 and obtain as much information
as you can about the MLAT. Aptitude tests propose to predict one’s performance in a
language course. Review the rationale supporting such testing, and then summarize the
controversy surrounding aptitude tests. What can you say about the validity and the ethics
of aptitude testing?

2. (G) In pairs, each assigned to one type of test (aptitude, proficiency, placement,
diagnostic, or achievement), create a list of broad specifications for the test tyoe you have
been assigned: What are the test criteria? What kinds of items should be used? How would
you sample among a number of possible objectives?
LANGUAGE ASSESSMENT PRINCIPLES AND CLASSROOM PRACTICES EVALUATION
H. DOUGLAS BROWN LIC. LISSETTE DE LANDAVERDE
3. (G) Look again at the discussion of objectives (page 49). In a small group, discuss
the following scenario: In the case that a teacher is faced with more objectives than are
possible to sample in a test, draw up a set of guidelines for choosing which objectives to
include on the test and which ones to exclude. You might start with considering the issue of
the relative importance of all the objectives in the context of the course in question. How
does one adequately sample objectives?

4. (I/C) Figure 3.1 depicts various modes of elicitation and response. Are there other
modes of elicitation that could be included in such a chart? Justify7 your additions with an
example of each.

5. (G) Select a language class in your immediate environment for the following project:
In small groups, design an achievement test for a segment of the course (preferably a unit
for which there is no current test or for which the present test is inadequate). Follow the
guidelines in this chapter for developing an assessment procedure. When it is completed,
present your assessment project to the rest of the class.

6. (G) Find an existing, recently used standardized multiple-choice test for which there
is accessible data on student performance. Calculate the item facility (IF) and item
discrimination (ID) index for selected items. If there are no data for an existing test, select
some items on the test and analyze the structure of those items in a distractor analysis to
determine to they have (a) any bad distractors, (b) any bad stems, or (c) more than one
potentially correct answer.

7. (I/C) On page 63, nine different options are listed for giving feedback to students on
assessments. Review the practicality of each and determine the extent to which practicality
(principally, more time expended) is justifiably sacrificed in order to offer better washback
to learners.

You might also like