Dimova Et Al. - Local Language Testing
Dimova Et Al. - Local Language Testing
testing and assessment demands the development of local solutions. Starting from
their own experience with the development of tests in their own contexts, the
authors manage to cover all the important issues to be taken into account, from
construct definition to score reporting.
Gladys Quevedo-Camargo, University of Brasília
Local Language Testing: Design, Implementation, and Development captures the authors’
collective years of local test development wisdom in reader-friendly language that
helps to demystify the often-vexing test development process. Professors Slobodanka
Dimova, Xun Yan, and April Ginther offer sage advice to a broad audience includ-
ing language program directors and classroom teachers who seek to create tests
locally for purposes such as placement and diagnosis. Chapters such as Scaling and
Data Collection and Management serve as roadmaps to lead local test developers to
an understanding of the steps in the process of test creation. Local Language Testing
should be on the reading list of anyone seeking know-how for the creation of fair
and usable assessments.
Deborah Crusan, Wright State University
A lucid and accessible account of the art of developing language tests for use in
local contexts. Teachers and program directors who may have thought language
testing was beyond them should think again. Three seasoned testing practitioners
(Dimova, Yan and Ginther) characterize language testing as a problem-solving
activity best tackled by those with an understanding of the context of concern.
Reflecting on their own practical experience in diverse teaching environments,
the authors outline the processes and likely challenges involved in developing
and delivering custom-built tests designed for particular local needs. This book is
a welcome addition to a field dominated thus far by research on large-scale
commercial language tests.
Catherine Elder, University of Melbourne
LOCAL LANGUAGE TESTING
• the ability of local tests to represent local contexts and values, explicitly and
purposefully embed test results within instructional practice, and provide
data for program evaluation and research;
• local testing practices grounded in the theoretical principles of language testing,
drawing from experiences with local testing and providing practical examples
of local language tests, illustrating how they can be designed to effectively
function within and across different institutional contexts;
• examples of how local language tests and assessments are developed for use
within a specific context and how they serve a variety of purposes (e.g.,
entry-level proficiency testing, placement testing, international teaching
assistant testing, writing assessment, and program evaluation).
Typeset in Bembo
by Swales & Willis, Exeter, Devon, UK
CONTENTS
List of figures x
List of tables xii
Preface xiii
Acknowledgements xv
1 Introduction 1
Why local tests? 1
Introduction 1
Our tests 3
Contexts and problem solving 7
Large-scale language tests and academic language testing 9
Organization of the rest of this volume 12
Further reading 14
References 15
Test design 48
Technical manuals 61
Summary 62
Further reading 63
References 64
4 Test tasks 66
Introduction 66
Discrete-point lexico-grammar items 67
Integrative measures of general language ability 68
Skill-based performance assessments 71
Summary 82
Further reading 84
References 86
6 Scaling 114
Introduction 114
Why do we need a rating scale? 116
Different types of rating scales 117
Approaches to scale development and validation 122
Issues to consider during scale validation 126
Unique challenges and opportunities for scaling in local testing contexts 131
Summary 133
Further reading 134
References 135
9 Reflections 181
Introduction 181
Some reflections on the revision of the TOEPAS 182
Some reflections on the EPT scale revision process 185
Some reflections on the development of the OEPT 189
Conclusions 193
References 193
Appendix A 195
Index 205
FIGURES
We would like to express our gratitude to Catherine Elder for her constructive
feedback and creative suggestions, to Kyle McIntosh for his careful edits and
useful comments, and to Shareh T. Vahed and Ji-young Shin for their assistance
with the further readings.
1
INTRODUCTION
Introduction
Tests and assessments that are developed for use within a local context serve
a variety of purposes that cannot be addressed effectively using either large-scale,
commercial tests or classroom assessments. This volume, Local Language Testing:
Design, Implementation, and Development, describes language testing practices that
exist in the intermediate space between the two. We define a “local test” as one
whose development is designed to represent the values and priorities within
a local instructional program and designed to address problems that emerge out
of a need within the context in which the test will be used. That is, local tests
do not stand alone; they are embedded. We operationalize “local context” first
as the specific language curriculum with which the local test is associated
2 Introduction
and second as the broader target language use domain, i.e., the “situation or
context in which the test-taker will be using the language outside of the test
itself” (Bachman & Palmer, 1996, p. 18).
Another key feature that distinguishes a local test from a large-scale, commer-
cial test is that the development of a local test must, both of necessity and
choice, rely on local expertise. Such expertise may be embodied in teachers, stu-
dents, researchers, and program administrators who practice/work in the local
context. Local test development requires the participation of each of these crit-
ical actors; however, within local test development teams, the contributions of
instructors because of their representation of and familiarity with the local
instructional context, are central to test development efforts.
The advantages of local tests include their design as representations of local
contexts and values, the potential to link scores and performances to instruc-
tional practice, and the provision of data for program evaluation and research.
While large-scale standardized tests provide scores, local tests generate both
scores and data. Unfettered access to test data, item-level, raw score data, and
actual performance data, is only possible when a test is locally developed.
In addition, local tests can be developed not only to provide test users with
information about a test-taker’s language abilities but also to provide test-takers
with information about the language use characteristic of the broader context in
which they will be required to perform. In other words, local tests have the
potential to function as instructional devices that offer information about the
performance expectations associated with the local context.
The development of local tests and their representation of local contexts and
values creates the conditions for the emergence of a community of practice
(Lave, 1991). Fostering this community through continued and iterative test
development efforts and the use of test results to examine the success of the
instructional program can lead to improvements in both the test and the pro-
gram it is designed to serve. Evaluation of the test will include analyses of the fit
between the test and the instructional program, and for the potential contribu-
tions of a local test to be fully realized, test development, program development,
and test and program evaluation should be understood as ongoing processes,
with ample time and resources devoted to these complementary efforts.
In our discussions of defining characteristics of local tests, we realized that our
description requires, at least, two qualifications. Both qualifications arose in dis-
cussions of assumptions that we believe logically accompany the notion of local
tests. First, because local tests are embedded in local instructional programs, it
might be assumed that local tests are always small-scale. However, the local
values and instructional emphases associated with the development of a local test
may align with district, provincial, state, and/or national instructional standards.
In this sense, even a large, national test could be considered a local test when
the values represented by the test reflect distinctive features of a broader instruc-
tional context. However, the directionality of the development of a local test is
a critical distinguishing feature of a local test; that is, because local tests are
Introduction 3
Our tests
Because local tests are the products of the contexts in which they are embedded
and because we will be referring to our experiences with local test development,
administration, and maintenance throughout the book, we will take the oppor-
tunity to introduce our tests and contexts at this point.
which also offers instruction for those who fail the test. The instructional side of
the program consists of small-sized language classes (eight students) for prospective
ITAs that emphasize individualized instruction in four hours of class time and
one hour thirty minutes of individual tutoring each week.
Purdue University is a large public institution (33,000 undergraduate and
10,000 graduate students) of higher education in the state of Indiana in the
United States (US). Like most large public universities in the US, Purdue has
large populations of international undergraduate (15%) and graduate (50%)
students. In the US, graduate students, particularly doctoral students, are funded
through graduate research and teaching assistantships that cover tuition and
provide living expenses. International students who want to receive teaching
assistantships must demonstrate satisfactory English language proficiency before
they can be offered teaching assistantships by their departments. There are many
opportunities for teaching assistantships because, particularly at large public US
institutions, introductory undergraduate classes are often taught by graduate
teaching assistants (TAs). Science, technology, engineering, and mathematics
(STEM) programs are highly populated by international graduate students, and
the OEPP diagnoses and/or prepares the population of international TAs to
communicate effectively in English with undergraduate students in classroom
contexts.
The considerable increase in the international graduate student population in
the 1980s led many US universities to require English language proficiency
screening for prospective TAs (see Ginther, 2003). In 27 of the 50 states, English
language proficiency is required or mandated by state legislation. From 1987 to
2001, the OEPP used the Speaking Proficiency Assessment Kit (SPEAK), retired
versions of the Test of Spoken English (TSE), which had been administered as
an additional component of the Test of English as a Foreign Language (TOEFL)
until the introduction of its Internet version, TOEFL iBT. The OEPT was
developed to replace the SPEAK in 2001 and has been revised three times since.
The test is now part of a much larger testing and instructional system, which
includes the OEPT itself, the OEPT Practice Test and Tutorial, the OEPT
Rater Training Application, the OEPT Rater Interface, and the OEPT research
database. The OEPT is administered to more than 500 prospective ITAs each
academic year. Each test is rated by at least two trained raters. Approximately
half of OEPT administrations are conducted, and scores reported, the week
before school starts in August.
have an adequate level of oral proficiency for lecturing and interacting with
students in an EMI university setting. The TOEPAS is based on a 20-minute
simulated lecture in which the test-takers present new material, give instructions,
and interact with students. The lecturers/test-takers are provided with a score,
a video recording, personalized written feedback, and a one-on-one, individual-
ized feedback session/conference with a test administrator.
TOEPAS was developed and is administered by The Centre for International-
ization and Parallel Language Use (CIP), established in 2008 to augment the
University’s efforts to implement a language policy based on the principles of
parallel language use, which is an important aspect of the internationalization
process at the University of Copenhagen. The implementation of the parallel
language use policy ensures consistently high standards of language use in
Danish, as well as in English.
CIP functions both as a research and training center; its principal aim is to
develop a research-based strategy for the enhancement of Danish and English lan-
guage skills among various groups at the University. The objective of this strategy
is to contribute to the strengthening of the University’s international profile by
supporting employees and students in meeting language-related challenges.
CIP offers language courses to a range of different groups at the University,
including teachers, researchers, administrative staff, and students at all educational
levels, both domestic and international. One of CIP’s most important tasks is to
develop and offer tailor-made language courses for employees and students at
the University. Danish courses are available to employees who are non-native
speakers of Danish, and English courses are available for employees and students
who wish to improve their English language skills. Two important features
characterize CIP’s language courses: (1) they are shaped by the CIP’s research
findings, and (2) they are tailor-made to meet each participant’s professional
requirements, existing language skills, career development goals and, where
relevant, teaching and preferred modes of academic publication.
demands of the campus. Students who are placed in one or more ESL courses
must complete all their required ESL courses before graduation.
The EPT has a long history that dates back to the 1970s, but the current version
was first developed in 2001. The EPT has both on-campus and online versions.
While the on-campus EPT is administered mostly to graduate students, the online
EPT is only administered to undergraduate students. The EPT has traditionally
been administered on the UIUC campus before the first week of instruction each
fall and spring semester. However, the large increase in the number of international
undergraduate students at UIUC in 2008 led to the development of an online EPT
to satisfy the growing need of undergraduate academic advisors for timely advising
for incoming international undergraduate students. The online EPT is administered
in summer only to undergraduate students via Moodle.
The EPT consists of two parts. The first part is a written test, and the second
part is an oral test. The written test requires students to produce an academic
essay based on the information obtained from a series of short reading passages
and a short lecture. For the oral test, students are asked to complete a series of
speaking tasks related to the same topic as in the writing section. If students
speak intelligibly and coherently, they will be exempted from further oral
testing. The EPT is administered to close to 2,000 students each academic year,
the majority of whom complete the online version before arriving on campus.
Although these students have been admitted, the EPT is used to place students
into different levels of first-year composition classes. Each test is rated by two
trained raters and the results determine whether test-takers will need to complete
one or two composition courses and additional instruction in speaking.
Originally developed as a placement test for all incoming students, the pur-
pose changed when Purdue decided to require all incoming students who score
100 or below on the TOEFL iBT (or a comparable score on the IELTS) to
enroll in the PLaCE sequence. However, students may be exempted from
the second semester by meeting PLaCE exemption requirements, which include
satisfactory completion of the first course, an instructor recommendation, and
a passing score on all sections of the Ace-IN. Modules of the Ace-IN are scored
as needed (e.g., for additional information contributing to exemption decisions).
PLaCE existed in pilot form for four years before the program was provided
recurring funding in 2018. Because students must still complete additional writ-
ing and communication courses, PLaCE has focused its instructional efforts on
helping students adapt to Purdue’s academic environment, develop intercultural
competence, and improve their speaking skills. We are still evaluating how
information provided by the Ace-IN can be used most effectively within the
instructional program.
some ITA programs today, although the test is no longer supported by the
Educational Testing Service (ETS).
The administration of the SPEAK at Purdue was time-consuming and ineffi-
cient, especially with the growing influx of international graduate students. It
was conducted one-on-one by test administrators who presented the prompts,
recorded responses on cassette tapes, and then rated the cassette recordings
later. Two language testing administrators/raters were needed for the SPEAK
administration and to rate approximately 500 prospective international teach-
ing assistants each academic year. In 1999, we began the development of the
OEPT and decided to adopt a semi-direct, computer-administered testing
format, primarily to ease the administrative burdens associated with our admin-
istration of the SPEAK. The adoption of the semi-direct, computer-delivered
testing format allowed for group administration of the new test and reduced
the time needed for administration and rating by 50%.
The development of this new test also afforded the opportunities associated
with local test development that were alluded to above. Because the SPEAK
was a general language proficiency test, it was not linked in any way to the
ITA context, and the development of a new test created an opportunity to
represent the local context. However, our initial OEPT test development
efforts can be considered a failure because our first pass resulted in items that
were, for the most part, simply replications of SPEAK items. The influence of
our long use of and familiarity with the SPEAK was considerable. Had we
stopped there, we would have simply had an old test on a new platform.
Instead, we shifted development efforts in order to represent the kinds of
tasks and activities that ITAs perform in instructional contexts at Purdue. Test
prompts were designed to introduce aspects of the context (When you are
a teaching assistant, you will have occasion to …) along with instructions for
successful completion of the test task. Representing the local ITA instructional
context became a central part of our revised test development efforts. These
efforts were successful and, in response to our pilot, one test-taker com-
mented, “Taking the test gave me an idea about what teaching assistants
actually have to do.” In addition, when we developed the scale for the OEPT,
scores were directly related to the test-taker’s readiness to perform duties as
a classroom instructor. It is important to note that for some incoming gradu-
ate students, the OEPT serves as one of the few – perhaps only – introductions
to ITA duties and responsibilities provided before the first day of class.
At the same time, we decided to train all OEPP classroom instructors as
OEPT raters. OEPP instructors are well versed in the characteristics of the
test; indeed, because instructors are raters, they are responsible for place-
ment into and out of the instructional side of the program. This dual role as
instructor and rater sets the stage for our instructional interventions. Those
test-takers who do not pass the OEPT can enroll in the instructional side of
the ITA program, which includes 4 hours of class time, a 30-minute individ-
ual session with the instructor, and a 50-minute tutorial with a tutor. OEPP
Introduction 9
class size is limited to eight students, and the generous support that stu-
dents receive allows us to individualize instruction. One of the first instruc-
tional encounters that enrolled students have with their instructors and
tutors, referred to as the OEPT Review, focuses on each test-taker’s OEPT
score and recorded item responses. After one student completed this OEPT
Review, he commented, “This is the first time I have fully understood why
I got the score that I got on a language test.”
The development of the OEPT allowed us to address not only the initial,
motivating problem of test administration, but also construct representation,
instructors’ understandings and uses of test scores, and test-takers’ understandings
of test scores. Over time, the OEPT scale was extended to provide the founda-
tion for the self-evaluations, volunteer audience evaluations, and instructor
evaluations associated with the four presentations that students in the program
complete each semester. We consider the OEPT a good example of a local test
that is fully integrated within the instructional context it was developed to serve
and of the opportunities that local test development offers.
stakes are high for both the test developer and the test-taker because a high-stakes
test developer must deliver an instrument that produces a score that is an accurate
and valid estimation of the test-taker’s ability regardless of who that student is,
where the student is from, or what the student has experienced, so establishing
a test’s technical qualities and developing arguments for the test’s representation of
underlying and necessarily abstract language constructs take center stage.
Language testers attend to the measurement linguistic characteristics of tests,
and the field of language testing has been described as a hybrid of these fields
(Davies, 1984). As a result, language testing is often perceived as esoteric –
beyond the reach and interest of many program directors and language instructors.
Language testers contribute to this impression by emphasizing the difficulty and
complexity of language testing, along with reported deficiencies in the assessment
literacy of teachers and practitioners. However, many appropriate test develop-
ment and analysis procedures are well within teachers’ and program directors’ abil-
ities. Indeed, the tenets of good language testing practice are familiar to instructors
and program directors and will be recognized in the following section.
In 1961, John Carroll laid the foundation for the development of language test-
ing as an academic sub-specialty in applied linguistics with his paper “Fundamental
Considerations in Testing for English Language Proficiency of Foreign Students”,
presented at a conference organized to address the need for and development of
a test of English language proficiency for university admissions purposes. As
Ginther and McIntosh (2018) explain, “Carroll contrasted the discrete-
structuralist approach, represented by Lado (1961), with his own integrative
approach that emphasized productive and receptive modes, ‘real world’ settings,
and communicative purposes” (p. 848). Carroll’s approach is familiar to language
teachers and program directors because his arguments were foundational to the
development of communicative language teaching.
Of Carroll’s presentation, Spolsky (1995) noted, “This may have well been the
first call for testing communicative effect” (p. 224), providing impetus for the rich
and varied theoretical and practical discussions of language testing and assessment
that followed, including Canale and Swain’s (1980) “Theoretical Bases of
Communicative Approaches to Second Language Teaching and Testing”, along
with Bachman’s (1990) and Bachman and Palmer’s (1996) extended models of com-
municative competence. Furthermore, Carroll’s (1961) paper heralded not only the
development of the TOEFL, but also anticipated its transformations across versions:
from the TOEFL Paper-based test (PBT), with its decidedly discrete/structuralist
orientation, to the TOEFL Computer-based test (CBT), with its then innovative
adaptive listening and structure subsections but more traditional, linear, multiple-
choice reading subsection, and finally to the introduction of the TOEFL Internet-
based test (TOEFL iBT) in 2005, which now includes subsections that incorporate
and integrate writing and speaking skills (Ginther & McIntosh, 2018).
An important point for the purposes of our examination of local tests is that
Carroll’s (1961) discussions of testing tasks and methods, from discrete items
focusing on points of grammar to integrated items and performance tasks, all
Introduction 11
remain viable options for the local test developer with a particular purpose in
mind (see Chapter 4, Test Tasks). The selection of items, formats, and methods
will depend on the instructional values and purposes of the language program in
which a local test is embedded. While the assessment of the productive skills of
speaking and writing is now part of TOEFL’s standard test administration, in
sharp contrast to the absence of speaking and the indirect measurement of writ-
ing with multiple-choice structure/vocabulary items that characterized the earli-
est versions of the TOEFL PBT, we believe that the fullest realization of
Carroll’s vision (i.e., productive and receptive modes, “real world” settings, and
communicative purposes) can be more fully achieved when tests are linked to
local instructional contexts in which communicative purposes are emphasized.
The shift toward functional, communicative tests led to considerations of inter-
activity and a focus on performance (McNamara, 1996) and was accompanied by the
development and popularity of the American Council on the Teaching of Foreign
Languages (ACTFL, 1986, 2004) proficiency guidelines and the Common European
Framework of Reference (CEFR, Council of Europe, 2001), both of which are now
part and parcel of the communicative language teaching and testing movement. Few
would deny the intuitive appeal of language proficiency guidelines that easily lend
themselves to the development of classroom activities and provide a common under-
standing of an assumed, if not empirically demonstrated, developmental progression
(see Hawkins & Filipovic, 2012 on the relation of selected linguistic features and their
relation to CEFR levels; Leahy & Wiliam, 2011 on developing learner progressions;
and Baten & Cornillie, 2019 on elicited imitation and proficiency levels). Both
teachers and program directors are well versed in the communicative movement, and
local language tests linked to an instructional program can function as laboratories
where links between testing, assessment, and instruction can be forged and examined.
Local language tests begin where large-scale tests end. As briefly discussed
above, the primary purpose of large-scale tests (e.g., TOEFL and IELTS) is gen-
eral language proficiency testing for admission/selection into institutions of higher
education, and they serve that purpose well. However, their use is limited for
placement, diagnosis, and achievement. The scope of a local test often involves
a closer look at a limited range of proficiency, and the restriction of range results
in the challenge and the benefit of focus or a shift in grain size – like changing
optics to get a more detailed look at a microscopic specimen on a slide. A local
test isolates a segment of a scale, expands the segment, and then, provides a new
and revised scale through which relevant qualities of performance are highlighted.
Local language tests also provide the opportunity to shed light on and exam-
ine the utility of language proficiency guidelines. While the authors of guidelines
are careful to state that guidelines may or may not actually represent develop-
mental progressions, guidelines provide a resource for the development of
instructional objectives and goals, classroom activities, and local test tasks within
selected levels. Some activities, items, and tasks will work better than others.
The development of local language tests provides an opportunity to examine
what works in relation to instructional programs.
12 Introduction
In the very short overview we provided above, we hope that we have high-
lighted some of the common values that large-scale and local tests share but also
how language testing literature can serve as a valuable resource for local test
development efforts. At the end of each chapter, we also provide briefly anno-
tated suggestions for further reading that may be of particular interest to pro-
spective local language test developers.
We have argued that the advantage of local language tests is that they can be
used to inform teaching, curriculum, and ongoing development of language sup-
port programs. This chapter has discussed the potential for local language tests to
become an integral part of language teaching programs by providing baseline
information about test-takers’ entry proficiency levels, information about stu-
dents’ progress over time, and information that can be used for program evalu-
ation. The remainder of this volume focuses on the information that prospective
local language test developers need to consider in order to realize these goals.
The volume is organized into the following chapters, in which we consider, fol-
lowing both Carroll and Bachman, fundamental considerations in the develop-
ment of local language tests.
Chapter 6: Scaling
This chapter focuses on different types of language test scoring systems, with
particular attention to scale design for performance-based tests in speaking and
writing. Alongside the presentation of different types of scales (analytic, holistic,
primary-trait), this chapter includes discussions of different scale design methods
(data-driven, theory-driven) and benchmarking, i.e., finding representative per-
formances of each scalar level.
Chapter 9: Reflections
In this final chapter, we discuss the lessons learned through the development,
administration, and maintenance of the four tests that we have developed and
used as examples throughout the book.
14 Introduction
Further reading
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Bachman’s Fundamental considerations provides an introduction of issues that must
be addressed when developing language tests and an analysis of some of the chal-
lenges that a language test developer may encounter during the development and
the use of an instrument used for measurement. The book provides a thorough
explanation of classical measurement theory, generalizability theory, and item
response theory. However, the reader of this book must keep in mind that the
book is more concerned with “considerations” rather than “applications” of these
concepts. The discussion of validity is also one of the salient features of this book
as he describes validity as a unitary concept that requires evidence to support the
various inferences we make on the basis of test scores.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and
developing useful language tests (Vol. 1). Oxford: Oxford University Press.
Bachman and Palmer’s (1996) main objective is to develop readers’ competency in
language test development and use. The book consists of three parts. Part one
explores the concepts of test usefulness, test task characteristics, and characteristics of
language use and language test performance. The second part, “language test devel-
opment”, discusses the practice of test development and use, from planning for
a test to its administration and scoring. Some of the topics discussed in this section
of the book include the design of a test (i.e., test purpose description, TLU domain
analysis, construct definition, evaluation of usefulness, and allocation of resources),
operationalization (i.e., development of test tasks and a blueprint, writing instruc-
tions, and choosing a scoring method), and test administration (i.e., collecting feed-
back, analyzing test scores, and archiving). The last part of the book provides ten
sample test development projects, which include good examples of activities lan-
guage test developers can use in their own test development endeavors.
Elder, C. (2017). Language assessment in higher education. In E. Shohamy, I. G. Or,
& S. May (Eds.), Language testing and assessment (3rd Ed, pp. 271–286). New York:
Springer.
Elder (2017) critically reviews validity issues concerning English language assess-
ment in higher education, focusing on entry, post-entry, and exit tests. The
author begins the review by discussing early developments in language testing led
by two large-scale English language admission tests and the theoretical foundations
of language proficiency. The author connects large-scale admission tests to alter-
native pathways such as preparatory courses or/and post-entry local tests and
emphasizes the importance of not only identifying language needs of linguistically
at-risk students but also providing appropriate intervention throughout their
learning trajectories. She calls for attention to refining construct representation of
English proficiency in the context of English as a lingua franca in higher education
Introduction 15
References
ACTFL. (1986). ACTFL proficiency guidelines (Revised). Hastings-on-Hudson, NY:
American Council on the Teaching of Foreign Languages.
ACTFL. (2004). ACTFL proficiency guidelines (Revised). Hastings-on-Hudson, NY:
American Council on the Teaching of Foreign Languages. Retrieved January 1, 2004,
World Wide Web www.sil.org/lingualinks/LANGUAGELEARNING/OtherRe
sources/ACTFLProficiencyGuidelines/ACTFLProficiencyGuidelines.htm
Bachman, L. S. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bachman, L. S., & Palmer, A. S. (1996). Language testing in practice: Designing and developing
useful language tests. Oxford: Oxford University Press.
Baten, K., & Cornillie, F. (2019). Elicited imitation as a window into developmental stages.
Journal of the European Second Language Association, 3(1), 23–34.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches
to second language teaching and testing. Applied Linguistics, 1(1), 1–47.
Carroll, J. B. (1961). Fundamental considerations in testing for English language proficiency
of foreign students. In H. B. Allen & R. N. Campbell (Eds.), (1972), Teaching English as
a second language: A book of readings. New York: McGraw Hill 313–321.
Centre for Internationalisation and Parallel Language Use. (2009). Test of Oral English Profi-
ciency for Academic Staff ((TOEPAS). Copenhagen: The University of Copenhagen.
Council of Europe. (2001). Common European framework of reference for languages: Learning,
teaching, assessment. Cambridge: Cambridge University Press.
Davies, A. (1984). Validating three tests of language proficiency. Language Testing, 1(1), 50–69.
Educational Testing Service.. (1980). Test of spoken English. Princeton, NJ: Educational
Testing Service.
Educational Testing Service. (1982). Guide to SPEAK. Princeton, NJ: Educational Testing
Service.
Educational Testing Service (2019). Performance Descriptors for the TOEFL iBT Test. Prince-
ton, NJ: Educational Testing Service. https://www.ets.org/s/toefl/pdf/pd-toefl-ibt.pdf
16 Introduction
This chapter begins by discussing the variety of contexts that need to be con-
sidered when developing a local test. For example, it presents variations of the
contexts in which English is used as a lingua franca and the various contexts of
language instruction. Then, test purpose and instructional goals and objectives are
considered. The chapter argues that within local programs, the original purpose of
a local test may be extended, from summative to formative purposes; the data
generated by a local test has the potential to serve as the basis for diagnosis and
feedback. Local tests can de designed to contribute to a larger assessment system
when the test aligns with and is embedded in the local instructional context.
Introduction
As we argued in Chapter 1, the local context is central to the design, implemen-
tation, and maintenance of local tests. This relationship with the local context
may, in fact, be the most important distinguishing feature of a local test. Because
large-scale tests must reasonably focus on commonalities across a broad spectrum
of language use, they may exclude specific characteristics of language use associ-
ated with particular contexts. Local tests can address the gap between large-scale
and local representations of language use by targeting the domain that lies
between the general and specific.
Influential characteristics of context can be represented in many ways because
educational endeavors are influenced by the broad sociocultural, economic, and
political contexts in which they are embedded, as well as the personal values,
practical needs, and available resources of the stakeholders who interact within
these broader contexts. In this chapter, we discuss the different levels of context
that have influenced the development of our local English language tests and
18 Local tests, local contexts
how the needs of stakeholders within these contexts were addressed, in part,
through the development of local language tests. We begin with a discussion of
the ever-increasing demand for developing proficiency in English in differing
educational contexts around the world.
The fact that so much of human knowledge and the world’s information
in the present day is transmitted and developed in English means that at
higher levels of learning, a knowledge of English is a virtual necessity. To
be competitive in an increasingly international academic marketplace and
to keep on top of developments in most fields requires a high level of
knowledge in English, and this fact has contributed significantly to the
accelerated and continuing spread of the language.
(pp. 4–5)
a closer look at the characteristics of performance within a level that is only gen-
erally represented on a large-scale test.
Developing a test may help prospective language testing practitioners address
a variety of problems. In broad terms, language tests can be categorized as attempts to
address four kinds of problems: proficiency, placement, achievement, and diagnosis.
Although a student may initially be placed within an instructional level, the
proficiency of test-takers often requires a closer look. Two students, both having
completed four years of foreign language instruction, may obtain different levels
of proficiency and may be placed into beginning, intermediate, or advanced lan-
guage classes. Although a test-taker’s proficiency may be considered sufficient
for entry into a domain of use (such as the minimum needed for admission), it
may be insufficient for advanced or specialized purposes (such as teaching an
introductory class within a program). Given different proficiency levels at entry,
test-takers may benefit from support. Once support is indicated, both students
and instructors want to make the most efficient use of the time available for
instruction. Appropriate placement into instructional levels is the primary pur-
pose of placement testing.
When students are found to be in need of additional support, instructors may
want to get a better idea of test-takers’ strengths and weaknesses. Students with
varying educational opportunities and experiences often have different sets of
skills. Test-takers may have sub-scores that add up to the same total score but rep-
resent very different skills. A score user, such as an advisor or instructor, may infer
that a student’s strengths are consistent across skills, particularly when basing the
assessment on face-to-face encounters with a student whose speaking abilities are
high, but a closer look may reveal lower abilities in other skills (e.g., reading,
writing). Even within a given skill (e.g., speaking), the sub-skills (e.g., intelligibil-
ity, fluency of delivery, coherence of message) may differ when examined in dif-
ferent ways. Uncovering students’ strengths and weaknesses across and within
skills is the primary purpose of diagnostic testing.
When instructors and program directors want – or are required – to provide
evidence that students have met instructional goals and objectives, they need to
be able to demonstrate how a curriculum provides opportunities to meet
intended program outcomes. The importance of evidence has grown with the
expectation that programs, institutions, school districts, states, and even nations
can and should demonstrate the effectiveness of their programs (known as the
“accountability movement” in the United States). Appropriate and accurate esti-
mates of success or failure in a program of study is the primary purpose of
achievement testing.
The broad purposes of proficiency, placement, diagnosis, and achievement are
complementary and often overlap. Language testers regularly caution test users
and developers that a test designed for one purpose should not be used for
another purpose unless it is validated for both. While we agree in principle,
local language tests often serve multiple purposes. Local language tests produce
more than a score; they provide accessible data. For example, a local test
24 Local tests, local contexts
designed for placement that provides both item-level scores and records of actual
speaking and/or writing performances may also be used for diagnosis. In add-
ition, since the purpose of local language tests is embedded in instructional goals
and outcomes, a test that serves well for placement (a pre-test) may also serve
well for achievement, examination of gain scores, and/or program evaluation
(post-test). Therefore, developing multiple forms of a single test and adopting it
for different uses within a single local context makes sense. A final important
purpose served by local testing and assessment is research. Local language tests
produce a wealth of data that can be adapted to address program and research
purposes simultaneously.
While the decision to embark on the development of a test is often associated
with the broad purposes introduced above, the actual trigger for test development
is often something far more practical, such as the desire to improve administrative
efficiency or to reduce expenses. Fortunately, technology like audio capture and
playback, which was once available only to large-scale test developers or the
technologically savvy, now lies within the reach of the average computer users. It
is likely, as a growing number of digital instructional and administrative platforms
(e.g., Blackboard, Bright Space, Moodle, GoSignMeUp) become more accessible
and begin to offer additional functions, that language program administrators and
classroom instructors will take full advantage of these resources to aid in their test
development efforts. Perhaps the next stage in the development of language test-
ing as a field (Spolsky, 1995) will be the era of local language tests.
In summary, the purpose of a test, in addition to addressing the construct of inter-
est, must also take into account the decisions that will be made based on test scores.
The initial purpose of a local test is to make a decision about test-takers (e.g., screen-
ing, placement, and/or diagnosis). These purposes are complementary. The screening
purpose of tests helps select individuals who possess the necessary language require-
ments associated with an educational program of study or a job. When screening is
the initial purpose of a local test, results are often extended to placement. If the lan-
guage programs in educational institutions offer courses that include different profi-
ciency levels or that focus on different language skills, then these programs may
extend the use of scores to place incoming students into an appropriate level of the
language course. For example, some high schools in Denmark offer Danish courses
for Danish as a second language students at three different proficiency levels, and so
they need a test that assists in placement. Similarly, results from the OEPT are used to
place students into two instructional levels. The use of a local test first for screening,
then for placement, anticipates and may then support subsequent instruction.
the extended, viable functions that test results might serve, such as placement.
However, the more interesting and potentially beneficial extension of test pur-
pose lies with diagnosis. The diagnostic purpose extends test use to the identifi-
cation of the language-related strengths and weaknesses that test-takers possess so
that they can obtain the relevant language instruction that will allow them to
meet program outcomes. Extensions of test use to diagnosis are possible because
local tests produce a vast wealth of performance data in addition to the provision
of a score. The following are examples that illustrate the extended purpose of
language tests (Examples 2.1 and 2.2).
level when defining the construct. Examples of commonly used standards include:
the CEFR; the ACTFL Guidelines, developed by the American Council on
Teaching Foreign Languages (ACTFL, 2004); and the WIDA Guidelines (WIDA,
2018), developed by a consortium of US states to establish standards for early
childhood and K-12 multilingual learners.
In local language test development, you may also want to consider other
standards not directly associated with language teaching and learning. In the US
tertiary education, for example, there is growing interest in students’ develop-
ment of the set of skills referred to as “21st century skills”. The National
Research Council (2011) explains:
21st century skills clearly interact with language development and easily lend
themselves for inclusion in language program curricula. These skills have been
categorized by the National Research Council (2011) as follows:
Most 21st century skills are dependent on the concomitant development of lan-
guage abilities, which include observing and listening, eliciting critical information,
interpreting the information, and conveying the interpretation to others (Levy &
Murnane, 2004). For example, the National Council operationalizes Active Listening
as: “Paying close attention to what is being said, asking the other party to explain
exactly what he or she means, and requesting that ambiguous ideas or statements are
repeated” (p. 43); this demonstrates that these skills are largely language dependent.
Indeed, if we expect language learners to effectively use 21st century skills, they
must develop not only the foundational language skills associated with the com-
mencement of a program, but also the sophisticated language abilities required
throughout the course of study.
Local tests, local contexts 27
get them a job interview, but it is the ability to effectively communicate that
is essential for securing a job offer.
When instructors shifted the instructional focus toward professional devel-
opment with an emphasis on presentation skills, students could see the rele-
vance to their goals after obtaining their degrees. The skills emphasized in the
OEPP lend themselves easily to both undergraduate instructional contexts and
professional graduate contexts at the University.
The only way to break this circle is to try to build up a body of independ-
ent evidence on language ability and use through investigation of samples
of actual task performance. As Messick (1994) notes: “We need to move
beyond traditional professional judgement of content to accrue construct-
related evidence that the ostensibly sampled processes are actually engaged
by respondents in task performance.”
(p. 19)
The strength of local instructional programs is that they provide optimal envir-
onments in which to accrue construct-related evidence about how local tests
function in the local context because of the opportunity to collect and analyze
data as an ongoing process (Angoff, 1988). You can gather different types of
independent construct-related evidence for local tests, ranging from analysis of
test content and test performances, to evaluation of needs for local test imple-
mentation and use.
If you want to collect evidence about the relevance of the test content, you
may take the opportunity to obtain subject specialists’, instructors’, and students’
evaluation of the test content. Fulcher’s (1997) discussion on the development
and validation of a local language test used for placement at the University of
Surrey is an example of test content evaluation. You can also gather construct-
related evidence through analysis of the test performance characteristics. For
example, Ginther, Dimova, and Yang (2010) looked at temporal measures of
fluency and their relation to the OEPT holistic scale and argued that
Local tests, local contexts 29
In the first place, the fact that outcomes are described in performance
terms means that learners are focused on language as a tool for communi-
cation rather than on language knowledge as an end in itself. They are
also able to obtain diagnostic feedback on the success of their learning
since explicit performance criteria are provided against which they can
judge their progress. Second, since there is a direct link between attain-
ment targets, course objectives and learning activities, assessment is closely
integrated with instruction: what is taught is directly related to what is
assessed and (in theory at least) what is assessed is, in turn, linked to the
outcomes that are reported. Third, instructors, by comparing students’
progress and achievement with the standards statements, are able to make
better-informed judgements about what individual learners need.
(Brindley, 1998, p. 52)
30 Local tests, local contexts
the key to more valid and useful English assessment in higher education
lies in effective long-term institutional policy-making which puts language
tests in their rightful place as part of an integrated teaching and learning
program which makes explicit not only the content objectives of courses
and the standards of achievement expected but also the nature and level of
English proficiency expected at various points in students’ study trajectory.
Effective policymaking, if concerns for language development are to be
anything more than tokenistic, must stipulate not only the mechanism for
ensuring adequate proficiency at entry but also the means of identifying
learning needs and monitoring language development throughout students’
courses and on graduation.
(p. 282)
learning. The development of local language tests is part and parcel of these
larger, positive developments, the potential of which is just beginning to be real-
ized. The following are examples of outcome-based assessment. Example 2.3
provides information about the development of learning outcomes for
a language program that are reflected on the test, while Example 2.4 discusses
a language test designed on the basis of curricular learning outcomes.
The OEPT Review (Example 2.5) is a good example of how the summative
use of an instrument can be extended for formative purposes.
The OEPT Review, individual goal setting consultations, and the extension
and expansion of the OEPT scale across assessment activities comprises
a system and a set of feedback loops in which assessment is thoroughly
embedded in the instructional context. You can read more about the
detailed analysis of the use of objectives, goals, and formative assessment
that explicates OEPP practice in these areas in Haugen (2017).
Involving stakeholders
You could effectively embed assessment activities in a program and more fully
contribute to an assessment system when you design an assessment that engages
all critical classroom actors: instructors, peers, and students. Brindley (1998)
describes some of the ways that such engagement might occur:
You have the opportunity to involve both instructors and students in the devel-
opment and evaluation of the test tasks. Instructors are most familiar with the
task types that students at different proficiency levels can complete, while stu-
dents can evaluate prototype tasks, provide information about relevant topics,
and help to select adequate assessment methods.
Involving instructors in the test development, administration, and rating can
also strengthen the link between instruction and testing. For example, instructors
can contribute to the development of the rating scale by providing descriptors
with which they are familiar and which they may already use during classroom
assessment. Given that instructors are often raters for local tests, their participa-
tion in the scale development facilitates their interpretation, internalization, and
use of the scale during the rating process.
In the following chapters, we will turn towards central aspects of the actual
test development process.
Further reading
Dimova, S., & Kling, J. (forthcoming). Current considerations on ICL in multilin-
gual universities. In S. Dimova & J. Kling (Eds.), Integrating Content and Lan-
guage in Multilingual Universities. Dordrecht: Springer.
Local tests, local contexts 35
References
ACTFL. (2004). ACTFL proficiency guidelines (Revised). Hastings-on-Hudson, NY:
American Council on the Teaching of Foreign Languages. Retrieved January 1, 2004
from the World Wide Web www.sil.org/lingualinks/LANGUAGELEARNING/
OtherResources/ACTFLProficiencyGuidelines/ACTFLProficiencyGuidelines.htm
Alderson, J. C. (Ed.). (2002). Common European framework of reference for languages: Learning,
teaching, assessment: Case studies. Strasbourg: Council of Europe.
Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I. Braun (Eds.),
Test validity (pp. 19–32). Hillsdale, NJ: Lawrence Erlbaum.
Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2002). Working inside the black
box: Assessment for learning in the classroom. London: King’s College London, Department
of Education and Professional Studies.
Black, P., & William, D. (1998). Assessment and classroom learning. Assessment in Education,
5(1), 7–72.
Brindley, G. (1991). Defining language ability: The criteria for criteria. In S. Anivan (Ed.), Cur-
rent developments in language testing (pp. 139–164). Singapore: Regional Language Centre.
Brindley, G. (1998). Outcomes-based assessment and reporting in language learning pro-
grammes: A review of the issues. Language Testing, 15, 45–85.
Brown, C., Boser, U., Sargrad, S., & Marchitello, M. (2016). Implementing the every student
succeeds act: Toward a coherent, aligned assessment system. Washington, DC: Center for
American Progress.
Chapelle, C., Chung, Y., Heigleheimer, V., Pendar, N., & Xu, J. (2010). Towards a
computer-delivered test of productive grammatical ability. Language Testing, 27, 443–469.
Cronquist, K., & Fiszbein, A. (2017). El aprendizaje del inglés en América Latina. Retrieved
fromwww.thedialogue.org/wp-content/uploads/2017/09/El-aprendizaje-del-ingl%
C3%A9s-en-Am%C3%A9rica-Latina-1.pdf
Deygers, B., Zeidler, B., Vilcu, D., & Carlsen, C. H. (2018). One framework to unite
them all? Use of the CEFR in European university entrance policies. Language Assess-
ment Quarterly, 15(1), 3–15.
Dimova, S., & Kling, J. (forthcoming). Current considerations on ICL in multilingual
universities. In S. Dimova & J. Kling (Eds.), Integrating content and language in multilingual
universities. Dordrecht: Springer.
Educational Testing Service. (2019). Performance descriptors for the TOEFL iBT® test. Prince-
ton, NJ: Educational Testing Service.
Local tests, local contexts 37
WIDA. Alternative Access for ELLs. (2018). Interpretive guide for score reports grades
1–12. Madison, WI: Board of Regents of the University of Wisconsin System, on
behalf of WIDA.
Wiliam, D. (2004). Keeping learning on track: Integrating assessment with instruction.
Invited address to the 30th annual conference of the International Association for Edu-
cational Assessment (IAEA) held June 2004, Philadelphia, PA.
Wiliam, D., & Thompson, M. (2013). Integrating assessment with learning: What will it
take to make it work? In C. A. Dwyer (Ed.), The future of assessment: Shaping teaching and
learning (pp. 53–84). New York: Routledge.
Yan, X., Thirakunkovit, S., Kauper, N., & Ginther, A. (2016). What do test-takers say:
Test-taker feedback as input for quality control. In J. Read. (Ed.), Post-admission language
assessments of university students (pp. 113–136). Melbourne, AU: Springer.
3
LOCAL TEST DEVELOPMENT
You are embarking on a new test development project because you have
a testing need in your local context – you need a new test or your current test
fails to fulfill the testing purpose, and none of the off shelf tests seem appropriate
for your context. For example, you need to screen the prospective international
teaching assistants (ITAs) for oral English proficiency, but your institution lacks
an oral English proficiency test or the current one has become inadequate for
the intended testing purposes. You are tasked with test development, so the first
question that comes to mind is: where do I start?
In this chapter, we will discuss the basic considerations during the develop-
ment of a local language test. We will address the importance of specifying the
purpose in the test specification design and the role of needs analysis in the pro-
cess of item type selection and context representation.
Introduction
Test development has traditionally been viewed as a cyclical process that can be
divided into three main stages: design, operationalization, and administration.
Activities undertaken during the design stage include defining the test purpose,
the test-takers, the task types, and the resources; activities undertaken during the
operationalization stage include specifying the test structure, the tasks, and the
scoring methods. The administration stage comes last, when the designed test
tasks are administered and test data are collected. The data gathered in the
administration stage are then evaluated to inform the further design and opera-
tionalization, which makes the process cyclical (Figure 3.1).
This model outlines the reciprocal relationships between each stage, with
decisions made in one influencing activities in the next, but it also provides
40 Local test development
information for revision of decisions made in the previous stage. However, the
cyclical model cannot capture the overlap of activities in all stages of the devel-
opment process of local tests. Despite best efforts to streamline the test develop-
ment process into distinct stages, the design and operationalization stages tend to
be blended, and test developers regularly move back and forth between decisions
regarding test structure or task types, language ability descriptions, and resource
allocation decisions.
Based on our experience with local tests, instead of distinct stages, we concep-
tualize the test development process as sets of different activities that are related
to test planning, design, and implementation (see Figure 3.2). Activities related
to planning include: 1) identifying the test purpose, 2) identifying the test-
takers, 3) budgeting the test development and maintenance costs, 4) identifying
available local resources (people, materials), and 5) building the test development
team. Then, design activities relate to 1) designing tasks, 2) developing test
structure, 3) establishing comparable forms, 4) establishing test administration
procedures, and 5) piloting the tasks and administration. The planning and
design activities, which seem concurrent in the case of local test development,
precede test implementation and continue after the test becomes operational.
The blue arrows in the diagram below represent the ongoing evaluation that
was represented as a cycle in other test development models.
Development costs
The management of a test development project depends on the same constraints
as any other project, namely time, cost, and scope (see Figure 3.3). In other
words, you want to design a test with certain features within a given time
period on a specific budget. The quality of the test will depend on how you
define the scope, manage the budget, and allocate your time. Of course, to
make sure your test adequately fulfills its purpose, you will need plenty of time
and a good budget, but all resources are limited. In local test development pro-
jects, the greatest constraint is usually the budget. Schools, universities, and
other institutions involved in test design have limited financial resources, and
thus developing a good test can take a long time.
The project management triangle presented above is intended to highlight the
trade-offs across quality, scope, time, and cost in any development project. In
other words, the quality is constrained by the project’s budget, deadlines, and
scope (or the project size). For example, if you want to develop an inexpensive
test in a short period of time, the test’s scope should be reduced or a trade-off
in quality should be tolerated. High-quality projects require extended time if
costs are low or increased costs if time is short.
The costs are associated with test development, administration, and mainten-
ance. When tasked with test design, test developers primarily focus on develop-
ment expenses. However, it is a good idea to also plan for the test’s
sustainability. In other words, a local test will have to adjust to the transitions in
the local context because the availability of local resources will change over
time, and the test will need to accommodate technological advances and shifts in
institutional policies.
While these shifts may lead to additional costs, the good news is that programs
can often absorb costs by redistributing workloads and responsibilities and allow-
ing a generous amount of time for a test to be fully embedded in the local con-
text. As mentioned in previous chapters, the OEPT, now fully embedded in the
instructional program at Purdue, is part of a larger testing system that includes
a practice test, a video tutorial, a rater training program, a rating platform, and
a database; however, the parts of the system were developed over a period of
five years, and each part requires regular revision and updating. Unlike large-
scale tests, local tests can more readily adapt to change, so test development is
iterative and ongoing. In other words, it is never completely finished.
Test design activities involve the utilization of human and material resources
for test development, administration, and maintenance. Human resources include
both the core and external team members who participate in the different test
development and administration activities, while material resources refer to
equipment and space. In the following sections, we provide lists of team
member roles and material resources that test developers need during different
testing activities. Although you may have to outsource some of these activities
and purchase new equipment, many human and material resources are probably
already available in your institution. Therefore, identifying existing resources as
part of your planning activities is crucial so that you can maximize their use.
If you work in a university context, for example, you may have access to
a graphic design department, an IT department, or a computer science depart-
ment, which can offer both human and material resources regarding test
development. Your institution may have also purchased software licenses for
operating systems and audio/video editing applications, as well as licenses for
different learning platforms. Using rooms at your institution for test
Local test development 43
Human resources
Team members can participate in test development and administration as item
writers, scale developers and raters (in case of developing in performance-based
tests), graphic designers, actors/speakers and audio/video editors (if developing
listening items), software developers (if designing a digitally-enhanced test), and
test administrators.
The item writers are responsible for the design of the different items/tasks
included in the test, which includes writing the test and task instructions, task
prompts, and the actual tasks. Test instructions are the general instructions test-
takers receive regarding the test structure and length (number of items and
approximate time), as well as information about whether additional resources
(dictionaries, conjunction tables, notes) are allowed. Test instructions may also
include information about how to navigate through the test, how to play audio
input, and record responses, especially when the test is digitally delivered. Item
writers are also responsible for collecting existing texts, graphics, and audio and
video files to include in the test.
When designing a performance-based test, some team members will be
responsible for development of the rating scale, i.e., determining the scale levels
and level descriptors. When responses from performance-based tests are col-
lected, raters are trained to rate written or oral performance-based test responses
and to assign scores based on established criteria. You can read more about
scales, rating criteria, and rater training in Chapters 6 and 7.
Regardless of whether the test is paper or digitally delivered, it needs to be
graphically designed so that the representation of textual and visual information
is professionally formatted and easy to navigate. In some cases, pictures and
graphics must be designed specifically for the test so that similar style is main-
tained across all tasks. For example, when developing an English language test
for first graders in the Danish elementary schools, engaging illustrations related
to different themes and pictures with recurring elements (e.g., the same charac-
ter in different situations or positions) were needed so that the children would
recognize characters and contexts across all tasks. For that purpose, an illustrator
was hired to draw all of the pictures for the test.
When test instructions and prompts are provided in both textual and audio/
video form, and if the test has listening tasks, actors may be hired to record the
voiceovers or video inputs. Once the input is audio/video recorded, audio/
video editors are engaged to edit the recordings. In case of digitally delivered
tests, software developers are responsible for designing a platform for test deliv-
ery and response collection, as well as a system for data management.
44 Local test development
Finally, test administrators are often clerical staff (those responsible for test
registration, score report generation, and stakeholder liaison management), and
test proctors (invigilators, or those who set up and monitor the actual testing
session). In direct oral tests, the examiners and/or the raters may also have the
role of test administrators.
• The test coordinator organizes the work of the testing team, conducts
continuous test analyses, and plans test revisions and updates. The test
coordinator also serves as a rater.
• The two test administrators (clerical staff) are responsible for liaising
with stakeholders and external collaborators, as well as purchasing and
maintaining technical equipment.
• The five raters are responsible for administering the test sessions, rating
performances, recording scores, and providing written and oral feed-
back reports to test-takers.
• Local and external collaborators are also engaged to provide consult-
ancy and expertise when needed.
• A local IT specialist from the IT department at the university provides
consultancy regarding selection of digital, technical, and infrastructural
solutions for test data recording and storage. He also liaises the com-
munication with external software developers regarding the design and
maintenance of the test database.
• External software developers were hired to design a database for stor-
age of test data and an online interface for test data entry and retrieval.
46 Local test development
Material resources
Material resources include the necessary equipment and space for test produc-
tion. We present some of the resources in this section, but you can read about
the selection of alternatives needed for the production of traditional or digitally
delivered tests in Chapter 5.
The types of equipment needed for test production may include computer
devices (desktops, laptops, tablets, smartphones), monitors, microphones (lapel
and/or room mics), video cameras, servers (or server space), Internet routers,
loudspeakers, headsets, scanners, and printers. The pieces of equipment that you
need will depend on the task types and test delivery platform. For example,
when developing tests of writing, video cameras, microphones, and loudspeakers
may not be required if the task prompt does not include audiovisual material.
Moreover, printers are rarely used if the test development is completely digi-
tized. Purchasing high-quality equipment for test development can be expensive,
and some equipment may need to be replaced in a few years when upgraded
versions become available.
The equipment applied during test administration may be similar to that used
during test development. The difference lies in the number of pieces and devices
needed, which depends on how many people take the test simultaneously.
Speakers, monitors, or projectors may be necessary if the input for listening tasks
is not delivered on computer devices. For online tests and digital data storage,
servers (or server space), internet routers, or hard drives may be included. Oral
tests may require a microphone (lapel and transmitter if the test-taker is allowed
to move) and an audio or a video recording device (e.g., camera or computer).
If the oral test is digitally delivered, then a headset may be used so that the test-
taker can hear the instructions and prompts and then record responses, as is the
case with the OEPT. Moreover, if the recorded responses are stored remotely
on a server, an internet router may be required (see Chapter 8 for data storage
and management). In some cases, you can avoid purchasing equipment by
having test-takers bring their own equipment (e.g., laptop, headset) to the test
session, unless that compromises test security and administration consistency. In
Chapter 8, we discuss the security measures test developers should implement to
ensure reliable test administration.
Most of the equipment listed above functions with the use of certain software
programs and applications. Computer devices, for example, need updated oper-
ating systems and different word processing applications. Software for digitizing
and editing audio and video data (e.g., Adobe Premiere Pro, Corel Video
Studio Ultimate, CyberLink PowerDirector), as well software programs and
applications for developing digital test delivery platforms, are usually part of the
test production process.
In addition to digital resources, other materials include paper and ink for
printing out paper-based tests. Whiteboards, markers, and other stationery may
be needed during meetings and planning activities of test production. Although
Local test development 47
test booklets and pens or pencils are common test materials in paper delivered
tests, even in case of digitally delivered tests, booklets with test information and
scratch paper and pencils for note taking or planning responses may be necessary.
Envelopes or folders may also be needed to collect paper-based responses at the
end of the testing session. Other materials could include dictionaries or conjuga-
tion tables, if such resources are allowed during testing.
When thinking about material resources, the space needed for test administra-
tion must also be considered. The building where the test is scheduled to take
place should be able to accommodate the administration, and a room – or sev-
eral rooms – with a specific setup may need to be booked in advance. For
example, when the paper-based version of EPT was administered, a large audi-
torium was reserved so that the test could be administered simultaneously to
many test-takers. For the administration of the OEPT, a single, relatively large
(40 station) computer laboratory is reserved so that the test can be administered
to many test-takers simultaneously with the use of standard university equip-
ment. However, because only 20 test-takers are scheduled in a single session,
multiple sessions must be scheduled. The TOEPAS, on the other hand, requires
a different setup, i.e., a small room with long desks in the middle so that one
student and one rater can sit across from each other (see Figure 3.5).
Maintenance costs
Most tests need some revision after a certain period of time due to changes in
the technological infrastructure of the testing context, updates to software and
applications, and the modification of institutional policies. Maintenance costs are
very similar to development costs for both human and material resources. In
terms of digitally enhanced tests, you will need software developers to update
the digital platforms, which may involve item writers if the platform changes
lead to item revisions. There may also be a need for software updates or pur-
chases of additional software for test development, delivery, and response collec-
tion, as well as software needed for test data analysis (e.g., SPSS, SAS,
FACETS). Therefore, you may want to plan from the beginning to establish test
sustainability by acquiring financial and other resources over a certain period, so
that these resources are available when the time comes for large-scale revisions.
Table 3.1 provides an overview of resources related to test development, admin-
istration, and maintenance.
Test design
Once the test purpose is established and the available resources are identified,
test design activities can be planned. These activities include (1) designing tasks,
(2) designing the test structure, (3) developing comparable forms, (4) establishing
test admin procedures, and (5) piloting the tasks and procedures. Although
planning facilitates the completion of the different design activities, some re-cali-
bration of cost and reallocation of resources can still be expected because unpre-
dicted situations and factors may occur. For example, with the English test for
elementary school children in Denmark, the technical infrastructure in the
TABLE 3.1 Resources needed for development, administration, and maintenance of tests
schools was analyzed when planning the test development, but the problem of
intermittent internet connection was not identified before the actual pilot. This
led to difficulties with test delivery and response collection, and so purchasing
internet routers and wireless data was necessary. This purchase was not originally
budgeted and, therefore, represented an additional cost. So, it is always a good
idea to reserve part of the budget (10%) for unexpected needs.
Instructions
Test and task instructions are important because they help test-takers familiarize
themselves with the test purpose, task format, and results. When developing the
instructions, you need to consider the content, specificity, language, and mode
of delivery.
The kind of information included in the instructions depends on when the
instructions are delivered to the test-takers, i.e., before, during, or after the test-
ing session. Instructions are distributed before the testing session mainly to
inform test-takers about the date, time, and location of the test administration,
as well as information regarding what they can or should bring to the test site:
identification (e.g., student ID, passport, driver’s license), materials (e.g., diction-
ary, notes, PowerPoint), stationery (e.g., pencil, eraser), and equipment (e.g.,
computer, tablet, headset). These instructions are usually distributed to test-
takers when they register for the test, most often in an email.
In addition to basic test instructions, information about the testing purpose,
description of the test structure and tasks, explanation of the scoring system, and
information about how the results are reported is useful before the test session.
Sample and practice tasks, as well as sample responses and scale descriptors, help
prospective test-takers not only to familiarize themselves with the tasks but also
to understand the rating criteria. Finally, test-takers may need information about
how the test results relate to national or international standards (e.g., linking the
score bands of the test and the CEFR levels), as well as the kinds of language
support that exist in the local context in case their test results are unsatisfactory.
Local test development 53
All of these instructions can be distributed in test booklets and flyers that test-
takers receive when they register. Another option is to provide these instructions
on the test (or program) website so that test-takers are able to learn about the
test purpose, format, and procedures before they decide to register for it.
Example 3.5 discusses the distribution of instructions provided to test-takers
before the test.
OEPT: instructions
When we first developed the OEPT, we were aware that, because it was
a semi-direct, computer-delivered test of oral English proficiency, the format
represented a radical departure from what test-takers might be expecting.
While the semi-direct format has become much more familiar with the inclu-
sion of speaking in the TOEFL iBT, we could not assume familiarity when the
OEPT was introduced in 2000. The provision of adequate test preparation
materials was essential.
Our first attempts included a paper booklet that described the semi-direct
format, computer-based administration, tasks, and scale. This was clearly
inadequate because of the mismatch across the paper-based description and
the computer-based format of the test itself, but it was the best we could
do at the time. In 2002, we provided prospective test-takers with a CD ver-
sion of an entire practice test along with the information that was provided
in the paper-based test preparation booklet. This solution was better than
the booklet alone, but departments had to distribute the CDs to students
and/or provide the materials in a computer lab. Test taker completion rates
for the practice test were only around 50%.
Our third solution was developed in 2006. The online OEPT practice test
and tutorial consists of two practice tests, the OEPT scale, two sets of
sample passing responses, and a video tutorial introducing the Purdue
instructional and ITA contexts. We track completion of the practice test and
have achieved between 80–90% completion across departments. We believe
that is about as high as it will ever be. The graduate school and departments
alert incoming graduate students about the need to be certified for English
proficiency, and students are also informed about OEPT test preparation
materials when they register.
OEPT practice test materials are freely available and can be accessed at:
http://tutorial.ace-in-testing.com/Default.aspx?p=test
54 Local test development
The instructions provided during the test sessions are related to the overall
test structure (e.g., the number and order of sections, time allotted for each sec-
tion, breaks between sections, number of items per section) and the tasks (how
to complete the task, and record and submit responses). Examples of tasks or
warm-up questions are also included to give test-takers an opportunity to prac-
tice before they respond to rated/scored tasks. In digitally delivered tests,
instructions about how to navigate the test, adjust the different settings (e.g.,
volume, light), record oral and written responses, and submit responses are
essential (see Figure 3.7).
The instructions provided after the test usually relate to when and how test-
takers will receive their results and whom they should contact if they have ques-
tions or complaints.
All instructions need to be straightforward and simple in order to avoid the pos-
sibility that the test-takers would misunderstand or overlook important informa-
tion. If you include too many details, the test-takers may experience difficulties
remembering them, which could result in confusion. Every attempt should be
made to move the test-takers through the test as easily and comfortably as possible
because you want their performance to reflect the ability being tested, not their
familiarity with the format and procedure. Repeating important instructions in
different places or at different points in time may also be useful. The specificity of
the instructions depends on the test-takers. If they are children, then the instruc-
tions should probably be short, simple, and symbolic. Once the instructions are
developed, their clarity can be confirmed through a pilot.
In local contexts where the test-takers share the same first language (L1), another
consideration regarding instructions is the selection of language, i.e., L1 or target lan-
guage (L2). Some would argue that providing instructions in the test-takers’ L1
enhances comprehension, especially among test-takers at lower L2 proficiency levels.
In many tests, general information regarding the test is provided in the test-takers’ L1,
while the instructions during the test are in their L2. The TOEPAS is an example
where the instructions sent to the registered test-takers before the session are in both
Danish and English, while the instructions during the test are in English (L2) only.
In addition to language, the selection of instruction delivery modes (textual,
audio, video, visual) should also be considered. If you have a digitally enhanced
test, for example, you could provide the instructions in both textual and audio
modes to maximize test-takers’ comprehension. Digitally enhanced tests also
allow for the inclusion of video instructions, especially when the instructions are
about a certain process (response recording or navigating). Video instructions
may be particularly useful for younger test-takers who are accustomed to using
YouTube as a source. More information about the opportunities that digitally
enhanced delivery offers can be found in Chapter 5.
Task format
The task format refers to the type of output, i.e., response, it elicits, the number of
language skills it involves, and the mode of delivery to test-takers. Tasks may contain
items or prompts (see Figure 3.8). Items elicit selected responses (i.e., test-takers select
the correct response from a multiple choice) or limited responses (i.e., test-takers
insert single words, phrases, or short sentences). An example of a limited response task
is the cloze test, where the test-taker is asked to enter words that have been removed
from a written text in order to assess the test-taker’s reading skills (see Chapter 4 for
more examples of limited response). The tasks may also contain prompts that elicit
constructed responses that require oral or written language production in the form of
short or extended discourse; these are also known as performance-based responses
(e.g., essays, presentations, dialogues) (see Chapter 4 for more examples).
In terms of number of skills, you must decide whether the tasks should focus on
one skill (single-skill tasks) or whether they should involve the use of several skills
(integrated-skill tasks). In single-skill tasks, the test-taker needs to activate only one
language skill (e.g., speaking, listening, reading) in order to complete the task. In
integrated-skill tasks, the test-taker needs to activate different language skills in order
to complete the task. For instance, if test-takers are required to write an essay based
on a written text, then they will not be able to complete the task successfully if they
lack reading comprehension or writing skills. Understanding the text would be
insufficient if the test-taker lacks the ability to write an essay, and writing an essay
would be insufficient if the test-taker cannot understand the text.
The delivery mode of tasks (e.g., paper, digital, interview) can influence the facility
of task administration and response collection. The design of integrated-skill tasks is
facilitated in digitally enhanced tests because these tests offer solutions for multimodal
representation of information. For example, the play menu and the writing field can
occur on the same screen in tasks that integrate listening and writing, which simplifies
the task delivery. However, despite the possible task delivery facilitation that digital
devices (e.g., computers, tablets) offer, you have to consider whether digital delivery
is relevant for the purpose of your test. The delivery mode is discussed in Chapter 5.
Input
The task input is the content – be it linguistic, nonlinguistic, or both – with which
the test-taker is presented in order to elicit the relevant language skill for the assess-
ment. In listening tasks, the input is the audio or video information that test-takers are
required to listen to or watch and process in order to complete the task (Brunfaut &
Révész, 2015; Vandergrift, 2007). The input in reading tasks is in the form of verbal
(texts) and non-verbal information (e.g., pictures, graphs, tables). Factors like text
length, topic, and genre familiarity may influence performance (Alderson, 2000).
Test-takers are required to process the written text and non-verbal information, if
present, so that they can provide responses for the reading tasks. Existing audio/video
recordings or texts that are publicly available could be selected or they could be
Local test development 57
created specifically to serve as input in the reading and listening tasks. You may intui-
tively select existing materials or develop your own based on topics that you find suit-
able for your test-taker population. If you select existing materials, you need to make
sure you have copyright permission to use them in the test.
In order to make sure that the input is at a specific level of difficulty, you need to
analyze the lexical, syntactic, and discourse complexity, as well as the explicitness of
the information. For audio/video inputs, you also need to analyze the phonological
complexity, i.e., the speed, rhythm, pitch, and frequency of elisions, i.e., omissions
of sounds due to uses of reduced forms (e.g., “I’m gonna” or “I’ma” instead of
“I am going to”). The more linguistically or phonologically complex the input is,
the more difficult it will become for the test-takers to process.
Regarding input for listening and reading tasks, the number of times that test-
takers can access the input, the possibility of taking notes, and the time that the
test-takers have to process the input all require consideration. For example, can
the test-takers hear the audio input only once, twice, or can they replay differ-
ent parts if there is a need for it? While listening to an audio input only once
may increase authenticity, allowing test-takers to listen twice enhances the bias
for their best performance. Allowing test-takers to take notes while listening to
or watching long recordings will help them to avoid mistakes in their responses
because they forgot some detail due to working memory overload.
In performance-based speaking and writing tasks, the input material that is used
to elicit the extended responses may be included in the prompt. The prompt
should provide sufficient contextual and linguistic cues, which will stimulate test-
takers’ interaction with the task in the intended manner. For example, a prompt
for a writing task may include information about the context, the genre, the audi-
ence, and the length of the task response. Figure 3.9 below presents an example of
a prompt for a speaking task from the OEPT. Three elements are present in the
prompt: 1) introduction (establishes the context and audience), 2) input (genre
and the topic), and 3) question (prompts the response). The introduction allows
test developers to represent the local context in which test-takers will use the
target language outside of the test. This contextualization enhances test-takers’
interaction with the tasks and promotes elicitation of the relevant language func-
tions for the assessment purpose. While the OEPT’s introductions/contextualiza-
tions are minimal, test-takers appreciate having a better understanding about why
they are being asked to provide a particular response. These introductions also
serve to enhance test-takers’ understanding of the instructional contexts in which
they will be required to perform once they pass the test.
Local tests also present a unique opportunity for representation of endonorma-
tive (localized or indigenized) rather than exonormative (imported) language
varieties in the task input, which has been difficult to achieve in large-scale tests
because of their uses across a wide range of contexts with different linguistic
norms. This opportunity is particularly useful in English as a lingua franca (ELF)
contexts in which native-speaker linguistic and functional norms may not be
applicable.
58 Local test development
FIGURE 3.9 Example of a prompt for a speaking task from the OEPT
Scoring method
Scoring refers to assigning numerical scores based on the level of task fulfillment
or response correctness. In selected or limited response types, as in multiple
choice items or cloze tests, the scoring will be based on correct/incorrect or par-
tial credit responses. If these items are digitally administered, then the scoring
method may be automated so that test-takers can receive their results as soon as
they finish the test. If the test has performance-based tasks, you will need to
develop a rating scale and train raters to use it. Chapters 6 and 7 provide
detailed information about the different types of scales, the scale development
process, and rater training. You can read about result reporting in Chapter 8.
Test structure
In order to elicit test-takers’ performances that include different language func-
tions across target language domains, test developers often decide to include
more than one task in the test. Most books on language test development rec-
ommend designing a detailed description, also known as “blueprint” of the test
Local test development 59
Comparable forms
For security reasons, designing multiple forms of the language test is recom-
mended. Two test forms may suffice in the beginning, one to administer in
regular sessions and the other in case some test-takers need to be re-tested (Bae
& Lee, 2011). A new form for every administration is developed when high test
security is essential or when the correct task responses are made public after
each test administration.
The multiple test forms have to be comparable, which means that each form
must have the same instructions, structure, and task types. The comparability of
test forms is analyzed by administering each of the two forms to random test-
takers and then comparing each group’s overall performance. The random form
assignment to test-takers ensures that the two groups have equal proficiency.
Any statistical differences across groups on the two test forms can be interpreted
as a difference in the test forms. If the overall performance of the group who
took form one equals the average performance of the group who took form
two, then the forms are comparable.
You may have noticed that we are using the term comparable test forms instead
of parallel or equivalent test forms. In order to claim that the test scores are parallel
across forms, equating methods based on Item Response Theory (IRT) models
are applied to resolve the statistical differences among the different forms (Boek-
kooi-Timminga, 1990; Weir & Wu, 2006). In addition, equating involves the
use of anchor items to compare the performance of target subpopulations over
time. Large data sets are required to perform such analyses, and so they are
rarely used in local testing. In addition, in local contexts, the ability to change
and adapt the test to different and evolving instructional efforts is prioritized
over the ability characteristics of a subgroup over time (e.g., students’ perform-
ance on the Scholastic Assessment Test (SAT) in 1969 versus 2015).
60 Local test development
Pre-operational testing
Three types of pre-operational testing are performed to ensure the quality of the
testing procedures: 1) pre-testing testing, or exploratory analysis of the tasks,
scale, administration procedures, and score reporting system; 2) pilot testing, or
a small scale trial used for basic item analysis; and 3) field testing, or confirma-
tory analysis of the test’s psychometric properties (Kenyon & MacGregor, 2012).
In this section, we will focus on pre-testing and pilot testing because pre-
operational field testing requires the collection of large data sets, which are
rarely available in local contexts. In large-scale tests, the psychometric properties
of test items are further examined by embedding prospective items within
a regular administration and then abandoning those items that do not work psy-
chometrically. Local tests seldom have the opportunity to conduct extensive
embedded pretesting; however, quality control can be enhanced by pilot testing.
Pre-testing of the items, test administration procedures, and score reporting
system helps to examine whether all of these function according to the plan, or
whether adjustments are necessary. Pre-testing of items allows test developers to
identify major issues with item format and delivery, as well as response collection.
The pre-test can help identify problems related to the accessibility of the test loca-
tion, size of the room, seating plan, functionality of the heating/cooling system,
and levels of noise or possibility for disturbance. The functionality of the equip-
ment in the room, as well as the availability and speed of the internet should also
be tested. The day and time of the administration may affect the room and equip-
ment functionality (e.g., afternoons are noisier than mornings, the internet is
slower during the weekend), and whether the day and time will affect test-takers’
Local test development 61
ability to participate in the test or their performance. Finally, the pilot allows the
test administrators to try the planned procedures (check-in, seating, instructions)
and figure out whether these procedures need to be adjusted. Most importantly,
test administrators have an opportunity to check whether they can handle the
administration alone or whether they need more staff during the administration.
Pilot testing requires the recruitment of a number of participants who are
willing to try out the test in order to gather data for analysis. Whether the tasks
seem adequate for the testing purpose and whether or not the tasks elicit the
adequate performance can only be checked by administering them to test-takers.
Identifying and keeping those tasks that give you the ability to distinguish test-
takers across different proficiency levels are of particular interest. A task/item
may not be useful if test-takers across all proficiency levels can complete it or if
high-proficiency test-takers experience difficulties, while lower-proficiency test-
takers can complete it. Piloting will also allow you to check whether the test-
takers find the test and task instructions clear and easy to follow, and whether
colors, fonts, visuals, and screen or page layouts are distracting. The scales for
performance-based tasks also have to be piloted in order to examine whether
the raters can use the scale effectively to distinguish performances across profi-
ciency levels. You can read more about scaling in Chapter 6.
Several different data collection methods can be applied during the pilot admin-
istration: basic item analysis, observation, feedback, and questionnaires. By observ-
ing the test administration procedures, test-takers, and rater behavior, certain
problems and inconsistencies can be identified and addressed before the test
becomes operational. For example, you may observe some confusion during the
setup of the equipment and, as a result, you develop clearer instructions. Collection
of feedback from all participants in the testing process (e.g., test-takers, administra-
tors, raters) can be performed through interviews, written comments, and focus
group discussions. When a large number of test-takers participates in the pilot,
feedback could be collected with the use of questionnaires. Test-takers can provide
informative feedback about the functionality of the testing procedure, the difficulty
or ease of tasks, equipment, and response recording, as well as test navigation and
score reporting. Test administrators’ feedback can include information about the
functionality and order of the testing procedures, the need to include more admin-
istrative staff, and the clarity of instructions for setting up the room and equipment.
The raters, on the other hand, can discuss the ease or difficulty of using the
scale descriptors to distinguish among proficiency levels, the need to include
additional descriptors that underline salient features of the level, and the ease/
difficulty of accessing responses and submitting scores.
Technical manuals
The purpose of the technical manual is to document the technical character-
istics of the test based on the research and development work completed in
the task and test development process. Therefore, technical manuals usually
62 Local test development
Summary
In this chapter, we discussed different activities involved in the planning, design,
and implementation of test development. As you may have noticed, we avoided
recommending a specific order in which these activities should take place because
the test development approach you take depends on the contextual factors and
resources. However, we would like to provide you with a concrete example of the
flow of activities undertaken in an actual test development process (see Figure 3.10)
so that you can obtain an overview of the test development process in practice.
As you can see in Figure 3.10, the first stage of the development process of
TOEPAS focused on the establishment of the core testing team, analysis of the
local context, and description of the communicative language functions observed
in teaching situations in the local university context. Based on the context
Local test development 63
analyses, a testing procedure was developed to reflect local language uses and
assessment values. Then, the rating scale descriptors and the result reporting
system were designed and piloted with 20 test-takers. In the last stages of the
test development process, the pilot data were analyzed, and the procedure and
scale were adjusted based on the results. Once the testing procedure and scale
were finalized, the raters were trained before the test become fully operational.
This chapter provided a general overview of the test development process. In
the following chapters, we provide more detailed discussions about task types,
test delivery, scaling, and data management.
Further reading
Fulcher, G. (1997). An English language placement test: Issues in reliability and
validity. Language testing, 14(2), 113–139.
Fulcher (1997) investigates the reliability and validity of a placement test at the Uni-
versity of Surrey, which is designed for decision-making regarding English language
support provision for undergraduate and postgraduate students. Reliable and valid
test scores are crucial in achieving the purpose of the test, minimizing the possibility
of students having academic challenges or failures due to low English language abil-
ity and reliability. The author comprehensively evaluates reliability and validity by
considering different quantitative estimates that complement each other, for
example, obtaining correlations, inter- and intra-rater reliability, a logistic model fit,
and Rasch modeling for reliability, and examining correlation, principal compo-
nents analysis, cut scores, concurrent validity, content validity, and student feedback
for validity. The results demonstrate how various methods are effectively used to
64 Local test development
equate test forms and investigate reliability and validity of a placement test which
has a small number of examinees and logistic and administrational constraints.
Henning, G. (1987). A guide to language testing. New York: Newbury.
As the subtitle suggests, this book introduces the technical principles of measure-
ment needed in the development of tests, course evaluation, and language-related
research. The first chapter introduces some of the key concepts in language testing
such as test purposes and test types (objective versus subjective, direct vs. indirect,
discrete-point vs. integrative, aptitude, achievement and proficiency, criterion or
domain referenced vs. norm-referenced, speed vs. power). The next two chapters
introduce measurement scales and test data analysis. Some technical topics such as
the standardization of scores, Fisher’s Z – transformation, and correction for guess-
ing formulae are described in this chapter. Some of the important topics covered
in the next chapters are item analysis, correlation and regression, reliability and
validity analysis, threats to reliability and validity, item response model, Rasch
analysis, and teacher and course evaluations. This book can be considered an
introduction to many technical and statistical principles which are important
when considering test development, evaluation, and research.
Lynch, B. K., & Davidson, F. (1994). Criterion-referenced language test devel-
opment: Linking curricula, teachers, and tests. TESOL Quarterly, 28(4),
727–743.
After presenting the difference between norm-referenced and criterion-referenced
testing, Lynch and Davidson (1994) argue that criterion-referenced testing can be
used to strengthen the relationship between testing and the curriculum. The authors
argue that one of the most important features of criterion-referenced language test
development (CRLTD) is the development of test specification and discuss the pro-
cess of using CRLTD to develop and redevelop a well-written specification. They
also argue that there must be clear communication between test specification writers
and item writers since CRLTD’s primary focus is on refining the process of specifi-
cations implementation to write test tasks. They also propose a “workshop
approach” to CRLTD which can help classroom teachers develop the ability to
articulate their curriculums’ aims and objectives in test specifications and make con-
nections between those objectives and the test development process. The authors
mention UCLA ESLPE test and UIUC Placement Test as examples of local tests,
which have used CRLTD to refine their test specifications and test items.
References
Alderson, J. C. (2000). Assessing writing. Cambridge: Cambridge University Press.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing
useful language tests (Vol. 1). Oxford: Oxford University Press.
Bae, J., & Lee, Y. S. (2011). The validation of parallel test forms: ‘Mountain’ and ‘beach’
picture series for assessment of language skills. Language Testing, 28(2), 155–177.
Local test development 65
This chapter presents a variety of task types that can be used to assess a number
of language skills, including lexico-grammar, listening, reading, speaking, and
writing. It briefly introduces the history of development of different task types,
which is closely related to theoretical movements in linguistics and psychology.
Examples of various task types designed for assessment of different languages are
also presented and the advantages and disadvantages of their uses are discussed.
In this chapter, we argue that, while there is no single best method to assess
language ability, awareness of different assessment methods enhances one’s test
development options.
Introduction
There are a thousand Hamlets in a thousand people’s eyes. Readers of Wil-
liam Shakespeare may each construct their own version of the character
based on their own experiences and interpretations, even though they all
read the same text. There is no single interpretation that is inherently super-
ior to the others. While this argument speaks directly about the situatedness
of interpretation, it also applies to how we judge the value and quality of
test tasks. A healthy mindset for evaluating language tasks is to think that all
methods can succeed and all methods can fail. Of course, it is possible to
develop some criteria to evaluate the quality of a particular language profi-
ciency measure, but that criteria should be embedded in the local context,
which includes the test purpose, development and scoring resources, and the
stakeholder network.
When developing a language test, you often start with tasks, especially if you
already have a clear idea of what you want to assess and why you want to assess
Test tasks 67
it. Sometimes, however, you might feel puzzled or even frustrated about which
tasks to use. Since the beginning of language testing as a field (Lado, 1961), the
tasks used in language tests have expanded from discrete-point items to open-
ended, integrated tasks. In this chapter, we survey the tasks used to assess differ-
ent components of language ability. Following a somewhat chronological order,
we review the popular language tasks in different eras of language assessment,
not to draw attention to fashions or trends, but to provide a range of options for
you to contemplate as a starting point for test development. We follow the ter-
minology in Chapter 3, Local test development by using task as an umbrella term,
which is interchangeable with item when used to elicit selected responses, and
with prompt when used to elicit constructed responses. Furthermore, an item
includes optional input materials, a stem and options; a prompt includes optional
input materials and a question.
FIGURE 4.1 True/false reading item for young learners in English and German
(Tidligere sprogstart, https://tidligeresprogstart.ku.dk)
Me ___________ Alejandro.
A. llama
B. llaman
C. llamo
D. llame
understand language beyond the sentence level (McCray & Brunfaut, 2018; Winke,
Yan, & Lee, in press). Similarly, there have been doubts about whether elicited imi-
tation tasks measure imitation or the comprehension and reconstruction of meaning.
Research findings have suggested that elicited imitation tasks are a valid measure of
general language proficiency (Vinther, 2002; Yan, Maeda, Lv, & Ginther, 2016).
Test tasks 71
FIGURE 4.4 Fill-in-the-blank/cloze test of English vocabulary for first-grade EFL learners
(Tidligere sprogstart, https://tidligeresprogstart.ku.dk)
Reading
In the language testing literature, it is commonplace to regard some of the general
proficiency measures (cloze test, c-test) as reading tasks (e.g., Alderson, 2000). In
some ways, this is reasonable, as these tasks involve the processing of a written text.
However, we have classified those tasks under general language proficiency meas-
ures because of their relative simplicity in task design, as well as the skills and abil-
ities they are claimed to measure. In this section, we introduce reading tasks that are
72 Test tasks
paragraphs, or passages while recording the speed of reading. The focus of the meas-
urement is on speed while the readers read the text for comprehension. Reading flu-
ency is often assessed through oral reading tasks. Oral reading (sometimes referred to as
guided oral reading) is a popular instructional technique for teaching reading to young
learners in which a teacher (or parent) reads a passage aloud to model fluent reading,
and then the students reread the text (Figure 4.7). This technique has been demon-
strated to be effective on adult learners as well (Allen, 2016). The key to oral reading
as an instructional technique is to make sure that the text is at the appropriate reading
level, so that students are able to recognize, understand, and pronounce most (95%) of
the words. Using this technique, one can effectively assess the reading level of students
by asking them to read aloud texts at different complexity/grade levels. By assessing
the pronunciation, parsing, fluency, and disfluency patterns of the read aloud, testers
are able to make judgments about the appropriate reading level of each student.
When students are reading above grade level, their attention is often captured by
a heavy amount of lexico-grammar decoding, which interferes with comprehension.
Instructors can assess students’ comprehension by asking for a summary or having
them complete a comprehension task. Figure 4.7 is an example of an oral reading task
in Danish as a second language.
In contrast to oral reading (see Figure 4.7), silent reading tasks tend to be
employed to measure reading comprehension. A reading comprehension task
includes a reading passage followed typically by a number of selected-response
or short answer items. Depending on the target proficiency level, the passage
and items need to be carefully selected or modified in terms of genre (advertise-
ment, poster, narrative, or expository writing) and complexity (language and
content). The challenge in developing good reading comprehension tasks lies in
both the creation of a reading passage at a particular complexity level and the
creation of items following the reading passage.
There are a number of ways to manipulate the complexity level of the reading
passage. You can (1) alter the length of the passage, (2) alter the use of complex
syntactic structures and sophisticated lexical items, (3) manipulate the information
density or content complexity of the text, or (4) select texts from different genres
and registers. For example, when measuring the reading level of third-grade Eng-
lish language learners in the US, you might want to choose short narrative texts
and avoid argumentative or expository ones. Figure 4.8 is an example of a typical
Listening comprehension
Measuring listening skills is arguably more challenging than assessing reading
skills. While listening tasks tend to assess a similar set of subskills as reading com-
prehension tasks, they also require test-takers to attend to speech characteristics
(e.g., speech rate, speaker accent) since the passage is conveyed orally. When
developing listening comprehension tasks, the speech characteristics discussed
below should be carefully considered.
The first speech characteristic to consider is register. In corpus linguistics and
speech sciences, research has examined the differences between written and
spoken discourse. In terms of complexity, written discourse tends to be structur-
ally more complex (e.g., featuring more nominalization) than spoken discourse,
whereas spoken discourse, especially dialogue, tends to feature more coordin-
ation and subordination. However, academic lectures or speeches tend to feature
an “intermediate” register between written and spoken discourse (Thirakunko-
vit, Rodríguez-Fuentes, Park, & Staples, 2019). Therefore, depending on the
target level and purposes of the test, you will need to select the register carefully
and script the stimulus accordingly. To make the listening stimulus authentic,
you should consider either adding disfluency features in a scripted speech or
using unscripted speech. In natural speech, disfluency is ubiquitous, as it is
normal for speakers to pause or repair mistakes. Thus, the listening material
might sound unnatural if it is completely free of disfluencies. You might assume
that adding disfluencies would confuse the listener or make the task unnecessar-
ily challenging, but in fact, research has shown the opposite; disfluencies at
expected places in speech either have little impact on listening comprehension
(Wagner & Toth, 2016) or help to alleviate the processing load for the listener,
thus making the task less difficult (Bosker, Quené, Sanders, & De Jong, 2014;
Fox Tree, 2001).
The second characteristic to consider is the accent of the speaker. In recent
discussions in the field of language testing and applied linguistics, the nature of
communication and interaction in English has been problematized by views
from the World Englishes (Nelson, 2012), English as an International Language
(Matsuda & Friedrich, 2011), and English as a Lingua Franca (Pickering, 2006)
perspectives. Researchers have gradually recognized that conversations in English
happen as frequently between speakers of different varieties (e.g., British and
American English) or between non-native speakers (e.g., English speakers from
South Korea and Spain) as they do between native speakers of the same variety.
Thus, from the perspective of authenticity, listening tasks should incorporate
a variety of accents and dialects. However, inclusion of multiple varieties intro-
duces fairness issues; while it is likely that English learners around the world are
familiar with British or American accents, due largely to film and other media, it
is less certain that they have heard Korean or Spanish accents before. Language
testing and applied linguistics researchers have started to explore the impact of
incorporating different varieties of English on test-takers’ performance on listen-
ing tasks (Kang, Thomson, & Moran, 2018; Major, Fitzmaurice, Bunta, & Bala-
subramanian, 2005). Those studies find that listeners tend to have an advantage
when processing the accent of their L1, but it remains unclear whether or not
that same accent creates a disadvantage for test-takers from different L1 back-
grounds. While the jury is still out on the impact of different varieties of Eng-
lish, test developers can justifiably incorporate them into listening tasks for
a local test, especially when it is important for the students to be able to under-
stand particular varieties represented in the local assessment and learning context.
For example, when developing an occupational English test in Thailand, one
might wish to include English accents from the ASEAN (Association of South-
east Asian Nations) community, as test-takers are more likely to use English to
communicate with speakers from neighboring nations in Southeast Asia (e.g.,
Vietnam, Malaysia). In contrast, it might be problematic, or at least unnecessary,
to include these same varieties in a listening test for test-takers in Latin America.
The last characteristic to consider is the inclusion of visual aids for listening tasks.
With recent advances in technology, digitally-delivered language tests have become
Test tasks 77
increasingly common (see Chapter 5 for more about technology). This change
makes it easier to incorporate visual aids, which can provide contextual or content
information. The purpose of these visuals is to either help test-takers activate their
schemata (prior knowledge and experiences related to the content of the passage) or
to help alleviate the cognitive pressure of processing meaning while listening.
Figure 4.9 provides an example of context visual and content visual for a listening
task on the Sun, Earth and Moon for young learners. However, research has sug-
gested that content visuals tend to have a stronger impact on test-taker performance
on listening tasks (e.g., Ginther, 2002). Thus, test developers should consider the
choice of visual aids according to both the complexity of the listening material and
the target proficiency level of the test-takers.
Speaking
How to assess speaking is a longstanding question in the language testing litera-
ture, since it involves multiple components, including pronunciation, vocabu-
lary, grammar, coherence, task completion and, in some cases, knowledge of
a particular content domain. In linguistics, the study of language has been more
concerned about how language is spoken than written. The assessment of lin-
guistic and related psychological abilities is often mediated through the speaking
mode. For example, in a particular assessment, a speaker may be prompted to
pronounce minimal pairs of sound contrasts, repeat words, phrases and sentences,
read aloud pre-scripted sentences and paragraphs, converse with another inter-
locutor, deliver a public speech, engage in a group discussion, and/or interpret
from one language to another. While speaking is a productive skill, the targeted
knowledge and abilities underlying this skill will vary according to the assess-
ment context, purpose, and tasks.
Tasks used to elicit the production of sounds, words, phrases, or even sen-
tences are common in linguistic and psychological experiments. These tasks are
often designed to assess the knowledge of specific linguistic structures and may
be viewed as quite limited or undesirable, if you consider speaking skills broadly
as the ability to communicate with another interlocutor. However, these tasks
can also be quite effective when your purpose is to assess subskills of oral com-
munication such as the pronunciation of words, the automaticity of using certain
grammatical structures, or sentence stress and intonation patterns. Similar to MC
items, elicitation tasks are not easy to develop, as test developers need to control
or take into account the impact of other linguistic structures and socio-cognitive
factors in language processing. For example, it might take a group of item
writers anywhere between one month to a year to develop a set of elicited imi-
tation items for beginning-level learners in a Chinese program at an American
university. Because the learners are of low proficiency and may have limited
exposure to Chinese languages and cultures, item writers will need to carefully
select words and grammatical structures that will both cover the key lexico-
grammar in the curriculum and provide meaningful and natural sentences that
are comparable to the proficiency levels of the examinees. In fact, to assess
speaking, or language proficiency in general, for low-level learners, controlled
speaking tasks tend to be more effective, as handling authentic and complex lan-
guage tasks can be too challenging for them.
In the assessment contexts for higher-proficiency learners, speaking ability tends
to be conceptualized and operationalized in a much narrower sense. That is,
speaking assessment tends to involve a holistic evaluation of someone’s ability to
achieve a particular communicative purpose. In these contexts, controlled speaking
tasks become less useful than open-ended ones; moreover, the open-ended tasks
are often embedded in a specific purpose (e.g., speaking for academic study, con-
tent instruction, business transactions, and nurse-patient interaction).
Based on the interactive modes of communication, we distinguish two larger
categories of speaking tasks: tasks eliciting dialogues (direct) and tasks eliciting
monologues (semi-direct). Dialogues tend to be used to assess conversation
skills, which can be measured in terms of both one-on-one interviews and
group oral tasks. In interview tasks, the interviewer, who is often the rater/
examiner as well, asks a list of questions to get the examinees to share personal
information and opinions on a particular topic. In group oral tasks, the exam-
inees are asked to share opinions on a particular topic, play a game, or work
collaboratively to solve a problem. Table 4.4 illustrates the typical conversation
tasks used to measure speaking ability.
Monologic speaking tasks can be classified into two subcategories: narration
tasks and argument-based tasks. Narration tasks typically ask test-takers to describe
a particular story or scene depicted in a picture. These tasks can be used to meas-
ure general speaking proficiency for both lower- and higher-proficiency learners
and for both young and adult learners. Figures 4.10 and 4.11 show examples of
picture narration tasks for adult learners and young learners respectively.
TABLE 4.4 Conversation tasks for speaking
Face-to-face interview • Tell me a little bit about your … (e.g., hometown, area
of study, favorite movie)
• What do you think about … ? (e.g., a current event, cul-
tural value)
Group oral task • Opinion exchange
○ Current events
○ Cultural values
○ Educational policy
• Game
○ Word puzzles
○ Desert survival
• Problem solving
○ Develop a lesson plan
○ Develop a treatment plan for a patient
FIGURE 4.10 Picture narration task for the ESL university students
80 Test tasks
Writing
Writing tasks vary according to the context and target proficiency level. In the
beginning stages, writing tasks can include sentence completion and paragraph writ-
ing tasks. In ESL contexts, especially in higher education contexts, writing assess-
ment typically involves composing essays, and argumentative essays in particular, to
measure students’ academic writing ability. However, it is also important to measure
students’ writing skills in other genres (e.g., email, memoir) to capture a wider spec-
trum of academic writing. The independent and integrated writing tasks used in
English for academic purposes in higher education contexts resemble the speaking
tasks discussed above, with the only differences being that (1) the task asks exam-
inees to write an essay, and that (2) the scoring places a stronger emphasis on rhet-
orical features (e.g., organization and coherence) than speaking. In this section, we
will focus on sentence and paragraph writing tasks and email writing tasks.
In the early stages of writing development, students learn letters, words, and
then sentences. Later, they learn how to write a paragraph by composing and
organizing multiple sentences around a common topic. These skills form the
foundation for essay writing in different domains. Sentence and paragraph writing
tasks can effectively assess foundational writing skills. Sentence writing tasks
require the test-takers to know how to group individual words together in
a grammatical and meaningful way. Paragraph writing tasks require knowledge
about how to write an effective topic sentence, provide and elaborate on support-
ing details, and use discourse markers to connect sentences effectively in support
of the topic sentence. Although paragraph writing is more common for lower-
level proficiency learners or the earlier years of language learning, based on our
experience, even students in higher education contexts can face challenges in pro-
ducing an effective paragraph. These skills become more important in advanced
L2 writing programs, where the ultimate goal is to prepare students to write in
different genres (e.g., research paper). Thus, paragraph tasks should be under con-
sideration when assessing writing skills for learners across multiple levels.
Another writing skill that is often overlooked is the ability to write an email.
You might be tempted to think that essay writing is the most important lan-
guage skill for college students, but many of them spend an average of 1 to 2
82 Test tasks
hours per day reading and writing emails. The ability to write an appropriate
email can have an impact not only on learning, but on social relations in both
school and the workplace. However, in most second and foreign language edu-
cation settings, email writing is not emphasized. Email writing tasks tend to be
quite homogenous, with the prompt of the task creating a scenario that includes
the intended audience, context, and intended speech act. The writer is supposed
to make an evaluation of this information to identify the register and other prag-
matic considerations, and then write the email accordingly. Thus, the scoring of
email writing will likely include pragmatics, along with grammaticality. An
example of an email writing task can be found in Figure 4.12.
While an email task (as in the example in Figure 4.12) appears to be easy to
develop, test developers need to consider the pragmatic functions of email writing;
that is, an email is often intended to achieve a specific purpose (e.g., a request)
and delivered to a particular audience (e.g., professor). This differentiates email
writing from other types of writing such as argumentative essays or narratives. An
email task should not elicit writing performances that deviate substantially from
the characteristics of an email. You might often find email writing tasks (especially
for young learners) that require test-takers to write longer, narrative texts. For
example, an email writing prompt might simply state, “imagine you visited Berlin
last weekend. Write about what you did during the visit.” This kind of prompt
lacks information about the target genre or audience, making it difficult for the
test-takers to write an authentic email. In addition, this type of email writing task
might also elicit writing performances that are similar to a regular narrative or
argumentative essay task. Thus, depending on the purpose of the local test, it
might be fruitful to include a mix of essay and email writing tasks, to obtain
a more comprehensive evaluation of students’ writing ability.
Summary
So far, we have provided a survey of tasks that can be used to assess language
performance. In this summary section, we provide a few recommendations for
the selection and control of language tasks. In terms of selection, as we argued
at the beginning of this chapter, all methods both succeed and fail. The selection
of language tasks should be governed by the purpose of the test and the target
proficiency levels of the test-takers. While certain tasks may appear to be more
authentic or popular within the fields of TESOL, applied linguistics, or language
Test tasks 83
testing, the selection of tasks should not be governed solely by trends. For
example, research-based writing courses, which have long been the dominant
approach to writing instruction in ESL writing programs in US higher educa-
tion, have now spread to a number of non-Anglophone universities. While the
ability to write academically in English is an important skill for undergraduate
and graduate students in this increasingly globalized world, assessment tasks that
require test-takers to produce a research paper can be overwhelming for those
who have just graduated from high school, where writing tasks rarely go beyond
the sentence or paragraph level. In this particular context, we recommend that
test developers create a sequence of writing tasks of increasing difficulty and
embed the assessment sequence in different placement levels within the writing
program (e.g., sentence combining, paragraph writing⇒ summary⇒ literature
review). Paragraph writing tasks can be used to target the lowest course/place-
ment level to build upon students’ foundational writing skills. At the higher
course levels, tests can include summary writing tasks and eventually ask students
to combine summaries of individual research papers into a coherent literature
review. This type of assessment sequence can effectively foster longitudinal
development of writing abilities among students. In contrast, when developing
assessment tasks for a short, advanced-level course on academic writing, teachers
and test developers can select more authentic, integrated writing tasks, since stu-
dents at that level can be expected to have developed foundation skills in
research-based writing. From these two examples, we hope to demonstrate that,
while language tasks that are in vogue at a particular time might attract attention
from a wide group of test developers and users, they are of limited value if they
do not match the purpose and target proficiency level of the test.
Following the same line of reasoning, we also recommend that test developers
dedicate time and effort to controlling the complexity level of the tasks. Task
complexity can come from both (1) the nature of the language performance
elicited by the prompt questions, and (2) the linguistic and content complexity of
the input material. The careful selection of writing tasks recommended in the pre-
vious paragraph addresses the first source of complexity. To control the linguistic
and content complexity of the input materials, test developers need to first identify
some benchmark materials and compare the input materials they create to the
benchmarks. However, the selection of benchmark materials differs across testing
contexts. If the test is embedded in a local language program, we recommend
using the instructional materials (e.g., textbooks, exercises) as a benchmark. If the
local test is not embedded in a language program, then we recommend using
materials (e.g., texts, graphs, audio) that test-takers are likely to encounter in the
target language use domain as a benchmark. Of course, it is desirable that test
developers simplify the content complexity of the tasks by selecting less advanced
topics in the target domain. Selecting less complex content topics can ensure that
content knowledge does not have a large impact on test-takers’ language perform-
ance, unless the task was also developed to assess content knowledge. For
example, when measuring oral proficiency for ITAs in US universities, it might
84 Test tasks
Further reading
Alderson, J. C. (2000). Assessing reading (Cambridge language assessment series).
Cambridge, UK: New York, NY: Cambridge University Press.
Alderson (2000) provides a clear and accessible overview of the research con-
ducted in the field of second language reading. In Assessing Reading, Alderson
first presents a general introduction to the assessment of reading in both first
and second language learning, then discusses variables that affect the nature of
reading and defines the construct of reading using the reading subskill literature.
Mostly focusing on small-scale test design, in one chapter of the book, Alderson
presents a critique of high-stakes summative tests. The book also contains some
practical examples of reading assessment techniques for the use of teachers in
Test tasks 85
their classes. Alderson’s prose is very accessible to both practitioners and non-
practitioners. The practicality of this work is what distinguishes it from other
books related to the assessment of reading.
Buck, G. (2001). Assessing listening (Cambridge language assessment series). Cam-
bridge, UK; New York, NY: Cambridge University Press.
Buck (2001) is a guide for the design and administration of listening tests, which
fills a great gap in the literature by building a connection between the newly
emerged need to do research in the field of listening and listening-related
research that already exists in other fields. He begins the discussion by noting
the different types of knowledge used in listening, i.e., the top-down and
bottom-up views. The chapter goes on to introduce and discuss some of the
most important, yet basic, concepts which are necessary to understand before
discussing any listening-related research studies. In the subsequent chapters,
Buck summarizes the research on the importance of accent, provides a historical
background to the approaches used in the assessment of listening, defines the
construct of listening comprehension, discusses listening comprehension task
characteristics, and provides a guide to acquiring appropriate listening texts and
developing a listening test. The rest of the book is devoted to a discussion about
large-scale tests that involve listening comprehension: TOEFL, TOEIC, FCE,
and Finnish National Foreign Language Certificate.
Luoma, S. (2004). Assessing speaking (Cambridge language assessment series).
Cambridge, UK; New York, NY: Cambridge University Press.
Luoma (2004) relies heavily on past research and the theories driven from
research. At the same time, the book is highly practical and “is aimed at those
who need to develop assessments of speaking ability” (p. x). In the introductory
chapter, the author presents scenarios of speaking assessment, which helps the
readers have a good grasp of the variables involved in the assessment of speaking.
In the subsequent chapters, various factors involved in the assessment of speaking
are discussed in detail. Overall, Assessing Speaking is a very accessible and well-
written book that is attractive for both those who want to learn more about the
assessment of speaking and those who have immediate needs to assess speaking
in their instructional context.
Purpura, J. (2004). Assessing grammar. Cambridge, England: Cambridge University
Press.
It is argued by Purpura that the field of language testing needs to be better
informed by research into grammar in order to improve the quality of language
tests that include the assessment of grammar. Similar to the other books in Cam-
bridge’s assessment series, Assessing Grammar encompasses both the theoretical
and the practical aspects of the assessment of grammar. The author begins the
discussion by defining the concept of grammar, arguing how the elusive defin-
ition of grammar makes it hard to assess grammatical competence. Purpura
86 Test tasks
explains that grammar is more than its linguistics definition, and when assessing
grammar, language test developers need to consider its semantic-pragmatic
meaning along with its linguistic meaning comprising only syntax and morph-
ology. The book presents a detailed discussion of how grammar has been
assessed in several language tests and is concluded by a discussion about the dir-
ections future research can take.
Weigle, S. (2002). Assessing writing (Cambridge language assessment series). Cam-
bridge, UK: Cambridge University Press.
Weigle (2002) provides an excellent overview of the assessment of writing for
a large audience, including test constructors, test score users, teachers, and other
stakeholders. Like the other books in the Cambridge Language Assessment
series, Assessing Writing balances between breadth and depth of treatment by
covering many of the most important topics related to the assessment of writing
in a good amount of detail. The book begins with an introduction to the assess-
ment of writing by categorizing types of writing based on their purpose. In the
subsequent chapters, a variety of topics are presented, including a summary of
the differences between speaking and writing, the roles of topical knowledge
and strategic competence in text production, writing test task design, task scor-
ing, classroom writing assessment, portfolio assessment, and large-scale writing
assessment. The author concludes the book by discussing the future of writing
assessment affected by computer technology.
References
Alderson, J. C. (1980). Native and non-native speaker performance on cloze tests. Language
Learning, 30(1), 219–223.
Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press.
Allen, M. C. (2016). Developing L2 reading fluency: Implementation of an assisted repeated reading
program with adult ESL learners. Doctoral dissertation: Purdue University, West Lafayette,
IN, USA.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University
Press.
Bosker, H. R., Quené, H., Sanders, T., & De Jong, N. H. (2014). Native ‘um’s elicit predic-
tion of low-frequency referents, but non-native ‘um’s do not. Journal of Memory and Lan-
guage, 75, 104–116.
Canale, M., & Swain, M. (1980). Theoretical bases of com-municative approaches
to second language teaching and testing. Applied linguistics, 1(1), 1–47.
Carroll, J. B. (1961). Fundamental considerations in testing for English language proficiency of foreign
students. Washington, DC: Testing Center for Applied Linguistics.
Fox Tree, J. E. (2001). Listeners’ uses of um and uh in speech comprehension. Memory &
Cognition, 29(2), 320–326.
G. Fulcher (2010). Practical language testing. London: Hodder Education.
Ginther, A. (2002). Context and content visuals and performance on listening comprehension
stimuli. Language Testing, 19(2), 133–167.
Test tasks 87
Hughes,A. (2003). Testing for language teachers (2nd ed.). Cambridge: Cambridge University
Press.
Kang, O., Thomson, R., & Moran, M. (2018). Which features of accent affect understand-
ing? Exploring the intelligibility threshold of diverse accent varieties. Applied Linguistics,
online first 1–29.
Lado, R. (1961). Language testing: The construction and use of foreign language tests. New York,
NY: McGraw-Hill.
Major, R. C., Fitzmaurice, S. M., Bunta, F., & Balasubramanian, C. (2005). Testing the
effects of regional, ethnic, and international dialects of English on listening
comprehension. Language Learning, 55(1), 37–69.
Matsuda, A., & Friedrich, P. (2011). English as an international language: A curriculum
blueprint. World Englishes, 30(3), 332–344.
McCray, G., & Brunfaut, T. (2018). Investigating the construct measured by banked
gap-fill items: Evidence from eye-tracking. Language Testing, 35(1), 51–73.
Nelson, C. L. (2012). Intelligibility in world Englishes: Theory and application. New York, NY:
Routledge.
Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in
instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578.
Pickering, L. (2006). Current research on intelligibility in English as a lingua franca. Annual
Review of Applied Linguistics, 26, 219–233.
Thirakunkovit, S., Rodríguez-Fuentes, R. A., Park, K., & Staples, S. (2019). A corpus-
based analysis of grammatical complexity as a measure of international teaching assistants’
oral English proficiency. English for Specific Purposes, 53, 74–89.
Vinther, T. (2002). Elicited imitation: A brief overview. International Journal of Applied Lin-
guistics, 12(1), 54–73.
Wagner, E., & Toth, P. D. (2016). The role of pronunciation in the assessment of second
language listening ability. In T. Isaacs & P. Trofimovich (Eds.), Second language pronunci-
ation assessment: Interdisciplinary perspectives (pp. 72–92). Bristol: Multilingual Matters Ltd.
Winke, P., Yan, X., & Lee, S. (in press). What does the cloze test really test? A conceptual
replication of Tremblay (2011) with eye-tracking data. In G. Yu & J. Xu (Eds.), Lan-
guage test validation in a digital age. Cambridge: Cambridge University Press.
Xi, X. (2010). Aspects of performance on line graph description tasks: Influenced by graph
familiarity and different task features. Language Testing, 27(1), 73–100.
Yan, X., Cheng, L., & Ginther, A. (2019). Factor analysis for fairness: Examining the
impact of task type and examinee L1 background on scores of an ITA speaking test.
Language Testing, 36(2), 207–234.
Yan, X., Maeda, Y., Lv, J., & Ginther, A. (2016). Elicited imitation as a measure of second
language proficiency: A narrative review and meta-analysis. Language Testing, 33(4),
497–528.
5
LOCAL TEST DELIVERY
This chapter addresses issues related to the selection of a test delivery platform
(paper, digital, online). On one hand, it examines the advantages of technology in
improving test administration efficiency. On the other, it outlines the possible dif-
ficulties test developers may encounter in the process of digital platform design,
especially regarding cost-effectiveness, communication with software developers,
and maintenance of the system over time. The chapter considers how the change
of delivery method may elicit a different set of abilities, which may impact test-
takers’ success, and emphasizes the need to consider test-takers’ digital literacy and
the effects of the test delivery mode on authenticity, washback, and security.
Introduction
When designing the various assessment task types presented in the previous chapter,
you need to think about the delivery platform, i.e., how the task instructions and the
task input will be delivered to the test-takers. In this chapter, we differentiate between
traditional, digitally delivered, and hybrid (or digitally enhanced) tests. Traditional test
methods include paper-based written tests (e.g., multiple-choice items, performance-
based writing) and oral language tests based on presentations, interviews, and conver-
sations (see Chapter 4 for task descriptions and uses). Digitally delivered tests are those
delivered via the application of electronic technology that generates, stores, and pro-
cesses data using a binary system, with the digits referred to as bits (a string of bits in
computer technology is known as a byte). For example, computers are digital devices
because they apply this binary system to process data. We use the term “digitally
delivered” instead of the more commonly used “computer-based” or “computer-
assisted” because test delivery is no longer limited to computers; tablets and smart-
phones are increasingly being used as test delivery platforms.
Local test delivery 89
Tests are often delivered in a hybrid format, utilizing both traditional and
digitized elements in the task input or test delivery and data collection. There-
fore, these hybrid tests may not be considered digitally delivered, but their deliv-
ery and data collection are digitally enhanced. An example of hybrid delivery is
when, in a listening test, digital audio files are played on an MP3 player or simi-
lar device to deliver the task input, but test-takers have the task instructions and
the items in a paper booklet where they also enter their responses. Another
common example of a hybrid delivery/collection method is when writing task
input is provided in a paper booklet, but the test-takers type their responses in
a word processing document. You can find more information about digital test
data collection and management in Chapter 8.
Selection of the test delivery platform is an important consideration because the
platform needs to be compatible with the intended task format and because the task
development requires various digital and programming resources. Since the decisions
about the formats, the types, and the number of tasks to include in the language
test depend on the test purpose, the assessed skill(s), and/or the test uses (see Intro-
duction), the compatibility of the delivery platform with the test purpose is essential.
For example, if a speaking test focuses on eliciting conversation skills and inter-
action, a digital delivery of tasks may not be a viable option, but digital recording
of the responses might be.
In terms of resources, the delivery method influences the choice of materials
needed during the test development process. It also affects decisions about response
collection, rating procedures, and test data management and storage. If the local
context lacks basic infrastructure for implementation of a digitally delivered test
(e.g., data storage, internet access, digital devices), then the development of digitized
test tasks may not be reasonable.
In the following sections, we present the digital devices used for digital test
delivery, and then discuss the production, maintenance, and administration effi-
ciency of each delivery method. We outline the challenges that you may
encounter in the process of digital platform design, including cost-effectiveness,
communication with software developers, and maintenance of the system over
time. Then, we discuss what considerations are essential in the selection process
of the appropriate platform. These include possible digital literacy, authenticity,
washback effects, and test security.
Digital devices
Computers are the digital devices that are most commonly used for digital tests.
When digital tests were introduced, desktop computers were the default elec-
tronic machines used for test administration because at that time, laptop proces-
sors were much less powerful and, therefore, inadequate for the delivery and
collection of large data sets. In the past decade, however, laptops have almost
caught up to desktop processors and, due to their portability, have become
widely used. For example, if you use laptops, it will not be necessary to schedule
90 Local test delivery
the test in computer labs, which tend to be quite small and limit the simultan-
eous administration to a small number of test-takers. In fact, you may ask test-
takers to bring their own laptops to the testing site and thus avoid investing in
computers or laptops for digital test delivery. The advantage of desktop com-
puters remains that they can have large monitors and full size keyboards, which
may be beneficial, especially for writing tests.
In addition to desktop and laptop computers, tablets and smartphones have
recently gained popularity in test administration because they contain several
features that render them useful, i.e., enhanced portability, long battery life,
and 4G internet connection. If you intend to administer the test in different
locations, transporting a number of lightweight tablets would be much easier
than transporting heavier laptops. Moreover, unlike laptops that need to be
recharged after several hours of use, most tablets can go for up to 10 hours
without charging. Finally, the possibility for a 4G LTE connection allows you
to use a SIM card (usually inserted into the tablet) to administer online
delivered tests even if there is no Wi-Fi connection. Smartphones exhibit simi-
lar characteristics as tablets, with the main difference being that they are much
smaller in size.
However, tablets and smartphones have a number of limitations. First,
their operating systems lack the capabilities of laptops’ full-blown operating
systems (e.g., Windows and MacOS), which offer extensive control over
how you store and manage data and support installation of professional soft-
ware. Importing documents and images from external sources is also compli-
cated because tablets and smartphones lack USB ports, apertures for HDMI,
and SD card readers. Tablets and smartphones are particularly inconvenient
for writing tests because the keyboards occur on the screen and cover the
writing field. Tablets can be connected with keyboards through Bluetooth,
but they still remain inferior to laptops in that respect. Table 5.1 provides
a comparative list of the technical features of desktop computers, laptops, and
tablets and smartphones.
When digital language test methods were introduced in the last decades of the
20th century, they were deemed most appropriate for delivery and scoring of dis-
crete-point items in grammar, vocabulary, and reading comprehension tests, rather
than performance-based tests, because they did not require advanced programing
and were easier to administer. The discrete-point item types can easily be context-
ualized with the use of visual (pictures and video) and sound files already available
on the internet. According to Roever (2001), the internet can provide a highly
authentic environment for language testing if the tasks involve retrieval of online
information (e.g., writing emails or searching Google). With rapid advances in
technology, digital delivery methods are increasingly applied in development,
delivery, and scoring of performance-based speaking and writing tests.
In addition to graphic design and illustration, which are essential in traditional
paper-based tests, digital test production requires careful consideration of the
availability and accessibility of appropriate hardware and software for item devel-
opment, as well as possible issues with browser incompatibility, test security,
server failure, and data storage in the local context. In the design process, special
attention must also be paid to the creation of relevant computer or tablet inter-
face, ease of navigation, page layout, and textual and visual representation on the
screen (Fulcher, 2003).
If you are developing a language test in an institution (e.g., school, university,
language center) whose main activity is not language testing, then a lack of
resources may be an issue; the development of digitally delivered tests requires
IT expertise and software licenses that can be expensive. Institutions’ limited
budgets for test development tend to preclude outsourcing test digitization by
hiring professional software development companies, not to mention institutions’
Local test delivery 93
constant negotiations to adapt the envisioned task format to the technical possi-
bilities. Such communication can be challenging because the two parties often
focus on different aspects of test development; test developers focus on the
effects of the task format on the elicited performance, while software developers
focus on the available tools for development of the test application, often in the
least elaborate way possible. Miscommunication may not be obvious in the ini-
tial discussions with software developers, and so it is essential that test developers
follow every stage of the design process, which tends to be reiterative rather
than linear. Example 5.2 recounts such communication between software and
test developers.
Control over task delivery and task completion processes can easily be supported
with technology (Ockey, 2009). For example, the mouse-over functionality can be
applied to provide instructions in both the target language and test-takers’ L1,
which minimizes the possibility that test-takers’ misunderstanding of instructions
will affect their task performance. Technology can also provide access to different
digital resources and tools, such as dictionaries, conjugation tables, glossaries, the-
saurus, or tutorials.
Test administration
The functionality of test administration is another aspect to consider when
selecting traditional or digital platforms for test delivery. Digital platforms may
be preferred because they tend to facilitate test administration. However, trad-
itional test delivery platforms may be maintained or enhanced with digital data
collection. Following is a typical process of paper test delivery.
Once the test booklets are printed, they need to be distributed to the testing site,
i.e., the place where the test-takers will take the test. In high-stakes test situations,
the transport of the test booklets must be secured to protect unauthorized access to
the tasks before the scheduled administration. After the tests are distributed and
responses are collected, they need to be delivered to the raters. Sometimes rating
sessions are organized with raters gathered to rate the responses, or else the tests are
sent to raters who perform the ratings individually and then send the results back to
the testing coordinators. When the rating is completed, the results are recorded
manually, and the test booklets are stored according to the planned storage
98 Local test delivery
procedures (see Chapter 8 for data management and storage). Given the need for
large storage spaces, sometimes decisions are made to destroy actual test responses
after a certain period of storage. The manual transportation, processing, and storage
of secured physical data may compromise the data security and complicate the ana-
lysis of response data over longer periods of time. To minimize possible complica-
tions with test security and handling of data, numerous working hours of
continuous involvement of dedicated administrative staff is unavoidable.
The administration of traditional oral language tests could be extremely time-
consuming for the interlocutor and the examiners. For example, it will take at
least 5 hours to administer a 15-minute test to just 20 test-takers, not to mention
the additional time to record the results. Oral language test administrations are
time-consuming and therefore expensive. In many local contexts, the test
administrators are language teachers who do not receive a course release, and
the oral test administration may be considered an additional professional activity.
To ensure test quality through teachers’ full engagement with the testing activ-
ities, institution or program leaders must account for the expenses related to
teachers’ test administration and rating time.
When planning the test administration, it is also important to consider interlocu-
tor and examiner fatigue as a factor that negatively affects the consistency of test
administration, also known as reliability. Fatigue can lower examiners’ attention and
affect their rating behavior, which can lead to inconsistent test administration and
rating. In other words, test-takers’ performances and results may not depend solely
on their language proficiency but also the part of the day when they took the test
(Ling, Mollaun, & Xi, 2014). To avoid compromising the consistency, you may
need to engage several interlocutors and examiners or, if a limited number of test
administrators are available, you need to arrange for frequent pauses. Unlike trad-
itional test methods, digital platforms allow for simultaneous administration of per-
formance-based tests to many test-takers in multiple testing sites. An added
advantage of oral test administration is that digital delivery minimizes the interlocu-
tor effect that can occur in traditional oral test settings. Example 5.5 exemplifies the
use of digital platforms for test administration efficiency.
to one test taker. It took around 40 minutes to administer the test to each
individual test taker.
The OEPT is administered in a computer lab, where around 20 students
take the test simultaneously. The latest version of the OEPT is timed and test
takers complete the entire exam in about 30 minutes. In approximately
1 hour, responses from 20 test takers are collected. In a traditional oral test
setting, at least 7 hours would be needed to test 20 people.
In terms of writing tests, internet technologies provide the possibility for test
administration in multiple sites, as the only requirements are a digital device and
internet access (Example 5.6). This means that some screening for course or pro-
gram admission or for the identification of language support needs at universities
could be performed even before students arrive on campus, which would greatly
facilitate early course and program planning.
of responses, provide secure storage, and are easily accessed for rating purposes.
The digital format of oral and written responses and the digital rating platforms
facilitate the rating process for human raters, which minimizes scoring errors
resulting from data mishandling and score recording (see Chapter 8 for
a detailed discussion on data management and storage).
Digitally delivered testing has evolved to the point of developing automated
scoring systems for performance-based tests of writing and speaking. By using
computational linguistics and speech recognition technology, machines are
“trained” to recognize certain written or speech characteristics that can charac-
terize language performances at different proficiency levels and assign scores
accordingly. The purpose of automated scoring systems is to improve rating effi-
ciency and consistency, hence score reliability, and to reduce the expenses
related to engaging human raters (Ockey, 2009). At this time, automated scoring
systems are rarely used in locally developed language tests because the develop-
ment of such systems is very expensive and time-consuming. In fact, the neces-
sity to have large amounts of data to “train” the machine prevents local test
developers from working on such systems, as data generation tends to be more
limited in local test administration than in large standardized tests such as the
TOEFL. These limitations may be overcome if you have local resources, like
language technology labs, computational linguistics programs, or natural language
processing programs.
Table 5.2 provides a comparison of traditional and digital test delivery platforms
in terms of development costs and resources, administration, elicited response types,
data storage, and maintenance. Traditional platforms are cost effective and do not
require many resources for development and maintenance, but they offer limited
task innovation and response types. Digital delivery platforms, on the other hand,
provide the possibility for task innovation and various response types, but their
development and maintenance may require costly resources.
TABLE 5.2 Summary of the characteristics of traditional and digitally delivered tests
Traditional Digital
Digital literacy
If you are planning to develop a digitally delivered test, it is important to take
into account the effects of digital literacy on test-takers’ task completion and over-
all test performance. Depending on the purpose of your test and the language
skill(s) you want to assess, you may consider including digital literacy as an integral
part of the assessed skill or else minimizing its effects. Lack of digital literacy could
result in differential test performance, which means that test-taker performance
may be influenced by their digital literacy proficiency rather than their language
proficiency (Chapelle & Douglas, 2006; Jin & Yan, 2017; Roever, 2001).
Digital literacy is defined as “a person’s ability to perform tasks effectively in
a digital environment” (Jones & Flannigan, 2006, p. 9). This means that digital
literacy is not confined to familiarity with and the ability to use computers, the
internet, or software; it includes the ability to find, understand, evaluate, and use
information available in multiple formats via computers, tablets, smartphones,
and the internet (Buckingham, 2007; Jones & Flannigan, 2006). Van Deursen
and Van Dijk (2009) operationalized digital literacy as a set of different skills:
operational (using computer hardware and software), formal (accessing computer
network and web environments), and informational (finding, selection, evalu-
ation, and process information). As part of information, Ananiadou and Claro
(2009) added the ability to restructure and transform information to develop
knowledge, i.e., information as a product (p. 20).
Given the recent expansion of computer and internet use among different
test-taker groups, including the claim that the younger generations are “digital
natives”, the term digital literacy has evolved from skills-oriented concepts
(computer literacy, media literacy) to more process-oriented concepts that
involve understanding and organizing digital information. For instance, the
Danish writing tests that students take at the end of their obligatory education in
Denmark require searching for, evaluating, and using relevant information from
the internet rather than just typing responses into a word processor.
Despite the widespread use of digital devices, you still need to examine the
operational literacy of test-takers in relation to the particular digital device and
software applications that you want to use. For example, when piloting
a digitally delivered English language test for first graders in the Danish elemen-
tary schools (see Example 5.4), we realized that, although the students were
exposed to computers and videogames daily, they experienced difficulties using
a mouse for tasks that involved drag-and-drop elements because they had been
102 Local test delivery
additional qualitative analyses have revealed that the different methods produce
different response characteristics (see Ginther, 2003, for a review). In addition,
discussions regarding digital speaking tests have also focused on restrictions
imposed by the delivery method in relation to limitations on interactivity
(Galaczi, 2010; O’Loughlin, 2001). Regarding the assessment of writing, Leijten,
Van Waes, Schriver, and Hayes (2014) held that writing ability models need an
update to include the effect that technologies have on the cognitive processes
involved in writing. They called for the inclusion of subprocesses that account
for online searches, motivation management, and the construction of verbal and
visual content, or design schemas (pp. 324–326). The comparability of offline
reading to reading in a digital environment has also been challenged. Research
findings have suggested that readers need a new set of skills and strategies to suc-
cessfully comprehend digital texts (Afflerbach & Cho, 2009a, 2009b; Coiro &
Dobler, 2007). Although readers may be quite proficient offline readers, they
may experience difficulties not only when searching for textual information
online (Eagleton & Guinee, 2002), but also when they are required to critically
evaluate and understand search results (Coiro, 2011; Fabos, 2008; Henry, 2006).
Spiro (2004) argued that, unlike the processing of offline reading, the cognitive
processing of online texts presents a more demanding task for readers because
they need to respond and adapt to various contextual cues that arise from the
multi-layered and more complex representations of information that occurs in
digital environments. In preparation for the introduction of the TOEFL iBT,
Taylor, Jamieson, Eignor, and Kirsch (1998) and Powers (1999) found debilitat-
ing effects for computer familiarity on TOEFL iBT performance but maintained
that, because the effects were negligible, no additional preparation materials
would be required in support of the TOEFL iBT.
It is likely the case that differences between traditional paper and pencil and
digital delivery formats attract the most attention in times of transition. Certainly,
while the role of computer familiarity in the composition process received a great
deal of attention as word processing programs were being introduced and adopted,
discussions now focus on the cognitive processes underlying language production
in the digital domain. Indeed, the differences between direct and semi-direct
speaking assessment formats no longer receive the same attention they once did,
despite the fact that the inclusion of interactivity remains an issue with respect to
construct representation. Technological advancements and the widespread use of
CMC (e.g., blogs, YouTube videos, Skype) have resulted in a situation where the
impact of digital delivery has been ameliorated. Nevertheless, the absence of deliv-
ery format effects cannot be assumed and should be taken into consideration,
especially when switching from one test delivery format to another.
Authenticity
Test-takers engage with the intended language domains if test tasks include sufficient
contextual cues that elicit their background knowledge of the communicative
104 Local test delivery
situation (Douglas, 2000). These interactions with the tasks help to elicit relevant lan-
guage performances that will be used to make assumptions about test-takers’ ability to
communicate in real-life situations. Therefore, the degree to which the test format
(traditional or digital) represents that situational characteristics (setting, participants,
content, tone, and genre) and enhances test-takers’ interaction with the task is vitally
important (Bachman & Palmer, 1996; Chapelle & Douglas, 2006; Douglas, 2000).
Digitally delivered tests, more so than paper delivered tests, offer possibilities
for a wide range of innovative tasks that facilitate the inclusion of relevant con-
textual cues and promote interactivity through exploitation of multimedia. Local
test developers can take full advantage of multimedia to design input with
graphic and audio enhancements, which enables them to maximize task authen-
ticity. However, although digital devices can assist in accomplishing realistic rep-
resentations of the real-world situations in which the communication takes
place, they may not be preferable for every testing situation. For instance, simu-
lated role-plays may be included in language for specific purposes (LSP) tests or
tests for employment certification because certain communicative tasks need
interlocutors to represent interactions across modalities in the communicative
contexts (Norris, 2016, p. 232). Paper delivered tests may also be relevant when
testing young learners because they allow for the inclusion of tasks that require
drawing and coloring.
The development of digital and web-based tasks is essential when measuring
test-takers’ abilities to participate in computer-mediated communication (CMC).
The characteristics of CMC language discourse differs from discourse in other
domains, and so tasks need sufficient clues to allow test-takers to differentiate
these domains in digitally delivered tasks. For example, the search, selection, and
evaluation of online texts as information sources requires a different set of abil-
ities than those needed for using paper-based textual sources. For that reason,
using a paper delivered test may lack authenticity because it fails to engage test-
takers with the task in the intended manner.
Security
Regardless of the local test delivery methods (traditional or digital), an important
consideration is ensuring the security of test items, item responses, and test-
takers’ personal information. Lack of test security can compromise the results
and affect the test consequences for individuals. For example, if test-takers have
access to test tasks, then opportunities for cheating are created, which means
that scores will not accurately reflect their abilities. Unauthorized access to per-
sonal data, including both personal identifiers and test result information,
breaches data confidentiality and results in unethical testing practices and nega-
tive consequences for test-takers.
With paper delivered tests, item security can be maintained if new items are
selected from a large item pool or new items are created for each test administra-
tion. Although these strategies may enhance security, the drawback is insufficient
Local test delivery 105
data for item analysis. In local test development, the opportunities to run trial
items before their actual uses are limited, as you do not have access to large num-
bers of test-takers Therefore, although items can be trialed before use, the data
collected may be insufficient for closer analyses of consistency, item difficulty, and
discrimination. Further security precautions include test booklet distribution to
and from testing sites in sealed envelopes or boxes and test storage in locked cab-
inets and rooms.
One issue with the security of online delivered tests is that the use of applica-
tions and machine translation programs, such as Google Translate, equals cheat-
ing, although they have been found to assist communication for L2 beginners
(Garcia & Pena, 2011). Similarly, questions of authorship are real threats that
can compromise test security (Chapelle & Douglas, 2006), and it is believed that
the internet access leads to increased instances of plagiarism (Evering & Moor-
man, 2012; Pecorari & Petrić, 2014). Having an internet connection makes it
possible for students to copy large amounts of text. Cheating by sharing can also
be enhanced, as it is difficult to monitor test-takers’ accesses to social platforms.
Issues with test security have received much attention in regards to large-scale
standardized English proficiency tests. Given that these tests are administered at
different testing sites, concerns about the identity of test-takers, fraudulent exam
invigilators/proctors, and collective memorization of test items have been raised
(Watson, 2014; Xinying, 2015). Rigorous measures against identified fraudulent
activities may have negative consequences for all test-takers, including those
who did not cheat, as testing agencies lose credibility and stakeholders begin to
mistrust the test results (Wilkins, Pratt, & Sturge, 2018). Although security
measures in local tests may not receive public attention, it does not mean that
security precautions are less important in such contexts.
Test security may be an issue, especially in locally developed computer-
delivered tests, as a lack of awareness is present regarding how and where test data
are stored and how easy unauthorized access of data is. It is essential that you
familiarize yourself with the institutional, national, and international laws and
regulations for the handling and manipulation of personal and sensitive data to
ensure appropriate collection and storage. For example, specific regulations may
exist regarding the location and ownership of the servers on which test data are
stored. This means that some of the freely or commercially available test develop-
ment platforms and applications may not satisfy legal security requirements.
Notwithstanding the legal requirements, development of computer/tablet-,
web-based and, most recently, cloud-based tests needs to follow the general
security procedures in a digital or online environment. According to the Code of
practice for information security management (ISO/IEC 27000, 2009; ISO/IEC
27001, 2005; ISO/IEC 27002, 2005) the basic information security components
for the online environment include: 1) confidentiality, 2) integrity, 3) availabil-
ity, 4) authenticity, and 5) non-repudiation.
Confidentiality, integrity, and availability refer to the assurance that test data are
not released to unauthorized individuals, processes, or devices, that they have not
106 Local test delivery
been modified or destroyed, and that they are reliably and timely accessible (ISO/
IEC 27001, 2005; ISO/IEC 27002, 2005). In other words, only assigned examin-
ers can access the test data in their original form on approved machines (e.g., per-
sonal computers, tablets), and that computer-, web-, or cloud-based test delivery
to test-takers is unobstructed and reliable. In web-based and cloud-based testing,
digital identities need to be created for each test-taker.
While authenticity is required to check the identity of users, be they test-takers,
examiners, or administrators, non-repudiation is the ability to obtain proof of
delivery and reception (ISO/IEC 27000, 2009). Establishment of a digital identity
to secure authenticity and non-repudiation is crucial (Raitman, Ngo, Augar, &
Zhou, 2005). Digital identities are certain data attributes or distinctive characteris-
tics (e.g., usernames and passwords, test IDs, rater IDs, date of birth) that allow
for the assessment, validation, and authentication of test-takers when they interact
with the testing system on the internet (Bertino, Paci, Ferrini, & Shang, 2009).
Usernames and passwords are the most basic way to authenticate users for
test-taker authentication and authorization to access the assessment tasks, as well
as to authorize registered examiners and test administrators to enter and access
test data. Clear definition of the different user roles in the data management
system is an imperative so that users are granted access only to authorized infor-
mation. For instance, test-takers’ logins allow them to access the tasks assigned
to them, while raters’ logins authorize them to access the responses they are
assigned to rate (see Chapter 8 for examples of secure data management systems).
Example 5.7 illustrates how test security could be breached if the password pro-
tection system does not function properly.
Finally, you may face the dilemma of whether to develop test tasks that require
finding and using internet sources for task completion. While these tasks authen-
tically reflect CMC or writing with digital sources, internet access compromises
test security. You will have to decide whether to consider online dictionaries,
conjugation websites, and translation websites as test security threats, or whether
these resources strengthen the authenticity of the test as L2 writers constantly use
them in real-life situations. The major security threat with internet access is the
possibility of using Google Drive, Dropbox, or similar cloud-based storage
domains and devices to exchange task responses with other individuals. This could
be prevented by blocking the access to these websites or by using window or
keystroke tracking programs. These programs keep logs of the websites that test-
takers access or the words they type up during the test session, which can reveal
any cheating practices.
Washback
Local language tests are often used to inform educational programs or to support
larger language programs by providing formative feedback, language courses, or
individual instruction. In such cases, you may reasonably expect that the task
type and delivery platform would have an impact on teaching and learning, i.e.,
the test washback effect (Bailey, 1996; Cheng, 2013). Since learning with infor-
mational technology (IT) has become one of the main curricular goals across
educational levels (elementary to university), implementing digital language tests
may help you to integrate the IT goals with the language learning goals in the
curricula and, in that way, create opportunities for positive washback effects on
teaching and learning (Chapelle & Douglas, 2006).
One positive washback effect of digitally delivered language tests is the access
to digital resources (e.g., dictionaries, conjugation tables, language corpora,
translation applications) as well as digital information sources. Inclusion of such
digital resources in language tests may raise awareness about their legitimacy in
language teaching and learning. Teachers may include more activities that
develop learners’ abilities to apply the digital resources effectively (e.g., how to
use language corpora for learning collocations or how to find relevant synonyms
in dictionaries), but also activities that improve learners’ understanding of priv-
acy, copyright, and plagiarism issues related to digital source use.
Digitally delivered language tests also enable language programs to develop
platforms for systematic data collection, which can help track learners’ progress
and provide continuous diagnostic feedback, as well as develop adequate instruc-
tional goals and objectives. Such a platform can facilitate the model of Cogni-
tively-based assessment of, for, and as learning, the purpose of which is to
document what students have achieved (of learning), how to plan instruction
(for learning); and what is a worthwhile educational experience in and of itself
(as learning) (Bennett, 2010).
108 Local test delivery
Summary
This chapter has discussed the advantages and disadvantages of traditional and
digital delivery of language tests, as well as fundamental considerations regarding
digital literacy, authenticity, security, and washback. Deciding on the delivery
method is a crucial step that determines the task design, test administration, and
maintenance. Therefore, you should examine the availability of human and
technological resources and the adequacy of the technological infrastructure at
your institution before making any decisions about the test delivery platform. If
you are considering changing the delivery method of an existing language test
from traditional to digital, you should first investigate the degree to which this
change will affect test-takers’ performance and whether it aligns with the testing
purpose.
Further reading
Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology.
Cambridge, UK: Cambridge University Press.
Chapelle and Douglas (2006) present and discuss the most recent developments
in computer technology and how technology has affected language test develop-
ment and delivery. The first chapter introduces the topic of technology in rela-
tion to language assessment and explains which groups of people might benefit
from reading this book. The next chapters touch upon a variety of topics related
to technology in language testing, including how the interaction between input
and response can be different when a test is technology-based, and what the
threats to test validity are when technologies such as automated scoring or adap-
tive item selection are used. There is a chapter that is closely related to the
needs of classroom teachers to computerized testing which works as a step-by-
step guide to creating computer-based tests on a commercial platform. The
administration of computerized tests is also discussed. The final chapter con-
cludes the discussion by explaining how promising the future of computerized
language testing looks with the fast-increasing transformative power of
technology.
Local test delivery 109
Chapelle, C. A., & Voss, E. (2016). 20 years of technology and language assess-
ment in Language Learning & Technology. Language Learning & Technology,
20(2), 116–128.
Chapelle and Voss (2016) examine the use of technology in language testing by
synthesizing language assessment research published in Language Learning & Tech-
nology (LLT) from 1997 to 2015. The authors selected 25 articles and reviews
for the synthesis and categorized them into two main themes: technology for
efficiency and for innovation. The scholarship that emphasizes technology to
improve test efficiency covers research on adaptive testing, automated writing
evaluation (AWE), and comparability of different test versions. Going beyond
the efficiency, research on technology for innovation opens opportunities to
connect language testing, learning, and teaching. Examples include individual-
ized assessment and feedback within AWE systems and the new learning and
assessment environment in distance-learning contexts. The technology-mediated
assessment leads to redefining language constructs and introducing novel task
designs (e.g., an asynchronous online discussion) as well as attending to conse-
quences of innovative technology-based assessment and evaluating authoring
software. The authors call for more field-wide efforts to develop expertise of
appropriately coping with issues in technology-mediated testing.
Douglas, D., & Hegelheimer, V. (2007). Assessing language using computer
technology. Annual Review of Applied Linguistics, 27, 115–132.
After a brief review of Jamieson’s famous (2005) article on computer-based lan-
guage assessment, the authors discuss both the promises and threats associated
with computer-based tests. Their discussion includes the definition of the con-
struct measured in computer-delivered tests, and how the nature of language
construct is different in these tests from paper and pencil tests. The authors also
make a technical discussion about the authoring tools that are being used in
computer-based or web-based tests. When it comes to computer-based tests,
a discussion of how scoring and reporting procedures are different from trad-
itional paper and pencil tests is inevitable. Automated scoring of speaking and
writing is briefly touched upon in this section of the paper. The last section of
the article is about the validation issues that surface in computer-based testing.
In this section, the authors argue that research to validate the specific interpret-
ation of a computer-delivered test is of utmost importance. This discussion is
supported by examples of validation research conducted by various researchers
to validate TOEFL iBT.
Fulcher, G. (2003). Interface design in computer-based language testing. Language
Testing, 20(4), 384–408.
Fulcher (2003) describes the three phases of developing a computer-delivered lan-
guage test interface. The author states that the article is unique due to the fact
that interfaces used for computer-based language tests are not publicly available for
110 Local test delivery
test-taker identity pose practical issues. CBT also brought innovations to lan-
guage skill assessments mainly in terms of new independent tasks types, skill
integration, and automatic rating. While addressing concerns related to the
innovative tasks and rating, the author predicts continued growth in CBT as
technology advances.
References
Afflerbach, P., & Cho, B. (2009a). Determining and describing reading strategies. In
H. S. Waters & W. Schneider (Eds.), Metacognition, strategy use, and instruction, (pp. 201–225).
New York: Guildford Press.
Afflerbach, P., & Cho, B. Y. (2009b). Identifying and describing constructively responsive
comprehension strategies in new and traditional forms of reading. In S. E. Israel &
G. G. Duffy (Eds.), Handbook of research on reading comprehension, (pp. 69–90). New York:
Routledge.
Ananiadou, K., & Claro, M. (2009). 21st century skills and competences for new millennium
learners in OECD countries. OECD Education Working Papers, No. 41, OECD Publish-
ing. Doi:10.1787/218525261154
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing
useful language tests. Oxford, UK: Oxford University Press.
Bailey, K. M. (1996). Working for washback: A review of the washback concept in language
testing. Language Testing, 13(3), 257–279.
Bangert-Drowns, R. L. (1993). The word processor as an instructional tool: A
meta-analysis of word processing in writing instruction. Review of Educational research, 63
(1), 69–93.
Bennett, R. E. (2010). Cognitively Based Assessment of, for, and as Learning (CBAL):
A preliminary theory of action for summative and formative assessment. Measurement, 8
(2–3), 70–91.
Bertino, E., Paci, F., Ferrini, R., & Shang, N. (2009). Privacy-preserving digital identity
management for cloud computing. IEEE Computer Society Data Engineering, 32(1), 21–27.
Buckingham, D. (2007). Digital media literacies: Rethinking media education in the age of
the internet. Research in Comparative and International Education, 2(1), 43–55.
Chapelle, C. A., & Douglas, D. (2006). Assessing language through computer technology. Cam-
bridge, UK: Cambridge University Press.
Cheng, L. (2013). Consequences, impact, and washback. In A. J. Kunnan (Ed.), The com-
panion to language assessment (Vol. 3, pp. 1130–1145). Chichester and UK: John Wiley
and Sons.
Coiro, J. (2011). Predicting reading comprehension on the internet: Contributions of off-
line reading skills, online reading skills, and prior knowledge. Journal of Literacy Research,
43(4), 352–392.
Coiro, J., & Dobler, E. (2007). Exploring the online reading comprehension strategies used
by sixth-grade skilled readers to search for and locate information on the internet. Read-
ing Research Quarterly, 42(2), 214–257.
Csapó, B., Ainley, J., Bennett, R. E., Latour, T., & Law, N. (2012). Technological issues
for computer-based assessment. In E. Care, P. Griffin, Patrick, & M. Wilson (Eds.),
Assessment and teaching of 21st century skills (pp. 143–230). Dordrecht, NL: Springer.
Dooey, P. (2008). Language testing and technology: Problems of transition to a new era.
ReCALL, 20(1), 21–34.
112 Local test delivery
Douglas, D. (2000). Assessing languages for specific purposes. Cambridge, UK: Cambridge Univer-
sity Press.
Douglas, D., & Hegelheimer, V. (2007). Assessing language using computer technology.
Annual Review of Applied Linguistics, 27 (2007), 115–132. Doi:10.1017/S0267190508070062
Eagleton, M. B., & Guinee, K. (2002). Strategies for supporting student internet inquiry.
New England Reading Association Journal, 38(2), 39–47.
Evering, L. C., & Moorman, G. (2012). Rethinking plagiarism in the digital age. Journal of
Adolescent & Adult Literacy, 56(1), 35–44.
Fabos, B. (2008). The price of information: Critical literacy, education, and today’s inter-
net. In J. Coiro, M. Knobel, D. Leu, & C. Lankshear (Eds.), Handbook of research on new
literacies (pp. 839–870). Mahwah, NJ: Lawrence Erlbaum.
Galaczi, E. D. (2010). Face-to-face and computer-based assessment of speaking: Challenges
and opportunities. In L. Araújo (Ed.), Proceedings of the Computer-Based Assessment (CBA)
of foreign language speaking skills (pp. 29–51). Brussels and Belgium: European Council.
Garcia, I., & Pena, M. I. (2011). Machine translation-assisted language learning: Writing for
beginners. Computer Assisted Language Learning, 24(5), 471–487.
Ginther, A. (2003). International teaching assistant testing: Policies and methods. In
D. Douglas, (Ed.), English language testing in US colleges and universities (pp. 57–84) Wash-
ington, DC: NAFSA.
Henry, L. A. (2006). SEARCHing for an answer: The critical role of new literacies while
reading on the internet. The Reading Teacher, 59(7), 614–627.
Huff, K. L., & Sireci, S. G. (2001). Validity issues in computer-based testing. Educational
Measurement: Issues and Practice, 20(3), 16–25. Doi:10.1111/j.1745-3992.2001.tb00066.x
ISO/IEC 27000 (2009). Information technology – Security techniques – Information security man-
agement systems – Overview and vocabulary. 2009. ISO/IEC 27000:2009(E).
ISO/IEC 27001 (2005). Information technology – Security techniques – Information security man-
agement systems – Requirements. 2005. ISO/IEC FDIS 27001:2005(E).
ISO/IEC 27002 (2005). Information technology – Security techniques – Code of practice for infor-
mation security management. 2005. ISO/IEC 27002:2005(E).
Jin, Y., & Yan, M. (2017). Computer literacy and the construct validity of a high-stakes
computer-based writing assessment. Language Assessment Quarterly, 14(2), 101–119.
Jones, B., & Flannigan, S. L. (2006). Connecting the digital dots: Literacy of the 21st
century. Educause Quarterly, 29(2), 8–10.
Laborda, J. G. (2007). From Fulcher to PLEVALEX : Issues in interface design, validity and
reliability in internet based language testing. CALL-EJ Online, 9(1), 1–9.
Leijten, M., Van Waes, L., Schriver, K., & Hayes, J. R. (2014). Writing in the workplace: Con-
structing documents using multiple digital sources. Journal of Writing Research, 5(3), 285–337.
Ling, G., Mollaun, P., & Xi, X. (2014). A study on the impact of fatigue on human raters
when scoring speaking responses. Language Testing, 31(4), 479–499.
Norris, J. M. (2016). Current uses for task-based language assessment. Annual Review of
Applied Linguistics, 36(2016), 230–244.
O’Loughlin, K. J. (2001). The equivalence of direct and semi-direct speaking tests. Cambridge,
UK: Cambridge University Press.
Ockey, G. J. (2007). Construct implications of including still image or video in
computer-based listening tests. Language Testing, 24(4), 517–537.
Ockey, G. J. (2009). Developments and challenges in the use of computer-based testing for
assessing second language ability. Modern Language Journal, 93(SUPPL. 1), 836–847.
Pecorari, D., & Petrić, B. (2014). Plagiarism in second-language writing. Language Teaching,
47(3), 269–302.
Local test delivery 113
Powers, D. E. (1999). Test anxiety and test performance: Comparing paper-based and
computer-adaptive versions of the GRE general test. ETS Research Report Series, 1999
(2), i–32.
Raitman, R., Ngo, L., Augar, N., & Zhou, W. (2005). Security in the online e-learning
environment. In 5th IEEE International Conference on Advanced Learning Technologies :
ICALT 2005: proceedings: 5–8 July, 2005, Kaohsiung, Taiwan (pp. 702–706).
Roever, C. (2001). Web-based language testing. Language Learning & Technology, 5(2),
84–94.
Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL
internet-based test. Language Testing, 26(1), 5–30.
Spiro, R. (2004). Principled pluralism for adaptive flexibility in teaching and learning to
read. In R. B. Ruddell & N. Unrau (Eds.), Theoretical models and processes of reading (5th
ed., pp. 654–659). Newark, DE: International Reading Association.
Taylor, C., Jamieson, J., Eignor, D., & Kirsch, I. (1998). The relationship between com-
puter familiarity and performance on computer-based TOEFL test tasks. ETS Research
Report Series, 1998(1), i–30.
Van Deursen, A. J., & Van Dijk, J. A. (2009). Using the internet: Skill related problems in
users’ online behavior. Interacting with Computers, 21(5-6), 393–402.
Watson, R. (2014, February 10). Student visa system fraud exposed in BBC investigation.
BBC News. Retrieved from www.bbc.com/news/uk-26024375
Wilkins, H., Pratt, A., & Sturge, G. (2018, August 29). TOEIC visa cancellations.
Retrieved from https://researchbriefings.parliament.uk/ResearchBriefing/Summary/
CDP-2018-0195.
Xinying, Z. (2015, October 12). Examiners get tough on students who cheat. China Daily.
Retrieved from www.chinadaily.com.cn/china/2015-10/12/content_22159695.htm
Zenisky, A. L., & Sireci, S. G. (2002). Technological innovations in large-scale assessment.
Applied Measurement in Education, 15(4), 337–362.
6
SCALING
This chapter focuses on the different types of language test scoring systems, with
particular attention to scale design for performance-based tests in speaking and
writing. Alongside the presentation of the different types of scales (analytic, holis-
tic, primary-trait), this chapter discusses different scale design methods (data-
driven, theory-driven) and benchmarking, i.e., finding representative performances
of each scalar level. It argues that although the choice of approach depends on the
resources available in the local context, it is important to consider three main
components throughout the scale development process: (1) alignment between the
scale and theories of language and language development, (2) ability of the scale
to distinguish levels of test-taker performance, and (3) scale usability for test raters.
The chapter ends with a discussion on the issues that must be considered during
scale validation, and the unique challenges and opportunities for scaling in local
testing contexts.
Introduction
To many language teachers and program administrators, scoring language per-
formances might seem like an easy task. Some may wonder why much effort
needs to be dedicated to scoring, as “the data (performance) will speak for
itself.” This may be true if you are measuring weight, height, or even diabetes,
as there are widely accepted objective measurements for these biological or
physiological traits. However, the evaluation of psychological constructs, such as
language ability, is a “complex, error-prone cognitive process” (Cronbach, 1990,
p. 584). Oftentimes, the test performance is too complex to convey a clear pic-
ture of the examinee’s language proficiency without interpretation from experi-
enced and well-trained experts, or else our current understanding of language
Scaling 115
Imagine that you have to evaluate a 500-word essay in 10 minutes while consid-
ering all of the above criteria; you might think “it is impossible to summarize
everything with a single score,” but you could probably bite the bullet and
finish the task, even though it required great concentration. If you are asked to
evaluate ten essays per day for a week, however, you might start to ignore the
criteria and instead rely solely on your general impression to assign the scores.
Of course, as you are using this coping strategy, you might question the value
of language testing and want to run away from assessment for good. Indeed,
scoring language performance should not be merely about quantity, but the
above situations, emotions, and strategies are common among language teachers
and program administrators, especially when appropriate design and quality con-
trol are missing during the scoring process.
One way to address these problems is to design an effective rating scale to
guide the scoring process. In this chapter, we will delineate what scaling entails
by introducing the advantages and disadvantages of rating scales, the types of
rating scale used in assessment practice, and different approaches to scale devel-
opment and validation. At the end of the chapter, we will discuss the unique
challenges and exciting opportunities in scale development for local language
tests.
There is no scale type that is inherently better than others; however, depending
on the purpose and resources available for your local test, there is usually
a better option that can meet your assessment needs. This section will introduce
different types of rating scales commonly used in language testing, along with
the kinds of information they can provide. From the perspective of measure-
ment, all abilities can be assessed on four levels of measurement: namely, nom-
inal, ordinal, interval, and ratio scales (Stevens, 1946). In language testing, there
are a wide variety of rating scales used in scoring language performance. How-
ever, they all belong to two general categories of rating scale: holistic and ana-
lytic. A holistic scale features a single construct, and raters only need to assign one
score to each performance; in contrast, an analytic scale, also referred to as mul-
tiple-trait scoring rubric, has more than one (sub-)construct, and raters need to
assign a score to each aspect of the assessed ability or skill, and then derive
a total score by either summing the subscores or weighing among different sub-
scores to obtain some kind of an average score.
As research gradually uncovers different factors within the design of lan-
guage tasks that may affect language performance and the evaluation of it by
human raters, primary trait scales have been developed. This type of rating scale
is a variety of holistic scale but is developed specifically for each prompt or
task. Primary trait scales can provide rich, prompt-specific information
about the performance at each level, although it is important to note that
this kind of scale can be time-consuming to develop. Thus, it is not widely
used in large-scale language tests. However, it can be useful in local con-
texts, given the narrower scope of the test. For example, a post-admission
writing placement test can include two writing tasks: (1) email writing,
which assesses students’ pragmatic knowledge of daily written communica-
tion, and (2) integrated argumentative essay, which assesses students’ com-
mand of lexico-grammar and knowledge of the rhetorical structures of
academic writing. As the two tasks assess fairly different writing skills, it
might be desirable to develop a separate scale for each task.
As language is complex and involves many subcomponents, the develop-
ment and use of hybrid scales has become a common practice. Testing
researchers and practitioners have explored the development and application
of a hybrid holistic scale. This type of scale features lengthy or rich descriptors
for each level, covering a range of analytic components. An example of the
hybrid holistic rating scale can be found in the Oral English Proficiency Test
(OEPT) at Purdue University, which was developed to measure the English
speaking skills of prospective international teaching assistants (see, Bailey,
1983, for a review of the history behind oral proficiency testing for inter-
national teaching assistants in the US) (see Example 6.1). The scale levels and
descriptors are presented in Figure 6.2 (taken from Kauper, Yan, Thirakun-
kovit, & Ginther, 2013).
Scaling 119
The field has also witnessed the use of a binary analytic scale. Like regular analytic
scales, this type of scale still requires a score on each criterion, but reduces the scale
to a set of binary questions or options to lower the difficulty in rating. Two types of
binary analytic scales are common in language testing. The first type is called an
empirically-derived, binary-choice, boundary-definition (EBB) scale, developed by Upshur
and Turner (1995). An EBB scale is a task-specific scale that decomposes a holistic
construct or scale into “a hierarchical (ordered) set of explicit binary questions relat-
ing to the performance being rated” (p. 6). The answer to each question breaks the
performances into two levels. The first question often constitutes the most important
cut on a holistic element (e.g., is the speech comprehensible?). As the level progresses,
the questions become more fine-grained (analytic). By answering these questions in
the specified order, the raters are guided to assign a holistic level to a particular lan-
guage performance at the end. For demonstration purposes, Figure 6.3 presents an
example of a simplified 4-point EBB scale for speaking ability. The speaking ability
was decomposed into two levels of binary choices, with a total of three binary ques-
tions. The questions start from a holistic element (i.e., comprehensibility) and grad-
ually move down to more specific elements (i.e., coherence and lexico-grammar at
the second level).
The other type of binary analytic scale is called a performance decision tree. Similar
to the EBB scale, performance decision trees also features a number of binary
questions. However, in a decision tree, the raters have more flexibility in choosing
the order by which they answer the binary questions, and raters need to sum the
points from different questions at the end of the scoring process to derive holistic
FIGURE 6.2 The hybrid speaking scale for the Oral English Proficiency Program
scores. The principles behind performance decision trees are similar to those for
EBB scales. That is, the scale helps simplify or decompose the holistic construct
into analytic questions. However, as compared to EBB scales, performance deci-
sion trees are composed of more analytic questions and elements, and these ques-
tions do not necessarily form a hierarchy. Because of this flexibility, performance
decision trees can be used to tackle more complex diagnosis of language perform-
ance (e.g., the processing of specific lexico-grammatical items, and the sequencing
of turns in conversation). Moreover, raters do not have as much pressure to keep
a holistic view of the performance during the scoring process; instead, they can
focus on more fine-grained performance characteristics that are relevant to the
target language ability and work on the holistic score as the last step. An example
of a decision tree can be found in Figure 6.4, which is a performance decision
tree for the English Placement Test at University of Illinois at Urbana-Champaign
that was developed to help raters derive subscores on argumentation and lexico-
grammar and decide on the relative strengths and weaknesses of the two criteria.
So far, we have browsed through a variety of rating scales commonly used in
language assessment. You might be overwhelmed or confused about which type
to use in your own local context. Sometimes, it is difficult to choose a clear
winner from these scales or to make connections among them, as they look
quite different. In theory, however, all of these scales can be placed along
a continuum of specificity, with holistic scales being the least specific and ana-
lytic scales being the most specific (see Figure 6.5). Nevertheless, in practice, it
FIGURE 6.4 The performance decision tree for the English Placement Test at UIUC
122 Scaling
is still convenient and perhaps more meaningful to classify rating scales according
to the holistic vs. analytic dichotomy. If you need fine-grained information or
diagnosis from assessment results, you will fare better with an analytic type of
rating scale. Conversely, if you need a rather efficient and effective measure to
place students into different levels, a holistic scale is the better option. We will
introduce different approaches that you can employ to develop a rating scale
from scratch in the following section.
raters, so that they can identify these key features easily when scoring the test
performances. Once the scale descriptors are established, the scale levels should
also be empirically established using statistical analysis (e.g., discriminant function
analysis). This procedure helps to reveal the number of performance levels that
the scores, based on the scale, can statistically distinguish.
In practice, how the features are identified can vary depending on the extent
to which the developers appeal to theory during the development stages. For
example, when developing a scale for more domain-specific language tests, the
developers often refer to theoretical discussions about language performance in
those specific domains to identify key knowledge and skills. They may also
evaluate performances of high-proficiency speakers to identify representative
performance features of the key knowledge and skills. In contrast, when dealing
with the assessment of general language skills or proficiency, scale developers are
often left with only one option; they have to evaluate and rank order the
sample performances based on impressionistic ratings. Once consensus is reached
regarding the rank order, they then go about describing the differences across
those performances in an effort to uncover features that can differentiate per-
formances between adjacent levels.
In either approach, the key features are further classified and interpreted to
form the scoring criteria. Oftentimes, the performances will undergo a stage of
fine-grained analysis based on those key features. Then, these features will be
quantified and subjected to statistical analysis to reveal the number of performance
levels that are statistically distinguishable. This step helps to identify the scale
levels. Once the levels of performance are decided, the key performance features
identified in the first step are used to describe each level in the scale. Even
though this method allows for a close analysis of actual performance samples and
strengthens the link between scale and actual performance, it is not without criti-
cisms; researchers have noted that the data-driven approach to scale development
can be time-consuming, and it produces analytic descriptors – often linguistic con-
structs – that human raters might find difficult to use in real-time rating (e.g.,
Banerjee, Yan, Chapman, & Elliott, 2015; Fulcher et al., 2011; Upshur & Turner,
1999). In this regard, the measurement-driven approach has more advantages, as
scale descriptors developed by experienced instructors tend to be based on their
teaching experience and thus more teacher-friendly. Since teachers form the main
group of raters in local language programs, scales developed from the measure-
ment-driven approach are more usable for the raters.
worlds by involving experienced teachers and testers to evaluate sample test per-
formances during the scale development process. In doing so, the scale develop-
ment process should triangulate input from three information sources to reach
an optimal outcome: theory, rater, and performance. An example of the applica-
tion of this approach can be found in Banerjee et al. (2015, p. 8), where they
consider input from the three sources in revising the rating scale for the writing
section of the Examination for the Certificate of Proficiency in English (ECPE).
As Banerjee et al’s (2015) study dealt with scale revision rather than scale devel-
opment (although the two processes share more commonality than dissimilarities),
we will sketch out a more general scale development process using the hybrid
approach based on their model. More importantly, they dealt with a large-scale
testing context. In local contexts, an additional source that we need to consider is
the alignment between assessment and instruction. This brings up a vital point
about scale development in local language programs; that is, teachers and testers
should collaborate in the process as they complement each other in terms of their
expertise. While testers can ensure the process is guided by sound assessment prin-
ciples, teachers know the curriculum and the students better. The following is an
example of a hybrid scale development (Yan, Kim, & Kotnarowski, in press). The
stages of the scale development process are represented in Figure 6.6.
rankings. During this discussion, the committee members noted the reasons
for both the consensus and disagreement among their rankings. The
agreed-upon reasons were maintained to develop scale descriptors, while
disagreement was discussed and revisited in the subsequent round to see if
consensus could be reached. The goal was to reach alignment but, at the
same time, allow all members to freely explain their own rationales in case
disagreement occurred. Until all members could reach consensus on most of
their evaluations, they needed to revisit the performances and consider how
many levels they could differentiate among the performances and whether
performances at each level shared similar characteristics. If such features did
exist at each level, the committee members tried to describe those features
in concise and precise language, and those descriptions formed the initial
draft of the rating scale.
The third stage was the piloting phase. Once the scale was drafted, the
developers pilot-rated more performances using the new scale to see if it
could be applicable to a larger pool of test performances. More teachers
who are novice to the new scale were invited to see if the new scale allowed
them to capture the differences among those performances as did the com-
mittee members. In this process, feedback was gathered and, based on that
feedback, adjustments to the scale were made.
After a couple of iterations of piloting and refining, the scale was launched
for operational use (the fourth stage). At this point, the developers should
start creating rater training materials and activities to help new raters learn to
use the scale. The specific procedures during the rater training were similar to
those in the second stage. However, the difference was that, at this stage,
new raters were assigned to benchmark performances, i.e., essays that show
the typical performance characteristics of each level and received the same
scores from the majority of the experienced raters. The use of benchmark per-
formances during rater training was crucial, as new raters could be easily over-
whelmed by the number of performance features to which they needed to
attend. Thus, using benchmarks could help raters ease into the rating scale by
simplifying the variability among examinee performances while still maintain-
ing the full range of performances represented on the test.
examinee population and the needed resources are lacking, as is often the case for
local testing programs, we need to look for viable alternatives. One way to exam-
ine the psychometric quality of a rating scale is to look at the raw scores that
examinees receive. First, the distribution of the final score can be calculated to
check whether examinees’ scores on the test follow a normal distribution.
A highly skewed distribution tends to suggest low discrimination power for the
scale. To examine how a scale functions across criteria, test conditions, or exam-
inee backgrounds, we can divide examinee scores by those factors and then exam-
ine the average and distributions for each condition or group to determine
whether examinees tend to receive the same, similar, or consistently higher or
lower scores across conditions. Put simply, the score distribution and descriptive
statistics are very useful information to examine the psychometric quality of rating
scales, and thus should be employed in all contexts.
Summary
This chapter introduced the fundamental concepts, procedures, and issues related
to scaling for local language tests. If you are reading this chapter for a general
learning purpose, we hope that you have obtained a general sense of what scaling
134 Scaling
entails in practice and why the procedures involved in scale development and val-
idation are necessary. If you are reading this chapter to find a solution to
a practical scaling problem, we hope that we have offered possible solutions to
your problem.
In any case, following a systematic approach to scale development and use is
essential. The development, implementation, use, and validation of a rating scale
should be an iterative process. During this process, the scale developers should
keep in mind that theory cannot be ignored, and teaching and assessment
experience is just as informative as the analysis of performance data.
Lastly, through the iterative discussions during scale development, implemen-
tation, and use, teachers and testers can help standardize the conceptualization
and operationalization of language ability in both assessment and instructional
contexts. The involvement of different stakeholders in scale development and
validation can promote collaborative dialogue and practice in language assess-
ment among stakeholder groups. Ultimately, this collaboration will strengthen
the alignment between assessment and pedagogy in local language programs.
Further reading
Alderson, J. C. (1991). Bands and scores. In J. C. Alderson & B. North (Eds.),
Language testing in the 1990s: The communicative legacy (pp. 71–86). London
and England: Macmillan.
Alderson (1991) discusses the purposes, uses, and issues centered around band
scores of language tests, particularly focusing on the ELTS Revision Project.
After providing background on language test proficiency scales (e.g., the InterA-
gency Language Roundtable scales, the Australian Second Language Proficiency
Ratings, and ELTS bands), the author introduces three main purposes of scales
(i.e., user-oriented, assessor-oriented, and constructor-oriented), each of which
has the functions of reporting test results, guiding assessing process, and control-
ling test construction, respectively. The author also addresses issues that impede
successfully achieving the functions and purposes, for example, inconsistency
between performance descriptions in scales and actual performance elicited by
test tasks, and mismatch between components in rating scales and reporting. The
author further discusses the process and issues of arriving at finalized perform-
ance-based test band score descriptors as well as converting of comprehension-
based test raw scores into band scores. The author points out the commonality
in the issues of band and test scores and emphasizes the benefits of using scales
for various stakeholders.
Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development
for speaking tests: Performance decision trees. Language Testing, 28(1), 5–29.
Fulcher et al. (2011) present the development of a Performance Decision
Tree (PDT), a speaking test rating scale from a performance data-based
Scaling 135
approach. The authors begin with comparing performance data-based and meas-
urement-driven approaches in designing and developing scales. The authors
argue for performance data-based approaches because performance data-based
scales provide sufficient descriptions directly linked to examinee performance,
which enables rigorous inferences from scores to domain-specific performance,
whereas measurement model-based scales misrepresent language abilities as sim-
plified linear levels and offer decontextualized description of abilities. Hence, the
authors developed PDT, which features as binary choice-based decision-making
and rich description based on a thorough review of literature on interactional
competences in service encounters domain (e.g., discourse competence, discourse
management competence, and pragmatic competence) and analysis of perform-
ance data. The authors claim superior validity and efficiency of PDT to assess
communication competence in service encounters in management contexts and
call for continued research on performance data-based scales.
Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language
tests. ELT Journal, 49(1), 3–12.
Upshur and Turner (1995) introduce “empirically-derived, binary-choice, bound-
ary-definition (EBB) scales,” (p. 6) to remedy reliability and validity issues of using
standard rating scales in instructional settings. The authors argue that theory-based
approach of standard ratings does not reflect actual learner performance due to char-
acteristics inherent to instructional settings, for example, narrow proficiency ranges
and frequent interactions between teaching conditions and progress. The authors
propose EBB scales as an alternative, which consists of binary questions in
a hierarchically ordered decision-making structure. The authors developed two
EBB scales to respectively assess grammatical accuracy and communicative effective-
ness of a story-retell task that 99 French ESL fifth graders took. They provide
detailed descriptions of the test development and administration procedures, and
report high reliabilities for both communicative effectiveness (r=0.81) and gram-
matical accuracy (r=0.87). The authors highlight that focusing differences and
boundaries between scale categories rather than the midpoint of a category and
using precise and simple descriptors improved rater reliability. They also stress the
positive influence of locally developed scales on enhancing validity of ratings.
References
Bailey, K. (1983). Foreign teaching assistants at US universities: Problems in interaction and
communication. TESOL Quarterly, 17(2), 308–310.
Banerjee, J., Yan, X., Chapman, M., & Elliott, H. (2015). Keeping up with the times:
Revising and refreshing a rating scale. Assessing Writing, 26, 5–19.
Bridgeman, B., Cho, Y., & DiPietro, S. (2016). Predicting grades from an English language
assessment: The importance of peeling the onion. Language Testing, 33(3), 307–318.
Brindley, G. (1998). Outcomes-based assessment and reporting in language learning pro-
grammes: A review of the issues. Language Testing, 15(1), 45–85.
136 Scaling
Chapelle, C. A., Chung, Y. R., Hegelheimer, V., Pendar, N., & Xu, J. (2010). Towards a
computer-delivered test of productive grammatical ability. Language Testing, 27(4),
443–469.
Cho, Y., & Bridgeman, B. (2012). Relationship of TOEFL iBT® scores to academic per-
formance: Some evidence from American universities. Language Testing, 29(3), 421–442.
Council of Europe. (2009). Relating language examinations to the common European framework
of reference for languages: Learning, teaching, assessment. A manual. Strasbourg: Language
Policy Division.
Council of Europe (2011). Common European Framework of Reference for Languages:
Learning, Teaching, Assessment. Council of Europe.
Cronbach, L. J. (1990). Essentials of psychological testing. (5. Baskı) New York, NY: Harper
Collins Publishers.
Crusan, D., Plakans, L., & Gebril, A. (2016). Writing assessment literacy: Surveying second
language teachers’ knowledge, beliefs, and practices. Assessing Writing, 28, 43–56.
Dimova, S. (2017). Life after oral English certification: The consequences of the test of oral
English proficiency for academic staff for EMI lecturers. English for Specific Purposes, 46,
45–58.
Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for speak-
ing tests: Performance decision trees. Language Testing, 28(1), 5–29.
Ginther, A., & Yan, X. (2018). Interpreting the relationships between TOEFL iBT
scores and GPA: Language proficiency, policy, and profiles. Language Testing, 35(2),
271–295.
Hamp-Lyons, L. (1995). Rating nonnative writing: The trouble with holistic scoring. Tesol
Quarterly, 29(4), 759–762.
Jarvis, S., Grant, L., Bikowski, D., & Ferris, D. (2003). Exploring multiple profiles of
highly rated learner compositions. Journal of Second Language Writing, 12(4), 377–403.
Jin, Y. (2010). The place of language testing and assessment in the professional preparation
of foreign language teachers in China. Language Testing, 27(4), 555–584.
Kauper, N., Yan, X., Thirakunkovit, S., & Ginther, A. (2013). The Oral English Proficiency
Test (OEPT) technical manual. (pp. 21–32). West Lafayette, IN: Purdue University Oral
English Proficiency Program.
Knoch, U. (2009). Diagnostic writing assessment: The development and validation of a rating scale
(Vol. 17). New York: Peter Lang.
Knoch, U., & Chapelle, C. A. (2018). Validation of rating processes within an
argument-based framework. Language Testing, 35(4), 477–499.
Lee, I. (2010). Writing teacher education and teacher learning: Testimonies of four EFL
teachers. Journal of Second Language Writing, 19(3), 143–157.
McReynolds, P., & Ludwig, K. (1987). On the history of rating scales. Personality and Indi-
vidual Differences, 8(2), 281–283.
Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. Studies in Language
Testing, 3, 74–91.
Popham, W. J. (2001). The truth about testing: An educator’s call to action. Alexandria, Virginia,
USA: ASCD.
Purpura, J. E. (2004). Assessing grammar. New York: Cambridge University Press.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680.
Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests.
ELT Journal, 49(1), 3–12.
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language
speaking ability: Test method and learner discourse. Language Testing, 16(1), 82–111.
Scaling 137
Weigle, S. C. (2007). Teaching writing teachers about assessment. Journal of Second Language
Writing, 16(3), 194–209.
Yan, X., Kim, H., & Kotnarowski, J. (in press). The development of a profile-based writing
scale: How collaboration with teachers enhanced assessment practice in a post-admission
ESL writing program in the USA. In B. Lanteigne, C. Coombe, & J. Brown (Eds.), Issues
in language testing around the world: Insights for language test users. New York: Springer.
Yan, X., & Staples, S. (2019). Fitting MD analysis in an argument-based validity framework
for writing assessment: Explanation and generalization inferences for the ECPE. Language
Testing, 0265532219876226.
Yan, X., Zhang, C., & Fan, J. J. (2018). “Assessment knowledge is important, but … ”:
How contextual and experiential factors mediate assessment practice and training needs
of language teachers. System, 74, 158–168.
7
RATERS AND RATER TRAINING
This chapter discusses the process of rater selection, initial training, and continu-
ous norming. It addresses the different methods of rater training and the import-
ance of rater agreement for the reliability of test results and the different
instruments and methods that can be employed for the analysis of rater behavior
and inter- and intra-rater reliability. More specifically, the chapter presents how
to perform consensus, consistency, and measurement estimates, as well as rater
effects from the local testing context. The chapter ends with an overview of
best practices for rater training, including rater cognition, benchmarking, as well
as establishing a rater group as a community of practice.
Introduction
In local testing, be it for placement, screening, or certification purposes, the
raters are usually language teachers. While teachers are familiar with test-takers
and test-taker performance in the local context, they need training to rate test
performance in a consistent manner. When rater training is not in place, we
might observe two kinds of raters or rater behavior. The first kind of raters
appear quite confident about how the scale works, but their ratings show that
they do not apply the scale and interpret scale descriptors in the intended way.
The other kind of raters are opposite to the confident raters; they are less confi-
dent or comfortable with the rating scale (or new scales in general). When struc-
tured rater training is not in place, they may avoid using the scale and instead
resort to their regular assessment practices in class or their own conceptualiza-
tions of language proficiency to evaluate language performances. In either case,
because of the lack of standardized rater training, we might observe a great
degree of individual variation in their scores and scoring behavior. This does not
Raters and rater training 139
necessarily mean that teachers do not know what language ability is. In fact, if
teachers are asked to rank a batch of language performances, they are usually
able to distinguish between high and low proficiency performances. However,
agreement on the evaluation of mid-level performance can be difficult to
achieve. Unfortunately, this level or range of proficiency is what you might be
dealing with in a local testing context. However, given teachers’ background in
language pedagogy and sensitivity to different profiles of language performances,
with a structured rater training program, they can be easily trained to reach con-
sensus on the meaning and value associated with language proficiency in the
particular local context and on aspects of language performances to focus on
during evaluation.
If you are a teacher working predominantly in language classrooms, you
might also wonder why rater agreement is important for local tests. It is
a common attitude or stance regarding assessment among teachers and teacher
educators that teachers should have the freedom to hold assessment close to
their own values and beliefs; they should not be standardized in terms of how
they evaluate students’ performance, as long as learning takes place. In fact,
standardization and freedom do not necessarily form a relationship of opposition.
A teacher should be granted freedom in terms of instructional approach and
materials selection. However, when multiple sections of the same class are
taught by different instructors, when students need to be placed or advanced
into different levels of courses, or when the instructional program is required to
achieve certain teaching or learning goals, standardization in assessment proced-
ures becomes important for program evaluation purposes (see Chapter 8 for pro-
gram evaluation). Similarly, in the classroom, teachers interact extensively with
the students and can repeatedly observe (that’s what assessment entails, in
a broad sense!) student performance across contexts. Their evaluation of the stu-
dents is less likely to be biased and more likely to reflect the students’ true
knowledge or skills. Thus, in classroom-based assessment contexts, educational
expertise and experience should be valued, and teachers should be encouraged
to use their daily observations to complement the formal assessments that they
use to evaluate students’ learning outcomes. In measurement terms, if repeated
testing or assessment provides similar test scores, the test or assessment possesses
high test-retest reliability. However, the assessment context for a local language
test is different. Teachers often do not know the examinees or have more than
one shot to accurately assess their language abilities. Instead, they have to use
a standardized rating scale to evaluate examinees’ performance. During the scor-
ing process, teachers’ instructional experience and familiarity with student per-
formance profiles in the courses can still be valuable for helping them distinguish
examinees across levels. However, rater training is needed to ensure that they
use the rating scale in similar ways.
Rater training for large-scale and local language tests can be very different.
First, large-scale tests tend to have large rater pools and thus can afford to select
the most reliable raters for each test administration. Local tests tend to have small
140 Raters and rater training
pools of raters and, for practical purposes, must employ all raters for operational
scoring. Thus, rater training tends to be more intensive as the goal is to align all
raters to the scale. However, compared to local tests, large-scale tests lack the abil-
ity to make finer distinctions among a subpopulation of examinees because they
lose the rich information that the test scores can provide about the examinees and
the opportunities the examinees can be given, as well as the resources and support
they can access. Because of their experience and familiarity with the local context,
language teachers or other local members (e.g., other stakeholders in the local
context) often possess sensitivity during the scoring process to the strength and
weaknesses of the examinees’ linguistic knowledge, as well as their impact on lan-
guage performance. Such sensitivity can contribute to the design of tasks, scoring
methods, and even research projects for local language testing. However, all of
these benefits are conditioned upon raters’ ability to reliably rate test performances
in the local context. In this chapter, we will discuss rater behavior and rater train-
ing models that are suitable for local language testing.
the scale, rater, and rater training. We want to make sure that the raters assign
the same score to the performance to ensure that they are using the scale con-
sistently. In other words, we want to make sure that, regardless of which rater
rates the performance, the test-taker receives the same score. This expected
agreement between raters is referred to as rater reliability in language testing or
psychological measurement literature. If two raters regularly assign different
scores to the same examinees, then the test scores will become less reliable and
meaningful, and the quality of the test will be compromised. In practice, if the
examinee takes a test twice with two different and unreliable raters, s/he will
likely receive two different scores. In that case, which score would be a better
reflection of his or her knowledge or skills? You might be tempted to average
the scores between the two raters, but oftentimes, this is not the best option and
should be the last resort before exhausting other possible ways to improve rater
reliability. This assessment principle applies to both large-scale and local language
tests.
The involvement of teachers as content experts or raters who are familiar
with the local contexts can bring both benefits and challenges to the scoring
process. Unlike large-scale tests, local tests typically cannot afford to lose raters,
even though some raters do not exhibit satisfactory rater reliability. They all
have to rate most of the time. Thus, rater reliability becomes essential to the
accuracy of test scores. However, the reliability of local tests often depends on
the group dynamic among the raters, as well as their individual differences. It is
within this constraint that rating and rater training for local language tests often
reside, but this constraint also gives the rater group a strong sense of community
and makes rater training a much more communal practice. When raters are
familiar with the local context, they are more aware of the purposes and uses of
the test. This awareness can keep them on track when scoring language per-
formances, rather than focusing on criteria not included in the rating scale.
Moreover, involving members in the local context can help to reinforce positive
test impact. If the test purports to place students into different ESL courses, and
the raters are instructors of those courses, then it is possible that they can
enhance the alignment between teaching and testing.
In summary, the involvement of members in the local context is often
a characteristic of the scoring process for local language tests. While local members
are fallible as raters, their involvement makes raters and rater training a much
more integrated system than the scoring process for large-scale language tests.
Consensus estimates
Consensus estimates reflect the percentage of agreement between the scores assigned
by two or more raters. Some commonly used consensus estimates include exact
agreement (percentage), adjacent agreement (percentage), and kappa statistics.
While exact agreement is defined as the percentage of exams on which two or
more raters assign exactly the same scores, adjacent agreement refers to the percent-
age of exams on which the scores assigned by two raters are only one point apart.
For example, if two raters assign a 3 on a 5-point rating scale for a particular exam-
inee, then we consider that the two raters have reached exact agreement on that
examinee. If, however, one rater assigns a 2, whereas the other assigns a 3, then we
would consider the two raters to have reached adjacent agreement. For some tests,
the sum of exact and adjacent agreement percentage is used as the estimate of rater
reliability to allow for humanistic errors/individual variations because it is rather dif-
ficult to reach exact agreement on all occasions, even if the two raters might have
similar qualitative evaluations of the same performance. In contrast, Kappa statistics
take an extra step to account for agreement by chance in the calculation of agree-
ment percentage. Kappa statistics assume that raters play a guessing game while scor-
ing examinee performances. Thus, among the three consensus estimates, Kappa
statistics is the most conservative because it tends to be lower than the exact agree-
ment statistic. Nevertheless, the interpretation of these estimates is the same; that is,
the closer to 1, the higher the level of rater agreement. To showcase an example of
consensus estimates of rater reliability, we present in Table 7.1 the rater agreement
statistics used to examine rater performance for regular test administrations of the
English Placement Test at UIUC.
Consistency estimates
The assumption underlying consistency estimates is that raters need to demon-
strate a similar or consistent way of ranking examinees according to the scale.
That is, more proficient examinees should receive higher scores than less profi-
cient examinees, even though two raters might not assign the exact same scores
to the examinees. Thus, consistency estimates differ from consensus estimates, as
they represent the extent to which raters rank order examinee performances in
a consistent manner. The most commonly used consistency estimates are vari-
ations of the correlation coefficient, including Pearson’s r, Spearman’s rho, and
Cronbach’s alpha. Both Pearson’s r and Spearman’s rho are correlation coeffi-
cients that are commonly used to estimate inter-rater reliability. The difference
is that, whereas the computation of Pearson’s r is based on the raw scores of the
examinees, Spearman’s rho is calculated based on the ranking of the examinees.
To understand the real difference between Pearson’s r and Spearman’s rho
requires some foundational knowledge in statistics. However, a rule of thumb
for the choice of reliability coefficient is that, if the raw scores are not normally
distributed, then Spearman’s rho is recommended. While the first two coeffi-
cients apply to a pair of raters, Cronbach’s alpha is applicable to a group of
raters, indicating the extent to which the raters are functioning interchangeably
or aligned to the rating scale. All three of these consensus estimates range from
0 to 1 and are interpreted in the same way. That is, the closer to 1, the more
aligned the raters are to one another or to the rating scale. Keep in mind, how-
ever, that it is very unlikely to achieve 100% exact agreement, regardless of how
hard you train the raters. You should consider your training a success if you can
achieve 50% exact agreement. If two raters do not give the same scores, you can
always assign the exam to a third rater and then assign the final score more con-
servatively. This procedure is common for local language tests.
You might think that consistency estimates are a less strict version of consensus
estimates. However, this is not the case. One drawback of consensus estimates is
that, if they are used as the only criteria to evaluate rater performance, then raters
may choose to use strategies to “game” the evaluation system. For example, sup-
pose we are training raters to use a 6-point rating scale to evaluate essay perform-
ances, and we set as the goal for the rater training program that, by the end of the
training, raters should achieve 40% exact agreement. This is not an easy task for
a local test, especially when you have 6 levels on the scale. To game the evalu-
ation system, a rater can play it safe by assigning only 3s and 4s to the exams s/he
is assigned to rate (i.e., the middle categories or the center of the scale). If the
scores on a test follow a normal distribution (i.e., most scores are distributed
around the center of the scale, and very few scores are distributed on both ends
of the scale), then there is a good chance that this rater will hit the target of 40%
exact agreement. This kind of playing-it-safe is a well-known rating strategy or
pattern called the central tendency effect. If we only use exact agreement consensus
estimates, we would perhaps not detect this strategy and conclude that the rater’s
144 Raters and rater training
Measurement estimates
In contrast to consensus and consistency estimates, measurement estimates are
a third category of complex procedures that analyze rater performance at
a much more fine-grained level and often provide rich information about rater
behavior. We won’t elaborate on these estimates in this book, but some of these
procedures include the many-facet Rasch measurement model, generalizability theory,
and factor or principal component analysis. However, we will introduce two add-
itional rater effects that these measurement estimates often analyze and discuss
how they can be detected using less complicated methods.
First, imagine that raters are assigned to rate a speaking test using a 6-point
rating scale. This scale has four different criteria: pronunciation, fluency, vocabu-
lary, coherence. Each criterion requires raters to assign an independent score.
Suppose you have to finish rating 100 speaking performances; that would mean
500 scores (4 criterion scores plus 1 total score multiplied by 100). Would you
be tempted to assign the same scores across the 4 criteria just to speed up the
process? In fact, it could happen that raters do assign similar scores across criteria.
In the measurement literature, the phenomenon that score assignment on one
criterion affects score assignment on other criteria is called the halo effect.
Because the halo effect tends to involve multiple criteria, it occurs mostly in
analytic scoring (recall this term from Chapter 6). The occurrence of a halo
effect often defeats the purpose of analytic scoring. Because the rationale behind
analytic scoring is to provide more fine-grained information regarding different
aspects of examinees’ language performance, allowing test users to make judg-
ments about the performance profiles (i.e., strengths and weaknesses) of each
Raters and rater training 145
their professional life. When they are involved in the scoring process, some of
them may have low motivation to learn about and use a new rating scale, espe-
cially when the rating scale differs from their regular assessment activities or
approaches. Below, we list a few quotes from language teachers to indicate their
motivation, attitudes, and beliefs toward (scoring) speaking or writing assessment.
These are important factors to consider when examining rater performance and
reliability in local testing contexts.
• We are language teachers, not testers. We should not be asked to rate writing
or speaking tests (both speaking and writing, low motivation).
• We teach writing, not language. We should only rate argument development,
not lexico-grammatical correctness (writing, attitudes toward different criteria in
the rating scale).
• This essay is very strong. It has five paragraphs, a clear stance at the end of
the introductory paragraph, and a topic sentence in each body paragraph
(writing, a very structural approach to argument development, without looking into
the content of the essay).
• I cannot understand anything he says. His accent is too strong (speaking,
focusing on accent instead of intelligibility).
• I don’t find the grammatical errors problematic. Everything she says is intelligible
to me (speaking, focusing only on intelligibility).
• I just disagree with this argument (speaking and writing, evaluating test performances
on ideas rather than language).
The above statements are not meant to provide an exhaustive list of rater issues that
you are likely to encounter in local testing contexts. However, they are listed here
to demonstrate the need for ongoing rater training for local language tests. More
importantly, the purpose of these example statements is to encourage you to con-
sider the background, work context, and prior experiences of the raters in their
local context and anticipate potential issues raters and rater trainers might encounter
during the rating and rater training process. As a rater trainer, you should keep in
mind that rating for a local test can be time-consuming and that teachers should be
compensated for scoring the test; alternatively, the hours they need to spend on test
scoring should be included in their normal workload. When giving feedback on the
teachers’ rating performance, you might also want to strike a balance between posi-
tive and negative comments to prevent them from developing negative feelings
towards the task. Low motivation among teachers might make rater training difficult
initially, but this issue usually improves over time if you continue to communicate
with the teachers about their concerns and needs.
knowledge and skills (i.e., language ability) on the local test, but also a communal
expedition that leads to a better understanding of language learners and users, which
thus demands collaborative effort from all members of the local (rater) community.
Through conversations on the analysis of examinee performance, new knowledge
can be co-constructed in a bottom-up fashion about the practice of rating to the
scale, the placement of different examinee profiles, the instructional activities for
certain language issues, and more. This is why we strongly advocate for rater train-
ing as an ongoing, regular quality control procedure for local tests. We will elabor-
ate on the best practice of rater training at the end of this section, but before we
make our recommendations, a general introduction of what rater training entails
and whether it is effective in practice is in order.
As the name suggests, rater training is a quality control procedure where
raters are trained to rate performances. In performance-based assessments,
a common goal of rater training is to develop a satisfactory level of consistency
and alignment among the raters in terms of their interpretation and application
of the scale. In order to do so, raters have to minimize any rater effects they
display in their rating behavior. That is, rater training should help raters recog-
nize effects such as rater severity, rater bias, central tendency, and halo effect,
and then negotiate a set of effective training activities to help them reduce these
rater effects and achieve higher rater agreement. In theory, rater training should
focus on the analysis of test performances and reflection upon rating behaviors.
However, in practice, rater training for local language tests often goes beyond
the immediate assessment situation, tapping into other related settings within the
local context where the test has an impact. This practice is a distinct feature of
rater training for local tests, where raters participate in assessment-related conver-
sations as both raters and their regular roles in the local context. This kind of
practice can ultimately create a community of practice by helping both raters to
integrate language assessment into their professional life and test developers to
improve the quality of the test in a sustainable way.
Interestingly, the field of language testing does not seem to have a broad con-
sensus regarding the effectiveness of rater training. While the majority of the
research in language testing has indicated that rater training helps to improve
rater reliability (Weigle, 1994), some researchers have failed to find a meaningful
effect of rater training on rater performance (Elder, Barkhuizen, Knoch, & Von
Randow, 2007; Vaughan, 1991). Woehr and Huffcutt (1994) conducted
a meta-analysis on the effectiveness of rater training for performance appraisal.
Their study revealed that, on average, rater training appears to have a moderate
effect on rating accuracy, rater severity, and halo effect. However, there is con-
siderable variability in the design and implementation of rater training (e.g., con-
tent, intensity, and duration of training). These characteristics can have an
impact on the effectiveness of rater training (Ivancevich, 1979). Interestingly, if
we look at rater training across fields, we may be surprised to find out that the
length and frequency of rater training can range from a one-time, 5-minute
briefing to a repeated training module that lasts for an entire semester or year.
148 Raters and rater training
In language testing, rater training is rarely done in the course of a few minutes.
However, in the published studies, rater training is often a one-time event.
Based on our experiences in different contexts, it is difficult to see an immediate
effect of rater training for language tests, and that effect rarely lasts long if it is
provided only once. Research in the assessment of other psychological traits has
reported similar findings (e.g., Ivancevich, 1979).
The challenge of making rater training effective with a single meeting makes sense
from a rater cognition perspective. Cronbach (1990) argues that rating is “a complex,
error-prone cognitive process” (p. 684; later echoed by Myford & Wolfe, 2003). The
rating process for language tests is arguably more challenging than other psychological
constructs (for reasons highlighted in the beginning of Chapter 6). A quote on writing
assessment from Hamp-Lyons (1995) explicates the difficulties in rating essay
performances:
the scale, the training can move on to finer details, namely, distinguishing perform-
ances between adjacent levels and characterizing performances at each level.
Example 7.1 presents rater training in a local testing context, i.e., the written
section on the English Placement Test (EPT) at the University of Illinois at
Urbana-Champaign.
Certification
• End of the
Meeting 4 semester
• Discuss rating • Evaluation of
Meeting 3 performance overall
• Rate and discuss performance
• Discuss rating
Meeting 2 performance 6 benchmark (individual)
• Rate and discuss essays
Meeting 1 • Discuss rating
performance 6 benchmark
• Rate and discuss essays
• Introducing
certification 6 benchmark
program, rating essays Rating 4
rubric • 12 essays
Rating 3
Rating 2 • 18 essays
Rating 1 • 18 essays
Rater performance analysis
• 12 essays after each round
During each month of the training program, the raters are asked to rate
a sample of benchmark essays individually and receive feedback from the
rater training regarding their performances. Then, they meet at the end of
each month to evaluate and discuss a smaller set of benchmark essays as
a group during the meeting. They are asked to provide justifications for the
score each essay receives, as a way to achieve a mutual interpretation and
application of the rating scale. During the meeting, the discussion often starts
with raters identifying performance characteristics that are aligned to the scale
descriptors. However, as soon as they have developed a good intuition about
the scale, the raters would draw on their relevant teaching experience in the
150 Raters and rater training
There is not really a single model of best practice for rater training. However,
we would like to recommend a few important considerations when developing
a rater training program for your own language test.
Rater cognition
One cannot overstress the fact that rating is a complex cognitive process. This
is, first and foremost, the factor that should guide the consideration of all other
factors listed below, as well as the overall design and implementation of your
rater training program. As much as you might like to address all rater effects
during rater training, it is highly unlikely that you will be able to achieve this
goal. Take baby steps and give yourself time to develop rater reliability and con-
fidence. Based on research findings on longitudinal development of rater per-
formance (Lim, 2011), raters tend not to develop a firm grasp of the rating scale
until the third or even fourth time practicing with it. Therefore, if you have an
analytic scale or a hybrid holistic scale with analytic components (refresh your
memory about the different types of scales from the previous chapter if need
be), you will need to have patience with the raters and allow them enough time
to explore and internalize the rating scale.
Rater background
Before you design or deliver rater training, you should consider the background
of the raters. Human raters are not machines. When they read an essay or listen
to a speech sample, they will develop assumptions, hypotheses, and attitudes
toward the examinee based on their personal, linguistic, cultural, educational,
professional backgrounds. This is human nature. However, while some of these
perceptions are relevant to the language ability of the examinee, others are biases
that need to be recognized and minimized. Thus, an understanding of rater
backgrounds is likely to help develop training materials and activities to prevent
certain biases from surfacing.
Raters and rater training 151
Benchmarking
Another important suggestion for rater training is benchmarking. In a language
test, benchmarks usually refer to the most typical test performances at each level
of the rating scale. Because they are typical to a particular level, these perform-
ances tend to render higher rater agreement (they can also be envisioned as the
center of each level, if we think of scale levels as bands). When training raters to
the scale, especially during the initial stage of training, it is very important to use
benchmark test performances and provide the “correct” scores to the raters.
In contrast to benchmarks are borderlines, which are test performances that are
not typical of a particular level and display characteristics of performances at
both of the two adjacent scores or levels. These performances tend to be less
frequent and representative of the examinee population than benchmarks. Thus,
borderline performances should not be used during rater training. In addition,
we recommend using only benchmark performances because rater training is not
a guessing game or a test for raters, although certain stakes are inevitable for
rater training in some contexts. Rater trainers should not use borderlines to trick
the raters or force them to agree on these essays. In fact, because of the cogni-
tive load of scoring language performances, it is not an easy task to align raters
to the rating scale, even when using benchmarks.
Community of practice
If you consider all of the factors above, you are likely to have a positive, or at
least less negative reaction, from your raters regarding scoring and rater training.
There is, however, more that you can do to make scoring, rater training, and
even language assessment a more communal experience. To do so, you should
seek frequent feedback from raters about the scale and make adjustments accord-
ingly. This can help the raters to develop a sense of ownership of the scale or
test. Raters should be encouraged to express their intuitions, hypothesis, and
opinions regarding the scale. They should be encouraged to develop their own
internalized versions of the rating scale (e.g., tweaking scale descriptors or devel-
oping their own “decision trees” for score assignment), as long as these internal-
ized versions are largely aligned to the original scale. Raters are not machines, so
don’t try to make them interchangeable, because you are bound to fail. In fact,
Raters and rater training 153
if you allow some freedom during rating and rater training, you might be sur-
prised to find out that there are many occasions when the raters’ voices can lead
to an improved version of the rating scale. Therefore, developing some level of
autonomy and ownership among the raters can create a community of assess-
ment practice. As we discussed in previous chapters, the raters for local language
tests tend not to be very confident about their language assessment literacy. This
kind of rater training practice can help develop language assessment literacy in
the local context.
That said, while granting teachers a certain level of autonomy, raters must be
trained to be willing to accept majority rule during rater training. When a rater
assigns a different score from the rest of the community, you should try to prevent
that rater from going into defense mode. Ideally, when discussing raters’ scores and
comments on certain test performances, you should keep the ratings anonymous.
After all, rating is a shared practice, and it should not be about getting the “right”
answer. Example 7.3 describes the development of a rater community of practice
in the case of the TOEPAS.
raters try to write feedback reports in which they describe the performance
they watched.
6. Observation and rating of a live TOEPAS session. New raters observe
a live TOEPAS session administered and rated by two experienced raters.
They also practice rating the observed session.
7. Individual rating. After the intensive four-day sessions, new raters are
assigned performances that they rate individually. New raters are certified
when they achieve 80% agreement.
In addition to the initial group training and regular group norming sessions,
raters further develop collaboration through post-score discussions and feed-
back report writing. More specifically, after independent score submission,
the two raters discuss the performance to decide on the final score that will
be reported to the test-taker. These post-score discussions allow raters to
adjust their uses of the scale and their shared understanding of the scale
descriptors. These discussions also help the raters to relate scale descriptors
to concrete examples (quotes) from the performance in order to produce
the written feedback report that test-takers receive as part of the test results.
Summary
This chapter discusses issues related to raters and rater training. We have introduced
a catalog of rater effects that can occur during test scoring, their impacts on test
scores, and ways to address them during rater training. It is important to keep in mind
that rating is a complex cognitive process and that training takes time to affect rater
performance and reliability. Equally important is the fact that, although we cannot
treat human raters as machines, it is important to develop a mutual understanding
among them of what constitutes language ability for the particular assessment purposes
and how to assess strengths and weaknesses in language performances. It is important
to embed these questions in their daily professional lives so that the language test
becomes a living organism that grows and integrates itself into the local context.
Further reading
Carey, M. D., Mannell, R. H., & Dunn, P. K. (2011). Does a rater’s familiarity
with a candidate’s pronunciation affect the rating in oral proficiency inter-
views? Language Testing, 28(2), 201–219.
Carey, Mannell, and Dunn investigate how examiners’ familiarity with examinees’
interlanguage affects pronunciation assessment ratings in International English Lan-
guage Testing Systems (IELTS) and oral proficiency interviews (OPI). Ninety-nine
IELTS examiners across five test centers (Australia, Hong Kong, India, Korea, and
New Zealand) participated in the study by rating three candidates’ OPI performance,
Raters and rater training 155
whose first language is Chinese, Korean, and Indian, respectively. Most examiners are
native speakers of English, except for the Indian center, but their levels of familiarity
to the non-native English accents varied. The results showed that examiners’ interlan-
guage familiarity had significant positive associations with pronunciation ratings across
languages. Examiners with higher familiarity to the candidate’s interlanguage tended
to give higher pronunciation ratings. In addition, the location of test centers was sig-
nificant to the inter-rater variability, independent of the interlanguage familiarity.
Examiners’ ratings were significantly higher when they rated the candidate’s pronunci-
ation in their home country, for example, when the Chinese candidate’s performance
was rated in the Hong Kong center. The authors suggested language test development,
rater training, and research consider interlanguage phonology familiarity and the test
center effect as an important source of bias in the OPI test.
Davis, L. (2016). The influence of training and experience on rater performance
in scoring spoken language. Language Testing, 33(1), 117–135.
Davis (2016) investigates the effects of rater training, rating experience, and scor-
ing aids (e.g., rubrics, exemplars) on rater performance in the TOEFL iBT
speaking test. Twenty native speaking English teachers participated, who have
varied teaching and rating experience but no previous TOEFL speaking rating
experience. Each rater scored 400 responses over four scoring sessions and took
one rater training. The training effects were investigated via rater severity,
internal consistency, and accuracy of scores. Multi-faceted Rasch measurement
analysis results showed little effect of training and experience on overall rater
severity and internal consistency. However, increases were detected in pairwise
inter-rater correlations and inter-rater agreements over time. In addition, scoring
accuracy measured by correlations and agreement with previously established
reference scores was significantly improved immediately after the rater training.
The frequency of using exemplars was associated with extreme raters: most
accurate raters referred to exemplars more frequently and longer than the least
accurate raters did. Using scoring rubrics also increased the score accuracy. The
study reassures positive rater training effects on rater performance found in pre-
vious literature while providing new insights into the relationships between
using scoring aids and rating performance.
Deane, P. (2013). On the relation between automated essay scoring and modern
views of the writing construct. Assessing Writing, 18(1), 7–24.
Deane (2013) examines the writing construct measurement in automated essay scor-
ing (AES) systems. After reviewing definitions of writing as a construct, the author
discusses the construct that AES system measures, focusing on the e-rater scoring
engine. AES systems primarily measure test production skills or text quality (e.g., the
text structure, linguistic features) but do not directly measure other writing skills
such as deploying meaningful content, sophisticated argumentation, or rhetorical
effectiveness. While acknowledging the limitations of AES systems, the author
highlights fluency and language control as important components. The author
156 Raters and rater training
provides literature that supports a strong correlation between efficient text pro-
duction skills and other cognitive skills necessary for successful writing, which
connects AES and human ratings. The strong correlations exist because mastery
of core text production skills enables writers to use more cognitive resources for
socio-cognitive writing strategies, which otherwise needs to be used for produ-
cing text. The author discusses three common criticisms of AES (construct rep-
resentations, measurement methods, and technical inadequacies of AES) and
responds to the concerns by suggesting a socio-cognitive approach to AES in
large-scale settings.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using
many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4),
386–422.
And
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects
using many-facet Rasch measurement: Part II. Journal of Applied Measurement,
5(2), 189–227.
Myford and Wolfe (2003, 2004) is a two-part paper that describes measuring rater
effects using the many-facet Rasch measurement (MFRM). In Part I, the authors
provide a historical review, focusing on how previous measurement literature con-
ceptualized and measured rater effects, theorized the relationships between rater
effects and rating quality, and make various efforts to keep rating effects under con-
trol for rating quality. The authors also introduce Andrich’s (1978) rating scale
model modified within the MFRM frame and three hybrid MFRM models based
on mathematical and conceptual backgrounds of MFRM, and make thorough com-
parisons of the models. In Part II, the authors offer a practical guidance on using the
Facets (Linacre, 2001) computer program to measure five rater effects, namely, leni-
ency/severity, central tendency, randomness, halo, and differential leniency/sever-
ity, as well as interpret the results. The authors emphasize the importance of
embracing diverse psychometric perspective in researching rater effects.
Yan, X. (2014). An examination of rater performance on a local oral English pro-
ficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527.
Yan (2014) examines rater performance and behavior on a local speaking test for
prospectus international teaching assistants (ITAs). The rater performance was
evaluated by investigating inter-rater reliability estimates (consistency, consensus,
and measurement statistics) of 6,338 ratings. The author also qualitatively analyzed
506 sets of rater comments for listener effort, fluency, pronunciation, grammar,
and language sophistication. The results revealed that overall reliability was high
(r=0.73) but exact agreement rate was relatively low, which was associated with
examinee oral English proficiency levels: higher agreement was found on passing
scores across different language groups. Many-facet Rasch measurement results
indicated that rating scales were used consistently, but severity varied among raters
Raters and rater training 157
with small effect size. Chinese raters were more lenient towards Chinese exam-
inees and more severe towards Indian examinees than native speaker raters. Quali-
tative analysis indicated that Chinese and native speaker raters had different
perceptions towards intelligibility of Indian and low-proficient Chinese examinees.
The author suggested considering various interactions among linguistic features in
understanding rater disagreement and addressing intelligibility by representing con-
struct of ITA English proficiency in real-world communication.
References
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika,
43, 561–573.
Ballard, L. (2017). The effects of primacy on rater cognition: An eye-tracking study. Doctoral disser-
tation. Michigan State University.
Cronbach, L. J. (1990). Essentials of psychological testing (5th ed.). New York: Harper & Row.
Davies, A. (2008). Textbook trends in teaching language testing. Language Testing, 25(3),
327–347.
Elder, C., Barkhuizen, G., Knoch, U., & Von Randow, J. (2007). Evaluating rater responses
to an online training program for L2 writing assessment. Language Testing, 24(1), 37–64.
Fulcher, G. (2012). Assessment literacy for the language classroom. Language Assessment
Quarterly, 9(2), 113–132.
Hamp-Lyons, L. (1995). Rating nonnative writing: The trouble with holistic scoring. Tesol
Quarterly, 29(4), 759–762.
Inbar-Lourie, O. (2008). Constructing a language assessment knowledge base: A focus on
language assessment courses. Language Testing, 25(3), 385–402.
Ivancevich, J. M. (1979). Longitudinal study of the effects of rater training on psychometric
error in ratings. Journal of Applied Psychology, 64(5), 502.
Lim, G. S. (2011). The development and maintenance of rating quality in performance
writing assessment: A longitudinal study of new and experienced raters. Language Testing,
28(4), 543–560.
Linacre, J. M. (2001). Facets Rasch measurement software. Chicago: Winsteps.com.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using
many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.
Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches
to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4), 1–19.
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language
speaking ability: Test method and learner discourse. Language Testing, 16(1), 82–111.
Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In L. Hamp-Lyons
((Ed.), Assessing second language writing in academic contexts (pp. 111–125). Norwood, NJ:
Ablex.
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11
(2), 197–223.
Woehr, D. J., & Huffcutt, A. I. (1994). Rater training for performance appraisal:
A quantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189–205.
Yan, X., Zhang, C., & Fan, J. J. (2018). “Assessment knowledge is important, but … ”:
How contextual and experiential factors mediate assessment practice and training needs
of language teachers. System, 74, 158–168.
8
DATA COLLECTION, MANAGEMENT,
AND SCORE REPORTING
While Chapter 3, Local test development, and Chapter 5, Local test delivery, deal with
the front-end of test development, this chapter discusses the back-end, i.e., data
collection and management. More specifically, this chapter discusses what types of
test-taker data need to be collected (e.g., background bio data), and how the data
are stored, organized, and retrieved (e.g., relational databases). Database design is
at the center of these discussions. We argue that appropriate data collection and
management need to be explicitly incorporated into test design at the earliest
stages of development because they are essential for the ongoing analyses of test
reliability, validity, and research. The data collection and management systems are
sometimes used not only for result storage but also for score reporting. Therefore,
this chapter also discusses the relevance of information included in different score/
result reports in relation to different score users and uses.
Introduction
When you design a local language test, the focus tends to rest primarily on test con-
tent and delivery, as well as the development of a rating scale in the case of perform-
ance-based tests. Of course, you are concerned about whether the tasks you design
engage test-takers as expected and, therefore, elicit relevant language performances.
You are also concerned about whether the test delivery platform (traditional or digi-
tal) serves the test purpose and is cost-effective and whether the rating scale allows
you to discriminate among the different proficiency levels. Response collection and
storage, however, tends to receive less attention during the test development stages
because it may seem naturally determined by the test delivery platform you select. For
example, if your L2 writing test is computer-delivered, then the responses you collect
are probably in the form of saved word processing documents (e.g., Microsoft Word
Data collection, management, and reporting 159
or text), which may be stored on the computer’s hard drive, a local server, or a USB
drive, or they may be sent to a specific email address. You may also have a situation
(e.g., oral interviews) in which the test responses are not recorded because they are
rated live, although a number of oral responses may be recorded for examiner- and
rater-training or task analysis. Even though the test delivery may influence the data
collection to some extent, hybrid approaches to delivery and response collection
facilitate quality control and reporting. For instance, test responses or other test-
related data (e.g., test-takers’ bio data) may be collected on paper and subsequently
digitized and stored in a database. Scores and raters’ notes from oral interviews may
also be stored in a digital database where they are linked with test-taker data. The
widespread use of computers and the use of digital databases for test data storage and
retrieval have become commonplace, especially because task responses are distributed
to raters for scoring procedures, and together with other test data, they are used for
test analysis, program evaluation, and research.
The test data are also used to process score reports for test-takers and score
users. A well-structured database facilitates the preparation and distribution of
test results to different stakeholders, and the production of result reports can be
automatized if the report parameters are well-defined. The automatic production
of reports is efficient and minimizes human data processing errors.
Given that appropriate data collection and management is central for the ongoing
test analyses and research, the test database needs to be explicitly incorporated into
the test design in the earliest stages of test development. In this chapter, we first
introduce the flow, storage, and management of data, and then focus on test data
uses for result reporting, item analysis, program evaluation, and long-term research.
Data flow
In order to highlight the importance of planning test data storage and mainten-
ance as part of the test design, we outline the test data flow at all stages of test
administration. As Figure 8.1 shows, data collection and storage occur mainly
during registration, testing, and rating. When the test-takers register to take the
test, their demographic information (e.g., name, L1, age) is gathered. Then, the
collected test responses are accessed by the raters during the rating session. The
scores, feedback, and, sometimes, rater notes are recorded and linked to test-
takers’ background data and test responses for the purpose of producing scores/
result reports. Test data can also be used for analysis of the test’s reliability and
validity, as well as research and program evaluation.
The basic data we collect during registration are related to test-taker identifica-
tion information, such as name, age (date of birth), gender, personal identification
number, contact information (email, phone, address), and test session (place date,
and time when the test-taker is registered to take the test). When tests are admin-
istered to test-takers with different language backgrounds and nationalities, infor-
mation regarding the languages they speak (mother tongue and other languages),
country of residence, and nationality may also be collected. In university contexts,
160 Data collection, management, and reporting
test-taker data could also contain information about test-taker’s department, study
program, degree (BA, MA, PhD), and student advisor or program liaison. If the
test is administered across several schools, school branches, or language centers,
information about test-taker’s school/center may also be needed. All of these data
are usually associated with a test-taker ID, which is used to protect personal infor-
mation through anonymous data handling.
Test-taker IDs are essential when there is a need to connect different data
associated with the same test-taker. For example, if test-takers are tested several
times over a certain period (e.g., pre- and post-test), then the IDs help to con-
nect the results from each test so that the individual test-taker’s growth can be
measured. Test-taker IDs are also useful to identify the test-taker correctly when
several test-takers in the database have the same first and last names. They can
be automatically generated by the database software, or they can be designed
with a particular structure (see the role of IDs in section Research, digitization of
test data, and corpus development below). The test-taker IDs in the OEPT 1.0 con-
tained information about the semester and year when the test was administered,
along with the test-taker number. For example, SP20020039 meant that the
test-taker was the 39th person who took the test in the spring semester of 2002.
Which type of data you need to collect and store depends on the needs of
your local context. Personal data protection laws and regulations require that
personal data collection is limited only to the most necessary information needed
for the purpose, and so you have to carefully consider what personal data are
relevant for your test, i.e., what data you need to report the results and to
Data collection, management, and reporting 161
perform test analyses for quality purposes. For example, collecting information
about the visa or residency status of international students at your university may
provide interesting information, but it may not be directly related to the purpose
of the test, and thus the data should not be collected. Collecting irrelevant per-
sonal data may also breach research and testing ethics.
Personal data protection regulations tend to be enacted at the institutional, national,
or international levels. For example, the General Data Protection Regulation (GDPR)
(Regulation (EU) 2016/679) of the European Parliament and the Council is enacted
at the international level in the European Union and the European Economic Area
(European Commission, 2019). GDPR stipulates that institutions involved in personal
data handling must employ a data protection officer whose responsibility is to manage
organizational data protection and oversee GDPR compliance. The GDPR has led to
development of similar international privacy protection acts in the Asia-Pacific
(APAC) region, like the APEC Privacy Framework, although these countries have
also implemented national privacy acts (Asia-Pacific Cooperation, n.d.). For example,
the national Australian Notifiable Data Breach (NDB) scheme, which is an addendum
to the Privacy Act of 1988, shares many similarities with GDPR, especially in relation
to management of electronic data (Australian Government, n.d.). In the US, however,
personal data management is regulated by the Family Educational Rights and Privacy
Act of 1974 (FERPA), which is a Federal law that applies only to personal data man-
agement in the educational sector (US Department of Education, 2018).
If your institution has a data protection officer, a research review board (e.g.,
IRB), or a research ethics committee, then you may need to obtain their approval
regarding the type and use of the test data that you collect and the type, place, and
length of data storage. Such a board or committee would be cognizant of the per-
sonal data protection laws relevant for your context. Example 8.1 discusses the pro-
cedures used to ensure that the TOEPAS data collection complies with GDPR.
1. Approval was requested from the data protection officer at UCPH regard-
ing the acceptability of data collection and storage.
162 Data collection, management, and reporting
Test responses are collected during the testing session and can be in the form of
correct/incorrect (0/1) answers, text, audio files, or video files, depending on
whether the test contains multiple-choice or performance-based items (see Chapter
4 for task descriptions). Test response data are collected to facilitate the scoring or
rating process, but they are also used for subsequent item and test analyses, as well
as research. Digital language tests allow for the collection of other types of test-
related data that could be used for further item/test development and an improved
understanding of the language ability. For example, you could record response-
preparation time, time spent on prompts, or response time.
In terms of rating data, however, we usually refer to scores assigned to each
item, as well as overall test scores. Additionally, rating data can include rater
comments and notes, which could be used to support score decisions, as well as
feedback reports. Rating data also include rater information, i.e., provides
a record of the scores that each rater assigns, rater background data (e.g., mother
tongue, gender, rater certification date), or task information if multiple tasks are
involved. Table 8.1 summarizes the types of test data commonly collected.
registration • test-taker ID
data • name
• origin
• age
• mother tongue
• educational background
response data • answers (text, audio files, correct/incorrect)
• preparation time
• response time
• notes
rating data • rater information (mother tongue, gender, rater certification date)
• rater comments, notes, feedback reports
• scores
Data collection, management, and reporting 163
App reports
The OEPT App has a number of report functions. Some reports allow testers
to view exam and survey data; others serve as quality assurance measures;
still other reports are used in the analysis of test, examinee, and rater data.
The OEPT Survey Report allows administrators to view OEPT Survey
responses and statistics. The OEPT Practice Test Report creates an entry for
each use, showing the date accessed, which tests were accessed, and user
comments. The RITA Report (for rater training) is organized by user; it
shows the user’s name, progress, and practice rating scores.
164 Data collection, management, and reporting
The SAS Report contains all rater item and overall scores, as well as final
scores, along with rater IDs and rater types. When raters are assigned to rate
a test, each one is designated as either “Scoring” (meaning they are up to
date on training and their scores will contribute to the final test score) or
“Apprentice” (meaning they are still training and their scores do not contrib-
ute to final score determination). Raters can also be designated as “Training”
when the test is assigned only for training purposes.
The Examinee Score report includes exam IDs, dates, and final scores,
along with examinee department, college, country, language, sex, and year
of birth. The SAS and Examinee reports are used to create data sets for use
with SAS, FACETS, and other statistical software.
Data storage
The storage of test data has evolved from the storage of physical data (e.g., paper
booklets, tapes) to digital data storage on internal and external hard drives and,
more recently, to client-server technology and cloud computing services. Despite
the evolution of digital data storage, physical data, or parallel systems of physical
and digital data, still exist in local settings because the volume of local test data is
relatively small when compared to data collected with large-scale tests.
In the past, the storage and management of test data required large cabinets and
storage rooms, as well as efficient indexing systems. Boxes with test booklets and
Data collection, management, and reporting 165
audiotapes and file cabinets with rating sheets and forms with test-taker data were
commonly found in institutions where local tests were administered. Locating data
was time-consuming because it required a physical search through boxes and cab-
inets, and retrieval was only possible if an efficient filing and labeling system was in
place. Due to limited storage space and search capacity, the data could easily
become unusable and had to be destroyed after a certain period of time. Figure 8.2
is an example of how the testing data (including exam sheets, essays, and score
reports) for the EPT at the University of Illinois at Urbana-Champaign was stored
before the online platform and database were developed.
With the increased accessibility to computers in local institutional contexts
and the development of digitally-delivered tests, the storage of directories with
test responses and other test data on institutional or program internal hard drives
became standard practice. Due to the limited size of computers’ hard drives in
the 1990s and early 2000s, CDs, DVDs and, later, external hard drives were
used for data storage. For example, when the OEPT became operational in
2001, the database with test-taker information and scores was stored on the pro-
gram’s computer hard drive, and backed up on CDs. Many institutions continue
to store test data on internal and external hard drives, especially because the hard
drives that are commercially available today have much larger storage capacity
than they did 20 or 30 years ago. The drawback is that, if the hard drive is dam-
aged, then the stored test data may be rendered irretrievable. All data storage
systems require at least one viable back-up system.
In order to secure safe storage and retrieval of test data, IT specialists recommend
the application of network storage through client-server technology or cloud com-
puting. In addition to safe storage, client-server and cloud applications allow for
data to be accessed from different devices. Furthermore, since client-server and the
cloud applications are independent from the local device (computer, tablet, smart-
phone), any damage to the local device’s hardware or software will not result in
data loss (Figure 8.3).
The client-server model allows local devices (e.g., computer, tablet, smart-
phone), which are called “clients”, to obtain resources or services from other
devices (usually computers), which are called “servers”, through a computer net-
work or the internet. In other words, you can use local devices to collect response
or other test data, which are then exported to and stored on the server. The inter-
net allows for the communication between the client and the server. In fact, the
server has the same functionality as an external drive. The only difference is that
the external drive is local, while the server is remote and can be accessed through
the internet. To give an example, simulated lectures, which are the test response
data from TOEPAS, are recorded as video files. Through a file transfer protocol
(FTP), the recorded video files are immediately transferred to and stored on
a server instead of a local internal or external hard drive. To access the video files
after storage, we need the unique URL address associated with those files.
The newest type of network storage is offered through cloud storage services.
In many ways, cloud services are similar to client-server models because they
both allow for the remote storage of data. The cloud is different, however, in
that you use the services of cloud providers who provide you with storage space
for a fee. Therefore, the advantage of cloud services is that the cloud provider
maintains the storage space, and so you do not need to invest in server updates
and maintenance as you do with client-server technologies. Moreover, the data
on the cloud are regularly synced automatically, and your data are safe and
sound despite any potential damage to your local hardware and software. You
may be familiar with the cloud services of Dropbox and OneDrive, which are
used primarily for storage of personal data files. For institutional use, Amazon
Web Services is a more robust provider, offering a range of cloud computing
services (Amazon, n.d.).
Although cloud computing has great potential and seems to represent the future
of data storage solutions, concerns are often raised regarding data security and priv-
acy. Many institutions do not allow storage of sensitive data, like test scores and per-
sonal information, on cloud servers because they do not trust a storage security
system over which they do not have control. When we tried to use ACE-In at the
University of Copenhagen, we were required to buy the Amazon cloud services for
Europe, instead of Amazon United States, which provided the cloud services for
the ACE-In administration at Purdue. Moreover, no personal test-taker identifiers
(e.g., names, national ID numbers) were stored on the cloud server. All personal
identifiers were associated with test-takers’ IDs and stored on a local computer hard
drive. Therefore, checking the local policies and requirements for personal data
management before designing the data storage system is essential.
The technical information about different storage alternatives may seem over-
whelming for some, but we hope it provides a starting point for discussions with
your local IT specialists or programmers when you are deciding on storage options.
Data management
Although you may find a storage system where you can keep the test data in an
organized and secure environment, this system will be of little use if exporting,
reorganizing, and recovering data from the storage is difficult. Just as boxes filled with
tests on the shelves of a storage room need to be labeled and indexed, digital data also
need consistent naming and organization to facilitate retrieval. You need a data man-
agement system that will enable you to send the data to the correct place and to find
what you want to use. In other words, the data management system can consist of
interfaces that allow (1) test administrators to input registration data, (2) raters to
access test responses and to submit their scores, (3) test administrators to process score
reports, and (4) researchers/analysts to access data for analysis.
The learning platform(s) used at your institution can serve as data manage-
ment systems that do not require much maintenance, or the institutions already
have staff who manage the learning platform. Example 8.3 illustrates how the
English Placement Test (EPT) at the University of Illinois at Urbana-
Champaign (UIUC) utilizes the Moodle learning platform for test delivery, scor-
ing, and data management.
the Moodle site to take the test during the available window. The learning
platform is set up such that it saves both examinees’ spoken and written
responses into downloadable files. After each test window, the examinees’
responses are automatically downloaded and saved in the EPT database,
using Python scripts.
The rating interface is also available on the Moodle site, which randomly
assigns each exam to two raters. Raters also need to log in to each EPT site
to access the exams and to submit their scores and comments. There are
several advantages of using learning platforms like Moodle. First, UIUC and
other universities use these learning platforms to host their regular course
sites so that the learning site is available at no additional cost. Furthermore,
when the test is delivered on the same learning platform, students’ test
scores can be easily linked to their course grades. This allows teaching and
testing to be better connected, thereby enhancing the effectiveness of
instructional programs (see Chapter 2 on linking testing to instruction). In
addition, there is a technical team on campus that provides technical sup-
port for the Moodle site and makes recommendations for test site setup in
order to allow for/optimize different task design features.
report (see Figure 8.7). Once Rater A submits the written feedback, Rater
B receives an email that the feedback is ready to be approved. If Rater
B approves the feedback, then the admin is informed that the results (final
score, video, and feedback) are ready to be sent to the test-taker. When the
admin sends the report, the test-taker receives an email with information
that the results could be accessed through a unique link (see Figure 8.8).
Data uses
Before you read this chapter, you might have been tempted to think that your
job is done after the test is administered and scored. However, the bulk of the
work starts after test administration and data collection. In general, after test
administration and scoring, the work on a local test involves (1) result reporting,
(2) test analysis, (3) program evaluation, and (4) long-term research agenda and
database management.
172 Data collection, management, and reporting
Result reports
In most testing contexts, after test administration, scoring, and data storage is
complete, the first thing to do is to report the test results to different stake-
holders. Depending on the context and purposes of your test, the stakeholders
will vary (see Chapter 2). As you prepare the test score report, keep in mind
that different stakeholders may have different levels of knowledge and familiarity
with the assessment terminology, which is known as language assessment liter-
acy. Thus, the language in the score report should be adjusted to the level of
assessment literacy of the different users. In addition, test result reporting should
be tailored towards the uses of test scores for different stakeholder groups by
including different sets of test information or presenting the results from different
perspectives (see also Example 8.2).
For example, an EFL end-of-semester achievement test for middle-school stu-
dents will likely have the following stakeholders: students, parents, teachers,
English course coordinators (loosely defined here to include head teachers or
headmasters of the grade level), the school principal, and local education author-
ities. Therefore, the reporting system should differentiate the level of informa-
tion accessible and useful for different stakeholder groups, which can effectively
facilitate their use of the test results. In Table 8.2, we show a sample test result
reporting system used for a municipal end-of-semester English achievement test
for eighth graders in China (other test-related information can be found in
Zhang & Yan, 2018). In this local testing context, the EFL teachers use the item
TABLE 8.2 A web-based test result reporting system for a municipal K-12 EFL test in China
School principal Ø Descriptive statistics for test and section score distributionsa
Ø Item difficulty, discrimination, test reliabilityb
Ø Rankingc of individual classes in the school, district, and city
Ø Rankingc of the school in the district and city
Teacher Ø Descriptive statistics for test and section score distributions
Ø Item difficulty, discrimination, test reliability
Ø Students’ mastery of target linguistic forms and functionsd
Ø Individual students’ item and test scores
Ø Ranking of individual students in the class, school, district, and city
Ø Rankingc of the classes that they teach in the school, district,
and city
Student and parents Ø The student’s item and total scores
Ø Mastery of target linguistic forms and functions
Ø Ranking of the student in the class, school, district, and city
Note. a. score distribution was reported for individual classes, the school, district, and city; b. item stat-
istics were estimated based on item responses of all students within the city; c. ranking of classes and
schools were based on the students’ average scores on the exam; d. mastery of different forms and func-
tions was estimated based on scores on items targeting that same form or function.
Data collection, management, and reporting 173
scores on the English exam to assess students’ learning outcomes and to make
instructional adjustments based on students’ performance on the exams. In add-
ition, teachers use item scores, total scores, school averages, and ranking to com-
municate with students and parents regarding the students’ learning progress and
the need for one-on-one tutoring after school or cram schools during summer
and winter vacations. However, the school principal would more likely be inter-
ested in the overall score distribution on the English exam at School A, as well
as the average scores on every English exam for individual classes at School
A compared to the district and city averages.
If your test assesses speaking and writing skills, it is important to include
a simplified version of your rating scale to explain the meanings of the scores. If
the test is intended to have an impact on language teaching and learning, it
might also be desirable to include comments from the raters to justify the scores
and offer more fine-grained information about the test-takers’ performances.
This information can help language teachers (and learners) to identify the
strengths and weaknesses of the test-takers’ performance and to establish appro-
priate learning goals. For example, the TOEPAS reports include a score, a video
recording of the performance, a written report, and an oral feedback session.
The first version of the feedback report included descriptions of performance in
terms of fluency, pronunciation, grammar, vocabulary, and interaction. The
revised version of the feedback report, however, focused more on the functions
of the test-taker’s language in the EMI classroom so that awareness is raised
about different language strategies in the classroom (Dimova, 2017). Contextual-
ized feedback is more feasible in the local tests because the raters (often teachers
in the program) are familiar with the specific language domains needed in the
context of language use (see Figure 8.8).
Score reporting is also an opportunity to strengthen the connection between
different test-related resources in the local context. Because local language tests
tend to be constrained in terms of resources, they need the support of their
stakeholders to make the test sustainable. Frequent communication with stake-
holders can effectively enhance their understanding of the test purpose, item
types, and score meanings. This kind of communication is also a recommended
practice based on ethical considerations for language tests (see guidelines for
practice EALTA, 2006; ILTA, 2007).
Item analysis
In most local testing contexts, the same test items are repeatedly used for test
administration. Thus, as part of quality control, it is crucial to continue monitor-
ing whether the items are doing a good job of placing examinees into the
appropriate levels. For that purpose, item analysis is a necessary procedure that
test developers perform after regular administrations. There are a range of statis-
tical analyses that can be performed to examine the quality of the items (Bach-
man, 2004; Bachman & Kunnan, 2005; Brown, 2005). The more sophisticated
174 Data collection, management, and reporting
analyses can provide fine-grained information regarding test items, but they also
tend to be more technical and require larger sample sizes, making those analyses
viable only for large-scale, high-stakes tests. Nevertheless, local test developers
can gain a great deal of useful information about test items and test-takers from
even the most basic type of item analysis. Here, we introduce three basic and
yet vital statistics that all types of item analyses can generate: score distribution,
item difficulty, and item discrimination.
• Descriptive statistics are the first step of most data analyses. A local language
test, although much smaller in scale compared to large-scale tests, tends to have
enough test-takers whose scores will show an approximately normal distribu-
tion. Therefore, the first step is to examine the score distribution in order to
see if the distribution is roughly normal. When examining the score distribu-
tion, be sure to check the central tendency (i.e., mean, median, and mode), as
Data collection, management, and reporting 175
well as the shape of the distribution (i.e., a histogram or bar chart of the
scores). A normal distribution would indicate that, while some test-takers per-
form well on the test, there are others who did not perform as well. That also
means that the test items are neither too easy or too difficult. This kind of dis-
tribution allows you to better differentiate test-takers across proficiency levels.
If the test items are too easy or too difficult, then test-takers might tend to
score close to the highest or lowest score point on the test, respectively. When
either situation occurs, the distribution is considered to be skewed, which can
make the results of the item analysis unreliable or misleading. Thus, examining
the score distribution is very important.
• Item difficulty is the first item-level statistic that we typically look at. It is
also called facility index. This statistic is computed as the percentage of test-
takers who answer the item correctly. It ranges from 0 to 1, with a value
closer to 1 indicating an easy item and a value close to 0 indicating
a difficult item. If we assume a normal distribution of test-taker ability, then
the difficulty level of a good item should be somewhere around 0.5. That
is, around half of the test-takers get a particular item correct. In that case,
we would say that the item targets the average proficiency level of the test-
takers reasonably well. However, this is not necessarily the case, as we
prefer to have items with a range of difficulty levels on a test so that we
can target different proficiency levels within the test-taker population.
• Item discrimination is another important statistic when evaluating the qual-
ity of a test item. What this statistic tells us is the extent to which an item can
differentiate between test-takers of high and low proficiency, assuming that all
items are measuring the same or similar knowledge or abilities. This index can
take two forms. One type of item discrimination statistics is called discrimination
index. In order to compute the discrimination index, you need to rank the
test-takers by their total score and equally divide them into three thirds. Then,
you calculate the item difficulty for the top ⅓ and bottom ⅓ separately, and
find the difference between the two. The other type of item discrimination
statistics is a correlation coefficient between score on the item and the total
score of the test, often referred to as point-biserial correlation. In both cases, item
discrimination ranges from -1 to 1. Ideally, we prefer items that are higher
than .25. If an item has close to 0 or even negative item discrimination, then it
should be flagged for further scrutiny and revision. Sometimes, it could be that
the answer key is wrongly specified, which is an easy fix. At other times, the
problem could be more complicated, such as content that might alter the
intended task responses on the item; then, it would require item writers to
look further into this item or item type to see if content revision is needed. It
should be noted that, when a test has different sections (e.g., reading, listening,
speaking, and writing), item discrimination should be calculated within each
section, as you will want to examine the quality of items in reference to other
items that measure similar things.
176 Data collection, management, and reporting
In some cases, after a long period of testing, you might need to perform a more
sophisticated analysis in order to examine more complex testing problems, such as
whether a Japanese reading passage gives an unfair disadvantage to test-takers who
speak Chinese as their L1, or whether all of the lexico-grammar items are measur-
ing grammatical competence. In such situations, it might be helpful to include
a statistician or data analyst on your team. Even so, having a basic understanding
of the core concepts and statistics in item analysis can facilitate the communication
between you and the data analysts to better address your needs.
Program evaluation
Test data are important for program evaluation. By definition, program evalu-
ation is a systematic assessment of the effectiveness and efficiency of a particular
program. The evaluation process tends to involve the collection and analysis of
information from a variety of projects and activities conducted in the program.
For a local language testing program, such information can include, but is not
limited to, the appropriateness of test purpose, procedures of test development,
quality of test items, summary of examinees and their test results, quality control
procedures, research conducted on the test, and feedback from the test stake-
holders. Test purpose and quality control are typically examined in a qualitative
fashion, which can be done through conversations with stakeholders regarding
their test score uses and documentation of the quality control procedures in
place. In contrast, the other types of information are typically examined quanti-
tatively, by analyzing the examinees’ scores, performances, and feedback on vari-
ous aspects of the test. While examinee scores and performances can be obtained
from the test performance itself, test-taker feedback needs to be collected after
the test administration through either a post-test questionnaire or semi-
structured interviews. Local language tests should collect and analyze this infor-
mation as part of the quality control procedures, so that the information can be
regularly assessed in order to improve the effectiveness and efficiency of the test.
For example, when the TOEPAS was introduced at Roskilde University, each
test-taker filled out a survey after they had taken the test. The survey consisted
of open-ended questions related to the dissemination of the information before
the test (e.g., invitation, place/date, preparation, procedure), the actual test
administration, and the usefulness of the oral and the written feedback. The
survey also asked for suggestions for improvement.
A local language test can help with the evaluation of a language program,
especially when the test is an entry and/or exit test embedded in the program.
Language programs are asked to demonstrate the effectiveness of instruction
and learning to their stakeholders. The desirable way to demonstrate learning
is to assess language before and after instruction to show changes. Therefore,
a local test embedded in a language program can be used to achieve this
purpose.
Data collection, management, and reporting 177
acoustic information of speech, and oftentimes, you need to use both sources of
data in tandem to answer your research questions. That said, it is not necessary
to transcribe all of the speech samples if you have a larger test-taker population.
Speech corpora tend to be much smaller than text corpora because of the com-
plexity involved in speech processing techniques and the rich information that
speech provides. Thus, you may want to consider selecting a stratified sample of
benchmark performances and have them transcribed. If you keep this as
a regular practice, it will be possible to build a sizable speech corpus in just
a few years. For example, as part of the OEPT validation research, we used test
data to analyze the characteristics of speaking fluency across the different OEPT
scalar levels and across speakers with different language backgrounds (for more
information see Ginther, Dimova, & Yang, 2010).
Although building a test performance corpus for local language tests can be
time-consuming, the kinds of data included in the corpus can be unique and
high in quality. Because the test administration is standardized, the performances
in a test corpus are more consistent in terms of tasks and response conditions.
Therefore, it is suitable for more in-depth investigations than other natural lan-
guage corpora collected from non-controlled environments. In addition, because
the corpus is built upon a local language test, it can be used to address all sorts
of interesting questions regarding the local context (Barker, 2012).
Summary
This chapter focused on the local test data flow, which includes (1) collection of
different data types (test-taker background information, responses, scores), (2) stor-
age and management of test data, and (3) data uses for result reporting, item ana-
lysis, program evaluation, and research. Collection of relevant data and the
establishment of a well-structured database facilitates result reporting and ongoing
test analysis for quality assurance. Digital technology enhances test data storage,
management, and use, and so you may want to consider digitizing paper-based
written responses or taped speech responses for long-term research. We would
emphasize, however, the importance of ensuring that the data collection, storage,
and use comply with the personal data protection laws in your context.
Further reading
Davidson, F. (2000). The language tester’s statistical toolbox. System, 28(4),
605–617.
Davidson (2000) discussed procedures and purposes of statistical tools that are
used for common activities in developing language tests, as well as for analyzing
and managing test data. The statistical methods are based mainly on Classical
Test Theory and Item Response Theory. The author also introduces computer
software that can perform each statistical analysis method, categorized into three
Data collection, management, and reporting 179
tiers: computer spreadsheets (e.g., Excel); statistical packages (e.g., SPSS and SAS);
and specialized software (e.g., FACETS and Bilog). The author stresses the
importance of meaning interpretation of statistical analysis and warn against statis-
tical determinism in language test development, clarifying the preference of statis-
tical simplicity over complexity. The author recommends combined uses of
statistical and non-statistical evidence (e.g., expert judgment) in evaluating and
making decisions on educational quality of language testing and test tasks, remind-
ing us that the ultimate goal is to develop and implement valid assessments.
Shin, S., & Lidster, R. (2017). Evaluating different standard-setting methods in
an ESL placement testing context. Language Testing, 34(3), 357–381.
Shin and Lidster (2017) examine three commonly practiced standard-setting
procedures of making cut-off score decisions: the Bookmark method, the Bor-
derline group method, and cluster analysis, which are differentiated by their
orientations (i.e., test-centered, examinee-centered, and statistical orientation,
respectively). Motivated by the issue of frequent discrepancy among the differ-
ent methods in cut-score decisions, the authors investigated validity and reli-
ability of placement cut-offs that each method derives in the context of an
intensive English program at a large US public university. The analysis
revealed that each method resulted in significantly different cut-off scores,
which calls for a more rigorous selection process. The authors provided
strengths and weaknesses of each standard-setting approach and, in particular,
noted procedural and internal issues of using cluster analysis for placement tests
in programs with fixed curricula in English for academic purposes contexts.
Suggestions for improving the Bookmark and the Borderline group method
were also made.
Zenisky, A. L., & Hambleton, R. K. (2012). Developing test score reports that
work: The process and best practices for effective communication. Educational
Measurement: Issues and Practice, 31(2), 21–26.
Zenisky and Hambleton (2012) discuss the development of score reports and
reporting practice to advance clear and meaningful communication of interpret-
able test scores with various stakeholders, including non-professionals. The
authors begin by reviewing the recent empirical research-based effort that sys-
temize and synthesize score reports, development methods, and scoring
approaches under the influence of the passage of the No Child Left Behind Act.
The focus of the review is on current and recommended best practices concern-
ing the development process, design, format, contents, ancillary materials, and
dissemination of score reports. The authors further introduce a score report
development model that guides design and validation with seven principles to
meet non-professionals’ information and usability needs. For future directions,
the authors stress the need to investigate the areas of online reporting, subscore
reporting, and practitioners’ understanding and use of reports.
180 Data collection, management, and reporting
References
Amazon Web Services (AWS) – Cloud Computing Services. (n.d.). Retrieved from
https://aws.amazon.com/
Asia-Pacific Cooperation. (n.d.). APEC privacy framework. Retrieved from www.apec.
org/Publications/2005/12/APEC-Privacy-Framework
Australian Government. (n.d.). Notifiable data breaches. Retrieved from www.oaic.gov.
au/privacy/notifiable-data-breaches/
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge
University Press.
Bachman, L. F., & Kunnan, A. J. (2005). Workbook and CD for statistical analysis for language.
Assessment. Cambridge: Cambridge University Press.
Barker, F. (2012)). How can corpora be used in language testing? In A. O’Keeffe &
M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 633–645).
New York: Routledge.
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language
assessment (2nd ed.). New York: McGraw-Hill College.
Dimova, S. (2017). Life after oral English certification: The consequences of the Test of
Oral English Proficiency for Academic Staff for EMI lecturers. English for Specific Pur-
poses, 46, 45–58.
EALTA. (2006). EALTA Guidelines for good practice in language testing and assessment.
Retrieved from www.ealta.eu.org/guidelines.htm
European Commission. (2019, July 24). EU data protection rules. Retrieved from https://ec.
europa.eu/commission/priorities/justice-and-fundamental-rights/data-protection/2018-
reform-eu-data-protection-rules_en
Ginther, A., Dimova, S., & Yang, R. (2010). Conceptual and empirical relationships
between temporal measures of fluency and oral English proficiency with implications
for automated scoring. Language Testing, 27(3), 379–399.
ILTA. (2007). International language testing association: Guidelines for practice. Retrieved
from cdn.ymaws.com/www.iltaonline.com/resource/resmgr/docs/ilta_guidelines.pdf
US Department of Education. (2018, March 1). Family Educational Rights and Privacy Act
(FERPA). Retrieved from www2.ed.gov/policy/gen/guid/fpco/ferpa/index.html
Zhang, C., & Yan, X. (2018). Assessment literacy of secondary EFL teachers: Evidence
from a regional EFL test. Chinese Journal of Applied Linguistics, 41, 25–46.
9
REFLECTIONS
This chapter presents reflections on our experiences with test development and
implementation in our individual contexts. We reflect on the challenges we have
faced and the approaches that we have applied to address them. Due to the per-
formance-based format of our tests, most of our reflections refer to scale develop-
ment and revision, along with rater training. The reflections on the TOEPAS
focus on issues related to designing a platform for data management, as well as the
process of scale revision and the adequacy of scale descriptors, especially language
norms. The EPT section continues with reflections on the scale revision process,
including reflections on building a testing team, communicating with stakeholders,
and considering the time needed to implement test revisions. The OEPT reflec-
tions emphasize the role of instructors in the design of scales, the rating of
responses, and the embedding of the test in an instructional program.
Introduction
In the previous chapters, we discussed the essential considerations when making
decisions about activities related to test planning, design, and implementation.
Instead of proposing a single approach to test development, we emphasized that
the type and order of activities in which you engage during the test develop-
ment process will depend on the needs and the resources available in your local
context. These unique characteristics of the local context may create unique
dilemmas or challenges for test development, and you have to learn your lessons
and find solutions from these situations. Thus, we would like to conclude this
book by sharing our reflections on some of the dilemmas that we experienced
while developing and revising our tests, as well as the lessons that we learned in
the process.
182 Reflections
Due to the performance-based format of our tests, most of our reflections refer
to scale development and revision, along with rater training. Although we shared
some experiences in the development of the three tests, we each reflect on the
most salient aspects of test development, implementation, and grounding in our
particular contexts. The reflections on the TOEPAS focus on issues related to
designing a platform for data management, as well as the process of scale revision
and the adequacy of scale descriptors, especially language norms. The EPT section
continues with reflections on the scale revision process, including reflections on
building a testing team, communicating with stakeholders, and considering the
time needed to implement test revisions. The OEPT reflections emphasize the
role of instructors in the design of scales, the rating of responses, and the embed-
ding of the test in an instructional program. We hope that these points will pro-
vide some support, either technical or emotional, whenever you feel perplexed,
frustrated, or even discouraged in your efforts to develop your own test.
the observed performances, but also to write up formative feedback reports that
test-takers receive alongside their scores.
An MS Access database was developed to keep the TOEPAS test data in an
organized manner. The flow of data consisted of several stages. The test-takers regis-
tered to take the test via an online registration form, which the TOEPAS adminis-
trator entered in the MS Access database. The raters entered the scores they
assigned in the rating form, which was available in paper version. The administrator
collected the completed rating forms and entered the scores in the MS Access data-
base. The TOEPAS coordinator accessed the database to perform queries in order
to find the relevant performances for rater training and data analysis.
Five years after the implementation of TOEPAS, several issues were raised
regarding the scale and data management. The following points refer to those
issues and our decisions on how to address them.
several times in order to check whether the data flow functioned as expected.
The database platform was updated several times based on the bugs identified
during the trials before it became operational in 2017. In the next stages, we
plan to develop a rater training interface, which will relate rater training data to
the existing database platform.
Identify the language norms that are appropriate for your context
The analyses that led to the development of the TOEPAS focused on identify-
ing the appropriate tasks and the needs of the EMI teachers and programs, but
they failed to take into consideration the English as a lingua franca (ELF) context
of the UCPH. Given that UCPH is traditionally a non-Anglophone university
and that most teachers and students are L2 speakers of English, the need to
reconsider the “well-educated native-speaker” norm and the emphasis on pro-
nunciation and grammar accuracy became apparent. However, selecting
adequate norms for an ELF context was not an easy feat. The uses of English in
ELF contexts appear characterized not only by structural variations of the lan-
guage at morphological and phrasal levels, but also the central communicative
discourses appear based on negotiation of meaning and practices of mediation
(Hynninen, 2011; Mauranen, 2007, 2010, 2012; Ranta, 2006). The revision of
the TOEPAS became quite challenging because we were uncertain of how to
deal with the tension between the observed variation of norms in the ELF con-
text and our need to standardize the scale (Dimova, 2019).
Based on additional interviews with EMI teachers, analysis of the video
recordings with performances at different proficiency levels, and previous
research findings regarding pragmatically effective speakers in the EMI classroom
(Björkman, 2010; Suviniitty, 2012), we decided to remove the native speaker
reference from the scale and to revise the scalar categories to reflect the prag-
matic strategies lecturers use when they teach. Some of the new categories we
added include descriptors related to lecturers’ abilities to discuss material effect-
ively through summaries, examples, or emphasis of important points, as well as
their abilities to deliver their lectures in an organized and structured manner
exploiting different types of explicit and implicit cohesive devices. Although we
let some structural linguistic descriptors (e.g., grammar, syntax) remain, we made
sure they were hugely deemphasized (Dimova, 2019).
formative feedback for the test-takers, we decided to either remove the technical
descriptors or to use them together with explanations. For example, although
“intonation” may be a more precise linguistic term, “speech melody” seems more
comprehensible for test-takers.
Another issue with technicality was identified with the content of the written
feedback report for test-takers. The first version focused on performance descrip-
tions based on the five scale categories (i.e., fluency, pronunciation, grammar,
vocabulary, and interaction). When we analyzed test-takers’ uses of the feedback
report, we realized that the test-takers, i.e., the EMI teachers, could not identify
the implications of the feedback for their teaching in the classroom (Dimova,
2019). Moreover, the feedback had the negative effect of reinforcing their inse-
curities related to native-like pronunciation and grammar.
In order to communicate the feedback to the test-takers, we decided to contextual-
ize the descriptions of test-takers’ performance by using references to the EMI class-
room. In other words, we emphasized the role of different language characteristics and
strategies in teaching a linguistically diverse student body. This approach also allowed
us to discuss the interaction of different aspects of the performance in relation to class-
room communication. For example, instead of describing test-takers’ fluency and then
their pronunciation, we refer to both speech rate (fluency) and pronunciation, as well
as other strategies like repetition and emphasis, in relation to intelligibility.
These reflections about the rescaling of TOEPAS have dealt with scale
descriptors and norms. The following section continues with reflection on
rescaling, but with a focus on building the testing team, communicating with
the stakeholders, and considering local resources for test revision.
time, and the range of proficiency has also shrunk. Over the years, the ESL program
where the EPT is housed has trimmed the rating scale from five levels to two that
only classify students into higher and lower levels. While trimming the scale increased
the efficiency of scoring to a certain extent, since raters only have to choose between
two levels, it loses the ability to provide more fine-grained information about test-
taker performance. As the descriptors were all in relative terms (higher vs. lower level,
more vs. fewer errors), the scale does not offer much useful information about the
test-takers to the instructors. Under this backdrop, we started a three-year test rescal-
ing project that was accomplished through tester-teacher collaboration. The rescaling
process and the resultant scale have been described in different chapters throughout
this book. In this section, we provide some reflections on the rescaling project, dis-
cussing what we have learned and what we wish we could have avoided.
You do not have to change the whole test at once, even if that is the
long-term goal
Besides concerns with personnel and budget, when revising the test, I was faced
with the question of whether I should wait to develop a completely new test at
Reflections 187
once or make changes bit by bit. There is no absolute advantage to taking one
approach over the other, and you should choose according to the assessment ques-
tions at hand and the resources available. I started with the former, even though
I had very limited resources. This proved to be unproductive: I intentionally stayed
away from any connection to the old test and spent a month with three core mem-
bers contemplating a new test to achieve different purposes the local test could ful-
fill. At first, we all felt excited about the prospects of a new test for the ESL
program. However, that feeling quickly wore off, as we kept adding design features
to the test at every meeting, until it soon became an impossible task. Eventually, we
compromised (which helped us to keep our sanity) by only changing the scale and
examining the need for further changes after the new scale was in use for a while.
Incremental change is possible for local language tests as the stakes are relatively
lower and tolerate ongoing test development. If you have a similar dilemma in your
own context, then you should consider developing the test bit by bit rather than all
at once, even if that is the long-term goal.
The feelings of uncertainty will weaken and eventually disappear after you see
the positive changes that the new test brings.
When introducing the new test, expect pushback and give it some
time to settle in
After the new scale was drafted and piloted within the development team, we
introduced the scale to the rater community within the ESL program. While we
anticipated questions and concerns about the differences between the old and
new scale, we underestimated the pushback we would receive from the rater
community, despite general excitement about the improvement that the new
scale would bring. Many raters felt reluctant to use the new scale, even those
who had suggested changes to the old scale during the planning stage. The
pushback was so strong that some raters remained quiet throughout the entire
semester of rater training. The raters’ pushback was also reflected in their rating
performance during the initial rounds of training. Although rater performance
improved substantially over the course of the semester, the raters maintained
a rather negative feeling about the introduction of the new scale.
While it was a disheartening experience for the scale development team, in the
ensuing semesters, the amount of negative energy surrounding the new scale grad-
ually decreased. Instead, we started to hear comments and conversations from the
raters about how the new scale matches the range of writing performance on the
test better and how it also made them view student performance in class differ-
ently. This shift in the raters’ reaction took time, but it brought about a stronger
sense of achievement and comfort than if it had occurred during the first semester.
Now that three years have passed since the introduction of the new scale, rater
performance has increased substantially from the first semester, and rater training
has become a less fraught process. Looking back at this experience, we feel that
the initial pushback was perhaps inevitable, as we were introducing changes that
required time and commitment from the raters. These changes had an impact not
only on their rating practice but also on the courses they teach. However, as the
test settled in and the positive changes became more apparent, the raters eventu-
ally embraced the new scale. Just remember, you need to give your new test a bit
of time to settle in to your local assessment context, and expect some emotional
turbulence from the raters or other stakeholders during the transition stage.
background in statistics, coupled with the fear that they do not have the ability
to model large sets of quantitative data. Such attitudes are reflective of a rather
narrow view of what language testing involves. In fact, teachers’ or local mem-
bers’ assessment literacy should not be underestimated. There are many elements
of language testing that are inseparable from teaching and learning. We might
even argue that teachers assess all the time when they teach. In the context of
the EPT, the new scale is impossible without feedback from the teachers, who
helped refine the scale in such a way that the descriptors became more user-
friendly and interpretable to other stakeholders. Moreover, inspired by the
changes to the rating scale, a committee of teachers in the ESL program was
formed and tasked to redesign the rubric used for classroom assessment and to
develop a norming session for the rubric. At first, one of the committee mem-
bers joined our development meeting and the language testing reading group at
our institution to brush up her assessment knowledge so that she could develop
the rubric in an informed way. During meetings and conversations, she realized
that there is nothing in assessment development and use that is completely new
for language teachers. It is just that, when thinking about tests, the principles for
assessment practice and the quality of the test are prioritized.
Overall, our original optimism has been rewarded, but our confidence has
been tempered. What we thought would proceed smoothly and quickly, with
a clear end in sight, was revealed as an ongoing process. Once started, test devel-
opment never ends (consider the different versions of the TOEFL). Revising an
existing test or developing a new one is like inheriting or building a house.
After a while, you realize that the house owns you as much as you own it, and
that there’s always something that needs attention.
We (grad student instructors and staff) began conducting research early on.
OEPT data has been used by student researchers in Second Language Studies,
Linguistics, Speech and Audiology, and Education. Other studies have taken
a while to complete and have been dependent on the continued interest of gradu-
ate student researchers (e.g., Ginther, Dimova, & Yang, 2010). In the past twenty
years, the OEPP has produced a prodigious amount of instructional materials and
test data, most of which remains underutilized. There is never enough time.
Despite the fact that there’s always too much work, the rewards can be great.
As stated in previous chapters, the OEPT was developed to replace the SPEAK,
retired versions of the Test of Spoken English (Educational Testing Service,
1980a, 1980b). SPEAK items represented generic contexts, and we had the
opportunity to represent our local context in our development of its replacement.
OEPT items/tasks are always introduced with the local context in mind (e.g., In
teaching and research contexts at Purdue, you will have many opportunities to introduce
yourself … provide a brief introduction). The test’s orientation to the local contexts of
use provide test-takers with a short but meaningful introduction to instructional
research contexts at the university. For many, this is the only introduction that
they will receive before they start teaching. Because many will be assigned teach-
ing duties in the following week, and because those who pass have limited options
for support, we have always rated conservatively. We argue that, all things con-
sidered, the best outcome is for test-takers to fail and then receive support. So far
and for the most part (there’s always some noise), both students and graduate
advisors agree. However, the extent to which test-takers, graduate advisors, and
program administrators support instruction remains an area of ongoing scrutiny
and negotiation. Winning that argument means that the test and program must
stand on their merits and be seen as enhancing student opportunities.
Shelter as structure
Like local language tests, local language programs occupy an in-between, inter-
mediate space where instructors introduce and guide students from sheltered
into increasingly unsheltered contexts of language. Most of us welcome shelter,
especially hard-pressed international students, but perhaps their international and
domestic instructors welcome it even more; the metaphor of program as shelter
works best when the program provides shelter for everyone.
Because we spend so many hours at work, the importance of having a comfortable,
productive, engaging work environment cannot be underestimated. We reside in
Reflections 191
a field that is associated, for better or worse, with teaching and service, where incen-
tives and rewards are limited. Instructors are the central players. When they are con-
vinced that they are adding value for their students, the students tend to agree.
Providing common structures for instructors goes a long way for creating an environ-
ment that feels like a shelter for both instructors and students. Instructors want to
experiment and come up with creative interventions for classroom challenges; at the
same time, being excellent students, they want to come up with the right answers.
We have found that the 50/50 rule works well for providing a foundation, allow-
ing experimentation, and encouraging adaptability. Assigned, common activities
(conversations, presentations) occur on half of the days that students are in class, and
teachers are free to develop or select activities for the remainder. Instructors may
select from a library of tried-and-true teaching activities, organized by week and
theme, that we have collected over the years and are available in the OEPP Course
Reader and Instructor’s Manual (Haugen, Kauper, & Ginther, 2019).
Instructors are usually fairly busy, many are overworked, and all are under-
paid. The development of a local test requires realignment and reassignment of
work duties; program instructors, staff, and administrators must accept additional
responsibilities and contribute to test development and administration for a local
test to function. OEPP instructors, who had experience training teaching assist-
ants, were central to the development of OEPT tasks and the OEPT scale (actu-
ally, graduate instructors were the only labor available). Fortunately, we have
been able to adjust workloads along with instructors’ changing roles and respon-
sibilities. For example, we were able to reduce instructors’ class hours to make
room for the hours they would be needed to rate exams. Instructors appreciated
their lighter instructional loads and the value of their dual instructor/rater roles
because of the training involved (they may serve as program administrators and
may need to develop tests and assessments) and because of the greater instruc-
tional coherence that can occur when testing is embedded. Instructors often
accept rater responsibilities if/when they see that their involvement in rating
tests improves practice, but adaptations and adjustments must be made.
Even with half of the schedule fixed, we still have a lot going on. One of the
best ways we found to manage all activities seems obvious: providing a semester
schedule for instruction and testing/rating (see appendix A). I used to think that
schedules were about managing time, but now I think that they are primarily an
expression of values: what merits space and time in the schedule indicates where
the values of the program lie. Regularly scheduled activities provide the basis for
technical reports (regular statistical analyses), annual reports for administrators
(testing volumes, pass rates, class enrollments by department and college), and
program newsletters (look what we’ve done for you lately).
duties, rater training can be highly effective. What is important is that training is
less effective if conducted only as an introduction and/or a one-time fix. Raters
have been found to perform well after initial training, and then diverge, but
recalibrate and increase their effectiveness over multiple training/rating sessions.
Large-scale tests usually require raters to calibrate to the scale by rating
a previously rated set of exams before beginning a rating session.
Local programs can institute similar calibration practices by assigning sets of pre-
viously rated sets of exams as part of the regular rater training process. There are
many ways to train raters. After experimenting with some, we now emphasize
benchmark performances – those responses that will be most likely to elicit agree-
ment. In our experience with OEPT rater training, we have also found it useful
to ask raters to focus on why a performance was assigned a score rather than why
an individual rater may disagree with a score assignment; this is another way to
enhance agreement. Rater training can be tricky, especially in tight-knit commu-
nities. So, finding ways to enhance agreement is worth the effort.
Nevertheless, it is always tempting to focus on those performances that result
in split scores to determine, once and for all, whether the performance is best
identified as representing a higher or lower score. These conversations are
always interesting, some of the best, but seldom resolve the underlying issues.
Some performances will always produce split scores because they capture aspects
of performance that lie between score points. Perfect rater agreement is impos-
sible and undesirable. Some examinees will always fall on the margins, and raters
will focus on different aspects of performance. Discussion of similarities and dif-
ferences across examinee performance at the same and different levels of the
scale are fascinating, sometimes problematic, always interesting, and never fully
resolvable. Regular rater training can be something instructor/raters look for-
ward to, but it may take a while to get there.
Extend the local scale from the test to classroom assessments and
self- and peer-evaluations
Test developers and instructors may be tempted to create unique scales, rubrics, and
guidelines for each test task or assessment activity; however, we have found that
extending a single scale across assessment opportunities can produce great benefits.
Herman (1992) argues that effective classroom assessment practices benefit when
the use of scales and tasks can be extended and she recommends that instructors:
Translated to language programs, these guidelines mean that the same scale can
be applied across items and classroom assessments. For example, the holistic
OEPT scale is used to evaluate performance across different item types within
the test (e.g., express an opinion, compare and contrast, or respond to
a complaint), and also as the basis for the evaluation of classroom presentations.
Evaluation and instruction overlap.
I am fortunate to be in a position where I direct both the test and the pro-
gram, which has made it much easier to embed the OEPT in the OEPP.
Coordinating across program directors and programs requires negotiation and
makes the aligning of a test with a program more complicated. Despite compli-
cations and setbacks, however, when considering all aspects of my academic
career, my experience as a program director is what I value most.
Conclusions
In the test development and analysis literature, most discussions focus on the
principles of developing large-scale tests or classroom-based assessments, while
the development of local language tests tends to be overlooked. Local language
tests can be positioned at different points in the continuum between large-scale
tests and classroom assessment and, hence, they may share characteristics with
both; while local language tests undergo a certain degree of standardization, they
are also used to inform classroom instruction. The central characteristic of local
tests is their embeddedness in context, as they represent local language uses and
local educational, professional, or cultural values.
Given the close ties of local language tests with the local context, language
test developers may experience constraints related to a lack of material, spatial,
and human resources, or unrealistic stakeholder expectations. On the other
hand, local stakeholders may be interested and supportive when they are
involved in the test development process and when they see that the subsequent
use of the test benefits the community.
The primary aim of this book has been to raise awareness about the opportunities
and challenges of local test development, but more importantly, to present the
opportunities that language testers have for an improved representation of the char-
acteristics of the local context. Our reflections highlight our individual contexts and
our preoccupations and ongoing concerns. All things considered, we are confident
that the potential benefits of local test development are well worth the effort.
References
Björkman, B. (2010). So you think you can ELF: English as a lingua franca as the medium
of instruction. Hermes–Journal of Language and Communication Studies, 45, 77–96.
Dimova, S. (2017). Life after oral English certification: The consequences of the Test of
Oral English Proficiency for Academic Staff for EMI lecturers. English for Specific Pur-
poses, 46, 45–58.
194 Reflections
OEPT Review
Listen to each OEPT item response and provide descriptions of and comments
about the student’s performance. Complete the item, synthesis, and formative
section prior to meeting with the student. Work on the goal section with
the student in the first few weeks of the semester. Think about strengths as well
as areas that need improvement and describe these to your student during your
first conference or tutoring session. It is important that students understand why
they did not pass the test and were recommended to take ENGL 620. You will
also use this information while preparing goals and the Mid-term Evaluation.
Item Section
Item Descriptions/Comments
Synthesis Section
Rate elements of oral proficiency using OEPT scale
Descriptions/
Language Skills Developing Sufficient Comments
Intelligibility 35 40 45 50 55
Overall
Intelligibility
Pronunciation
Fluency/Pausing
Stress/Intonation/
Rhythm
Volume/Projection
Speed
Vocabulary
Range
Usage
Idiomaticity
Discourse Markers
Grammar
Word Order/
Syntax
Verb Use
Bound Morphology
Article Use
Other
Listening
Comprehension
Appendix A 197
Goal Section: By the end of September, finalize the three selected Final
language skills for improvement and formalize weekly and/or daily Evaluation
task based goals for homework and Performance Goals for what is
expected in PRs, ICs, and class/conference.
At Final Evaluation, rate achievement of goals by +, *, or -.
+ = Has met the performance goal.
* = Showing improvement and making good progress towards
the performance goal.
- = Showing little or no progress in meeting the goal.
Language Task Based Goals (Practice Performance
Skill Strategies for Homework) Goals (Outcomes)
1
2
3
Mid-term Evaluation Final Evaluation
Core Language Skills 35-40 45 50-55 Description of skills 35-40 45 50-55 Description of skills
Intelligibility
Overall intelligibility
Pronunciation
Fluency/Pausing
Stress/Intonation/Rhythm
Volume/Projection
Speed
Vocabulary
Range
Usage
Idiomaticity
Grammar
Word order/syntax
Verb use
Bound morphology
Core Interactive Skills
Indicating Comp.
Seeking Clarification
Providing Clarification
Appendix A 199
(Continued )
(Cont.)
(Continued )
(Cont.)
(Continued )
202 Appendix A
(Cont.)
Pronunciation
Fluency/Pausing
Rhythm/Stress/Intonation
Volume/Projection
Speed
Grammar
Vocabulary
Seeking clarification
Appendix A 203
Written comments about your overall PR1 performance and your ability to
communicate successfully with the audience:
1. Strengths:
2. Weaknesses:
1. How did you encourage your audience to ask questions? Transcribe what
you said.
2. How many comprehension checks did you make? ____ Transcribe two
of your best comprehension checks (write the exact words you said):
Question:
Your restatement:
5. How many times did you rehearse your presentation, actually speaking
your sentences? ____
165; database 169; development of 7–9, proficiency 11, 23–24, 114–115; EPT
95; digital delivery 98–99; embedded 185–186; integrative measures 68–71;
nature of 42; experiential approach performance-based, data-driven approach
51–52; formative assessment 33–34; 124; rater reliability 144; rater training
headsets 46; hybrid holistic scale 139; scaling 131–132, 133; see also
118–119, 120; instructions 53, 54; place- language abilities
ment 24; prompts 57, 58; rater applica- program evaluation 2, 3, 139, 171, 176
tion 94; rater training 152; reflections on project management triangle 41–42
181, 189–193; review 195–204; security prompts 55, 56, 57, 58, 67, 69, 72
issues 106; space for test 47; test-taker pronunciation 77, 78; OEPT 119, 196,
IDs 160; testing team 44; validation 198, 202; TOEPAS 173, 182, 185
research 178 psychometric quality 128–129
oral interpretation skills 81 public speaking skills 81
oral reading tasks 73 Purdue Language and Cultural Exchange
oral tests: EPT 6; equipment 46; rater- (PLaCE) 6–7, 31
mediated assessments 140; test adminis- Purdue University 3–4, 6–9, 27–28, 51–52,
tration 98; test delivery 91; see also 131, 189–193
speaking Purpura, J. 85–86
outcome-based assessment 29–31
quality control 148, 176
Palmer, A. S. 1, 2, 10, 14 questions 55, 67
paper-based tests 88, 91, 102–103; data
collection 159; data storage 165–166; Rasch modeling 128, 145, 156
hybrid delivery formats 89; material raters 13, 44, 115, 138–157; behavior on
resources 46–47; security 104–105; test a rating scale 128; data management 167,
administration 97–98 169–170; EPT 125–126, 168, 188;
paragraph writing tasks 81, 83 OEPT 8, 33, 163–164, 191–192; rating
passwords 106 data 162; scale development 34, 114,
Pearson’s r 143 122, 125; scoring process 117; speaking
Pendar, N. 29, 127 tasks 78; test administration 97; TOEPAS
Pennington, M. C. 18 45, 63, 92, 183, 184; see also training
performance-based approach 123–124, 128, reading: digital delivery 103; PLaCE 31;
135, 140; rater training 147; test design research on assessment 84–85; tasks
56, 57, 58 71–75, 80
performance decision trees 119–121, 122, regulations 160–162
134–135, 152 reliability 63–64; automated scoring 100;
phonological complexity 57 measurement-based approach 123; rater
piloting 40, 48, 50, 60–61; EPT 126; 13, 128, 135, 141–144, 146, 150, 153;
hybrid development approach 127; test scoring 116; test-retest 139
structure 59; TOEPAS 63 reporting 171, 172–173, 179, 191
PLaCE see Purdue Language and Cultural rescaling 186–189
Exchange research 24, 83, 171, 177, 190
placement 11, 23–24, 25, 133 resources 40, 41, 42, 46–47, 48; delivery
plagiarism 105 platform choice 89, 108; digital delivery
planning 40, 41–48 92, 93, 107
point-biserial correlation 175 Roever, C. 92
policy-making 30 role-plays 104
post-admission tests 133
Powers, D. E. 103 sample tasks 52
practice tests 52, 53 scaling 13, 114–137; challenges and oppor-
pre-operational testing 60–61 tunities 131–133; EPT 186–189; extend-
primary trait scales 118 ing a scale 192–193; hybrid development
principal component analysis 144 approach 124–126, 127; measurement-
professional development 30 based approach 122–123, 124; OEPT
210 Index