English Testing for S1 Students
English Testing for S1 Students
ENGLISH LANGUAGE
                           TESTING
                            Compiled By:
                  BERTARIA SOHNATA HUTAURUK
            Prodi PendidikanBahasaInggris
FAKULTAS KEGURUAN DAN ILMU PENDIDIKAN
           UNIVERSITAS HKBP NOMMENSEN
                      PEMATANGSIANTAR
                               2015
                            Compiled by:
                      Bertaria Sohnata Hutauruk
3. Informal vs. Formal Assessments: Tests are not the only end-all-be-all
of how we assess.………………………………………………………….. 6
9. Performance-Based Assessment……………………………………….... 32
“If mathematics teachers were to focus their efforts on classroom assessment that is
primarily formative in nature, students’ learning gains would be impressive. These
efforts would include gathering data through classroom questioning and discourse,
using a variety of assessment tasks, and attending primarily to what students know and
understand” (Wilson & Kenney, page 55).
“Most experienced teachers will say that they know a great deal about their students in
terms of what the students know, how they perform in different situations, their attitudes
and beliefs, and their various levels of skill attainment. Unfortunately, when it comes to
grades, they often ignore this rich storehouse of information and rely on test scores and
rigid averages that tell only a small fraction of the story.
The reason this is a problem is that students learn what is valued and they strive to do
well on those things. If the end-of-unit tests are what are used to determine your grade,
guess what kids want to do well on, the end-of-unit test! You can do all the great
activities you want, but if the bottom line is the test, then that is what is going to be
valued most by everyone: teachers, students, and parents, alike.
It is very different from what we are used to doing. We are used to teaching and then
assessing. In reality, the line between teaching and assessment should be blurred
(NCTM, 2000). “Interestingly, in some languages, learning and teaching are the same
word”(Fosnot and Dolk, page 1). We need to assess on a daily basis to give us the
information to make choices about what to teach the next day. If we just teach the
whole unit and wait until the end-of-unit test to find out what the kids know, we may be
very unhappily surprised. On the other hand, if we are assessing on a daily basis
throughout the unit, we do not need to average all those assessments to come up with a
final evaluation. Instead, we could just use the most recent assessments to make that
evaluation. In this way, we do not penalize the student that did not know much at the
beginning of the unit and worked really hard to learn what you felt were the big ideas.
Instead we rate them on where they are when you finished the unit. This gives a more
accurate report or evaluation of where they are performing when the evaluation is made.
Evaluation: Procedured used to determine whether the subject (i.e. student) meets a
preset criteria such as qualifying for special education services. This uses assessment
(remember that an assessement may be a test) to make a determination of qualification
in accordance with a predetrmined criteria.
Meassurement, beyond its general definition, refers to the set of procedures and the
principles for how to use the procedures in educational evaluations would be raw
scores, percentiles ranks, derrived scores, standard scores etc.
Example
At the end of the course, the learners have a final exam to see if they pass to the next
course or not. Alternatively, the results of a structured continuous assessment process
are used to make the same decision.
In the classroom Informal and formal assessments are both useful for making valid and
useful assessments of learners' knowledge and performance. Many teachers combine the
two, for example by evaluating one skill using informal assessment such as observing
group work, and another using formal tools, for example a discrete item grammar test.
Formative assessment
Formative assessment is the use of assessment to give the learner and the teacher
information about how well something has been learnt so that they can decide what to
do next. It normally occurs during a course. Formative assessment can be compared
Example
The learners have just finished a project on animals, which had as a language aim better
understanding of the use of the present simple to describe habits. The learners now
prepare gap-fill exercises for each other based on some of their texts. They analyse the
results and give each other feedback.
In the classroom ,One of the advantages of formative feedback is that peers can do it.
Learners can test each other on language they have been learning, with the additional
aim of revising the language themselves. It has been once said that ““Everybody is a
genius. But if you judge a fish by its ability to climb a tree, it will live its whole life
believing that it is stupid.” Our students must be assessed relative to what their skills
are. It could be done by doing formal assessments or informal assessments or
combination of both.
The result of a formal test (e.g. long test) alone would not necessarily dictate the entire
academic ability of our students. It does not mean that when a student fails a formal test
(e.g. periodical test), we could already conclude that he’s entire learning capabilities for
that subject failed as well.
Assessing students is not monopolized by just doing it formally (e.g. giving out tests,
quizzes, summative exams, etc.), but rather depends on the other informal assessments
(e.g. coaching sessions, reflective logs, fly-by-question and answers, etc.) that reinforce
formal ones.
There are many factors why a student could fail from a test (e.g. lack of sleep,
emotional and family distress, etc.), but there would only be few factors why he/she
would not be able to provide a reflective insight on the lesson. But how do we separate
formal assessments from informal ones?
The table and concept map I incorporated below could give some help (you could click
the picture or open it in a new tab to see it clearer =).
1. We want to gauge the students cognitive, affective and manipulative skills in the
simplest way possible. We ask students to recite or write down essays to easily
determine if they understood a specific lesson well or poorly, if they are enthusiastic or
bored with the lesson, if they are already familiar or completely unfamiliar with the
topic, etc.
2. We deem that the results of the formal examinations are not enough to give a
concluding mark for the students’ performance. If a specific student performs
excellently in class activities but suddenly failed a summative test, it could tell us that
there could be a deviation between our formal against our informal assessments, or
other factors might have been involved with such event (e.g. student factor: did not
review, physically/emotionally troubled, etc.)
Although informal assessments provide teachers with solid bases of how the students
are performing, it would not imply that it could already replace formal assessments.
They should work hand-in-hand and interdependently. One should complement the
other.For instance, if we opt to use role plays and recitals in assessing students’
communications skills informally, we should also align our formal exams with the
activities our students previously engaged on. In this way, we could ensure validity and
fairness of our assessments. Moreover, we could find that these methods relieve our
burdens with analyzing, comparing, and understanding our students “true” abilities.
We cannot just give (formal) tests or quizzes in the same manner as we cannot just
consume course-time with just giving out (informal) class activities. Arriving at a valid
and reliable grades for our students is a combination of maximing both formal and
informal assessments.
By contrast, a test is criterion-referenced when provision is made for translating the test
score into a statement about the behavior to be expected of a person with that score. The
same test can be used in both ways. Robert Glaser originally coined the terms norm-
referenced test and criterion-referenced test.
Standards-based education reform is based on the belief that public education should
establish what every student should know and be able to do.Students should be tested
against a fixed yardstick, rather than against each other or sorted into a mathematical
bell curve.
By assessing that every student must pass these new, higher standards, education
officials believe that all students will achieve a diploma that prepares them for success
in the 21st century.Most state achievement tests are criterion-referenced. In other words,
Many college entrance exams and nationally used school tests use norm-referenced
tests. The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale
for Children (WISC) compare individual student performance to the performance of a
normative sample. Test takers cannot "fail" a norm-referenced test, as each testtaker
receives a score that compares the individual to others that have taken the test, usually
given by a percentile. This is useful when there is a wide range of acceptable scores that
is different for each college.
Advantages to this type of assessment include that students and teachers know what to
expect from the test and just how the test will be conducted and graded. Likewise, all
schools will conduct the exam in the same manner, reducing such inaccuracies as time
differences or environmental differences that may cause distractions to the students.
This also makes these assessments fairly accurate as far as results are concerned, a
major advantage for a test.
Critics of criterion-referenced tests point out that judges set bookmarks around items of
varying difficulty without considering whether the items actually are compliant with
grade level content standards or are developmentally appropriate Thus, the original
1997 sample problems published for the WASL 4th grade mathematics contained items
that were difficult for college educated adults, or easily solved with 10th grade level
methods such as similar triangles.The difficulty level of items themselves and the cut-
scores to determine passing levels are also changed from year to year. Pass rates also
vary greatly from the 4th to the 7th and 10th grade graduation tests in some states.
One of the limitations of No Child Left Behind is that each state can choose or construct
its own test, which cannot be compared to any other state. A Rand study of Kentucky
results found indications of artificial inflation of pass rates which were not reflected in
increasing scores in other tests such as the NAEP or SAT given to the same student
populations over the same time.Graduation test standards are typically set at a level
consistent for native born 4 year university applicants.Unusual side effect is that while
colleges often admit immigrants with very strong math skills who may be deficient in
English, there is no such leeway in high school graduation tests, which usually require
passing all sections, including language. Thus, it is not unusual for institutions like the
University of Washington to admit strong Asian American or Latino students who did
Although the tests such as the WASL are intended as a minimal bar for high school, 27
percent of 10th graders applying for Running Start in Washington State failed the math
portion of the WASL. These students applied to take college level courses in high
school, and achieve at a much higher level than average students. The same study
concluded the level of difficulty was comparable to, or greater than that of tests
intended to place students already admitted to the college.
A norm-referenced test has none of these problems because it does not seek to enforce
any expectation of what all students should know or be able to do other than what actual
students demonstrate. Present levels of performance and inequity are taken as fact, not
as defects to be removed by a redesigned system. Goals of student performance are not
raised every year until all are proficient. Scores are not required to show continuous
improvement through Total Quality Management systems. Disadvantages include
standards based assessments measure the level that students are currently by measuring
against where their peers are currently at instead of the level that both students should
be at.
A rank-based system produces only data that tell which average students perform at an
average level, which students do better, and which students do worse, contradicting
fundamental beliefs, whether optimistic or simply unfounded, that all will perform at
one uniformly high level in a standards based system if enough incentives and
punishments are put into place. This difference in beliefs underlies the most significant
differences between a traditional and a standards based education system.
Examples
    1.   IQ tests are norm-referenced tests, because their goal is to see which test taker is
         more intelligent than the other test takers.
A criterion-referenced test is one that provides for translating test scores into a
statement about the behavior to be expected of a person with that score or their
relationship to a specified subject matter. Most tests and quizzes that are written by
school teachers can be considered criterion-referenced tests. The objective is simply to
see whether the student has learned the material. Criterion-referenced assessment can be
contrasted with norm-referenced assessment and ipsative assessment.
Sample scoring for the history question: What caused World War II?
                                  Criterion-
      Student answers             referenced        Norm-referenced assessment
                                 assessment
Student #1: This answer is This answer is worse than Student #2's
Student #2:
WWII was caused by multiple
factors, including the Great
Depression and the general
economic situation, the rise of
                                      This answer is This answer is better than Student #1's
nationalism,       fascism,    and
                                      correct.       and Student #3's answers.
imperialist expansionism, and
unresolved              resentments
related to WWI. The war in
Europe         began    with    the
German invasion of Poland.
Student #3:
WWII was caused by the This answer is This answer is worse than Student #1's
assassination      of    Archduke wrong.             and Student #2's answers.
Ferdinand.
Many high-profile criterion-referenced tests are also high-stakes tests, where the results
of the test have important implications for the individual examinee. Examples of this
include high school graduation examinations and licensure testing where the test must
be passed to work in a profession, such as to become a physician or attorney. However,
being a high-stakes test is not specifically a feature of a criterion-referenced test. It is
instead a feature of how an educational or government agency chooses to use the results
of the test.
   1.   Driving tests are criterion-referenced tests, because their goal is to see whether
        the test taker is skilled enough to be granted a driver's license, not to see whether
        one test taker is more skilled than another test taker.
   2.   Citizenship tests are usually criterion-referenced tests, because their goal is to
        see whether the test taker is sufficiently familiar with the new country's history
        and government, not to see whether one test taker is more knowledgeable than
        another test taker.
Should language be tested by discrete points or by integrative testing?Traditionally, language test have
been constructed on the assumption that: language can be broken down intoits component and those
component parts are duly tested.What is discrete point?
Language is segmented into many small linguistic points and the four language skills of
listening, speaking,reading and writing. Test questions are designed to test these skills
and linguistic points. A discrete point testconsists of many questions on a large number
of linguistic points, but each question tests only one linguisticpoint.Examples of
Discrete point test are:1. Phoneme recognition.2. Yes/No, True/ False answers.3.
Spelling.4. Word completion.5. Grammar items.6. Multiple choice tests.Such tests have
a down side in that they take language out of context and usually bear no relationship to
theconcept or use of whole language.Discrete point test met with some criticism,
particularly in the view of more recent trends toward viewing theunits of language and
its communicative nature and purpose, and viewing language as the arithmetic sum ofall
its parts.That is why John Oller (1976) introduced“INTEGRATIVE TESTING”.
Oller (1979:38) has refined the integrative concept further by proposing what he calls
pragmatic test.A pragmatic test is“...any procedure or task that causes the learner to
process sequences of elements in alanguage that conform to the normal contextual
constraints of that language and which requires learner torelate sequences of linguistics
elements via pragmatic mappings to extra linguistic contexts.
” A step in a positive direction would be to concentrate on tests of communicative
competence.The recent direction of linguistic study has been toward viewing language
as an integrated and pragmatic skill,we cannot be certain that a test like a cloze test
meets the criterion of predicting or assessing a unified andintegrated underlying
linguistic competence we must be cautious in selecting and constructing test
oflanguage.There is nothing wrong to use the traditional tests of discrete points of
language especially in achievement andother classroom-oriented testing in which
certain discrete points are very important.
Sabria and Samer (other students) have pointed me in the direction of Cambridge
ToEFL exams (conformant with the Common European Framework of Reference for
Languages) as a potential basis for communicative testing. The tests are divided into the
4 principal language dimensions (Speaking, Listening, Writing and Reading) and
provide tests and marking criteria at all levels of competency including that for the
research context (Young Learners English – YLE starters).
Communicative language tests are intended to be a measure of how the testees are able
to use language in real life situations. In testing productive skills, emphasis is placed on
appropriateness rather than on ability to form grammatically correct sentences. In
testing receptive skills, emphasis is placed on understanding the communicative intent
of the speaker or writer rather than on picking out specific details. And, in fact, the two
are often combined in communicative testing, so that the testee must both comprehend
Tasks
Communicative tests are often very context-specific. A test for testees who are going to
British universities as students would be very different from one for testees who are
going to their company's branch office in the United States. If at all possible, a
communicative language test should be based on a description of the language that the
testees need to use. Though communicative testing is not limited to English for Specific
Purposes situations, the test should reflect the communicative situation in which the
testees are likely to find themselves. In cases where the testees do not have a specific
purpose, the language that they are tested on can be directed toward general social
situations where they might be in a position to use English.
This basic assumption influences the tasks chosen to test language in communicative
situations. A communicative test of listening, then, would test not whether the testee
could understand what the utterance, "Would you mind putting the groceries away
before you leave" means, but place it in a context and see if the testee can respond
appropriately to it.
Tests intended to test communicative language are judged, then, on the extent to which
they simulate real life communicative situations rather than on how reliable the results
are. In fact, there is an almost inevitable loss of reliability as a result of the loss of
control in a communicative testing situation. If, for example, a test is intended to test the
ability to participate in a group discussion for students who are going to a British
university, it is impossible to control what the other participants in the discussion will
say, so not every testee will be observed in the same situation, which would be ideal for
test reliability. However, according to the basic assumptions of communicative
language testing, this is compensated for by the realism of the situation.
Evaluation
Speaking/Listening
Information gap. An information gap activity is one in which two or more testees work
together, though it is possible for a confederate of the examiner rather than a testee to
Student A
You are planning to buy a tape recorder. You don't want to spend more than about 80
pounds, but you think that a tape recorder that costs less than 50 pounds is probably not
of good quality. You definitely want a tape recorder with auto reverse, and one with a
radio built in would be nice. You have investigated three models of tape recorder and
your friend has investigated three models. Get the information from him/her and share
your information. You should start the conversation and make the final decision, but
you must get his/her opinion, too.
Student B
Your friend is planning to buy a tape recorder, and each of you investigated three types
of tape recorder. You think it is best to get a small, light tape recorder. Share your
information with your friend, and find out about the three tape recorders that your friend
investigated. Let him/her begin the conversation and make the final decision, but don't
hesitate to express your opinion.
This kind of task would be evaluated using a system of band scales. The band scales
would emphasize the testee's ability to give and receive information, express and elicit
opinions, etc. If its intention were communicative, it would probably not emphasize
pronunciation, grammatical correctness, etc., except to the extent that these might
interfere with communication. The examiner should be an observer and not take part in
Role Play. In a role play, the testee is given a situation to play out with another person.
The testee is given in advance information about what his/her role is, what specific
functions he/she needs to carry out, etc. A role play task would be similar to the above
information gap activity, except that it would not involve an information gap. Usually
the examiner or a confederate takes one part of the role play.
Student
You missed class yesterday. Go to the teacher's office and apologize for having missed
the class. Ask for the handout from the class. Find out what the homework was.
Examiner
You are a teacher. A student who missed your class yesterday comes to your office.
Accept her/his apology, but emphasize the importance of attending classes. You do not
have any extra handouts from the class, so suggest that she/he copy one from a friend.
Tell her/him what the homework was.
Again, if the intention of this test were to test communicative language, the testee would
be assessed on his/her ability to carry out the functions (apologizing, requesting, asking
for information, responding to a suggestion, etc.) required by the role.
Your boss has received a letter from a customer complaining about problems with a
coffee maker that he bought six months ago. Your boss has instructed you to check the
company policy on returns and repairs and reply to the letter. Read the letter from the
customer and the statement of the company policy about returns and repairs below and
write a formal business letter to the customer.
The letter would be evaluated using a band scale, based on compliance with formal
letter writing layout, the content of the letter, inclusion of correct and relevant
information, etc.
          Performance-Based Assessment
Performance-based assessment is an alternative form of assessment that moves away
from traditional paper and pencil tests. Performance-based assessment involves having
the students produce a project, whether it is oral, written or a group performance. The
students are engaged in creating a final project that exhibits their understanding of a
concept they have learned.
   There are two parts to performance-based assessments. The first part is a clearly
defined task for the students to complete. This is called the product descriptor. The
assessments are either product related, specific to certain content or specific to a given
task. The second part is a list of explicit criteria that are used to assess the students.
Generally this comes in the form of a rubric. The rubrics can either be analytical,
meaning it assesses the final product in parts, or holistic, meaning that is assesses the
final product as a whole.
The goal for assessment is to accurately determine whether students have learned the
materials or information taught and reveal whether they have complete mastery of the
content with no misunderstandings. Just as researchers use multiple data sources to
determine the truthfulness of the results, teachers can use multiple types of assessment
to evaluate the level of student learning. Because assessments involve the gathering of
data or information, some type of product, performance, or recording sheet must be
generated. The following are some examples of various types of performance-based
assessments used in physical education.
1. Journals
2.Letters
The students will create original language compositions through producing a letter.
They will be asked to write about something relevant to their own life using the target
language. The letter assignment will be accompanied by a rubric for assessment
purposes.
3. Oral Reports
The students will need to do research in groups about a given topic. After they have
completed their research, the students will prepare an oral presentation to present to the
class explaining their research. The main component of this project will be the oral
production of the target language.
4. Original Stories
  The students will write an original fictional story. The students will be asked to
include several specified grammatical structures and vocabulary words. This assignment
will be assessed analytically, each component will have a point value.
6. skit
  The students will work in groups in order to create a skit about a real-world situation.
They will use the target language. The vocabulary used should be specific to the
situation. The students will be assess holistically, based on the overall presentation of
the skit.
7.Poetry Recitations
  After studying poetry, the students will select a poem in the target langugage of their
choice to recite to the class. The students will be assessed based on their pronunciation,
rhythm and speed. The students will also have an opportunity to share with the class
what they think the poem means.
8.Portfolios
  Portfolios allow students to compile their work over a period of time. The students
will have a checklist and rubric along with the assignment description. The students will
assemble their best work, including their drafts so that the teacher can assess the
process.
9.PuppetShow
  The students can work in groups or individually to create a short puppet show. The
puppet show can have several characters that are involved in a conversation of real-
world context. These would most likely be assessed holistically.
Human performance provides many opportunities for students to exhibit behaviors that
may be directly observed by others, a unique advantage of working in the psychomotor
domain. Wiggins (1998) uses physical activity when providing examples to illustrate
complex assessment concepts, as they are easier to visualize than would be the case
with a cognitive example. The nature of performing a motor skill makes assessment
through observational analysis a logical choice for many physical education teachers. In
fact, investigations of measurement practices of physical educators have consistently
shown a reliance on observation and related assessment methods (Hensley and East
1989; Matanin and Tannehill 1994; Mintah 2003).
Teachers and peers can assess others using observation. They might use a checklist or
some type of event recording scheme to tally the number of times a behavior occurred.
Keeping game play statistics is an example of recording data using event recording
techniques. Students can self-analyze their own performance and record their
performances using criteria provided on a checklist or a game play rubric. Table 14.1 is
an example of a recording form that could be used for peer assessment. When using
peer assessment, it is best to have the assessor do only the assessment. When the person
recording assessment results is also expected to take part in the assessment (e.g., tossing
the ball to the person being assessed), he or she cannot both toss and do an accurate
observation. In the case of large classes, teachers might even use groups of four, in
The following example of a project designed for middle school or high school students
involves a research component, analysis and synthesis of information, problem solving,
and effective communication.
It’s a good idea to limit the portfolio to a certain number of pieces of work to prevent
the portfolio from becoming a scrapbook that has little meaning to the student and to
avoid giving teachers a monumental evaluation task. This also requires students to
exercise some judgment about which artifacts best fulfill the requirements of the
portfolio task and document their level of achievement. The portfolio itself is usually a
file or folder that contains the student’s collected work. The contents could include
items such as a training log, student journal or diary, written reports, photographs or
sketches, letters, charts or graphs, maps, copies of certificates, computer disks or
computer-generated products, completed rating scales, fitness test results, game
A rubric (scoring tool) should be used to evaluate portfolios in much the same manner
as any other product or performance. Providing a rubric to students in advance allows
them to self-assess their work and thus be more likely to produce a portfolio of high
quality. Portfolios, since they are designed to show growth and improvement in student
learning, are evaluated holistically. The reflections that describe the artifact and why the
artifact was selected for inclusion in the portfolio provide insights into levels of student
learning and achievement. Teachers should remember that format is less important than
content and that the rubric should be weighted to reflect this. Table 14.2 illustrates a
qualitative analytic rubric for judging a portfolio along three dimensions.
For additional information about portfolio assessments, Lund and Kirk (2010) have a
chapter on developing portfolio assessments. An article published as part of a JOPERD
feature presents a suggested scoring scale for a portfolio (Kirk 1997). Melograno’s
Assessment Series publication (2000) on portfolios also contains helpful information.
Performances
Organizing and performing a jump rope show at the half-time of a basketball game
Although performances do not produce a written product, there are several ways to
gather data to use for assessment purposes. A score sheet can be used to record student
performance using the criteria from a game play rubric. Game play statistics are another
example of a way to document performance. Performances can also be video recorded
to provide evidence of learning. In some cases teachers might want to shorten the time
used to gather evidence of learning from a performance. Event tasks are performances
that are completed in a single class period. Students might demonstrate their knowledge
of net or wall game strategies by playing a scripted game that is video recorded during a
single class. The ability to create movement sequences or a dance that uses different
levels, effort, or relationships could be demonstrated during a single class period with
an event task. Many adventure education activities that demonstrate affective domain
attributes can be assessed using event tasks.
Student Logs
For the statistical consultant working with social science researchers the estimation of
reliability and validity is a task frequently encountered. Measurement issues differ in
the social sciences in that they are related to the quantification of abstract, intangible
and unobservable constructs. In many instances, then, the meaning of quantities is only
inferred.
Let us begin by a general description of the paradigm that we are dealing with. Most
concepts in the behavioral sciences have meaning within the context of the theory that
they are a part of. Each concept, thus, has an operational definition which is governed
by the overarching theory. If a concept is involved in the testing of hypothesis to
support the theory it has to be measured. So the first decision that the research is faced
with is “how shall the concept be measured?” That is the type of measure. At a very
broad level the type of measure can be observational, self-report, interview, etc. These
types ultimately take shape of a more specific form like observation of ongoing activity,
observing video-taped events, self-report measures like questionnaires that can be open-
ended or close-ended, Likert-type scales, interviews that are structured, semi-structured
or unstructured and open-ended or close-ended. Needless to say, each type of measure
has specific types of issues that need to be addressed to make the measurement
meaningful, accurate, and efficient.
Another important feature is the population for which the measure is intended. This
decision is not entirely dependent on the theoretical paradigm but more to the
immediate research question at hand.
It is important to bear in mind that validity and reliability are not an all or none issue but
a matter of degree.
Measurement Error
       All measurements may contain some element of error; validity and reliability
concern the amount and type of error that typically occurs, and they also show how we
can estimate the amount of error in a measurement.
There are three chief sources of error:
   1. in the thing being measured (my weight may fluctuate so it's difficult to get an
       accurate picture of it);
   2. the observer (on Mondays I may knock a pound off my weight if I binged on my
       mother's cooking at the week-end. Obviously the binging doesn't reflect my true
       weight!);
   3. or in the recording device (our clinic weigh scale has been acting up; we really
       should get it recalibrated).And there are two types of error:
Random errors are not attributable to a specific cause. If sufficiently large numbers of
observations are made, random errors average to zero, because some readings over-
estimate and some under-estimate.Systematic errors tend to fall in a particular direction
and are likely due to a specific cause. Because systematic errors fall in one direction
(e.g., I always exaggerate my athletic abilities) they bias a measurement.Random errors
Reliability
The reliability of an assessment tool is the extent to which it consistently and accurately
measures learning. When the results of an assessment are reliable, we can be confident
that repeated or equivalent assessments will provide consistent results. This puts us in a
better position to make generalised statements about a student’s level of achievement,
which is especially important when we are using the results of an assessment to make
decisions about teaching and learning, or when we are reporting back to students and
their parents or caregivers. No results, however, can be completely reliable. There is
always some random variation that may affect the assessment, so educators should
always be prepared to question results.
  The length of the assessment – a longer assessment generally produces more reliable
results.
  The suitability of the questions or tasks for the students being assessed.
  The phrasing and terminology of the questions.
  The consistency in test administration – for example, the length of time given for the
assessment, instructions given to students before the test.
  The design of the marking schedule and moderation of marking procedures.
  The readiness of students for the assessment – for example, a hot afternoon or straight
after physical activity might not be the best time for students to be assessed.
Check in the user manual for evidence of the reliability coefficient. These are measured
between zero and 1. A coefficient of 0.9 or more indicates a high degree of reliability.
Educational assessment should always have a clear purpose. Nothing will be gained
from assessment unless the assessment has some validity for the purpose. For that
reason, validity is the most important single attribute of a good test.
The validity of an assessment tool is the extent to which it measures what it was
designed to measure, without contamination from other characteristics. For example, a
test of reading comprehension should not require mathematical ability.
It is fairly obvious that a valid assessment should have a good coverage of the criteria
(concepts, skills and knowledge) relevant to the purpose of the examination. The
important notion here is the purpose. For example:
  The PROBE test is a form of reading running record which measures reading
behaviours and includes some comprehension questions. It allows teachers to see the
reading strategies that students are using, and potential problems with decoding. The
test would not, however, provide in-depth information about a student’s comprehension
strategies across a range of texts.
English Language Testing                                                              48
  STAR (Supplementary Test of Achievement in Reading) is not designed as a
comprehensive test of reading ability. It focuses on assessing students’ vocabulary
understanding, basic sentence comprehension and paragraph comprehension. It is most
appropriately used for students who don’t score well on more general testing (such as
PAT or e-asTTle) as it provides a more fine grained analysis of basic comprehension
strategies.
There is an important relationship between reliability and validity. An assessment that
has very low reliability will also have low validity; clearly a measurement with very
poor accuracy or consistency is unlikely to be fit for its purpose. But, by the same token,
the things required to achieve a very high degree of reliability can impact negatively on
validity. For example, consistency in assessment conditions leads to greater reliability
because it reduces 'noise' (variability) in the results. On the other hand, one of the things
that can improve validity is flexibility in assessment tasks and conditions. Such
flexibility allows assessment to be set appropriate to the learning context and to be
made relevant to particular groups of students. Insisting on highly consistent assessment
conditions to attain high reliability will result in little flexibility, and might therefore
limit validity.
Validity:
        Very simply, validity is the extent to which a test measures what it is supposed
to measure. The question of validity is raised in the context of the three points made
above, the form of the test, the purpose of the test and the population for whom it is
intended. Therefore, we cannot ask the general question “Is this a valid test?”. The
question to ask is “how valid is this test for the decision that I need to make?” or “how
valid is the interpretation I propose for the test?” We can divide the types of validity
into logical and empirical.
VALIDITY refers to what conclusions we can draw from the results of a measurement.
Introductory-level definitions are "Does the test measure what we are intending to
measure?", or "How closely do the results of a measurement correspond to the true state
of the phenomenon being measured?"
Observer
Recording device
(e.g., screening test)
Random error
test re-test reliability
correlation between observers
calibration trial (variation with standard object)
Systematic
record diurnal (etc) variation (e.g. BP higher on Mondays)
agreement between observers (e.g. nurses or patients)
construct& criterion validity;
sensitivity& specificity
Validity of a screening test. This can be used to illustrate the way validity is assessed.
Here, it is commonly reported in terms of sensitivity and specificity.
        Sensitivity refers to what fraction of all the actual cases of disease a test detects.
If the test is not very good, it may miss cases it should detect. Its sensitivity is low and it
generates "false negatives" (i.e., people score negatively on the test when they should
have scored positive). This can be extremely serious if early treatment would have
saved the person's life.
Mnemonics to help you: The word 'sensitivity' is intuitive: a sensitive test is one that
can identify the disease.
English Language Testing                                                                  50
SeNsitivity is inversely associated with the false Negative rate of a test (high sensitivity
= few false negatives).
       Specificity refers to whether the test identifies only those with the disease, or
does it mistakenly classify some healthy people as being sick? Errors of this type are
called "false positives." This can lead to worry and expensive further investigations.
Types of Validity
   1. Content Validity:
       When we want to find out if the entire content of the behavior/construct/area is
represented in the test we compare the test task with the content of the behavior. This is
a logical method, not an empirical one. Example, if we want to test knowledge on
American Geography it is not fair to have most questions limited to the geography of
New England.
   2. Face Validity:
       Basically face validity refers to the degree to which a test appears to measure
what it purports to measure. Face Validity ascertains that the measure appears to be
assessing the intended construct under study. The stakeholders can easily assess face
validity. Although this is not a very “scientific” type of validity, it may be an essential
component in enlisting motivation of stakeholders. If the stakeholders do not believe the
measure is an accurate assessment of the ability, they may become disengaged with the
task. Example: If a measure of art appreciation is created all of the items should be
related to the different components and types of art. If the questions are regarding
historical time periods, with no reference to any artistic movement, stakeholders may
not be motivated to give their best effort or invest in this measure because they do not
believe it is a true assessment of art appreciation.
    4. Concurrent Validity:
        Concurrent validity is the degree to which the scores on a test are related to the
scores on another, already established, test administered at the same time, or to some
other valid criterion available at the same time. Example, a new simple test is to be
used in place of an old cumbersome one, which is considered useful, measurements are
obtained on both at the same time. Logically, predictive and concurrent validation are
the same, the term concurrent validation is used to indicate that no time elapsed between
measures.
    5. Construct Validity:
Construct Validity is used to ensure that the measure is actually measure what it is
intended to measure (i.e. the construct), and not other variables. Using a panel of
“experts” familiar with the construct is a way in which this type of validity can be
assessed. The experts can examine the items and decide what that specific item is
intended to measure. Students can be involved in this process to obtain their feedback.
Example: A women’s studies program may design a cumulative assessment of learning
throughout the major.       The questions are written with complicated wording and
phrasing.     This can cause the test inadvertently becoming a test of reading
comprehension, rather than a test of women’s studies. It is important that the measure is
actually assessing the intended construct, rather than an extraneous factor.
English Language Testing                                                                52
       Construct validity is the degree to which a test measures an intended
hypothetical construct. Many times psychologists assess/measure abstract attributes or
constructs.   The process of validating the interpretations about that construct as
indicated by the test score is construct validation. This can be done experimentally,
e.g., if we want to validate a measure of anxiety. We have a hypothesis that anxiety
increases when subjects are under the threat of an electric shock, then the threat of an
electric shock should increase anxiety scores (note: not all construct validation is this
dramatic!)
       A correlation coefficient is a statistical summary of the relation between two
variables. It is the most common way of reporting the answer to such questions as the
following: Does this test predict performance on the job? Do these two tests measure
the same thing? Do the ranks of these people today agree with their ranks a year ago?
       (rank correlation and product-moment correlation)
       According to Cronbach, to the question “what is a good validity coefficient?”
the only sensible answer is “the best you can get”, and it is unusual for a validity
coefficient to rise above 0.60, though that is far from perfect prediction.
       All in all we need to always keep in mind the contextual questions: what is the
test going to be used for? how expensive is it in terms of time, energy and money? what
implications are we intending to draw from test scores?
Formative Validity when applied to outcomes assessment it is used to assess how well a
measure is able to provide information to help improve the program under study.
Example: When designing a rubric for history one could assess student’s knowledge
across the discipline. If the measure can provide information that students are lacking
knowledge in a certain area, for instance the Civil Rights Movement, then that
assessment tool is providing meaningful information that can be used to improve the
course or program requirements.
       Sampling Validity (similar to content validity) ensures that the measure covers
the broad range of areas within the concept under study.           Not everything can be
covered, so items need to be sampled from all of the domains. This may need to be
completed using a panel of “experts” to ensure that the content area is adequately
sampled. Additionally, a panel can help limit “expert” bias (i.e. a test reflecting what an
individual personally feels are the most important or relevant areas).
English Language Testing                                                              53
Example: When designing an assessment of learning in the theatre department, it would
not be sufficient to only cover issues related to acting. Other areas of theatre such as
lighting, sound, functions of stage managers should all be included. The assessment
should reflect the content area in its entirety.
Reliability:
         Research requires dependable measurement.          (Nunnally) Measurements are
reliable to the extent that they are repeatable and that any random influence which tends
to make measurements different from occasion to occasion or circumstance to
circumstance is a source of measurement error. (Gay) Reliability is the degree to which
a test consistently measures whatever it measures. Errors of measurement that affect
reliability are random errors and errors of measurement that affect validity are
systematic or constant errors.
         Test-retest, equivalent forms and split-half reliability are all determined through
correlation.
RELIABILITY           refers to consistency or dependability. Your patient Jim is
unpredictable; sometimes he comes to his appointment on time, sometimes he's late and
once or twice he was early.
Types of Reliability
   1. Test-retest Reliability:
       Test-retest reliability is the degree to which scores are consistent over time. It
indicates score variation that occurs from testing session to testing session as a result of
errors of measurement. Problems: Memory, Maturation, Learning.
  Test-retest reliability is a measure of reliability obtained by administering the same
test twice over a period of time to a group of individuals. The scores from Time 1 and
Time 2 can then be correlated in order to evaluate the test for stability over time.
Example: A test designed to assess student learning in psychology could be given to a
group of students twice, with the second administration perhaps coming a week after the
first. The obtained correlation coefficient would indicate the stability of the scores.
    3. Inter-rater reliability
           Inter-rater reliabilit is a measure of reliability used to assess the degree to which
different judges or raters agree in their assessment decisions. Inter-rater reliability is
useful because human observers will not necessarily interpret answers the same way;
raters may disagree as to how well certain responses or material demonstrate knowledge
of the construct or skill being assessed.
Example: Inter-rater reliability might be employed when different judges are evaluating
the degree to which art portfolios meet certain standards.            Inter-rater reliability is
especially useful when judgments can be considered relatively subjective. Thus, the use
of this type of reliability would probably be more likely when evaluating artwork as
opposed to math problems.
Split-Half Reliability:
        Requires only one administration. Especially appropriate when the test is very
long. The most commonly used method to split the test into two is using the odd-even
strategy. Since longer tests tend to be more reliable, and since split-half reliability
represents the reliability of a test only half as long as the actual test, a correction
formula must be applied to the coefficient. Spearman-Brown prophecy formula.
        Split-half reliability is a form of internal consistency reliability.
Internal Consistency Reliability:
        Determining how all items on the test relate to all other items.            Kudser-
Richardson-> is an estimate of reliability that is essentially equivalent to the average of
the split-half reliabilities computed for all possible halves.
Refers to measuring what we intend to measure. How well a test measures what it is
supposed to measure. For eexampleIf math and vocabulary truly represent intelligence
then a math and vocabulary test might be said to have high validity when used as a
measure of intelligence. Estimating the Validity of a Measure:
Content Validity:
      1. Does the test contain items from the desired “content domain”?
      2. Based on assessment by experts in that content domain.
      3. Is especially important when a test is designed to have low face validity.
      4. Is generally simpler for “other tests” than for “psychological constructs ”
For Example - Easier for math experts to agree on an item for an algebra test than it is
for psych experts to agree whether or not an item should be placed in a EI or a
personality measure.
For Example:
In developing a nursing licensure exam, experts on the field of nursing would identify
the information and issues required to be an effective nurse and then choose (or rate)
items that represent those areas of information and skills.
Face Validity
English Language Testing                                                               59
   1. Face validity refers to the extent to which a measure ‘appears’ to measure what
       it is supposed to measure
   2. Not statistical—involves the judgment of the researcher (and the participants)
   3. A measure has face validity—’if people think it does’
   4. Just because a measure has face validity does not ensure that it is a valid
       measure (and measures lacking face validity can be valid)
   1. Inadequate sample
   2. Items that do not function as intended
   3. Improper arrangement/unclear directions
   4. Too few items for interpretation
   5. Improper test administration
   6. Scoring that is subjective
    1. the longer the test, the more reliable it is likely to be [though there is a point of
        no extra return]
    2. items which discriminate will add to reliability, therefore, if the items are too
        easy / too difficult, reliability is likely to be lower
    3. if there is a wide range of abilities amongst the test takers, test is likely to have
        higher reliability
    4. the more homogeneous the items are, the higher the reliability is likely to be
Practicality:
    1. items can be replicated in terms of resources needed e.g. time, materials, people
    2. can be administered
    3. can be graded
    4. results can be interpreted
    1. quality of items
    2. number of items
    3. difficulty level of items
    4. level of item discrimination
    5. type of test methods
    6. number of test methods
    7. time allowed
    8. clarity of instructions
3.Practicality
CONSTRUCTING TESTS
Writing items requires a decision about the nature of the item or question to which we
ask students to respond, that is, whether discreet or integrative, how we will score the
item; for example, objectively or subjectively, the skill we purport to test, and so on. We
also consider the characteristics of the test takers and the test taking strategies
respondents will need to use. What follows is a short description of these considerations
for constructing items.
Test Items
A test item is a specific task test takers are asked to perform.Test items can assess one
or more points or objectives, and the actual item itself may take on a different
constellation depending on the context. For example, an item may test one point
(understaning of a given vocabulary word) or several points (the ability to obtain facts
from a passage and then make inferences based on the facts). Likewise, a given
objective may be tested by a series of items. For example, there could be five items all
testing one grammatical point (e.g., tag questions). Items of a similar kind may also be
grouped together to form subtests within a given test.
Classifying Items
Discrete – A completely discrete-point item would test simply one point or objective
such as testing for the meaning of a word in isolation. For example:
English Language Testing                                                               64
Choose the correct meaning of the word paralysis.
Integrative – An integrative item would test more than one point or objective at a time.
(e.g., comprehension of words, and ability to use them correctly in context). For
example:
Sometimes an integrative item is really more a procedure than an item, as in the case of
a free composition, which could test a number of objectives; for example, use of
appropriate vocabulary, use of sentence level discourse, organization, statement of
thesis and supporting evidence. For example:
Write a one-page essay describing three sports and the relative likelihood of being
injured while playing them competitively.
Objective – A multiple-choice item, for example, is objective in that there is only one
right answer.
Subjective – A free composition may be more subjective in nature if the scorer is not
looking for any one right answer, but rather for a series of factors (creativity, style,
cohesion and coherence, grammar, and mechanics).
Items may require test takers to employ different levels of intellectual operation in order
to produce a response (Valette, 1969, after Bloom et al., 1956). The following levels of
intellectual operation have been identified:
analysis (breaking down a message into its constituent parts in order to make explicit
the relationships between ideas, including tasks like recognizing the connotative
meanings of words and correctly processing a dictation, and making inferences);
synthesis (arranging parts so as to produce a pattern not clearly there before, such as in
effectively organizing ideas in a written composition); and
it has been popularly held that these levels demand increasingly greater cognitive
control as one moves from knowledge to evaluation – that, for example, effective
operation at more advanced levels, such as synthesis and evaluation, would call for
more advanced control of the second language. Yet this has not necessarily been borne
Items can also assess different types of response behavior. Respondents may be tested
for accuracy in pronunciation or grammar. Likewise, they could be assessed for fluency,
for example, without concern for grammatical correctness. Aside from accuracy and
fluency, respondents could also be assessed for speed – namely, how quickly they can
produce a response, to determine how effectively the respondent replies under time
pressure.In recent years, there has also been an increased concern for developing
measures of performance – that is, measures of the ability to perform real-world tasks,
with criteria for successful performance based on a needs analysis for the given task
(Brown, 1998; Norris, Brown, Hudson, & Yoshioka, 1998).
Performance tasks might include “comparing credit card offers and arguing for the best
choice” or “maximizing the benefits from a given dating service.” At the same time that
there is a call for tasks that are more reflective of the real world, there is a
commensurate concern for more authentic language assessment. At least one study,
however, notes that the differences between authentic and pedagogic written and spoken
texts may not be readily apparent, even to an audience specifically listening for
differences (Lewkowicz, 1997). In addition, test takers may not necessarily concern
themselves with task authenticity in a test situation. Test familiarity may be the
overriding factor affecting performance.
Characteristics of Respondents
With regard to language ability, both Bachman and Palmer (1996) and Alderson (2000)
detail the many types of knowledge that respondents may need to draw on to perform
well on a given item or task:world knowledge and culturally-specific knowledge,
knowledge of how the specific grammar works, knowledge of different oral and written
text types, knowledge of the subject matter or topic, and knowledge of how to perform
well on the given task.
Item-Elicitation Format
the format for item elicitation has to be determined for any given item. An item can
have a spoken, written, or visual stimulus, as well as any combination of the three.
Thus, while an item or task may ostensibly assess one modality, it may also be testing
some other as well. So, for example, a subtest referred to as “listening” which has
respondents answer oral questions by means of written multiple-choice responses is
testing reading as well as listening. It would be possible to avoid introducing this
reading element by having the multiple-choice alternatives presented orally as well. But
then the tester would be introducing yet another factor, namely, short-term memory
ability, since the respondents would have to remember all the alternatives long enough
to make an informed choice.
Item-Response Format
The item-response format can be fixed, structured, or open-ended. Item responses with a
fixed format include true/false, multiple-choice, and matching items.Item responses,
which call for a structured format include ordering (where respondents are requested to
Grammatical competence
Major grammatical errors might be considered those that either interfere with
intelligibility or stigmatize the speaker. Minor errors would be those that do not get in
the way of the listener's comprehension nor would they annoy the listener to any
extent.Thus, getting the tense wrong in the above example, "We have had a great time at
your house last night" could be viewed as a minor error, whereas in another case,
producing "I don't have what to say" ("I really have no excuse" by translating directly
from the appropriate Hebrew language) could be considered a major error since it is not
only ungrammatical but also could stigmatize the speaker as rude and unconcerned,
rather than apologetic.
Student Placement,
Diagnosis of Difficulties,
Checking Student Progress,
Reports to Student and Superiors,
Evaluation of Instruction.
Unfortunately the most common perception is that tests are designed to statistically rank
all students according to a sampling of their knowledge of a subject and to report that
ranking to superiors or anyone else interested in using that information to adversely
influence the student's feeling of self-worth. It is even more unfortunate that the
perception matches reality in the majority of testing situations. Consequently tests are
highly stressful anxiety producing events for most persons.
All too often tests are constructed to determine how much a student knows rather than
determining what he/she must learn. Frequently tests are designed to "trap" the student
and in still other situations tests are designed to insure a "bell curve" distribution of
results. Most of the other numerous testing designs and strategies fail to help the student
in his learning process and in many cases are quite detrimental to that process.
In a Mastery Based system of instruction the two main reasons for testing are to
determine mastery and to diagnose difficulties. When tests are constructed for these
purposes, the other four purposes will also be satisfied. For example, consider a test
which requires the student to demonstrate mastery and at the same time rigorously
diagnoses learning difficulties. If no difficulties are indicated, it may be safely assumed
that the learner has mastered the concept. That information may then be used to record
student progress and to make reports to the student and superiors. Examining student
performance collectively for a group of students provides information about the quality
of instruction. Examining a single student's performance collectively for a group of
It is therefore important that the instructional developer construct each question so that a
correct response indicates mastery of the learning objective and any incorrect response
provides information about the nature of the student's lack of mastery. Furthermore,
each student should have ample opportunity to "inform" the instructor of any form of
lack of mastery. Unfortunately the mere presence of a test question influences the
student's response to the question. The developer should minimize that influence by
constructing questions which permit the student to make any error he would make in the
absence of such influence. For example, a multiple choice question should have all the
wrong answers the student might want to select and should also have as many correct
answers as the student might want to provide.
True/False Questions:
True/false questions should be written without ambiguity. That is, the statement of the
question should be clear and the decision whether the statement is true or false should
not depend on an obscure interpretation of the statement. A true/false question may
easily be used, and most commonly is used, to determine if the student recalls facts.
However, a true/false question may also be used to determine if the learner has mastered
the learning objective well enough to correctly analyze a statement.
It is important to be aware that only two choices are available to the student and
therefore the nature of the question gives the student a 50% chance of being correct. A
single True/False question therefore is helpful only if the student answers the question
incorrectly and the incorrect response indicates a specific misunderstanding of the
learning objective. A collection of true/false questions, about a single learning objective,
all answered correctly by a student is a much stronger indication of mastery. It is
therefore important that the instructional developer construct a "test bank" containing a
large number of true/false questions. It is also important to include numerous true/false
questions on any test which utilizes true/false questions. Ideally a true/false question
should be constructed so that an incorrect response indicates something about the
Multiple choice questions should be written without ambiguity. That is, the statement of
the question stem should be clear and should leave no doubt about how to select
choices. Additionally the choices should be written without ambiguity and should
contain all information required to make a decision whether or not to choose it. The
decision whether to select or not select a choice should not depend on an obscure
interpretation of either the stem or the choice. A multiple choice question may easily be
used to determine if the student recalls facts. However, a multiple choice question may
also be used to determine if the student has mastered the learning objective well enough
to correctly analyze a statement.
The instructional developer should not construct multiple choice questions with a
uniform number of choices, a uniform number of valid choices, or any other
recognizable pattern for construction of choices. Instead the instructional developer
should include as many valid and invalid choices as is required to determine the
student's deficiencies with respect to the learning objective. Moreover, each choice
should appear to be a valid choice to some student.
Multiple choice questions should therefore contain any number of choices with one or
more valid choices. The student is of course required to select all valid choices and
failure to select any one of the valid choices will provide information about the student's
misunderstanding of the learning objective in the same way that selection of an invalid
choice reveals the nature of his/her misunderstanding. The nature of the choices
provided in a multiple choice question may be of two types: those which require merely
recall of facts and those which require additionally activity such as synthesis, analysis,
computation, comparison, or diagramming. The instructional developer who is seriously
concerned with the student's success will use both types extensively.
The temptation, when constructing fillintheblank questions, is to construct traps for the
student. The instructional developer should avoid this problem. Ensure that there is only
one acceptable word for the student to provide and that the word (or words) is
significant. Avoid asking the student to supply "minor" words. Avoid fillintheblank
question with so many blanks that the student is unable to determine what is to be
completed.
Sometime/Always/Never Questions:
SAN questions (especially the sometimes statements) are the most difficult to construct
but can be the most significant part of a test. SAN questions should be constructed to
force the student to engage in some critical thinking about the learning objective. When
used properly, SAN questions force the student to consider important details about the
learning objective. Careful use of this type of question and careful analysis of student's
response will provide detailed information about some of the student's deficiencies.
SAN questions are especially appropriate, and easy to construct, for learning objectives
addressing concepts which are "black" or "white" except in a few cases. The true
statements in a collection of true/false questions are of course always true statements
while the set of false statement may be further subdivided into those which are true
sometimes and those which are never true.
Test Construction
Since the practical arguments for giving objective exams are compelling, we offer a few
suggestions for writing multiple-choice items. The first is to find and adapt existing test
items. Teachers’ manuals containing collections of items accompany many textbooks.
(AIs: Your course supervisor or former teachers of the same course may be willing to
share items with you.) However, the general rule is adapt rather than adopt. Existing
items will rarely fit your specific needs; you should tailor them to more adequately
reflect your objectives.
Second, design multiple choice items so that students who know the subject or material
adequately are more likely to choose the correct alternative and students with less
adequate knowledge are more likely to choose a wrong alternative. That sounds simple
enough, but you want to avoid writing items that lead students to choose the right
answer for the wrong reasons. For instance, avoid making the correct alternative the
longest or most qualified one, or the only one that is grammatically appropriate to the
stem. Even a careless shift in tense or verb-subject agreement can often suggest the
correct answer.
Finally, it is very easy to disregard the above advice and slip into writing items which
require only rote recall but are nonetheless difficult because they are taken from obscure
passages (footnotes, for instance). Some items requiring only recall might be
appropriate, but try to design most of the items to tap the students’ understanding of the
subject (Adapted with permission from Farris, 1985). One way to write multiple choice
questions that require more than recall is to develop questions that resemble miniature
“cases” or situations. Provide a small collection of data, such as a description of a
situation, a series of graphs, quotes, a paragraph, or any cluster of the kinds of raw
Here are a few additional guidelines to keep in mind when writing multiple-choice tests
(Adapted with permission from Yonge, 1977):
The item-stem (the lead-in to the choices) should clearly formulate a problem.
  Randomize occurrence of the correct response (e.g., you don’t always want “C” to be
the right answer).
  Make sure there is only one clearly correct answer (unless you are instructing
students to select more than one).
Make the wording in the response choices consistent with the item stem.
Use negatives sparingly in the question or stem; do not use double negatives.
  Beware of using sets of opposite answers unless more than one pair is presented (e.g.,
go to work, not go to work).
Grading of multiple choice exams can be done by hand or through the use of computer
scannable answer sheets available from your departmental office. Take completed
answer sheets to IUB Evaluation Services and Testing (BEST) located in Franklin Hall
If you choose the computer-grading route, you must be sure students have number 2
pencils to mark answers on their sheets. These are often available from your
department’s main office. At the time of the exam it is helpful to write on the
chalkboard all pertinent information required on the answer sheet (course name, course
number, section number, instructor’s name, etc.). Also, remind students to fill in their
university identification numbers carefully so that you can have a roster showing the ID
number and grade for each student. If you would like to consult with someone about
developing test items, call theCenter for Innovative Teaching and Learning at 855-9023.
If you would like to consult with someone about how to interpret your test results, call
BEST at 855-1595.
Essay Tests
If you want students to study in both depth and breadth, don't give them a choice among
topics. This allows them to choose not to answer questions about those things they
didn’t study. Instructors generally expect a great deal from students, but remember that
their mastery of a subject depends as much on prior preparation and experience as it
does on diligence and intelligence; even at the end of the semester some students will be
struggling to understand the material. Design your questions so that all students can
answer at their own levels.
The following are some suggestions that may enhance the quality of the essay tests that
you produce (Adapted with permission from Ronkowski, 1986):
1. Have in mind the processes that you want measured (e.g., analysis, synthesis).
1. DISCRIMINATIVE LISTENING
Hearing ability
The ability to hear helps in sound differentiation and therefore is one can hear well, then
there is a high likelihood that they can get the message well (Lengel, 1998).
The next step beyond discriminating between different sound and sights is to make
sense of them. To comprehend the meaning requires first having a lexicon of words at
our fingertips and also all rules of grammar and syntax by which we can understand
what others are saying.
The same is true, of course, for the visual components of communication, and an
understanding of body language helps us understand what the other person is really
meaning.
In communication, some words are more important and some less so, and
comprehension often benefits from extraction of key facts and items from a long
spiel.Comprehension listening is also known as content listening, informative listening
and full listening.Listening Comprehension Sample Questions Transcript
Sample Item A
On the recording, you will hear:
 (Man):          I have a very special announcement to make. This year, not just
                 one,but three of our students will be receiving national awards for
                 their academic achievements. Krista Conner, Martin Chan, and
                 Shriya Patel have all been chosen for their hard work and
                 consistently high marks.It is very unusual for one school to have so
                 many students receive this award in a single year.
(Girl): Hi, Jeff. Hey, have you been to the art room today?
  (Girl):          Well, Mr. Jennings hung up a notice about a big project that's going
                   ondowntown. You know how the city's been doing a lot of work to
                   fix up MainStreet—you know, to make it look nicer? Well, they're
                   going to create a mural.
 (Girl):       It's that big wall on the side of the public library. And students from
               this school are going to do the whole thing ... create a design, and paint
               it, and everything. I wish I could be a part of it, but I'm too busy.
 (Boy):        [excitedly] Cool! I'd love to help design a mural. Imagine everyone in
               town walking past that wall and seeing my artwork, every day.
 (Girl):       I thought you'd be interested. They want the mural to be about nature,
               so I guess all the design ideas students come up with should have a
               nature theme.
 (Boy):        That makes sense—they've been planting so many trees and plants
               along the streets and in the park.
 (Boy):        [half listening, daydreaming] This could be so much fun. Maybe I'll try
               to visit the zoo this weekend ... you know, to see the wild animals and
               get some ideas, something to inspire me!
 (Girl):       [with humor] Well maybe you should go to the art room first to get
               more information from Mr. Jennings.
 (Boy):        [slightly sheepishly] Oh yeah. Good idea. Thanks for letting me know,
               Lisa! I'll go there right away.
Sample Set B
On the recording, you will hear:
Script Text:
 (Woman):       We've talked before about how ants live and work together in huge
                communities. Well, one particular kind of ant community also grows
                its own food. So you could say these ants are like people like farmers.
                And what do these ants grow? They grow fungi [FUN-guy]. Fungi are
                kind of like plants—mushrooms are a kind of fungi. These ants have
                gardens, you could say, in their underground nests. This is where the
                fungi are grown.
   1. C
   2. D
   3. A
   4. B
   5. A
   6. D
   7. B
   8. B
   9. A
   10. D
Critical listening is listening in order to evaluate and judge, forming opinion about what
is being said. Judgment includes assessing strengths and weaknesses, agreement and
approval.
This form of listening requires significant real-time cognitive effort as the listener
analyzes what is being said, relating it to existing knowledge and rules, whilst
simultaneously listening to the ongoing words from the speaker.
2. BIASED LISTENING
Biased listening happens when the person hears only what they want to hear, typically
misinterpreting what the other person says based on the stereotypes and other biases that
they have. Such biased listening is often very evaluative in nature.
3. EVALUATIVE LISTENING
In evaluative listening, or critical listening, we make judgments about what the other
person is saying. We seek to assess the truth of what is being said. We also judge what
they say against our values, assessing them as good or bad, worthy or unworthy.
Evaluative listening is particularly pertinent when the other person is trying to persuade
us, perhaps to change our behavior and maybe even to change our beliefs. Within this,
we also discriminate between subtleties of language and comprehend the inner meaning
of what is said. Typically also we weigh up the pros and cons of an argument,
determining whether it makes sense logically as well as whether it is helpful to
us.Evaluative listening is also called critical, judgmental or interpretive listening.
4. APPRECIATIVE LISTENING
Examine the following statements and choose the answer option that best applies to you.
There may be some questions describing situations that may not be relevant to you. In
such cases, select the answer you would most likely choose if you ever found yourself
in that type of situation. In order to receive the most accurate results, please answer as
truthfully as possible.After finishing the test, you will receive a Snapshot Report with an
introduction, a graph and a personalized interpretation for one of your test scores. You
will then have the option to purchase the full results.
It's time to learn something new. Which class would you be most interested in taking
up?
G.    I would rather take:
          Acting classes
          Creative writing classes
English Language Testing                                                             90
H.   I would rather take:
         Survival skills classes
         Speed reading classes
I.   I would rather take:
         Kickboxing classes
         Tai Chi classes
Which of the following would you rather visit or spend some time in?
J.   I would rather go to:
         An Inuit igloo
         A Buddhist monastery
5. SYMPATHETIC LISTENING
        In sympathetic listening we care about the other person and show this concern in
the way we pay close attention and express our sorrow for their ills and happiness at
their joys.
EMPATHETIC LISTENING
6. THERAPEUTIC LISTENING
In therapeutic listening, the listener has a purpose of not only empathizing with the
speaker but also to use this deep connection in order to help the speaker understand,
change or develop in some way.This not only happens when you go to see a therapist
but also in many social situations, where friends and family seek to both diagnose
problems from listening and also to help the speaker cure themselves, perhaps by some
cathartic process. This also happens in work situations, where managers, HR people,
trainers and coaches seek to help employees learn and develop.
    7. DIALOGIC LISTENING
English Language Testing                                                            94
The word 'dialogue' stems from the Greek words 'dia', meaning 'through' and 'logos'
meaning 'words'. Thus dialogic listening mean learning through conversation and an
engaged interchange of ideas and information in which we actively seek to learn more
about the person and how they think.Dialogic listening is sometimes known as
'relational listening'.
       Listen carefully to the dialog between nick and jimmy,then complete the
conversation
Nick     : I heard (1)..........as a computer pragrammer
Jimmy : Yes,and I had already(2)..............
Nick     : Really?i’m happy(3)...
Jimmy : Thank you.
Nick     : Your parents must be(4)........
Jimmy : They want me to run their business.they’re(5)......
Nick     : That’s a pity!did you explain your reasons?
Jimmy : I did and I hope they’ll accept my decision.
Dialog II
Margaret : Look at you!you look so great now.what have you been doing?
Joe         : Really?(1).................i’ve been in canada for two weeks.by the way,how
               about your job?
Margaret : (2)............it’s in a big new hospital.My working conditions aremuch
               better than the the last place.
Tony        : Attention,please.today,we have a surprise.we’ve been offered a trip from
               our boss
Joe         : Really?(3)........................?
Tony        : Bandung
Joe         : (4)..................but where is it located?
Tony        : Aren’t you pleased?
English Language Testing                                                              97
Joe          : Yes,of course.(5)........................but tell me where it is.
Margare      : It’s in indonesia.
Joe          : Oh,I see.that’s not so good
Tony         : Don’t worry joe.my friend,lisa,who lives there,wrote to me about the
                conditions in indonesia.indonesia is safe now,especially in that twon.there
                is no riot.it’s just a rumour.
Key Answer
      1) I think it’s usual
      2) That’s great
      3) Where to
      4) Marvellous
      5) I’m delighted to hear that
8. RELATIONSHIP LISTENING
English is a very important language in the world. It plays a very big rule in
communication and education. Everything which is served by technology should be
related to English. By and by, English will be the global language in every part of the
world.Since English is an international language, people all over the world try to learn
as much as Possible about english. To develop our skill in english we always meet the
grammar. And we practice our english by the testing grammar, so that we know how far
we understand the english.
A. Definition of grammar
Grammar is the structural foundation of our ability to express ourselves. The more we
are aware of how it works, the more we can monitor the meaning and effectiveness of
the way we and others use language. It can help foster precision, detect ambiguity, and
Both kinds of grammar are concerned with rules--but in different ways. Specialists in
descriptive grammar (called linguists) study the rules or patterns that underlie our use of
words, phrases, clauses, and sentences. On the other hand, prescriptive grammarians
(such as most editors and teachers) lay out rules about what they believe to be the
“correct” or “incorrect” use of language.
B. Types of test
Before writing a test it is vital to think about what it is you want to test and what its
purpose is. We must make a distinction here between proficiency tests, achievement
tests, diagnostic tests and prognostic tests.
There are of course many other types of tests. It is important to choose elicitation
techniques carefully when you prepare one of the aforementioned tests.There are many
elicitation techniques that can be used when writing a test. Below are some widely used
types with some guidance on their strengths and weaknesses. Using the right kind of
1. Multiple choice
Choose the correct word to complete the sentence.
Cook is ________________today for being one of Britain's most famous explorers.
Multiple choice can be used to test most things such as grammar, vocabulary, reading,
listening etc. but you must remember that it is still possible for students to just 'guess'
without knowing the correct answer.
2. Transformation
Complete the second sentence so that it has the same meaning as the first.
'Do you know what the time is, John?' asked Dave.
Dave asked John __________ (what) _______________ it was.
This time a candidate has to rewrite a sentence based on an instruction or a key word
given. This type of task is fairly easy to mark, but the problem is that it doesn't test
understanding. A candidate may simply be able to rewrite sentences to a formula. The
fact that a candidate has to paraphrase the whole meaning of the sentence in the
example above however minimizes this drawback.
3. Gap-filling
Complete the sentence.
Check the exchange ______________ to see how much your money is worth.
The candidate fills the gap to complete the sentence. A hint may sometimes be included
such as a root verb that needs to be changed, or the first letter of the word etc. This
usually tests grammar or vocabulary. Again this type of task is easy to mark and
relatively easy to write. The teacher must bear in mind though that in some cases there
may be many possible correct answers.
       Gap-fills can be used to test a variety of areas such as vocabulary, grammar and
        are very effective at testing listening for specific words
4. True / False
Decide if the statement is true or false.
Here the candidate must decide if a statement is true or false. Again this type is easy to
mark but guessing can result in many correct answers. The best way to counteract this
effect is to have a lot of items.
 This question type is mostly used to test listening and reading comprehension
5. Open questions
Answer the questions.
Here the candidate must answer simple questions after a reading or listening or as part
of an oral interview. It can be used to test anything. If the answer is open-ended it will
be more difficult and time consuming to mark and there may also be a an element of
subjectivity involved in judging how 'complete' the answer is, but it may also be a more
accurate test.
       These question types are very useful for testing any of the four skills, but less
        useful for testing grammar or vocabulary.
6.Error Correction
Find the mistakes in the sentence and correct them.
Errors must be found and corrected in a sentence or passage. It could be an extra word,
mistakes with verb forms, words missed etc. One problem with this question type is that
some errors can be corrected in more than one way.
       Error correction is useful for testing grammar and vocabulary as well as readings
        and listening.
7. Other Techniques
There are of course many other elicitation techniques such as translation, essays,
dictations, ordering words/phrases into a sequence and sentence construction
(He/go/school/yesterday).
It is important to ask yourself what exactly you are trying to test, which techniques suit
this purpose best and to bear in mind the drawbacks of each technique. Awareness of
this will help you to minimize the problems and produce a more effective test.
The study of grammar all by itself will not necessarily make you a better writer. But by
gaining a clearer understanding of how our language works, you should also gain
greater control over the way you shape words into sentences and sentences into
paragraphs. In short, studying grammar may help you become a more effective
writer.Descriptive grammarians generally advise us not to be overly concerned with
matters of correctness: language, they say, isn't good or bad; it simply is. As the history
of the glamorous word grammar demonstrates, the English language is a living system
of communication, a continually evolving affair. Within a generation or two, words and
phrases come into fashion and fall out again. Over centuries, word endings and entire
sentence structures can change or disappear.
Introduction
What does interpret mean? To interpret is to decide what the intended meaning of
something is (Cambridge Advanced Learner’s Dictionary). To interpret is to conceive
the significance of; construe (thefreedictionary.com). Thus, to interpret is to understand
the meaning and the significance of something.Interpreting test scores is to understand
the meaning and the significance of test scores, which can be used to plan next action -
to fix or to retain. There are many ways to do it, but the most common three are
frequency distribution, measures of central tendency, and measures of dispersion.
Frequency distribution here is talking about the distribution of scores and the frequency
of each category. On the other hand, measures of central tendency refer to measure of
“middle” value, and are measured using the mode, median, and mean. The last but not
least, is the measures of dispersion. It is related to the range or spread of scores. All
three can help teachers interpret the meaning behind test scores.
A. Frequency Distribution
Frequency distribution deals with the distribution of scores and the frequency of the
distribution. Each entry in the table contains the frequency or count of the occurrences
of scores within a particular name, and in this way, the table summarizes the
distribution of scores.
The example case here is: a teacher administers a test of 40 questions to 26 students.
Marks are awarded by counting the number of correct answers on the test scripts. These
are known as raw marks.
Here are the steps to create a table of frequency distribution:
1. Create Table 1 and put the raw mark of every student in it.
   TABLE 1
   Testee       Mark
   A            20
   B            25
   C            33
   D            35
   E            29
   F            25
   G            30
   H            26
   I            19
   J            27
   K            26
   L            32
   M            34
   N            27
   O            27
   P            29
   Q            25
2. Create Table 2. Sort the marks from the highest to the lowest score. This is called
   descending sorting. It is easier and faster to use tool like Microsoft Excel to do the
   sorting.
    TABLE 2
    Testee     Mark
    D          35
    M          34
    C          33
    W          33
    L          32
    G          30
    S          30
    E          29
    P          29
    J          27
    N          27
    O          27
    H          26
    K          26
    T          26
   Now, we determine the rank. We start form rank 1 up to rank 26, for there are 26
   students.
The problem comes when there are two or more students with the same mark. Here we
highlight the same mark to make it easier to distinguish. Then, we write imaginary rank
on the right of Rank column from 1 to 26. The imaginary rank of the same mark is then
added and divided by how many people who get the same mark. For example, student C
and W have the same mark, 33. Their imaginary rank is 3 and 4. To get the actual rank,
we add 3 and 4 (3+4=7). The result, 7, is then divided by the number of people of the
same score, which is 2 here. The final result is 3.5. Thus, the final result is 3.5. Thus,
the final result is 3.5. Thus, the ranks of both of them are 3.5.
    TABLE 2
                                  Imaginary
    Testee     Mark      Rank
                                  rank
    D          35        ?        1
    M          34        ?        2
    C          33        ?        3               (3+4) / 2 = 3.5
    W          33        ?        4
    L          32        ?        5
   The result will be like this. Table 2 shows the students’ scores in order of merit and
   their rank as well.
    TABLE 2
    Testee    Mark       Rank
    D         35         1
    M         34         2
    C         33         3.5
    W         33         3.5
    L         32         5
3. Create Table 3, which consists of Mark column, Tally column, and Frequency
   column.
   In Mark column, we can expand the range from 40 up to 15, for the highest score is
   35 and the lowest score is 19. We usually do this to give more space to enhance
   readability.
   Tally is the stroke of how many students get a certain score. It is simply a method of
   counting the frequency of scores.
   Frequency column lists the number of students obtaining each score. It is easier to
   count due to the tallies.
B.1. Mode
Mode refers to the score which most candidates obtained. We can easily spot it from
Table 3. The most frequent score in Table 3 is 26, as five testees have scored this mark.
Thus, the mode is 26.
B.2. Median
Median refers to the score gained by the middle candidate after the data is put in order.
We use Table 2, which has been ordered in descending order, to find the median. In the
case of 26 students here, there can obviously be no middle student and thus the score
halfway between the lowest score in the top half and the highest score in the bottom half
is taken as the median. The median score in this case is 26.
B.3. Mean
C. Measures of Dispersion
Measures of dispersion are important for describing the spread of the scores, or its
variation around a central value. There are various methods that can be used to measure
the dispersion of a dataset, but the most common ones are the range and the standard
deviation.
C.1. Range
A simple way of measuring the spread of marks is based on the difference between the
highest and the lowest scores. It is called the range. From previous Table 2, we can see
the highest score is 35 and the lowest score is 19. The range is 16.
         Range =Xmax – Xmin
         Range = 35 – 19 = 16
                                                 Σd
                                      s. d.
                                                  N
/ = 432 / 26 = 16.62
/ √ . = 4.077= 4.08
Thus, standard deviation (s.d.) is 4,08. That means that on average, the scores are about
4 points away from the average.
Classroom and Large- Scale Assessment. Wilson and Kenney. This article appeared
in A Research Companion to Principles and Standards for School Mathematics
(NCTM), 2003, (pages 53-67).
Cronbach, L., 1990. Essentials of psychological testing. Harper & Row, New York.
Carmines, E., and Zeller, R., 1979. Reliability and Validity Assessment. Sage
Publications, Beverly Hills, California.
Gay, L., 1987. Eductional research: competencies for analysis and application. Merrill
Pub. Co., Columbus.
Winer, B., Brown, D., and Michels, K., 1991. Statistical Principles in Experimental
Design, Third Edition. McGraw-Hill, New York.
Cozby, P.C. (2001). Measurement Concepts. Methods in Behavioral Research (7th ed.).
Moskal, B.M., &Leydens, J.A. (2000). Scoring rubric development: Validity and
The Center for the Enhancement of Teaching. How to improve test reliability and
http://spiritize.blogspot.com/2007/10/active-listening.html
http://wiki.answers.com/Q/Examples_of_poetry#ixzz1xYYnZXmU
Hatch, E. &Lazaraton, A. 1991 The Research Manual - Design & Statistics for Applied
Linguistics Newbury House
Messick, S. 1988 The once and future issues of validity: Assessing the meaning and
consequences of measurement. In H. Wainer& H. Braun [Eds.] Test validity [pp. 33-
45], Hillsdale, NJ: Erlbaum.