WEEK 1&2: ASSESSMENT, CONCEPTS AND ISSUES
Key contents:
A. Assessment and testing
1. Assessment and test
Should tests be at the highest level of seriousness?
Should they be degrading or threatening to sts?
Build sts confidence?
Learning exp?
Integral part of sts ongoing classroom development?
Bring out the best in sts?
● Definitions:
ASSESSMENT:
+ is appraising or estimating the level or magnitude of some attribute of a person.
+ is the systematics process of documenting and using empirical data on the knowledge,
skills, attitudes, and beliefs.
ex: question responses, assignments, projects, homework, diary/journal,
comments/suggestions, quizzes, reflections, tests,... .
TESTS: + are a subset of assessment, a genre of assessment techniques.
They are prepared administrative procedures occuring at identifiable times in a curriculum.
- A test is used to examine someone’s knowledge of something to determine what that
person knows or has learned. It measures the level of skill or knowledge that has been
reached.
ex: driving test, grammar test, swimming test,... .
4 BASICS OF A TEST:
A test is a method of measuring a person's ability, knowledge, or performance in a given
domain.
#1:A test is a method.
An instrument
Set of techniques
Procedures
Performance from test-takers
Preparation from organ
How to be qualified:
Explicit test structure
Prescribed answer keys
Scoring rubrics (Writing)
Question prompts (Speaking)
#2:A test must measure.
- Measure general ability, specific competencies or objectives
- Quantify a test-taker's performance (Bachman, 1990)
- Provide letter grade, numerical score, a percentile rank…
#3: A test measures an individual's ability, knowledge, or performance
- Appropriate
- Result interpretable
Example:
IELTS Speaking test (Band 6)
Reading comprehension test TELL (design a test using technology)
Phonetics and Phonology test
#4: a test measures a given domain
- appropriate (level, purpose)
- measure the desỉed criterion
ex: - end of unit test
- end of course test (UWE IE1)
- proficiency test
2. MEASUREMENT AND EVALUATION
MEASUREMENT: • The process of quantifying the observed performance of classroom
learners → requires some standardized tools for measuring
• Can be quantitative and qualitative
Quantitative
- Test scores, letter grade ...
- Advantage: provide exact descriptions of student performance and to compare one
student with another more easily → greater objectivity
Qualitative:
- Oral feedback, comments ...
– Advantage: individualize feedback → supportive, encouraging, psychologically beneficial
EVALUATION: • Involved when the results of a test are used to make decisions • Involves
the interpretation of information
Example - Assignment, Midterm, Final → Pass/ Fail
- Subject, Average → Excellent, Good, Fair, Average, Weak
3. ASSESSMENT AND LEARNING
Should we assess test learners all the time?
For optimal learning
• Have freedom to experiment, to try out own hypotheses about language
without feeling their overall competence is judged
• Have ample opportunities to "play" with language in a classroom
without being formally graded
4. INFORMAL AND FORMAL ASSESSMENT
INFORMAL
– Nonjudgmental, simply coaching
- Make comments, give suggestions/advice ...
Example Nice job! Smiley face ® Likel Love
FORMAL
- Exercises or procedures specifically designed to tap into a storehouse of skills and
knowledge
Is formal assessment the same as a test?
All tests are formal assessment, but NOT all formal assessment is testing. Example Journal,
Project
5. FORMATIVE AND SUMMATIVE ASSESSMENT
FORMATIVE ASSESSMENT
- In the process of "forming“ competencies and skills with the goal of
helping them to continue that growth process.
- Used to monitor student's learning to provide ongoing feedback that can be used by
instructors or teachers to improve their teaching and by students to improve their learning
Are all kinds of informal assessment formative?-> YES
SUMMATIVE ASSESSMENT
- Aims to measure, or summarize, what a student has grasped and typically occurs at the end
of a course or unit of instruction
- Used to evaluate student's learning at the end of an instructional unit by comparing it
against some standard or benchmark
Are quizzes, periodic review tests, midterm exams summative? -> YES
6. NORM-REFERENCED AND CRITERION-REFERENCED TESTS
Norm-Referenced Tests
- measure broad skill areas, then rank students with respect to how others
(norm group) performed on the same test.
- each test-taker's score is interpreted in relation to a mean (average score), median (middle
score), standard deviation (extent of variance in scores), and/ or percentile rank
Score reported
- Numerical score (90, 780...)
- Percentile rank (80%)
Examples – TOEFL iBTI IELTS/ FCE - CAT (International University HCM) - HS. National
Grad. Examination
Criterion-Referenced Tests
- determine whether students have achieved certain defined skills. An individual is compared
with a preset standard for expected achievement.
- designed to give test-takers feedback, usually in the form of grades, on specific course or
lesson objectives
- Test scores tell more or less how well someone performs on a task.
Score reported - Numerical score (90)
Examples - Quizzes - In-class assessment - Midterm/ Final
B. TYPES AND PURPOSES OF ASSESSMENTS
5 TYPES OF TESTS
1. Achievement Tests
- The primary role is to determine whether course objectives have been met
- Measure learners' ability within a classroom lesson, a unit, or even an entire
curriculum
- Be a form of summative assessment, administered at the end of a lesson/ unit/
semester
- Short-term/ Long-term/ Accumulative
-
Specifications for an achievement test determined by
● objectives of the lesson, unit, or course being assessed
● relative importance (or weight) assigned to each objective
● tasks used in classroom lessons during the unit of time
● time frame for the test itself and for returning evaluations to students
● potential for formative feedback
2. Diagnostic Tests
- Identify aspects of a language that a student needs to develop or that a course should include
- Help a teacher know what needs to be reviewed or reinforced in class, enable the student to
identify areas of weakness
- offers more detailed, subcategorized information about the learner
3. Placement Tests
- Place a student into a particular level or section of a language curriculum or school
- Include points to be covered in the various courses in a curriculum
- Come in many varieties (formats, question types)
- use existing standardized proficiency tests due to obvious advantage in practicality,
cost, speed in scoring, and efficient reporting of result
UWE
Listening: 20 MCQs (2 parts)
Grammar: 20 MCQs (2 parts)
Vocabulary: 20 MCQs (2 parts)
Reading: 20 MCQs (2 parts)
Writing: essay (200 words)
IU (Full IELTS – Proficiency test)
Writing:Visuals + Essay (2 parts)
Speaking: various Qs (3 parts)
Listening: 40 Os (4 sections)
Reading: 40 Qs (3 passages)
4. Proficiency Tests
- Test overall ability, not limited to any one course, curriculum, or single
skill in the language
- Traditionally consisted of standardized MC items on grammar, vocabulary, reading
and aural comprehension
- Almost always summative and norm-referenced
- Play a gatekeeping role in accepting/ denying someone academically
- A key issue is how the constructs of language ability are specified
5. Aptitude Tests
- Measure capacity/ general ability to learn a foreign language (before
taking a course)
- Predict success in academic courses
- Significant correlations with the ultimate performance of students in language courses
(Carroll, 1981), but measured by similar processes of mimicry, memorization, and puzzle-
solving → less popular
Other Tests:
internal test, external test, objective test, subjective test, combination test,... .
C. ISSUES IN LANGUAGE ASSESSMENT
I. Behavioral Influences on Language Testing
- Strongly influenced by behavioral psychology and structural linguistics
- Assumption that language can be broken down into its component parts and that those
parts can be tested successfully
- → discrete-point tests
Main skills: listening, speaking, reading, and writing
Units of language: phonology, morphology, lexicon, syntax, and discourse
ECPE: Examination for the Certificate of Proficiency in English) Writing, Listening, GCVR,
Speaking
ECCE: (Examination for the Certificate of Competency in English) Listening, GVR, Writing,
Speaking
2. Integrative Approaches
Language pedagogy rapidly moving in more communicative directions
→ discrete-point approach → inauthentic
- Language competence: a unified set of interacting abilities that could not be tested
separately (John Oller, 1979) → integrative testing
Cloze test:
- Reading passage
- a blank after each 7 words
- Integration of vocabulary, structure, grammar, reading comprehension, prediction,
discourse ...
Dictation:
- Listen and write
- integration of listening, writing, grammar, structure, vocabulary, efficient short-term
memory, discourse, ..
3. Communicative Language Testing
- By the mid-1980s, a switch to work on communicative competence → communicative test
tasks
- Integrative tests such as cloze only reveal a candidate's linguistic competence, NOT directly
about a student's performance ability → a quest for authenticity on comm. Performance
- Bachman (1990) proposed a model of language competence:
A. Organizational Competence
1. Grammatical (including lexicon, morphology, and phonology)
2. Textual (discourse)
B. Pragmatic Competence
1. Illocutionary (functions of language)
2. Sociolinguistic (including culture, context, pragmatics, and purpose)
4. Traditional and "Alternative" Assessment
Traditional Assessment Alternative Assessment
One-shot, standardized exams Continuous, long-term assessment
Timed, multiple-choice format Untimed, free-response format
Decontextualized test items Contextualized communicative tasks
Scores sufficient for feedback Individualized feedback and washback
Norm-referenced scores Criterion-referenced scores
Focus on discrete answers Open-ended, creative answers
Summative Formative
Oriented to product Oriented to process
Noninteractive performance Interactive performance
Fosters extrinsic motivation Fosters intrinsic motivation
5. Performance-Based Assessment
- General educational reform movement: standardized tests DO NOT elicit actual
performance on the part of test-takers. → Performance-Based Assessment
Involves: oral production, written production, open-ended responses, integrated
performance, group performance, and interactive tasks
Drawbacks: time-consuming, expensive
Pay-off: more direct testing, actual or simulated real-world tasks → higher content validity
Appearance: Task-based assessment , Classroom-based assessment
D. CURRENT HOT TOPICS IN LANGUAGE ASSESSMENT
Assessing for Learning
Dynamic Assessment
Zone of Proximal Development (ZPD)
• Learner's potential abilities > the actual performance in a task
• What learner can do with assistance/ feedback
• Assessment NOT complete wlo observation and assistance (Poehner and Lantolf, 2003) →
Important to explore/ discover ZPD
What to do:
- provide clear tasks and activities
- pose questions for Ss to demonstrate understanding and knowledge
- intervene with feedback and student reflections on their learning
Assessing Pragmatics
Phonetics Phonology Morphology Syntax Semantics Pragmatics
● Focusses of pragmatics research: speech acts (e.g., requests, apologies, refusals,
compliments, advice, complaints, agreements, and disagreements)
● Research instruments: discourse completion tasks, role plays, and socio-pragmatic
judgment tasks, and are referenced against a native speaker norm
● Underrepresents L2 pragmatic competence (Roever, 2011). need to include
assessment of learners' participation in extended discourse
● Other aspects to be assessed: recognizing and producing formulaic expressions (e.g.,
Do you have the time? Have a good day.)
Use of Technology in Testing:
Advantages Concerns:
1. easily administered classroom-based tests 1. Lack of security and the possibility of
2. self-directed testing on various aspects of cheating
a language (vocabulary, grammar, four 2. "Homegrown" quizzes on unofficial
skills, etc.) Websites
3. practice for upcoming high-stakes 3. Probably mistaken for validated
standardized tests assessments
4. individualization with customized, 4. MCQ format preferred for most CBT
targeted test items 5. Usual potential for flawed item design
5. large-scale standardized tests 6. Open-ended responses less likely to
(administered easily in quantity, locations, appear (i) expense + potential unreliability
then scored electronically, results reported of hu. scoring (ii) complexity of recog.
quickly) software for auto. scoring
6. improved automated essay evaluation/ 7. No human interactive element (esp. in
scoring oral exam)
8. Validation issues stem from test-takers
approaching tasks as test tasks
WEEK 4: PRINCIPLES OF LANGUAGE ASSESSMENT
CONTENTS:
1. Practicality
- the logistical, administrative issues involved in making, giving, and scoring
and assessment ínstrument, including ‘costs, the amount of time it takes to
construct and to administer,
- the considerations of cost of a test, time allotment, test administration, human
resource, test construction,
- 6 attributes of PRACTICALITY
+ Stay within budgetary limits
+ can be completed by the test-taker within appropriate time constraints
+ have clear directions for administration
+ appropriately utilize available human resources
+ not exceed available material resources
+ consider time and effort involved to both design and score
2. Validity
- the consideration that it ‘really measures what it purports to measure’ (McCall, 1922,
p.196)
- the cónideration that tests really assess what they are intended to assess (Davies,
1990; Brown, 2005)
- 6 attributes of VALIDITY
+ measure what it purports to measure
+ not measure irrelevant or ‘contaminating’ variables
+ rely as much as possible on empirical evidence (performance)
+ involve performance that samples the test’s criterion (objective)
+ offer useful, meaningful infor abt a test-taker’s ability
+ supported by a theoretical rationale or argument
Evidence of VALIDITY
CONTENT CRITERION CONSTRUCT FACE
Content-related Criterion-related evidence Construct-related evidence Face validity
evidence
Clearly Based on classroom Any theory, hypothesis, or The degree to which
define the objectives model that attempts to explain a test looks right,
achievement Minimal passing observed phenomena in our appears to measure
you are grade universe of perceptions (Brown the knowledge or
measuring & abeywickrama, 2018) abilites it claims to
Examples: + research
Ex: + speaking test proposal: measure
(write down the The extent to which
- Meet requirements of The theory
responses) st vie the assessment
components of a underpinning the
as fair, relevant and
+ 10 course research paper assessment relevant and
useful for improving
objectives (test 2 out - Clear descriptions of adequate to support the
learning
of 10) each part intended decision
A fallacy to some
- Use of hedges in (Green, 2013).
+ 1 objective tested experts.
in midterm & final discussion and Ex: linguistics construct Reflect the quality
(some deleted) conclusion of a test
Presentation skill course
Direct CONCURRENT (persuasion): - content,
testing: test VALIDITY -> results pronunciation, vocabulary,
taker actually supported by other concurrent body language, visual aids.
performing performances (Brown & Q/A, organization (logical
the target abeywickrama, 2018) ideas)
task. Ex: Scores validated A major issue in
phonetics & through comparison validating large-scale
phonology with various teachers’ standardized tests of
course: grades (Green, 2013). proficiency ->
students Ex: + IU CAT (English) – Practicality: Omit oral
pronounce correlation test with high test
words/ sound school year -end score/ high Very low correlation:
analysis school graduation exam. W & S/ R & L
system to + regarding for final exams + micro level: test takers,
check (asserted by another lecturer) families (preparation, accuracy,
spectrum. psychology/motivation/attitude
Indirect + questionnaire: adjusted
from FW (then checked by /habits/...)
testing: test
taker NOT advisor) + macro level: society,
actually PREDICTIVE VALIDITY - educational systems
performing > determine students’ (conditions for coaching,
the target readiness to ‘move on’ to teaching methodologies, …)
task, but do another unit
related Ex: placement test, scholastic
task(s). Ex: aptitude test (SAT)
students
write down What if the test designer just
transcription pays attentions to concurrent
, validity but ignore content
differentiate validity?
minimal
pairs.
3. Reliability
Definitions:
5 principles of reliability
1. Has consistent conditions across two or more administrations
2. Gives clear direction for scoring/evaluation
3. Has uniform rubrics for scoring/evaluation
4. Lends itself to consistent application of rubrics by the scorer
5. Contains items/tasks that are unambiguous to the test-taker
Factors causing unreliability:/ Possible sources of fluctuations:
- The student:
+ Health problems (fatigue, bad stomach, …)
+ Psychological issues (anxiety, break-up, ...)
+ Test wiseness (strategies,..)
What can teachers do to help students reveal their true competence? -> test
bank, mock-test, share tips, awards
- The scoring/ the rater reliability (interobserver reliability)
The degree of agreement between different people observing or assessing the same
thing.
Interrater reliability: when two or more scorers yield consistent scores of the same
test
Why fluctuation exist?
+ Lack of adherence to scoring criteria
+ Inexperience
+ Inattention
+ Preconceived biases
What can be done to avoid the big gap between two raters?
Intra-rater reliability (internal factor): when a single rater repeats administrations of
a test. (1 rater rates the test twice)
=> why fluctuation exist?
+ unclear scoring criteria
+ fatigue
+ bias towards ‘good’ and ‘bad’
+ carelessness
What can be done to maintain intra-rater reliability? ->
- The test administration
The conditions in which the test is administered. This may cause unfairness
Why fluctuation exist?
+ Background noise
+ Photocopying variations
+ Lighting systems
+ Location of air conditioners/fans
+ Arrangement/ condition of chairs/desks
+ Session of exam (morning/afternoon/…)
+ Locations of speakers (listening exam)
Scenario: think about a midterm/final test in IU
- The test itself
the nature of the test itself can cause measurement errors
Why fluctuation exist?
+ test format
+ test design items
- Characteristics (to categorize learners)
- Poorly designed items (ambiguous, two similar answers)
- Subjective test (open-ended questions)
- Objective test (MCQs)
4. Authenticity
Definitions:
Signals of authenticity:
An authentic test:
1. Contains language that us as natural as possible
2. Has items that are contextualized rather than isolated
3. Includes meaningful, relevant, interesting topics
4. Provides some thematic organization to items, such as through
a story line or episode.
5. Offer tasks that replicate real-word tasks.
Example of authentic assessment: portfolios, role-play, memos, presentations, case studies,
proposal, reports, projects, …
5. Washback
Definitions: the effect testing on teaching and learning.
….
CONTRIBUTORS TO POSITIVE WASHBACK:
A test that provides beneficial washback …
1. Positively influences what and how teachers teach
2. Positively influences What and how learners learn
3. Offer learners a chance to adequately prepare
4. Give learners feedback that enhances their language
development
5. Is more formative in nature than summative
6. Provide conditions for peak performance by the learner
What can do to maintain positive washback?
Suggested research topics:
Washback from the on-going assessment in the Writing 1 course
Students’ perceptions towards face validity of exams in the presentation skill course
WEEK 6:
Analysis the test:
1. Validity:
Content-Validity: in the test: role-play -> test speaking competence -> high content validity
Criterion-validity:
Chapter 3:
1. Determine the purpose of the test
When to give a test?
-when needed (with clear purpose)
-test usefulness
Think: a teacher finishes the class earlier than expected. He then shows a free test online for
students. -> give a related game instead
Innovative (using technology) or traditional?
Innovative: (+) impressive, time-saving, cost-effective, no power cut-off, fast results
(?) effort (teacher), effectiveness (cheating)
Traditional: (+) format, construct -> familiar to students
(?): old-fashioned, boring, time consuming
Think and respond: A teacher wants to test speaking, and he asks students to record
and upload onto teacher’s Google Drive. -> save time, all students can have place to
perform/ students may just read the prepared script, takes lots of time for teacher to
check.
2. STATE CONSTRUCT OR ABILITIES
1. Know clearly what you want to test
2. Review what students are able to do.
Think and respond: what should teachers do to know what students are able to do? ->
base on the class’s objectives, give test which is similar to the things students have in
class, check the curriculum, make a checklist (skills, vocab, grammar, structures,…)
3. DESIGN TEST SPECIFICATIONS:
Class objectives should be presented through a variety of appropriate task types and
weights and a logical sequence.
Test specs should include:
1. The skill/abilities assessed
2. A description of its content
3. Task type and item types (MCQ, cloze,
T/F, short answer, …)
4. Weights of each part/section
5. Specific procedures to be used to score
the test
6. An explanation of how test results will be
reported to students.
4. DESIGN/DEVÍSE OR SELECT TÉST ITEMS:
Example:
Considerations:
A. Using multiple-choice items:
- Drawback of MCQs:
+ the technique tests only recognition knowledge.
+ guessing may have a considerable effect on test scores.
+ the technique severely restricts what can be test
+ successful items are difficult to write.
+ beneficial washback may be minimal.
+ cheating may be facilitated.
- Strengths of MCQs
+ Practicality
+ Reliability
B. Design each item to measure a single objective
Food for thoughts:
1. Test single items
2. Include distractors
3. Provide sufficient evidence, not
WORDY.
C. Pay attention to IF, ID
If: item facility: the extent to which an item is easy or difficult for the proposed group
of test-takers.
- To easy (99% respondents get it right)
- To difficult (99% of respondents get it wrong)
Not separate high-ability and low-ability of test-takers.
ID: item discrimination: the extent to which an item differentiates between high and
low-ability test-takers.
Item with good discrimination power:
- Correct responses from most of the high-ability group
- Incorrect responses from most of the low-ability group.
5. ADMINISTERING THE TEST
A. Pre-test consideration:
1. The condition for the test (time limits, no
portable electronics, breaks, etc.)
2. Material that students should bring with
them
3. The kinds of items that will be on the test
4. Suggestions strategies for optimal
performance
5. Evaluation criteria (rubrics)
- Offer a view of components
- Give students a chance to ask any questions and provide responses.
B. Test administration (onsite)
- Arrive early and check classroom conditions (lighting, temperature, a clock, furniture
arrangement, etc.)
- If audio or video or other technology is needed for administration, try everything out
in advance.
- Have extra paper, writing instruments, or other response materials on hand.
- Start on time
- Distribute the test in a fair/appropriate manner
- Remain quietly seated at the teacher’s desk, available for questions from students as
they proceed.
- For a timed test, warn students when the time is about to run out, and encourage them
to complete their work.
Make a checklist
Planning required.
C. Providing feedback
- Consider making adjustment …
WEEK 7: CHAPTER 4
1. None
2. Role of standards in standardizes tests
3. Standard-based education