Evaluating User Interface
Evaluating User Interface
Evaluation
Conceptual design/
formal design
Evaluation
• Evaluation
• tests usability and functionality of system
• occurs in laboratory, field and/or in collaboration with
users
• evaluates both design
g and implementation
p
• should be considered at all stages in the design life
cycle
y
Evaluation
• Concerned with gathering data about the usability
of
• a design or product
• byy a specific
p g
group
p of users
• for a particular activity
• in a specified environment or work context
• Informal feedback …… controlled lab
experiments
Goals of Evaluation
• assess extent of system functionality
• Comparing designs
• compare with competitors or among design options
• Later to
• identify user difficulties / fine tune
• improve an upgrade of product
Case Study: 1984 Olympic
Messaging System
• Voice mail for 10,000 athletes in LA -> was successful
• Kiosks placed around Olympic village -- 12 languages
• Approach to design (user-centered design)
• printed scenarios of UI prepared, comments obtained from designers,
managementt prospective ti users ->> functions
f ti altered,
lt d dropped
d d
• produced brief user guides, tested on Olympians, families& friends, 200+
iterations before final form decided
• early simulations constructed, tested with users --> need ‘undo’
undo
• toured Olympic villlage sites, early demos, interviews with people
involved in Olympics, ex-Olympian on the design team -> early prototype
-> more iterations and testing
Case Study: 1984 Olympic
Messaging System
• Approach to design (continued)
• “Hallway” method: -- put prototype in hallway, collect opinions on height
and layout from people who walk past
• “Try to destroy it” method -- CS students invited to test robustness by
trying to “crash” it
• Principles of User-Centered Design:
• focus on users & tasks early in design process
• measure reactions using prototype manuals, interfaces, simulations
• design iteratively
• usability factors must evolve together
Case Study: Air Traffic Control
• UK, 1991
• Original
O i i l system
t -- data
d t in
i variety
i t off formats
f t
• analog and digital dials
• CCTV, paper, books
• some line of sight, others on desks or ceiling mountings outside
view
• Goal: integrated display system
system, as much info as practical
on common displays
• Major concern: safety
Air Traffic Control, continued
• Evaluate controller’s task
• want key info sources on one workstation(windspeed, direction, time,
runway use, visual range, meterological data, maps, special procedures)
• Develop first-cut design (London City airport, then Heathrow)
• Establish user
user-systems
systems design group
• Concept testing / user feedback
• modify info requirements
• different layouts for different controllers and tasks
• greater use of color for exceptional situations and different lighting conditions
• ability to make own pages for specific local conditions
• simple editing facilities for rapid updates
ATC, continued
• Produce upgraded prototype
• “Road
Road Show”
Show to five airports
• Develop system specification
• Build and Install system
• Heathrow , 1989
• other airports, 1991
• Establish new needs
Case Study: Forte Travelodge
• System goal: more efficient central room booking
• IBM Usability Evaluation Centre, London
• Evaluation goals:
• identify and eliminate problems before going live
• avoid business difficulties during implementation
• ensure system easy to use by inexperienced staff
• develop improved training material and documentation
The Usability Lab
• Similar to TV studio: microphones, audio, video,
one way mirror
one-way
Particular aspects of interest
• System navigation, speed of use
• screen design:
d i ease off use, clarity,
l it efficiency
ffi i
• effectiveness of onscreen help and error messages
• complexity of keyboard for computer novices
• effectiveness of training program
• clarity and ease-of-use of documentation
Procedure
• Developed set of 15 common scenarios, enacted by
cross-section
cross section of staff
• eight half-day sessions, several scenarios per session
• emphasize that evaluation is of system not staff
• video cameras operated by remote control
• debriefing sessions after each testing period, get info
about problems and feelings about system and document
these
Results:
• Operators and staff had received useful training
• 62 usability
bilit ffailures
il id
identified
tifi d
• Priority given to:
• speed of navigation through system
• problems with titles and screen formats
• operators unable to find key points in doc
• need to redesign telephone headsets
• uncomfortable furniture
• New system:
y higher
g p
productivity,
y, low turnover,, faster
booking, greater customer satisfaction
Evaluation Methods
• Observing and monitoring usage
• field or lab
• observer takes notes / video
• keystroke logging / interaction logging
• Collecting users
users’ opinions
• interviews / surveys
• Experiments
p and benchmarking
g
• semi-scientific approach (can’t control all variables, size of
sample)
Evaluation Methods
• Interpretive Evaluation
• informal, try not to disturb user; user participation common
• includes participatory evaluation, contextual evaluation
• Predictive Evaluation
• predict problems users will encounter without actually testing
the system with the users
• keystroke analysis or expert review based on specification,
mock-up, low-level prototype
• Pilot Study for all types!! -- small study before main study to work
out problems with experiment itself
• Human Subjects concerns --
Usage Data: Observations,
Monitoring User
Monitoring, User’s
s Opinions
• Observing users
• Verbal protocols
• Software logging
• Use
Users’
s opinions:
op o s Interviews
e e sa and
d Ques
Questionnaires
o a es
Direct Observation
• Difficulties:
• people “see what they want to see”
• “Hawthorne effect” -- users aware that performance is monitored,
altering
g behavior and p
performance levels
• single pass / record of observation usually incomplete
• Performance
Performance-based
based analysis
• obtain clearly defined performance measures from the data
collected (frequency of task completion, task timing, use of
commands, frequency of errors, time for cognitive tasks)
• classification of errors
• repeatability of study
• time ((5:1)) -- tools can help
p
Verbal protocols
• User’s spoken observations, provides info on:
• what user planned to do
• user’s identification of menu names or icons for controlling the
system
• reactions when thingsg g go wrong,
g, tone of voice,, subjective
j feelings
g
about activity
• “Think aloud protocol” -- user says out loud what he is thinking while
working
ki on a task
t k or problem-solving
bl l i
• Post-Event protocols -- users view videos of their actions and provide
commentary on what they were trying to do
Think Aloud
• user observed performing task
• user asked
k d tto d
describe
ib what
h thhe iis d
doing
i and
d why,
h what
h th
he
thinks is happening etc.
• Advantages
• simplicity - requires little expertise
• can provide
id useful
f l iinsight
i ht
• can show how system is actually use
• Disadvantages
• subjective
• selective
• act of describing may alter task performance
Software Logging
• Researcher need not be present
• partt off data
d t analysis
l i process automated
t t d
• Time-stamped keypresses
• Interaction logging-- recording made in real time and can
be replayed in real time so evaluator can see interaction
as it happened
• Neal & Simons playback system -- researcher adds own
comments to timestamped log
• Remaining problems: expense, volume
Protocol analysis
• paper and pencil – cheap, limited to writing speed
• audio – good for think aloud
aloud, difficult to match with other protocols
• video – accurate and realistic, needs special equipment, obtrusive
• computer logging – automatic and unobtrusive, large amounts of data
difficult to analyze
• user notebooks – coarse and subjective, useful insights, good for
longitudinal studies
Yes No Maybe
DUPLICATE [ ] [ ] [ ]
PASTE [ ] [ ] [ ]
Closed question -- six-point
scale
Rate the usefulness of the DUPLICATE command on the
following scale:
very of no
useful |____|____|____|____|____|____| use
Closed question - Likert scale
Computers can simplify complex problems
|____|_____|_____|_____|_____|_____|_____|
strongly agree slightly neutral slightly disagree strongly
agree agree disagree disagree
Closed question - semantic
differential
Rate the Beauxarts drawing package on the
f ll i di
following dimensions:
i
___ PASTE
___ DUPLICATE
___ GROUP
___ CLEAR
Questionnaires
• Responses converted to numerical values
• St
Statistical
ti ti l analysis
l i performed
f d (mean,
( std_dev,
td d SPSS
often used if more statistical detail required)
• Increase chances of respondents completing and
returning:
• short
• small fee or token
• send copy of report
• stamped, self-addressed envelope
• Pre- / post- questionnaires
Questionnaire on User Interaction Satisfaction
(QUIS)
• Styles of question
• general
• open ended
open-ended
• scalar
• multi-choice
• ranked
How to write a good survey
• Write a short questionnaire
• what is essential to know? what would be useful to know? what
would be unnecessary?
• Use simple words
• Don’t: "What is the frequency of your automotive travel to your
parents' residence in the last 30 days?"
• Do: "About how many times have you driven to your parent's
h
home iin th
the llastt 30 d
days?"
?"
How to write a good survey
• Relax your grammar
• if the q
questions sound too formal.
• For example, the word "who" is appropriate in many instances
when "whom" is technically correct.
• Assure a common understanding
• Write questions that everyone will understand in the same way.
Don't assume that everyone has the same understanding of the
facts or a common basis of knowledge. Identify even commonly
usedd abbreviations
bb i ti tto b
be certain
t i th
thatt everyone understands.
d t d
How to write a good survey
• Start with interesting questions
• Start the survey with questions that are likely to sound interesting and
attract the respondents
respondents' attention.
attention
• Save the questions that might be difficult or threatening for later.
• Voicing questions in the third person can be less threatening than
questions voiced in the second question.
• Don't write leading questions
• Leading questions demand a specific response. For example: the
question "Which day of the month is best for the newly established
company wide monthly meeting?"
company-wide meeting? leads respondents to pick a date
without first determining if they even want another meeting.
How to write a good survey
• Avoid double negatives
• Respondents can easily be confused deciphering the
meaning of a question that uses two negative words.
• Balance rating scales
• When the question requires respondents to use a
rating
g scale, mediate the scale so that there is room for
both extremes.
How to write a good survey
• Don't make the list of choices too long
• If the list of answer categories is long and unfamiliar,
unfamiliar it
is difficult for respondents to evaluate all of them. Keep
the list of choices short.
• Avoid difficult concepts
• Some q
questions involve concepts
p that are difficult for
many people to understand.
How to write a good survey
• Avoid difficult recall questions
• People's memories are increasingly unreliable as you ask them to recall
events farther and farther back in time.
time You will get more accurate
information from people if you ask about the recent past (past month)
versus the more distant past (last year).
• rather
th ththan emphasizing
h i i statement
t t t off goals,
l objective
bj ti
tests, research reports, instead emphasizes usefulness of
findings to the people concerned
• good for feasibility study, design feedback, post-
p
implementation review
Interpretive Evaluation
• Experimental: Formal and objective
• Interpretive: More subjective
• Concerned with humans, so no objective reality
• Sociological anthropological approach
Sociological,
• Q
Quantitative
tit ti
• How often was something done, what per cent of the time did
something occur, how many different …
Predictive Evaluation
• Predict aspects of usage rather than observe and
measure
• doesn’t involve users
• cheaper
Why Predictive Evaluation
• User testing is expensive and time consuming, and
requires a prototype
• Predictive techniques use expertise of human-computer
interaction specialists (in person or via heuristics or
models they develop) to identify usability problems
without testing or (in some cases) prototypes
Predictive Evaluation Methods
• Inspection Methods
• Standards inspections
• Consistency inspection
• Heuristic evaluation
• Walkthroughs
• Example heuristics
• system behaviour is predictable
• system behaviour is consistent
• feedback is provided
• Requires
• specification of system functionality
• task analysis, breakdown of each task into its
components
Keystroke-level modeling
• Time to execute sum of:
• Tk - keystroking (0.35
(0 35 sec)
• Tp - pointing (1.10)
• Td - drawing (problem-dependent)
• Tm - mental (1.35)
• Th - homing (0.4)
• Tr - system response (1
(1.2)
2)
Keystroke Modeling Example
A B C D E F G
Button3 W, S, P,
D
Drwthru
Methodology
• Between-subjects paradigm
• six
i groups, 4 subjects
bj t per group
• in each group: 2 experienced w/mouse, 2 not
• each subject first trained in use of mouse and in editing
techniques in Star w.p. system
• Assigned scheme taught
• Each subject performs 10 text-editing tasks, 6 times each
Results: selection time
Time:
Scheme A :12.25 s
Scheme B: 15.19 s
Scheme C: 13
13.41
41 s
Scheme D: 13.44 s
Scheme E: 12
12.85
85 s
Scheme F: 9.89 s
Results: Selection Errors
• Average: 1 selection error per four tasks
• 65% of errors were drawthrough errors, same
across all selection schemes
• 20% of errors were “too many clicks” , schemes
with less clicking
g better
• 15% of errors were ‘click wrong mouse button”,
schemes with fewer buttons better
Selection scheme: test 2
• Results of test 1 lead to conclusion to avoid:
• drawthroughs
• three buttons
• multiple clicking
• S
Scheme
h “G” introduced
i t d d -- avoids
id ddrawthrough,
th h uses only
l
2 buttons
• New test
test, but test groups were 3:1 experienced w/mouse
to not
Results of test 2
• Mean selection time: 7.96s for scheme G,
frequency of “too
too many clicks
clicks” stayed about the
same
• Conclusion:
C l i scheme
h G acceptable
t bl
• selection time shorter
• advantage of quick selection balances moderate error
rate of multi-clicking
Experimental design - concerns
• What to change? What to keep constant? What
to measure?
• Hypothesis, stated in a way that can be tested.
• Statistical tests: which ones, why?
Selecting subjects - avoiding
bias
• Age bias -- Cover target age range
• Gender bias -- equal numbers of male/female
• Experience bias -- similar level of experience with
computers
• etc.
etc ...
Experimental Designs
• Independent subject design
• single group of subjects allocated randomly to each of the
experimental conditions