15/10/2019
1
PA198
Augmented Reality Interfaces
Lecture 9
Evaluating Augmented Reality Interfaces
Fotis Liarokapis
liarokap@fi.muni.cz
15th October 2019
Introduction
Evaluating User Interfaces
• Assess effect of interface on user performance
and satisfaction
• Identify specific usability problems
• Evaluate users’ access to functionality of system
• Compare alternative systems/designs
Major Parameters
• The major parameters in the user interface
evaluation activities are:
– Stage of the design
– Inspection methods vs. usability testing
– Formative vs. summative
http://www.topdesignmag.com/navigate-usability-evaluation/
Influence of the Parameters
• These parameters influence:
– How the design is represented to evaluators
– Documents/deliverables required
– Need for resources (personnel, equipment, lab)
– Methodology
• For data gathering
• For analysis of results
Methodologies for Data Gathering
• Structured inspection
• Interviews
• Focus groups
• Questionnaires
• Field studies
• Controlled Experiments
– Quantitative metrics
– Thinking aloud, cooperative evaluation
15/10/2019
2
Evaluating User Interface Designs
• Stage of the design process
– Early design (prototype)
– Intermediate
– Full design
– After deployment
• Evaluation should be done throughout the
usability life cycle – not just at the end
– Called iterative design
• Different evaluation methods appropriate at
different stages of the cycle
Evaluating User Interface Designs .
Inspection Methods Usability Testing
Cognitive
Walkthrough
Heuristic
Evaluation
Guidelines
Review
Field Study
Laboratory
Experiment
Formative vs. Summative Evaluation
• Formative evaluation
– Identify usability problems
• Qualitative measures
• Ethnographic methods
• Summative evaluation
– Measure/compare user performance
• Quantitative measures
• Statistical methods
Participatory or User-centered Design
• Users are active members of the design team
• Characteristics
– Context and task oriented rather than system oriented
– Collaborative
– Iterative
• Methods
– Brain-storming (“focus groups”)
– Storyboarding
– Workshops
– Pencil and paper exercises
Cognitive Walkthrough
• Evaluates design on how well it supports user
in learning task
• Usually performed by expert in cognitive
psychology
• Expert ‘walks though’ design to identify
potential problems using psychological
principles
• Scenarios may be used to guide analysis
Cognitive Walkthrough .
• For each task, walkthrough considers:
– What impact will interaction have on user?
– What cognitive processes are required?
– What learning problems may occur?
• Analysis focuses on users goals and knowledge
– Does the design lead the user to generate the
correct goals?
15/10/2019
3
Cognitive Walkthrough Video
https://www.youtube.com/watch?v=bzvQY68lm8c
Heuristic Evaluation
• Usability criteria (heuristics) are identified
• Design examined by experts to see if these are
violated
• Example heuristics
– System behavior is consistent
– Feedback is provided
• Heuristic evaluation debugs design
Heuristic Evaluation Video
https://www.youtube.com/watch?v=lkbBc4aF5FA
Guidelines Inspection
• A usability group should have a designated
inspector!
• Written guidelines recommended for larger
projects:
– Screen layout
– Appearance of objects
– Terminology
– Wording of prompts and error messages
– Menu’s
– Direct manipulation actions and feedback
– On-line help and other documentation
Usability Experiment
What is a Usability Experiment?
• Usability testing in a controlled environment
– There is a test set of users
– They perform pre-specified tasks
– Data is collected (quantitative and qualitative)
– Take mean and/or median value of measured
attributes
– Compare to goal or another system
• Contrasted with expert review and field study
evaluation methodologies
• Note the growth of usability groups and usability
laboratories
15/10/2019
4
Experimental Factors
• Subjects
– Representative
– Sufficient sample
• Variables
– Independent variable (IV)
• Characteristic changed to produce different conditions
– i.e. Interface style, number of menu items
– Dependent variable (DV)
• Characteristics measured in the experiment
– i.e. Time to perform task, number of errors
Experimental Factors .
• Hypothesis
– Prediction of outcome framed in terms of IV and DV
– Null hypothesis: states no difference between
conditions and the aim is to disprove this
• Experimental design
– Within groups design
– Between groups design
Independent Variables
• Hypothesis includes the independent variables
that are to be altered
– The things you manipulate independent of a
subject’s behaviour
– Determines a modification to the conditions the
subjects undergo
– May arise from subjects being classified into
different groups
Dependent Variables
• Hypothesis includes the dependent variables
that will be measured
– Variables dependent on the subject’s behaviour /
reaction to the independent variable
– The specific things you set out to quantitatively
measure / observe
Independent and Dependent Variables
https://www.youtube.com/watch?v=aeH1FzqdQZ0
Within Groups Design
• Each subject performs experiment under each
condition
• Advantages
– Fewer subjects needed
– Less likely to suffer from user variation
• Disadvantages
– Transfer of learning possible
15/10/2019
5
Between Groups Design
• Each subject performs under only one
condition
• Advantages
– No transfer of learning
• Disadvantages
– More subjects required (therefore more costly)
– User variation can bias results
How Many Test Users?
• Problems-found (i) = N (1 - (1 - l)i )
– i = number of test users
– N = number of existing problems
– l = probability of finding a single problem with a
single user
Data Collection Techniques
• Paper and pencil
– Cheap, limited to writing speed
• Audio
– Good for think aloud, difficult to match with other
protocols
• Video
– Accurate and realistic, needs special equipment,
obtrusive
• Computer logging
– Automatic and unobtrusive
– Large amounts of data difficult to analyze
Data Collection Techniques .
• User notebooks
– Coarse and subjective, useful insights
– Good for longitudinal studies
• Brain logging
– More difficult technique
Summative Evaluation
• What to measure?
– Total task time
– User “think time” (dead time??)
– Time spent not moving toward goal
– Ratio of successful actions/errors
– Commands used/not used
– Frequency of user expression of:
• Confusion, frustration, satisfaction
– Frequency of reference to manuals/help system
– Percent of time such reference provided the needed
answer
Measuring User Performance
• Measuring learnability
– Time to complete a set of tasks by novice
– Learnability/efficiency trade-off
• Measuring efficiency
– Time to complete a set of tasks by expert
– How to define and locate ‘experienced’ users
• Measuring memorability
– The most difficult, since ‘casual’ users are hard to find
for experiments
– Memory quizzes may be misleading
15/10/2019
6
Measuring User Performance .
• Measuring user satisfaction
– Likert scale (agree or disagree)
– Semantic differential scale
– Physiological measure of stress
– EEG measures
• Measuring errors
– Classification of minor vs. serious
– Removing noise
Reliability and Validity
• Reliability means repeatability
– Statistical significance is a measure of reliability
– Difficult to achieve because of high variability in
individual user performance
• Validity means will the results transfer into a
real-life situation
– Depends on matching the users, task, environment
– Difficult to achieve because real-world users,
environment and tasks difficult to duplicate in
laboratory
Formative Evaluation
• What is a Usability Problem?
– Unclear
• The planned method for using the system is not readily
understood or remembered (task, mechanism, visual)
– Error-prone
• The design leads users to stray from the correct operation
of the system (task, mechanism, visual)
Formative Evaluation .
• What is a Usability Problem?
– Mechanism overhead
• The mechanism design creates awkward work flow
patterns that slow down or distract users
– Environment clash
• The design of the system does not fit well with the users’
overall work processes (task, mechanism, visual)
– i.e. Incomplete transaction cannot be saved
Formative vs Summative
https://www.youtube.com/watch?v=bTGnJnuVNt8
Methods
15/10/2019
7
Qualitative Methods for Collecting
Usability Problems
• Thinking aloud method and related alternatives:
– Constructive interaction, coaching method,
retrospective walkthrough
• Output: Notes on what users did and expressed:
– Goals, confusions or misunderstandings, errors,
reactions expressed
• Questionnaires
– Focus groups, interviews
Observational Methods - Think Aloud
• User observed performing task
– User asked to describe what he is doing and why, what
he thinks is happening etc.
• Advantages
– Simplicity - requires little expertise
– Can provide useful insight
– Can show how system is actually use
• Disadvantages
– Subjective
– Difficult to conduct
– Act of describing may alter task performance
Observational Methods - Cooperative
evaluation
• Variation on think aloud
• User collaborates in evaluation
• Both user and evaluator can ask each other
questions throughout
• Additional advantages
– Less constrained and easier to use
– User is encouraged to criticize system
– Clarification possible
Observational Methods
• Post task walkthrough
– User reacts on action after the event
– Used to fill in intention
• Advantages
– Analyst has time to focus on relevant incidents
– Avoid excessive interruption of task
• Disadvantages
– Lack of freshness
– May be post-hoc interpretation of events
Query Techniques - Interviews
• Analyst questions user on one to one basis
• Usually based on prepared questions
• Informal, subjective and relatively cheap
• Advantages
– Can be varied to suit context
– Issues can be explored more fully
– Can elicit user views and identify unanticipated problems
• Disadvantages
– Very subjective
– Time consuming
Query Techniques - Questionnaires
• Set of fixed questions given to users
• Advantages
– Quick and reaches large user group
– Can be analyzed quantitatively
• Disadvantages
– Less flexible
– Less probing
15/10/2019
8
Query Techniques - Questionnaires .
• Need careful design
– What information is required?
– How are answers to be analyzed?
• Should be pilot tested for usability!
• Styles of question
– General
– Open-ended
– Scalar
– Multi-choice
– Ranked
Laboratory studies: Pros and Cons
• Advantages:
– Specialist equipment available
– Uninterrupted environment
• Disadvantages:
– Lack of context
– Difficult to observe several users cooperating
• Appropriate
– If actual system location is dangerous or impractical
for to allow controlled manipulation of use
Conducting A
Usability Experiment
Main Steps
• The planning phase
• The execution phase
• Data collection techniques
• Data analysis
The Planning Phase
• Who, what, where, when and how much?
– Who are test users, and how will they be recruited?
– Who are the experimenters?
– When, where, and how long will the test take?
– What equipment/software is needed?
– How much will the experiment cost?
– Outline of test protocol
Outline of Test Protocol
• What tasks?
• Criteria for completion?
• User aids
• What will users be asked to do
– i.e. Thinking aloud studies
• Interaction with experimenter
• What data will be collected?
15/10/2019
9
Designing Test Tasks
• Tasks:
– Are representative
– Cover most important parts of UI
– Don’t take too long to complete
– Goal or result oriented (possibly with scenario)
• Tips:
– First task should build confidence
– Last task should create a sense of accomplishment
Detailed Test Protocol
• All materials to be given to users as part of the
test, including detailed description of the tasks
• Deliverables from detailed test protocol
– What test tasks? (written task sheets)
– What user aids? (written manual)
– What data collected? (include questionnaire)
– How will results be analyzed/evaluated? (sample
tables/charts)
• Then do a pilot with a few users
Pilot Studies
• A small trial run of the main study
– Can identify majority of issues with interface design
• Pilot studies check:
– That the evaluation plan is viable
– You can conduct the procedure
– That interview scripts, questionnaires, experiments,
etc. work appropriately
• Iron out problems before doing the main study
Billinghurst, M. Evaluating AR Applications, HIT Lab NZ, University of Canterbury
The Execution Phase
• Prepare environment, materials, software
• Introduction should include:
– Purpose (evaluating software)
– Voluntary and confidential
– Explain all procedures
• i.e. Recording, question-handling
– Invite questions
• During experiment
– Give user written task description(s), one at a time only
one experimenter should talk
• De-briefing
Ethics of Human Experimentation
• Users feel exposed using unfamiliar tools and
making errors
• Guidelines:
– Re-assure that individual results not revealed
– Re-assure that user can stop any time
– Provide comfortable environment
– Don’t laugh or refer to users as subjects or guinea pigs
– Don’t volunteer help, but don’t allow user to struggle too
long
– In de-briefing
• Answer all questions
• Reveal any deception
• Thanks for helping
Data Collection
• Pad and paper the only absolutely necessary data
collection tool!
• Observation areas (for other experimenters,
developers, customer reps, etc.) - should be
shown to users
• Videotape (may be overrated) - users must sign a
release
• Video display capture
• Portable usability labs
• Usability kiosks
15/10/2019
10
Data Analysis
• Before you start to do any statistics:
– Look at data
– Save original data
• Choice of statistical technique depends on
– Type of data
– Information required
• Type of data
– Discrete - finite number of values
– Continuous - any value
Statistics
• The mean time to perform a task (or mean no.
of errors or other event type)
• Measures of variance – standard deviation
• For a normal distribution:
– 1 standard deviation covers ~ 2/3 of the cases
– In usability studies:
• Expert time SD ~ 33% of mean
• Novice time SD ~ 46% of mean
• Error rate SD ~ 59% of mean
Statistics .
• Confidence intervals (the smaller the better)
– The “true mean” is within N of the observed
– Mean, with confidence level (probability) .95
• Since confidence interval gets smaller as the
number of users grow:
– How many test users required to get a given
– Confidence interval and confidence level
Testing Usability in the Field
• Direct observation in actual use
– Discover new uses
– Take notes, don’t help, chat later
• Logging actual use
– Objective, not intrusive
– Great for identifying errors
– Which features are/are not used
– Privacy concerns
• Bulletin boards and user groups
Testing Usability in the Field .
• Questionnaires and interviews with real users
– Ask users to recall critical incidents
– Questionnaires must be short and easy to return
• Focus groups
– 6-9 users
– Skilled moderator with pre-planned script
– Computer conferencing
– Virtual environments
• On-line direct feedback mechanisms
– Initiated by users
– May signal change in user needs
– Trust but verify
Field Studies: Pros and Cons
• Advantages:
– Natural environment
– Context retained (though
observation may alter it)
– Longitudinal studies possible
• Disadvantages:
– Distractions
– Noise
• Appropriate:
– For beta testing
– Where context is crucial for
longitudinal studies
15/10/2019
11
Choosing an Evaluation Method
• When in process
– Design vs. implementation
• Style of evaluation
– Laboratory vs. field
• How objective
– Subjective vs. objective
• Type of measures
– Qualitative vs. quantitative
Choosing an Evaluation Method .
• Level of information
– High level vs. low level
• Level of interference
– Obtrusive vs. unobtrusive
• Resources available
– Time
– Subjects
– Equipment
– Expertise
Subjects
• The choice of subjects is critical to the validity of
the results of an experiment
– Subjects group should be representative of the
expected user population
• In selecting the subjects it is important to
consider things such as their
– Age group, education, skills, culture
– How does the sample influence the results?
• Report the selection criteria and give relevant
demographic information in your publication
Billinghurst, M. Evaluating AR Applications, HIT Lab NZ, University of Canterbury
Subjects .
• How many participants depends on how big is the
effect you want to measure?
– Large effects can be detected with smaller samples
• i.e. Small n needed to discriminate speed between turtles and a
rabbits
– The more participants the “smoother” the data
– Central Limit Theorem - as n increases (n>30) the sample
mean approaches a normal distribution
– Extreme data has less influence (e.g. one sleepy
participants does not mess up the results that much)
• For quantitative analysis:
– Min 15-20 or more per group/cell
Billinghurst, M. Evaluating AR Applications, HIT Lab NZ, University of Canterbury
Experimental Measures
Billinghurst, M. Evaluating AR Applications, HIT Lab NZ, University of Canterbury
Evaluators & Problems
Billinghurst, M. Evaluating AR Applications, HIT Lab NZ, University of Canterbury
15/10/2019
12
Evaluate AR Apps
Why Evaluate AR Applications
• To test and compare interfaces, new
technologies, interaction techniques
• Test Usability
– Learnability, efficiency, satisfaction,...
• Get user feedback
• Refine interface design
• Better understand your end users
Billinghurst, M. Evaluating AR Applications, HIT Lab NZ, University of Canterbury
Types of User Studies in AR
• Perception
• User Performance
• Collaboration
• Usability of Complete Systems
• Brain Analysis
Types of AR User Studies
Billinghurst, M. Evaluating AR Applications, HIT Lab NZ, University of Canterbury
Types of Experimental Measures Used
Billinghurst, M. Evaluating AR Applications, HIT Lab NZ, University of Canterbury
Typical Hardware
• Eye Tracking
• HMDs
• Physiological devices
15/10/2019
13
Eye Tracking
• Head or desk mounted equipment tracks the
position of the eye
• Eye movement reflects the amount of cognitive
processing a display requires
• Measurements include
– Fixations: eye maintains stable position. Number and
duration indicate level of difficulty with display
– Saccades: rapid eye movement from one point of
interest to another
– Scan paths: moving straight to a target with a short
fixation at the target is optimal
Physiological Measurements
• Emotional response linked to physical changes
– May help determine a user’s reaction to an interface
• Measurements include:
– heart activity, including blood pressure, volume and pulse
– activity of sweat glands: Galvanic Skin Response (GSR)
– electrical activity in muscle: electromyogram (EMG)
– electrical activity in brain: electroencephalogram (EEG)
• Some difficulty in interpreting these physiological
responses
– More research needed
Survey of AR Papers
• Edward Swan (2005)
– Surveyed major conference/journals (1992-2004)
• Presence, ISMAR, ISWC, IEEE VR
• Summary
– 1104 total papers
– 266 AR papers
– 38 AR HCI papers (Interaction)
– 21 AR user studies
• Only 21 from 266 AR papers had a formal user
study
– Less than 8% of all AR papers
Billinghurst, M. Evaluating AR Applications, HIT Lab NZ, University of Canterbury
AR User Evaluations
Perceptual Evaluation of PhotoRealism
AR
https://www.youtube.com/watch?v=lrtUZKl9v34
User Experiences with AR Mobile
Navigation
https://www.youtube.com/watch?v=qoOMDP2uHq0
15/10/2019
14
Some Questionnaires
Immersion
• In a virtual environment (VE), Immersion,
defined in technical terms, is capable of
producing a sensation of Presence, the
sensation of being there (part of the VE), as
regards the user (Ijsselsteijn & Riva, 2003)
Ijsselstein, W., Riva, G. Being there: the experience of presence in mediated environments. In Being There: Concepts, effects and measurement
of user presence in synthetic environments. G. Riva, F. Davide, W.A IJsselsteijn (Eds.) Ios Press, Amsterdam, The Netherlands, 1-14, 2003
Types of Immersion
• Tactical immersion
– Tactical immersion is experienced when performing tactile
operations that involve skill
– Players feel "in the zone" while perfecting actions that result in
success
• Strategic immersion
– Strategic immersion is more cerebral, and is associated with
mental challenge
– Chess players experience strategic immersion when choosing a
correct solution among a broad array of possibilities
• Narrative immersion
– Narrative immersion occurs when players become invested in
a story, and is similar to what is experienced while reading a
book or watching a movie
https://en.wikipedia.org/wiki/Immersion_(virtual_reality)
Presence
• Presence can be defined as a special case of
situation awareness, in which self-orientation
and self-location are defined with respect to a
media environment, not the real environment
Ijsselstein, W., Riva, G. Being there: the experience of presence in mediated environments. In Being There: Concepts, effects and measurement
of user presence in synthetic environments. G. Riva, F. Davide, W.A IJsselsteijn (Eds.) Ios Press, Amsterdam, The Netherlands, 1-14, 2003
Immersion Requirements
• A wide field of view (80 degrees or better)
• Adequate resolution (1080p or better)
• Low pixel persistence (3 ms or less)
• A high enough refresh rate (>60 Hz, 95 Hz is
enough but less may be adequate)
• Global display where all pixels are illuminated
simultaneously (rolling display may work with
eye tracking)
https://en.wikipedia.org/wiki/Immersion_(virtual_reality)
Immersion Requirements .
• Optics (at most two lenses per eye with tradeoffs,
ideal optics not practical using current
technology)
• Optical calibration
• Rock-solid tracking – translation with millimeter
accuracy or better, orientation with quarter
degree accuracy or better, and volume of 1.5
meter or more on a side
• Low latency (20 ms motion to last photon, 25 ms
may be good enough)
https://en.wikipedia.org/wiki/Immersion_(virtual_reality)
15/10/2019
15
Immersive Virtual Reality
• Direct interaction of the nervous system
– The most considered method would be to induce the
sensations that made up the virtual reality in the
nervous system directly
– In functionalism/conventional biology we interact
with consensus reality through the nervous system
– Thus we receive all input from all the senses as nerve
impulses
– It gives your neurons a feeling of heightened
sensation
https://en.wikipedia.org/wiki/Immersion_(virtual_reality)
VR Immersion Requirements
• Understanding of the nervous system
– A comprehensive understanding of which nerve impulses
correspond to which sensations, and which motor impulses
correspond to which muscle contractions will be required
• Ability to manipulate central nervous system (CNS)
– Manipulation could occur at any stage of the nervous system –
the spinal cord is likely to be simplest; as all nerves pass
through here, this could be the only site of manipulation.
• Computer hardware/software to process inputs/outputs
– A very powerful computer would be necessary for processing
virtual reality complex enough to be nearly indistinguishable
from consensus reality and interacting with central nervous
system fast enough
https://en.wikipedia.org/wiki/Immersion_(virtual_reality)
Presence Questionnaire
http://w3.uqo.ca/cyberpsy/docs/qaires/pres/PQ_va.pdf
FLOW
• In positive psychology, FLOW (also known as the
zone), is the mental state of operation in which a
person performing an activity is fully immersed in a
feeling of energized focus, full involvement, and
enjoyment in the process of the activity
– Characterized by complete absorption in what one does
– Named by Mihály Csíkszentmihályi, the concept has been
widely referenced across a variety of fields (and has an
especially big recognition in occupational therapy),
though the concept has existed for thousands of years
under other guises, notably in some Eastern religions
https://en.wikipedia.org/wiki/Flow_(psychology)
FLOW Model
https://en.wikipedia.org/wiki/Flow_(psychology)
FLOW Conditions
• One must be involved in an activity with a clear set
of goals and progress
– This adds direction and structure to the task
• The task at hand must have clear and immediate
feedback
– This helps the person negotiate any changing demands
and allows them to adjust their performance to maintain
the flow state
• One must have a good balance between the
perceived challenges of the task at hand and their
own perceived skills
– One must have confidence in one's ability to complete
the task at hand
https://en.wikipedia.org/wiki/Flow_(psychology)
15/10/2019
16
FLOW Example
• Development and validation of a scale to
measure optimal experience: the FLOW state
scale
https://ess22012.files.wordpress.com/2012/09/flow-state-scale004.pdf
Virtual Reality Sickness
• Virtual reality sickness (also known as
cybersickness) occurs when exposure to a virtual
environment causes symptoms that are similar to
motion sickness symptoms
• The most common symptoms are general
discomfort, headache, stomach awareness, nausea,
vomiting, pallor, sweating, fatigue, drowsiness,
disorientation, and apathy
• Other symptoms include postural instability and
retching
https://en.wikipedia.org/wiki/Virtual_reality_sickness
Virtual Reality Sickness .
• Virtual reality sickness is different from
motion sickness in that it can be caused by the
visually-induced perception of self-motion;
real self-motion is not needed
• It is also different from simulator sickness
– Non-virtual reality simulator sickness tends to be
characterized by oculomotor disturbances
– Virtual reality sickness tends to be characterized
by disorientation
https://en.wikipedia.org/wiki/Virtual_reality_sickness
Simulator Sickness
• Simulator sickness is a subset of motion sickness
that is typically experienced by pilots who
undergo training for extended periods of time in
flight simulators
• It is similar to motion sickness in many ways, but
occurs in simulated environments and can be
induced without actual motion
• Symptoms of simulator sickness include
discomfort, apathy, drowsiness, disorientation,
fatigue, vomiting, and many more
https://en.wikipedia.org/wiki/Simulator_sickness
Simulator Sickness Questionnaire
http://w3.uqo.ca/cyberpsy/docs/qaires/ssq/SSQ_va.pdf
Conclusions
• Very extensive field
• Not easy to select the best approach
• Biggest problems:
– Understand the problem
– Get a large sample
– Analyse the data properly
• Still AR is not properly explored
– Need for more research
15/10/2019
17
Questions