29/09/2019 1 PV182 Human Computer Interaction Lecture 6 Evaluating Controlled Experiments Fotis Liarokapis liarokap@fi.muni.cz 30th September 2019 Controlled Experiments • What is experimental design? • What is an experimental hypothesis? • How do I plan an experiment? • Why are statistics used? • What are the important statistical methods? Quantitative Evaluation of Systems • Quantitative: – Precise measurement, numerical values – Bounds on how correct our statements are • Methods – User performance data collection – Controlled experiments Collecting User Performance Data • Data collected on system use (often lots of data) • Exploratory: – Hope something interesting shows up – But difficult to analyze Collecting User Performance Data . • Targeted – Look for specific information, but may miss something • Frequency of request for on-line assistance – What did people ask for help with? • Frequency of use of different parts of the system – Why are parts of system unused? • Number of errors and where they occurred – Why does an error occur repeatedly? • Time it takes to complete some operation – What tasks take longer than expected? Controlled Experiments • Traditional scientific method • Reductionist – Clear convincing result on specific issues • In HCI: – Insights into cognitive process, human performance limitations, ... – Allows system comparison, fine-tuning of details ... 29/09/2019 2 Controlled Experiments . • Strives for: – Lucid and testable hypothesis – Quantitative measurement – Measure of confidence in results obtained (statistics) – Replicability of experiment – Control of variables and conditions – Removal of experimenter bias Controlled Experiments .. • Subjects in experiments: – Between-subjects (randomized design of measurement, each participant is assigned to a different condition) – Within-subjects (repeated measures, each user performs under each different condition) Controlled Experiments Example https://www.youtube.com/watch?v=D3ZB2RTylR4 Clear and Testable Hypothesis • State a clear, testable hypothesis – This is a precise problem statement • Example: – There is no difference in user performance (time and error rate) when selecting a single item from a pop-up or a pull down menu of 4 items, regardless of the subject’s previous expertise in using a mouse or using the different menu types” File Edit View Insert New Open Close Save File Edit View Insert New Open Close Save Independent Variables • Hypothesis includes the independent variables that are to be altered – The things you manipulate independent of a subject’s behaviour – Determines a modification to the conditions the subjects undergo – May arise from subjects being classified into different groups Independent Variables . • Menu experiment – Menu type: pop-up or pull-down – Menu length: 3, 6, 9, 12, 15 – Subject type (expert or novice) 29/09/2019 3 Dependent Variables • Hypothesis includes the dependent variables that will be measured – Variables dependent on the subject’s behaviour / reaction to the independent variable – The specific things you set out to quantitatively measure / observe Dependent Variables . • Menu experiment – Time to select an item – Selection errors made – Time to learn to use it to proficiency Independent and Dependent Variables https://www.youtube.com/watch?v=aeH1FzqdQZ0 Scales of Measurements • Four major scales of measurements – Nominal – Ordinal – Interval – Ratio Nominal Scale • Classification into named or numbered unordered categories – Country of birth, user groups, gender… • Allowable manipulations – Whether an item belongs in a category – Counting items in a category • Statistics – Number of cases in each category – Most frequent category – No means, medians… Nominal Scale . • Sources of error – Agreement in labelling, vague labels, vague differences in objects • Testing for error – Agreement between different judges for same object 29/09/2019 4 Ordinal Scale • Classification into named or numbered ordered categories – No information on magnitude of differences between categories • i.e. Preference, social status, gold/silver/bronze medals • Allowable manipulations – As with interval scale, plus – Merge adjacent classes – Transitive: if A > B > C, then A > C • Statistics – Median (central value) – Percentiles, e.g., 30% were less than B • Sources of error – As in nominal Interval Scale • Classification into ordered categories with equal differences between categories – Zero only by convention • i.e. Temperature (C or F), time of day • Allowable manipulations – Add, subtract – Cannot multiply as this needs an absolute zero • Statistics – Mean, standard deviation, range, variance • Sources of error – Instrument calibration, reproducibility and readability – Human error, skill… Ratio Scale • Interval scale with absolute, non-arbitrary zero – i.e. temperature (K), length, weight, time periods • Allowable manipulations – Multiply, divide Example: Apples • Nominal: – Apple variety – Macintosh, Delicious, Gala… • Ordinal: – Apple quality – US. Extra Fancy – U.S. Fancy – U.S. Combination Extra Fancy / Fancy – U.S. No. 1 – U.S. Early – U.S. Utility – U.S. Hail Example: Apples . • Interval: – Apple ‘Liking scale’ • Marin, A. Consumers’ evaluation of apple quality. Washington Tree Postharvest Conference 2002. After taking at least 2 bites how much do you like the apple? Dislike extremely Neither like or dislike Like extremely • Ratio: – Apple weight, size, … Subject Selection • Judiciously select and assign subjects to groups – Ways of controlling subject variability – Reasonableamount of subjects – Random assignment – Make different user groups an independent variable – Screen for anomalies in subject group • Superstars versus poor performers ExpertNovice 29/09/2019 5 Controlling Bias • Unbiased instructions • Unbiased experimental protocols – Prepare scripts ahead of time • Unbiased subject selection Now you get to do the pop-up menus. I think you will really like them... I designed them myself! Statistical Analysis • Apply statistical methods to data analysis • Confidence limits: – The confidence that your conclusion is correct – “the hypothesis that computer experience makes no difference is rejected at the .05 level” means: • A 95% chance that your statement is correct • A 5% chance you are wrong Interpretation • What you believe the results really mean • Their implications to your research • Their implications to practitioners • How generalizable they are • Limitations and critique Planning Flowchart for Experiments Stage 1 Problem definition research idea literature review statement of problem hypothesis development Stage 2 Planning define variables controls apparatus procedures Stage 3 Conduct research data collection Stage 4 Analysis data reductions statistics hypothesis testing Stage 5 Interpret- ation interpretation generalization reporting select subjects experimental design preliminary testing feedback feedback Copied from an early ACM CHI tutorial Statistical Analysis • Calculations that tell us – Mathematical attributes about our data sets • Mean, amount of variance, ... – How data sets relate to each other • Whether we are “sampling” from the same or different distributions – The probability that our claims are correct • “statistical significance” Statistical vs Practical Significance • When n is large, even a trivial difference may show up as a statistically significant result – i.e. menu choice: mean selection time of menu a is 3.00 sec menu b is 3.05 sec • Statistical significance does not imply that the difference is important! – A matter of interpretation – Statistical significance often abused and used to misinform 29/09/2019 6 Problem with Visual Inspection of Data • Will almost always see variation in collected data • Differences between data sets may be due to: – Normal variation • i.e. two sets of ten tosses with different but fair dice – Differences between data and means are accountable by expected variation – Real differences between data • i.e. two sets of ten tosses for with loaded dice and fair dice – Differences between data and means are not accountable by expected variation T-test • A simple statistical test – Allows one to say something about differences between means at a certain confidence level • Null hypothesis of the T-test: – No difference exists between the means of two sets of collected data • Possible results: – I am 95% sure that null hypothesis is rejected • There is probably a true difference between the means • I cannot reject the null hypothesis – The means are likely the same Wikipedia • The t statistic was introduced in 1908 by William Sealy Gosset, a statistician working for the Guinness brewery in Dublin, Ireland ("Student" was his pen name). Gosset had been hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness' industrial processes. Gosset devised the t-test as a way to cheaply monitor the quality of beer. He published the test in Biometrika in 1908, but was forced to use a pen name by his employer, who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was known to fellow statisticians. • Today, the t-test is more generally applied to the confidence that can be placed in judgments made from small samples. Different Types of T-tests • Comparing two sets of independent observations – Usually different subjects in each group – Number per group may differ as well Condition 1 Condition 2 S1–S20 S21–43 • Paired observations – Usually a single group studied under both experimental conditions – Data points of one subject are treated as a pair Condition 1 Condition 2 S1–S20 S1–S20 Different Types of T-tests . • Non-directional vs directional alternatives – Non-directional (two-tailed) • No expectation that the direction of difference matters – Directional (one-tailed) • Only interested if the mean of a given condition is greater than the other T-test Assumptions • Assumptions of t-tests – Data points of each sample are normally distributed • But t-test very robust in practice – Population variances are equal • t-test reasonably robust for differing variances • Deserves consideration – Individual observations of data points in sample are independent • Must be adhered to • Significance level – Decide upon the level before you do the test! – Typically stated at the .05 or .01 level 29/09/2019 7 Two-tailed Unpaired T-test N: number of data points in the one sample SX: sum of all data points in one sample X: mean of data points in sample S(X2): sum of squares of data points in sample s2: unbiased estimate of population variation t: t ratio df = degrees of freedom = N1 + N2 – 2 Formulas: s X N X N N N X X 2 1 2 1 2 1 2 2 2 2 2 1 2 2          ( () ( ) ) ( ) t X X s N s N    1 2 2 1 2 2 Level of Significance for Two-tailed test df .05 .01 1 12.706 63.657 2 4.303 9.925 3 3.182 5.841 4 2.776 4.604 5 2.571 4.032 6 2.447 3.707 7 2.365 3.499 8 2.306 3.355 9 2.262 3.250 10 2.228 3.169 11 2.201 3.106 12 2.179 3.055 13 2.160 3.012 14 2.145 2.977 15 2.131 2.947 df .05 .01 16 2.120 2.921 18 2.101 2.878 20 2.086 2.845 22 2.074 2.819 24 2.064 2.797 Two-tailed Unpaired T-test • Or, use a statistics package (e.g., Excel has simple stats) – Condition one: 3, 4, 4, 4, 5, 5, 5, 6 – Condition two: 4, 4, 5, 5, 6, 6, 7, 7 Unpaired t-test DF: 14 Unpaired t Value: -1.871 Prob. (2-tail): .0824 Group: Count: Mean: Std. Dev.: Std. Error: one 8 4.5 .926 .327 two 8 5.5 1.195 .423 ANOVA • Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as "variation" among and between groups) – Developed by statistician and evolutionary biologist Ronald Fisher • In the ANOVA setting, the observed variance in a particular variable is partitioned into components attributable to different sources of variation https://en.wikipedia.org/wiki/Analysis_of_variance ANOVA • In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups • ANOVAs are useful for comparing (testing) three or more means (groups or variables) for statistical significance – It is conceptually similar to multiple two-sample ttests, but is more conservative (results in less type I error) and is therefore suited to a wide range of practical problems https://en.wikipedia.org/wiki/Analysis_of_variance Significance Levels and Errors • Type 1 error – Reject the null hypothesis when it is, in fact, true • Type 2 error – Accept the null hypothesis when it is, in fact, false • Effects of levels of significance – High confidence level (e.g. p<.0001) • Greater chance of Type 2 errors – Low confidencelevel (e.g. p>.1) • Greater chance of Type 1 errors • You can ‘bias’ your choice depending on consequence of these errors 29/09/2019 8 Type I and Type II Errors • Type 1 error – Reject the null hypothesis when it is, in fact, true • Type 2 error – Accept the null hypothesis when it is, in fact, false False True True Type I error  False  Type II error Decision “Reality” Which is Worse? • Type I errors are considered worse – Because the null hypothesis is meant to reflect the incumbent theory • BUT – You must use your judgement to assess actual risk of being wrong in the context of your study Significance Levels and Errors • There is no difference between Pie and traditional popup menus • What is the consequence of each error type? – Type 1: • Extra work developing software • People must learn a new idiom for no benefit – Type 2: • Use a less efficient (but already familiar) menu • Which error type is preferable? – Redesigning a traditional GUI interface • Type 2 error is preferable to a Type 1 error – Designing a digital mapping application where experts perform extremely frequent menu selections • Type 1 error preferable to a Type 2 error New Open Close Save NewOpen CloseSave You Know Now • Controlled experiments can provide clear convincing result on specific issues • Creating testable hypotheses are critical to good experimental design • Experimental design requires a great deal of planning • Statistics inform us about – Mathematical attributes about our data sets – How data sets relate to each other – The probability that our claims are correct You Know Now . • Statistics inform us about – Mathematical attributes about our data sets – How data sets relate to each other – The probability that our claims are correct • There are many statistical methods that can be applied to different experimental designs – T-tests Articulate: •who users are •their key tasks User and task descriptions Goals: Methods: Products: Brainstorm designs Task centered system design Participatory design User- centered design Evaluate tasks Psychology of everyday things User involvement Representation & metaphors low fidelity prototyping methods Throw-away paper prototypes Participatory interaction Task scenario walk- through Refined designs Graphical screen design Interface guidelines Style guides high fidelity prototyping methods Testable prototypes Usability testing Heuristic evaluation Completed designs Alpha/beta systems or complete specification Field testing 29/09/2019 9 Questions Acknowledgements • Prof. Ing. Jiří Sochor