1/7/2020

Introduction

Research workflow

  • we have a problem
  • how others deal with it? -> analysis
  • if not sufficient, how can we solve it? -> idea and design
  • solve it -> implementation
  • has our solution solved the problem? -> evaluation and discussion
  • what’s next? -> conclusion and future work

Has our solution solved the problem?

Compare

  • the new and old version of the algorithm
  • two algorithms (heuristics)
  • results from the model with the actual result (fMRI)
  • results from the same model with different parameters
  • results from two models (MD)
  • subjects with/without treatment
  • subjects before/after treatment

Comparison

  • what means better? \(\rightarrow\) metric
  • datasets varying in size
  • datasets varying in other characteristics
  • for the same input
    • same metric value: shortest path with greedy
    • similar metric value: running time
    • different metric value: almost everything else

Simple example

Does vitamin E influence IQ?

  • two groups: control and treatment
  • treatment group took vitamin E
  • measure IQ for both and compare
  • simulated data

Real-world example

Is the new version of our algorithm in CaverDock better than the old one?

  • stochastic algorithm
  • old heuristic and new heuristic
  • more and less complex input data
  • three metrics of what means better
  • compare all metrics for all data
  • measured data

Basics

Usual routine

  • results should look like this
  • guess the sample size
  • run some experiments
  • calculate statistics, until it looks good
  • conclude that your idea is correct
  • if no “good” results, run more experiments
  • maybe this parameter also matters
  • maybe you found a bug

Exploratory research

  • what is going on?
  • look around data
  • figure things out
  • generate a hypothesis
  • less restrictions
  • requires further research to confirm
  • useful to do even if you have hypothesis

Confirmatory research

  • is this going on?
  • the hypothesis formed ahead
  • measure data and analyze them without bias
  • confirm/reject hypothesis? interpret results
  • more restrictions
  • requires further research to rule out errors and be more sure

What should not happen when exploring

  • results should look like this
  • guess the sample size
  • run some experiments
  • calculate statistics, until it looks good
  • conclude that your idea is correct
  • if no “good” results, run more experiments
  • maybe this parameter also matters
  • maybe you found a bug

What should not happen when confirming

  • results should look like this
  • guess the sample size
  • run some experiments
  • calculate statistics, until it looks good
  • conclude that your idea is correct
  • if no “good” results, run more experiments
  • maybe this parameter also matters
  • maybe you found a bug

We will talk about …

  • results should look like this -> Hypotheses
  • guess -> Sample size justification
  • run experiments -> Pre-registration and error control
  • calculate, until … -> Statistics and multiple testing
  • conclude -> Interpretation of statistics
  • …, so run more experiments -> Optional stopping
  • maybe this also matters -> (In)dependent variables

Variables

Hypothesis

  • a supposition or proposed explanation
  • made on the basis of limited evidence
  • as a starting point for further investigation

  • observe and ask questions
  • then hypothesize how or why
  • make predictions and test them
  • rinse and repeat

Hypothesis

Hypothesis

Hypothesis

Hypothesis

Null Hypothesis

  • H0
  • there is nothing going on
  • there is no difference between two *
  • sometimes interesting on its own
  • mobile phones do not cause brain cancer
  • usually accompanied by H1
  • H0 true == H1 false
  • vitamin E does not influence IQ
  • vitamin E does not increase IQ by more than two points

Alternative hypothesis

  • H1
  • there is something going on
  • there is a difference between two *
  • nothing about how big a difference
  • H1 true == H0 false
  • vitamin E influences IQ
  • vitamin E increases IQ
  • vitamin E increases IQ by more than two points
  • vitamin E increases IQ by 5 points

Hypothesis

  • form with theory or exploratory research
  • when confirming, must be fully formed before data collection
  • what prediction could confirm it?
  • what prediction could disprove it?

What can happen?

Was prediction right? & What was predicted?

Statistically significant?

  • statistically significant
    • the difference is probably NOT due to random chance
    • rejecting H0
    • not confirming H1
  • not statistically significant
    • meh
    • not rejecting H0
    • not confirming H1

Statistically significant?

  • calculate test statistic
  • calculate the critical value (based on \(\alpha\), distribution, parameters of the test, etc.)
  • if the result < critical value \(\rightarrow\) statistically significant result
  • if the result > critical value \(\rightarrow\) not statistically significant result

Test statistic

Take data, calculate one scalar that describes how they differ.

  • based on mean: z-test, t-test, f-test
  • based on variance: correlation, ANOVA, \(\chi\) squared test
  • based on predicting the change: regression
  • non-parametric
  • and many more (Wikipedia has 101 pages under category Statistical tests)

What can happen?

Was prediction right? & What was predicted?

True negative

  • when H0 is true and we are saying H0 is true
  • interesting if H0 itself is interesting
  • mobile phones do not cause cancer
  • vaccination does not cause autism

True positive

  • when H1 is true and we are saying H1 is true
  • what we need to
    • come up with H1 that are true in this universe
    • have enough statistical power to be able to see it
  • statistically significant or not is binary
  • the size of the effect

Effect size

  • assuming H1 is true, how big is the effect?
  • standardized difference in means
  • correlations
  • small effect \(\leftrightarrow\) high statistical power
  • practical significance of the effect

Statistical power

  • when H1 is true, I will see it in my data with 1-\(\beta\) probability
  • informational value of the study
## Loading required package: pwr
## Warning in qt(sig.level/tside, nu, lower = FALSE): NaNs produced

## Sample size and effect

## Loading required package: shiny
## Loading required package: shinythemes
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Shiny applications not supported in static R Markdown documents

http://shiny.ieis.tue.nl/d_p_power/

False negative

  • when H1 is true and we are saying H0 is true
  • Type 2 error rate \(\beta\)
  • in the long run, this error occurs \(\beta\)% of the time
  • usually \(\beta = 0.2\)
  • you can decrease it by
    • increasing sample size
    • decreasing measurement error
    • predicting the direction of the effect
  • potentially even more dangerous than Type 1 error

False positive

  • when H0 is true and we are saying H1 is true
  • Type 1 error rate \(\alpha\)
  • in the long run, this error occurs \(\alpha\)% of the time
  • substantially decreased if we replicate studies
  • usually \(\alpha = 0.05\)
  • inflates in case of any tinkering with data
  • needs to be controlled carefully

Tricky question

It is known that an effect exists in the population.

In a pilot study, a difference between groups was observed.

Group 1: n = 22, M = 5.68, SD = 0.98.

Group 2: n = 23, M = 6.28, SD = 1.11.

p < 0.05

When you replicate the study in a same way what is the chance you will observe a statistically significant result?

What can happen?

We will talk about …

  • results should look like this -> Hypotheses
  • guess -> Sample Size Justification
  • run experiments -> Pre-registration and Error Control
  • calculate, until … -> Calculating statistics and Multiple Testing
  • conclude -> Interpreting statistics
  • …, so run more experiments -> Optional stopping
  • maybe this also matters -> Dependent and Independent Variables, Hypotheses

Pre-registration and Further Details

Pre-registration

Plan the analysis AHEAD

  • justify sample size
  • what data do you measure and how?
  • how do you randomize between groups?
  • how do you clean data?
  • what statistical tests you want to do?
  • what are dependent and independent variables for each test?
  • specify parameters for each test
  • specify error rates \(\alpha\) and \(\beta\) and their control

Write this down and ideally publish (aspredicted.org).

Sample size justification

  • according to available resources
  • according to required accuracy
    • distribution and standard error
  • according to required statistical power
    • \(1-\beta\), \(\alpha\) and estimated effect size
    • or the smallest effect size of interest

Data

Collect

  • needed for metric
  • maybe a bit more for futher exploration

Clean

  • missing values
  • outliers

Error rate control

Type 2 error rate control

  • \(\beta\)
  • easy to neglect
  • power analysis beforehand
  • usually 20%, but think for yourself
  • in reality, estimated 65% in psychology
  • in reality, estimated 79% in neuroscience

Type 1 error rate control

  • \(\alpha\)
  • easy to inflate
  • pre-registration ahead
  • usually 5%, but think for yourself

Misuse of statistics

  • techniques that tinker with results to get “good” p-value
  • HARKing
  • biased data
  • not including confounding variables
  • manipulating data (e.g. outliers)
  • p-hacking
    • multiple testing, i.e. look elsewhere
    • optional stopping, i.e. collect more data
  • cherry-picking
  • some acceptable when exploring
  • inflates \(\alpha\) to unknown level when confirming

Multiple testing

Multiple testing

  • what to do with multiple p-values
    • different metrics
    • different parameters
    • different inputs
  • three statistical tests, all \(\alpha=0.05\)
  • total Type 1 error rate \(1-0.95^3 = 0.14\)
  • with 50 tests, it’s above 90%

Control for multiple testing

  • family of tests
  • independent tests, otherwise (M)ANOVA
  • Bonferroni correction and its variants
    • \(\alpha_i = \alpha / n\)
    • single false positive is a disaster
    • more false negatives
  • False discovery rate control
    • Benjamini-Hochberg procedure
    • if I make 50 discoveries, have 2 false positives is okay
    • control the amount of false positives
    • many variables (genes)

Optional stopping

## [1] 0.006071701
## The lowest p-value was observed at sample size 100
## The p-value dropped below 0.05 for the first time as sample size 73

Optional stopping

  • willing to collect 100 data points
  • will look at data three times
  • each test has \(\alpha=0.05\)
  • total Type 1 error rate \(\sim\) 0.1
  • with 100 looks, it’s \(\sim\) 0.35
  • control with sequential analysis
  • you can spend \(\alpha\) gradually

Run and analyze

  • only after all this is planned
  • run experiments
  • measure data
  • run the analysis as planned

Interpretation of the Results

Interpreting p-values

  • we found p < 0.05, so …
  • they separate signal from the noise in the data
  • is it random variation or is there true difference?
  • how much surprising the data are, assuming H0 is true
## :(     p = 0.1158312 
## 
## :(     p = 0.6854975 
## 
## :(     p = 0.5361662 
## 
## :(     p = 0.7272989 
## 
## :(     p = 0.5463914 
## 
## (^.^)  p = 0.03822513 
## 
## :(     p = 0.3950347 
## 
## :(     p = 0.8122609 
## 
## :(     p = 0.4110737 
## 
## :(     p = 0.4119823 
## 
## :(     p = 0.6676723 
## 
## :(     p = 0.4533229 
## 
## :(     p = 0.4525323 
## 
## :(     p = 0.1470071 
## 
## (._.)  p = 0.07287707 
## 
## :(     p = 0.2670293 
## 
## :(     p = 0.5607152 
## 
## :(     p = 0.7337348 
## 
## :(     p = 0.448781 
## 
## (^.^)  p = 0.04090435 
## 
## :(     p = 0.7119981 
## 
## (._.)  p = 0.05451528 
## 
## :(     p = 0.9858737 
## 
## :(     p = 0.1108632 
## 
## :(     p = 0.8522585 
## 
## (._.)  p = 0.06048712 
## 
## :(     p = 0.5688637 
## 
## :(     p = 0.3202356 
## 
## :(     p = 0.9092101 
## 
## :(     p = 0.1441345 
## 
## :(     p = 0.1768541 
## 
## :(     p = 0.760798 
## 
## :(     p = 0.433823 
## 
## :(     p = 0.780449 
## 
## :(     p = 0.4140463 
## 
## :(     p = 0.2232231 
## 
## (^.^)  p = 0.02855 
## 
## :(     p = 0.6863586 
## 
## :(     p = 0.7007114 
## 
## :(     p = 0.3517176 
## 
## :(     p = 0.4920222 
## 
## :(     p = 0.7887199 
## 
## :(     p = 0.9718164 
## 
## :(     p = 0.200618 
## 
## :(     p = 0.5945487 
## 
## :(     p = 0.5550135 
## 
## :(     p = 0.5398895 
## 
## :(     p = 0.1877556 
## 
## :(     p = 0.5488533 
## 
## :(     p = 0.4419479 
## 
## :(     p = 0.142162 
## 
## :(     p = 0.7735385 
## 
## :(     p = 0.3882965 
## 
## :(     p = 0.7543358 
## 
## :(     p = 0.2755101 
## 
## :(     p = 0.8318198 
## 
## :(     p = 0.2879396 
## 
## :(     p = 0.8695526 
## 
## (^.^)  p = 0.01357314 
## 
## :(     p = 0.7738423 
## 
## :(     p = 0.9191245 
## 
## :(     p = 0.2141243 
## 
## :(     p = 0.1889315 
## 
## (^.^)  p = 0.01612838 
## 
## :(     p = 0.9375521 
## 
## :(     p = 0.2964957 
## 
## :(     p = 0.5809294 
## 
## :(     p = 0.915115 
## 
## (^.^)  p = 0.03597982 
## 
## :(     p = 0.5658116 
## 
## :(     p = 0.9163458 
## 
## :(     p = 0.2520798 
## 
## :(     p = 0.8350342 
## 
## :(     p = 0.4300955 
## 
## :(     p = 0.3148625 
## 
## :(     p = 0.2250195 
## 
## :(     p = 0.9043926 
## 
## :(     p = 0.1137421 
## 
## :(     p = 0.3017151 
## 
## :(     p = 0.9576115 
## 
## :(     p = 0.5287814 
## 
## :(     p = 0.9411985 
## 
## :(     p = 0.6118674 
## 
## :(     p = 0.108627 
## 
## :(     p = 0.5926454 
## 
## :(     p = 0.6983204 
## 
## :(     p = 0.9669818 
## 
## :(     p = 0.9780894 
## 
## :(     p = 0.8902916 
## 
## :(     p = 0.4054365 
## 
## :(     p = 0.8411379 
## 
## :(     p = 0.4896501 
## 
## (._.)  p = 0.07989834 
## 
## :(     p = 0.6973778 
## 
## :(     p = 0.7487406 
## 
## :(     p = 0.5448076 
## 
## :(     p = 0.1412426 
## 
## :(     p = 0.3644735 
## 
## :(     p = 0.9546028 
## 
## :(     p = 0.5940736

p-value distribution

## [1] 0.0549

p-value < 0.05

it DOES NOT mean that

  • H1 is true
  • H1 is probably true (with 1-\(\alpha\) probability)
  • H0 is probably false (with 1-\(\alpha\) probability)
  • you will get the same result in replication study
  • your Type 1 error rate is really 0.05
  • the value of p depends on the size of the effect
  • the lower p-value, the more confidence

p-value < 0.05

it DOES mean that

  • you have set your Type 1 error rate to 5%
  • if H0 was true it would be unlikely to observe these data
  • if H0 was true we’d get this result in 5% studies
  • somebody should do a replication study
  • if H1 is true, lower values of p are more probable
  • values between 0.04 and 0.05 are suspicious
    • only four times more likely under H1 than under H0

p-value >= 0.05

it DOES NOT not mean that

  • H0 is true
  • H1 is false
  • there is no effect
  • you should tinker with results to get p<0.5
  • you are a bad scientist
  • you should hide these data (although you probably will not get them published)

p-value >= 0.05

it DOES mean that

  • the data are not surprising, if H0 is true
  • H1 might still be true
  • not enough statistical power to observe it
  • or just bad luck

p-curve

Why so many studies are false?

Interpreting effect sizes

  • effects can be statistically significant, but practically insignificant
  • with enough data you have statistical power to show even meaningless effects
  • two families: Cohen’s d and correlations
  • Cohen’s d: what is the difference between x and y?
  • visualization
  • correlations: how much the relationship between x and y reduces the error in data?
  • visualization

Conclusion

Statistics is tricky

  • \(P(H|D) \neq P(D|H)\)
  • p < \(\alpha\) does not mean that H1 is true
  • p > \(\alpha\) does not mean that H1 is false
  • if H1 is true, lowering \(\alpha\) does not make it more probable to see H1 in data
  • if H1 is true, p-values close to \(\alpha\) are rather unlikely
  • with enough tests, you can ``prove’’ anything
  • with enough data, you can detect statistically significant, but practically meaningless effects
  • even large effects do not affect everybody

What should happen

  • forming hypothesis ahead when confirming
  • ensuring high statistical power
  • controlling false positive error rate
  • sound, careful statistical analysis planned ahead
  • no tinkering with the results
  • reporting effect size and confidence intervals
  • publishing even statistically insignificant results
  • replicating studies

Conclusion

  • statistics is tricky
  • our brains do not understand it intuitively
  • useful to understand research
  • essential to properly do research
  • one study does not prove anything
  • trouble: low power, high p-value and surprising result
  • nothing is certain, but sometimes we are pretty sure

Resources