Week 8 : Statistics Introduction to Bioinformatics (LF:DSIB01) Adobe Systems What is statistics? 2 •Statistics is the science of learning from data, and of measuring, controlling and communicating uncertainty. - American Statistical Association (ASA) • • Image result for ppdac Adobe Systems What is statistics? 3 •Statistics is the science of learning from data, and of measuring, controlling and communicating uncertainty. - American Statistical Association (ASA) • • Image result for ppdac Adobe Systems Sampling Design •Randomization ‒Sampling designs should be as random as possible •Overrepresentation ‒Preferentially select units where the dispersion is larger ‒Sample is not necessarily a “scale copy” of population ‒It makes sense to increase depth in categories that are more variable •Restriction ‒Should restrict or exclude problematic samples such as samples with empty categories ‒Stratification: fixing the sample size in categories of the population ‒Is not in contrast with Randomization as long as there are enough possible samples 4 Review: Probability Sampling Designs: Principles for Choice of Design and Balancing, Tille and Wilhelm, 2016 Adobe Systems Statistical Hypothesis Testing 5 • •To ask questions on data • we use statistical methods that provide • a confidence or likelihood about the answers. • • •Null Hypothesis: H0: The default position that there is nothing new happening • •How can we test our confidence in the Null Hypothesis? Adobe Systems Statistical Hypothesis Testing 6 • •Usual goal: •Reject Null Hypothesis with some confidence (0.05) • •Confirm statistically significant effect. • •Example • •You can’t confirm null Null Hypothesis! Adobe Systems Frequentist (classical) vs. Bayesian statistics •A Probability value can be thought of in several ways: • 1.Long-term frequency 2.Degree of belief 3.Degree of logical support • •Frequentist Statistics works with (1) while Bayesian Statistics with (2 and 3) • • 7 Adobe Systems 8 Frequentist Statistics -Only repeatable random events (like the result of flipping a coin) have probabilities. -These probabilities are equal to the long-term frequency of occurrence of the events in question. -Cannot apply probabilities to hypotheses or to any fixed but unknown values in general - Bayesian Statistics -Probabilities can represent the uncertainty in any event or hypothesis -Newly collected data narrows down the probability distribution over the parameter. - Example: Doctor knows that 20% of the population has Disease Test that shows + for 90% of Disease individual but also shows + for 30% of Healthy individual A patient comes in the Doctor’s office and takes the test. What is the probability that the patient has Disease? The patient takes Test and the test comes back + What is the probability that the patient has Disease? (given that the Test showed +) Adobe Systems 9 Frequentist Statistics -Only repeatable random events (like the result of flipping a coin) have probabilities. -These probabilities are equal to the long-term frequency of occurrence of the events in question. -Cannot apply probabilities to hypotheses or to any fixed but unknown values in general - Bayesian Statistics -Probabilities can represent the uncertainty in any event or hypothesis -Newly collected data narrows down the probability distribution over the parameter. - Example: Doctor knows that 20% of the population has Disease Test that shows + for 90% of Disease individual but also shows + for 30% of Healthy individual A patient comes in the Doctor’s office and takes the test. What is the probability that the patient has Disease? 20% (1:4) The patient takes Test and the test comes back + What is the probability that the patient has Disease? (given that the Test showed +) 43% (3:7) How? Answered in Lecture 10: Bayesian Inference Adobe Systems Response vs. explanatory variable 10 •Dependent vs. measured variable •Example • •What if variables are independent. ‒Correlation •What if it is a chicken egg scenario. ‒select it to fit your model/test • •Why do we care? ‒Formula in R Adobe Systems Response vs. explanatory variable Response variable type Explenatory variable type Example tst type Categorical Categorical Fisher test Categorical (two groups) Continuous t-test Categorical (multiple groups) Continuous ANOVA Continuous Continuous Linear regression Continuous Categorical (two groups) Logistic regression 11 Adobe Systems Linear regression 12 Simple linear regression is used to model the relationship between two continuous variables. Multiple linear regression is used when we we have multiple explanatory variables. Beware - overfitting! Adobe Systems Nonlinear regression 13 •Nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear. ;-) •Polynomial regression •Poisson distribution Image result for negative binomial distribution Adobe Systems Nonlinear regression 14 •Nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear. ;-) •Polynomial regression Adobe Systems Polynomial regression 15 •Overfitting Adobe Systems Parametric vs. non-parametric tests 16 Parametric tests assume underlying statistical distributions in the data. Therefore, several conditions of validity must be met so that the result of a parametric test is reliable. For example, Student’s t-test for two independent samples is reliable only if each sample follows a normal distribution and if sample variances are homogeneous. Nonparametric tests do not rely on any distribution. They can thus be applied even if parametric conditions of validity are not met. They are generally weaker. Image result for parametric vs nonparametric tests table Adobe Systems Wilcoxon rank-sum test (Mann–Whitney U test) 17 •Comparing medians of two population •No assumptions about the populations – but data must be ordinal ‒Beware of ties •One population can be sub-sample of the other ‒ Does selected genes have a higher expression? Adobe Systems Kruskal–Wallis test 18 Adobe Systems Wilcoxon signed-rank test 19 •Paired data points Adobe Systems Student's t-test 20 Adobe Systems Student's t-test 21 •One sided vs. two sided • • • • • •Also paired version Adobe Systems ANOVA 22 Adobe Systems p-value 23 •Many researchers in various areas use standard routines in statistical software in the expectation that the software can condense their research into a single summary (most often a p-value) that ‘objectifies’ their results. This idea of objectivity is in stark contrast with the realization by many of these researchers at some point that depending on individual inventiveness there are many ways to arrive at such a number.” Adobe Systems p-value adjustment 24 •Multiple testing problem ‒Minimize false positive error rate Adobe Systems Adobe Systems Adobe Systems Adobe Systems Adobe Systems Adobe Systems 25 www.ceitec.eu CEITEC @CEITEC_Brno Thank you for your attention! 60 minutes lunch break. >