Experimental Humanities II (HUMB002) 2016 STATISTICAL ANALYSIS Lecture 2 Z-SCORES AND OTHER STANDARDIZED SCORES, NORMAL DISTRIBUTION CORRELATION, SIMPLE LINEAR REGRESSION Scores transformation •We often transform observed score for easier understanding and interpretation •Making interpretation easier – linear transformations •e.g. multiplying by 10 or 100 for getting rid of the decimals •distribution shape remains unchanged •descriptive statistics will change in a predictable way •possibility of standardization •Change of distribution shape – nonlinear transformations •log/exp functions, square(root)… •Change of measurement level – ordinal transformation (ranking) • Nonlinear transformations •Example: ln transformation, usually trying to make distribution „more normal“ Ordinal transformation: ranking •Transformation from observed values on ranking: •Elimination of extreme values, neglecting differences magnitude between people •Usually ascending (the lowest value is 1) •The same values get average ranking (=RANK.AVG) •The distribution will be approximately uniform •Percentiles – standardized form of ranking transformation • • Linear transformation – standardization •The most usual transformation: standardization (z-scores) •scores transformation, so that M=0, SD=1 •measurement unit becomes SD – we can easily compare scores from different scales (but differences in distribution remain!) •zi = (Xi – M) / SD •Scores derived from z-scores: •T scores: M=50, SD=10; Ti = 50 + 10zi •IQ scores: M=100, SD=15 •Stens (Standard TENs, cz: steny): M=5.5, SD=2; Steni = 2zi + 5.5 •Stanines (STAndard NINEs, cz: staniny): M=5, SD=2; Staninei = 2zi + 5 •Normal distribution is always required for correct standardized scores interpretation! • • • Psychodiagnostic calculator •Online tool for scores transformation •Developed by Hynek Cígler and Martin Šmíra from the Department of Psychology, Faculty of Social Studies MU •http://kalkulacka.testforum.cz/transformace-skoru •In Czech language Normal distribution (Gaussian, bell curve) •The distribution of nature phenomena influenced by many independent factors: many variables •The distribution of random errors •Advantages: we can assume how many of which values there are in the target population •Many statistical procedures work with normal distribution (-> many statistical procedures require normal distribution) • • Characteristics of normal distribution •Symmetrical, unimodal •Mean = median = mode •Skewness = 0 •Kurtosis = 3 •Standardized normal distribution: scores transformed to z-scores (M=0, SD=1) • Counting quantiles in Excel •NORM.S.DIST(z;1) – returns corresponding percentile for given z-score (=how many people have the same or lower z-score) •Percentage of people between two given z-scores: NORM.S.DIST(higher z;1) minus NORM.S.DIST(lower z;1) •NORM.S.INV(p) – returns corresponding z-score for given percentile • • Example •John got score 7 from math test and score 13 in English tests. The math tests scores have M=5 and SD=2.2, the English test has M=9 and SD=3.6. In which of the tests was John better? •Math •z-score: (7-5)/2.2 = 0.91 •T-score: 50 + 10*0.91 = 59.1 •percentile: NORM.S.DIST(0.91;1) = 0.82 = 82nd percentile •John was in the math test same or worse than 82% of other students •English •z-score: (13-9)/3.6 = 1.11 •T-score: 50 + 10*1.11 = 61.1 •percentile: NORM.S.DIST(1.11;1) = 0.87 = 87th percentile •John was in the English test same or worse than 87% of other students • •ASSOCIATIONS BETWEEN VARIABLES •CORRELATION Associations between variables •Statistical mapping of association between variables depends on measurement level: categorical vs. metric variables Variables classification according to their function in the association •We are usually interested in causal relationships •Statistics itself cannot detect or test causality •Causality can be determined by research design and theoretical assumptions •Variables classification •Dependent, independent, intervening •Exogenous, endogenous, mediators, moderators •Usually we can’t identify all intervening variables… Independent variable Dependent variable Moderator Mediator Intervening variable Compound bar chart Contingency table Math grades Total 1 2 3 4 5 Czech language grades 1 82 40 8 1 0 131 2 71 200 73 17 0 361 3 4 75 109 25 0 213 4 1 7 23 24 1 56 5 0 0 2 1 2 5 Total 158 322 215 68 3 766 •For any variables, but most suitable for discrete variables with not many values •The cells can contain both absolute or relative frequencies (row, column and total frequencies) •The last row and column contain so called row/column marginal frequencies •Graphical representation of contingency table is 3D bar chart or 3D histogram •High frequencies in diagonals indicate linear relationship between variables Scatterplot •Substitutes contingency table for continuous variable •Each axis represents one variable •Each point represents one subject (unit) •Frequency of the same values can be represented e.g. by the dot size • Different forms of associations Corr-example LINEAR RELATIONSHIPS Linear association •Monotonous relationship which can be described in words: the higher X, the higher/lower Y •By „correlation“ we usually mean linear association •In the scatterplot, the „ideal“ line can be placed •The linear function (line) Y = a + bx indicates the association slope •Correlation describes the strength of the linear association (cz: těsnost vztahu) Strength of association •The stronger the association, the closer are the points to the line •Strength of association isn’t related to the line’s slope •Strength of association can be described by correlation coefficient from -1 to 1 •-1 means maximum negative association, 0 mean no association, 1 means maximum positive association •+ values: the higher X, the higher Y •- values: the higher X, the lower Y • • Corr-example Covariance (shared variance) •Covariance expresses the extent of the shared variance •It is a numerical expression of association strength • • • • • •x and y are deviation scores (deviations from the average) •covariance is not very practical, similarly as variance, because it is in „squared unit“ Remember the formula for variance calculation: Σx2 / (n – 1). This formula is: Σ x*y / (n – 1). So, instead of x*x, we have here x*y, that’s why it’s called co-variance. This sum is the higher the more pairs of xy there are, where both x and y value is above-average or below-average. The sum is the lower the more pairs of xy there are, where one of the values is above-average and the other below-average. Correlation (=standardized shared variance) Characteristics of Pearson’s correlation coefficient •A deviation statistics: •interval or higher measurement level required •great impact of extreme values •suitable for normally distributed variables description (approximately normal distribution of both variables required) •expresses only the association strength, not causality!!! •Takes values between -1 a +1 •0 = no association •+1(-1) = perfect positive (negative) association = variables identity = direct (indirect relationship) •r2 = coefficient of determination (R2, D) = proportion of shared variance •Consequence: r 0,3 – r 0,1 ≠ r 0,7 – r 0,5 (0,09-0,01=0,08; 0,49-0,25=0,24) •r=0 doesn’t mean there isn’t any relationship between variables, it only means there is no linear association between them Characteristics of Pearson’s correlation coefficient Computing correlation 1.Check assumptions: interval or higher measurement level, normal distribution of both variables, extreme values, assumption of linear relationship (plot scatterplot and histograms) 2.Compute z-scores for all observed scores – you will need M and SD for both groups: zi = (Xi – M) / SD Excel: =AVERAGEA(data), =PRŮMĚR(data), =STDEVA(data), =SMODCH.VÝBĚR.S(data), =STANDARDIZE(X;M;SD) 3.Compute correlation: • rxy = [ (-1,52*1,67) + (-1,01*1,01) + (0,00*0,02) + (1,52*-0,97) + (0,51*-0,97) + (0,01*-0,31) + (0,25*-0,47) ] / (7-1) = -0,94 Excel: =COVARIANCE.P(var1, var2), =COVARIANCE.S(var1, var2), =CORREL(var1, var2) 1. Characteristics of Pearson’s correlation coefficient •When correlation doesn’t make sense? •Q1: How many hours daily do you watch TV? •Q2: How many hours daily do you watch TV news? •…why? •Correlation of variables with the same cause: •Priests’ salaries and vodka prices correlate •Children’s IQ and their height correlate as well… •…why? •Age and number of birthdays… •Covariance of variables with the same cause is the basis for other analysis methods: scale reliability analysis and factor analysis • • Rank (ordinal) correlation coefficients •Suitable not only for ordinal data, but also for interval data with deviations from normal distribution •Capture also nonlinear monotonic relationships •To what extent are ranking of the two correlated variables the same •Spearman rho coefficient - r, rs •Based on differences magnitude in rankings •Ordinal equivalent for Pearson’s correlation •r2 can be interpreted •Usually used as a more resistant variant of Pearson’s r •Calculated in the same way as Pearson’s, but on rankings •Kendall tau coefficient – t (+ „b“ and „c“ variants) •Based on number of values „out of order“ •No effect of outliers •b and c variants deal with more values of the same ranking • https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Spearman_fig1.svg/300px-Spearman_fig1.svg .png Kendall rank correlation: example t = (K-D) / [N (N -1)/2] = (3-7)/(5.4/2) = -4/10 = -0,4 Partial correlation: portioning variance variable A variable B variable C •STATISTICAL PREDICTION •LINEAR REGRESSION Statistical prediction •Statistical prediction is qualified estimating of the most probable variable value from data we already know by modelling the relationship between the variable and its correlates •From one (or more) variable (predictor, independent variable) we are trying to predict another variable (predicted variable, dependent variable) •E.g. How well can intelligence test at 10 years predict grades in the end of high school? •We make a model: we collect data from both variables, that is results of an intelligence test and high school grades from the same people (we already need them to be finishing their high school). •If the model works successfully, we can use the intelligence test scores in 10-year-old children to predict their future grades… • • • • • •Example 1: Imagine that all students have exactly the same grades from math and physics. The variables would be identical: •What would be the value of correlation between the variables? •r = 1 •What would be the value of coefficient of determination? •r2 (R2, D) = 12 = 1 •What would be the proportion of shared variance? What does it mean? •R2 * 100 = 1 * 100 = 100% •It means that we can predict 100% of math grades values correctly from physics values (or the other way). •What of the information above would change if all students had exactly opposite math grades than physics grades? •r = -1, R2 = -12 = 1, R2 * 100 = 100% •Of course, usually we can predict with much less precision, but we’re trying to predict with the highest precision. For that, we need correlates with high correlation with the predicted variable. Statistical prediction •Example 2: What score in an intelligence test could we predict for a random respondent, if we know that the test has approximately normal distribution with M=100 and SD=15? •What information could make our prediction more precise? •Height? •Education? •Score from a memory test? •… Statistical prediction Prediction of middle finger length from index finger length Linear regression •For prediction, we need a function (how to compute variable Y from know variable X) •For linear regression – prediction based on linear relationship, it is linear equation: Y = a + bX (a straight line, regression line) •We are modelling the linear function: we’re making estimations of variable Y values by computing the linear equation using variable X values •Variable Y estimated = Y’ •Regression of Y on X: Y’ = Y + e = f(x) + e, where e = Y’ – Y •e is residual value, Y is dependent variable, X is independent variable (predictor) •e represents all other variance sources except X Linear regression •If Pearson correlation well describes relationship between two variables, we can express the relationship by linear function: •Y’ = a + bX; Y = Y’ + e = a + bX + e •a = intercept (cz: průsečík), b = slope (směrnice) •How can we find the best regression line? •Estimate by least squares estimation – we are trying to minimize the sum of residual squares •b = rxy(SDy/SDx) •a = My – bMx •If the values of X and Y are in z-scores, then b = rxy •a, b – correlation coefficients • Linear regression •Y’ = a + bX •b = rxy(SDy/SDx) •a = My – bMx •If the values of X and Y are in z-scores, then b = rxy •The line goes through values Mx and My •The sum of residuals is zero, the sum of squared residuals is the least possible • Linear regression •Mm = 7,109; SDm = 0,843 Y •Mi = 6,983; Sdi = 0,658 X - predictor •rmi = 0,917 •b = rxy(SDy/SDx) = 0,917(0,843/0,658) = 1,175 •a = My – bMx = 7,109 – 1,175*6,983 = -1,096 •Y´ = 1,175*X – 1,096 Prediction of middle finger length from index finger length Predicted values IF MF MF' 6,5 6,4 6,5413 7 7 7,1291 7,5 7,5 7,7169 5,2 4,8 5,0130 6,6 6,7 6,6589 6,6 6,8 6,6589 7 7 7,1291 6,8 ? Y´ = 1,175*X – 1,096 = 1,175*6,8 – 1,096 = 6,894 Distribution of predicted values •MMF’ = 7,109 = MMF •SDMF’ = 0,773 Linear regression: model fit •How well, precise are the predicted values? •Precision = the least residuals •How large are the residuals? IF MF MF' e = (MF - MF') 6,5 6,4 6,5413 -0,1413 7 7 7,1291 -0,1291 7,5 7,5 7,7169 -0,2169 5,2 4,8 5,0130 -0,2130 6,6 6,7 6,6589 0,0411 6,6 6,8 6,6589 0,1411 7 7 7,1291 -0,1291 Distribution of residuals •Me = 0 •SDe = 0,337 Linear regression: model fit •sy2 = sreg2 + sres2 •R2 = sreg2 / sy2 … sres2= sy2(1−R2) •Coefficient of determination: R2 •Exaplained variance proportion •Measure of model fit with the data (regression success) •For simple linear regression it applies: R 2 = r 2 regrese2 Linear regression: assumptions •Assumptions are the same as for Pearson correlation: •The basic assumption: the relationship really is linear •The residuals have normal distribution with M=0 and SD = Sres •It means the 95% of estimation residuals lie approx. between −2sres a +2sres •homoscedascity (cz: homoskedascita): residuals independency = the residual variance won’t change with increasing X • •The model validity depends on data from which was the model extrapolated •Watch out for extreme values (as with all deviation statistics) regrese4 Other regression types •Simple linear regression: one independent and one dependent variable •Multiple linear regression: more independent variables (predictors) •Y = a +b1X1 + b2X2 + … + bmXm •complicated by relationships between the predictors •Logistic regression: •Dependent variable is dichotomous (nominal) •Prediction of dependent variable values probability •If the relationship isn‘t linear: •We can try to transform the variables, so that the relationship becomes linear •We can divide the sample into subgroups in which the relationship is linear • •