Measures of association and effect Hynek Pikhart Revision - Measures of disease frequency  Used for binary outcomes  Require a numerator and denominator number of persons with disease -------------------------------------------------- number of persons examined  expressed as X per 1000 persons (or per 100,000 etc) Prevalence  number of existing cases / population of interest at a defined time • number of new cases in a given time period / total population at risk Incidence Measures of association  Risk of disease, rate of disease in different groups of population  Comparison of risks/rates Constructing 2-way table For binary health outcomes (Y/N), it is possible to construct 2x2 table and to estimate either relative or absolute measures of risk Disease Exposure Yes No Total Yes a b a+b No c d c+d Total a+c b+d a+b+c+d Relative measures of effect (relative risk) We have 2 groups of individuals:  An exposed group (group with risk factor of interest) and unexposed group (without such factor of interest)  We are interested in comparing the amount of disease (mortality or other health outcome) in the exposed group to that in the unexposed group Risk/rate  Incidence rate or Risk in exposed (r1)  Incidence rate or Risk in unexposed (r0) Measures of association  Risk of disease, rate of disease in different groups of population  Comparison of risks/rates Risk ratio • we calculate the risk ratio (RR) as: RR=r1/r0 Risk difference • the absolute difference between two risks (or rates) RD = r1 – r0 Constructing 2-way table For binary health outcomes (Y/N), it is possible to construct 2x2 table and to estimate either relative or absolute measures of risk Disease Exposure Yes No Total Yes a b a+b No c d c+d Total a+c b+d a+b+c+d Example: Alcohol drinking and heart attack Heart attack Yes No Total Alcohol drinking Yes 25 400 425 No 75 1500 1575 Total 100 1900 2000 Risk (exposed) = 25/425=0.059 Risk (unexposed) = 75/1575=0.048 Relative risk = 0.059/0.048 = 1.23  We can also have different strata of exposure.We may calculate ratio measures for each strata – we compare measure of frequency in each level with measure of frequency in the baseline (unexposed) level.  Example: Death rates from CHD in smokers and non-smokers by age Age Smokers rate Nonsmokers rate Rate ratio 35-44 0.61 0.11 5.5 45-54 2.40 1.12 2.1 55-64 7.20 4.90 1.5 65-74 14.69 10.83 1.4 75-84 19.18 21.20 0.9 85+ 35.93 32.66 1.1 ALL AGES 4.29 3.30 1.3 What can you say about this table? Age Smokers rate Nonsmokers rate Rate ratio 35-44 0.61 0.11 5.5 45-54 2.40 1.12 2.1 55-64 7.20 4.90 1.5 65-74 14.69 10.83 1.4 75-84 19.18 21.20 0.9 85+ 35.93 32.66 1.1 ALL AGES 4.29 3.30 1.3 The rate ratio decreases with increasing age. This table may also suggest that the effect of smoking on the rate of CHD is higher in younger ages.  Alternative measure of risk Odds ratio The odds of disease is the number of cases divided by the number of non-cases Cases Odds = ------------ Non cases Odds ratio (OR) is ratio of odds of disease among exposed (oddsexp) and odds of disease among unexposed (oddsunexp) OR= oddsexp/ oddsunexp We can calculate • Odds (exposed) Oexp=25/400 • Odds (unexposed) Ounexp=75/1500 • Odds ratio OR = Oexp / Ounexp = 1.25 Heart attack Yes No Total Alcohol drinking Yes 25 400 425 No 75 1500 1575 Total 100 1900 2000 Odds ratio as an approximation to the risk ratio  For a rare disease, odds ratio is approximately equal to the risk ratio (because denominators are very similar)  For a common conditions, OR overestimates the true RR Rare disease → OR~RR RR OR Cases Cases N Population N controls ~ If disease common: Disease Exposed Unexposed Total Yes 50 25 75 No 50 75 125 Total 100 100 200 R1=50/100=0.5 R0=25/100=0.25 RR=2.0 a / b O1=50/50=1.0 O0=25/75=0.33 OR=3.0 c / d Measure of effect Use of the measure How to interpret results Risk Difference Public Health Interested in excess disease burden due to factor (“Attributable risk”) Close to 0 = little effect Large difference = large effect Risk Ratio Epidemiology Causation “This factor doubles the risk of the disease” Close to 1 = little effect Large ratio = large effect Close to 0 = large effect!Odds Ratio As for Risk Ratio “This factor doubles the odds of the disease” Only possibility (case-control study) More advanced statistical methods (logistic regression) Example  Random sample of individuals were questioned about their occupation and their BP was measured. Based on SBP and DBP measures they were classified as hypertensive or nonhypertensive.Among 300 people in non-manual jobs, there were 72 hypertensive individuals. Among 240 people in manual jobs, there were 96 hypertensive individuals. Constructing 2-way table As a first step we need to organize our data in a formal way – we construct 2-way table Hypertension Yes No Total Manual 96 144 240 Non-manual 72 228 300 Total 168 372 540 What does it mean when we speak about an association between two categorical variables?  It means that knowing the value of one variable tells us something about the value of the other variable.  Two variables are therefore said to be associated if the distribution of one variable varies according to the value of the other variable. What does it mean when we speak about an association between two categorical variables?  In our example, the two variables, occupation and hypertension, are associated if the distribution of hypertension varies between occupational groups.  And, if distribution of hypertension is same in both occupational groups, we can say that there is no association between hypertension and occupational category - because knowing a occupational category of individual will not tell us anything about hypertension. What does it mean when we speak about an association between two categorical variables?  Having constructed a two-way table, the next step is to look whether the distribution of one variable differs according to the value of the other variable.  We need to calculate either row or column percentages.  Often, one variable can be regarded as the response variable, while the other is the explanatory variable, and this should help to decide what percentages are shown  If the columns represent the explanatory variable, then column percentages are more appropriate, and vice versa. Constructing 2-way table As a second step we calculate proportion of hypertensive individuals among manual workers, non-manual workers and in the whole sample Hypertension Yes No Total Manual 96 (40.0%) 144 (60.0%) 240 Non-manual 72 (24.0%) 228 (76.0%) 300 Total 168 (31.1%) 372 (68.9%) 540 The numbers in the four categories in the 2-way table in the previous slide all called OBSERVED NUMBERS  The data seem to suggest some association between hypertension and occupation (40% of manual workers with hypertension compared to 24% of non-manual workers with hypertension)  The calculation and examination of such percentages is an essential step in the analysis of a two-way table, and should always be done before starting formal significance tests. Significance test for the association  Although it seems that there is an association in the table, the question is whether this may be attributable to sampling variability  Each of the percentages in the table is subject to sampling error, and we need to assess whether the differences between them may be due to chance  This is done by conducting a significance test  The null hypothesis is “there is no association between the two variables” Expected numbers  The significance test is Chi-squared test • This test compares the observed numbers in each of four categories of contingency table with the numbers to be expected if there was no difference in proportion of hypertensive individuals in two occupational groups Hypertension Yes No Total Manual 74.64 240 Non-manual 300 Total 168 (31.1%) 372 (68.9%) 540  From the table above, the overall proportion of hypertensive individuals is 168/372 (31.1%).  If the null hypothesis were true, the expected number of manual subjects with hypertension is 31.1% of 240, which is 74.64 Expected numbers Hypertension Yes No Total Manual 74.64 165.36 240 Non-manual 93.36 206.64 300 Total 168 (31.1%) 372 (68.9%) 540  Expected numbers in the other cells of the table can be calculated similarly, using the general formula: Row total x Column total Expected number = ----------------------------------- Overall total Next step – compare observed and expected numbers EXPECTED Hypertension Yes No Total Manual 74.64 165.36 240 Non-manual 93.36 206.64 300 Total 168 (31.1%) 372 (68.9%) 540 OBSERVED Hypertension Yes No Total Manual 96 144 240 Non-manual 72 228 300 Total 168 (31.1%) 372 (68.9%) 540 Chi-squared test (Χ2 test)  Calculate (O-E)2/E for each cell and sum over all cells  In our example: Χ2 = [(96-74.64)2 / 74.64 + (144-165.36)2 / 165.36 + (72-93.36)2 / 93.36 + (228-206.64)2 / 206.64] = 15.97 X2 =  [(O – E)2/E ]  If χ2 value is large then (O-E) is, in general, large and data do not support H0 = association  If χ2 value is small then (O-E) is, in general, small and data do support H0 = no association  Large values of χ2 suggest that the data are inconsistent with the null hypothesis, and therefore that there is an association between the two variables. Obtaining p-value  Under H0: χ2 distribution Obtaining p-value  The P-value is obtained by referring the calculated value of χ2 to tables of the chisquared distribution.  The P-value in this case corresponds to the value shown as  in the tables.  The degrees of freedom are given by the formula: d.f. = (r – 1) x (c – 1) • r = number of rows, c = number of columns Back to our example: Χ2 = 15.97 Table 2x2 d.f.=1 and from the table P<0.001 Larger tables (r x c tables) ( )  − = E EO 2 2 d.f. = (r-1) x (c-1) • Valid if less than 20% of expected numbers are under 5 and none is less than 1 • If low expected numbers – combine either rows or columns to overcome this problem How to calculate expected number in particular cell Row total x Column total Expected number = -------------------------------------- Overall total Interpretation of chi-square test results: Chi-squared tests in STATA  We try to evaluate whether there is an association between current smoking and age  We have age grouped into 4 groups (30-39, 40-49, 50-59, 60-69)  Smoking (variable smok) was coded 1=current smokers, 0=non-smokers . tab smok agegroup, col | 30-39,40-49,50-59,60-69 1=yes 0=no | 30 40 50 60 | Total -----------+--------------------------------------------+-------- 0 | 337 357 490 491 | 1,675 | 54.71 56.31 72.38 78.81 | 65.69 -----------+--------------------------------------------+-------- 1 | 279 277 187 132 | 875 | 45.29 43.69 27.62 21.19 | 34.31 -----------+--------------------------------------------+------- Total | 616 634 677 623 | 2,550 | 100.00 100.00 100.00 100.00 | 100.00 Let’s check proportion of smokers in each age category Chi-squared test . tab smok agegroup, col chi | 30-39,40-49,50-59,60-69 1=yes 0=no | 30 40 50 60 | Total -----------+--------------------------------------------+------- 0 | 337 357 490 491 | 1,675 | 54.71 56.31 72.38 78.81 | 65.69 -----------+--------------------------------------------+------- 1 | 279 277 187 132 | 875 | 45.29 43.69 27.62 21.19 | 34.31 -----------+--------------------------------------------+------- Total | 616 634 677 623 | 2,550 | 100.00 100.00 100.00 100.00 | 100.00 Pearson chi2(3) = 118.7458 Pr = 0.000 Degrees of freedom Chi-squared test value p<0.001 Measures of population impact  Population attributable risk (PAR) is the absolute difference between the risk (or rate) in the whole population and the risk or rate in the unexposed group PAR = r – r0 Population attributable risk fraction (PARF or PAR%)  It is a measure of the proportion of all cases in the study population (exposed and unexposed) that may be attributed to the exposure, on the assumption of a causal association  It is also called the aetiologic fraction, the percentage population attributable risk or the attributable fraction  If r is rate in the total population PAF = PAR/r PAR = r – r0 PAF = (r-r0)/r Exercise  50 persons attended a garden party  25 of them developed diarrhoea in the next 3 days  What was the risk of diarrhoea among the participants of the party? Exercise – cont.  30 party visitors had a BBQ (minced meat)  24 of them developed diarrhoea  20 people did not eat BBQ  1 of them developed diarrhoea  How would you calculate RR related to eating BBQ? Exercise – cont.  Risk among unexposed R0:  1/20  Risk among exposed R1:  24/30  Relative risk RR=R1/R0=(24/30)/(1/20)=16