1. Upload the "salary_data.csv" data set. 2. Build the plot to look at the relationship between the variables. What will be the dependent variable (outcome), what will be the independent variable (predictor)? 3. Perform linear regression analysis (fit a simple linear regression model between the variables). Draw the best-fit regression line. 4. Check the main assumptions of the model, use the four main plots for checking: Plot 1.Linearity of the data, independence of residuals Plot 2.Normality of residuals using Q-Q plot Plot 3.Constant variance of residuals Plot 4. No influential outliers 5. Check the assumption "Normality of residuals" using histogram and normality tests; and "Zero mean of residuals". Don't forget to look at the Q-Q plot from the previous question. 6. Obtain parameters of the regression line (the intercept, the slope of the line); check the significance. Fill up the check list. 7. Obtain criteria for the model evaluation (Adjusted R-squared, RSE, AIC, the 95% confidence intervals). Fill up the check list. 8. After checking all the assumptions, what conclusion can you make? 9. Take away the outlier (number 5 on the previous plots) that has a high influence on the regression line. To identify the outlier, first, look at histograms of the variables. 10. Delete the outlier using the tidyverse package. 11. Now, when you have data without the outlier, fit an adjusted simple linear regression model (repeat steps 2-7). 12. Fill up the check list, compare the models, choose a better model and draw your conclusions. Check list Model_version_1 Model_version_2 Assumptions after Linear regression: Plot 1: Linearity of the data, independence of residuals Plot 2: Normality of residuals +histogram + normality tests Zero mean of residuals Plot 3: Constant variance of residuals Plot 4: No influential outliers Results interpretation and model evaluation: Parameters of the regression: - intercept (α) - slope of the line (β) Significance of β and the model Criteria for the model evaluation: Adjusted R^2; RSE; 95% CI; AIC Conclusion based on the chosen model: The assumptions are _______________; the model and the independent variable (_______________) are _________________ (p____________). The _______________ variable explains ________% of the ________________________ variability, RSE equals ______________. The estimate of the β-coefficient equals ________________ (95% CI [_______________________]), the intercept α equals _______. Y(____________)=__________________________________(for each one-unit shift of ___________________ ____________________increases by ___________). Check list Model_version_1 Model_version_2 Assumptions after Linear regression: Plot 1: Linearity of the data, independence of residuals met met Plot 2: Normality of residuals +histogram + normality tests not met met Zero mean of residuals met met Plot 3: Constant variance of residuals met met Plot 4: No influential outliers not met met Results interpretation and model evaluation: Parameters of the regression: - intercept (α) - slope of the line (β) α= -28.63, β=0.62 Y(productivity)= -28.63+0.62*X(salary) α= -41.75, β=0.71 Y(productivity)= -41.75+0.71*X(salary) Significance of β and the model p<0.001 p<0.001 Criteria for the model evaluation: Adjusted R^2; RSE; 95% CI; AIC R^2[adj.]=33%, RSE=7.95 (thousand dollars per year), 95%CI [0.44;0.79], AIC=702.5 R^2[adj.]=51%, RSE=6.22 (thousand dollars per year), 95%CI [0.57;0.85], AIC=646.7 Conclusion based on the chosen model: The assumptions are met the model and the independent variable (salary) are significant (p<0.001). The salary variable explains 51% of the productivity variability, RSE equals 6.22 thousand dollars per year. The estimate of the β-coefficient equals 0.71 (95% CI [0.57;0.85]), the intercept α equals -41.75. Y(productivity)= -41.75+0.71*X(salary) (for each one-unit shift of salary productivity increases by 0.71).