--- title: "1st homework assignment" output: pdf_document highlight: zenburn --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Task 1 - cleaning data (2 points) Work with the **customer_behaviour** dataset. ```{r} load("customer_behaviour.RData") ``` The dataset has 4 columns, each row represents an individual customer: *money_spent* describes the average amount of money customer spends during one visit, *age* is self-explanatory, *web_visits* describes how many times a month customer checks out the shop website, *mail_ads* describes how many advertisement emails the customer gets monthly, *shop_visits* described how many times the customer visits a shop in person a month. Explore each variable and **delete** any rows which have mistakes in them. **Do not fix the mistakes, delete whole rows.** | number of rows in the cleaned dataset | | -------------- | | `insert number` | ## Task 2 - descriptive statistics (3 points) Work with the cleaned dataset from the previous month **customer_behaviour2**. ```{r} load("customer_behaviour2.RData") ``` Firstly, create a new variable called `big` where each value equals either 1 (if the person spent more money than 5000 USD), or 0 (if he spent less or equal): ```{r} # insert your code here ``` Plot two boxplots of the variable *money_spent* into one figure: the first one for observations with the value of *big* equal to 0, the second one for observations with the value of *big* equal to 1. Then create a histogram for the variable *money_spent* together with its kernel density estimation. ```{r, echo = FALSE, out.width="49%"} # insert your code here ``` Finally, compute following numerical characteristics of the variable *age*: | mean | median | $1^{st}$ quartile | $3^{rd}$ quartile | interquartile range | variance | | -------------- | --------------- | --------------- | --------------- | --------------------- | -------------- | | `insert` | `insert` | `insert` | `insert` | `insert` | `insert` | Choose one appropriate measure of location and one appropriate measure of variability for the *money_spent* variable. Input the name of the measure into the following table. Briefly explain why you chose these measures. |measure of location | measure of variability | |-----------------|-------------------| | `insert name`| `insert name`| |--------------------|---------------| | `insert explanation` | `insert explanation` | ## Task 3 - correlation (2 points) Compute the correlation matrix of the data from the previous task (excluding the *money_spent* and *big* variables) and the sum of all its diagonal elements. Explain the result of the sum: ```{r, include = F} # compute the correlation matrix here: # compute the sum of all diagonal elements here: ``` | Sum of diagonal elements | Explanation | |---------------|--------------------------------| | `number` | `text` | Compute and **interpret** correlation coeficients between following variables: | Variables | Results | Interpretation | |-----------------|-----------|--------------------------------------| | `example` | 0 | The correlation is zero, which means... | | `money_spent`, `age` | `insert`| `insert` | | `money_spent`, `web_visits` | `insert`| `insert` | Interpretation in the form "correlation coefficient is 0.8 which means the correlation is high" will not be accepted. ## Task 4 - PCA (3 points) Use PCA on the dataset from the previous task (**customer_behaviour2**, excluding variables *money_spent* and *big*). Use as little components as possible to capture at least 80 % of data variance. | Number of components used | |-----------------| | `insert` | State which variable has the most influence on each component. | - | Component 1 | Component 2 | Component 3 | Component 4 | |-----------------|-----------------|-----------------|-----------------| -----------------| most impactful variable | `insert` | `insert` | `insert` | `insert` | Create a scatter plot of data points using the first two components. Plot the points in different colours depending on the value of the *big* variable. What is your evaluation of the final plot? Can you decipher from the plot which variable(s) seems best at separating *big* shoppers from the customers who spend less? ```{r, include = T} ``` | Scatter plot evaluation | Which variable(s) best separates heavy spenders | |-----------------|-----------------| | `insert text explaining what can you see in the scatter plot` | `insert name of the variable(s) and explain why you came to such conclusion` |