Homework 2

Due by 30/10/2016.

Complete exercises and replicate outputs. Note that only PDF and HTML formats will be accepted. All R code you used to generate figures should be included in the document.

There are plenty of options for creating PDF documents with inline R code such as knitr, sweave, sense.io, jupyter, Anaconda cloud or simply save the plots as .png images and add them manually to your document together with R code.

Your output can be stylistically different from the outputs below (e.g. vectors or matrices will show up with row/column numbers). This doesn't matter as long as you have the correct numbers, labels and graphs.

Note. If you decide to use Jupyter Notebook, run command options(jupyter.plot_mimetypes = 'image/png') in the first cell to get around bug with SVG images.

1. Location and spread characteristics

Load fivemin.csv dataset with network usage data recorded every 5 minutes over 8 hours. Select data for type=Server and do the following tasks:

  1. Plot histogram of $sessions$ and $log(sessions + 1)$
  2. Report all location and spread characteristics for $sessions$. (It's up to you whether you decide to use print commands or put all characteristics in a dataframe or list and print that)
  3. Repeat 1. and 2. for $packets$ column
Sessions
=================================
Minimum: 1293 
Maximum: 2678 
Mean: 1904.573 
... TODO ...
Packets
=================================
Minimum: 14857 
Maximum: 4015541 
Mean: 204540.1 
... TODO ...

2. Parameter estimation for normal distribution

Let $X$ be a random variable representing number of sessions for type=Server from the previous exercise. Answer the following questions

  • a) What is the probability $P(X > 2000)$ (i.e. high load load)?
  • b) What is the probability $P(1500 < X < 2000)$ (i.e. probability of usual network load)?
  • c) What is the probability $P(X > 2700)$ (i.e. probability of extreme network load)?

Answer these questions using two different approaches:

  • i) Do not make any assumptions about distribution of X and calculate empirical probabilities (relative frequencies), i.e. $$ P(a < X < b) = \frac{\text{nr of times X is between } a \text{ and } b}{\text{nr of all occurences}} $$
  • ii) Assume that $X$ is continuous and follows a normal distribution $N(\mu, \sigma^2)$. It's parameters can be estimated by $$ \hat{\mu} = \frac{\sum_{i=1}^{n} x_i}{n}, \quad \quad \hat{\sigma}^2 = s^2 = \frac{\sum_{i=1}^{n} (x_i - \hat{\mu})^2}{n - 1}. $$ Calculate given probabilities from distribution $N(\hat{\mu}, \hat{\sigma}^2)$.

3. Finish exercise 9 (Interactive Normal Distribution 2)

Finish exercise 9 and deploy your app to shinyapps.io. Here's a tutorial to help you get started http://shiny.rstudio.com/articles/shinyapps.html.

For this task please submit your app in zip format and include link to your app in the PDF.