---
title: "03. Vkládání dat"
subtitle: "R101"
author: "Vít Gabrhel"
output: 
  html_document:
    toc: true
    toc_float: true
    theme: yeti
    code_folding: "show"
---

---
## **Pracovní složka**

* Zjištění pracovní složky (**get working directory)**
```{r eval = FALSE} 
getwd()
```

* Nastavení pracovní složky **(set working directory)**
```{r eval= FALSE}
setwd("~/Data")
```
<ul> **nebo** </ul>

```{r eval= FALSE}
setwd("...//Data")
```

Tento přístup není příliš [**efektivní**](https://twitter.com/hadleywickham/status/940021008764846080?lang=en).

## **Project approach**

Proč bychom měli pracovat [**"projektovým způsobem"**](https://r4ds.had.co.nz/workflow-projects.html#rstudio-projects)?

* R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. 
  + This is such a wise and common practice that RStudio has built-in support for this via projects.

* **Click File > New Project**...

## **Import dat**
### *Flat Files - Utils - .csv*

* **Import** swimming_pools.csv:
```{r eval = TRUE}
pools = read.csv("swimming_pools.csv")
```
* **Print the structure** of pools:
```{r eval = TRUE}
str(pools)
```
* **Import** swimming_pools.csv correctly:
```{r eval = TRUE}
pools = read.csv("swimming_pools.csv", 
                 stringsAsFactors = FALSE)
```
* Check the **structure** of pools again:
```{r eval = TRUE}
str(pools)
```

### *Flat Files - Utils - .txt*

* Načtení dat z .txt souboru, kde **první řádek nese názvy proměnných**:
```{r eval = TRUE}
hotdogs_1 = read.delim("hotdogs_1.txt", 
                       header = TRUE)
```

* Načtení dat z .txt. souboru s **manuálním nastavením názvů proměnných**:
```{r eval = TRUE}
hotdogs_2 = read.delim("hotdogs_2.txt", 
                        header = FALSE, 
                        col.names = c("type", "calories", "sodium"))
```

* Základní popisné statistiky o datovém souboru "hotdogs_1":
```{r eval = TRUE}
summary(hotdogs_1)
```

* Struktura dat souboru "hotdogs_1":
```{r eval = TRUE}
str(hotdogs_1)
```

* Struktura dat souboru "hotdogs_1":
```{r eval = TRUE}
str(hotdogs_2)
```

* **Select** the hot dog with the **least calories (Cal)**:
```{r eval = TRUE}
Cal <- hotdogs_1[which.min(hotdogs_1$Calories), ]
Cal
```

* **Select** the observation with the **most sodium (Sod)**:
```{r eval = TRUE}
Sod = hotdogs_1[which.max(hotdogs_1$Sodium), ]
```

### *Excel - [readxl](https://cran.r-project.org/web/packages/readxl/readxl.pdf)*
<br>

* Instalace balíčku:
```{r eval = FALSE}
install.packages("readxl")
```

* Nahrání balíčku:
```{r eval = TRUE}
library(readxl)
```

* Dva základní příkazy:
```{r eval = FALSE}
excel_sheets() # Výčet listů v daném excelovském (.xls, .xlsx) souboru
read_excel() # Načtení souboru excelovského formátu
```

* **Načtení** souboru *latitude.xlsx*:
```{r eval = TRUE}
excel_sheets("latitude.xlsx")
```

* **Read the first sheet** of *latitude.xlsx*:
```{r eval = TRUE}
latitude_1 = read_excel("latitude.xlsx", 
                        sheet = "1700")
latitude_1
```

* Read the **second sheet** of *latitude.xlsx*:
```{r eval = TRUE}
latitude_2 = read_excel("latitude.xlsx", 
                        sheet = 2)
latitude_2
```

* Put latitude_1 and latitude_2 **in a list**:
```{r eval = TRUE}
lat_list = list(latitude_1, latitude_2)
```
### *Excel - [readxl](https://cran.r-project.org/web/packages/readxl/readxl.pdf) - col_names*

* Apart from path and sheet, there are several other arguments you can specify in `read_excel()`. 
  + One of these arguments is called col_names.

* Import the the first Excel sheet of *latitude_nonames.xlsx* (R gives names):
```{r eval = TRUE}
latitude_3 = read_excel("latitude.xlsx", 
                        sheet = 3, 
                        col_names = FALSE)
latitude_3
```

* Import the the first Excel sheet of latitude_nonames.xlsx (specify col_names):
```{r eval = TRUE}
latitude_4 = read_excel("latitude.xlsx", 
                        sheet = 3, 
                        col_names = c("country", "latitude"))
latitude_4
```

* Print the summary of *latitude_3*:
```{r eval = TRUE}
summary(latitude_3)
```

* Print the summary of *latitude_4*:
```{r eval = TRUE}
summary(latitude_4)
```

### *Excel - [readxl](https://cran.r-project.org/web/packages/readxl/readxl.pdf) - skip*

* Another argument that can be very useful when reading in Excel files that are **less tidy**, is **skip**.
  + With skip, you can tell R to **ignore a specified number of rows** inside the Excel sheets you're trying to pull data from.
  
* Have a look at this **example**:
```{r eval = TRUE}
read_excel("latitude.xlsx", 
           skip = 15)
```
* In this case, the **first 15 rows** in the **first sheet** of *"data.xlsx"* **are ignored**.
+ Pozor na **posunutí matice**!
```{r eval = TRUE}
read_excel("latitude.xlsx", 
           skip = 15, 
           col_names = FALSE)
```

### *Excel - [readxl](https://cran.r-project.org/web/packages/readxl/readxl.pdf) - binding tabs a missing values*

```{r eval = TRUE}
latitude_all <- cbind(latitude_1, latitude_2[-1])
tail(latitude_all)
```
<ul> *Argument [-1] se týká prvního sloupce v rámci dané matice* </ul>

* Remove all rows with NAs from latitude_all
```{r eval = TRUE}
latitude_all_clean = na.omit(latitude_all)
```
* Print out a summary of latitude_all
```{r eval = TRUE}
summary(latitude_all_clean)
```

### *Excel - [foreign](https://cran.r-project.org/web/packages/foreign/foreign.pdf)*

* Balíček *foreign*:
```{r eval = TRUE}
library(foreign)
```

* K načtení dat z SPSS (.sav, .por) slouží příkaz **read.spss()**
  + Aby měla nahraná data povahu data frame, je nutné uvnitř příkazu read.spss() jako argument zadat **"to.data.frame = TRUE"**

* **Načtení dat**:
```{r eval = TRUE}
demo_1 = read.spss("international.sav", 
                   to.data.frame = TRUE)
```

* **Načtení několika prvních řádků**:
```{r eval = TRUE}
head(demo_1)
```

* *Jak nastavit "value labels" z SPSS jako "factors" v R?*

* Skrze argument "**se.value.labels**" v rámci příkazu "**read.spss()**". Tento argument upřesňuje, zda mají být "**value labels**" konvertovány do R jako "**factors**".
  + Argument je "TRUE by default", výchozím stavem je tedy provedení výše uvedené konverze

* Načtení dat
```{r eval = TRUE}
demo_2 = read.spss("international.sav", 
                   to.data.frame = TRUE, 
                   use.value.labels = FALSE)
```

* Načtení několika prvních řádků
```{r eval = TRUE}
head(demo_2)
```

* *Jak nastavit "value labels" z SPSS u "factors" v R u dílčích proměnných?*

* Summary demo_2$contint
```{r eval = TRUE}
summary(demo_2$contint)
class(demo_2$contint)
```

* Konverze demo_2$contint na faktor
```{r eval = TRUE}
demo_2$contint = as.factor(demo_2$contint)
```

* Summary demo_2$contint znovu
```{r eval = TRUE}
summary(demo_2$contint)
class(demo_2$contint)
```

* *Jak nastavit "value labels" z SPSS jako "factors" u dílčích proměnných v R?*
```{r eval = TRUE}
continents = c("Africa", "Americas", "Asia", "Europe")
demo_2$contint = factor(demo_2$contint, 
                        levels = c(1, 2, 3, 4), 
                        labels = continents)
summary(demo_2$contint)
```

## **Zdroje**

Packages (n.d.) Packages. In Quick-R. Staženo dne 2. 10. 2016 z http://www.statmethods.net/interface/packages.html

Prostý databázový soubor. (n.d.). In Wikipedia. Staženo dne 2. 10. 2016 z https://cs.wikipedia.org/wiki/Prost%C3%BD_datab%C3%A1zov%C3%BD_soubor