---
title: "**04. Data Cleaning**"
subtitle: "R101"
author: "Vít Gabrhel"
output:
html_notebook:
toc: true
toc_float: true
theme: yeti
code_folding: "show"
---
---
# Setting up the library
* **Packages**:
```{r echo = FALSE, message = FALSE, warning=FALSE, error=FALSE}
# Load packages via function
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
packages <- c("ggplot2", "dplyr", "psych",
"summarytools", "readxl", "lubridate",
"stringr")
ipak(packages)
```
# Data Exploration
## *Exploration of the raw data - [base](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html) and [dplyr](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf)*
* **Data** entry:
```{r}
bmi_1 = read_excel("bmi.xlsx", sheet = 2)
```
* Check the **class** of *bmi*:
```{r}
class(bmi_1)
```
* Check the **dimensions** of *bmi*:
```{r}
dim(bmi_1)
```
* View the **column names** of *bmi*:
```{r}
colnames(bmi_1)
```
* **Structure** of the data:
```{r}
str(bmi_1)
```
* **Glimpse**:
```{r}
# library(dplyr), install.packages(“dplyr“)
glimpse(bmi_1)
```
* **Summary**:
```{r}
summary(bmi_1)
```
* **First** 10 rows:
```{r}
head(bmi_1, n = 10)
```
* **Last** 10 rows:
```{r}
tail(bmi_1, n = 10)
```
## *Exploration of the raw data - [psych](https://cran.r-project.org/web/packages/psych/psych.pdf)*
* Load psych:
```{r}
# library(psych), install.packages("psych")
```
* Check the **structure** of *bmi*, the **psych** way:
```{r, warning=FALSE}
describe(bmi_1)
```
* Check the **structure** of *bmi*, the **psych** way - **by group**:
```{r, warning = FALSE}
bmi_2 <- bmi_1 %>%
group_by(Country) %>%
mutate(Health = if_else(BMI_1980 <= 18.5 | BMI_1980 >= 24.9,
"Unhealthy", "Normal"))
describe.by(bmi_2$BMI_1980, group = factor(bmi_2$Health))
```
## *Exploration of the raw data - [summarytools](https://cran.r-project.org/web/packages/summarytools/summarytools.pdf)*
* Load summarytools:
```{r}
# library(summarytools), install.packages("summarytools")
```
* Data:
```{r, warning=FALSE}
Manpower <- read.csv('Manpower.csv')
```
* Check the **structure** of *bmi*, the **summarytools** way:
```{r, warning=FALSE, eval=FALSE, message=FALSE}
view(dfSummary(Manpower))
```
# Introduction to Data Wrangling
* Data:
```{r}
Infrastructure <- read.csv2("Infrastructure.csv")
```
* Preview *Infrastructure* with **str()**:
```{r}
str(Infrastructure)
```
* Coerce *Country* to **character**:
```{r, warning=FALSE}
Infrastructure$Country <- as.character(Infrastructure$Country)
```
* Coerce *Rank* to **factor**:
```{r, warning=FALSE}
Infrastructure$Rank <- as.character(Infrastructure$Rank)
```
* Look at *Infrastructure* once more with **str()**:
```{r, warning=FALSE}
str(Infrastructure)
```
## *Strings*
* Load the stringr package:
```{r}
# library("stringr") install.packages("stringr")
```
* **Trim** all leading and trailing whitespace:
```{r}
name = c(" Filip ", "Nick ", " Jonathan")
str_trim(name)
```
* **Pad** these strings with leading zeros:
```{r}
pad = c("23485W", "8823453Q", "994Z")
str_pad(pad, width = 9, side = "left", pad = "0")
```
* **Print** state abbreviations:
```{r}
head(Manpower$Country)
```
* Make states all **uppercase** and save result:
```{r}
# to states_upper
states_upper <- toupper(Manpower$Country)
head(states_upper)
```
* Make states_upper all **lowercase** again:
```{r}
states_lower <- tolower(Manpower$Country)
head(states_lower)
```
* Look at the **head** of *Infrastructure*:
```{r}
head(Infrastructure)
```
* **Detect** all "Republic" in *Country*:
```{r}
str_detect(Infrastructure$Country, "Republic")
```
* In the Country column, **replace** "Republic" with "R"...:
```{r}
Infrastructure$Country <- str_replace(Infrastructure$Country, "Republic", "R")
```
## *Valid Values*
* In R, represented as NA
* May appear in other forms
+ N/A (Excel)
+ Single dot (SPSS, SAS)
+ Empty string
* **Inf** - "Infinite value" (indicative of outliers?):
```{r}
1/0
1/0 + 1/0
33333^33333
```
* **NaN** - "Not a number" (rethink a variable?):
```{r}
0/0
1/0 - 1/0
```
* **character**:
```{r eval=FALSE}
"treatment", "123", "A"
```
* **numeric**:
```{r eval=FALSE}
23.44, 120, NaN, Inf
```
* **integer**:
```{r eval=FALSE}
4L, 1123L
```
* **factor**:
```{r eval=FALSE}
factor("Hello"), factor(8)
```
* **logical**:
```{r eval=FALSE}
TRUE, FALSE, NA
```
## *Missing Values*
* **Data**:
```{r}
name = c("Jerry", "Beth", "Rick", "Morty")
n_friends = c(NaN, NA, Inf, 2)
status = c("Listening to human music", "Happy Family", "Garage", "")
social_df = data.frame(cbind(name, n_friends, status))
```
* Call **is.na()** on the **full** *social_df* to **spot all NAs**:
```{r}
is.na(social_df)
```
* Use the **any()** function to ask whether there are **any NAs** in the data:
```{r}
any(is.na(social_df))
```
* View a **summary()** of the dataset:
```{r}
summary(social_df)
```
* Call **table()** on the *status* column:
```{r}
table(social_df$status)
```
* **Replace all empty strings** in *status* with **NA**:
```{r}
social_df$status[social_df$status == ""] <- NA
```
* **Print** *social_df* to the console:
```{r}
social_df
```
* Use **complete.cases()** to see which **rows have no missing values**:
```{r}
complete.cases(social_df)
```
* Use **na.omit()** to **remove all rows** with **any missing values**:
```{r}
na.omit(social_df)
```
## *Outliers*
* **Data**:
```{r}
Infrastructure = read.csv2("Infrastructure.csv")
```
* **Histogram**:
```{r}
hist(Infrastructure$Ports)
```
* **Boxplot**:
```{r}
boxplot(Infrastructure$Airports)
```
* **Scatterplot**:
```{r}
plot(Infrastructure$Railway_Coverage, Infrastructure$Roadway_Coverage)
```
## Resources
[DataCamp's Free Classroom Model Used in 180 Countries](https://www.datacamp.com/community/blog/datacamps-free-classroom-model-used-in-180-countries?fbclid=IwAR1Bo0B4-D2M7LDAGUg6LVbYKyeJn4O6FYxKs0wqEgA7u0EzM3cUOzHSGKY)
[This R Data Import Tutorial Is Everything You Need](https://www.datacamp.com/community/tutorials/r-data-import-tutorial)
[The Landscape of R Packages for
Automated Exploratory Data Analysis
](https://journal.r-project.org/archive/2019/RJ-2019-033/RJ-2019-033.pdf)