--- title: "**04. Data Cleaning**" subtitle: "R101" author: "Vít Gabrhel" output: html_notebook: toc: true toc_float: true theme: yeti code_folding: "show" --- --- # Setting up the library
* **Packages**: ```{r echo = FALSE, message = FALSE, warning=FALSE, error=FALSE} # Load packages via function ipak <- function(pkg){ new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])] if (length(new.pkg)) install.packages(new.pkg, dependencies = TRUE) sapply(pkg, require, character.only = TRUE) } packages <- c("ggplot2", "dplyr", "psych", "summarytools", "readxl", "lubridate", "stringr") ipak(packages) ```
# Data Exploration ## *Exploration of the raw data - [base](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html) and [dplyr](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf)*
* **Data** entry: ```{r} bmi_1 = read_excel("bmi.xlsx", sheet = 2) ```
* Check the **class** of *bmi*: ```{r} class(bmi_1) ```
* Check the **dimensions** of *bmi*: ```{r} dim(bmi_1) ```
* View the **column names** of *bmi*: ```{r} colnames(bmi_1) ```
* **Structure** of the data: ```{r} str(bmi_1) ```
* **Glimpse**: ```{r} # library(dplyr), install.packages(“dplyr“) glimpse(bmi_1) ```
* **Summary**: ```{r} summary(bmi_1) ```
* **First** 10 rows: ```{r} head(bmi_1, n = 10) ```
* **Last** 10 rows: ```{r} tail(bmi_1, n = 10) ```
## *Exploration of the raw data - [psych](https://cran.r-project.org/web/packages/psych/psych.pdf)*
* Load psych: ```{r} # library(psych), install.packages("psych") ```
* Check the **structure** of *bmi*, the **psych** way: ```{r, warning=FALSE} describe(bmi_1) ```
* Check the **structure** of *bmi*, the **psych** way - **by group**: ```{r, warning = FALSE} bmi_2 <- bmi_1 %>% group_by(Country) %>% mutate(Health = if_else(BMI_1980 <= 18.5 | BMI_1980 >= 24.9, "Unhealthy", "Normal")) describe.by(bmi_2$BMI_1980, group = factor(bmi_2$Health)) ```
## *Exploration of the raw data - [summarytools](https://cran.r-project.org/web/packages/summarytools/summarytools.pdf)*
* Load summarytools: ```{r} # library(summarytools), install.packages("summarytools") ```
* Data: ```{r, warning=FALSE} Manpower <- read.csv('Manpower.csv') ```
* Check the **structure** of *bmi*, the **summarytools** way: ```{r, warning=FALSE, eval=FALSE, message=FALSE} view(dfSummary(Manpower)) ``` # Introduction to Data Wrangling
* Data: ```{r} Infrastructure <- read.csv2("Infrastructure.csv") ```
* Preview *Infrastructure* with **str()**: ```{r} str(Infrastructure) ```
* Coerce *Country* to **character**: ```{r, warning=FALSE} Infrastructure$Country <- as.character(Infrastructure$Country) ```
* Coerce *Rank* to **factor**: ```{r, warning=FALSE} Infrastructure$Rank <- as.character(Infrastructure$Rank) ```
* Look at *Infrastructure* once more with **str()**: ```{r, warning=FALSE} str(Infrastructure) ```
## *Strings*
* Load the stringr package: ```{r} # library("stringr") install.packages("stringr") ```
* **Trim** all leading and trailing whitespace: ```{r} name = c(" Filip ", "Nick ", " Jonathan") str_trim(name) ```
* **Pad** these strings with leading zeros: ```{r} pad = c("23485W", "8823453Q", "994Z") str_pad(pad, width = 9, side = "left", pad = "0") ```
* **Print** state abbreviations: ```{r} head(Manpower$Country) ```
* Make states all **uppercase** and save result: ```{r} # to states_upper states_upper <- toupper(Manpower$Country) head(states_upper) ```
* Make states_upper all **lowercase** again: ```{r} states_lower <- tolower(Manpower$Country) head(states_lower) ```
* Look at the **head** of *Infrastructure*: ```{r} head(Infrastructure) ```
* **Detect** all "Republic" in *Country*: ```{r} str_detect(Infrastructure$Country, "Republic") ```
* In the Country column, **replace** "Republic" with "R"...: ```{r} Infrastructure$Country <- str_replace(Infrastructure$Country, "Republic", "R") ```
## *Valid Values* * In R, represented as NA * May appear in other forms + N/A (Excel) + Single dot (SPSS, SAS) + Empty string
* **Inf** - "Infinite value" (indicative of outliers?): ```{r} 1/0 1/0 + 1/0 33333^33333 ```
* **NaN** - "Not a number" (rethink a variable?): ```{r} 0/0 1/0 - 1/0 ```
* **character**: ```{r eval=FALSE} "treatment", "123", "A" ```
* **numeric**: ```{r eval=FALSE} 23.44, 120, NaN, Inf ```
* **integer**: ```{r eval=FALSE} 4L, 1123L ```
* **factor**: ```{r eval=FALSE} factor("Hello"), factor(8) ```
* **logical**: ```{r eval=FALSE} TRUE, FALSE, NA ```
## *Missing Values*
* **Data**: ```{r} name = c("Jerry", "Beth", "Rick", "Morty") n_friends = c(NaN, NA, Inf, 2) status = c("Listening to human music", "Happy Family", "Garage", "") social_df = data.frame(cbind(name, n_friends, status)) ```
* Call **is.na()** on the **full** *social_df* to **spot all NAs**: ```{r} is.na(social_df) ```
* Use the **any()** function to ask whether there are **any NAs** in the data: ```{r} any(is.na(social_df)) ```
* View a **summary()** of the dataset: ```{r} summary(social_df) ```
* Call **table()** on the *status* column: ```{r} table(social_df$status) ```
* **Replace all empty strings** in *status* with **NA**: ```{r} social_df$status[social_df$status == ""] <- NA ```
* **Print** *social_df* to the console: ```{r} social_df ```
* Use **complete.cases()** to see which **rows have no missing values**: ```{r} complete.cases(social_df) ```
* Use **na.omit()** to **remove all rows** with **any missing values**: ```{r} na.omit(social_df) ``` ## *Outliers*
* **Data**: ```{r} Infrastructure = read.csv2("Infrastructure.csv") ```
* **Histogram**: ```{r} hist(Infrastructure$Ports) ``` * **Boxplot**: ```{r} boxplot(Infrastructure$Airports) ``` * **Scatterplot**: ```{r} plot(Infrastructure$Railway_Coverage, Infrastructure$Roadway_Coverage) ``` ## Resources [DataCamp's Free Classroom Model Used in 180 Countries](https://www.datacamp.com/community/blog/datacamps-free-classroom-model-used-in-180-countries?fbclid=IwAR1Bo0B4-D2M7LDAGUg6LVbYKyeJn4O6FYxKs0wqEgA7u0EzM3cUOzHSGKY)
[This R Data Import Tutorial Is Everything You Need](https://www.datacamp.com/community/tutorials/r-data-import-tutorial)
[The Landscape of R Packages for Automated Exploratory Data Analysis ](https://journal.r-project.org/archive/2019/RJ-2019-033/RJ-2019-033.pdf)