Setting up the library


  • Packages:
     ggplot2        dplyr        psych summarytools       readxl    lubridate      stringr 
        TRUE         TRUE         TRUE         TRUE         TRUE         TRUE         TRUE 


Data Exploration

Exploration of the raw data - base and dplyr


  • Data entry:
bmi_1 = read_excel("bmi.xlsx", sheet = 2)


  • Check the class of bmi:
class(bmi_1)
[1] "tbl_df"     "tbl"        "data.frame"


  • Check the dimensions of bmi:
dim(bmi_1)
[1] 199   2


  • View the column names of bmi:
colnames(bmi_1)
[1] "Country"  "BMI_1980"


  • Structure of the data:
str(bmi_1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   199 obs. of  2 variables:
 $ Country : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ BMI_1980: num  20.4 25.2 23.7 25.7 20.1 ...


  • Glimpse:
# library(dplyr), install.packages(“dplyr“)
glimpse(bmi_1)
Observations: 199
Variables: 2
$ Country  <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Barbuda", "Argentina", "Armenia"...
$ BMI_1980 <dbl> 20.44348, 25.17427, 23.67764, 25.67324, 20.06763, 24.22235, 23.84436, 25.77727, 23.63058, 23.90340, 26....


  • Summary:
summary(bmi_1)
   Country             BMI_1980    
 Length:199         Min.   :18.47  
 Class :character   1st Qu.:21.38  
 Mode  :character   Median :23.98  
                    Mean   :23.55  
                    3rd Qu.:25.40  
                    Max.   :28.28  


  • First 10 rows:
head(bmi_1, n = 10)


  • Last 10 rows:
tail(bmi_1, n = 10)


Exploration of the raw data - psych


  • Load psych:
# library(psych), install.packages("psych")


  • Check the structure of bmi, the psych way:
describe(bmi_1)


  • Check the structure of bmi, the psych way - by group:
bmi_2 <- bmi_1 %>%
         group_by(Country) %>%
         mutate(Health = if_else(BMI_1980 <= 18.5 | BMI_1980 >= 24.9, 
                                 "Unhealthy", "Normal"))
describe.by(bmi_2$BMI_1980, group = factor(bmi_2$Health))

 Descriptive statistics by group 
group: Normal
--------------------------------------------------------------------------------------------- 
group: Unhealthy


Exploration of the raw data - summarytools


  • Load summarytools:
# library(summarytools), install.packages("summarytools")


  • Data:
Manpower <- read.csv('Manpower.csv')


  • Check the structure of bmi, the summarytools way:
view(dfSummary(Manpower))

Introduction to Data Wrangling


  • Data:
Infrastructure <- read.csv2("Infrastructure.csv")


  • Preview Infrastructure with str():
str(Infrastructure)
'data.frame':   133 obs. of  7 variables:
 $ Country         : Factor w/ 133 levels "Afghanistan",..: 126 97 26 49 40 125 56 120 43 35 ...
 $ ISO3            : Factor w/ 133 levels "AFG","ALB","ALG",..: 127 97 25 49 40 124 57 121 43 35 ...
 $ Rank            : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Ports           : int  24 7 15 7 14 14 10 9 13 7 ...
 $ Roadway_Coverage: int  6586610 982000 3860800 3320410 951200 394428 1210251 352046 644480 65050 ...
 $ Railway_Coverage: int  224792 87157 86000 63974 29640 16454 27182 8699 41981 5083 ...
 $ Airports        : int  13513 1218 507 346 464 460 175 98 539 83 ...


  • Coerce Country to character:
Infrastructure$Country <- as.character(Infrastructure$Country)


  • Coerce Rank to factor:
Infrastructure$Rank <- as.character(Infrastructure$Rank)


  • Look at Infrastructure once more with str():
str(Infrastructure)
'data.frame':   133 obs. of  7 variables:
 $ Country         : chr  "United States" "Russia" "China" "India" ...
 $ ISO3            : Factor w/ 133 levels "AFG","ALB","ALG",..: 127 97 25 49 40 124 57 121 43 35 ...
 $ Rank            : chr  "1" "2" "3" "4" ...
 $ Ports           : int  24 7 15 7 14 14 10 9 13 7 ...
 $ Roadway_Coverage: int  6586610 982000 3860800 3320410 951200 394428 1210251 352046 644480 65050 ...
 $ Railway_Coverage: int  224792 87157 86000 63974 29640 16454 27182 8699 41981 5083 ...
 $ Airports        : int  13513 1218 507 346 464 460 175 98 539 83 ...


Strings


  • Load the stringr package:
# library("stringr") install.packages("stringr")


  • Trim all leading and trailing whitespace:
name = c(" Filip ", "Nick ", " Jonathan")
str_trim(name)
[1] "Filip"    "Nick"     "Jonathan"


  • Pad these strings with leading zeros:
pad = c("23485W", "8823453Q", "994Z")
str_pad(pad, width = 9, side = "left", pad = "0")
[1] "00023485W" "08823453Q" "00000994Z"


  • Print state abbreviations:
head(Manpower$Country)
[1] United States  Russia         China          India          France         United Kingdom
133 Levels: Afghanistan Albania Algeria Angola Argentina Armenia Australia Austria Azerbaijan Bahrain Bangladesh ... Zimbabwe


  • Make states all uppercase and save result:
# to states_upper
states_upper <- toupper(Manpower$Country)
head(states_upper)
[1] "UNITED STATES"  "RUSSIA"         "CHINA"          "INDIA"          "FRANCE"         "UNITED KINGDOM"


  • Make states_upper all lowercase again:
states_lower <- tolower(Manpower$Country)
head(states_lower)
[1] "united states"  "russia"         "china"          "india"          "france"         "united kingdom"


  • Look at the head of Infrastructure:
head(Infrastructure)


  • Detect all “Republic” in Country:
str_detect(Infrastructure$Country, "Republic")
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [21] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [81] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[101] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[121] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE


  • In the Country column, replace “Republic” with “R”…:
Infrastructure$Country <- str_replace(Infrastructure$Country, "Republic", "R")


Valid Values

  • In R, represented as NA
  • May appear in other forms
    • N/A (Excel)
    • Single dot (SPSS, SAS)
    • Empty string


  • Inf - “Infinite value” (indicative of outliers?):
1/0
[1] Inf
1/0 + 1/0
[1] Inf
33333^33333
[1] Inf


  • NaN - “Not a number” (rethink a variable?):
0/0
[1] NaN
1/0 - 1/0
[1] NaN


  • character:
"treatment", "123", "A"


  • numeric:
23.44, 120, NaN, Inf


  • integer:
4L, 1123L


  • factor:
factor("Hello"), factor(8)


  • logical:
TRUE, FALSE, NA


Missing Values


  • Data:
name = c("Jerry", "Beth", "Rick", "Morty")
n_friends = c(NaN, NA, Inf, 2)
status = c("Listening to human music", "Happy Family", "Garage", "")
social_df = data.frame(cbind(name, n_friends, status))


  • Call is.na() on the full social_df to spot all NAs:
is.na(social_df)
      name n_friends status
[1,] FALSE     FALSE  FALSE
[2,] FALSE      TRUE  FALSE
[3,] FALSE     FALSE  FALSE
[4,] FALSE     FALSE  FALSE


  • Use the any() function to ask whether there are any NAs in the data:
any(is.na(social_df))
[1] TRUE


  • View a summary() of the dataset:
summary(social_df)
    name   n_friends                      status 
 Beth :1   2   :1                            :1  
 Jerry:1   Inf :1    Garage                  :1  
 Morty:1   NaN :1    Happy Family            :1  
 Rick :1   NA's:1    Listening to human music:1  


  • Call table() on the status column:
table(social_df$status)

                                           Garage             Happy Family Listening to human music 
                       1                        1                        1                        1 


  • Replace all empty strings in status with NA:
social_df$status[social_df$status == ""] <- NA


  • Print social_df to the console:
social_df


  • Use complete.cases() to see which rows have no missing values:
complete.cases(social_df)
[1]  TRUE FALSE  TRUE FALSE


  • Use na.omit() to remove all rows with any missing values:
na.omit(social_df)

Outliers


  • Data:
Infrastructure = read.csv2("Infrastructure.csv")


  • Histogram:
hist(Infrastructure$Ports)

  • Boxplot:
boxplot(Infrastructure$Airports)

  • Scatterplot:
plot(Infrastructure$Railway_Coverage, Infrastructure$Roadway_Coverage)
---
title: "**04. Data Cleaning**"
subtitle: "R101"
author: "Vít Gabrhel"
output: 
  html_notebook:
    toc: true
    toc_float: true
    theme: yeti    
    code_folding: "show"
---
---
# Setting up the library
<br>

* **Packages**:
```{r echo = FALSE, message = FALSE, warning=FALSE, error=FALSE}
# Load packages via function
ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

packages <- c("ggplot2", "dplyr", "psych", 
              "summarytools", "readxl", "lubridate",
              "stringr")

ipak(packages)
```
<br>

# Data Exploration
## *Exploration of the raw data - [base](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html) and [dplyr](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf)*
<br>

* **Data** entry:
```{r}
bmi_1 = read_excel("bmi.xlsx", sheet = 2)
```
<br>

* Check the **class** of *bmi*:
```{r}
class(bmi_1)
```
<br>

* Check the **dimensions** of *bmi*:
```{r}
dim(bmi_1)
```
<br>

* View the **column names** of *bmi*:
```{r}
colnames(bmi_1)
```
<br>

* **Structure** of the data:
```{r}
str(bmi_1)
```
<br>

* **Glimpse**:
```{r}
# library(dplyr), install.packages(“dplyr“)
glimpse(bmi_1)
```
<br>

* **Summary**:
```{r}
summary(bmi_1)
```
<br>

* **First** 10 rows:
```{r}
head(bmi_1, n = 10)
```
<br>

* **Last** 10 rows:
```{r}
tail(bmi_1, n = 10)
```
<br>

## *Exploration of the raw data - [psych](https://cran.r-project.org/web/packages/psych/psych.pdf)*
<br>

* Load psych:
```{r}
# library(psych), install.packages("psych")
```
<br>

* Check the **structure** of *bmi*, the **psych** way:
```{r, warning=FALSE}
describe(bmi_1)
```
<br>

* Check the **structure** of *bmi*, the **psych** way - **by group**:
```{r, warning = FALSE}
bmi_2 <- bmi_1 %>%
         group_by(Country) %>%
         mutate(Health = if_else(BMI_1980 <= 18.5 | BMI_1980 >= 24.9, 
                                 "Unhealthy", "Normal"))


describe.by(bmi_2$BMI_1980, group = factor(bmi_2$Health))
```
<br>

## *Exploration of the raw data - [summarytools](https://cran.r-project.org/web/packages/summarytools/summarytools.pdf)*
<br>

* Load summarytools:
```{r}
# library(summarytools), install.packages("summarytools")
```
<br>

* Data:
```{r, warning=FALSE}
Manpower <- read.csv('Manpower.csv')
```
<br>

* Check the **structure** of *bmi*, the **summarytools** way:
```{r, warning=FALSE, eval=FALSE, message=FALSE}
view(dfSummary(Manpower))
```

# Introduction to Data Wrangling
<br>

* Data:
```{r}
Infrastructure <- read.csv2("Infrastructure.csv")
```
<br>

* Preview *Infrastructure* with **str()**:
```{r}
str(Infrastructure)
```
<br>

* Coerce *Country* to **character**:
```{r, warning=FALSE}
Infrastructure$Country <- as.character(Infrastructure$Country)
```
<br>

* Coerce *Rank* to **factor**:
```{r, warning=FALSE}
Infrastructure$Rank <- as.character(Infrastructure$Rank)
```
<br>

* Look at *Infrastructure* once more with **str()**:
```{r, warning=FALSE}
str(Infrastructure)
```
<br>

## *Strings*
<br>

* Load the stringr package:
```{r}
# library("stringr") install.packages("stringr")
```
<br>

* **Trim** all leading and trailing whitespace:
```{r}
name = c(" Filip ", "Nick ", " Jonathan")
str_trim(name)
```
<br>

* **Pad** these strings with leading zeros:
```{r}
pad = c("23485W", "8823453Q", "994Z")

str_pad(pad, width = 9, side = "left", pad = "0")
```
<br>

* **Print** state abbreviations:
```{r}
head(Manpower$Country)
```
<br>

* Make states all **uppercase** and save result:
```{r}
# to states_upper
states_upper <- toupper(Manpower$Country)
head(states_upper)
```
<br>

* Make states_upper all **lowercase** again:
```{r}
states_lower <- tolower(Manpower$Country)
head(states_lower)
```
<br>

* Look at the **head** of *Infrastructure*:
```{r}
head(Infrastructure)
```
<br>

* **Detect** all "Republic" in *Country*:
```{r}
str_detect(Infrastructure$Country, "Republic")
```
<br>

* In the Country column, **replace** "Republic" with "R"...:
```{r}
Infrastructure$Country <- str_replace(Infrastructure$Country, "Republic", "R")
```
<br>

## *Valid Values*

* In R, represented as NA
* May appear in other forms
  + N/A (Excel)
  + Single dot (SPSS, SAS)
  + Empty string
  
<br>

* **Inf** - "Infinite value" (indicative of outliers?):
```{r}
1/0

1/0 + 1/0

33333^33333
```
<br>

* **NaN** - "Not a number" (rethink a variable?):
```{r}
0/0

1/0 - 1/0
```
<br>

* **character**:
```{r eval=FALSE}
"treatment", "123", "A"
```
<br>

* **numeric**: 
```{r eval=FALSE}
23.44, 120, NaN, Inf
```
<br>

* **integer**: 
```{r eval=FALSE}
4L, 1123L
```
<br>

* **factor**:
```{r eval=FALSE}
factor("Hello"), factor(8)
```
<br>

* **logical**:
```{r eval=FALSE}
TRUE, FALSE, NA
```

<br>

## *Missing Values*

<br>

* **Data**:
```{r}
name = c("Jerry", "Beth", "Rick", "Morty")
n_friends = c(NaN, NA, Inf, 2)
status = c("Listening to human music", "Happy Family", "Garage", "")
social_df = data.frame(cbind(name, n_friends, status))
```

<br>

* Call **is.na()** on the **full** *social_df* to **spot all NAs**:
```{r}
is.na(social_df)
```

<br>

* Use the **any()** function to ask whether there are **any NAs** in the data:
```{r}
any(is.na(social_df))
```

<br>

* View a **summary()** of the dataset:
```{r}
summary(social_df)
```

<br>

* Call **table()** on the *status* column:
```{r}
table(social_df$status)
```

<br>

* **Replace all empty strings** in *status* with **NA**:
```{r}
social_df$status[social_df$status == ""] <- NA
```

<br>

* **Print** *social_df* to the console:
```{r}
social_df
```

<br>

* Use **complete.cases()** to see which **rows have no missing values**:
```{r}
complete.cases(social_df)
```

<br>

* Use **na.omit()** to **remove all rows** with **any missing values**:
```{r}
na.omit(social_df)
```

## *Outliers*

<br>

* **Data**:
```{r}
Infrastructure = read.csv2("Infrastructure.csv")
```

<br>

* **Histogram**:
```{r}
hist(Infrastructure$Ports)
```

* **Boxplot**:
```{r}
boxplot(Infrastructure$Airports)
```

* **Scatterplot**:
```{r}
plot(Infrastructure$Railway_Coverage, Infrastructure$Roadway_Coverage)
```

## Resources

[DataCamp's Free Classroom Model Used in 180 Countries](https://www.datacamp.com/community/blog/datacamps-free-classroom-model-used-in-180-countries?fbclid=IwAR1Bo0B4-D2M7LDAGUg6LVbYKyeJn4O6FYxKs0wqEgA7u0EzM3cUOzHSGKY)
<br>

[This R Data Import Tutorial Is Everything You Need](https://www.datacamp.com/community/tutorials/r-data-import-tutorial)
<br>

[The Landscape of R Packages for
Automated Exploratory Data Analysis
](https://journal.r-project.org/archive/2019/RJ-2019-033/RJ-2019-033.pdf)



