Data Exploration
Exploration of the raw data - base and dplyr
bmi_1 = read_excel("bmi.xlsx", sheet = 2)
class(bmi_1)
[1] "tbl_df" "tbl" "data.frame"
- Check the dimensions of bmi:
dim(bmi_1)
[1] 199 2
- View the column names of bmi:
colnames(bmi_1)
[1] "Country" "BMI_1980"
str(bmi_1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 199 obs. of 2 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ BMI_1980: num 20.4 25.2 23.7 25.7 20.1 ...
# library(dplyr), install.packages(“dplyr“)
glimpse(bmi_1)
Observations: 199
Variables: 2
$ Country [3m[38;5;246m<chr>[39m[23m "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua and Barbuda", "Argentina", "Armenia"...
$ BMI_1980 [3m[38;5;246m<dbl>[39m[23m 20.44348, 25.17427, 23.67764, 25.67324, 20.06763, 24.22235, 23.84436, 25.77727, 23.63058, 23.90340, 26....
summary(bmi_1)
Country BMI_1980
Length:199 Min. :18.47
Class :character 1st Qu.:21.38
Mode :character Median :23.98
Mean :23.55
3rd Qu.:25.40
Max. :28.28
head(bmi_1, n = 10)
tail(bmi_1, n = 10)
Exploration of the raw data - psych
# library(psych), install.packages("psych")
- Check the structure of bmi, the psych way:
describe(bmi_1)
- Check the structure of bmi, the psych way - by group:
bmi_2 <- bmi_1 %>%
group_by(Country) %>%
mutate(Health = if_else(BMI_1980 <= 18.5 | BMI_1980 >= 24.9,
"Unhealthy", "Normal"))
describe.by(bmi_2$BMI_1980, group = factor(bmi_2$Health))
Descriptive statistics by group
group: Normal
---------------------------------------------------------------------------------------------
group: Unhealthy
Introduction to Data Wrangling
Infrastructure <- read.csv2("Infrastructure.csv")
- Preview Infrastructure with str():
str(Infrastructure)
'data.frame': 133 obs. of 7 variables:
$ Country : Factor w/ 133 levels "Afghanistan",..: 126 97 26 49 40 125 56 120 43 35 ...
$ ISO3 : Factor w/ 133 levels "AFG","ALB","ALG",..: 127 97 25 49 40 124 57 121 43 35 ...
$ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ Ports : int 24 7 15 7 14 14 10 9 13 7 ...
$ Roadway_Coverage: int 6586610 982000 3860800 3320410 951200 394428 1210251 352046 644480 65050 ...
$ Railway_Coverage: int 224792 87157 86000 63974 29640 16454 27182 8699 41981 5083 ...
$ Airports : int 13513 1218 507 346 464 460 175 98 539 83 ...
- Coerce Country to character:
Infrastructure$Country <- as.character(Infrastructure$Country)
Infrastructure$Rank <- as.character(Infrastructure$Rank)
- Look at Infrastructure once more with str():
str(Infrastructure)
'data.frame': 133 obs. of 7 variables:
$ Country : chr "United States" "Russia" "China" "India" ...
$ ISO3 : Factor w/ 133 levels "AFG","ALB","ALG",..: 127 97 25 49 40 124 57 121 43 35 ...
$ Rank : chr "1" "2" "3" "4" ...
$ Ports : int 24 7 15 7 14 14 10 9 13 7 ...
$ Roadway_Coverage: int 6586610 982000 3860800 3320410 951200 394428 1210251 352046 644480 65050 ...
$ Railway_Coverage: int 224792 87157 86000 63974 29640 16454 27182 8699 41981 5083 ...
$ Airports : int 13513 1218 507 346 464 460 175 98 539 83 ...
Strings
- Load the stringr package:
# library("stringr") install.packages("stringr")
- Trim all leading and trailing whitespace:
name = c(" Filip ", "Nick ", " Jonathan")
str_trim(name)
[1] "Filip" "Nick" "Jonathan"
- Pad these strings with leading zeros:
pad = c("23485W", "8823453Q", "994Z")
str_pad(pad, width = 9, side = "left", pad = "0")
[1] "00023485W" "08823453Q" "00000994Z"
- Print state abbreviations:
head(Manpower$Country)
[1] United States Russia China India France United Kingdom
133 Levels: Afghanistan Albania Algeria Angola Argentina Armenia Australia Austria Azerbaijan Bahrain Bangladesh ... Zimbabwe
- Make states all uppercase and save result:
# to states_upper
states_upper <- toupper(Manpower$Country)
head(states_upper)
[1] "UNITED STATES" "RUSSIA" "CHINA" "INDIA" "FRANCE" "UNITED KINGDOM"
- Make states_upper all lowercase again:
states_lower <- tolower(Manpower$Country)
head(states_lower)
[1] "united states" "russia" "china" "india" "france" "united kingdom"
- Look at the head of Infrastructure:
head(Infrastructure)
- Detect all “Republic” in Country:
str_detect(Infrastructure$Country, "Republic")
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[21] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[81] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[101] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[121] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
- In the Country column, replace “Republic” with “R”…:
Infrastructure$Country <- str_replace(Infrastructure$Country, "Republic", "R")
Valid Values
- In R, represented as NA
- May appear in other forms
- N/A (Excel)
- Single dot (SPSS, SAS)
- Empty string
- Inf - “Infinite value” (indicative of outliers?):
1/0
[1] Inf
1/0 + 1/0
[1] Inf
33333^33333
[1] Inf
- NaN - “Not a number” (rethink a variable?):
0/0
[1] NaN
1/0 - 1/0
[1] NaN
"treatment", "123", "A"
23.44, 120, NaN, Inf
4L, 1123L
factor("Hello"), factor(8)
TRUE, FALSE, NA
Missing Values
name = c("Jerry", "Beth", "Rick", "Morty")
n_friends = c(NaN, NA, Inf, 2)
status = c("Listening to human music", "Happy Family", "Garage", "")
social_df = data.frame(cbind(name, n_friends, status))
- Call is.na() on the full social_df to spot all NAs:
is.na(social_df)
name n_friends status
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
- Use the any() function to ask whether there are any NAs in the data:
any(is.na(social_df))
[1] TRUE
- View a summary() of the dataset:
summary(social_df)
name n_friends status
Beth :1 2 :1 :1
Jerry:1 Inf :1 Garage :1
Morty:1 NaN :1 Happy Family :1
Rick :1 NA's:1 Listening to human music:1
- Call table() on the status column:
table(social_df$status)
Garage Happy Family Listening to human music
1 1 1 1
- Replace all empty strings in status with NA:
social_df$status[social_df$status == ""] <- NA
- Print social_df to the console:
social_df
- Use complete.cases() to see which rows have no missing values:
complete.cases(social_df)
[1] TRUE FALSE TRUE FALSE
- Use na.omit() to remove all rows with any missing values:
na.omit(social_df)
Outliers
Infrastructure = read.csv2("Infrastructure.csv")
hist(Infrastructure$Ports)
boxplot(Infrastructure$Airports)
plot(Infrastructure$Railway_Coverage, Infrastructure$Roadway_Coverage)
---
title: "**04. Data Cleaning**"
subtitle: "R101"
author: "Vít Gabrhel"
output: 
  html_notebook:
    toc: true
    toc_float: true
    theme: yeti    
    code_folding: "show"
---
---
# Setting up the library
<br>

* **Packages**:
```{r echo = FALSE, message = FALSE, warning=FALSE, error=FALSE}
# Load packages via function
ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

packages <- c("ggplot2", "dplyr", "psych", 
              "summarytools", "readxl", "lubridate",
              "stringr")

ipak(packages)
```
<br>

# Data Exploration
## *Exploration of the raw data - [base](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html) and [dplyr](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf)*
<br>

* **Data** entry:
```{r}
bmi_1 = read_excel("bmi.xlsx", sheet = 2)
```
<br>

* Check the **class** of *bmi*:
```{r}
class(bmi_1)
```
<br>

* Check the **dimensions** of *bmi*:
```{r}
dim(bmi_1)
```
<br>

* View the **column names** of *bmi*:
```{r}
colnames(bmi_1)
```
<br>

* **Structure** of the data:
```{r}
str(bmi_1)
```
<br>

* **Glimpse**:
```{r}
# library(dplyr), install.packages(“dplyr“)
glimpse(bmi_1)
```
<br>

* **Summary**:
```{r}
summary(bmi_1)
```
<br>

* **First** 10 rows:
```{r}
head(bmi_1, n = 10)
```
<br>

* **Last** 10 rows:
```{r}
tail(bmi_1, n = 10)
```
<br>

## *Exploration of the raw data - [psych](https://cran.r-project.org/web/packages/psych/psych.pdf)*
<br>

* Load psych:
```{r}
# library(psych), install.packages("psych")
```
<br>

* Check the **structure** of *bmi*, the **psych** way:
```{r, warning=FALSE}
describe(bmi_1)
```
<br>

* Check the **structure** of *bmi*, the **psych** way - **by group**:
```{r, warning = FALSE}
bmi_2 <- bmi_1 %>%
         group_by(Country) %>%
         mutate(Health = if_else(BMI_1980 <= 18.5 | BMI_1980 >= 24.9, 
                                 "Unhealthy", "Normal"))


describe.by(bmi_2$BMI_1980, group = factor(bmi_2$Health))
```
<br>

## *Exploration of the raw data - [summarytools](https://cran.r-project.org/web/packages/summarytools/summarytools.pdf)*
<br>

* Load summarytools:
```{r}
# library(summarytools), install.packages("summarytools")
```
<br>

* Data:
```{r, warning=FALSE}
Manpower <- read.csv('Manpower.csv')
```
<br>

* Check the **structure** of *bmi*, the **summarytools** way:
```{r, warning=FALSE, eval=FALSE, message=FALSE}
view(dfSummary(Manpower))
```

# Introduction to Data Wrangling
<br>

* Data:
```{r}
Infrastructure <- read.csv2("Infrastructure.csv")
```
<br>

* Preview *Infrastructure* with **str()**:
```{r}
str(Infrastructure)
```
<br>

* Coerce *Country* to **character**:
```{r, warning=FALSE}
Infrastructure$Country <- as.character(Infrastructure$Country)
```
<br>

* Coerce *Rank* to **factor**:
```{r, warning=FALSE}
Infrastructure$Rank <- as.character(Infrastructure$Rank)
```
<br>

* Look at *Infrastructure* once more with **str()**:
```{r, warning=FALSE}
str(Infrastructure)
```
<br>

## *Strings*
<br>

* Load the stringr package:
```{r}
# library("stringr") install.packages("stringr")
```
<br>

* **Trim** all leading and trailing whitespace:
```{r}
name = c(" Filip ", "Nick ", " Jonathan")
str_trim(name)
```
<br>

* **Pad** these strings with leading zeros:
```{r}
pad = c("23485W", "8823453Q", "994Z")

str_pad(pad, width = 9, side = "left", pad = "0")
```
<br>

* **Print** state abbreviations:
```{r}
head(Manpower$Country)
```
<br>

* Make states all **uppercase** and save result:
```{r}
# to states_upper
states_upper <- toupper(Manpower$Country)
head(states_upper)
```
<br>

* Make states_upper all **lowercase** again:
```{r}
states_lower <- tolower(Manpower$Country)
head(states_lower)
```
<br>

* Look at the **head** of *Infrastructure*:
```{r}
head(Infrastructure)
```
<br>

* **Detect** all "Republic" in *Country*:
```{r}
str_detect(Infrastructure$Country, "Republic")
```
<br>

* In the Country column, **replace** "Republic" with "R"...:
```{r}
Infrastructure$Country <- str_replace(Infrastructure$Country, "Republic", "R")
```
<br>

## *Valid Values*

* In R, represented as NA
* May appear in other forms
  + N/A (Excel)
  + Single dot (SPSS, SAS)
  + Empty string
  
<br>

* **Inf** - "Infinite value" (indicative of outliers?):
```{r}
1/0

1/0 + 1/0

33333^33333
```
<br>

* **NaN** - "Not a number" (rethink a variable?):
```{r}
0/0

1/0 - 1/0
```
<br>

* **character**:
```{r eval=FALSE}
"treatment", "123", "A"
```
<br>

* **numeric**: 
```{r eval=FALSE}
23.44, 120, NaN, Inf
```
<br>

* **integer**: 
```{r eval=FALSE}
4L, 1123L
```
<br>

* **factor**:
```{r eval=FALSE}
factor("Hello"), factor(8)
```
<br>

* **logical**:
```{r eval=FALSE}
TRUE, FALSE, NA
```

<br>

## *Missing Values*

<br>

* **Data**:
```{r}
name = c("Jerry", "Beth", "Rick", "Morty")
n_friends = c(NaN, NA, Inf, 2)
status = c("Listening to human music", "Happy Family", "Garage", "")
social_df = data.frame(cbind(name, n_friends, status))
```

<br>

* Call **is.na()** on the **full** *social_df* to **spot all NAs**:
```{r}
is.na(social_df)
```

<br>

* Use the **any()** function to ask whether there are **any NAs** in the data:
```{r}
any(is.na(social_df))
```

<br>

* View a **summary()** of the dataset:
```{r}
summary(social_df)
```

<br>

* Call **table()** on the *status* column:
```{r}
table(social_df$status)
```

<br>

* **Replace all empty strings** in *status* with **NA**:
```{r}
social_df$status[social_df$status == ""] <- NA
```

<br>

* **Print** *social_df* to the console:
```{r}
social_df
```

<br>

* Use **complete.cases()** to see which **rows have no missing values**:
```{r}
complete.cases(social_df)
```

<br>

* Use **na.omit()** to **remove all rows** with **any missing values**:
```{r}
na.omit(social_df)
```

## *Outliers*

<br>

* **Data**:
```{r}
Infrastructure = read.csv2("Infrastructure.csv")
```

<br>

* **Histogram**:
```{r}
hist(Infrastructure$Ports)
```

* **Boxplot**:
```{r}
boxplot(Infrastructure$Airports)
```

* **Scatterplot**:
```{r}
plot(Infrastructure$Railway_Coverage, Infrastructure$Roadway_Coverage)
```

## Resources

[DataCamp's Free Classroom Model Used in 180 Countries](https://www.datacamp.com/community/blog/datacamps-free-classroom-model-used-in-180-countries?fbclid=IwAR1Bo0B4-D2M7LDAGUg6LVbYKyeJn4O6FYxKs0wqEgA7u0EzM3cUOzHSGKY)
<br>

[This R Data Import Tutorial Is Everything You Need](https://www.datacamp.com/community/tutorials/r-data-import-tutorial)
<br>

[The Landscape of R Packages for
Automated Exploratory Data Analysis
](https://journal.r-project.org/archive/2019/RJ-2019-033/RJ-2019-033.pdf)



