--- title: "Intro to R and RStudio 2" output: html_notebook --- Goal: Learn more about data frames, plots, models and statistical testing Course: AI in Finance Date: 23.9.2023 Author: Martina Halousková License: GPL>=3 ------------------------------------------------------ First, create a project and clean up your environment. ```{r} # Either click on the broom icon top right in the environment tab, or use this command: rm(list=ls()) ``` ## SOME MORE FUN STUFF WITH DATAFRAMES Installing and loading packages. ```{r} # install.packages("datasets") library("datasets") ``` Load a built-in dataset from library datasets. ```{r} data(iris) data() DT=iris DT ``` Lets explore this dataset. ```{r} DT head(DT) tail(DT) dim(DT) names(DT) class(DT) ``` Are there any missing values? ```{r} is.na(DT) ``` How many and if any, where is the missing value? ```{r} sum(is.na(DT)) which(is.na(DT)) ``` A simple way of dealing with missing values is deleting rows that include missing values. ```{r} DT[complete.cases(DT),] ``` What are the sums of rows and columns? ```{r} colSums(DT) # returns an error ``` The previous code returned an error. Do you know why? ```{r} colSums(DT[,-c(5)]) ``` ```{r} rowSums(DT[,-c(5)]) ``` Summarize this dataset. ```{r} summary(DT) ``` What are the 3 main species? ```{r} unique(DT$Species) ``` How many flowers are in each species? ```{r} table(DT$Species) ``` Or summary in a graph = a boxplot! ```{r} boxplot(DT[,-c(5)]) ``` ### Plots Lets visualize the dataset with a simple function plot(). ```{r} plot(DT) # or pairs(DT) ``` Are the variables correlated? ```{r} cor(DT) # returns error, why? ``` ```{r} cor(DT[,-c(5)]) ``` Lets focus on a scatterplot of one pair. ```{r} plot(DT$Sepal.Length, DT$Petal.Length) ``` Can we test the correlation of this pair? ```{r} cor.test(DT$Sepal.Length, DT$Petal.Length) ``` How about a linear model? ```{r} model=lm(Petal.Length~Sepal.Length, data=DT) summary(model) ``` Lets return to that scatterplot and overlay a regression line from the estimated model. ```{r} plot(DT$Sepal.Length, DT$Petal.Length) abline(model) ``` Lets look at Petal.Length. Notice anything suspicious? ```{r} plot(DT$Petal.Length) barplot(DT$Petal.Length) hist(DT$Petal.Length) ``` ```{r} table(DT$Petal.Length) ``` ```{r} table(DT$Species,DT$Petal.Length) plot(DT$Petal.Length, col=DT$Species) boxplot(Petal.Length~Species,data=DT) ``` ```{r} boxplot(Petal.Length~Species, data=DT, main = "Petal Length in 3 species of iris", xlab = "Species of flowers", ylab = "Petal Length", col=c("red","green","blue")) ``` Lets subset Petal.Length into 3 different vectors to compare their histograms. ```{r} setosa = DT[DT$Species == "setosa",3] versicolor = DT[DT$Species == "versicolor",3] virginica = DT[DT$Species == "virginica",3] ``` Compare their histograms ```{r} par(mfrow=c(1,3)) hist(setosa) hist(versicolor) hist(virginica) par(mfrow=c(1,1)) ``` Are these differences significant? ```{r} t.test(setosa,versicolor) ``` ## Bonus: some nicer graphs ### Lattice graphics Note: plots are created with single function call (xyplot etc.) Most useful for conditioning types of plots ```{r} install.packages("lattice") library("lattice") ``` This package has a nicer histogram. ```{r} histogram(setosa) ``` But it is perhaps most used for the xyplot graph ```{r} # Scatterplot of sepal length and petal length xyplot(DT$Sepal.Length~DT$Petal.Length) # Scatterplot of petal length and species xyplot(DT$Petal.Length~DT$Species) # You can split the first scatterplot based on the variable species xyplot(DT$Sepal.Length~DT$Petal.Length|DT$Species) ``` ### The GGally package Note: always make sure to select only numeric variables Produces a summary plot of all variables. ```{r} # install.packages(GGally) library(GGally) ggpairs(iris, columns = 1:5, title="IRIS") ``` ### ggplot2 ggplot is a large data visualization package. It will let you create and customize just about any graphic or graph. We will start with something simple, but feel free to explore ggplot on you own: https://ggplot2.tidyverse.org/. Load the package. ```{r} install.packages("ggplot2") library(ggplot2) ``` Similar to the lattice package, ggplot also has a function for nicer scatterplots. ```{r} qplot(Petal.Length, Petal.Width, data = iris) ``` Lets build up a nice scatterplot step by step. ```{r} # ggplot g1 = ggplot(data=DT, aes(x=Sepal.Length, y=Petal.Length)) + geom_point() g1 ``` Condition the color of the dots based on the variable Species, change the shape and size of the dots. ```{r} g1 = ggplot(data=DT, aes(x=Sepal.Length, y=Petal.Length, color=Species)) + geom_point(shape=17, size=1.2) g1 ``` Smooth out the graph and add a different theme. ```{r} g2 = g1 + geom_smooth(method="lm") g2 + theme_minimal() ``` Lets add a trim around the whole graph, make it blue and dashed. Make the description text of axis bigger and bold. And increase the size of numbers on the axis. ```{r} g3 = g2 + theme(axis.line = element_line(colour = "blue", linewidth = .1, linetype = "dashed"), axis.text = element_text(size=12), axis.title=element_text(size=12,face="bold")) g3 ``` Lets add a title to the graph and change the axis description text. ```{r} g4 = g3 + ggtitle("Our first ggplot graph with a title") + labs(y = "Petal Length (cm)") + labs(x = "Sepal Length (cm)")+ theme(legend.title = element_blank()) g4 ``` And now everything all at once: ```{r} g5 = ggplot(data=DT, aes(x=Sepal.Length, y=Petal.Length, color=Species)) + geom_point(shape=17, size=1.2) + geom_smooth(method="lm") + theme(axis.line = element_line(colour = "blue", linewidth = .1, linetype = "dashed"), axis.text = element_text(size=12), axis.title=element_text(size=12,face="bold"))+ ggtitle("Our first ggplot graph with a title") + labs(y = "Petal Length (cm)") + labs(x = "Sepal Length (cm)")+ theme(legend.title = element_blank()) g5 ``` --------------------------------------------------------- ## Try going through these tasks on your own: 1. Change the rownames of dataset DT. ```{r} ``` 2. How many flowers of the species "setosa" have the Sepal wider then 3.5? ```{r} ``` 3. Calculate the median and standard deviation of all 4 numeric variables in the dataset. Assign these results into a vectors. Store these two vectors in a list. ```{r} ``` 4. Is there a relationship between the Petal length and Petal width? - Fit a linear model of Petal length modeled by Petal width - Create a scatter plot and add a regression line ```{r} ``` 5. Add a new column with log transformation of variable Sepal length - Make a plot of this new variable. Connect the dots using a type="l" argument - Change the axis labels and plot title to explain your graph - Change the color of the line to green - Assign this plot to an object and then save this object as an RData file - Save this plot as an image ```{r} ``` 6. Convert the first 4 columns of DT into a matrix. Save this matrix as a csv file. ```{r} ``` 7. Find out, which rows of DT correspond to Sepal width equal to 2.2. - Replace all values in these rows with empty values (NAs) - use function na.omit() or function complete.cases() to remove these rows ```{r} ```