--- title: "Intro to R and RStudio 1" output: html_notebook --- Goal: Learn basic data types, functions and data structures Course: AI in Finance Date: 23.9.2023 Author: Martina Halousková License: GPL>=3 ------------------------------------------------------ ### What is an R Notebook? This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*. ```{r} plot(cars) ``` Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I*. When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* to preview the HTML file). The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed. ------------------------------------------------------ ## Mathematical operations (using R as a calculator) First, let's try using R as a calculator. The chunks below each have a simple mathematical operation. Try running these one by one. The output should appear bellow the chunk, but also in the Console (bottom left window). *Tip: You can also use the console directly!* ```{r} 2+2 ``` ```{r} 6/3 ``` ```{r} 2^3 # this is a comment, you can use it to write down any notes ``` ```{r} (5-2)*2 ``` A chunk of code can include several rows of code. Try running the following one, to see, what kind of output it produces. ```{r} 1+9-7 5*4 2*(87-65) ``` Great! It seems, that each row of code corresponds to a row of output bellow the chunk of code. Now it is time to create your own chunk of code. Click on the *Insert Chunk* button on the toolbar or press *Ctrl+Alt+I* and calculate how much is $123*456*789$. Next we would like to calculate the square root of sixteen, $\sqrt 16$. To do so, we will need to use our first pre-built function sqrt(). ```{r} sqrt(16) ``` Notice the syntax: a function name, "sqrt" in this case, is followed by brackets ().Inside these brackets, the function accepts input arguments, such as number 16 in our case. This is very similar to excel functions.There are many useful pre-built function. Let's use another one, to calculate the natural logarithm of ten, $\ln(10)$ ```{r} log(10) # log base e ``` ------------------------------------------------------ ## Numeric variables So far we have only used R as a calculator. However, very often we would like to remember some values, or even the results of our calculations. To do so, we will create *objects*. Let's start with a simple example. We have a number 3. We want to "write down" and remember this number. To do so, we will *assign* the value 3 to an object "x". You can view this as a variable x. ```{r} x=3 ``` - The assignment operator is =. Previously it was (and still can be) a <-. These can be used interchangeably, but you should learn one and stick with it. - Notice that the value was not printed out below the chunk of code. If we want to print it out, we have to type "x" on a new line like so: ```{r} x=3 x ``` Now, that we can assign values to objects, it will be much easier to do some calculations. Here is an example: ```{r} a=6 b=8 a+b a*b (a+b)^2 ``` Notice, that x, a and b are now in our environment (top right window). This means, that we can further work with them. The next step is assigning the result of our calculation to a new object c, such that $c=a+b+10$. ```{r} c=a+b+10 c d <- a+b + 10 # Notice, that spaces are not important, and we can use = or <-. c and d should have the same value. d ``` One more thing. R is case sensitive, which means, that object E is not e. ```{r} e=3 E # R is case sensitive (E is not e) ``` ------------------------------------------------------ ## Data types, vectors Let's move on to some more complex objects, that can store more than just one value - vectors! Say, there are 4 students, their heights are: 165 151 182 and 177 centimeters. We can store these numbers in a data vector with the c() function. This function c() *combines* values into one object - one vector. ```{r} heights=c(165,151,182,177) ``` By the way, any time you do not know, what a function does, you can ask R. Simply try typing ?function(). It will open a help page (bottom right window). ```{r} ?c() ``` Alright, let's look at the data. ```{r} heights ``` Now, that we defined the vector heights, it should be in our environment. Also, we can ask it some questions, such as: How tall is the first student? ```{r} heights[1] #[1] refers to the first observation ``` Now it is your turn. Create a chunk of code, that calculates the height of the second student plus the height of the third student and store the result into an object called "res". Good news! We can combine vectors. Simply use the same function c() on two vectors. ```{r} x=c(1.5, 3.2, -4.7) y=1:10 z=c(x,y) z # Notice, that the original object x=3 was overwritten. ``` You might have noticed, that the vectors in the environment have all have written next to them either "num" or "int" - these tell you the class of the values stored in the vector. ```{r} class(x) class(y) ``` *Note: If the data consists of only whole numbers, the type is "integer", otherwise we use "numeric". But it does not really matter* Or, you can directly ask whether x has a specific class. ```{r} is.numeric(x) is.character(x) is.integer(x) ``` Apart from numbers, vectors can store just about anything. Next up are character vectors, that can store plain text. Let's continue with the example of 4 students and give these 4 students names: John, Sarah, Tim and Jane. ```{r} students=c("John", "Sarah", "Tim", "Jane") ``` Can you find out what is the class of this new vector "students"? ```{r} ``` We can also convert data types back and forth. ```{r} # Convert vector x to plain text. x.char=as.character(x) x.char # These are no longer numbers! ``` ```{r} # Convert back to numbers. x.num=as.numeric(x.char) x.num ``` Next up, logical vectors. Do these four students speak English? ```{r} english=c(TRUE,FALSE,FALSE,TRUE) class(english) ``` How many of them speak English? ```{r} sum(english) ``` These were logical values. Now we move on to *logical operators*. These will help us ask more questions. Say we want to know, whether "a" equals zero. We can use the double equal sign (==), which will test, whether a=0 and it will return either TRUE or FALSE. ```{r} a == 0 ``` Interestingly, we can ask the same question about a vector of numeric values, such as x. ```{r} x == 0 ``` How would you interpret the output? Same can be done for characters. For example, let's ask R, whether all students (which students) are named John. ```{r} students == "John" ``` Or the other way around, which students are NOT named John. To do so, we introduce a second logical operator !=. Where ! stands for NOT and so != is supposed to look like a crossed equal sign. ```{r} students != "John" ``` Being able to find out, whether an object is equal to something or not is very useful. Often we will want to know, if x is smaller, or bigger than a given number. For instance: ```{r} # Which students are taller then 160cm? heights > 160 # Which students are smaller then 170cm? heights < 170 # Is someone taller or equal to 177cm? heights >=177 # Is someone smaller then 165 cm or exactly 165 cm tall? heights <=165 ``` Remember, these operators will be super helpful later on. All logical operators can be found in the help page, that will appear after running the following code: ```{r} ?Logic ``` Alright, and finally one bonus class of vectors - factors. Start by creating a character vector. ```{r} group = c("Group A", "Group B") group class(group) ``` Now, we can convert the character vector into a factor vector. ```{r} group = as.factor(group) group # Notice, that the vector is no longer as simple, instead it has "levels" class(group) ``` ------------------------------------------------------ ## Working with vectors Let's explore the vector with heights of students some more. We will apply some basic functions & ask some interesting questions. First up: How long is the vector with heights? (How many items are in the vector?) ```{r} length(heights) ``` Imagine all students standing on top of each other. How high would this tower of students be? ```{r} # Hint: sum up all the heights sum(heights) ``` What is the average height? ```{r} # Either sum(heights)/length(heights) # Or mean(heights) ``` Who is the smallest? ```{r} # What is the minimum height? min(heights) # Which student is the smallest? In other words which number is this smallest in the vector of heights? which(heights == min(heights)) ``` Which student is the tallest? ```{r} max(heights) which(heights == max(heights)) ``` Bonus task: remember the order of the tallest student (assign this number to a new object called "ind") and use it to find out the name of this student. ```{r} ind = students[ ] ``` Min and max heights at the same time. ```{r} range(heights) ``` What is the standard deviation of vector heights? ```{r} sd(heights) ``` Instead of calling multiple functions, we can use the summary() function, which produces basic descriptive statistics. ```{r} summary(heights) ``` What if we want the same information in a graph? We can use boxplot - our first graph! ```{r} boxplot(heights) ``` Imagine that we want to line up these students by their heights. To do so, let's use the function sort(). ```{r} sort(heights) # this is the default version sort(heights, decreasing = TRUE) # adding an argument "decreasing" equal to TRUE, will sort from the larges number sort(heights, decreasing = FALSE) # adding an argument "decreasing" equal to FALSE, will sort from the smallest number ``` *Notice: By using additional arguments in brackets (), you can modify, what a function does. All possible/available arguments are usually listed in the help page.* What is the cumulative sum of the heights? We can use the function cumsum(), which will sum up all heights one by one. ```{r} cumsum(heights) ``` One year has gone by and our 4 students have grown. Let us define a new vector with their new heights. ```{r} # Heights of these students the next year heights.next = c(170,151,183,179) ``` Let's explore: - How much did they grow? ```{r} heights.next - heights ``` - How much did they grow on average? ```{r} mean(heights.next - heights) ``` *Notice, that functions are applied to each item in the vector.* ------------------------------------------------------ ## Conditional Statements and Loops ### Conditional Statements (Similar to IF in excel) Next up we are going to learn something about decision making in R. The main building blocks are the if... else... statements. The syntax of an if statement is: ``` if (test_expression) { statement } ``` Where, if test_expression is TRUE, the statement gets executed. But if it's FALSE, nothing happens. If we want something to happen, when the test_expression is FALSE, we need to add the "else" expression: ``` if (test_expression) { statement1 } else { statement2 } ``` Here is a simple example, in which we want to test, whether a variable is equal to 1. ```{r} variable=1 if (variable == 1){ print("Yes") } ``` Now let's extend this example. If variable equals 1, print the expression "Yes", if not, print "No". ```{r} variable=1 if (variable > 0){ print("Yes") } else { print("No") } ``` Try setting variable equal to any value you want ```{r} variable = ``` And rewrite the following code, so that: If "variable" is greater than 4, divide it by 2 and print it. If not, print a message that tells us, the variable is smaller then 4. ```{r} if(){ }else{ } ``` ### For loops Quite often we will want to run a piece of code repeatedly. If we want a set of operations to be repeated several times we use what’s known as a loop. When you create a loop, R will execute the instructions in the loop a specified number of times or until a specified condition is met. There are 2 main types of loops, we will be using: a **for loop** and a **while loop**. A for loop is useful, when you want to repeat a task numerous times. Lets look at the following example. The code cycles through a sequence of numbers, 1 to 10 In each step, an index i takes a value from this sequence and gets printed out. In the first loop, i=1, the loop will print out number 1, and so on. ```{r} for (i in 1:10){ print(i) } ``` The code that is performed inside the loop can be much more complicated and as long, as you want. ```{r} for (i in 1:10){ j=i*2 print(j) } ``` And the numbers that we iterate/loop/cycle over, do not need to start with number 1. In fact we can define them in various ways. ```{r} for (i in seq(5,15)){ print(i) } ``` Let's combine loops with if statements. Try running the following example. Can you describe, what is happening inside the loop? ```{r} for (i in 1:30){ if((i %% 2) == 0) { print(paste(i,"is even")) } else { print(paste(i,"is odd")) } } # Hint: ‘%%’ indicates ‘x mod y’ ``` ### While loops The second type of loops - **while loops** are useful, when we want to run a piece of code, until a specified condition is met. For instance, in the following chunk of code, we want to print i, while it is smaller then 10. Once i=10, the loop stops on its own. ```{r} i=1 while (i < 10) { print(i) i=i+1 } ``` ------------------------------------------------------ ### Functions Previously, we introduced how to work with the functions that R has pre-built. These were for example: mean(), print(), sort(), boxplot(), summary() and so on. However, often times we will want to define our own custom functions. The following example defines a very simple function. Examine its structure and try to describe what the function does, try to use this new function. ```{r} # Define my_function <- function(x){ x^2/2 } # Use my_function(3) ``` Try defining a new function, that returns: - the sum of all values in a vector, if the vector has at least 5 values, - or the mean of all values if there are less then 5 values. Store the results into variable y and **return** the result. ```{r} # Define my_func <- function(x){ if (length(x) >= 5){ y = sum(x) } else { y = mean(x) } return(y) } # Use x=c(seq(1:14)) my_func(x) ``` ------------------------------------------------------ ## Data structures - dataframes So far, we have only worked with vectors, but lets be honest that wont be enough. So lets move on to some more complex data structures. First up, data frame - they look like tables and act something like an excel spreadsheet. ### Constructing data frames (something like an excel spreadsheet) Lets construct our first dataframe from three vectors. ```{r} vec1 = rep(c("A","B","C","D","E","F"), 10) vec2 = rnorm(n=60, mean=10, sd=5) vec3 = rep(c("TRUE","FALSE"),30) DT=data.frame("grades"=vec1, "points"=vec2, "logicvec"=vec3) ``` Explore the dataframe. There are 2 ways of viewing the dataframe: ```{r} # 1. Display here DT ``` ```{r} # 2. open in second window View(DT) ``` We can have a look at first rows, ```{r} head(DT) ``` or have a look at the bottom of the table - at last rows. ```{r} tail(DT) ``` What are the dimensions? ```{r} dim(DT) # Number of rows nrow(DT) #Number of columns ncol(DT) ``` What are the column names and row names? ```{r} names(DT) colnames(DT) rownames(DT) ``` If you do not like the rownames, you can change them, simply by assigning new vector of names. ```{r} rownames(DT)=paste("row", seq(1:60)) rownames(DT) ``` ### Accessing elements Apart from display the whole table, looking at the head() and tail(), there are a few ways of accessing items in the dataframe. You can do so: 1. By column names *DT$columnname* ```{r} DT$grades ``` ```{r} DT$points ``` ```{r} DT$logicvec ``` 2. By indices *DT[row,column]* ```{r} # Look at a single row DT[2,] ``` ```{r} # Look at a single column DT[,2] ``` ```{r} # Look the value in first row and first column DT[1,1] DT["row 1","grades"] ``` ```{r} # Look at multiple rows and multiple columns DT[c(3,5),1:2] ``` Or combinig the two: ```{r} DT$grades[1] ``` ### Setting new values Instead of just accessing part of the dataframe, you can also set new values. 1. By names ```{r} # Define new column called empty and assign empty values (NAs) DT$empty=NA DT ``` ```{r} # Define a new column points2 by multiplying values from column points by 2. DT$points2=DT$points*2 DT ``` ```{r} # Define a new column diff by subtracting one column from the other. DT$diff=DT$points2 - DT$points DT View(DT) ``` 2. By indices ```{r} # Rewrite the 6th column (remember DT[rows,columns]!) and assign 60 random numbers. DT[,6]=rnorm(n=60, mean=0, sd=1) # Rewrite the full 60th row with number 1. DT[60,]=1 ``` 3. With logical operators ```{r} # Select only rows, where the column "grade" contains letter "A". Store this section of the dataframe as a new dataframe. nerds = DT[DT$grades == "A",] # Select only rows, where the column "points" contains the number 10 or more. Store this section of the dataframe as a new dataframe. top = DT[DT$points >= 10,] ``` 4. Subsetting ```{r} # Select only rows, where the column "points" contains the number 10 or more. This time use the function subset, which also allows you to select specific columns. Choose columns points and diff and store the result into a new dataframe. topdiff = subset(DT, points>= 10, select = c(points,diff)) ``` ### Combine dataframes We already know, how to subset (select) parts of the dataframe, now let's try to merge them back together. Function rbind() takes 2 dataframes with an equal number of columns and stacks the 2 dataframes on top of each other. ```{r} rbind(nerds,top) ``` Function cbind() takes 2 dataframes with an equal number of rowa and binds the 2 dataframes next to each other. ```{r} cbind(top,topdiff) ``` ------------------------------------------------------ ## Working directory, Save and load files At the beginning of each lesson/analysis, we will need to load a dataset into the environment. And we'll want to physically save the modified dataset and the results of our analysis to a folder on our computer again at the end. There are various formats, R can work with, lets go over a few. ### CSV Save this dataset as a csv to a folder called "data". Reminder: always make sure, you are in the correct folder (often refered to as the working directory). ```{r} # getwd() will print out your current working directory getwd() # this is how you can "write it down" my_wd=getwd() # Changing working directory setwd("./data") # SAVE EXCEL FILE in the current directory write.csv(x = DT, file = "DT.csv") # set the original directory setwd(my_wd) ``` Load this dataset back to a new object. ```{r} setwd("./data") newDT=read.csv("DT.csv") setwd(my_wd) ``` ### RData Save this dataset as a RData file to a folder called "data". ```{r} setwd("./data") save(DT, file="DT.RData") load("DT.RData") setwd(my_wd) ``` ### RDS ```{r} setwd("./data") saveRDS(DT, file="DT") DT=readRDS("DT") setwd(my_wd) ``` ------------------------------------------------------ ## Data structures - LISTS Vectors or any datatypes and datastructures can be combined into lists. Lists can be named and nested. ```{r} mylist = list("grades"=vec1, "points"=vec2, "logicvec"=vec3, "nerds"=nerds, "gradelist"=list("A"=1,"B"=2,"C"=3, "D"=4, E="5", "F"=6)) mylist class(mylist) str(mylist) ``` ### Accessing lists Just as you can access elements of dataframes, you can access parts of lists. Here are a few ways you can do it: ```{r} mylist[[1]] # by index mylist$logicvec # by a name mylist[[4]][1,1] # first column & first row in fourth list element mylist$nerds[1,1] mylist[[5]] # fifth element is a list within a list mylist[[5]][[1]] ``` ```{r} length(mylist[[2]]) mean(mylist[[2]]) ``` ------------------------------------------------------ ## Data structures - MATRICES Very similar to dataframes, are matrices. These are essentially dataframes *without* column and rownames. In addition, you can do basic matrix operations with them. Lets start by defining two matrices, A & B. ```{r} A=matrix(1:10,nrow=2) B=matrix(11:20,ncol=5) A B ``` You can test, wheter they are in fact of the class matrix. ```{r} is.matrix(A) is.matrix(B) ``` And here are some examples of how you can calculate with matrices. ```{r} A+B ``` ```{r} A-B ``` ```{r} A*B ``` ```{r} 3*A ``` ```{r} A%*%B ``` ```{r} A %*% t(B) ``` ```{r} C = t(A) %*% B C ``` Dataframes can be even directly converted into matrices. ```{r} topdiff.m=as.matrix(topdiff) topdiff.m ``` Or you can create a matrix from two vectors. Remeber the rbind() and cbind() functions? ```{r} k=c(1,8,6,3,11) l=c(9,7,6,2,2) m=rbind(k,l) m n=cbind(k,l) n is.matrix(n) ```