# Title: Intro to R and statistical background - part I # Environment: RStudio # Goal: Learn basic data types, functions and data structures # Course: AI in Finance # Date: 23.09.2024 # Author: Martina Halousková, Štefan Lyócsa (small contribution) # License: GPL>=3 ############################################################# ##### Mathematical operations (using R as a calculator) ##### ############################################################# # First, let's try using R as a calculator. # The lines of code below each have a simple mathematical operation. Try running these one by one. # Place your cursor in the line with the code that you want to run and either (1) press CTRL+ENTER # or click on the small icon "Run" on the top of the screen. # The output should appear in the Console (bottom left window). # *Tip: You can also type into the console directly!* 2+2 6/3 2^3 # this is a comment, you can use it to write down any notes. # Whatever is written after the hashtag will not be execute. Hastag has to be used every line (note the text above) # You can think of the script file as a list of commands that will be executed by RStudio. # The language he uses is R. So we have to speak to him in this language. (5-2)*2 1+9-7 5*4 2*(87-65) # Now it is time to create your own line of code. Calculate how much is 123*456*789. You can type it into the console # or you can highlight the numbers and press CTLR+Enter. # Next we would like to calculate the square root of sixteen. # To do so, we will need to use our first pre-built function sqrt(). sqrt(16) # Notice the syntax: the name of the function is "sqrt" in this case, is followed by brackets (). # Inside these brackets, the function accepts input arguments, such as number 16 in our case. # This is very similar to excel functions.There are many useful pre-built function. # Let's use another one, to calculate the natural logarithm of ten, $\ln(10)$ log(10) # log at the base e # How do we know that the base is 'e'? Functions have arguments, which are further options. This function has an # argument 'base'. log(10,base=2) # How do I know what are the arguments of a function? We can type ?log # or args(log) # In the first case the 'help' page shows you details about the function. If you do not specify any arguments, default # options are used. # In the second case, we use another function 'args' which is returns the arguments of the function 'log'. # Later you will even create your own functions, or you will use functions that I will create. ############################################################# ##################### Numeric variables ##################### ############################################################# # So far we have only used R as a calculator. However, very often we would like to remember some values, # or even the results of our calculations. To do so, we will create *objects*. # Let's start with a simple example. We have a number 3. We want to "write down" and remember this number. # To do so, we will *assign* the value 3 to an object "x". You can view this as a variable x. x=3 # The assignment operator is =. Previously it was (and still can be) a <-. # These can be used interchangeably. I use '=' but most of the people around me prefer '<-'. It is up-to you. # Notice that the value was not printed out. # If we want to print it out, we have to type "x" on a new line like so: x=3 x # Now, that we can assign values to objects, it will be much easier to do some calculations. Here is an example: a=6 b=8 a+b a*b (a+b)^2 # Notice, that x, a and b are now in our environment (top right window). # This means, that we can further work with them. The next step is assigning # the result of our calculation to a new object c, such that $c=a+b+10$. c=a+b+10 c d <- a+b + 10 # Notice, that spaces are not important, and we can use = or <-. c and d should have the same value. d # One more thing. R is case sensitive, which means, that object E is not e. e=3 E # R is case sensitive (E is not e) # This is an error message. Now, errors are common. When I prepare a script for the class, it should not have many errors # however, when I was preparing the script I certainly made a lot of errors and I fixed those. Having errors is quite common. # Do not get disencouraged by having those. They are a necessary part of the learning. ############################################################# #################### Data types, vectors #################### ############################################################# # Let's move on to some more complex objects, that can store more than just one value - vectors! # Say, there are 4 students, their heights are: 165 151 182 and 177 centimeters. # We can store these numbers in a data vector with the c() function. This function c() *combines* values into one object - one vector. heights=c(165,151,182,177) # By the way, as before any time you do not know, what a function does, you can ask R. Simply try typing ?function(). # It will open a help page (bottom right window). ?c() # Alright, let's look at the data. heights # Now, that we defined the vector heights, it should be in our environment. Also, we can ask it some questions, such as: # How tall is the first student? heights[1] #[1] refers to the first observation heights[5] #What happens if you make a [] reference outside of the length of the vector? # By the way, the length of the vector which tells you the numbe of observations is intuitively given by: length(heights) # You could also combine both. If I want to see the last observation and I do not know how many I have: heights[length(heights)] # Some alternative using function tail() tail(heights,n=1) # The function returns last 'n' observations tail(heights,n=2) # Now it is your turn. Create a command (line), that will calculate the height of the second student # plus the height of the third student and store the result into an object called "res". # We can name the elements of a vector - the function is names() names(heights) = c('Alex','Adam','Arnold','Pierce') heights # We can combine vectors. Simply use the same function c() on two vectors. x=c(1.5, 3.2, -4.7) y=1:10 z=c(x,y) z # Notice, that the original object x=3 was overwritten. # You might have noticed, that the vectors in the environment have all have written next to them either "num" or "int" # - these tell you the class of the values stored in the vector. class(x) class(y) # *Note: If the data consists of only whole numbers, the type is "integer", otherwise we use "numeric". But it does not really matter.* # Or, you can directly ask whether x has a specific class. is.numeric(x) is.character(x) is.integer(x) # This brings us closer to logical statements. But why would you use such functions? # Imagine that you expect a certain outcome. You run an analysis and the outcome should be # an integer but the results is something like 0.234. You check the outcome at the end using a function is.integer() # and if it returns FALSE you can programm what to do next. # As suggested before, apart from numbers, vectors can store just about anything. Next up are character vectors, that can store plain text. # Let's continue with the example of 4 students and give these 4 students names: John, Sarah, Tim and Jane. students=c("John", "Sarah", "Tim", "Jane") # Can you find out what is the class of this new vector "students"? # We can also convert data types back and forth. # Convert vector x to plain text. x.char=as.character(x) x.char # These are no longer numbers! # Convert back to numbers. x.num=as.numeric(x.char) x.num # Next up, logical vectors. Do these four students speak English? english=c(TRUE,FALSE,FALSE,TRUE) class(english) # How many of them speak English? sum(english) # These were logical values. Now we move on to *logical operators*. These will help us ask more questions. # Say we want to know, whether "a" equals zero. We can use the double equal sign (==), # which will test, whether a=0 and it will return either TRUE or FALSE. a == 0 # Interestingly, we can ask the same question about a vector of numeric values, such as x. x == 0 # How would you interpret the output? # Same can be done for characters. For example, let's ask R, whether all students (which students) are named John. students == "John" # Or the other way around, which students are NOT named John. To do so, we introduce a second logical operator !=. # Where ! stands for NOT and so != is supposed to look like a crossed equal sign. students != "John" # Being able to find out, whether an object is equal to something or not is very useful! # Often we will want to know, if x is smaller, or bigger than a given number. For instance: # Which students are taller then 160cm? heights > 160 # Which students are smaller then 170cm? heights < 170 # Is someone taller or equal to 177cm? heights >=177 # Is someone smaller then 165 cm or exactly 165 cm tall? heights <=165 # Remember, these operators will be super helpful later on. All logical operators can be found in the help page, # that will appear after running the following code: ?Logic # Alright, and finally one bonus class of vectors - factors. Start by creating a character vector. group = c("Group A", "Group B") group class(group) # Now, we can convert the character vector into a factor vector. group = as.factor(group) group # Notice, that the vector is no longer as simple, instead it has "levels" class(group) ############################################################# #################### Working with vectors ################### ############################################################# # Let's explore the vector with heights of students some more. # We will apply some basic functions & ask some interesting questions. # First up: How long is the vector with heights? (How many items are in the vector?) length(heights) # Imagine all students standing on top of each other. How high would this tower of students be? # Hint: sum up all the heights sum(heights) # What happens if we multiply two vectors? heights * heights # It is the element-wise multiplication # What if you want matrix-wise multiplication (recal linear algebra) heights %*% heights # Which is the sum of squared elements sum(heights * heights) # What happens if we multiple with a vector of different length? age = c(22,24,29) heights*age # We get a warning (not an error). So it does something. But what? Let's explore heights[1:3]*age heights[4]*age[1] # What it actually did was the following heights*c(age,age[1]) # Because age was a shorter object it filled up the missing part starting from the beginning of the age object. # What is the average height? # Either sum(heights)/length(heights) # Or mean(heights) # Other measures of centrality of data - median. It is much more robust to outliers median(heights) # Here it is similar to mean. But let's try the following: mean(c(1,3,4,5,6,1080)) median(c(1,3,4,5,6,1080)) # Quite the difference. Note that the goal here is to describe the data. Using a centrality measure should tell us # around which values are the numbers centered. However, mean() here would be a challanging result. There are not # values anywhere near the mean. Around median we have at least 5 observations. # Who is the smallest? # What is the minimum height? min(heights) # Which student is the smallest? In other words which number is this smallest in the vector of heights? which(heights == min(heights)) # which() is a super useful function. I use it all the time when I work with data. # Which student is the tallest? max(heights) which(heights == max(heights)) # Bonus task: remember the order of the tallest student (assign this number to a new object called "ind") # and use it to find out the name of this student. ind = students[ ] # Min and max heights at the same time. range(heights) # What is the standard deviation of vector heights? sd(heights) # Instead of calling multiple functions, we can use the summary() function, # which produces basic descriptive statistics. summary(heights) # We have the minimum, median, mean, maximum and quartiles! # The 1st Qu. is the lower Quartile. It is an estimate of an upper boundary of the 25% of the lowest values. # To put it differently, 25% of all values should be below the lower Quartile. # Note that it is an estimate and we have only 4 observations. The estimate will be much more accurate with more # observations in a sample. # Quartiles are part of the so called quantiles. We can ask for other, but let me first create a larger (random) set of # numbers. Just to have something with more observations v = rnorm(200,mean=2,sd=1) summary(v) # Here, 25% of values are below 1.2445 and 25% are above 2.7024? Let's try quantile() quantile(v,p=c(0,0.25,0.5,0.75,1)) # Up to some rounding errors, this is the same. In the argument 'p=c()' we specify what values are we interested about: quantile(v,p=c(0,0.1,0.25,0.5,0.75,0.9,1)) # What if we want the same information in a graph? We can use boxplot - our first graph! boxplot(heights) # This is not a 'nice' graph but you can play with it a lot! We will do that later. # Imagine that we want to line up these students by their heights. # To do so, let's use the function sort(). sort(heights) # this is the default version sort(heights, decreasing = TRUE) # adding an argument "decreasing" equal to TRUE, will sort from the larges number sort(heights, decreasing = FALSE) # adding an argument "decreasing" equal to FALSE, will sort from the smallest number # *Notice: By using additional arguments in brackets (), you can modify, what a function does. # All possible/available arguments are usually listed in the help page.* # What is the cumulative sum of the heights? We can use the function cumsum(), which will sum up all heights one by one. cumsum(heights) # One year has gone by and our 4 students have grown. Let us define a new vector with their new heights. # Heights of these students the next year heights.next = c(170,151,183,179) # Let's explore: # How much did they grow? heights.next - heights # How much did they grow on average? mean(heights.next - heights) # *Notice, that functions are applied to each item in the vector.* ############################################################# ############# Conditional Statements and Loops ############## ############################################################# #### Conditional Statements (Similar to IF in MS Excel) #### # Next up we are going to learn something about decision making in R. # The main building blocks are the if... else... statements. # The syntax of an if statement is: if (test_expression) { statement } # This is an example, do not mind the error. # Where, if test_expression is TRUE, the statement gets executed. But if it's FALSE, nothing happens. # If we want something to happen, when the test_expression is FALSE, we need to add the "else" expression: if (test_expression) { statement1 } else { statement2 } # This is an example, do not mind the error. # Here is a simple example, in which we want to test, whether a variable is equal to 1. variable=1 if (variable == 1){ print("Yes") } # Now let's extend this example. If variable equals 1, print the expression "Yes", if not, print "No". variable=1 if (variable > 0){ print("Yes") } else { print("No") } # Try setting variable equal to any value you want, variable = # and rewrite the following code, so that: # If "variable" is greater than 4, divide it by 2 and print it. # If not, print a message that tells us, the variable is smaller then 4. if(){ }else{ } #### For loops #### # Quite often we will want to run a piece of code repeatedly. # If we want a set of operations to be repeated several times we use what’s known as a loop. # When you create a loop, R will execute the instructions in the loop a specified number of times or until a specified condition is met. # There are 2 main types of loops, we will be using: a **for loop** and a **while loop**. # A for loop is useful, when you want to repeat a task numerous times. Lets look at the following example. # The code cycles through a sequence of numbers, 1 to 10 In each step, an index i takes a value from this sequence and gets printed out. # In the first loop, i=1, the loop will print out number 1, and so on. for (i in 1:10){ print(i) } # If it is a one-liner you can write: for (i in 1:10) print(i) # Everything in the loop would be done 'quitely' where the output is supressed. We have forced to print the output with the # print() function for (i in 1:10) i # See it did nothing visible (and actually nothing). # A variation where we store values? We prepare an empty object: v = c() v # Now we fill the object with squares of the numbers from 1 to 10 for (i in 1:10) v[i] <- i^2 v # The code that is performed inside the loop can be much more complicated and as long, as you want. for (i in 1:10){ j=i*2 print(j) } # And the numbers that we iterate/loop/cycle over, do not need to start with number 1. # In fact we can define them in various ways. for (i in seq(5,15)){ print(i) } # Let's combine loops with if statements. Try running the following example. # Can you describe, what is happening inside the loop? for (i in 1:30){ if((i %% 2) == 0) { print(paste(i,"is even")) } else { print(paste(i,"is odd")) } } # Hint: ‘%%’ indicates ‘x mod y’ and 'mod' is the so called modulus. #### While loops #### # The second type of loops - **while loops** are useful, when we want to run a piece of code, until a specified condition is met. # For instance, in the following chunk of code, we want to print i, while it is smaller then 10. Once i=10, the loop stops on its own i=1 while (i < 10) { print(i) i=i+1 } #### Functions #### # Previously, we introduced how to work with the functions that R has pre-built. # These were for example: mean(), print(), sort(), boxplot(), summary(), quantile(), which() and so on. # However, often times we will want to define our own custom functions. # The following example defines a very simple function. Examine its structure and try to # describe what the function does, try to use this new function. # Define my_function <- function(x){ x^2/2 } # Use my_function(3) my_function(2) my_function(5) # What happens if I try my_function(x=c(1,2,3)) # Interesting - it tries everything # What happens if I try my_function(x='dunno') # I can not fool it... my_function(y=5) # Oh he expects that I explicitly define 'x' or if I do not do it, he assumes that the first argument is 'x=', but I cannot write something else. # We can create a function with multiple arguments and pre-determine default values my_function <- function(x=5,powerof=3){ x^powerof/2 } # This will re-write the previous function my_function(x=c(1,2,3)) # He tried each 'x' and run power of 3. my_function(x=c(1,2,3),powerof=c(1,3)) # Here I run into a warning as I cannot try multiple powers... he uses the first one powerof=1 only. # Try defining a new function, that returns: # - the sum of all values in a vector, if the vector has at least 5 values, # - or the mean of all values if there are less then 5 values. # Store the results into variable y and **return** the result. # Define my_func <- function(x){ if (length(x) >= 5){ y = sum(x) } else { y = mean(x) } return(y) } # Use x=c(seq(1:14)) my_func(x) ############################################################# ############### Data structures - dataframes ################ ############################################################# # So far, we have only worked with vectors, but lets be honest that wont be enough. # So lets move on to some more complex data structures. # First up, data frame - they look like tables and act something like an excel spreadsheet. #### Constructing data frames (something like an excel spreadsheet) #### # Lets construct our first dataframe from three vectors. vec1 = rep(c("A","B","C","D","E","F"), 10) vec2 = rnorm(n=60, mean=10, sd=5) vec3 = rep(c("TRUE","FALSE"),30) DT=data.frame("grades"=vec1, "points"=vec2, "logicvec"=vec3) # Explore the dataframe. There are 2 ways of viewing the dataframe: # 1. Display here DT # 2. open in second window View(DT) # We can have a look at first rows, head(DT) # or have a look at the bottom of the table - at last rows. tail(DT) # What are the dimensions? dim(DT) # Number of rows nrow(DT) #Number of columns ncol(DT) # What are the column names and row names? names(DT) colnames(DT) rownames(DT) # If you do not like the rownames, you can change them, simply by assigning new vector of names. rownames(DT)=paste("row", seq(1:60)) rownames(DT) #### Accessing elements #### # Apart from display the whole table, looking at the head() and tail(), # there are a few ways of accessing items in the dataframe. You can do so: # 1. By column names # *DT$columnname* DT$grades DT$points DT$logicvec # 2. By indices # *DT[row,column]* # Look at a single row DT[2,] # Look at a single column DT[,2] # Look the value in first row and first column DT[1,1] DT["row 1","grades"] # Look at multiple rows and multiple columns DT[c(3,5),1:2] # Or combinig the two: DT$grades[1] #### Setting new values #### # Instead of just accessing part of the data frame, you can also set new values. # 1. By names # Define new column called empty and assign empty values (NAs) DT$empty=NA DT # Define a new column points2 by multiplying values from column points by 2. DT$points2=DT$points*2 DT # Define a new column diff by subtracting one column from the other DT$diff=DT$points2 - DT$points DT View(DT) # 2. By indices # Rewrite the 6th column (remember DT[rows,columns]!) and assign 60 random numbers. DT[,6]=rnorm(n=60, mean=0, sd=1) # Do not worry about the rnorm() function for now. You can access the details via ?rnorm(), but we just needed some numbers. # Rewrite the full 60th row with number 1. DT[60,]=1 # 3. With logical operators # Select only rows, where the column "grade" contains letter "A". Store this section of the dataframe as a new dataframe. nerds = DT[DT$grades == "A",] # Select only rows, where the column "points" contains the number 10 or more. Store this section of the dataframe as a new dataframe. top = DT[DT$points >= 10,] # 4. Subsetting # Select only rows, where the column "points" contains the number 10 or more. This time use the function subset, which also allows you to select specific columns. Choose columns points and diff and store the result into a new dataframe. topdiff = subset(DT, points>= 10, select = c(points,diff)) # I personally rarely use subset(), but other may prefer this type. I would use conditions. DT[DT$points>=10,c('points','diff')] # It is the same. The fact that you can achieve the same with different strategies means that there are different coding styles. #### Combine dataframes #### # We already know, how to subset (select) parts of the dataframe, now let's try to merge them back together. # Function rbind() takes 2 dataframes with an equal number of columns and stacks the 2 dataframes on top of each other. rbind(nerds,top) # Function cbind() takes 2 dataframes with an equal number of rowa and binds the 2 dataframes next to each other. cbind(top,topdiff) ############################################################# ######## Working directory, Save and load files ############ ############################################################# # At the beginning of each lesson/analysis, we will need to load a dataset into the environment. # And we'll want to physically save the modified dataset and the results of our analysis to a folder on our computer again at the end. # There are various formats, R can work with, lets go over a few. #### CSV #### # Save this dataset as a csv to a folder called "data". # Reminder: always make sure, you are in the correct folder (often refered to as the working directory). # getwd() will print out your current working directory getwd() # this is how you can "write it down" my_wd=getwd() # Changing working directory setwd("./data") # SAVE EXCEL FILE in the current directory write.csv(x = DT, file = "DT.csv") # set the original directory setwd(my_wd) # Load this dataset back to a new object. setwd("./data") newDT=read.csv("DT.csv") setwd(my_wd) #### RData #### # Save this dataset as a RData file to a folder called "data". setwd("./data") save(DT, file="DT.RData") load("DT.RData") setwd(my_wd) ### RDS ### Different data format setwd("./data") saveRDS(DT, file="DT") DT=readRDS("DT") setwd(my_wd) ############################################################# ################ Data structures - LISTS ################### ############################################################# # Vectors or any datatypes and data structures can be combined into lists. # Lists can be named and nested. mylist = list("grades"=vec1, "points"=vec2, "logicvec"=vec3, "nerds"=nerds, "gradelist"=list("A"=1,"B"=2,"C"=3, "D"=4, E="5", "F"=6)) mylist class(mylist) str(mylist) ### Accessing lists ### # Just as you can access elements of dataframes, you can access parts of lists. Here are a few ways you can do it: mylist[[1]] # by index mylist$logicvec # by a name mylist[[4]][1,1] # first column & first row in fourth list element mylist$nerds[1,1] mylist[[5]] # fifth element is a list within a list mylist[[5]][[1]] length(mylist[[2]]) mean(mylist[[2]]) ############################################################# ############## Data structures - MATRICES ################## ############################################################# # Very similar to data frames, are matrices. These are essentially data frames but all columns/rows have the same class, # preferably numeric. # In addition, you can do basic matrix operations with them. # I use matrices a lot! # Lets start by defining two matrices, A & B. A=matrix(1:10,nrow=2) B=matrix(11:20,ncol=5) A B # You can define rownames() and columnames() rownames(A) = paste('row',1:dim(A)[1],sep='') A # What was that? Let's take a look as paste() is very useful function. It is essentially something like concatenate in MS EXCEL, i.e. it combines multiple strings into one. dim(A)[1] # That was the number of rows. dim() is also a very useful function. It tells you the dimension of the object dim(A) dim(A)[2] 1:dim(A)[1] # So paste('row',1:dim(A)[1],sep='') will write 'row' and 1, 2, to it and create a string out of it. The option # sep='' means separate, so what should be used between row and numbers 1 or 2. Here it is nothing. Change to: paste('row',1:dim(A)[1],sep='_') # To see the difference! # You can test, whether they are in fact of the class matrix. is.matrix(A) is.matrix(B) # And here are some examples of how you can calculate with matrices. A+B A-B A*B 3*A A%*%B # Check conditions for matrix multiplication - linear algebra. Do not worry we will not do that. A %*% t(B) C = t(A) %*% B C # Dataframes can be even directly converted into matrices. topdiff.m=as.matrix(topdiff) topdiff.m # Or you can create a matrix from two vectors. Remeber the rbind() and cbind() functions? k=c(1,8,6,3,11) l=c(9,7,6,2,2) m=rbind(k,l) m n=cbind(k,l) n is.matrix(n) # Ok, if you survived this, you are quite ready for part2.