--- title: "Homework" output: html_notebook: default pdf_document: default html_document: df_print: paged --- Goal: Practice & recap R basics Course: AI in Finance Date: UCO: Student name: ------------------------------------------------------ 1. Use R to calculate $\sqrt \ln(8) /(4*(7-2.5))$. ```{r} ``` ------------------------------ 2. What is the absolute value of -453? (Hint: try using the abs() function) ```{r} ``` ------------------------------ 3. Assign value 5 to an object x, print out x and x^2. ```{r} ``` ------------------------------ 4. Assign value 3 to an object y, multiply x and y and store the result into a new object z. Print out the object z. ```{r} ``` ------------------------------ 5. Create a matrix, which will contain the following characters:'c','b','a','a','a','b', will have 2 rows and 3 columns. Simply modify the following code: ```{r} M = matrix( c(), nrow = , ncol = , byrow = TRUE) ``` - Now print the matrix. - Remove the matrix from the environment. (HINT: use function rm()) ```{r} ``` ------------------------------ 6. Define and print out these 3 string (plain text) variables: clara = "clara is a very tall person", mary = "mary likes running", thomas = "thomas is a great coworker" Combine these three variables into one vector called "friends". What is the class of vector "friends"? ```{r} ``` ------------------------------ 7.Fill out the following list with your personal details, or some fictional ones. If something does not apply to you (for instance you do not have any sisters), leave that element of the list empty. (like so: sisters=c() ) ```{r} my_family <- list( mother = "Rose", father = "John", sisters = c("Kate", "Monica"), sister_age = c(12, 22), brothers = c("Jake"), brother_age = c(5) ) ``` a) Print this list. b) How long is the list? c) Print the first element of the list. (Hint: either use my_family[[1]] or my_family[1] - can you see the difference in the output?) d) Print the second and the third element of the list. e) Print out elements "brothers" and "sisters" by name. Now store these two elements into a new list, called "siblings". f) What is the class of the object my_family? And what is the structure? (Hint: try using the function str()) ```{r} ``` ------------------------------ 8. First, create some vectors: - Create a sequence of whole numbers from 1 to 10 and store them into vector "vec1". - Create a second sequence of whole numbers, that ranges from 2 to 11. Store them into vector "vec2". - Create a third vector "vec3", that will include ten zeros. - Combine all 3 vectors into one long vector "longvec". - Subtract "vec2" from "vec1", so that vecdiff = vec2 - vec1. ```{r} ``` a) What is the minimum and maximum of vec1? b) What is the length of vec3? c) What is the mean of vec2? d) What is the standard deviation of longvec? e) Sort vecdiff from the largest number to the smallest number. f) Try printing out the cumulative sum of vecdiff. g) Summarize all five vectors. (vec1, vec2, vec3, longvec, vecdiff). h) Create a dataframe "DF" from vectors vec1, vec2 and vec3. Print and summarize this dataframe. ```{r} ``` ------------------------------ 9. Suppose, you write down your daily expenses for lunch at the faculty cafeteria for a whole month (20 work days). You have written down these numbers: 120, 99, 88, 69, 110, 129, 86, 37, 116, 98, 77, 122, 114, 81, 102, 132, 73, 123, 95, 201 (all in CZK) a) How much did you spend on lunches this month? b) What was the most expensive lunch? c) How many times did you pay more then 90CZK? d) How much money did you spend on Fridays? (the 5th, 15th and 20th days) (Hint: try using these [ ] brackets). e) Make a boxplot and a histogram from these numbers. What did you find out from the graphs? Is the dataset skewed? Are there any outliers? Is the dataset symmetric? ```{r} ``` ------------------------------ 10. Say, that we have a vector of numbers x=c(5,1,7,9,10,2,4,5,6,6). Use R to calculate: a) (X1+X2+...+X10)/10 b) Find (Xi - 5.2)/1.438 for each i. (Apply all at once.) c) Find the difference between the largest and smallest value of x. d) Are there any missing values in the vector? e) Print out the fifth element from x. f) Print out all elements except the fifth one. g) Print out all values from x equal or higher than 4. h) Convert this vector to a character and back to numerical. ```{r} ``` ------------------------------ 11. Your birthday is coming up and you want to find out, who is coming to your birthday party. You ask your friends, and write down Yes/No answers. You have written down: answ=c("Yes","No","No","Yes","No","No","Yes","Yes","No","Yes","Yes","Yes","No", "Yes", NA) *NA means, that the friend could not yet decide or did not respond to your message* a) How many people are coming to your party? (try using the function table()) b) Try converting the character vector into a factor vector. c) What happened with the NA? Did it influence our results? Try removing the value NA. ```{r} ``` ------------------------------ 12. Generate 100 random numbers with a normal distribution, assign them to a vector. Create two different histograms for two different times of creating the vector. Do you get the same histogram? ```{r} ``` ------------------------------ 13. Load the dataset "EuStockMarkets" from library "datasets". This dataset includes: Daily Closing Prices of Major European Stock Indices, 1991-1998. ```{r} install.packages("datasets") library("datasets") data(EuStockMarkets) DT=as.data.frame(EuStockMarkets) ``` a) Print a section of this dataset: rows 235 to 290 and columns 1 to 3. b) Subset all rows, in which prices of DAX went over the price 2700. (HINT: use function subset()). ```{r} ``` c) Subset all rows, in which prices of CAC are higher then 1870 and lower or equal then 2228. ```{r} # Complete this code: # 1. with [ ] brackets DT_CAC <- DT[(DT$CAC > & DT$CAC <= ),] # 2. with function subset() DT_CAC_S <- subset(DT, CAC> & CAC <= ) # 3. with function filter() install.packages("dplyr") library(dplyr) DT_CAC_F <- filter(DT, DT$CAC > & DT$CAC <= ) ``` d) Complete the following code, to create a dummy variable. Assign 1 to a new column DT$DUM_C, if prices of CAC are higher then 1870 and lower or equal then 2228, and assign 0 otherwise. Repeat the same for prices of FTSE. Assign 1 to a new column DT$DUM_F, if prices of CAC are higher then 2843 and lower then 3500, and assign 0 otherwise. ```{r} DT$DUM_C <- ifelse((DT$CAC > & DT$CAC <= ),1,0) DT$DUM_F <- ``` e) Filter only rows, in which the price of FTSE is equal to 2840 or lower. f) Sort the prices of CAC from highest to lowest and print only the first 100 values. ```{r} ``` g) Take the first half of this dataset (subset based on the number of rows) and convert it into a matrix called A. Take the lower half of dataset and convert it into matrix B. h) Multiply matrices A and B. i) Transpose matrix B and assign the result to a new matrix C. j) Perform a matrix multiplication (A %*% C) and assign the result to matrix D. ```{r} ``` ------------------------------ 14. Say, that k=10, and l=45. a) Complete the following code, so that it will print the phrase "Good morning", if l is greater then k: ```{r} if (){ print(" ") } ``` b) Complete the following code, so that it will print the phrase "Good morning", if l is greater then k, and it will print the phrase "Good evening" otherwise. ```{r} if (){ print(" ") } else { } ``` ------------------------------ 15. We know, that i=11 and j=23. a) Write your own code, which will: print the result of i-j if i is greater then j, print j-i, if j is greater then i, and print the message "i and j are the same", if i equals j. b) Rewrite this code into a function, whose input arguments are i and j. c) Remove variables i and j from the environment. ```{r} ``` ------------------------------ 16. a) Complete the following code, so that it will print i as long as i is less than 10. ```{r} i <- 1 (i < 10) { print(i) i <- i + 1 } ``` b) Take the same code and modify it so that the loop stops if i is equal to 7. c) Take the same code and modify it to skip the loop, when i equals 8. ```{r} ``` ------------------------------ 17. This morning you went to the farmers market and you bought a basket of apples. Some were green, some yellow, others red. basket=c("green","green","red","yellow","red", "red", "green", "yellow", "red", "red", "red", "yellow") a) Define this vector and using the function table(), find out, how many apples did you buy per each color. b) Loop through this "basket" vector and print each item in the vector one by one. ```{r} ``` ------------------------------ 18. Here is a dataset: ```{r} DF=data.frame(U=rnorm(n=10, mean=1220, sd=150), V=rnorm(n=10, mean=1220, sd=100), W=rnorm(n=10, mean=1220, sd=200), X=rnorm(n=10, mean=1220, sd=200)) ``` a) Add a column "Y" which is equal to 1, if O>C and 0 otherwise. (HINT: use function ifelse()) b) Finish the following code, to loop over rows of DF and print their average. ```{r} for (i in 1:nrow( )){ print(mean(DF[ , ])) } ``` c) Change the code to loop over columns instead and print their sum. ```{r} ``` ------------------------------ 19. Define a function, that returns the median of all values in a vector, if the vector has at least 6 values or the mean of all values if there are less then 6 values. ```{r} ``` ------------------------------ 20. Define a function, with input arguments vectors x and y. First, check, whether the length of these vectors is the same. If it is not, return a warning message "Vectors have differing length" and end the function. If they do have the same length, print out a scatter plot of x and y. ```{r} ``` ------------------------------ 21. Print the phrase "R is easy" 21 times. ```{r} ``` ------------------------------ 22. Load the dataset "UScereal" from library "MASS". ```{r} install.packages("MASS") library("MASS") data(UScereal) DT=as.data.frame(UScereal) ``` This is a dataset with Nutritional and Marketing Information on US Cereals. The data come from the 1993 ASA Statistical Graphics Exposition, and are taken from the mandatory F&DA food label. The data have been normalized here to a portion of one American cup. a) Explore this dataset using functions we have used before. - explore first 6 and last 6 rows, - summarize the dataset, - use the boxplot function to display all variables at once (Hint: boxplot(DT)) - make sure there are no missing values, ```{r} ``` b) Explore correlations between columns. Hint: omit columns that are not numeric. - Calculate correlations and store them into object x - Try using the following code to display the correlation matrix in a graph: ```{r} install(DescTools) # to visualize pairwise correlation matrix library(DescTools) PlotCorr(x) ``` - Try using the ggpairs(DT[,-c(1,11)]) from library(GGally) to prepare a nice graph. ```{r} ``` c) Convert the variable shelf into a factor. ```{r} ``` d) Here is a loop, that will produce a plot() of each numeric variable in the dataset. It will also change the title of the graph to include the name of the variable. Change the code, so that: - the y axis is the name of the variable, - the color of the points is red, - modify the symbol of the points in the plot (choose any you like), - increase the size of the points, - change title size to 2, - change X-axis and Y-axis labels size to 3, - change axis labels size to 0.5, - change color of the title to blue. ```{r} for (i in c(2:8,10)){ plot(DT[,i],main=paste("Graph of",colnames(DT)[i])) } ``` e) Use a similar loop as in exercise 22.d) to create histograms and boxplots for all numeric columns. ```{r} ``` f) Lets explore the relationship between calories and sugars in US cereals. First, lets use an xyplot scatterplot from the "lattice" library. Can you modify the xyplot to be based on (separated by) variable shelf? Did you find any differences? Look for the dataset UScereal in the help page. What does the variable shelf stand for? ```{r} ``` Great, now lets create the same scatterplot of calories to sugars based on the shelf using the function ggplot from library "ggplot2" and customize it a little bit. You can start with the following code: ```{r} g1 = ggplot(data=DT, aes(x=calories, y=sugars,color=shelf)) + geom_point() ``` Try changing the graph: - change the shape of the points to 23 and increase the size of the points to 2, - change the color of the points to purple, - change the theme of the graph to minimal, - add a title to the graph and change the axis description text, - make the description text of axis bigger and bold, - increase the size of numbers on the axis, - use the argument scale_color_manual(values = c()) to change the colors of the points to: "tomato3","goldenrod", and "skyblue2" ------------------------------ 23. Load the dataset "trees" from library "datasets". ```{r} install.packages("datasets") library("datasets") data(trees) DT=trees ``` This data set provides measurements of the girth, height and volume of timber in 31 felled black cherry trees. Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 in above the ground. a) Examine the first 10 rows of the dataset. b) Print the column names. c) What is in the first column? d) Rename the first column "Girth" to "Diameter". e) What are the dimension (number of rows and columns)? f) Summarize this dataset. g) Make sure there are no missing values. h) Write a for loop, in which you loop through column of this dataset to make a histogram of that column variable. i) Are these columns correlated/is there any relationship between the diameter, height and volume of timber in these trees? - Try using the ggpairs() function from library(GGally) to prepare a nice graph. - Choose a pair of variables with the highest correlation and examine their relationship. You can use any type of graph or statistical test, that we used in the class, or any, that you find. j) Given, what you have found out about the dataset, propose and fit a linear model, that would explain the variable Volume. As a bonus, you can create a scatter plot and add a regression line. k) Final challenge with this dataset: try improving the fit of the linear model, you can for example: log-transform your variables, multiply variables, or anything you can think of. l) Save this dataset as a csv to a folder called "data". ```{r} ```