An Introduction to the Fundamentals & Functionality of the R Programming Language Part II: The Nuts & Bolts Theresa A Scott, MS Biostatistician III Department of Biostatistics Vanderbilt University theresa.scott@vanderbilt.edu Table of Contents 1 Working with Data Structures 2 1.1 Vectors ................................................ 2 Creation............................................ 2 Attributes........................................... 3 Subsetting........................................... 4 Manipulation......................................... 6 1.2 Matrices & Arrays.......................................... 17 Creation............................................ 17 Attributes........................................... 19 Subsetting........................................... 19 Manipulation......................................... 20 1.3 Data frames & Lists......................................... 21 Creation............................................ 21 Attributes........................................... 22 Subsetting........................................... 22 Manipulation......................................... 26 1.4 Data Export ............................................. 34 2 Working with R 38 Object Management......................................... 38 Customizing R Sessions....................................... 39 Conditional evaluation of expressions................................ 41 Repetitive evaluation of expressions ................................ 44 The Family of apply () Functions.................................. 48 Writing Your Own Functions.................................... 53 3 Catalog of Functions, Operators, & Constants 58 4 R Graphics Reference 67 High-level plotting functions..................................... 67 The par() function.......................................... 74 Points................................................. 75 Lines.................................................. 77 Text.................................................. 78 Color.................................................. 81 Axes.................................................. 82 Plotting regions and Margins.................................... 84 Multiple plots............................................. 88 Overlaying output .......................................... 92 Other additions............................................ 92 Graphical Output........................................... 95 Preface We will be using the PBC data set that we introduced in the first document and some functions from the Hmisc add-on package. Therefore, make sure you load the Hmisc package using the library() function, change to (ie, set) the correct working directory, read-in our pbc data set, and make changes to the pbc data frame using the Hmisc package's upDataO function. The code to do so is given in Scott. IntroToR. II.R code file. The 'contents' of the resulting updated pbc data frame should be the same as the output of the contents () function invocation below. > contents (pbc) Data frame:pbc 100 observations and 14 variables Maximum # NAs:32 Labels Units Levels Storage NAs id integer 0 fudays Follow Up days integer 0 status Original Survival Status 3 integer 0 drug Treatment 2 integer 25 age Age days integer 0 sex Gender 2 integer 0 ascites Presence of Ascites 2 integer 25 bili Serum Bilirubin mg/dL double 0 chol Serum Cholesterol mg/dL integer 32 album Serum Albumin mg/dL double 0 stage Histologic stage of disease 4 integer 2 ageyrs Age years double 0 f uyrs Follow Up years double 0 censored Collapsed Survival Status 2 integer 0 +--------------+------------------------------------------------------------------------------+ I Variable I Levels I +--------------+------------------------------------------------------------------------------+ I status |Censored,Censored due to liver treatment,Dead| +--------+-------------------------------------------+ I drug |D-penicillamine,Placebo I +--------------+------------------------------------------------------------------------------+ I sex |Female,Male I +--------------+------------------------------------------------------------------------------+ I ascites |No,Yes I +--------------+------------------------------------------------------------------------------+ Istage 11,2,3,4 I +--------------+------------------------------------------------------------------------------+ I censored|Censored,Dead I +--------------+------------------------------------------------------------------------------+ Like the first document, all R code has been extracted from this document and saved in the Scott. IntroToR. II. R text file. 1 Chapter 1 Working with Data Structures Learning objective To understand how to generate, manipulate, and use R's various data structures. 1.1 Vectors DEFINITION: As mentioned in the first document, the vector is the simplest data structure in R. For example, a single value in R (i.e., the logical value TRUE or the numeric value 2) is actually just a vector of length 1. Vectors are one dimensional and consist of an ordered collection of elements. All elements of a vector must be the same data type, or mode - numeric (an amalgam of integer and double precision mode), complex (numeric value followed by an i), logical, character, and raw (intended to hold raw bytes). However, vectors can include missing elements designated with the NA value. CREATION: We demonstrated in the first document how vectors could be created using the c() (concatenate) and seq() (sequence) functions. Recall, the seq() function constructs a numeric vector, while the c () function can be used to generate all kinds of vectors. There are many other functions that will construct a vector, including the rep() (replicate), sampleO, and pasteO functions. The rep() function replicates the specified values to generate a vector. The sample () function generates a vector by taking a random sample of the specified values. And the paste () function generates character vectors by concatenating the specified elements. The following are examples of these three functions: > rep(1:4, times = c(2, 4, 3, 1)) [1] 1122223334 > sample(c("M", "F"), size = 10, replace = TRUE) Ml "Tľ" "Tľ" "M" "Tľ" "Tľ" "M" "M" "Tľ" "M" "M" > paste("Treatment", c("A", "B", "C")) [1] "Treatment A" "Treatment B" "Treatment C" Note, the set. seed () function is often used in conjunction with the sample () function in order to (regenerate the same random sample. There are also several functions in addition to the factor() function that will generate factors. These are the gl(), cut(), and interactionO functions. The gl() function generates a factor using the specified pattern of its levels. The cut() function generates a factor by dividing the range of a numeric vector 2 CHAPTER 1. WORKING WITH DATA STRUCTURES into intervals and coding the numeric values according to which interval they fall into. And, as its name implies, the interactionO function generates a factor that represents the interaction of the specified levels. The following are examples of these three functions. You'll notice that we use the levels () function in conjunction with the interactionO function to build a factor from the levels of two existing factors. > gl(n = 2, k = 8, label = c(''Control", "Treatment")) [1] Control Control Control Control Control Control Control Control [9] Treatment Treatment Treatment Treatment Treatment Treatment Treatment Treatment Levels: Control Treatment > head(cut(pbc$ageyrs, breaks =4)) [1] (54.5,66.5] (30.5,42.5] (54.5,66.5] (42.5,54.5] (30.5,42.5] (54.5,66.5] Levels: (30.5,42.5] (42.5,54.5] (54.5,66.5] (66.5,78.5] > interaction(levels(pbc$drug), levels(pbc$censored)) [1] D-penicillamine.Censored Placebo.Dead 4 Levels: D-penicillamine.Censored Placebo.Censored ... Placebo.Dead It is also worthwhile to mention character vectors in a little more detail. Specifically, single or double quotes can be embedded in character strings. For example, > "Single 'quotes' can be embedded in double quotes" [1] "Single 'quotes' can be embedded in double quotes" Similarly, double quotes can be embedded into single quote character strings (e.g., 'Example of "double" inside single'), but the single quotes will be converted to double quotes and the embedded double-quotes will be converted to \". For example, compare the code specified in the Scott. IntroToR. II.R file for the following expression and that shown as output. > "Double \"quotes\" can be embedded in single quotes" [1] "Double \"quotes\" can be embedded in single quotes" In actuality, in order to embed double quotes within a character string specified using double quotes, you must escape them using a backslash, \. For example, > "Embedded double quotes, \", must be escaped" [1] "Embedded double quotes, \", must be escaped" This \ " character specification will make more sense when we discuss functions like as cat () and write. table () in the 'Data Export' section. ATTRIBUTES: All vectors have a length, which is its number of elements and is calculated using the length () function. For example, > length(pbc$ageyrs) [1] 100 However, the length () function does not distinguish missing elements, NAs, from the length of a vector. For example, we know that the ascites column of our pbc data frame contains 25 missing values, yet > length(pbc$ascites) [1] 100 3 CHAPTER 1. WORKING WITH DATA STRUCTURES To calculate the number of non-missing elements of a vector, we take advantage of an odd quirk of the logical values, TRUE and FALSE. Specifically, the logical values of TRUE and FALSE are coerced to values of 1 and 0, respectively, when used in arithmetic expressions. So, we can calculate the number of non-missing elements of a vector using > sum(!is.na(pbc$ascites)) [1] 75 An alternative is to tabulate the result of the is.naO function using the tableO function, as demonstrated in the first document: > table(!is.na(pbc$ascites)) FALSE TRUE 25 75 These same tips can be used with factors. If the elements of a vector are named, the names () function returns (prints) the associated names. For example, > (x <- c(Dog = "Spot", Horse = "Penny", Cat = "Stripes")) Dog Horse Cat "Spot" "Penny" "Stripes" > names(x) [1] "Dog" "Horse" "Cat" As mentioned in the first document, the levels () function can be used to return (print) the levels of a specified factor. Another useful function is the nlevelsO function, which returns (prints) the number of levels of a specified factor. Some examples: > levels(pbc$stage) ["-I "I II -I If II Q it non 11411 > nlevels(pbc$stage) [1] 4 In addition, the modeO function returns (prints) the mode of the specified vector. Recall, all factors are stored internally as integer vectors - more to come in 'Coercing the mode of a vector'. For example, > mode(x) [1] "character" > mode (pbc$ageyrs) [1] "numeric" > mode (pbc$drug) [1] "numeric" SUBSETTING: For a vector, we can use single square brackets, [ ], to extract specific subsets of the vector's elements. Specifically, for a vector x, we use the general form x [index], where index is a vector that can be one of the following forms: 4 CHAPTER 1. WORKING WITH DATA STRUCTURES 1. A vector of positive integers. The values in the index vector normally lie in the set {1, 2, . . . , length(x)}. The corresponding elements of the vector are extracted, in that order, to form the result. The index vector can be of any length and the result is of the same length as the index vector. For example, x [6] is the sixth element of x, x [1:10] extracts the first ten elements of x (assuming length(x) > 10), x[length(x)] extracts the last element of x, and x[c(l, 5, 20)] extracts the 1st, 5th, and 20th elements of x (assuming length(x) > 20). We can use any of the functions that construct numeric vectors to define the index vector of positive integers, such as the c() (concatenate), the rep() (replicate), the sampleO, and the seq() (sequence) functions or the : (colon) operator. NA is returned if the index vector contains an integer > x, and an empty vector (numeric (0)) is returned if the index vector contains a 0. Alternatives in this situation are the headO and tailO functions, which will return (print) the first/last n= elements of a vector (n = 6 by default). 2. A vector of negative integers. This specifies the values to be excluded rather than included in the extraction. For example, x[-c(l:5)] extracts all but the first five elements of x - we merely place a negative sign in front of the index vector. As seen, all elements of x except those that are specified in the index vector are extracted, in their original order, to form the result. The results is the length(x) minus the length of the index vector elements long. 3. A logical vector. In this case, the index vector must be of the same length as x. Values corresponding to TRUE in the index vector are extracted and those corresponding to FALSE or NA are not. The logical index vector can be explicitly given (e.g., x[c(TRUE, FALSE, TRUE)]) or can be constructed as a conditional expression using any of the comparison or logic operators - see the vector 'Manipulation' section for more detail. For example, > set.seed(l) > (x <- sampleClO, size = 10, replace = TRUE)) [1] 3 4 6 10 3 9 10 7 7 1 > x[x == 4] [1] 4 > x[x > 2 & x < 5] [1] 3 4 3 As seen, the result is the same length as the number of TRUE values in the index vector. Recall, if the vector we are trying to subset contains missing values, we need to make sure our logical indexing vector correctly excludes them using the is.naO function, if desired. For example, > (x <- c(9, 5, 12, NA, 2, NA, NA, 1)) [1] 9 5 12 NA 2 NA NA 1 > x[x > 2] [1] 9 5 12 NA NA NA > x[x > 2 & !is.na(x)] [1] 9 5 12 An alternative in this situation is the subset () function. You merely specify the vector as its first argument and a conditional expression as its second. An advantage to the subset () function is that missing values are automatically taken as false, so they do not need to be explicitly removed using the is.naO function. For example, 5 CHAPTER 1. WORKING WITH DATA STRUCTURES > subset(x, x > 2) [1] 9 5 12 > subset(x, x > 2 & x < 10) [1] 9 5 4. A vector of character strings. This possibility applies only when the elements of a vector have names. In that case, a subvector of the names vector may be used in the same way as the positive integers case in 1. That is, the strings in the index vector are matched against the names of the elements of x and the resulting elements are extracted. Alphanumeric names are often easier to remember than numeric indices of elements. For example, > (fruit <- c(oranges = 5, bananas = 10, apples = 1, peaches = 30)) oranges bananas apples peaches 5 10 1 30 > names(fruit) [1] "oranges" "bananas" "apples" "peaches" > fruit [c("apples", "oranges")] apples oranges 1 5 IMPORTANT: As hinted to in the 'Assignment' section, extracting elements of a data structure, such as a vector, can be done on the left hand side of an assignment expression (e.g., to select parts of a vector to replace) as well as on the right-hand side. For example, we can assign the missing values of the x vector assigned above to zeros: > x [1] 9 5 12 NA 2 NA NA 1 > x[is.na(x)] <- 0 > x [1] 9 5 12 0 2 0 0 1 Note, the recycling rule, discussed in the vector 'Manipulation' section, was used in x[!is.na(x)] <- 0. Specifically, the value of 0 was 'recycled' to fill in the three missing values of x. MANIPULATION: Because there are several kinds of vectors, there are many ways in which vectors can be manipulated. In this document, we will discuss (1) how the vectorization of functions and the recycling rule affect vector manipulations; (2) the construction and use of conditional expressions; (3) how to coerce the mode of a vector; (4) how to construct and manipulate vectors of dates; and (5) how character vectors can be manipulated using regular expressions. VECTORIZATION OF FUNCTIONS & THE RECYCLING RULE: Many R functions are vectorized, meaning that they operate on each of the elements of a vector (i.e., element-by-element). For example, the expression log(y) returns a vector that is the same length of y, where each element of the result is the natural logarithm of the corresponding element in y. > (y <- sampled: 100, size = 10)) 6 CHAPTER 1. WORKING WITH DATA STRUCTURES [1] 21 18 68 38 74 48 98 93 35 71 > log(y) [1] 3.044522 2.890372 4.219508 3.637586 4.304065 3.871201 4.584967 4.532599 3.555348 [10] 4.262680 Other examples of vectorized functions include the exp() (exponentiation) function, the sqrtO (square root) function, the round () function, and the casefoldO function. In addition, any of the arithmetic (mentioned in the first document), comparison, and logic operators (mentioned in the next section) will operate element-by-element. For example, adding two vectors of the same length: > (x <- sampled:20, size = 10)) [1] 19 5 12 3 20 6 1 16 11 4 > x + y [1] 40 23 80 41 94 54 99 109 46 75 These operators and other specific functions incorporate what is known as the 'recycling rule1. The rule is that shorter vectors in the expression are replicated (i.e., recycled; re-used) to be the length of the longer vector. The evaluated value of the expression is a vector of the same length as the longest vector which occurs in the expression. The simplest illustration of the recycling rule is when the expression involves a 'constant' (e.g., a single numeric value): > y + 2 [1] 23 20 70 40 76 50 100 95 37 73 > y < 50 [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE In each of these examples, the constant, which is technically a vector of length 1, is replicated to the length of y. When the length of the shorter vector is greater than 1, the elements of the shorter vector are replicated in order until the result is the proper length. For example, > (x <- sampled:20, size =2)) [1] 10 12 > x + y [1] 31 30 78 50 84 60 108 105 45 83 Here, the vector x was replicated 5 times to match the length of y (10 elements). If the length of the longer vector is not a multiple of the shorter one, the shorter vector is fractionally replicated and a warning is given (longer object length is not a multiple of shorter object length). For example (warning is given, but not shown in output), > x <- 1:9 > y <- 1:10 > x + y 7 CHAPTER 1. WORKING WITH DATA STRUCTURES [1] 2 4 6 8 10 12 14 16 18 11 Here, to match the length of y only the first element of x (1) was re-used. CONDITIONAL EXPRESSIONS: When using R, vectors of logical values, TRUE and FALSE, are most often generated by conditional expressions. For example, the expression y <- x > 10 assigns y to be a vector of the same length as x with value FALSE corresponding to elements of x where the condition is not met and TRUE where it is. Conditional expressions are often constructed using comparison and logic operators, which are listed in the following table: Type Operator Meaning Comparison < > 1 = >= <= Less than Greater than Equal to Not equal to Greater than or equal to Less than or equal to Logic &, && 1, M ! And (intersection) Or (union) Not (negation) As their name implies and has been demonstrated, the comparison operators are used to compare two vectors - e.g., x < y. In addition, as mentioned above, the comparison operators are vectorized. Therefore, they will compare each element of the first vector with each element of the second vector, employing the recycling rule when necessary - e.g., comparing a 5 element numeric vector to the value 2. Comparison operators may be applied to vectors of any mode and they return a logical vector of the same length as the input vectors. In contrast, the logic operators are applied to logical vectors, which are usually the result of comparing two vectors. As shown in the table above, the "And" and "Or" logical operators exist in two forms. The logic operators & and |, compare each element of the two specified logical vectors (e.g., x & y) and return a logical vector the same length as the input vectors. In contrast, && and | | only return a single logical value for the outcome - evaluating only the first element of the two input vectors. The && and | | forms are most often used in looping expressions, particularly in if statement - see the 'Conditional Evaluation of Expressions' section of the second chapter. The following is an example showing the difference > (x <- 1:10) [1] 1 2 3 4 5 6 7 8 9 10 >x<7&x>2 [1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE >x<7&&x>2 [1] FALSE Note, the expression 2 < x < 7 is an invalid alternative in the example above; the expression must be broken up into two comparison expressions combined with a logic operator. When using logic operators, it helps to be aware of how various expressions will evaluate: • TRUE & TRUE = TRUE 8 CHAPTER 1. WORKING WITH DATA STRUCTURES • TRUE & FALSE = FALSE • FALSE & FALSE = FALSE • TRUE | TRUE = TRUE • TRUE | FALSE= TRUE • FALSE | FALSE = FALSE How various expressions will evaluate becomes a little more tricky when NAs are involved: • NA & TRUE = NA • NA | TRUE = TRUE • NA & FALSE = FALSE • NA | FALSE = NA Other useful operators include the 0/0in0/0 and the (Hmisc package's) XninX value matching operators. These binary operators return a logical vector indicating whether elements of the vector specified as the left operand match any of the values in the vector specified as the right operand. The XninX operator is the 'negative' of the °/oin°/o operator. For example, > (x <- sample(c("A", "B", "C", "D"), size = 10, replace = TRUE)) Ml "R" "A" "D" "P" "D" "A" "P" "R" "D" "P" > x %in% c("A", "C", "D") [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE > x %nin% c ("A", "B") [1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE These operators are useful as alternatives to multiple 'Or' statements. For example, x 0/0in0/0 c ("A", "C", "D") is equivalent to x == "A" | x == "C" | x == "D". Also, to use the °/,nin°/, operator, DON'T FORGET to load the Hmisc package with the library () function (i.e., library (Hmisc)) if you haven't already done so. If you haven't previously installed the package, you must do this first. To 'wholly' compare two vectors, two functions are available: the identical() and all.equal() functions. Some examples: > x <- 1:3 > y <- 1:3 > x == y [1] TRUE TRUE TRUE > identicaKx, y) [1] TRUE > all.equal(x, y) [1] TRUE 9 CHAPTER 1. WORKING WITH DATA STRUCTURES The identical () function compares the internal representation of the data and returns TRUE if the objects are strictly identical, and FALSE otherwise. On the other hand, the all.equal() function compares the 'near equality' of two objects, and returns TRUE or displays a summary of the differences. The 'near equality' arises from the fact that the all.equal() function takes the approximation of the computing process into account when comparing numeric values. The comparison of numeric values on a computer is sometimes surprising! For example, > 0.9 == (1 - 0.1) [1] TRUE > identical(0.9, 1 - 0.1) [1] TRUE > all.equal(0.9, 1 - 0.1) [1] TRUE > 0.9 == (1.1 - 0.2) [1] FALSE > identical(0.9, 1.1 - 0.2) [1] FALSE > all.equal(0.9, 1.1 - 0.2) [1] TRUE > all.equal(0.9, 1.1 - 0.2, tolerance = le-16) [1] "Mean relative difference: 1.233581e-16" COERCING THE MODE OF A VECTOR: We've mentioned thus far that logical vectors may be used in arithmetic expressions, in which case they are coerced into numeric vectors - FALSE becoming 0 and TRUE becoming 1. For example, > x <- c (34, 31, 80, 78, 64, 87) > x > 35 [1] FALSE FALSE TRUE TRUE TRUE TRUE > sum (x > 35) [1] 4 > sum(x > 35)/length(x) [1] 0.6666667 In fact, it is possible to coerce a vector of any mode to any other mode, even though the result may not be what you wish. These coercions are possible using the many as.XO vector functions in R - in fact, there are many as. X () functions that convert other data structures, but they will only be mentioned in the 'Catalog' chapter. Even though there are many such as. X() functions for coercing vectors, I have found that I primarily use just three of them - the as.numeric(), the as. character(), and as.logical() functions, which convert vectors to mode numeric, character, and logical, respectively. Even though the idea of the coercion follows intuitive rules, the results of the three functions depend on the mode of the input vector, as summarized in the following table: 10 CHAPTER 1. WORKING WITH DATA STRUCTURES Function Coercion (Input =4> Output) as.numeric() FALSE => 0 TRUE => 1 "1", "2", ... => 1, 2, ... other characters =>• NA as.character 0 1,2, . . . => "1", "2", . . . FALSE => "FALSE" TRUE => "TRUE" as.logicalO 0 => FALSE other numbers (e.g., 1) =>• TRUE "FALSE", "F" => FALSE "TRUE", "T" => TRUE other characters =>• NA The as.numeric() and as. character() functions are often used when trying to coerce a factor back to a numeric or character vector. When dealing with a factor with character levels, the as. character () function will coerce it back to a character vector. For example, > (fac <- factor(sample(c("M", "F"), size = 5, replace = TRUE))) [1] F F F F M Levels: F M > as.character(fac) [1] "F" "F" "F" "F" "M" This tip is useful considering some functions will return the internal integer vector representation of a factor with character levels instead of the character levels. For example, compare the output of the following two ifelseO function expressions - we will discuss the ifelseO function in more detail in upcoming sections. > ifelse(fac == "M", NA, fac) [1] 1 1 1 1 NA > ifelse(fac == "M", NA, as.character(fac)) [1] "F" "F" "F" "F" NA Coercing a factor with character levels to a numeric vector with the as.numeric() function returns (prints) the internal integer vector representation of the factor. For example, > as.numeric(fac) [1] 1 1 1 1 2 In order to coerce a factor with numeric levels into a numeric vector, keeping the levels as they are originally specified, you have to first coerce the factor into a character vector and then into a numeric vector. For example, > (fac <- factor(sample(5:10, size = 10, replace = TRUE), levels = 5:10)) [1] 7 9 9 7 10 7 6 5 5 6 Levels: 5 6 7 8 9 10 > as.numeric(fac) 11 CHAPTER 1. WORKING WITH DATA STRUCTURES [1] 3553632112 > as.numeric(as.character(fac)) [1] 7 9 9 7 10 7 6 5 5 6 DATES: We often deal with data in form of a date and/or time, such as a date of birth, procedure, or death. In turn, we often wish to manipulate these date/time fields to calculate values such as the time to death from baseline or the time between treatment visits. Luckily, there are two main 'datetime' classes in R's base package - Date and POSIX. The Date class supports dates without times. As mentioned in the 'R Help Desk' article of the June 2004 R Newsletter, 'eliminating times simplifies dates substantially since not only are times, themselves, eliminated but the potential complications of time zones and daylight savings time vs. standard time need not be considered either.' The POSIX class actually refers to the two classes of POSIXct and POSIXlt. It also refers to a POSIXt class, which is the superclass of POSIXct and POSIXlt. Normally we will not access the POSIXt class directly when using R, but any POSIXlt and POSIXct class specific methods of generic functions will be given under POSIXt. The POSIXct and POSIXlt classes support times and dates including times zones and standard vs. daylight savings time. POSIXct datetimes are represented as the number of seconds since January 1, 1970 GMT, while POSIXlt datetimes are represented by a list ('It') of 9 components plus an optional timezone attribute. The Date class and POSIXct/POSIXlt classes have a similar interface, making it easy to move between them. To construct a Date class datetime object, we use the as.Date() function; and to construct a POSIXlt datetime object, we use the strptimeO function. To only way to construct a POSIXct datetime object is to coerce the output of the strptimeO function using the as.POSIXctO function. The data argument to the as.Date() and strptimeO functions is a character vector specifying the dates (and possibly times). Remember, by default, the read, table 0 and data, frame O functions coerce character columns, like date-time columns, to factors. Therefore, make sure you specify that theses functions should treat the character datetime columns as character and/or coerce the possibly factor datetime columns back to character columns using the as. characterO function. The default format of the date is "year-month-day", such as "2007-09-24". If the variable includes time, then the default format is "year-month-day hour:minutes:seconds", where the hours are specified using a 24-hour clock (i.e., 00-23). The second argument to these three functions is a f ormat= argument, which is used to specify how the dates (and times) are represented in the character string. The format consists of conversion specifications to represent the different portions of the date/time and characters that represent the delimiters between the different date/time portions, like dashes or forward slashes. The following table lists a subset of the conversion specifications the as.Date0 and strptimeO functions will recognize - see the strptimeO function's help file for a complete list. Portion of date/time field Conversion specification & description Year %Y Numeric year with century (e.g., 2007) %y Numeric year without century (00-99); don't use! Month %m Numeric month (01-12) %b Abbreviated month name (first three letters; e.g., Mar) %B Full month name (e.g., March) Day %d Numeric day (01-31) Hours %H Hours from 24-hour clock (00-23) %I Hours from 12-hour clock (01-12) %p AM/PM indicator (used in conjunction with %I) Minutes %M Minutes (00-59) Seconds %S Seconds (00-61) Note, leading zeros in the conversion specifications (like months and days) are useful to include in the character strings - I often have problems when I don't. Let's demonstrate these various conversion specifications 12 CHAPTER 1. WORKING WITH DATA STRUCTURES with some examples - you'll notice the format of the output is consistent: > as.Date("2007-10-18", format = "%Y-%m-%d") [1] "2007-10-18" > as.Date("20070CT18", format = "°/0Y°/0b°/0d") [1] "2007-10-18" > as. Date ("October 18, 2007", format = "°/,B %d, °/,Y") [1] "2007-10-18" > strptimeC 10/18/2007 08:30:45", format = "%m/%d/%Y °/,H: °/,M:°/,S") [1] "2007-10-18 08:30:45" > strptimeC 10/18/2007 12:30:45 AM", format = "%m/%d/%Y %I: °/,M: °/,S °/,p") [1] "2007-10-18 00:30:45" The seq() and cut() functions also have Date and POSIXt class specific methods for generating datetime vectors - see the seq.DateO, seq.POSIXt(), cut.DateO, and cut.POSIXtO help files for examples. The seqO functions are particularly useful when creating a date axis for a plot. In addition, once a datetime vector has been constructed, you can manipulate it in several ways - see the help files for the Date and POSIXt methods of the roundO, truncO, diff (), weekdaysO and formatO functions. Also check out ?"+.Date" and ?"+. POSIXt", and the diff time O help file. The following are some examples: > (x <- seq.Date(from = as.Date("2007-10-18"), to = as.Dat e("2007-10-30"), + by = "3 days")) [1] "2007-10-18" "2007-10-21" "2007-10-24" "2007-10-27" "2007-10-30" > x + 10 [1] "2007-10-28" "2007-10-31" "2007-11-03" "2007-11-06" "2007-11-09" > x > as.Date("2007-10-21") [1] FALSE FALSE TRUE TRUE TRUE > x - as.Date(c("2006-01-10", "2007-08-15", "2005-06-24", "2004-12-30", + "2005-04-05")) Time differences in days [1] 646 67 852 1031 938 > diff(x) Time differences in days [1] 3 3 3 3 > weeicdays (x) [1] "Thursday" "Sunday" "Wednesday" "Saturday" "Tuesday" > format (x, "°/,Y") [1] "2007" "2007" "2007" "2007" "2007" 13 CHAPTER 1. WORKING WITH DATA STRUCTURES > format(x, "%m/%d/%Y") [1] "10/18/2007" "10/21/2007" "10/24/2007" "10/27/2007" "10/30/2007" It is important to mention that you will receive an error if you try to define a new column of a data frame as the output of the strptimeO function. For example, you would receive an error if you invoked an expression similar to df$visitdate <- with(df, strptime (visit, format = "°/oY-°/om-°/0d") ), where df is a fictitious data frame with visit being a character column giving the visit dates. This error will occur because, as we mentioned, the strptimeO function returns a list, even though it does not appear this way (even if we look at the structure of the output using the str () function). The remedy to this problem is the above-mentioned as.POSIXctO function, which will coerce the strptimeO function output to a datetime vector. Therefore, we should modify the example code to: df$visitdate <- with(df, as.POSIXct(strptime(visit, format = "7.Y-y.m-y.d"))). In addition to the POSIXct and POSIXlt classes to represent dates and times, there is also the chron package, which is an add-on package available through the CRAN website. The chron package provides dates and times, but there are no time zones or notion of daylight vs. standard times. Datetimes in the chron package are represented internally as days since January 1, 1970, with times represented as fractions of days. Even though the chron package includes several functions that allow you to manipulate datetime variables, the format used to specify the dates and times is not as extensive as the conversion specifications of the POSIXct and POSIXlt classes. MANIPULATING CHARACTER VECTORS WITH REGULAR EXPRESSIONS: Working with data in R often involves text data, which are often called character strings. These character strings may be the values of a column in a data frame or may be the output from a function like the names 0 function. In any case, we often want to extract or manipulate elements of character vectors that match a specific pattern of characters. In R, this is possible using the grepO and sub() functions, respectively, and regular expressions. In general, regular expressions are used to match patterns against character strings. In other words, regular expressions are special strings of characters that you create in order to locate matching pieces of text. In order to understand the power of regular expressions, let's work through their use with the grepO function, which searches for matches to a pattern=, which is specified as its first argument, within the character vector x=, which is specified as its second argument. More information is available regarding regular expressions in the regex help file (ie, ?regex). Patterns: As we have indicated, every regular expression contains a pattern, which is used to match the regular expression against a character string. Within a pattern, all characters except ., |, (, ), [,],{,},+, \, ", $, *, and ? match themselves. For example, we could extract the column names of our pbc data frame that contain the character string "age" using the grepO function. Note, we need to also specify value = TRUE in our grepO function expression to return (print) the matching values - by default, the grepO function returns the indices of the matches (value = FALSE). > names (pbc) [1] "id" "fudays" "status" "drug" "age" "sex" "ascites" [8] "bili" "chol" "album" "stage" "ageyrs" "fuyrs" "censored" > grep(pattern = "age", x = names (pbc), value = TRUE) [1] "age" "stage" "ageyrs" If you want to match one of the special characters mentioned above literally, you have to precede it with two backslashes. For example, we could extract the elements of the following character vector that contain a period. > (char <- c("id", "patient.age", "date", "baseline_bmi", "follow.up.visit")) 14 CHAPTER 1. WORKING WITH DATA STRUCTURES [1] "id" "patient.age" "date" "baseline_bmi" [5] "follow.up.visit" > grep (pattern = "\\.", x = char, value = TRUE) [1] "patient.age" "follow.up.visit" Anchors: By default, a regular expression will try to find the first match for the pattern in a string. But what if you want to force a pattern to match only at the start or end of a character string? In this case, the " and $ special characters are used to match an empty string at the beginning and end of a line, respectively. So, for example, > char <- cC'this is an option", "or perhaps this", "and don't forget about this one") > grepCpattern = "this", x = char, value = TRUE) [1] "this is an option" "or perhaps this" [3] "and don't forget about this one" > grepCpattern = ""this", x = char, value = TRUE) [1] "this is an option" > grepCpattern = "this$", x = char, value = TRUE) [1] "or perhaps this" Character classes: A character class is a list of characters between brackets, [ ], which matches any single character between the brackets. On the other hand, if the first character of the list is the caret, ", then the character class matches any characters not in the list. So, for example, [aeiou] matches any vowel and ["abc] matches anything except the characters a, b, or c. A range of characters may be specified by giving the first and last characters, separated by a hyphen. So, [0-9] matches any digits, [a-z] matches any lower case letters, [A-Z] matches any upper case letters, [a-zA-Z] matches any alphabetic characters, ["a-zA-Z] matches any non-alphabetic characters, [a-zA-ZO-9] matches any alphanumeric characters, [ \t\n\r\f \v] matches any white space characters, and [,.:;!?] matches punctuation. Also, the significance of the special characters, . | () []{}+"$*?, is turned off inside the brackets. In addition, to include a literal [ or ], place it anywhere in the list; to include a literal ", place it anywhere but first; and to include a literal -, place it first or last. To match any other special character, except \, inside a character class, place it anywhere. Here's some examples: > char <- cC" ", "3 times a day") > grepCpattern = "[a-zA-ZO-9]", x = char, value = TRUE) [1] "3 times a day" > grepCpattern = "[~a-zA-Z0-9]", x = char, value = TRUE) [1] " " "3 times a day" Repetition: If r stands for the immediately preceding regular expression within a pattern, then r* matches zero or more occurrences of r; r+ matches one or more occurrences of r; and r? matches zero or one occurrence of r. Additionally, {n} matches the preceding item 'n' times; {n,} mathces the preceding item 'n' or more times; and {n,m} matches the preceding item at least 'n' times, but not more than 'm' times. These repetition constructs have a high precedence - they bind only to the immediately preceding regular expression in the pattern. So, "ab+" matches an a followed by one or more b's, not a sequence of ab's. You have to be careful with the * construct too - the pattern "a*" will match any string (i.e., every string has zero or more a's). Here are some examples: > char <- cC'The", "moon is made", "of cheese") > grepCpattern = " +", x = char, value = TRUE) 15 CHAPTER 1. WORKING WITH DATA STRUCTURES [1] "moon is made" "of cheese" > grep(pattern = "o?o", x = char, value = TRUE) [1] "moon is made" "of cheese" Alternation: The vertical bar, | is a special character because an unescaped vertical bar matches either the regular expression that precedes it or the regular expression that follows it. For example, > char <- c("red", "ball", "blue", "sky") > grep (pattern = "die", x = char, value = TRUE) [1] "red" "blue" > grep(pattern = "alllu", x = char, value = TRUE) [1] "ball" "blue" Grouping: You can use parentheses to group terms within a regular expression. Everything written within the group is treated as a single regular expression. For example, > char <- c("red ball", "blue ball", "red sky", "blue sky") > grep (pattern = "red", x = char, value = TRUE) [1] "red ball" "red sky" > grep (pattern = "(red ball)", x = char, value = TRUE) [1] "red ball" We should also mention what the period, ., special character does. An unescaped period, ., matches any single character. In addition, the grepO function is case-sensitive - its ignore. case= argument is by default FALSE. For example, > char <- c("vit E", "vitamin e") > grep(pattern = "vit.*E", x = char, value = TRUE) [1] "vit E" > grep(pattern = "vit.*E", x = char, value = TRUE, ignore.case = TRUE) [1] "vit E" "vitamin e" As I mentioned above, the subO (substitute) function also employs regular expressions. The subO function needs a pattern= and x= argument like the grepO function, but it also needs a replacement= argument, which specifies the substitution for the matched pattern. For example, > char <- c("one.period", "two..periods", "three...periods") > sub(pattern = "\\.+", replacement = ".", x = char) [1] "one.period" "two.periods" "three.periods" With the sub () function, the parentheses take on an additional ability. Specifically, the parentheses can be used to tag a portion of the match as a 'variable' to return. For example, we can extract just the leading number from each element of the following character vector. > char <- c("45: Received chemo", "1, Got too sick", "2; Moved to another hospital") > sub(pattern = "~([0-9]+)[:,;].*$", "\\1", char) [1] "45" "1" "2" 16 CHAPTER 1. Here's another example: > char <- c("vit E", "vitamin E", "vitamin ESTER-C", "vit E ") > subCpattern = "'(vit).*([E]).*$", "\\1 \\2", char) [1] "vit E" "vit E" "vit E" "vit E" It is also worthwhile to mention the gsubO function. Unlike the sub() function, which replaces only the first occurrence of a 'pattern', the gsubO function replaces all occurrences. For ezample, > sub(" a", " A", "Capitalizing all words beginning with an a") [1] "Capitalizing All words beginning with an a" > gsub(" a", " A", "Capitalizing all words beginning with an a") [1] "Capitalizing All words beginning with An A" 1.2 Matrices &; Arrays DEFINITION: A matrix is a two-dimensional data structure that consists of rows and columns - think of a vector with dimensions. Like vectors, all the elements of a matrix must be the same data type (all numeric, character, or logical), and can include missing elements designated with the NA value. Taking it one step further, an array is a generalization of a matrix, which allows more than two dimensions. In general, an array is fc-dimensional. CREATION: To create a matrix, we can use the matrix () function, which constructs an nrow x ncol matrix from a supplied vector (data=). > args(matrix) function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) NULL By default, the matrix is filled by columns (byrow = FALSE), but specifying byrow = TRUE fills the matrix by rows. For example, > matrix(1:6, ncol = 3) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 > matrixd :6, ncol = 3, byrow = TRUE) [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 Row and column names can also be specified using the dimnames= argument. The dimnames= argument is specified as a list of two components giving the row and column names, respectively. For example, > matrix(c(l, 2, 3, 11, 12, 13), nrow = 2, ncol = 3, byrow = TRUE, dimnames = list(c("rowl", + "row2"), cC'C.l", "C.2", "C.3"))) C.l C.2 C.3 rowl 12 3 row2 11 12 13 17 CHAPTER 1. The cbindO and rbindO functions can also be used to construct a matrix by binding together the data arguments horizontally (column-wise) or vertically (row-wise), respectively. The supplied data arguments can be vectors (of any length) and/or matrices with the same columns size (i.e., the same number of rows) or row size (i.e., the same number of columns), respectively. Any supplied vector is cyclically extended to match the 'length' of the other data arguments if necessary. For example, > cbind(l:3, 7:9) [,1] [,2] [1,] 1 7 [2,] 2 8 [3,] 3 9 > rbind(l:ll, 5:15) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [1,] 123456789 10 11 [2,] 5 6 7 8 9 10 11 12 13 14 15 Row and column names are created by supplying the vectors as named vectors - i.e., VECTORname = vector. For example, > cbind(coll = 1:3, col2 = 7:9) coll col2 [1,] 1 7 [2,] 2 8 [3,] 3 9 > rbind(rowl = 1:11, row2 = 5:15) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] rowl 123456789 10 11 row2 5 6 7 8 9 10 11 12 13 14 15 The array () function can be used to construct an array. With the array () function, two formal arguments must be specified: (1) a vector specifying the elements to fill the array; and (2) a vector specifying the dimensions of the array. For example, > array(l:24, dim = c(3, 4, 2)) , , 1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 , , 2 [,1] [,2] [,3] [,4] [1,] 13 16 19 22 [2,] 14 17 20 23 [3,] 15 18 21 24 The array is filled column-wise, and if the data vector is shorter than the number of elements defined by the dimensions vector, the data vector is recycled from its beginning to make up the needed size. Like the 18 CHAPTER 1. matrix() function, dimension names can be added using the dimnames= argument. ATTRIBUTES: The dim() function will return (print) the dimensions of the specified matrix or array -the number of rows, columns, etc. If the dimensions of a matrix or array are named (rows, columns, etc), the dimnamesO function will return (print) these assigned names. Lastly, because matrices/arrays are similar to vectors in the fact that all of their elements must be the same data type, the modeO function will return the mode of the matrix/array. SUBSETTING: Elements (rows and/or columns) of a matrix may be extracted using the single square bracket operators, [ ], by giving two index vectors in the form x[i, j], where i extracts rows and j extracts columns. The index vectors i and j can take any of the four forms shown for vectors. If character vectors are used as indices, they refer to row or column names, as appropriate. And, if either index vector is not specified (i.e., left empty), then the range ofthat subscript is taken. That is x[i, ] extracts the rows specified in i across all columns of x, and x[ , j] extracts the columns specified in j across all rows of x. Some examples: > (x <- matrix(l:12, ncol = 4, byrow = TRUE)) [,1] [,2] [ 3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > x[l, ] [1] 12 3 4 > x[, 4] [1] 4 8 12 > x[2, 3] [1] 7 > x[-(l:2) , 3:4] [1] 11 12 > x[x[, 2] < 8 & x[, 4] < 10, ] [,1] [,2] [,3] [,4] [1,] 12 3 4 [2,] 5 6 7 8 You can also use the headO and tailO functions to return (print) the first/last n= rows a matrix (n = 6 by default). Remember, arrays are fc-dimensional generalizations of matrices. Therefore, for a fc-dimensional array we must give k index vectors, each in one of the four forms - x [i, j , k, . . . ]. As with matrices, if any index position is given an empty index vector, then the full range of that subscript is extracted. Some examples: > (x <- array(l:24, dim = c(3, 4, 2))) , , 1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 19 CHAPTER 1. [2,] 2 5 8 11 [3,] 3 6 9 12 [,1] [,2] [,3] [,4] [1,] 13 16 19 22 [2,] 14 17 20 23 [3,] 15 18 21 24 > x[2, 3, U [1] 8 > x[l: 2, c(l, [ ,1] [,2] [1,] 13 22 [2,] 14 23 It might not be apparent from the examples above, but sometimes an indexing operation causes one of the dimensions to be 'dropped' from the result. For example, when we return only the fourth column a 3x4 matrix, R returns a three element vector, not a one-column matrix. > x <- matrix(1:12, ncol = 4, byrow = TRUE) > x[, 4] [1] 4 8 12 > is.matrix(x[, 4]) [1] FALSE > is. vector(x[, 4]) [1] TRUE The default behavior of R is to return an object of the lowest dimension possible. However, as you can imagine, this isn't always desirable and can cause a failure in general subroutines where an index occasionally, but not usually, has length one. Luckily, this habit can be turned off by adding drop = FALSE to the indexing operation. Note, drop = TRUE does not add to the index count. For example, > x <- matrix(l:12, ncol = 4, byrow = TRUE) > x[, 4, drop = FALSE] [,1] [1 ] 4 [2 ] 8 [3 ] 12 > is.matrix(x[, 4, drop = FALSE]) [1] TRUE MANIPULATION: The manipulation of matrices and arrays most often involves mathematical manipulation related to matrix algebra. There are several functions that can be used to manipulate matrices and arrays, including the t() (transpose), apermO, diagO, lower.tri(), upper.tri(), rowsumO/colsumO, rowSumsO/colSumsO, rowMeansO/colMeansO, crossprodO, det (), eigenO, max. col(), scale(), svd(), solve(), and backsolveO functions. There are additionally two operators that can be used to manipulate matrices and arrays: '/,*'/„ (matrix multiplication) and 7oO°/o (outer product). 20 CHAPTER 1. 1.3 Data frames Sč Lists DEFINITION: As defined in the first document, a data frame in R corresponds to what other statistical packages call a 'data matrix' or a 'data set' - the 2-dimensional data structure used to store a complete set of data, which consists of a set of variables (columns) observed on a number of cases (rows). In other words, a data frame is a generalization of a matrix. Specifically, the different columns of a data frame may be of different data types (numeric, character, or logical), but all the elements of any one column must be of the same data type. As with vectors, the elements of any one column can also include missing elements designated with the NA value. Most types of data you will want to read into R and analyze are best described by data frames. A list is the most general data structure in R and has the ability to combine a collection of objects into a larger composite object. Formally, a list in R is a data structure consisting of an ordered collection of objects known as its components. Each component can contain any type of R object and the components need not be of the same type. Specifically, a list may contain vectors of several different data types and lengths, matrices or more general arrays, data frames, functions, and/or other lists. Because of this, a list provides a convenient way to return the results of a computation - in fact, the results that many R functions return are structured as a list, such as the result of fitting a regression model. It is also important to realize that a data frame is a special case of a list. In fact, a data frame is nothing more than a list whose components are vectors of the same length and are related in such a way that data in the same 'position' come from the same experimental unit (subject, animal, etc.). CREA TION: As demonstrated in the first document, the data. frame () function can be used to construct a data frame from scratch. Often, the data arguments you supply to the data.frameO function will be individual vectors, which will construct the columns of the data frame. These data arguments can be specified with or without a corresponding column name - either in the form value or the form COLname = value. For example, we can use the sample () function to generate a random data frame. > ourdf <- data.frame(id = 101:110, sex = sample(c("M", "F"), size = 10, + replace = TRUE), age = sample(20:50, size = 10, replace = TRUE), + tx = sample(cC'Drug", "Placebo"), size = 10, replace = TRUE), diabetes = sample(c(TRUE, + FALSE))) > ourdf id sex age tx diabetes 1 101 F 43 Drug TRUE 2 102 F 22 Placebo FALSE 3 103 M 47 Placebo TRUE 4 104 F 30 Drug FALSE 5 105 M 46 Placebo TRUE 6 106 M 30 Drug FALSE 7 107 M 30 Drug TRUE 8 108 F 34 Placebo FALSE 9 109 M 47 Drug TRUE 10 110 M 46 Placebo FALSE With the data.frameO function, character vectors are automatically coerced into factors because of its stringsAsFactors= argument. However, specifying stringsAsFactors = FALSE in your data.frameO function invocation will keep character vectors from being coerced. An alternative is to wrap the character vector with the 10 function, which keeps it 'as is'. Also, all invalid characters in column names (e.g., spaces, dashes, or question marks) are converted to periods (.). The listO function can be used to construct a list. Like the data.frameO function, the components of a list can be named using the COMPONENTname = component construct. For example, 21 CHAPTER 1. > (ourlist <- list (compl = c(TRUE, FALSE), comp2 = 1:4, comp3 = matrix(1:20, + nrow = 2, byrow = TRUE))) $compl [1] TRUE FALSE $comp2 [1] 12 3 4 $comp3 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 123456789 10 [2,] 11 12 13 14 15 16 17 18 19 20 ATTRIBUTES: Also as demonstrated in the first document, we can use the dim O function to check the dimensions (number of rows and number of columns, respectively) of a data frame. We can use the names () function to check the variable names of a data frame. We can also use the Hmisc package's contents () function to display both of these attributes and more. Specifically, the contents() function displays the meta-data of your data frame, which includes the number of observations (rows) and columns, the variable names, the variable labels (if any), the variable units of measurement (if any), the number of levels for factor variables (if any), the storage mode of each variable, and the number of missing values (NAs) for each variable. The contents () function also displays the maximum number of NAs across all variables, and the level labels of each factor variable (if any). For a list whose components are named, the names () function will return (print) the names of the components. However, the dim O and Hmisc package's contents () function do not work with lists. Instead, use the length() function to return (print) the number the number of components of a list. I also recommend using the str O (structure) function, which compactly displays the internal structure of an R object. The str O function is especially well suited to compactly display the (abbreviated) contents of lists, including nested lists, which are lists with list components. For example, we can return (print) the structure of the ourlist list we assigned above. > str(ourlist) List of 3 $ compl: logi [1:2] TRUE FALSE $ comp2: int [1:4] 12 3 4 $ comp3: int [1:2, 1:10] 1 11 2 12 3 13 4 14 5 15 ... The output of the str O function indicates the ourlist is a list of 3 named components - compl, comp2, and comp3. It also tells us that compl is a logical vector of 2 values (logi [1:2]), comp2 is a numeric vector of 4 values (int [1:4]), and comp3 is a numeric 2x10 matrix (int [1:2, 1:10]). The usefulness of the str O function will become more apparent when you use it to determine the structure of the output of a function in order to select specific portions of the output - for example, examine the structure of the output from str (contents (pbc) ) . SUBSETTING: To subset a list, we use either the single square brackets, [ ], or the double square brackets, [ [ ] ]. With the single square brackets, we indicate which component(s) of the list we would like to extract. When the components of a list are not named, we specify the desired component(s) using their number(s). For example, if we had a 9 component list Lst we could extract the second component using Lst[2]. We can also incorporate the colon operator (:) or the c() (concatenate) function to extract more than one component. For example, we could extract the first three components using Lst [1:3] and we could extract the third, fifth, and ninth components using Lst [c (3, 5, 9) ]. When the components of a list are named, as with the list ourlist, we specify the desired component(s) using their name(s) - as quoted character strings. As before, we can incorporate the c() (concatenate) function to extract more the one 22 CHAPTER 1. named component. For example, we could extract the first component of ourlist using ourlist ["compl"] and we could extract the first and third components using ourlist [cC'compl" , "comp3")]. IMPORTANT: Subsetting a list using the single square brackets, [], returns a tistl Therefore, the result is a sublist of the original list, consisting of the specified components. If it was a named list, the names are transferred to the sublist. For example, > str(ourlist["compl"]) List of 1 $ compl: logi [1:2] TRUE FALSE In contrast, subsetting a list using the double square brackets, [ [ ] ], extracts the object that was stored in the specified component. Because of this, we specify only a single component to be extracted with the double square brackets - we do not incorporate the colon operator, :, or the c() (concatenate) function. Also, the name of the object is not included when it is extracted, if the corresponding component was named in the list. As with the single square brackets, [ ], the number of the component can be specified using its number or its name, if appropriate. If the components of a list are named, an alternative to the double square brackets is the $ operator. Therefore, Lst [["ComponentName"]] is equivalent to Lst$ComponentName. For example, > ourlist[["compl"]] [1] TRUE FALSE > str(ourlist [["compl"]]) logi [1:2] TRUE FALSE > ourlist$comp3 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 123456789 10 [2,] 11 12 13 14 15 16 17 18 19 20 > str(ourlist$comp3) int [1:2, 1:10] 1 11 2 12 3 13 4 14 5 15 ... Because the result of subsetting a list using the double square brackets is the stored object, the result can then be itself subsetted as demonstrated for vectors, matrices, etc. For example, we can extract the third column of the comp3 matrix from ourlist using ourlist$comp3 [, 3]. Because data frames are a special case of lists, elements (rows and/or columns) of a data frame can be extracted using the [ ], [ [ ] ], and/or $ operators. In addition, because a data frame can be thought of as a generalization of a matrix, we can also use the [ , ] subsetting format. If df is our data frame and vl, v2, and v3 are its columns, then df [["vl"]], df$vl, and df [, "vl"] are equivalent and they all return a vector. However, df ["vl"] returns a data frame. Similarly, specifying more than one column with the [ , ] subsetting format returns a data frame. Some examples using the ourdf data frame: > ourdf [, 4] [1] Drug Placebo Placebo Drug Placebo Drug Drug Placebo Drug Placebo Levels: Drug Placebo > ourdf[["tx"]] [1] Drug Placebo Placebo Drug Placebo Drug Drug Placebo Drug Placebo Levels: Drug Placebo 23 CHAPTER 1. > ourdf$tx [1] Drug Placebo Placebo Drug Placebo Drug Drug Placebo Drug Placebo Levels: Drug Placebo > ourdf[ourdf$sex == "M", c("id", "tx")] id tx 3 103 Placebo 5 105 Placebo 6 106 Drug 7 107 Drug 9 109 Drug 10 110 Placebo Notice how we had to use the $ operator in order to subset the rows of ourdf by the sex column. REMEMBER: If the column(s) of the data frame you are using in the conditional selection of rows contain(s) missing values, you will need to explicitly remove the NAs. For example, if the sex column of ourdf contained NAs, in order to extract the rows for which sex == "M", we would have to specify ourdf [ ourdf $sex == "M" & !is.na(ourdf$sex), ]. Even though the dfname$colname and df name [["colname"] ] constructs for extracting a column from a data frame are equivalent, the df name$colname construct is more convenient when using R interactively and the dfname[["colname"]] construct is very useful when you are specifying the value of "colname" using another object. For example, > x <- "ageyrs" > head(pbc[[x]]) Age [years] [1] 66.25873 42.50787 59.95346 52.02464 41.38535 61.72758 This df name [["colname"]] construct will become very handy when using loops - see the 'Repetitive Evaluation of Expressions' section of the second chapter. As you can imagine, using the [ ] operators with dfname$colname construct and is.naO function to properly remove any missing values can lead to a large amount of typing depending on you are trying to subset your data frame. For example, we would have to use the following code to correctly subset the age (in years) values of those female subject who died and had received D-penicillamine: pbc[pbc$sex == "Female" & !is.na(pbc$sex) & pbc$censored == "Dead" & !is.na(pbc$censored) & pbc$drug == "D-penicillamine" & !is.na(pbc$drug), "ageyrs"] Luckily there are several functions that make subsetting data frames much easier. Let's first discuss the subset () function, which we've already introduced with vectors. With data frames, the subset () function returns (prints) the specified subset of rows and/or columns of a data frame. The formal arguments of the data frame method of the subset () function and their defaults (if any) are: > args(subset.data.frame) function (x, subset, select, drop = FALSE, ...) NULL The x= argument specifies the data frame to be subsetted. The subset= argument specifies which rows of the data frame to keep by using a conditional expression. With the subset= argument, missing values are automatically taken as false, so they do not need to be explicitly removed using the is.naO function. 24 CHAPTER 1. In addition, the subset= argument is evaluated in the data frame, so columns can be simply referred to (by name; e.g, age) as variables in the expression. This saves us from having to use the df name$colname construct throughout the specified logical expression. The select= argument specifies which columns of the data frames to select. Like the subset= argument, for the select= argument, we specify the columns by their name. The select= argument works by first replacing the column names in the selection expression with the corresponding column numbers in the data frame and then using the resulting integer vector to index the columns. This allows the use of the standard indexing conventions so that, for example, a group of columns can be specified using the c() (concatenate) function, ranges of columns can be specified using the : operator, or single columns can be dropped using the (unary) - operator. Let's demonstrate the usefulness of the subset () function with some examples. > subset(pbc, subset = sex == "Female" & censored == "Dead" & drug == + "D-penicillamine" & ascites == "Yes", select = age) age 6 22546 18 28018 20 25023 > subset(pbc, subset = sex == "Female" & censored == "Dead" & drug == + "D-penicillamine" & ascites == "Yes", select = c(id, bili, stage)) id bili stage 6 37 7.1 4 18 92 1.4 4 20 106 2.1 2 > subset(pbc, subset = sex == "Female" & censored == "Dead" & drug == + "D-penicillamine" & ascites == "Yes", select = bili:stage) bili chol album stage 6 7.1 334 3.01 4 18 1.4 206 3.13 4 20 2.1 NA 3.90 2 > head(subset (pbc, select = -c(drug, censored, ascites))) id fudays status age sex bili chol album stage ageyrs fuyrs 3 66.25873 6.8528405 2 42.50787 6.5708419 4 59.95346 3.7125257 3 52.02464 3.9534565 4 41.38535 0.8788501 4 61.72758 0.6105407 Unlike the [ ] and [[ ]] operators, the subset() function always returns a data frame, even if the data frame has only one row or one column. To return a single vector of output, we can use the df name$colname construct in conjunction with the subset () function. For example, > subset(pbc, subset = sex == "Female" & censored == "Dead" & drug == + "D-penicillamine" & ascites == "Yes")$age Age [days] [1] 22546 28018 25023 For a data frame, the complete, cases() function returns (prints) a logical vector indicating which cases (i.e., rows) of the data frame are 'complete' (i.e., have no missing values across all of its columns). The 1 6 2503 Dead 24201 Female 0. .8 248 3. .98 2 9 2400 Dead 15526 Female 3. .2 562 3. .08 3 20 1356 Dead 21898 Female 5. .1 374 3. .51 4 26 1444 Dead 19002 Female 5. .2 1128 3. .68 5 30 321 Dead 15116 Female 3. .6 260 2. .54 6 37 223 Dead 22546 Female 7. .1 334 3. .01 25 CHAPTER 1. complete, cases() function can be used in conjunction with the [ ] operators, or the subset() function. For example, pbc[complete.cases(pbc), ] or subset(pbc, subset = complete.cases(pbc)), respectively. The unique () function is also useful for subsetting a data frame. In particular, it is very useful when you are trying to determine the possible combinations between several columns that exist in your data frame. For a data frame, the unique () function returns a subset of the data frame with all duplicate rows (across all columns of the data frame) removed. For example, > unique(subset(pbc, select = c(drug, censored, ascites))) drug censored ascites 1 Placebo Dead No 2 D-penicillamine Dead No 6 D-penicillamine Dead Yes 8 Placebo Censored No 11 D-penicillamine Censored No 42 Placebo Dead Yes 54 D-penicillamine Censored Yes 76 Censored 80 Dead As you noticed, missing values are considered unique values if they exist. And remember, you can use the headO and tailO functions to return (print) the first/last n= rows of your data frame (n = 6 by default). MANIPULATION: In addition to modifying individual variables of a single data frame, you often need to manipulate one or more whole data frames. This data frame manipulation can include tasks such as combining two or more data frames together; sorting the rows of a data frame in ascending or descending order of a desired column or columns; and/or reshaping a data frame that has repeated measurements. We will discuss each of these mentioned data frame manipulations individually. DEFINING ADDITIONAL INDIVIDUAL COLUMNS: Additional individual columns can be added to an existing data frame in several different ways. The first way is to use the df name$colname construct to assign a value to a new column. For example, we could define follow up in months using pbc$fumonths <-pbc$fudays/30. We can also add a new variable by using the data.frameO function and overwriting the existing data frame. For example, pbc <- data, frame (pbc, fumonths = pbc$fudays/30). You noticed that we still needed to use the dfname$colname construct within the data.frameO function invocation to properly reference the needed columns. Yet another way we can add a new variable to an existing data frame is using an assignment expression involving the transform 0 function. By default, the transform0 function only prints the updated data frame and does not 'permanently' add the new variable to the data frame, so we need to assign the output of the transform 0 function to our data frame. For example, pbc <-transform (pbc, fumonths = fudays/30). An advantage to using transform 0 function to add a variable to a data frame is that you don't have to use the df name$colname construct to reference other columns. Also, don't forget about the Hmisc package's upDataO function, which we demonstrated in the first document. At this point, you might also be asking yourself 'Can't we use the withO function to add a new variable to an existing data frame?' The answer is 'no' if we use an expression similar to with (pbc, newvar <-oldvar*2), but we an expression of the form pbc$newvar <- with(pbc, oldvar*2) will work. A newer alternative is the within () function - within (pbc, newvar <- oldvar*2). When defining additional individual columns to an existing data frame, the columns can be derived from other existing columns in the data frame, or can be created 'from scratch.' In either case, the column can be constructed by virtually any combination of functions (and operators) in R that construct and/or manipulate 26 CHAPTER 1. vectors - see the 'Catalog' chapter. REMOVING COLUMNS/ROWS: A single column can be removed from a data frame by setting it to NULL. For example, we could remove fudays column from our pbc data frame using pbc$fudays <- NULL. In a similar sense, we can use the [ ] operators to remove several columns from a data frame. For example, pbc[, cC'fudays", "album", "ascites")] <- NULL. You can check that the column(s) has/have been removed from the data frame by returning (printing) the output of the names () function. We can also use the assignment operator (<-) and NULL to remove a row or multiple rows from a data frame. We would simply use the [ , ] subsetting format or the subset () function to extract the rows we wish to remove and then assign them to NULL. For example, we could remove the female subjects from our pbc data frame using subset(pbc, sex == "Female") <- NULL. MERGING: The cbindO and rbindO functions, that were mentioned in the 'Matrices & Arrays' section, can be used to combine multiple data frames. Recall, in general, the cbindO and rbindO bind together the data arguments horizontally (column-wise) or vertically (row-wise), respectively. In terms of data frames, the cbindO function adds columns of the same length to a data frame, and the rbindO function adds (concatenates) rows across the same columns. When combining data frames, the data frames must have the same number of rows in order to cbindO them, or the same number of columns with the same column names to rbindO them. Also, when you use the cbindO function to column bind two data frames, you need to make sure that the rows of the data frames are in the same order. An alternative to the rbindO and cbindO functions is the merge0 function, which allows more general combinations of data frames. With the merge 0 function, two data frames can be joined in a one-to-one, many-to-many, or many-to-many fashion using any number of matching variables. The formal arguments of the data frame method of merge 0 function and their default values (if any) are: > args(merge.dat a.frame) function (x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), incomparables = NULL, ...) NULL The x= and y= arguments specify the names of the two data frames you wish merge. The by=, by.x=, and by.y= arguments specify the column(s) you wish to merge on. You can use the by= argument to specify the column(s) if the name(s) of the column(s) is/are the same in both data frames. For example, if we wanted to merge two data frames on a subject ID column that was named id in both data frames, then we would specify by = "id". If the name(s) of the column(s) you wish to merge on are different in the two data frames, then you need to specify the correct names for each data frame using the by.x= and by.y= arguments. For example, if we wanted to merge two data frames on a subject ID column that was named id in the first data frame and subject in the second, then we would specify by.x = "id", by.y = "subject". As I've been hinting, you can also merge on more than one column. In this case, use the c() (concatenate) function to specify the vector of column names you wish to merge on. For example, by = c ("id" , "visitdate"). The all=, all.x=, and all.y= arguments specify which subsets of rows should remain in the resulting merged data frames. By default, only those rows 'in common' (i.e., the intersection) between the two data frames are returned (all = FALSE). In other words, any nonmatching rows of either the first or second data frame are dropped. Use all = TRUE to keep all of the rows (matching or nonmatching) from either the first or second data frame (i.e., the union). And use all.x = TRUE to keep all the nonmatching rows of the first data frame, but drop any nonmatching rows of the second data frame. The use is analogous for all.y=. When all = TRUE, all. x = TRUE, or all. y = TRUE, missing values (NAs) are filled in the corresponding columns of all of the nonmatching rows. Also, by default, if the remaining columns in the two data frames (i.e., those columns the data frames contain but were not merged on) have any common names, these have suffixes (by default, '.x' and '.y') appended 27 CHAPTER 1. to their names to make the names of the result unique. Use the suf f ixes= argument to change this default. Let's work through some examples using the authors and books data frames that are constructed in the Examples section of the merge () function's help file. > (authors <- data.frame(surname = cC'Tukey", "Venables", "Tierney", "Ripley", + "McNeil"), nationality = c("US", "Australia", "US", "UK", "Australia"), + deceased = c("yes", rep("no", 4)))) surname nationality deceased 1 Tukey US yes 2 Venables Australia no 3 Tierney US no 4 Ripley UK no 5 McNeil Australia no > (books <- data.frame(name = cC'Tukey", "Venables", "Tierney", "Ripley", + "Ripley", "McNeil", "R Core"), title = c("Exploratory Data Analysis", + "Modern Applied Statistics", "LISP-STAT", "Spatial Statistics", + "Stochastic Simulation", "Interactive Data Analysis", "An Introduction to R"), + other.author = c(NA, "Ripley", NA, NA, NA, NA, "Venables & Smith"))) name title other.author 1 Tukey Exploratory Data Analysis 2 Venables Modern Applied Statistics Ripley 3 Tierney LISP-STAT 4 Ripley Spatial Statistics 5 Ripley Stochastic Simulation 6 McNeil Interactive Data Analysis R Core An Introduction to R Venables & Smith > merge(authors, books, by.x = "surname", by.y = "name") surname nationality deceased title other.author 1 McNeil Australia 2 Ripley UK 3 Ripley UK 4 Tierney US 5 Tukey US 6 Venables Australia no Interactive Data Analysis no Spatial Statistics no Stochastic Simulation no LISP-STAT yes Exploratory Data Analysis no Modern Applied Statistics Ripley > merge(books, authors, by.x = "name", by.y = "surname", all.x = TRUE) title other.author nationality deceased 1 McNeil Interactive Data Analysis 2 R Core An Introduction to R Venables & Smith 3 Ripley Spatial Statistics 4 Ripley Stochastic Simulation 5 Tierney LISP-STAT 6 Tukey Exploratory Data Analysis 7 Venables Modern Applied Statistics Ripley > merge(authors, books, by.x = "surname", by.y = "name surname nationality deceased title 1 McNeil Australia no Interactive Data Analysis Australia no UK no UK no US no US yes Australia no all = TRUE) e other. author s 28 CHAPTER 1. 2 Ripley UK 3 Ripley UK 4 Tierney US 5 Tukey US 6 Venables Australia 7 R Core no Spatial Statistics no Stochastic Simulation no LISP-STAT yes Exploratory Data Analysis no Modern Applied Statistics Ripley An Introduction to R Venables & Smith In all three merge () function examples, we had to specify by. x= and by. y= because the column names were not the same in the two data frames (i.e., surname and name). In the first merge, all nonmatching rows of books were dropped from the result (i.e., the row corresponding to name equal R Core). You'll also notice that a second row for surname equal Ripley was added to the result to match the two rows for name equal Ripley in the books data frame. In the second merge, we specified that the nonmatching rows of the books data frame should not be dropped (i.e., the name equal R Core row), and missing values were filled into the remaining columns of the result (nationality and decreased), which came from the authors data frame. In the third merge, all rows from both data frames were kept in the result, and missing values were filled into the appropriate columns - the nationality and deceased columns for the surname equal R Core row. SORTING: Even though the sortO function is available to sort the elements of individual vectors, when dealing with data frames, sorting a single vector is not usually what is required. More often, you need to sort a series of columns according to the values of some other columns. For example, we may want to sort the pbc data frame by treatment (drug), gender (sex), and survival status (censored). With data frames, we also have to make sure that the whole records (single rows across all the columns) are correctly sorted, not just individual columns. To sort data frames, we will want to use the order () function instead of the sortO function. Unlike the sortO function, the order() function can operate on more than one vector simultaneously. For example, order (x, y) will return (print) the index vector which rearranges x (its first argument) into ascending or descending order, with ties broken by y (its second argument). If more than two arguments are given, ties are broken by subsequent arguments. By default, all vectors involved in the order () expression are sorted in ascending order. Because the order () function returns an index vector and not the original data (like the sortO) function), we use the order() function in conjunction with the [ ] operators to properly sort the rows of a data frame. So, returning to our example, we can sort the pbc data frame by treatment (drug), gender (sex), and survival status (censored) using pbc[with(pbc, order (drug, sex, censored)), ]. We used the with () function to save some typing. When our order () expression involves numeric vectors, we can use the (unary) - operator to specify that a numeric vector should be sorted in descending order. For example, we can sort the pbc data frame by treatment (drug), descending age, and survival status (censored) using pbc [with(pbc, order(drug, - age, censored)), ]. RESHAPING: Often, your data frame contains repeated measures. For example, perhaps each subject in your data frame received various kinds of chemotherapy at different visit dates. The reshape () function was created to easily manipulate the 'shape' of such longitudinal data frames. For the reshape () function, the shape of a data frame can be either 'wide' or 'long.' A 'wide' longitudinal data frame will have one record for each individual, time-constant variables occupying single columns, and time-varying variables occupying a column for each time point. In a 'long' format, the data frame will have multiple records for each individual, with some variables being constant across these multiple records and others varying across these multiple records. A 'long' format data frame also needs a 'time' variable identifying which time point each record comes from and an 'id' variable showing which records refer to the same subject. So, as its name implies, the reshape () function reshapes data frames between the 'wide' and 'long' formats. The formal arguments of the reshape () function and their default values (if any) are: > args(reshape) function (data, varying = NULL, v.names = NULL, timevar = "time", idvar = "id", ids = 1:NR0W(data), times = seq_along(varying[[1]]), drop = NULL, direction, new.row.names = NULL, sep = ".", 29 CHAPTER 1. split = if (sep == "") { list(regexp = "[A-Za-z][0-9]", include = TRUE) } else { list(regexp = sep, include = FALSE, fixed = TRUE) » NULL Other than the data= argument, which specifies the name of the data frame you are wanting to reshape and the drop= argument, which specifies a vector of column names to drop before reshaping, the remaining arguments of the reshape () function are easier to describe depending on which way you are wanting to reshape the data frame - from long to wide, or from wide to long. As you can guess, the direction= argument specifies the direction in which the data frame should be reshaped. Use direction = "wide" to reshape a long data frame to a wide data frame. And use direction = "long" to reshape a wide data frame to a long data frame. Let's first discuss how to specify the arguments when direction = "wide". The v.names= argument specifies the names of the variables in the current long format that will correspond to multiple variables in the resulting wide format. The idvar= argument specifies the name(s) of one or more variables in the current long format that identify multiple records from the same group and/or individual. The timevar= argument specifies the variable in the long format that is time-varying (i.e., that differentiates multiple records from the same group or individual). With this in mind, let's work through some direction = "wide" examples. For the first example, let's create a data frame with multiple records per individual (i.e., in long format) that has three columns: (1) subject ID; (2) the type of chemotherapy the subject received; and (3) the corresponding visit date. Therefore, each subject can receive different chemotherapy regiments on different visit dates. > x <- data, frame (id = sampled: 100, size = 50, replace = TRUE), chemo = sample (Cs (Hormonal, + Antibody, Other), size = 50, replace = TRUE), visitdt = sample(paste(1:12, + 1:30, 2004:2006, sep = "/"), size = 50, replace = TRUE)) > dim (x) [1] 50 3 > head(x) id chemo visitdt 1 15 Antibody 5/17/2005 2 24 Other 10/22/2004 3 6 Antibody 12/12/2006 4 65 Hormonal 4/4/2004 5 88 Hormonal 4/28/2004 6 78 Other 9/9/2006 > table(x$id) 2 4 6 8 11 13 15 18 21 22 23 24 28 36 42 44 45 46 48 49 50 52 1111112111111211121111 53 57 58 60 61 64 65 66 72 74 76 78 80 82 88 93 98 99 100 1113113111111112112 > length(unique(x$id)) [1] 41 We want to reshape the data frame from long format to wide format such that there are four columns in the resulting reshaped (wide) data frame: (1) subject ID; (2) 'Hormonal' chemotherapy; (3) 'Antibody' chemotherapy; and (4) 'Other' chemotherapy. In other words, we're interested in tallying whether each subject ever received each of the three types of chemotherapy. 30 CHAPTER 1. 1 15 Antibody 2 24 3 6 Antibody 4 65 Antibody 5 88 6 78 > dim(widechemo) [1] 41 4 > widechemo <- reshape(subset(x, select = Cs(id, chemo)), v.names = "chemo' + idvar = "id", timevar = "chemo", direction = "wide") > head(widechemo) id chemo.Antibody chemo.Other chemo.Hormonal Other Other Hormonal Hormonal Other The number of rows of our reshaped wide data frame should be equal to length (unique (x$id)). Note, we needed to drop the visitdt column from consideration in the reshape. If we didn't, we would have gotten the following warning: Warning message: some constant variables (visitdt) are really varying in: reshapeWide(data, idvar = idvar, timevar = timevar, varying = varying, Let's work with another long format data frame that we wish to reshape to a wide format one. This example data frame has three columns: (1) subject ID; (2) week (1 - 10); and (3) hemoglobin (HGB) value. In this long format data frame, each subject has a HGB value for each week. > xx <- data.frame(id = rep(l:100, times = 10), week = rep(l:10, each = 100), + hgb = sample(seq(4, 16, by = 0.1), size = 1000, replace = TRUE)) > dim (xx) [1] 1000 3 > length(unique(xx$id)) [1] 100 > head(xx) id week hgb 1 1 1 9.4 2 2 1 7.7 3 3 1 10.9 4 4 1 15.0 5 5 1 5.7 6 6 1 9.0 We want to reshape the data frame from long format to wide format such that there are 11 columns: subject ID, and one column for each week's visit. > wide.xx <- reshape(xx, v.names = "hgb", idvar = "id", timevar = "week", + direction = "wide") > head(wide.xx) id hgb.l hgb.2 hgb.3 hgb.4 hgb.5 hgb.6 hgb.7 hgb.8 hgb.9 hgb.10 1 1 9.4 15.7 15.1 11.5 12.0 4.5 10.1 7.7 5.8 9.0 2 2 7.7 15.9 4.5 8.8 14.9 8.2 15.1 6.5 6.7 5.9 3 3 10.9 6.1 7.5 15.5 7.6 10.5 5.6 15.1 5.9 7.4 4 4 15.0 10.5 10.0 11.9 15.2 11.3 6.8 4.6 11.4 13.7 5 5 5.7 8.6 11.3 7.9 6.4 7.2 9.9 8.9 10.0 5.9 6 6 9.0 12.1 7.1 6.3 13.5 6.4 9.7 9.5 6.2 13.5 31 CHAPTER 1. > dim(wide.xx) [1] 100 11 Like before, the number of rows of our reshaped wide data frame should be equal to length (unique (xx$id)). Let's take our xx long format data frame one step further by introducing some missing HGB values and missing records. For instance, let's drop 100 random records: > subxx <- xx [sample(1:1000, size = 900, replace = FALSE), ] And, in the remaining 900 records, let's replace 100 random values of HGB with NA (i.e., a missing value): > subxx$hgb[sample(1:900, size = 100, replace = FALSE)] <- NA Once again, let's reshape the long format data frame to a wide format data frame and see what's different. > wide.subxx <- reshape(subxx, v.names = "hgb", idvar = "id", timevar = "week", + direction = "wide") > head(wide.subxx) id hgb.8 hgb.4 hgb.6 hgb.l hgb.3 hgb.10 hgb.2 hgb.9 hgb.5 hgb.7 703 3 15.1 15.5 NA 10.9 7.5 7.4 NA 5.9 7.6 5.6 399 99 10.2 4.7 14.5 5.8 8.0 9.6 4.6 NA 14.0 9.5 501 1 7.7 11.5 4.5 9.4 15.1 NA 15.7 5.8 12.0 10.1 37 37 4.8 10.3 9.1 6.3 6.9 4.9 9.0 10.1 9.0 NA 297 97 6.3 NA 10.9 NA 4.5 NA 13.7 6.1 7.6 NA 959 59 NA 9.9 10.2 5.4 10.2 12.9 7.7 10.6 NA NA > dim(wide.subxx) [1] 100 11 Like twice before, the number of rows of our reshaped wide data frame should be equal to length (unique (xx$id)). But this time, you'll notice that the columns are no longer in order. You'll also notice that missing values are appropriately inserted in the reshaped data frame. As a fourth example of reshaping a long format data frame to a wide format one, let's use a created data frame with two different variables (x and y) that have each been measured at two different time points (Before and After). > df <- data.framedd = rep(l:4, rep(2, 4)), visit = factor(rep(c("Before", + "After"), 4)), x = rnorm(4), y = runif(4)) > df id visit x y 1 1 Before 1.0067290 0.2998300 2 1 After -1.6988995 0.8935240 3 2 Before -0.3840194 0.7626606 4 2 After -0.8087731 0.1040294 5 3 Before 1.0067290 0.2998300 6 3 After -1.6988995 0.8935240 7 4 Before -0.3840194 0.7626606 8 4 After -0.8087731 0.1040294 > reshape(df, timevar = "visit", idvar = "id", direction = "wide") 32 CHAPTER 1. id x.Before y.Before x.After y.After 1 1 1.0067290 0.2998300 -1.6988995 0.8935240 3 2 -0.3840194 0.7626606 -0.8087731 0.1040294 5 3 1.0067290 0.2998300 -1.6988995 0.8935240 7 4 -0.3840194 0.7626606 -0.8087731 0.1040294 Unlike the three previous examples, in this one we did not specify the v.names= argument - here, we did not need to since, after specifying the timevar= and idvar= arguments, only two columns remained to be considered. We can also see what happens when you reshape an "unbalanced" subset of a long format data frame. > df2 <- df[l:7, ] > df2 id visit X y 1 1 Before 1.0067290 0. .2998300 2 1 After -1.6988995 0. .8935240 3 2 Before -0.3840194 0. .7626606 4 2 After -0.8087731 0. .1040294 5 3 Before 1.0067290 0. .2998300 6 3 After -1.6988995 0. .8935240 7 4 Before -0.3840194 0. .7626606 > reshape(df2, timevar = "visit", idvar = "id", direction = "wide") id x.Before y.Before x.After y.After 1 1 1.0067290 0.2998300 -1.6988995 0.8935240 3 2 -0.3840194 0.7626606 -0.8087731 0.1040294 5 3 1.0067290 0.2998300 -1.6988995 0.8935240 7 4 -0.3840194 0.7626606 NA NA As you noticed, NAs are used to 'fill in' the resulting wide format data frame. Now let's discuss how to specify the arguments when direction = "long". The varying= argument specifies the names of sets of variables in the current wide format that will correspond to single (time-varying) variables in the resulting long format. The varying= argument must be specified as a list of vectors. The v.names= argument can be optionally used to specify the name(s) of the variable(s) in the resulting long format that corresponds to the multiple variables in the original wide format. The ids= argument can be also be optionally used to specify the values to use for a newly created 'idvar' variable in the resulting long format. And the times= argument can be also be optionally used to specify the values to use for a newly created 'timevar' variable in the resulting long format. With this in mind, let's work one direction = "long" example. For this example, we'll use the MASS package's (automatically installed) immer data frame, which is a built-in data set that we can access in R, and contains the yield data from a barley field trial. Specifically, the five varieties of barley were grown in six locations in each of 1931 and 1932. The immer data frame has 30 rows and 4 columns: (1) Loc the location; (2) Var the variety of barley (manchuria, svansota, velvet, trebi, and peatland); (3) Yl the yield in 1931; and (4) Y2 the yield in 1932. > library(MASS) > data(immer) > head(immer) Loc Var Yl Y2 1 UF M 81.0 80.7 2 UF S 105.4 82.3 33 CHAPTER 1. 3 UF V 119.7 80.4 4 UF T 109.7 87.2 5 UF P 98.3 84.2 6 W M 146.6 100.4 > dim(immer) [1] 30 4 We want to reshape the wide format data frame to a long format one, such that the yield values from each year are placed in one column. > immer.long <- reshape(immer, varying = list(c("Yl", "Y2")), direction = "long") > head(immer.long) Loc Var time Yl id 1.1 UF M 1 81.0 1 2.1 UF S 1 105.4 2 3.1 UF V 1 119.7 3 4.1 UF T 1 109.7 4 5.1 UF P 1 98.3 5 6.1 W M 1 146.6 6 As you can see, just specifying the varying= argument doesn't generate the clearest labels, so let's specify some more arguments: > immer.long <- reshape(immer, varying = list(c("Yl", "Y2")), timevar = "Year", + times = c(1931, 1932), v.names = "Yield", direction = "long") > head(immer.long) Loc Var Year Yield id 1. .1931 UF M 1931 81.0 1 2. .1931 UF S 1931 105.4 2 3. .1931 UF V 1931 119.7 3 4. .1931 UF T 1931 109.7 4 5. .1931 UF P 1931 98.3 5 6. .1931 W M 1931 146.6 6 Now the resulting long data frame is much more descriptive. In general, if a data frame (say x) resulted from a previous reshape () function invocation, then the operation can be reversed simply by using reshape (x). In this case, the direction= argument is optional and the other arguments are stored as attributes on the data frame. 1.4 Data Export When using R, you will often want to write specific data objects to a file. These objects may be formatted character strings that contain specific calculated results, or they may be whole matrices or data frames. There are many functions that can be used for data export, but in this section we will discuss the cat() and write.table() functions - see the 'Catalog' chapter for the other functions. The most basic function for producing customized output is the cat() function, which concatenates and prints objects and character strings. And, when used in conjunction with other functions like the roundO, and f ormatO functions, it can print nicely formatted reports. The cat() function is also useful for producing output in user-defined functions. The syntax of the cat() function is 34 CHAPTER 1. > args(cat) function (..., file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE) NULL Like the paste () function, the cat () function converts it arguments (...) to character strings, concatenates them, separating them by the given sep= (separator) character string (by default, spaces), and then outputs them. For example, > cat('The mean age of the subjects in our PBC data set is", round(mean(pbc$ageyrs), + 1), "years.") The mean age of the subjects in our PBC data set is 49.9 years. Even though it is not shown here, by default, the cat() function does not go to a new line after being executed (fill = FALSE). In general, the fill= argument is a logical or (positive) numeric value controlling how the output is broken into successive lines. If fill = FALSE (default), only newlines created explicitly by "\n" are printed. Otherwise, the output is broken into lines based on the maximum printed line width that is set by the width option from the options () function (see the 'Customizing R' section of the second chapter for more details) if fill = TRUE, or the value of fill= if this is numeric. Based on this, the previous example can be modified to catC'The mean age of the subjects in our PBC data set is", round (mean (pbc$ageyrs) , 1), "years.", fill = TRUE), which is equivalent to cat ("The mean age of the subjects in our PBC data set is", round(mean(pbc$ageyrs), 1), "years.", "\n"). In addition to newlines ("\n"), quotes and other special characters can be specified within character strings using escape sequences: Escape Sequence Character V Single quote \" Double quote \n Newline \r Carriage return \t Tab character \b Backspace \a Bell \f Form feed \v Vertical tab w Backslash itself Here's another example that incorporates more escape sequences and the round () function, > cat("Mean", "\t", "SD", "\t", "N", "\n", round(mean(pbc$ageyrs), 2), + "\t", round(sd(pbc$ageyrs), 2), "\t", sum(!is.na(pbc$ageyrs)), "\n") Mean SD N 49.92 11.89 100 Notice, the quotation marks are removed when the character strings are catOed. Also, as seen, by default, the result of a cat() function invocation is returned (printed) to the screen (file = ""), but it can also be written to an external file. Use the f ile= argument to specify a character string naming the file to print to. The file name can include a path like the file names specified in the read.table() function. Also, use append = TRUE to append the printed output to the file specified. By default, the cat() function overwrites the contents of the specified file (append = FALSE). 35 CHAPTER 1. In addition to the paste () and round() functions demonstrated above, there are several other functions that can be incorporated into a cat() function expression to modify the output. Similar to the roundO function, there are the ceilingO, floorO, truncO, and signif () functions. Another useful function is the generic formatO function, whose default method (format.default()) can format a numeric vector for 'pretty printing'. R does not always print output in the nicest, most consistent format. For example, by default, trailing zeros are removed from numeric values when they are printed, very large or small numbers are printed using scientific notation, and no thousand separators are used when large numbers are printed. For example, type the following numeric values at the command line and press return - 2.002, 2.00, 0.0000345, and 34567901567. Luckily, the format .default () function has nsmall=, scientific=, and big.mark= arguments to help remedy these problems. Here are the same examples from above modified using the format () function. > format(2, nsmall = 2) [1] "2.00" > format(3.45e-05, scientific = FALSE) [1] "0.0000345" > format(34567901567, big.mark = ",") [1] "34,567,901,567" There is also a format. pval () function that is also very helpful to use in conjunction with the cat () function. A drawback to the cat() function is that it only prints vectors (anything else is coerced to a vector before outputted). Therefore, the cat () function is not very useful when you are trying to write non-vector output to a file, such as the output from a regression analysis. For this, I would suggest using the sinkO function, which we discussed in the 'Diverting screen output to a file' section of the first document. It is also useful to know how to write a data frame to a file in R, particularly in the instance where you have made several modifications to your data frame and would like to output it for use in some other program. Like reading in a data file, there are many functions in R that write a data frame to a file, and most are 'derivatives' of the write, table () function. The formal arguments of the write, table () function and their default values (if any) are: > args(write.table) function (x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE, qmethod = cC'escape", "double")) NULL By default, the specified data frame (x=) is written to the specified file (f ile=), as a space-delimited text file (sep = " "). In addition, both the column and row names are written (col.names = TRUE and row.names = TRUE, respectively), all character strings are quoted (quote = TRUE), and missing values are represented with NA (na = "NA"). For thoroughness, let's discuss some of the write.tableO function's arguments and their useful modifications in more detail. The x= argument is used to specify the object to be written, which is preferably a data frame or a matrix. If x= is not a data frame or a matrix, the write. table () function attempts to coerce the object to a data frame. As you can imagine, the f ile= argument is used to specify the name of the file (case sensitive, in quotes, and including its extension) to which the data should be written. Like in the read.table() function, the name of the file can be specified by only its name, or including its path - if no path is specified, the file is created in the current working directory. By default, the file the data frame is written to is overwritten if its already 36 CHAPTER 1. exists (append = FALSE). Specify append = TRUE if the written data frame should be appended to the end of the specified file. By default, any character or factor columns of the data frame are surrounded by double quotes when the data frame is written to the specified file. In addition, if quote = TRUE column and/or row names (if printed) are also quoted. Specify quote = FALSE to suppress the quoting of such columns. As mentioned, by default, the data frame is written as a space-delimited text file (sep = " "). Like the read.table() function, the sep= argument specifies the field separator character, which is the character that will separate the values on each line of the written file. Use sep = "\t" to write tab-delimited files, and sep = "," for comma-separated value files ('.csv'). The na= argument is used to specify the character string to use to represent missing values in the written data. By default, missing values are represented with NA (na = "NA"). As mentioned, by default, the row names of the data frame are written to the specified file (row.names = TRUE). Specify row.names = FALSE to suppress this writing. The row.names= argument can also be specified as a character vector of row names to be written. Lastly, the qmethod= determines how the write.table() function will deal with embedded double quote characters when quoting strings (e.g., "This is a "double quote" within a quote"). The qmethod= arguments must be specified as either "escape" (default) or "double". Specifying qmethod = "escape" causes the quote character to be escaped in C style by a backslash (e.g., "This is a \"double quote\" within a quote"), while specifying qmethod = "double" causes the quote character to be doubled. When your data frame contains embedded double quotes, specifying quote = FALSE causes any embedded quotes to be written within an unquoted character column. As an example, let's build a data frame with a column that contains embedded quotes and then 'write' the data frame to a file specifying the qmethod= argument various ways. Note, when constructing a character vector in R, embedded double quotes (and other special characters within strings) are specified using the escape sequences \". Also, specifying file = "" in the write.table() function invocation causes the data frame to be written to the screen. > x <- data.frame(id = 1:2, comment = cC'Double \"quote\" example 1", + "Another double \"quote\" example")) > x id comment 1 1 Double "quote" example 1 2 2 Another double "quote" example > write.table(x, file = "", qmethod = "escape") "id" "comment" "1" 1 "Double V'quoteV example 1" "2" 2 "Another double \"quote\" example" > write.table(x, file = "", qmethod = "double") "id" "comment" "1" 1 "Double ""quote"" example 1" "2" 2 "Another double ""quote"" example" > write.table(x, file = "", quote = FALSE) id comment 1 1 Double "quote" example 1 2 2 Another double "quote" example IMPORTING TEXT FILES INTO MICROSOFT EXCEL: It is quite easy to import data that is saved as a text file into Microsoft Excel. Specifically, you can import data into Microsoft Excel from most data sources by pointing to Import External Data on the Data menu, clicking Import Data, and then choosing the data you want to import in the Select Data Source dialog box - it is helpful if you choose All files (*.*) in the Files of type box before locating and double-clicking the text file you want to import in the Look in list. Microsoft Excel will then open the Text Import Wizard, which will allow you to specify how you want to divide the data into columns using various field separators. You can also reach the Text Import Wizard in a similar fashion by using the Open command in the File drop-menu. 37 Chapter 2 Working with R Learning objective To understand how to manage your workspace, customize your R sessions, use and avoid looping, evaluate expressions conditionally, and write your own functions. OBJECT MANAGEMENT: During an R session, every object you assign a name is stored in what is known as your workspace. The ls() function can be used to display the names of all the variables currently assigned. For example, let's list the objects defined in our current workspace: > ls() [1] "authors" "books" "char" "df" "df2" "fac" [7] "fruit" "immer" "immer.long" "ourdf" "ourlist" "pbc" [13] "subxx" "widechemo" "wide.subxx" "wide.xx" "x" "xx" [19] "y" As you can see, your workspace can quickly become cluttered, and it can become difficult to identify specific desired objects from among this clutter. Luckily, the ls() function has a pattern= argument that allows you to return the objects whose name matches a specific pattern. For example, we could list the objects that contain the letter 'x' using Is (pattern = "x"). The pattern= argument can be expanded to match based on regular expressions, which we discussed in the 'Vectors' manipulation section. When you are assigning names to objects, it is important to know whether an object name is already being used (eg, is an existing function) before assigning it to a new object. The exists () function can be used to determine whether an object with a proposed name already exists - it searches for the name as a function or as the name of another assigned object. For example, length is the name of a function, but lngth has not been assigned yet. > exists("length") [1] TRUE > exists("lngth") [1] FALSE The rm () function can be used to remove any unnecessary and/or unwanted objects by specifying the names of these objects. The name(s) may or may not be quoted, and multiple names can be specified if separated by commas. For example, to remove the object x, we would type rm(x) at the command line. To remove both the x and y objects, we would type rm(x, y). 38 CHAPTER 2. WORKING WITH R Multiple names can also be given in a vector form as a quoted string to the list= argument. Therefore, the previous example could be typed as: rm(list=c("x", "y")). We can use this list= argument in conjunction with the ls() function to easily remove all of the objects in your workspace by typing rm(list = IsO). In the Windows version of R, the entire current workspace can also be cleared by selecting Remove all objects from the Mise drop-menu. We can also take this example one step further by incorporating the setdiff () function, which returns the difference between two sets of elements, to remove all objects except specified ones from your current workspace: rm(list = setdiff (ls(), c(objectl, object2))), where objectl and object2 are those objects you wish to keep. —> Practice Exercise: Let's remove all the defined objects in our workspace except our pbc data frame. > rmClist = setdiff (Is () , "pbc")) > ls() [1] "pbc" There are also several ways to save desired objects. Recall, when you quit R, you are asked Save your workspace image? Accepting this offer will save all the objects in your current workspace to a directory specific hidden file named .RData1. If you saved your workspace before quitting and start R from the same directory at a later time, R loads this saved workspace and all of the objects are restored to the current workspace. However, I would highly recommend cleaning up your workspace using the rm() function before saving your workspace. The save. image () function can also be explicitly used to save the current workspace. By default, the objects are saved to the . RData hidden file, but a different file name can be specified using the f ile= argument. Unfortunately both of these methods are very inefficient and use a lot of memory to save the workspace. An alternative is to use the saveO function, which allows you to save specific objects and has an option to compress the file the objects are saved to. This is extremely useful when dealing with large objects that take significant execution time to create and/or normally take up scarce memory. Specify compress = TRUE to store the specified file very compactly - savedist = cC'objectl", "object2"), file = "objects, rda", compress = TRUE), where objectl and object2 are those objects you wish to save, and objects, rda is the file you wish to compactly save them too. You then use load ("objects, rda") to reload the saved objects at a later time (i.e., in another R session). CUSTOMIZING R SESSIONS: The (possible) loading of objects from a saved image of the previous workspace from the hidden .RData file (if the workspace was saved) is just one step in the 'startup' procedure that R goes through when an R session is started. This startup procedure also sources (1) environment files, which contain lists of environment variables to be set, and (2) profile files, which contain R code used to load specific packages, to define important functions, and to define specific environment options, which affect the way in which R computes and displays its results. The nice thing is that we can modify the profile files sourced in this procedure in order to customize our R sessions. However, even though we can access the environment files, it is not recommended to modify these files. After searching for and sourcing the environment files, R first searches for and sources a 'site-wide' startup profile file named Rprofile.site, which is located in the R_H0ME/etc directory (if it exists). Recall, on a Linux, Unix, or Mac machine, R_H0ME is /usr/lib/R or /usr/local/lib/R. On a Windows machine R_H0ME is C:\Program Files\R\R-version, where version is the R version number (e.g., 2.5.1). The Rprof ile. site file is termed 'site-wide' because this file is sourced every time R is started, no matter what directory R is started in - so think of it as a 'global' file. In turn, the Rprof ile. site files contains the expressions that you want to execute every time R is started anywhere on your machine. A second, 'personal', hidden file named .Rprof ile can also be placed in any directory. In turn, R will source a hidden .Rprof ile file if R is started in a directory that contains such a file. In contrast to the global Rprof ile. site file, but xBy default, hidden files like .RData will not appear in a folder/directory. On a Unix/Linux machine, use Is -a at the shell command prompt to list all files, including hidden files. On a Windows machine, select Folder Options. .. from the View drop-menu. Then select Show all files from the View tab of the Folder Options dialog box. 39 CHAPTER 2. WORKING WITH R like the hidden .RData file, the hidden . Rprof ile file is 'directory specific' - it allows for different startup procedures in different working directories. You can make the hidden .Rprofile file slightly more global by saving it in your 'home' directory - if no hidden .Rprof ile file is found in the current directory where R is started, then R looks for a hidden .Rprof ile file in your home directory and uses that (if it exists). Remember, the .Rprofile file is hidden, so you will need to take similar measures as you did with the hidden .RData file to see it when you list a directory - see the previous footnote. Also, the Rprof ile. site and hidden .Rprofile files often do not exist. Therefore, you can create these files using any text editor and then save them accordingly. As mentioned, the Rprof ile. site and hidden . Rprof ile files contain R code used to load specific packages, to define important functions, and to define specific environment options, which affect the way in which R computes and displays its results. Specific add-on packages, like the Hmisc package, can be loaded whenever R is started by adding library () function expressions. You can also save R code to these files that defines functions that you have written yourself and that you wish to be able to use in your R sessions. In addition to any self-defined functions, you can also define two special functions - .First() and .LastO - in one or both of the profile files. The . First () function is automatically performed at the beginning of an R session and may be used to initialize the environment. Similarly, the .LastO function is executed at the very end of every session. The . First () function is where the specific environment options are defined using the options () function. The options () function contains a large number of arguments that control different aspects of the base, grDevices, stats, and utils packages. Like the par() function, modifying an argument of the options () function has a permanent effect, and we may modify specific arguments using a options (optionname = value, ...) construct. Some environment options you may consider modifying are: • def aultPackages= specifies the packages, in addition to base, that are loaded when R is started -cC'datasets", "utils", "grDevices", "graphics", "stats", "methods") by default. • prompt= specifies the non-empty string to be used for R's command-line prompt - "> " by default and should usually end in a blank (" "). • continue= specifies a non-empty string to be used for R's continuation prompt - "+ " by default. • width= controls the maximum number of columns on a line used in printing data structures to the screen (and files) - 80 by default. This is useful when you re-size the window that R is running in; the Windows version of R automatically changes the value when the window is re-sized. • digits= controls the number of digits to print when printing numeric values - 7 by default; valid values are from 1 to 22. Unfortunately, this is a suggestion only; not all functions will follow it. Also, there really is no way to specify that all numeric values should be rounded to so many decimal places by default. The alternative is using specific print() or format() functions. • scipen=, specified using an integer, controls the penalty to be applied when deciding to print numeric values in scientific notation - 0 by default. Usually, decimals that are less than 0.001 are printed using scientific notation. Similarly, large numeric values greater than 1,000,000 are often printed with scientific notation. Specifying scipen= with positive integers will bias printing towards not using scientific notation (i.e., to scipen= decimal places). As with digits=, alternatively use specific print () or format () functions. • str ingsAsFactor s= specifies the default setting for the str ingsAsFactor s= argument of the data. frame ( ) and read, table () functions - recall, str ingsAsFactor s = TRUE by default, coercing all character columns to factors. The optionsO function contains the full list, and can be returned (printed) by invoking the optionsO function with no arguments. You can use a get0ption("optionname") construct to return (print) the value of a specific environment option. Environment options can also be set outside of the Rprof ile. site and .Rprofile files. Putting all of this information together, the following is an example of a Rprof ile. site or .Rprof ile file: 40 CHAPTER 2. WORKING WITH R # Load the Hmisc package library(Hmisc) # Modify the command line and continuation prompt, and the width options(prompt="$ ", continue="+\t", width = 100) # Source a file that contains several self-defined functions source(file.path(Sys.getenv("HOME"), "R", "mystuff.R")) # Modify the q() function to automatically quit _without_ saving the workspace # (and without prompting) q <- function(save = "no", status = 0, runLast = TRUE) { .Internal(quit(save, status, runLast)) } .Last <- function() { # close all graphics window and file devices; a small safety measure graphics.off() } In the Windows version of R, you can also customize the way the R console 'looks and feels' using the GUI preferences selection under the Edit drop-menu. CONDITIONAL EVALUATION: Up to this point, we have evaluated single expressions or multiple expressions grouped with the curly braces, { }. However, we will often want to conditionally evaluates one or several expression. For example, perhaps we wish to calculate the mean of a continuous vector if the values are normally distributed, or the median if they are not. Conditional evaluation will also come in handy when you want to generalize your R code and perhaps incorporate it into a self-defined function. In R, an if statement allows us to evaluate expressions based on a condition, and takes one of two forms. The first is the 'if-then' form: if(condition) { expression(s) if condition is TRUE } The condition is an expression that, when evaluated, returns a single TRUE or FALSE value - an error is returned if the condition does not evaluate to a logical value. If the condition evaluates to TRUE, then any expression(s) between the braces ({ and }) is/are evaluated. The second form is an 'if-then-else' form: if(condition) { expression(s) if condition is TRUE else { expression(s) if condition is FALSE } In this form, the condition is evaluated, and any expression(s) between the first set of curly braces is/are evaluated if the condition evaluates to TRUE. If the condition evaluates to FALSE, then the expression(s) between the second set of curly braces is/are evaluated. The 'if-then-else' form of the if statement can also be nested. if(conditionl) { expression(s) if conditionl is TRUE } else if(condition2) { expression(s) if condition 1 is FALSE but condition2 is TRUE } else if(condition3) { expression(s) if both conditi onl and 2 3xe FALSE but cond.ition3 is TRUE 41 CHAPTER 2. WORKING WITH R } else { expression(s) if conditionl, 2, and 3 are FALSE } In any form, the if statement returns the value of the expression evaluated, or NULL if no expression was, which may happen if there is no else. When the expression(s) is/are not specified in a block involving braces ({ and }), then the else, if present, must appear on the same line as the if (condition). In general though, it is a good habit to always use braces to block the expressions with the appropriate part of the statement. As mentioned, the condition of an if statement is expected to return a single TRUE or FALSE value when evaluated. If the condition returns a logical vector of more than one element, then a warning is given. For example, > x <- c(2, 5, 7, NA, 10, NA, -1) > if (x < 0) { + print("< 0") + } else { + print("> 0") + } [1] "> 0" Warning message: the condition has length > 1 and only the first element will be used in: if (x < 0) { In this example, the condition x < 0 actually returns a logical vector the same length as x: > x < 0 [1] FALSE FALSE FALSE NA FALSE NA TRUE When the condition evaluates to a logical vector of more than one element, as the warning suggests, only the first element of the evaluated condition ( either TRUE or FALSE) is used - in this example, FALSE. Note, because it's a nonstandard topic name, the if statement must be quoted to access its help file - help ("if "). An alternative to the if statement is the if else () function, which is the vectorized version of the if statement. The ifelseO function has the form if else (condition, expressionl, expression2) and returns a vector of the length of its longest argument, with elements expressionl [i] if condition is TRUE, or expression2[i] otherwise. Here's an example: > with(pbc, table(ifelse(ageyrs < 40, "< 40", ">= 40"))) < 40 >= 40 20 80 The if else () function can also be nested: > with(pbc, table(ifelse(ageyrs < 40, "< 40", ifelse(ageyrs >= 40 & ageyrs <= + 60, "40-60", "> 60")))) < 40 40-60 > 60 20 59 21 The if else () function is very useful when you want to create new columns in a data frame that are derived from existing ones. Remember the 'rules' of how TRUE and FALSE values evaluate when combined using logic operators - see the 'Conditional expression' portion of the 'Vectors' manipulation section. Also remember the use of the as. character () function when conditionally evaluating factors with the if else () function — see the 'Coercing the Mode of a Vector' portion of the 'Vectors' manipulation section. An alternative to the if statement and if else () function is the switchO function, whose syntax is 42 CHAPTER 2. WORKING WITH R > args(switch) function (EXPR, ...) NULL The EXPR= argument is an expression evaluating to a number or a character string and ... is a 'list of alternatives' (not a formal list), which defines the output of the switchO function. The specific 'alternative' (i.e., element of the 'list') that is chosen is conditionally chosen based on the resulting value of EXPR=. Specifically, if EXPR= returns a number between 1 and length of the 'list' specified as . . . (i.e., the number of elements), then the corresponding element of the list is evaluated and the result is returned. If EXPR= returns a number that is less than 1 or is greater than the number of elements of the list, then NULL is returned. For example, let's conditionally evaluate a numeric vector depending on its length. > x <- 10 > switchdength(x), x, median(x), mean(x)) [1] 10 > x <- c(5, 2) > switchdength(x), x, median(x), mean(x)) [1] 3.5 > x <- c(3, 4, 7) > switchdength(x), x, median(x), mean(x)) [1] 4.666667 > x <- sampled: 10, size = 5) > switchdength(x), x, median(x), mean(x)) NULL When the EXPR= argument evaluates to a character string, we need to name the elements of our 'list of alternatives', .... In turn, the element of . . . with a name that exactly matches the resulting value of EXPR= is returned. If there is no match NULL is returned. For example, > (x <- sample(c("horse", "cat", "dog"), size = 1)) [1] "horse" > switchCx, horse = "Penny", cat = "Stripes", dog = "Spot") [1] "Penny" A common use of the switch() function is to branch according to the character value of one of the arguments to a function. For example, > centre <- function(x, type) { + switchCtype, mean = mean(x), median = median(x), trimmed = mean(x, + trim = 0.1)) + } > x <- rcauchy(lO) > centreCx, "mean") [1] -4.81051 > centreCx, "median") 43 CHAPTER 2. WORKING WITH R [1] 0.6287843 > centre(x, "trimmed") [1] 0.4996638 REPETITIVE EVALUATION: \ In addition to conditionally evaluating expressions, we often want to repeatedly evaluate an expression or block expressions. This is often called Hooping7. There are three statements that will perform looping: (1) the for statement; (2) the while statement; and (3) the repeat statement. However, the while and repeat statements are rarely used in R. The standard for loop has the basic structure of for(increment in sequence) expression(s) } The increment is a variable name - often the name of an indexer, such as i. The sequence can be any vector or list, and the expression(s) is/are evaluated for each increment (i.e., value) of sequence. When sequence is a vector, increment loops over each element of the vector (i.e., sequence [increment] ). When sequence is a list, increment refers to each successive component in the list (i.e., sequence [ [increment] ] ). A simple example is a countdown program: > for (i in 5:1) { + print (i) + } [1] 5 [1] 4 [1] 3 [1] 2 [1] 1 > i [1] 1 In this example, increment is the variable i, and sequence is the numeric vector of the set of numbers 5 through 1. As seen, a f or loop returns the value of the last expression evaluate (or NULL if none was), and sets increment to the last used element (component) of sequence (or to NULL if it was of length zero). Therefore, at the end of the for loop above, i now has the value 1. for loops can also be nested - we merely have to use a different increment for each loop. for loops are often used to evaluate an expression or block of expressions for specific columns or specific subsets of the rows of a data frame. For example, > for (i in cC'ageyrs", "fuyrs", "bili", "chol", "album")) { + cat("Mean of", i, "=", round(mean(pbc[[i]], na.rm = TRUE), 2), "\n") + } Mean of ageyrs = 49.92 Mean of fuyrs =5.12 Mean of bili = 3.11 Mean of chol = 381.62 Mean of album =3.52 44 CHAPTER 2. WORKING WITH R > for (i in cC'ageyrs", "fuyrs", "bili", "chol", "album")) { + for (j in levels(pbc$sex)) { + cat("Mean of", i, "in", j, "patients =", round(mean(pbc[pbc$sex == + j & !is.na(pbc$sex), i], na.rm = TRUE), 2), "\n") + } + } Mean of ageyrs in Female patients = 49.07 Mean of ageyrs in Male patients = 55.16 Mean of fuyrs in Female patients =5.13 Mean of fuyrs in Male patients = 5.04 Mean of bili in Female patients = 3 Mean of bili in Male patients = 3.79 Mean of chol in Female patients = 374.74 Mean of chol in Male patients = 417.27 Mean of album in Female patients = 3.5 Mean of album in Male patients = 3.63 In these examples we incremented across the elements of a character vector and took advantage of the double square brackets, [ [ ] ], and [ , ] subsetting constructs to extract the rows and columns of our pbc data frame. As you can imagine, the dfname$colname construct does not work well with for loops. IMPORTANT: Using a for loop is not always necessary in R. As we have already seen, many functions and operators are vectorized. In the 'Family of apply () functions' section , we will also encounter additional functions that perform implicit looping. In general, avoiding loops can make your R code more compact, easier to read, and often times more efficient in execution. Let's get back to repetitive evaluation... Obviously, the for loop is great to use when you know ahead of time what sequence of values you want to loop over. However, sometimes you don't know this. In this case, you may want to repeat something as long as a condition is met (e.g., as long as a number is positive, or a number is larger than some tolerance). For this, use a while loop: while(condition) { expression(s) } Like the if statement, the condition is an expression that, when evaluated, returns a single TRUE or FALSE value. If the condition evaluates to TRUE, then any expression(s) between the curly braces ({ and }) is/are evaluated. This process continues until the condition evaluates to FALSE. Like the for loop, the while loop returns the value of the last evaluation of the expression(s). For example, > i <- 1 > while (i <= 5) { + cat("Iteration", i, "\n") + i <- i + 1 + } Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 > i [1] 6 45 CHAPTER 2. WORKING WITH R If the expression(s) is/are never evaluated, then the while () function returns NULL. And like the if statement, a warning is returned if the condition returns a logical vector of more than one element (and only the first logical element of the evaluated condition is used). Lastly, a repeat loop causes repeated evaluation of expressions until a break is specifically requested. This means that you need to be careful when using repeat because of the danger of an infinite loop. The syntax of the repeat loop is repeat { expression(s) } Unlike if, for, and while, when using repeat, the expression(s) must be a block of code denoted with braces ({ and }). This is because you need to both perform some computation and test whether or not to break from the loop, which usually requires at least two expressions. The additional flow control statements next and break can be used to skip the next value in a loop or to break out of a loop, respectively. More specifically, break breaks out of a for, while or repeat loop, and control is transferred to the first statement outside the inner-most loop, next halts the processing of the current iteration and advances the looping index. Both break and next apply only to the innermost of nested loops. The following example is a contrast and comparison of the repeat and for loops: > i <- - 0 > repeat { + if (i > 10) + break + if (i > 2 && i + i <- i + 1 + next + } + print (i) + i <- i + 1 + } [1] o [1] 1 [1] 2 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9 [1] ic ) < 5) { > for (i in 0:10) { + if (i > 2 && i < 5) + next + print (i) + } [1] 0 [1] 1 [1] 2 [1] 5 46 CHAPTER 2. WORKING WITH R [1] 6 [1] 7 [1] 8 [1] 9 [1] 10 Like the if statement, all three looping statements must be quoted to access their help files-e.g., helpC'for"). 'GROWING' DATA STRUCTURES IN A LOOP: We often generate data structures, such as vectors or data frames, in for loops. Unfortunately, if you're not careful, generating a data structure can be very memory intensive. Specifically, in each iteration of a for loop we often concatenate the next element of the vector onto the existing vector to generate the final vector. However, every time you do this, R implicitly copies the existing vector and then adds the additional element. Therefore, you are using 2n+l the amount of memory, where n is the length of the vector during a specific iteration of the for loop, just to add a single element to the vector. Depending on the length n, this can be a huge amount of your memory. For example, we often do the following: a <- NULL for (i in 1:100) { a <- c(a, rnorm(l)) } A much more efficient way of generating a vector in a for loop, is to define an 'empty' vector of the final length, if you know what this final length will be. We can then overwrite the value of each element with the correct value using the single square brackets, [ ], subsetting construct. For example, suppose we want to generate a numeric vector of 100 elements. So, instead of the doing the above, we can do this: a <- numeric(100) for(i in l:length(a)) { a[i] <- rnorm(l) } In contrast to 'growing' a vector, overwriting each element in a vector requires just the copying of the replacement elements. The numeric () function generates a numeric vector of specified length, where each element has a default value of 0. We could have also used the character () or logical () functions to generate a character/logical vector, respectively, of specified length. Each element of the character vector created with character () function has a default value of " ", while each element of the logical vector created with logical() function has a default value of FALSE. For example, > numeric(length =10) [1] 0000000000 > character(length =10) > logical(length =10) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE There are similar functions for complex, double, integer, raw, and real vectors, In addition, there is the vector () function, which produces a vector of the specified length and mode. It is not necessary to know the final length of the vector you want to generate in the for loop. When assigning an element of a vector, if the number indexing the element to be assigned is greater than the length of the vector, the vector will be extended to that number of elements. In general, NAs are used to fill any gaps. For example, 47 CHAPTER 2. WORKING WITH R > (x <- sample(1:10)) [1] 8 6 10 7 5 1 9 2 3 4 > x[15] <- 20 > x [1] 8 6 10 7 5 1 9 2 3 4 NA NA NA NA 20 Therefore, even if we don't know the final length of the vector we want to generate in a for loop, we can still define a vector of some length and then extend it. Similarly, we can always define a vector in a vector whose length may be too long and then shorten the vector once the for loop is complete. We can use a similar process to efficiently generate a matrix (array) using the [ , ]([ , ,...]) subsetting construct. For example, > (mat <- matrix(NA, nrow = 2, ncol = 4)) [,1] [,2] [,3] [,4] [1,] NA NA NA NA [2,] NA NA NA NA > for (i in 1:nrow(mat)) { + for (j in 1 :ncol(mat)) { + mat[i, j] <- rnorm(l) + } + } > mat [,1] [,2] [,3] [,4] [1,] 0.08808937 1.4210358 1.3017982 -2.03791607 [2,] 1.64086197 -0.1099261 -0.8379594 0.07467112 Notice how R employed the recycling rule to generate our initial matrix of NAs. Unfortunately, we need to know the final dimensions of our desired matrix; we cannot expand the dimensions of matrix using the [ , ] subsetting construct. Likewise, we need to know the final dimensions of a desired data frame. With a data frame, we can use the numeric(), character (), etc functions to define each column of the initial data frame. For example, c <- data.frame(a = numeric(lOO), b = character(100)) Or, if you desired a data frame of a single mode (e.g., completely numeric), you can initialize the data frame like this: b <- numeric(100) attrCb, "dim") <- c(10,10) c <- as.data.frame(b) The attrO function is a way to define the attributes, like the dimensions, of a data structure. With either of these methods, you can then replace the various elements of the data frame using the [ [ ] ] or [ , ] subsetting constructs. THE FAMILY OF apply() FUNCTIONS: As mentioned above, using a for loop is not always necessary in R. As we have already seen, many functions and operators are vectorized. In addition, there is a group of functions that allows us to perform implicit looping. This group of functions includes the apply () function and its 'relatives' lapplyO, sapplyO, and tapplyO. 48 CHAPTER 2. WORKING WITH R A common looping application is to apply a function to each element of a set of values or vectors and collect the results in a single structure. In R this is possible using the lapplyO and sapplyO functions. The lapplyO function always returns a list (hence the T) whereas the sapplyO function tries to simplify (hence the 's') the result to a vector or a matrix if possible. Both functions operate on the components of a list or the elements of a vector. In addition, both functions accept two main arguments: X= and FUN=, where X= specifies the data and FUN= specifies the function to be applied to each 'element' of X=. Additional arguments to the function specified using the FUN= argument can also be passed to the lapplyO/sapplyO functions via a . . . argument. So, to compute the range of the continuous variables of pbc, we can do the following: > lapply(X = subset(pbc, select = cC'fuyrs", "ageyrs", "bili", "chol", + "album")), FUN = range, na.rm = TRUE) $fuyrs [1] 0.5913758 12.1916496 $ageyrs [1] 30.57358 78.43943 $bili [1] 0.4 25.5 $chol [1] 120 1128 $album [1] 1.96 4.24 > sapply(X = subset(pbc, select = cC'fuyrs", "ageyrs", "bili", "chol", + "album")), FUN = range, na.rm = TRUE) fuyrs ageyrs bili chol album [1,] 0.5913758 30.57358 0.4 120 1.96 [2,] 12.1916496 78.43943 25.5 1128 4.24 Notice how both functions attach meaningful names to the result, which is another good reason to prefer using these functions over using explicit loops. Also, notice how we passed na.rm = TRUE to the range0 function. The lapplyO and sapplyO functions can be used in conjunction with the split0 function to generate results 'indexed' by a second vector. The split 0 function divides the values of a vector into the groups defined by a 'factor'. For example, we can calculate the mean ageyrs for each value of drug: > with(pbc, split(ageyrs, drug)) $~D-penicillamine~ Age [years] [1] 42.50787 61.72758 33.63450 53.50856 32.61328 46.51608 67.31006 33.47570 76.70910 [10] 68.50924 41.15264 61.07050 48.85421 54.25599 35.15127 40.55305 42.96783 75.01164 [19] 69.94114 49.60438 41.60986 70.00411 62.62286 40.71732 61.29500 42.68583 59.76181 [28] 46.78987 44.82957 78.43943 31.44422 58.26420 51.48802 38.39836 50.10815 46.34908 [37] 67.57290 35.35113 35.53457 37.99863 $Placebo Age [years] [1] 66.25873 59.95346 52.02464 41.38535 33.69473 49.13621 32.49281 55.83025 47.21697 49 CHAPTER 2. WORKING WITH R [10] 34.03970 44.06297 42.63929 62.64476 34.98700 59.40862 42.74333 43.41410 35.49076 [19] 52.69268 56.77207 41.09240 49.76318 58.33539 41.37440 59.96988 74.52430 46.38193 [28] 45.08419 50.47228 56.15332 50.24504 31.38125 56.39151 30.57358 33.15264 > withCpbc, lapply(X = split(ageyrs, drug), FUN = mean, na.rm = TRUE)) $~D-penicillamine~ [1] 50.90849 $Placebo [1] 47.76525 > withCpbc, sapplyCX = split(ageyrs, drug), FUN = mean, na.rm = TRUE)) D-penicillamine Placebo 50.90849 47.76525 Another example would be to split a data frame with multiple records per subject on the subject IDs in order to calculate specific information on each subject (eg, the mean hemoglobin value for each subject). A similar function to the lapplyO and sapplyO functions is the appplyO functions, which allows you to apply a function to the rows or columns of a matrix (array) or data frame. Like the lapplyO/sapplyO functions, the apply() function wants a X= and FUN= argument, and it accepts additional argument to the function specified using the FUN= argument via a . . . argument. The apply() function also has a MARGIN= argument that is used to specify the dimension of X= the function FUN= will be applied over - MARGIN = 1 indicates the rows, MARGIN = 2 indicates the columns, and MARGIN = c(l, 2) indicates the rows and columns. For example, we could generate table of each of the factor variables in our pbc data set: > applyCX = subsetCpbc, select = cC'drug", "sex", "ascites", "stage", + "censored")), MARGIN = 2, FUN = table) $drug D-penicillamine Placebo 40 35 $sex Female Male 86 14 $ascites No Yes 70 5 $stage 12 3 4 6 24 39 29 $censored Censored Dead 68 32 50 CHAPTER 2. WORKING WITH R With the lapplyO, sapplyO, and apply() functions, the function specified via the FUN= argument can return more than one value. In addition, the FUN= argument can be a self-defined function. For example, putting these two concepts together: > with(pbc, sapplyCX = split(ageyrs, drug), FUN = function(x) { + c(Mean = mean(x), SD = sd(x), Median = median(x), Min = min(x), + Max = max(x)) + })) D-penicillamine Placebo Mean 50.90849 47.76525 SD 13.64798 10.99152 Median 47.82204 47.21697 Min 31.44422 30.57358 Max 78.43943 74.52430 In this case, the sapplyO function returns a matrix. The apply() function would similarly return a matrix. Use the lapplyO function if FUN= returns a list and make sure X= is specified as a list (i.e., e.g., the subsetO function). The last 'apply family' function to discuss is the tapplyO function. The tapplyO function is an alternative to using the split() function in conjunction with the lapplyO or sapplyO function. Like all the previous 'apply family' functions, the tapplyO functions accepts an X= and a FUN= argument, and additional arguments to FUN= via a . . . argument. For the tapplyO function, however, X= must specify a vector, and the FUN= argument must specify a function that returns only a single value. The tapplyO function also accepts a INDEX= argument that allows you to specify more than one grouping 'factors'. In turn, the tapplyO function creates tables (hence the 't') of the output of FUN= on subgroups of X= defined by INDEX=, which is specified using as list. If more than one 'factor' is specified, a cross-classified table is generated. In addition, if the 'factors' specified by INDEX= are not already defined as factors, they will be converted to factors internally. For example, > with(pbc, tapply(X = ageyrs, INDEX = sex, FUN = median, na.rm = TRUE)) Female Male 47.10883 52.12320 > with(pbc, tapplyCX = ageyrs, INDEX = list(sex, drug), FUN = median, + na.rm = TRUE)) D-penicillamine Placebo Female 47.82204 44.57358 Male 51.46201 50.24504 An alternative to the tapplyO function is the aggregate0 function, which also splits the data (x=, one or more vectors) into specified subsets (using the by= argument, a list of grouping 'factors') and computes a value for each subset based on the value of the FUN= argument. Unfortunately, like the tapplyO function, the FUN= argument is restricted to functions that return a single value. The advantage of the aggregate 0 function is that the result is reformatted into a data frame. This is often very useful, especially if you are grouping the output by subject ID - the output can be easily merged with the original data frame. For example, you may have two data frames - one representing the time in-varying information regarding each patient (e.g., age, race, gender, etc), the other representing time varying information (e.g., lab values for each visit). Using the aggregate0 function, you can easily calculate the number of records in the time varying data frame for each patient and then merge this additional column of information (by patient ID) with the time in-varying data frame. The following is a dummy example: > x <- data.frame(id = 1:10, age = sample(20:40, size = 10, replace = TRUE), + race = factor(sample(c("W", "B", "0"), size = 10, replace = TRUE)), 51 CHAPTER 2. WORKING WITH R + gender = factor(sample(c("M", "F"), size = 10, replace = TRUE))) > y <- data, frame (id = sampled: 10, size = 100, replace = TRUE), lab = sample (100:500, + size = 100, replace = TRUE)) > (aggout <- with(y, aggregate(x = id, by = list(id), FUN = length))) Group.1 X 1 1 14 2 2 7 3 3 4 4 4 15 5 5 8 6 6 7 7 7 10 8 8 13 9 9 12 10 10 10 > x <- ■ merge(x, aggout ;, by.x > X id age race gender x 1 1 31 B M 14 2 2 27 0 M 7 3 3 30 W M 4 4 4 36 w M 15 5 5 29 0 M 8 6 6 26 B M 7 7 7 39 0 F 10 8 8 20 w M 13 9 9 37 B F 12 10 10 29 B F 10 'id", by.y = "Group.1", all = TRUE) 'APPLY' FUNCTIONS VERSUS LOOPING: In general, avoiding loops can make your R code more compact, easier to read, and often times more efficient in execution. We can test the performance of these functions using the system.time() function, which returns the CPU (and other) times that an expression used when it was being evaluated. For an example, let's compare various ways of subtracting the mean from each element in a 25000 x 4 matrix - the first way is using three for loops, while the second way uses the apply() function. This example comes from Kuhrnert & Venables' contributed document. > mat <- matrix(rnorm(le+05), ncol = 4) > usingLoops <- function(mat) { + col.scale <- matrix(NA, nrow(mat), ncol(mat)) + m <- NULL + for (j in 1 :ncol(mat)) { + m[j] <- mean (mat [, j]) + } + for (i in 1:nrow(mat)) { + for (j in 1 :ncol(mat)) { + col. scale [i, j] <- mat[i, j] - m[j] + } + } + col.scale + } > usingApply <- function(mat) { + apply(mat, 2, scale, scale = FALSE) 52 CHAPTER 2. WORKING WITH R + } > system.time(usingLoops(mat)) user system elapsed 0.804 0.000 0.803 > system.time(usingApply(mat)) user system elapsed 0.012 0.004 0.016 Which approach would you use? The general rule of thumb is avoid loops, if it's possible and reasonable. WRITING YOUR OWN FUNCTIONS:] Even though R is a very powerful data analysis and manipulation tool, there will come a point at which you find yourself wanting to write your own functions. For example, you may want to change a built-in function to more appropriately fit your needs. Or perhaps, you would like to automate some of your code in order to be able to perform a specific task repeatedly and under various circumstances. The ability to write your own functions is one of the real advantages of R. REVISITING FUNCTIONS: Recall, functions in R can do three things: (1) have values passed to them; (2) return a value; and/or (3) generate side effects, which are anything that is not the returning of a value (e.g., printing and plotting). Also, every function in R, whether intrinsic to the language or user-written, is defined using the same basic statement: FUNname <- function( arglist ) { body }, where FUNname is the name of the function; arglist is a comma separated list of zero or more arguments that can be passed to the function; and body contains the statements that perform the actions of the function. Except when the body of the function consists of a single expression, the body should always be enclosed between the curly braces ({ }). We can also define new binary operators, like 0/0in% in a similar fashion. To distinguish a binary operator from a function, the assigned name must be of the form 7oanything°/o and the name must be wrapped with quotation marks in the assignment - for example, "%! V <- f unction(X, y) {...}. Also keep in mind that an easy way to get started writing your own functions is to (copy/paste and) modify an existing function and assign it a new function name. In order to do this, we need to remember how to access the body of existing functions. As we know, typing the name of a function without parentheses at the command line and pressing return will return (print) the body of the function. Also recall generic functions, class specific methods, and'non-visible' functions - see the 'A Thought to End With: Object-Oriented Programming' section of the first document. Use the getAnywhere () function to print the body of a 'non-visible' function. REVISITING FUNCTION ARGUMENTS: Recall, a function's arguments are treated like variables inside the function's body. Also recall, the arguments of a function are most often defined using an ARGname or an ARGname = VALUE construct. An ARGname argument is often the first argument in a function's argument list and often represents the main data object being passed to the function. You can also think of ARGname arguments as the ones that must always be specified in order for the function to evaluate. For example, in the Hmisc package's smean.sdO function, which returns (prints) the mean and standard deviation of a numeric vector, you must always specify the x= argument: > smean.sd function (x, na.rm = TRUE) { if (na.rm) x <- x[!is.na(x)] n <- length(x) if (n == 0) return(c(Mean = NA, SD = NA)) xbar <- sum(x)/n 53 CHAPTER 2. WORKING WITH R sd <- sqrt(sum((x - xbar)"2)/(n - 1)) c(Mean = xbar, SD = sd) } environment: namespace:Hmisc> In contrast, ARGname = VALUE arguments are used to set a default value to an argument. In this case, the function will still evaluate if the ARGname = VALUE argument is not specified. However, we can modify the evaluation of the function if we modify VALUE to some other value. For example, in the Hmisc package's smean. sd () function, the na. rm= argument is set to a default value of TRUE, but we can easily modify this to na.rm = FALSE if we know the numeric vector we are evaluating contains no missing values. As we have seen, the argument list can also have a special type of argument . . . ('dot dot dot'). This argument can hold a variable number of arguments and is mostly used for passing argument values to other functions that are invoked within the body of a function. For example, the argument lists of most high- and low-level plotting functions, like the plotO function, contain a . . . argument that is used to pass par() function arguments, like pch= or lty=, to the par() function that is called within the body of the plotting function. You can see where the arguments being 'absorbed' by the . . . argument are being passed internally by looking at the body of a function. For example, the default method of the generic lines () function: > lines.default function (x, y = NULL, type = "1", ...) plot.xy(xy.coords(x, y), type = type, ...) Without a . . . argument, the argument list of our function would have to include each argument of each function called within the body of our function. Therefore, the . . . argument allows us to save a lot of coding and headache. SAVING AND SOURCING FUNCTIONS: Obviously, to execute a function you wrote yourself, the function must be a defined object in your workspace (i.e., the name of the function is one of the names returned by the ls() function). To define the function in your workspace, you can always type the function assignment at the command line, or you can always copy and paste the function assignment from a text editor. If the function has been saved in a text file, it can be loaded with the source () function like other R code. Another possibility is to include the function assignment in your RProf ile. site or .RProfile file. If you have several self-defined functions, you will want to consider whether you should save these in the same text file or save them to different text files. LOCAL AND GLOBAL VARIABLES IN A FUNCTION: When writing a function, it is not necessary to declare the variables used within a function. When a function is executed, R uses a rule called lexical scoping to decide whether an object is local to the function's body or global (defined in your workspace). To understand this mechanism, let's consider a very simple function: > printfun <- function() { + print(x) + } > x <- 1 > printfun() [1] 1 In this example, the name x is not used to create an object within the printfun () function. Therefore, R looks for an object called x from the (global) workspace. In fact, if x is not defined in the workspace, executing printfun () will result in an error-Error in print(x) : object "x" not found. In contrast, if x is assigned as the name of an object within our function, the value of x in the (global) workspace is not used. For example, 54 CHAPTER 2. WORKING WITH R > x <- 1 > printfun <- function() { + x <- 2 + print(x) + } > printfun() [1] 2 > x [1] 1 In this example, the printfun () function used the object x that is defined within its environment, that is within its body. It is possible to create multiple nested environments (i.e., functions within functions) - the various environments enclose one another. In this case, the search for objects is made progressively from a given environment to the enclosing one, and so on, up to the global one. It is also important to note that any 'ordinary' assignments (using <-) done within the a function are local and temporary and are lost after exit from the function. For example, in the following example, an error would be returned if we tried to return(print) the value of y. > x <- 1 > printfun <- function() { + y <- 2 + print(x + y) + } RETURNING VALUES FROM A FUNCTION: As we know, functions are designed to return values. The value to be returned by a function may be given in an explicit call to the return () function or (more typically) is the value of the last expression in the body of the function. For example, the two following functions are equivalent: oneFUN <- function(x, y = 5) { x + y } secondFUN <- function(x, y = 5) { return(x + y) } An object enclosed in the return () function will override the returning of the value of the last expression in the body of a function. For example, > oneFUN <- function(x, y = 5) { + return(20) + x + y + } > oneFUN(x = 5) [1] 20 USEFUL FUNCTIONS FOR WRITING FUNCTIONS: There are many functions that are designed to be used only or primarily in the body of other functions. The as._() and is._() families of coercion and testing functions, respectively are very useful. As their names imply, they can be used to change the data type/structure of an object and to tell what kind of object the has been passed, respectively. The missingO function can be used to determine if the value of an argument of the form ARGname has been specified in the invocation of the function. The missingO function will return a single logical value of TRUE if the value of the argument was not specified. For example, 55 CHAPTER 2. WORKING WITH R > examplefun <- function(x) { + if (missing(x)) { + return("x is missing") + } + else { + return("x is not missing") + } + } > examplefunO [1] "x is missing" > example f un(10) [1] "x is not missing" The on.exit() function records the expression given as its argument as needing to be executed when the current function exits (either naturally or as the result of an error). This is useful for resetting graphical parameters or performing other cleanup actions. For example, parfun <- function(x, y) { oldpar <- par(no.readonly = TRUE) par(mar = c(2, 2, 2, 2) + 0.1, las = 1) plot(x, y) on.exit(oldpar) } The stopO and warning () handle errors and warnings, respectively. There are also several global options you can set using the options () function that pertain to errors and warnings: 'warn', 'warning, expression', and 'error' DEBUGGING FUNCTIONS: It is rare to write a function that works correctly the first time that it is tried. Often, all that is required to pinpoint the error is to add expressions to the body of the function that call the print() or cat() functions to print out partial results. When a function dies from an error, you can use the tracebackO function to return (print) the sequence of calls that lead to the error, which is useful when an error occurs with an unidentifiable error message. Another option is the debug() function. To debug a problem function fun(), we execute the expression debug(fun). We then execute the invocation of the function fun() that keeps causing an error - for example, fun(x = 5, y = 2). Executing the problem expression invokes a debugger, which suspends normal execution of expressions and allows you to execute the body of the problem function one expression at a time. Before each expression in the body of the problem function is executed, the expression is printed and a special prompt is invoked. You can enter R expressions or special commands at the debugger prompt. The commands are n (or just return), which advances to the next expression in the body of the problem function; c, which continues to the end of the current context (e.g., to the end of the loop if within a loop or to the end of the function); where, which prints a stack trace of all the active function calls; and Q, which exits the debugger and returns to the command-line prompt. Anything else entered at the debug prompt is interpreted as an R expression to be evaluated in the calling environment. In particular typing an object name will cause the object to be printed, and ls() lists the objects in the calling frame. So, for example, if an object x is defined within the body of the problem function, typing x at the debug prompt will return (print) the value of x. If there is a local variable with the same name as one of the special commands listed above then its value van be accessed by using the get() function - e.g., get("n"). The debugger provides access only to interpreted expressions. If a function calls a foreign language (such as C) then no access to the statements in that language is provided. Debugging is turned off by a call to undebugO function with the function as an argument - e.g., debug (fun). SOME ADVICE: Like everything else in R, writing your own functions is a learning process, especially when it comes to writing functions that run faster and consume less memory. In addition to all the advice I have 56 CHAPTER 2. WORKING WITH R tried to insert in this and the first document, the following is, as John Fox put it, some additional 'general, miscellaneous, and mostly unoriginal' advice about writing your own functions. • Work from the bottom up. You will occasionally encounter a moderately large programming project. It is almost always helpful to break a large project into smaller parts, each of which can be programmed as an independent functions. In a truly complex project, these functions may be organized hierarchically, with some calling others. If some of these small functions are of more general utility, then you can maintain them as independent functions and reuse them. • Test your program. Before worrying about speed, memory usage, elegance, and so on, make sure that your function provides the right answer. Developing any function is an iterative process of refinement, and getting a function to work correctly is the key first step. In checking out your function, try to anticipate all of the circumstances that the function will encounter and test each of them. Furthermore, in 'quick-and-dirty' programming, the time that you spend writing and debugging your function will probably be vastly greater than the time the function spends executing. Remember the programmer's adage: 'Make it right before you make it faster.' • Document your functions. The best way to document your functions is to write them in a transparent and readable style - use descriptive function and argument names, even if this means that they are longer than what you may prefer; name the elements of the various data structures that are created in the body a function (i.e., x[, "dead"] is more meaningful than x[ , 2]); avoid clever but cryptic tricks in your code (trust me, you'll spend a fair amount of time trying to figure out the trick every time you read your code); break long expressions over multiple lines; indent lines (for example, in loops) to reveal the structure; always use parentheses to make groupings explicit; always use curly braces, { and }, in the body of your functions, including with for, if, else, and other statements; and always use TRUE and FALSE instead of T and F. In addition, if at all possible, give parameters sensible defaults. You can also add a few comments to the beginning of a function to explain what the function does and what its arguments mean. The key is that you want to understand your own functions when you return to them a month or a year later. 57 Chapter 3 Catalog of Functions, Operators, & Constants This chapter is intended to act as a reference regarding the functions, operators, and constants we have discussed in both documents. This catalog also includes additional functions that we did not cover, but I feel would be very useful to be aware of. NOTE: This catalog does not include any graphics functions; I have compiled the graphical material into its own reference, which you can find in the next chapter. The functions, operators, and constants given in this chapter come from the base, foreign, Hmisc, stats, and utils packages; each item is labeled with its corresponding package. ALSO NOTE: This chapter does not include all of the functions available from these packages, just the ones I have found I use the most or are most likely to use. The functions, operators, and constants in the chapter have also been grouped by topics, such as data creation or data manipulation. It is also important to be aware of the set of manuals available on the R website (http://www.r-project.org/): • R Installation and Administration: Comprehensive overview of how to install R and its packages under different operating systems. • An Introduction to R: Provides an introduction to the language. • R Data Import/Export: Describes import and export facilities. • Writing R Extensions: Describes how you can create your own packages. • The R Reference Index: Contains printable versions of all the R help files for standard and recommended packages. The manuals can be accessed by clicking on the Manuals link listed under Documentation. R Session Entry Package Description Startup base Describes the initialization at the start of an R session only) (help topic R.homeO base Returns the R_HOME directory options() base Returns all options settings or sets specific options settings getOptionO base Returns the current value of specific options settings getwdO base Returns path of current working directory setwdO base Sets path to desired working directory list .f ilesO base Lists the files in the current working directory dateO base Returns system's current date and time Sys.DateO base Returns system's current date Sys. timeO base Returns system's current time 58 CHAPTER 3. CATALOG OF FUNCTIONS, OPERATORS, & CONSTANTS Sys. timezoneO base Returns system's current time zone R. VersionO base Returns detailed information about the version of R running sessionlnfo() utils Returns version information about R and loaded (attached) pack- history() utils Displays the command history of the current R session timestampO utils Writes a timestamp (or other message) into the command history of the current session and echos it to the console savehistoryO utils Saves the command history of the current R session to a specified file Loads the saved command history from a previous R session (from loadhistoryO utils a specified file) source() base Reads and executes R code from a file system() base Invokes a system command qO base Terminates an R session Finding Help Entry Package Description argsO base Returns the argument list of the specified function argsAnywhere() utils Returns the argument list of the specified 'non-visible' function methods() utils Returns the class specific methods of a generic function getAnywhereO utils Returns the body of the specified 'non-visible' function helpO utils Provides access to the help file of the specified topic; alternatively 7 help.search() utils Searches the help files for the specified character string RSiteSearchO utils Searches for key words or phrases in the R-help mailing list archives or documentation help.start() utils Starts the hypertext (currently HTML) version of R's online help files Returns a character vector giving the names of all objects (func- apropos() utils tions and assigned objects) in the search list or that match a spec- ified character string findO utils Similar to apropos(), but allows you to specify the mode of the objects to return example() utils Executes the 'Example' section from the specific help file demoO utils Executes some demonstration R code for a specific topic, or lists the available demos vignette() utils Similar to demoO RShowDocO utils Shows R manuals or other documentation Syntax base Describes the operator syntax and precedence (help topic only) Paren base Describe how R handles parentheses and braces (help topic only) Quotes base Describes the various uses of quoting in R (help topic only) Packages Entry Package Description R.homeO base Returns the R_H0ME directory installed.packages() utils Finds (or retrieves) details of all packages installed in the specified libraries install.packages() utils Installs packages from CRAN or a downloaded package source/binary file 59 CHAPTER 3. CATALOG OF FUNCTIONS, OPERATORS, & CONSTANTS update.packages() utils Compares current versions of installed packages, asks if you would like to update each package that has a newer version available, and updates the packages you specify to update remove.package s( ) utils Removes specified installed packages library() base Loads specified package search() base Returns a character vector of the currently loaded packages searchpathsO base Returns a character vector of the currently loaded packages with their directory paths detach() base Un-loads the specified package Data Creation Entry Package Description Assignment base <-, ->, << —, — >>, and the assignO function NA base 'Not available'; representation of missing values NULL base The null object LETTERS, letters base Generates a character vector of the specified subset of the 26 upper-and lower-case letters of the Roman alphabet month.abb, month.name base Generates a character vector of the specified subset of three-letter abbreviations or full names of the months of the year c() base Generates a vector or list by concatenating the specified values Cs() Hmisc Generates a character vector by concatenating the specified values character() base Generates a character vector of the specified number of elements complex() base Generates a complex vector of the specified number of elements double(), single() base Generates a double-/single-precision vector of the specified number of elements integer() base Generates an integer vector of the specified number of elements logical() base Generates a logical vector of the specified number of elements numeric() base Generates a numeric vector of the specified number of elements raw() base Generates a raw vector of the specified number of elements realO base Generates a real vector of the specified number of elements vector() base Generates a vector of the specified number of elements and mode paste() base Generates a character vector concatenating the specified vectors sample() base Generates a vector by taking a random sample of the specified elements (used in conjunction with the set.seedO function) rep() base Generates a vector by replicating the specified elements score.binary() Hmisc Generates a vector from a series of logical conditions seq(), : operator base Generates a regular sequence sequence() base Generates a vector of sequences combnO utils Generates a vector of all the combinations of 'n' elements, taken 'm' at a time cut() base Generates a factor by dividing the range of a numeric vector into intervals and coding the values of the numeric vector according to which interval they fall factor() base Encodes a vector CIS cL factor gl() base Generates a factor by specifying the pattern of its levels interaction() base Generates a factor which represents the interaction of the specified levels matrix() base Creates a matrix array() base Creates an array rbindO, cbindO base Combines vector, matrices, or arrays by rows or columns data.frame() base Creates a data frame 60 CHAPTER 3. CATALOG OF FUNCTIONS, OPERATORS, & CONSTANTS expand.grid() listO base base Creates a data frame from all the combinations of the specified vectors or factors Creates a list Data Import Entry Package Description dataO, getHdataO utils, Hmisc Loads the specified built-in data set, or lists the available data sets read.table() utils Reads in a white space delimited file read.delimO utils Reads in a tab-delimited file read.csvO, csv.getO utils, Hmisc Reads in a comma-delimited file read.dtaO, foreign, Reads in a Stata file stata.getO Hmisc read. spssO, foreign, Reads in a SPSS file spss.get() Hmisc read.xport(), foreign, Read in a SAS Transport file read.ssd(), Hmisc sasxport.get() mdb.get() Hmisc Reads in tables from a Microsoft Access Database file read.epiinfo() foreign Reads in an Epi Info data file read.ftable() stats Reads in a fiat contingency table read.fwf() utils Reads in a fixed-width format file read.mtpO foreign Reads in a Minitab Portable Worksheet file read.octave() foreign Reads in an Octave text data file read.systat() foreign Reads in a Systat file scanO base Reads data into a vector or list from the command line or file Data Attributes Entry Package Description attrO base Returns or sets the specific attributes of an object attributes() base Returns the attributes of an object classO, data.classO base Returns (or sets) the class of an object contents() Hmisc Returns the 'metadata' of the columns of a data frame, which includes the names, labels (if any), units (if any), number of factor levels (if any), factor levels, class, storage mode, and number of NAs dim(), nrowO, ncolO base Returns (or sets) the dimensions (number of rows and columns) of an object dimnamesO, base Returns or sets the dimension names (row and column names) of rownamesO, an object colnamesO row.names() base Returns or sets the row names for data frames is._() base Tests whether an object is something - execute apropos(""is\\. ") to see all the available functions, including is.naO, is.nanO, and is.nullO label() Hmisc Returns or sets the label of an object length() base Returns or sets the length of vectors (including factors) and lists levels(), nlevelsO base Returns or sets the levels of a factor; returns the number of levels of a factor 61 CHAPTER 3. CATALOG OF FUNCTIONS, OPERATORS, & CONSTANTS modeO, base Returns or sets the mode and storage mode of an object storage .modeO names() base Returns or sets the names of the elements/components of an object str() utils Compactly displays the structure of an object typeof() base Returns the type of an object units() Hmisc Returns or sets the units of an object upDataO Hmisc Updates a dataframe and the attributes of its columns Data Manipulation: Math Entry Package Description Numeric constants base Inf, NaN, and pi Arithmetic operators base +, -, *, /, " (exponentiation), %% (modulo), and %/% (integer division) Calculates the lagged and iterated differences between the elements diffO base of a numeric vector sum() base Sums the elements of a vector prodO base Calculates the product of all the elements of a numeric vector cumsumO base Calculates the cumulative sums of the elements of a numeric vector cumprodO base Calculates the cumulative products of the elements of a numeric vector cummax() base Calculates the cumulative maximas of the elements of a numeric vector cummin() base Calculates the cumulative minimas of the elements of a numeric vector abs() base Calculates the absolute values of the elements of a numeric vector sqrtO base Calculates the square roots of the elements of a numeric vector Trigonometric & Hyper- base cos(), sin(), tan(), acosO ('a' = arc), asinO, atanO, bolic atan2(), coshO ('h' = hyperbolic), sinhO, tanhO, acoshO, asinhO, and atanhO Logarithms & Exponen- base log(), logbO, loglpO, loglOO, log2(), expO, and expmlO tials Combinatorics base choose(), IchooseO, factorial(), and lfactorialO Matrix operators base 7o*7o (matrix multiplication), 7oO°/o (outer product), and °/oX°/o (Kroe-necker product) Other matrix & array re- base colSumsO, rowSumsO, rowsumO, colMeansO, rowMeansO, lated crossprodO, tcrossprodO, det(), eigenO, diagO, lower.tri(), upper.tri(), qr(), svd(), scaleO, t(), svd(), apermO, and sweep() Data Manipulation: Descriptive Statistics Entry Package Description cor() stats Calculates the correlation between two numeric vectors or matrices cov() stats Calculates the covariance between two numeric vectors or matrices var() stats Calculates the variance between two numeric vectors or matrices cov2cor() stats Scales a covariance matrix into the corresponding correlation matrix Calculates the kernel density estimates of a numeric vector density() stats ecdf() stats Calculates the empirical cumulative distribution function (ECDF) of a numeric vector 62 CHAPTER 3. CATALOG OF FUNCTIONS, OPERATORS, & CONSTANTS min(), max(), range() base Calculates the minimum, maximum, and both values of a numeric vector pminO, pmaxO base Calculates the parallel minima and maxima of two or more numeric vectors or matrices quantileO stats Calculates various sample quantiles of a numeric vector IQRO stats Calculates the inter-quartile range of a numeric vector f ivenumO stats Calculates Tukey's five-number summary (minimum, lower-hinge, median, upper-hinge, and maximum) of a numeric vector median() stats Calculates the median of a numeric vector meanO base Calculates the arithmetic mean of a numeric vector weighted.mean() stats Calculates the weighted mean of a numeric vector mad() stats Calculates the mean absolute difference of a numeric vector sd() stats Calculates the standard deviation of a numeric vector rank() base Calculates the sample ranks of the values of a vector smean.sd() Hmisc Calculates the mean and standard deviation of a numeric vector smean.cl.normal() Hmisc Calculates the sample mean and lower and upper Gaussian confidence limits of a numeric vector, based on the t-distribution smean.sdl() Hmisc Calculates the mean plus or minus a constant times the standard deviation smean. cl .bootO Hmisc A very fast implementation of the basic nonparametric bootstrap for obtaining confidence limits for the population mean without assuming normality smedian.hilowO Hmisc Calculates the sample median and a selected pair of outer quantiles having equal tail wtd.mean Hmisc Calculates the weighted mean of a numeric vector wtd.var Hmisc Calculates the weighted variance of a numeric vector wtd.quantile Hmisc Calculates the weighted quantiles of a numeric vector wtd.Ecdf Hmisc Calculates the weighted ECDF of a numeric vector wtd.rank Hmisc Calculates the weighted ranks of a numeric vector, using mid-ranks for ties describe() Hmisc Provides a concise statistical description of a vector, matrix, or data frame bystatsO, Hmisc Generates descriptive statistics by categories summary, f ormulaO Data Manipulation: Dates Entry Package Description as .DateO strptimeO as.POSIXctO cutO repO seqO format() round0, truncO diff (), difftimeO base base base base base base base base base Converts a character vector to a Date class vector Converts a character vector to a POSIXlt class vector Converts a POSIXlt class vector to a POSIXct class vector Converts a datetime class object to a factor (see .Date and . POSIXt methods) Replicates the elements of datetime class vectors (see .Date and .POSIXt methods) Generates a regular sequence of dates (see .Date and .POSIXt methods) Formats how datetime class vectors are printed (see .Date and .POSIXt methods) Rounds/truncates datetime class vectors (see .Date and .POSIXt methods) Calculates the lagged and iterated differences between the elements of a datetime class vector (see .Date and .POSIXt methods) 63 CHAPTER 3. CATALOG OF FUNCTIONS, OPERATORS, & CONSTANTS Data Manipulation: Tables Entry Package Description table() xtabsO ftable 0 prop.table() margin.table() addmargins( ) base stats stats base base stats Creates a one-, two, or n-way table Alternative to the table () function (has formula interface) Creates a 'fiat' 3-way or greater table Expresses table entries as proportions of margins Computes the sum of the table entries for a given margin Expands a n-way table to include margin totals Other Data Manipulation Entry Package Description withO format() as._() 10 unlist() base base base base base Evaluates an expression in a data environment Format an R object for pretty printing (generic) Coerces an object (generic; execute apropos(""as\\.") for complete list) Inhibits coercion of objects 'Flattens' lists levels (), relevelO, reorder() base, stats Modifies the levels of a defined factor if elseO transform() mChoiceO base base Hmisc Conditionally generates a vector from existing vectors Transforms an object, such CIS cL data frame Generates an object representing non-mutually exclusive events unionO, intersect (), setdiff (), setequalO base Performs set operations order(), sortO, rev() base Sorts.orders the elements of a vector merge() reshape() base stats Merges two data frames by common columns or row names Reshapes a data frame from 'long' to 'wide' and vice versa roundO, signifO, ceilingO, floorO, truncO base Rounds numeric values in various ways aggregate() byO apply (), lapplyO, sapplyO, tapplyO split 0 stats base base base Splits the data into subsets, computes the value of a function for each, and returns the result in a data frame Applies a function to a data frame split by factors Applies a function over the elements of an object Divides a vector into groups based on a factor complete.case( ) unique() duplicated() stats base base Returns a logical vector indicating which cases (rows) of an object are complete (i.e., have no missing values) Returns the vector or data frame with duplicate elements removed Determines which elements of a vector or data frame are duplicates of elements with smaller subscripts and returns a logical vector indicating which element (rows) are duplicates headO subset() Extract utils base base Returns the first/last parts of a vector, matrix, or data frame Returns subsets of vectors or data frames which meet conditions Subsetting operators: [],[[]],[ , ], and $ Logic Matching base ! (negation), & and && (AND; intersection), | and | | (OR; union), and the xor() and isTRUEO functions 7oin°/o (base), 7onin°/o (Hmisc), and is. element 0 (base) 64 CHAPTER 3. CATALOG OF FUNCTIONS, OPERATORS, & CONSTANTS all(), anyO all.equal(), identical() which0 which. maxO, which. minO base base base base Given a set of logical vectors, are all/any (at least one) of the values TRUE? Tests if two objects are nearly/exactly equal Returns the indices of a logical object that are TRUE Returns the indices of a numeric vector of the (first) occurrence of the maximum or minimum tolowerO, toupperO strsplit() grepO, subO, gsubO base base base Character translation from upper to lower case or vice versa Splits the elements of a character vector into substrings according to the presence of a substring within them Pattern matching and replacement (based on regular expressions) Data Export Entry Package Description catO print() sinkO write() write.table() base base base base utils Outputs the specified information as a single concatenated character vector either to the screen or to a specified file Prints (returns) the specified object to the screen Diverts R output from the screen to a specified file Writes the specified data (usually a matrix) to the specified file Writes the specified data frame to the specified file Object Management Entry Package Description existsO base Returns TRUE/FALSE depending on whether object is already defined ls() base Lists the currently assigned objects rm() base Removes specified assigned objects saveO base Saves specified assigned objects to specified file save.image( ) base Shortcut for saving your current workspace to the hidden .RData file Loads saved objects from specified file load base attach() base Attaches the specified list or data frame to the 'search path' (output of the searchO function) detach() base Removes the specified list or data frame to the 'search path' search() base Returns a character vector of the currently loaded packages and any attached lists or data frames conf lictsO base Searches for masked objects (objects that exist with the same name in two or places on the 'search path') mem.limits base Controls the memory available for R Memory-limits base Describes the memory limits of R (must specify in quotes to access help file - ?"Memory-limits") object. sizeO utils Provides an estimate of the memory that is being used to store the specified R object Writing Functions 65 CHAPTER 3. CATALOG OF FUNCTIONS, OPERATORS, & CONSTANTS Entry Package Description Assignment alarm() missingO match.arg() match. callO do.call() Repetitive & Conditional evaluation stopO warning() require() tempiile() return() invisibleO on.exit() tracebackO debug() Rprof() base utils base base base base base base base base base base base base base utils <-, ->, << —, — >>, and the assignO function Gives an audible or visual signal to the user Tests whether a value was specified for a specific argument to a function Matches the argument specified against a table of candidate values Constructs and executes a function call from a name or a function and a list of arguments to be passed to it if, else, for, while, repeat, break, next (must specify in quotes to access help file - ?"f or"), and the switch0 function Stops the execution of the current expression and executes an error action Generates a warning message that corresponds to its argument(s) Loads the specified package if not already done so Creates names for temporary files Returns a value from a function Returns a (temporarily) invisible copy of an object Records the expression given as its argument as needing to be executed when the current function exits (either naturally or as the result of an error) Prints the sequence of calls that lead to the error of a function Debugs a function Profiles the execution of R expressions 66 Chapter 4 R Graphics Reference HIGH-LEVEL PLOTTING FUNCTIONS: Recall, a traditional graphics plot is created by first calling a high-level plotting function that creates a 'complete' plot. This means that with high-level plotting functions, axes, labels and titles are automatically generated, where appropriate, and unless you request otherwise. In addition, high-level plotting functions always start a new plot, erasing the current plot if necessary. The following table lists the high-level plotting functions available in the graphics and grDevices packages. The list also contains some additional high-level plotting functions from the stats and Hmisc packages - denoted with their corresponding package in parentheses. Examples of specific high-level plotting functions (denoted with an asterisk, *) are shown in Figures 4.1 to 4.5. Function Description curve () Generates a curve corresponding to the given function or expression. Al- ternative is the plot, curve () method of the generic plotO function. hist () Generates a histogram of the given data points, including POSIXt and Date class data. Alternatives include the Hmisc package's hist .data, frame() and histbackbackO functions. boxplotO stripchart() interaction.plot() (stats) barplotO pie() dotchart() Used to plot the 'five-number' summary of a continuous variable and can generate side-by-side boxplots. Can be used in conjunction with the bxp() function. Also see the bppltO (Hmisc) function. Uses the boxplot. stats () function to gather the statistics necessary for producing the boxplot. Generates 'ID scatter plots' or dotplots of continuous data. Alternative is the stemO function. Plots the mean (or other summary statistic) of a continuous (response) variable for a two-way combination of (two) categorical variables, thereby illustrating possible interactions. Graphically displays the relative frequencies or proportions of the 'levels' of a categorical variable and can generate stacked or side-by-side barplots representing the 'relationship' between two categorical variables (i.e., a two-way contingency table). Generates a pie chart, displaying the relative frequencies or proportions of the 'levels' of a categorical variable. Generates a dotplot, an alternative to barplots. mosaicplot()* Generates a mosaic-plot of a one-way or greater contingency table. spineplot()* fourfoldplotO* Generates spine plots and spinograms, which are special cases of mosaic plots. Also seen as generalizations of stacked (or highlighted) bar plots. Creates a 'four-fold display' for the special case of two dichotomous categorical variables grouped by a third categorical variable with at least two levels (i.e., data from a 2 by 2 by k three-way table). 67 CHAPTER 4. R GRAPHICS REFERENCE assocplot()* Produces a Cohen-Friendly association plot indicating deviations from independence of the rows and columns in a two-way contingency table. cdplotO* Computes and plots the conditional densities describing how the conditional distribution of a categorical variable y changes over a numeric variable x. plotO Generic. The default method (plot.defaultO) produces a basic scatter plot between two continuous variables. Can also handle variables of POSIXct, POSIXlt, and Date class. Invoke methods (plot) to list all class specific methods. scatter. smoothO (stats) Generates an x-y scatterplot with smooth curves fitted by LOESS. perspO* Useful to graphically display three continuous variables. Generates an x-y scatterplot between two of the continuous variables and represents the third continuous variable by generating a 3D surface over the x-y plane. contour()* Useful to graphically display three continuous variables. Generates an x-y scatterplot between two of the continuous variables and represents the third continuous variable using contour lines. Alternative is the filled. contourO function. symbols()* Useful to graphically display three continuous variables. Generates an x-y scatterplot between two of the continuous variables and represents the third continuous variable using a symbol (e.g., circles of varying radius). image()* Useful to graphically display three continuous variables. Produces an x-y grid of rectangles and uses color to represent the value of a third variable z. pairs() Generates a matrix of scatterplots, plotting each continuous variables by all other continuous variables specified. Can be used in conjunction with the panel, smooth() function. starsO* Generates star (spider/radar) plots or segments diagrams of a multivariate data set. coplotO* Produces two types of conditioning plots. matplotO Plots the columns of one matrix against the columns of another. heatmapO* (stats) Draws a heat map, which is a false color image (basically image(t(x))), with a dendrogram added to the left side and top. sunflowerplot()* Useful when identical data values repeat a small number of times. Plots a 'flower' at each x-y location with a 'petal' for each replication of z. 'STANDARD' ARGUMENTS OF HIGH-LEVEL PLOTTING FUNCTIONS: Recall, in addition to function-specific arguments, there are several arguments that are 'standard' in the sense that many high-level plotting functions will accept them. Specifically, most high-level plotting functions will accept graphical parameter arguments that control such things as the appearance of axes and labels, and the range of the axes scales. It is usually possible to modify the default range of the axes scales on a plot by specifying the xlim= and/or ylim= arguments in the high-level function invocation, which are each specified as two element vectors containing the minimum and maximum values (e.g., xlim = c(0, 50)). There is also a set of arguments for modifying the default labels (if any) on a plot: main= for a title, sub= for a sub-title, xlab= for an x-axis label, and ylab= for a y-axis label. Each of these arguments is specified as a character string. The title specified with the main= argument (if any) is placed at the top of the plot in a large font, and the sub-title specified with the sub= argument is placed just below the x-axis in a smaller font. Some high-level plotting functions have an axes= argument (by default TRUE), which allows (if set to FALSE) you to suppress the drawing of the axes and therefore produce customized axes instead. Lastly, some high-level plotting function have an add= argument (by default FALSE), which, if set to TRUE, forces the function to act as a low-level plotting, superimposing the high-level plot onto the current plot. 68 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.1: Examples ol high-level plotting lunctions -mosaicplotO, spineplotO, fourfoldplot assocplot (). 0, and mosaicplotO ro W "S > w T3 O O D-penicillaiTiine PlaceOo Treatment fourfoldplotQ a (£r % % y, 13 ^__ —— 10 iL 3 ^""*~ 2 eZľ" (f Bj£\ .^ JJ. spineplotQ ro ro W (U n m > ř 3 w o (i) (U w (U () o O D-penicillamine Placebo Treatment w w (U O "O ro Q assocplot() ü"a"a' I 12 3 4 Histologic stage of disease 69 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.2: Examples of high-level plotting functions - cdplotO, sunf lowerplotO, and stars (). cdplotO sunflowerplot() >- CM - 1 - - • + * * 1 O - 1 # * • T ~ 1 "7|v * 1 1 • 1 CO 1 1 1 • i i i -1 0 1 disp s starsO hp -^ mpg stars() i azda RX Mazda RX4 Wag Mazda RX4 Datsun710 Hornet 4 Drive Valiant Hornet Sportabout jportaPout Merc 240D Duster 360 Merc 230 Merc 280 70 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.3: Examples of high-level plotting functions - perspO, contourO, imageO, and symbolsO. perspO contour() i---------1---------1---------1---------r 0 200 400 600 800 imageO o o - iL T T T T 200 400 600 800 symbols() o o o o - (0 LO (II o o - o "í F o O O -j o - o o - O o CM 1 1 1 1 1 1 50 55 60 65 70 75 Age 71 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.4: Examples of high-level plotting functions - coplotO. Given : stage 0 5 10 15 20 25 > o o o o ■fe°0° o o o o ť** oo 1 o o o o o 0 o o 0 s o 6 o 0 5 10 15 20 25 bili 72 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.5: Examples of high-level plotting functions - heatmapO. heatmapfi CT3 O 0 CO O) co O) CO o 0 CZ c c 0 c O) 'c 03 0 co CT3 > T3 03 0 > 03 03 Q. E 03 Q. o critical raises _o_ 73 CHAPTER 4. R GRAPHICS REFERENCE THE par() FUNCTION: Recall, the par() function is used to access and modify numerous graphical parameters. Also recall, that some of the par() function's arguments can be used as arguments to other high- and low-level plotting functions, while others can be queried and set only via the par() function. In addition, a small set of graphical parameters cannot be set at all and can only be queried using the par () function. The following table lists these subsets of the par() function's arguments. Note, each graphical parameter will be discussed in detail in the corresponding 'Section'. Specification Section Parameter Description Queried and set via the parO function and used as arguments to other high- and low-level plotting functions Points pch= type= data symbol type ('point character') type of plot (points, lines, both) Lines lty= line type (solid, dashed) lwd= line width Text adj = justification of text ann= draw plot labels and titles? cex= size of text (magnification multiplier) cex.axis= size of axis tick labels cex.lab= size of axis labels cex.main= size of plot title cex.sub= size of plot sub-title f ont= font face (bold, italic) for text font.axis= font face for axis tick labels font.lab= font face for axis labels font.main= font face for plot title font.sub= font face for plot sub-title las= rotation of text in margins srt= rotation of text in plot region tmag= size of plot title (relative to other labels) Color bg= 'background' color col= color of lines and data symbols col.axis= color of axis tick labels col.lab= color of axis labels col.main= color of plot title col.sub= color of plot sub-title fg= 'foreground' color gamma= gamma correction for colors Axes bty= type of box drawn by the box () function lab= number of ticks on axes mgp= placement of axis title, tick labels, and line tck= length of axis ticks (relative to plot size) tcl= length of axis ticks (relative to text size) xaxp= number of ticks on x-axis xaxs= calculation of scale range on x-axis xaxt= x-axis style (standard, none) yaxp= number of ticks on y-axis yaxs= calculation of scale range on y-axis yaxt= y-axis style (standard, none) Plotting regions xpd= clipping region & Margins Lines lend= line end style ljoin= line join style lmitre= line mitre limit Queried and set only via the parO function Text f amily= font family for text lheight= line spacing (multiplier) 74 CHAPTER 4. R GRAPHICS REFERENCE ps= size of text (points) Axes usr= range of scales on axes xlog= logarithmic scale on x-axis? ylog= logarithmic scale on y-axis? Plotting regions & Margins fig= fin= omd= pin= plt= pty= mai= mar= mex= oma= omi= location of figure region (normalized) size of region (inches) location of inner region (normalized) size of plot region (inches) location of plot region (normalized) aspect ratio of plot region size of figure margins (inches) size of figure margins (lines of text) line spacing in margins size of outer margins (lines of text) size of outer margins (inches) Multiple plots ask= prompt user before new page? mf col= number of figures on a page mf g= which figure is used next mf row= number of figures on a page Overlaying output has a new plot been started? Only queried using the par() function Points cin= cra= cxy= size of character (inches) size of character ('pixels') size of character (user coordinates) Plotting regions & Margins din= size of graphics device (inches) REMEMBER, invoking the par() function makes a persistent change to specific graphical parameter settings. Often, we want to modify some graphical parameters, do some plotting, and then restore the original graphics state. Using the par() function's no.readonly= argument, with no other arguments, returns only the parameters which can be set by a subsequent par() function call. This allows you to assign these returned parameters to an object, and, in turn, allows you to restore these initial parameters after making some modifications. For example, 2) op <- par(no.readonly = TRUE) par(oma = c(7, 7, 10, 2), cex.main = 5, cex.lab with(subset(pbc, drug == "Placebo"), plot(age.years ~ chol, main = "Age (years) vs. Serum Cholesterol", xlab = label(chol), ylab = label(age.years))) par(op) POINTS: R provides a fixed set of 26 data symbols for plotting points and the choice of data symbol is controlled by the pch= ('point character') graphical parameter, which is accepted as an argument to the par() function or as an argument to appropriate high- and low-level plotting functions. The pch= argument can be specified as either an integer value to select one of the fixed set of data symbols, or a single quoted character (e.g., either pch = 19 or pch = "+"). If the pch= parameter is a character then that letter is used as the plotting symbol. The character '.' is treated as a special case and the device attempts to draw a very small dot. Figure 4.6 lists the available data symbols and their relevant integer value. In the future, the Hmisc package's show.pchO (invoked with no formal arguments), which plots the definitions of the pch= parameters, is also useful when you do not have Figure 4.6 handy. The color of the point character is controlled by the col= graphical parameter, and point characters 21 to 25 allow a fill color separate from the border color, with is controlled by the bg= (background) graphical parameter - see the 'Color' section. 75 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.6: The plotting symbols in R (pch = 1:25). 1 2 3 4 5 6 7 8 9 10 o A + X O V KÍ * 4> © 11 12 13 14 15 16 17 18 19 20 $ ffl & Q 21 22 23 24 25 * ? ■ X a O D O A V * ? x a The size of the data symbol is linked to the size of the text and is affected by the cex= parameter. If the data symbol is a character, the size will also be affected by the ps= parameter - see the 'Text' section for more details. The type= parameter controls how the data is represented in a plot (i.e., the type of plot produced). Unlike the pch=, the type= parameter is most often specified within a call to a high-level function (e.g., the plotO function) rather than via the par() function. Specifying type = "p" plots individual points at each (x, y) location (the default). Specifying type = "1" plots lines between each (x, y) location and individual points are not distinguished by a pch= data symbol. Specifying type = "b" plots both the data symbols distinguishing the points and the lines that connect them - the lines stop short of each data symbol. Like type = "b", specifying type = "o" plots both the data symbols and the lines, but the lines overlay (i.e., 'over-plot') the points. Specifying type = "h" plots vertical lines from the x-axis to the (x, y) locations, which causes the plot to appear like a barplot or histogram with very thin bars. Two further values, "s" and "S" plot 'step-functions.' Specifically, specifying type = "s" plots lines going horizontally then vertically, implying the top of the vertical line defines the point. On the other hand, specifying type = "S" plots lines going vertically then horizontally, which implies the bottom of the vertical line defines the point. Finally, the value "n" (null) suppresses the plotting of any points and/or lines. However, axes are still drawn (by default) and the coordinate system is set up according to the data. Using type = "n" is ideal for crafting plots with subsequent low-level graphics functions. Figure 4.7 shows simple examples of the different plot types. The main low-level plotting function used to add points to an existing plot is the points () function, which is a generic function that draws a sequence of points at the specified coordinates. If the data symbol is a character, the specified character(s) are centered at the coordinates when plotted. The matpoints () function can be used to add points to an existing matplotO function invocation. There are also several functions for helping to plot data when symbols overlap in a standard scatterplot, including the jitterO function. The jitterO function does no drawing but adds a very small amount of random noise to data values in order to separate values that are originally identical. The Hmisc package also has the jitter2() function, which does not add random noise, but retains unique values and ranks, and randomly spreads duplicate values at 76 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.7: Available basic plot types that can be specified with the type= graphical parameter. The relevant value of type= is shown above each plot. type = "p" type = "b" / type = "h" type = "I" type = "o" type = "s" equidistant positions within limits of enclosing values. LINES: There are five graphical parameters for controlling the appearance of lines. The lty= parameter describes the type of line to draw (solid, dashed, dotted, ...), which can be specified by name, such as "solid", or as an integer index. Figure 4.8 shows the available predefined line types. The lwd= parameter describes the width of the lines, which can be specified by a simple numeric value, e.g., Iwd = 3. The interpretation of this value depends on what sort of device the line is being drawn on. In other words, the physical width of the line may be different when the line is drawn on a computer screen compared to when it is written to a file and then printed on a sheet of paper. And the ljoin=, lend=, and lmitre= parameters control how the ends and corners in the lines are drawn. When drawing thick lines, including rectangles and polygons, it becomes important to select the style that is used to draw corners (joins) in the line (ljoin=) and the ends of the line (lend=). R provides three styles for both cases. Specifically, the ljoin= parameter, which is used to control line joins, can be specified as "mitre" (pointy), "round", or "bevel". And the lend= parameter, which is used to control the line ends, can be specified as "round", "square", or "butt". To avoid excessively pointy lines, the ljoin= parameter will be automatically converted from ljoin = "mitre" to ljoin = "bevel" if the angle at the join is too small. The point at which this automatic conversion occurs is controlled by the lmitre= parameter, which specifies the ratio of the length of the mitre divided by the line width. The default value is 10, which means that the conversion occurs for joints where the angle is less than 11 degrees. Other standard values are 2, which means that conversion occurs at angles less than 60 degrees, and 1.414, which means that conversion occurs for angles less than 90 degrees. The minimum mitre limit value is 1. The lty= and lwd= parameters are accepted as arguments to the par() function or as arguments to appropriate high- and low-level plotting functions. On the other hand, the ljoin=, lend=, and lmitre= parameters are accepted by the par () function only, and not all devices will respect them (especially lmitre=). There are several low level functions that add lines to an existing plot. The lines () function is a generic function that joins the specified coordinates with line segments. The matlinesO function can be used to add points to an existing matplotO function invocation. The ablineO function adds one or more straight lines to the current plot, which can be specified using the intercept and slope (a= and b=, respectively, or 77 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.8: Available predefined line types that can be specified with the lty= graphics parameter. Integer Sample line String 5 -------------------------------------- "blank" "solid" "dashed" "dotted" "dotdash" "longdasr "twodash' as a 2 element vector with coef=) of the line, the y-value(s) for horizontal line(s) (h=), or the x-value(s) for vertical line(s) v=. The arrows () function draws 'arrows' between pair of points, which can be easily specified to draw error bars. Similarly, the segments () function draws line segments between pairs of points. The gridO function adds a nx by ny rectangular grid to an existing plot. The rectO function draws a rectangle (or sequence of rectangles) with the given coordinates, fill and border colors. The polygon() function draws the polygon whose vertices are given in the x= and y= arguments. And the Hmisc package offers three additional useful low-level plotting functions: (1) the plsmoO function adds a line to the existing plot that is a plot smoothed curve of x vs. y; (2) the confbarO function draws multi-level confidence bars using small rectangles that may be of different colors; and (3) the errbarO function adds vertical error bars to an existing plot or makes a new plot with error bars. As an example of the usefulness of the arrows () function, let's look at age in years (age. years) across stage of disease (stage) where we plot the raw data as a stripchart and overlay an indication of the mean and standard deviations. The output is shown in Figure 4.9. TEXT: There are a large number of traditional graphical parameters for controlling the appearance of text. There is also an ann= parameter, which indicates whether titles and axis labels should be drawn on a plot. This is intended to apply to high-level plotting functions, but is not guaranteed to work with all such functions. JUSTIFICATION OF TEXT: Usually, the adj= parameter is a value of 0, 0.5, or 1 indicating the horizontal justification of text strings (0 means left-justified, 1 means right-justified, and 0.5 centers text). However, the adj = parameter may also be specified as a two element numeric vector of the form c (hj ust, vjust), where h j ust specifies the horizontal justification and v j ust specifies vertical justification. The adj = parameter can be specified as an argument to the par() function or as an argument to appropriate high-and low-level plotting functions. ROTATING TEXT: The srt= parameter specifies a rotation angle clockwise from the positive x-axis, in degrees. This will only affect text drawn in the plot region (i.e., text draw by the textO function). Unlike 78 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.9: Example of using the low-level arrows () function to add lines to an existing plot. > xbar <- withCpbc, tapplyCX = ageyrs, INDEX = stage, FUN = mean, na.rm = TRUE)) > sdev <- withCpbc, tapplyCX = ageyrs, INDEX = stage, FUN = sd, na.rm = TRUE)) > withCpbc, stripchartCageyrs ~ stage, method = "jitter", jitter = 0.2, + pch = 1, vertical = TRUE, col = "black", xlab = labelCstage), ylab = paste(labelCageyrs), + " C", unitsCageyrs), ")", sep = ""), main = pasteC'Age Across Stage of Disease", + "- Mean and SD denoted -", sep = "\n"))) > arrowsCl:4, xbar + sdev, 1:4, xbar - sdev, angle = 90, code = 3, length = 0.1, + lwd = 2) > linesCl:4, xbar, pch = 4, type = "b", cex = 4, lwd = 2) > boxC'figure") 79 CHAPTER 4. R GRAPHICS REFERENCE within the plot region, in the figure and outer margins, text may only be drawn at angles that are multiples of 90 degrees, and this angle is controlled by the las= parameter. A value of 0 means that text is always drawn parallel to the relevant axis (i.e., horizontal in margins 1 and 3, and vertical in margins 2 and 4). A value of 1 means text is always horizontal, 2 means text is always perpendicular to the relevant axis, and 3 means text is always vertical. This parameter interacts with or overrides the adj= and srt= parameters. Both the srt= and las= parameters can be specified as an argument to the par () function or as an argument to appropriate high- and low-level plotting functions. TEXT SIZE: The size of text is ultimately a numerical value specifying the size of the font in 'points.' The font size is controlled by two parameters: the ps= parameter specifies an absolute font size parameter (e.g., ps = 9; can only be set via the par() function), and the cex= parameter specifies a multiplicative magnification modifier (e.g., cex = 1.5; a value < 1 shrinks the text and a value > 1 enlarges text). As with specifying color, the scope of a cex= parameter can vary depending on where it is given. When cex= is specified via the par() function, it affects most text. However, when cex= is specified via the plotO function, it only affects the size of the data symbols. Luckily, there are special parameters for controlling the size of text that is drawn CIS cLXIS tick labels (cex.axis=), text that is drawn CIS cLXIS labels (cex.lab=), text in the title (cex.main=), and text in the sub-title (cex.sub=). These four arguments can also be used as arguments in most high-level plotting functions. There is also a tmag= parameter for controlling the amount to magnify title text relative to other plot labels, which can also be used as an argument in most high-level plotting functions. Finally, the strheightO and strwidthO functions can be used to compute the height or width of character strings or mathematical expression, respectively, on the current plotting device in user coordinates, inches, or as fraction of the figure width. MULTI-LINE TEXT: It is possible to draw text that spans several lines by inserting a new line escape sequence, "\n", within a piece of text. For example, "first line\nsecond line". The spacing between lines is controlled by the lheight= parameter, which is a multiplier added to the natural height of a line of text (e.g., lheight = 2 specifies double-space text), and can only be specified via the par() function. SPECIFYING FONTS: The font face is specified via the f ont= parameter as an integer, and can be specified as an argument to the par() function or as an argument to appropriate high- and low-level plotting functions, font = 1 produces Roman or upright face; font = 2 produces bold face font; font = 3 produces slanted or italic face; and font = 4 produces bold and slanted face. As with color and text size, the font= parameter applies only to text drawn in the plot region. There are also additional parameters specifically for axes (font.axis=), labels (font.lab=), and titles (font.main= and font.sub=). These four arguments can also be used as arguments in most high-level plotting functions. Also, every graphics device establishes a default font family, which is usually a sans serif font such as Helvetica or Arial. A new font family is specified using the family= parameter via the par() function. In addition to all of these graphical parameters, there are a large number of low-level plotting functions that control the placement of text onto your plot. For instance, the textO function draws the supplied text strings at specified coordinates within the plot region. The mtextO function writes text in one of the four margins of the current figure region or one of the four outer margins of the device region - see the 'Plotting regions & Margins' section for more details. And the title () function can be used to add labels to a plot - the main plot title (the main= argument; placed at top), the plot sub-title (the sub= argument; placed at bottom), the x-axis label (the xlab= argument), and the y-axis label (the ylab= argument). These four arguments can also be used as arguments in most high-level plotting functions. There is also the identifyO function, which can be used to interactively add labels to data symbols on a plot. The identifyO function performs no plotting itself, but simply allows the user to move the mouse pointer and click the left mouse buttton near a point. If there is a point near the mouse pointer it will be marked with its index number (that is, its position in the x/y vectors) plotted nearby. Alternatively, you could use some informative string (such as a case name) as a highlight by using the identifyO function's labels= argument, or disable marking altogether with the plot = FALSE argument. When the process is terminated, the identifyO function returns the indices of the selected points; you can use these indices to 80 CHAPTER 4. R GRAPHICS REFERENCE extract the selected points from the original vectors x and y. COLOR: There are three main color parameters: col=, f g=, and bg=. The col= parameter is the most commonly used. The primary use is to specify the color of points, lines, text, and so on that are drawn in the plot region. Unfortunately, when specified via a high- and/or low-level plotting function, the effect can vary. For example, a standard scatterplot produced by the high-level plotO function will use the col= parameter for coloring data symbols and lines, but the high-level barplotO function will use the col= parameter for filling the contents of its bars. In the low-level rectO function, the col= parameter provides the color to fill the rectangle and there is a border= argument specific to the low-level rectO function that gives the color to draw the border of the rectangle. The effect of the col= parameter on graphical output drawn in the margins also varies. It does not affect the color of axes and axis labels, but it does affect the output from the low-level mtextO function. There are also specific parameters for affecting axes, labels, titles, and subtitles called col.axis=, col.lab=, col.main=, and col.sub=, respectively. The fg= parameter is primarily intended for specifying the color of axes and borders on plots. There is some overlap between this and the specific col.axis=, col.main=, etc parameters mentioned above. The bg= parameter is primarily intended to specify the color of the background for graphical output. This color is used to fill the entire page. As with the col= parameter, when bg= is specified in a graphics function it can have a quite different meaning. For example, the high-level plotO function and the low-level points() function use the bg= parameter to specify the color for the interior of the points, which can have different colors on the border (i.e., pch= parameter values 21 to 25; see the 'Points' section). There is also the gamma= parameter that controls the gamma correction for a device. On most devices this can only be set once the device is first opened. SPECIFYING COLOR: The easiest way to specify a color in R is simply to use the color's name. For example, "red" can be used to specify that graphical output should be (a very bright) red. R understands a fairly large set of color names (657 to be exact), which can be returned by the colors() (or colours()) functions (invoked with no formal arguments). The following function generates a matrix of rectangles that displays the corresponding color of each of the 657 possible colors you can specify with the colors () function. show.colors <- function(x = 22,y = 30){ par(mar = c(0.1, 0.1, 0.1, 0.1)) plot(c( -1, x), c(-l, y), xlab = "", ylab = "", type = "n", xaxt = "n", yaxt = "n", bty = "n") for(i in l:x) { for(j in l:y){ k <- y*(i - 1) + j ; co <- colors()[k] rect(i - 1, j - 1, i, j, col = co, border = 1)}} text(rep(-0.5, y), (l:y) - 0.5, l:y, cex = 1.2 - 0.016*y) text((l:x) - 0.5, rep(-0.5, x), y*(0:(x-l)), cex = 1.2 - 0.022*x) } show. colorsO By default, R has a color palette of eight colors, which can be returned by the palette () function (invoked with no formal arguments): > palette() [1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow" "gray" In addition to being able to specify these eight colors by name, these eight colors can also be specified by integer values (i.e. 1 through 8). The Hmisc package's show.colO function plots the definitions of these eight integer-value colors in case you forget. It is also possible to specify the colors using one of the standard color-space descriptions. For example, the rgb() function allows a color to be specified as a Red-Green-Blue (RGB) triplet of intensities. For example, using this function, the color red is specified as rgb(l, 0, 0) (i.e., as much red as possible, no blue, and no green). The col2rgb() function can be used to see the RGB values for a particular color name. An alternative way to provide an RGB color specification is to provide a string of the form "#RRGGBB", where each of the pair RR, GG, and BB consist of two hexadecimal digits giving a value in the range zero (00) to 255 (FF). In this specification, the color red is given as "#FF0000". There 81 CHAPTER 4. R GRAPHICS REFERENCE is also an hsv() function for specifying a color as a Hue-Saturation-Value (HSV) triplet. The terminology of color spaces is fraught, but roughly speaking: hue corresponds to a position of the rainbow, from red (0), through orange, yellow, green, blue, indigo, to violet (1); saturation determines whether the color is dull or bright; and value determines whether the color is light or dark. The HSV specification for the (very bright) color red is hsv(0, 1, 1). The rgb2hsv() function converts a color specification from RGB to HSV. COLOR SETS: More than one color is often required within a single plot and in such cases it can be difficult to select colors that are aesthetically pleasing or are related in some way (e.g., a set of color in which the brightness of the colors decreases in regular steps). The following table lists some functions that R provides for generating sets of colors. Each of the functions listed selects a set of colors by taking regular steps along a path through the HSV color space. Also, the colorRampO and colorRampPaletteO functions can be used to interpolate a new color set from an existing set of colors. Name Description rainbow() Colors vary from red through orange, yellow, green, blue, and indigo, to violet. heat .colors () Colors vary from white, through orange, to red. terrain, colors() Colors vary from white, through brown, to green. topo. colors () Colors vary from white, through brown then green, to blue. cm. colors() Colors vary from light blue, through white, to light magenta. grayO A set of shades of gray. gray, colors() A set of gamma-corrected gray colors. AXES: By default, the traditional graphics system produces axes with sensible labels and tick marks at sensible locations. However, if the axis does not look right, there are a number of graphical parameters specifically for controlling aspects such as the number of tick marks and the positioning of labels. Recall, most high-level plotting functions provide xlim= and ylim= arguments to control the range of the scale on the axes. If none of these gives the desired result, you may have to resort to drawing the axis explicitly using the low-level axisO plotting function - more to come. The lab= parameter in the traditional graphics state is used to control the number of tick marks on the axes. The parameter is only used as a starting point for the algorithm R uses to determine sensible tick mark locations so the final number of tick marks that are drawn could easily differ from this specification. The parameter takes two values: the first specifies the number of tick marks on the x-axis and the second specifies the number of tick marks on the y-axis. The xaxp= and yaxp= parameter also relate to the number and location of the tick marks on the axes of a plot. This parameter is almost always calculated by R for each new plot so user parameters are usually overridden. In other words, it only makes sense to query this parameter for its current value. The parameters consist of three values: the first two specify the location of the left-most and right-most tick-marks (bottom and top tick-marks for the y-axis), and the third value specifies how many intervals there are between tick marks. However, when a log transformation is in effect for an axis, the three values have a different meaning altogether (see the par() function's help file). The mgp= parameter controls the distance that the components of the axes are drawn away from the edge of the plot region. There are three values representing (in order) the positioning of the axis label, the tick mark labels, and the axis line. The values are in terms of lines of text away from the edges of the plot region. By default mgp = c (3, 1, 0). The tck= and tcl= parameters control the length of the tick marks. The tcl= parameter specifies the length of tick marks as a fraction of the height of a line of text. The sign dictates the direction of the tick marks - a negative value draws tick marks outside the plot region and a positive value draws tick marks inside the plot 82 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.10: Specifying the extremes of the coordinates of the plotting region with the par() function's usr= argument. > plot(0, 0, type = "n", axes = FALSE, xlab = > par(usr = c(l, 10, 1, 5)) > axis(side = 1, at = 1:10) > axis(side =2, at = 1:5) > points(x = sample(seq(l.5 + y = sample(seqCl.5, 4 > boxC'figure") ylab = "y") 9.5, by = 0.5), size = 10, replace = TRUE), 5, by = 0.5), size = 10, replace = TRUE)) 1 2 10 region. The tck= parameter specifies tick mark lengths as a fraction of the smaller of the physical width or height of the plotting region, but it is only used if its value is not NA (by default, tck = NA). An alternative to modifying the tck= and/or tcl= parameters is Hmisc package's low-level minor.tickO function, which adds minor (shorter) tick marks to an existing plot. The xaxs= and yaxs= parameters control the 'style' of the axes of a plot. By default, the parameter is "r", which means that R calculates the range of values on the axis to be wider than the range of the data being plotted (so that data symbols do not collide with the boundaries of the plot region). It is possible to make the range of the values on the axis exactly match the range of values in the data by specifying the value " i ". This can be useful if the range of values on the axes are being explicitly controlled via the xlim= or ylim= arguments to a high-level plotting function. An alternative to the xaxs= and yaxs= parameters is the usr= parameter, which will allow us to specify that the two axes should intersect at an exact x-y coordinate. Specifically, the usr= parameter is specified as a vector of the form c(xl, x2, yl, y2) giving the extremes of the coordinates of the plotting region. For example, let's generate a plot such that the x- and y- axes intersect at (1, 1) - see Figure 4.10. Notice, with the usr= parameter, you have to start a new plot first and then you set the usr= parameter. Also, notice that we specified type = "n" and axes = FALSE in the plotO function invocation and then added the axes back in with the axisO function. 83 CHAPTER 4. R GRAPHICS REFERENCE The xaxt= and yaxt= parameters control the 'type' of axes. The default value, "s", means that the axis is drawn. Specifying a value of "n" means that the axis is not drawn. The xlog= and ylog= parameters control the transformations of the values on the axes. The default value is FALSE, which means that the axes are linear and values are not transformed. If this value is TRUE then a logarithmic transformation is applied to any values on the relevant dimension in the plot region. This also affects the calculation of tick mark locations on the axes. The bty= parameter is not strictly to do with axes, but it controls the type of 'box' that is drawn around a plot. The value can be "n", which means that no box is drawn, or it can be one of "o", "1", "7", "c", "u", or "] ", which means that the box drawn resembles the corresponding uppercase character. For example, bty = "c" means that the bottom, left, and top borders will be draw, but the right border will not be drawn. An alternative to using the bty= parameter is to use the low-level box() function. With the box() function you can specify the box-type (bty=), color (col=), line type (lty=), and line width (lwd=) of the box drawn. You can also specify which (which=) box to draw. By default, the box() function will draw a box connecting all four axes (which = "plot"). However, which= can be changed to "figure", "inner", or "outer", which come in very handy when you want to outline the wholeplot or specific plots when plotting multiple plots. In some case, it may be useful to draw tick marks at the locations that the default axis would use, but with different labels. The axTicksO function can be used to calculate these default locations. This function is also useful for enforcing an xaxp= (or yaxp=) graphical parameter. If these parameters are specified via the par () function, they usually have no effect because the traditional graphics system almost always calculates the parameter itself. You can choose these parameters by passing them as arguments to the axTicksO function and then passing the resulting locations via the at= argument to the axisO function. The prettyO function is another useful function for determining where to place the tick marks. THE axisQ FUNCTION: When using the axisO function to draw an axis 'from scratch,' the first step is to inhibit the default axes. Most high-level plotting functions should provide an axes= argument which, when set to FALSE, indicates that the axes should not be drawn. Specifying the graphical parameter xaxt = "n" (or yaxt = "n") or ann = FALSE via the par() function may also do the trick. Once this is done, the axisO function can draw axes on any side of the plot (specified by the side= argument), and you can specify the location along the axis of tick marks and the text to use for tick labels using the at= and labels= arguments, respectively. See Figure 4.11 for an example. The axisO function is not generic, but there are special alternative functions for plotting time related data. Specifically, the axis.Date() and axis.POSIXctO functions take an object containing dates and produce an axis with appropriate labels representing times, days, months, and years (e.g., 10:15, Jan 12, or 1995). In addition, the axisO function is useful to superimpose the underlying numerical axes that are used by functions like dotchartO, stripchart (), boxplotO, or barplotO - use axis (side = 1) to superimpose the x-axis and axis (side = 2) for the y-axis. This allows you to determine the x-y location of specific things in the plot, such as where to place a legend or the middle of a bar in order to add text. Lastly, the axisO function can be used to draw text in the margins - see Figure 4.12 for an example.1 It is important to realize that the axisO function will not allow you to place any text in the margin areas where two axes intersect. An alternative is to use the mtextO function (discussed in the 'Text' section) with its line= and at= arguments. PLOTTING REGIONS & MARGINS: \ In the base graphics system, every page is split up into three main regions: (1) the outer margins; (2) the current figure region; and (3) the current plot region. Figure 4.13 shows these regions when there is only one figure on the page and Figure 4.14 shows the regions when there are multiple figures on the page. The region obtained by removing the outer margins from the device is called the inner region. When there is only one figure, this usually corresponds to the figure region, but when there are multiple figures the inner region corresponds to the union of all figure regions. The area outside the plot region, but inside the figure region is referred to as the figure margins. A typical high-level function draws data symbols and lines within the plot region and axes and labels in the figure or outer xThe rates.dat file contains a data set for a study of beta-blocker adherence post-AMI. 84 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.11: Example ol customizing axes using the low-level axisO plotting lunction. > op <- par(read.only = TRUE) > x <- 1:2 > y <- runif(n = 2, min = 0, max = 100) > parCcex = 0.8, mar = c(4, 5, 2, 5)) > plot(x, y, type = "n", xlim = c(0.5, 2.5), ylim = c(-10, 110), axes = FALSE, + ann = FALSE) > axis(side =2, at = seqCfrom = 0, to = 100, by = 20)) > mtext("Temperature (Centigrade)", side = 2, line = 3) > axis(side = 1, at = 1:2, labels = c("Treatment 1", "Treatment 2")) > axis(side =4, at = seq(from = 0, to = 100, by = 20), labels = seq(from = 0, + to = 100, by = 20) * 9/5 + 32) > mtext("Temperature (Fahrenheit)", side = 4, line = 3) > box() > segments (x, 0, x, 100, lwd = 20, col = "dark grey") > segments(x, 0, x, 100, lwd = 16, col = "white") > segments (x, 0, x, y, lwd = 16, col = "light grey") > boxC'figure") > par (op) O o - <-* n CM CM o CO CD - r^ ^^ ^^ 0) TD (D ro -£Z CZ Dl 0 o OJ "c CD ~ -£Z (D ro Ü LL 0) (D Z5 Z3 ro O _ - o ro aj o3 Q. Q. E E (D (D 1— o CM O - CO CD CM CO I— 1 1 Treatment 1 Treatment 2 85 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.12: Example ol customizing axes and margins using the low-level axisO plotting lunction and other low-level text plotting functions. > rates <- read.table("rates.dat", header = TRUE) > op <- par(no.readonly = TRUE) > par(mar = c(10, 5, 4, 2), bg = "white") > plot(rates$day, rates$ratel, type = "1", ylim = c(0, 85), axes = FALSE, + xlab = "Days since discharge", ylab = "Percent beta-blocker users (%)") > axisd, at = c(0, 30, 90, 180, 270, 365)) > axis(2, at = c(0, 20, 40, 60, 80)) > axisd, at = rates$day, labels = rates$atriskl, tick = FALSE, line = 4, + cex =0.8) > axisd, at = rates$day, labels = rates$atriskO, tick = FALSE, line =6.5, + cex =0.8) > lines(rates$day, rates$rateO, type = "1") > box() > mtextC'No. at-risk: patients discharged on beta-blockers", side = 1, + line = 4, adj = 0, cex =0.8) > mtextC'No. at-risk: patients not discharged on beta-blockers", side = 1, + line = 6.5, adj = 0, cex = 0.8) > mtext("Outpatient adherence to\nbeta-blocker therapy post-AMI", side = 3, + cex =1.2, line = 1) > text(180, 70, "Discharged on beta-blockers", cex = 0.8) > text(180, 20, "Not discharged on beta-blockers", cex = 0.8) > boxC'outer") > par (op) Outpatient adherence to o CO beta-blocker therapy post-AMI AfW ^ [ — / Vr-v. Discharged on beta-blockers 0) (J) Z3 o CD ' \^~*il\ J\r\ /^~s«i r^AyAH. f~" \ 0) ^. Ü o .Q o *3- "aj .a 0) o Not discharged on beta-blockers (D Q_ o - III 1 1 0 30 90 180 270 I 365 Days since discharge No. at-risk: patients discharged on beta-blockers 365 363 350 342 338 330 327 325 316 312 308 No. at-risk: patients not discharged on beta-blockers 423 418 399 380 366 351 345 334 328 321 312 86 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.13: The outer margins, figure region, and plot region when there is a single plot on the page. Outer margin 3 Figure Region rsi ^r .■= .■= E> E> I Plot Region , 1 yl <- rnorm(lOO) > y2 <- rnorm(lOO) > parfrnfrow = c(2, 1), > plotCyl, type = "1", > box(col = "grey") > mtextC'Left end of margin", adj > lines(x = c(20, 20, 40, 40), y + col = "grey") > plot(y2, type = "1", axes = FALSE, xlab > box(col = "grey") > mtext("Right end of margin", > mtextC'Label below x=30", at > lines(x = c(20, 20, 40, 40), + col = "grey") > boxC'outer") 0, side = 3) c(-7, max(yl) , max(yl) . ylab '") -7), lwd = 3, '") adj = 1, side = 3) = 30, side = 1) y = c(7, min(y2), min(y2), 7), lwd = 3, Left end of margin Label below x=30 89 CHAPTER 4. R GRAPHICS REFERENCE These three approaches are mutually incompatible. For example, a call to the layout () function will override any previous mfrow= or mfcol= parameters. Also, some high-level functions (e.g., the coplotO function) call the layout () and par() functions themselves to create a plot arrangement, which means that the output from such functions cannot be arranged with other plots on a page. THE par () FUNCTION APPROACH: The number of figure regions on a page can be controlled via the mf row= and mf col= graphical parameters. Both of these consist of two values indicating a number of rows, rar, and a number of columns, rac (i.e., mfrow = c(nr, nc)); these parameters result in the rar x nc figure regions of equal size. For example, the code par (mfrow = c (3,2)) creates six figure regions on the page arranged in three rows and two columns. The top left region figure is used first. If the parameter is made via the mf row= parameter then the figure regions along the top row are used next from left to right, until that row is full. After that, figure regions are used in the next row down, from left to right, and so on. When all rows are full, a new page is started. If the parameter is made via the mf col= parameter, figure regions are used by column instead of by row. We demonstrated the use of the mf row= parameter in the first document. The order in which figure regions are used can be controlled using the mf g= parameter to specify the next figure region. This parameter consists of two values that indicate the row and column of the next figure to use. In addition, the ask= parameter controls whether the user is prompted before the graphics system starts a new page of output - use ask = TRUE to be prompted. It is useful for viewing multiple pages of output (e.g., the output from example (boxplot)) that otherwise flick by too fast to view properly. THE layout () FUNCTION APPROACH: The layout () function provides an alternative to the mfrow= and mf col= parameters. The primary difference is that the layout () function allows the creation of multiple figure regions of unequal sizes. The simple idea underlying the layout () function is that it divides the inner region of the page into a number of rows and columns, but the heights of the rows and the widths of the columns can be independently controlled, and a figure can occupy more than one row or more than one column. In addition, one or more intersections of the rows and columns can be left blank. The first argument (and the only required argument) to the layout () function is a matrix. The number of rows and columns in the matrix determines the number of rows and columns in the layout. The contents of the matrix are integer values that determine which rows and columns each figure will occupy. A value of 0 can also be used, which means that no figure will occupy the specified region. For example, the following layout specification is identical to par (mfrow = c(3, 2)): layout (matrix (c (1, 2, 3, 4, 5, 6), byrow = TRUE, ncol = 2)). We can visualize the the partitions created using the layout, show() function - we merely specify the number of figures to plot. For example, > (m <- matrix(c(l, 0, 2, 0, 3, 0, 4, 0, 5), byrow = TRUE, ncol = 3)) [,1] [,2] [,3] [1,] 10 2 [2,] 0 3 0 [3,] 4 0 5 > layout(m) > layout.show(9) 1 2 3 4 5 90 CHAPTER 4. R GRAPHICS REFERENCE The contents of the layout matrix determine the order in which the resulting figure regions will be used. For example, the following code creates a layout with exactly the same rows and columns as the previous one, but the figure regions will be used in the reverse order: layout (matrix (c (6, 5, 4, 3, 2, 1), byrow = TRUE, ncol =2)). A figure may also occupy more than one row or column in the layout. For example, in the layout specified by the code layout (matrix (c(l, 2, 3, 3), byrow = FALSE, ncol = 2) ), the third plot will fill a figure region that occupies both rows of the second column. > (m <- matrixCcCl, 2, 3, 3), byrow = FALSE, ncol = 2)) [,1] [,2] [1,] 1 3 [2,] 2 3 > layout(m) > layout.show(4) 1 3 2 By default, all row heights are the same and all column widths are the same size and the available inner region is divided up equally. The heights= argument can be used to specify that certain rows are given greater portion of the available height (for all of what follows, the widths= arguments works analogously for the column width). When the available height is divided up, the proportion of the available height given to each row is determined by dividing the row heights by the sum of the row heights. For example, in the code layout (matrix (c(l, 2)), height = c (2,1)) the top row is given two-thirds of the available height (2/(2+1)) and the bottom row is given one third (1/(2+1)). > (m <- matrix(c(l, 2))) [1,] 1 [2,] 2 > layout(m, height = c(2, 1)) > layout.show(2) 1 2 91 CHAPTER 4. R GRAPHICS REFERENCE By default, the division of row heights is completely independent of the division of column widths (respect = FALSE). Use respect = TRUE to force the widths and heights to correspond as well so that, for example, a height of 1 corresponds to the same physical distance as a width of 1. The respect= argument can also be specified as a matrix. In this case, only certain rows and columns will respect each other's heights/widths. As an example, let's look at the relationship between age in years and serum albumin, but also include each variable's histogram - see Figure 4.16. OVERLAYING OUTPUT: As we know, all high-level plotting functions always start a new plot, erasing the current plot if necessary. However, we may want to generate a plot that results from the superimposing of two high-level plotting functions. Some high-level functions provide an argument called add=, which, if set to TRUE, will add the function output to the current plot, rather than starting a new plot. We have demonstrated this in the first document. Unfortunately, the add= argument is only available in some high-level plotting functions. Alternatives include using the new= parameter via the par() function, or the plot.new() function. When new = TRUE, the next high-level plotting function will simply overlay the existing plot. BE AWARE, unlike the add= argument, using the new= par() function parameter or the plot.new() function does not guarantee that the axes from your two plots will line up exactly. In actuality, the axes will often conflict with each other. To make sure they don't, explicitly define the ranges of both axes in both high-level function invocations using the xlim= and ylim= arguments. OTHER ADDITIONS: As we demonstrated in the first document, the traditional graphics system provides the legendO function for adding a legend or key to a plot. The legend is usually drawn within the plot region, and is located relative to user coordinates. The function has many arguments, which allow for a great deal of flexibility in the specification of the contents and the layout of the legend. A legend can also been drawn outside the plot region (ie, in a margin) by specifying xpd = NA using the par() function - see the 'Plotting regions & Margins' section. In turn, the location coordinates of the legend specified in the subsequent legend () function invocation should be outside the plotting region. It should be noted that it is entirely your responsibility to ensure that the legend corresponds to the plot. There is no automatic checking that data symbols in the legend match those in the plot, or that labels in the legend have any correspondence with the data. The locator () function is a very useful one to know about and allows you to interactively select the position for graphical elements such as legends or labels. Specifically, the locator () function waits for you to select locations on the current plot using the left mouse button. This continues until n (default 512) points have been selected, or another mouse button is pressed. As an example, we could use the following code to place some informative text near an outlying point: text(locatord), "Outlier", adj = 0). The locator() function will be ignored if the current device, such as postscript does not support interactive pointing. Another useful function is the rug() function, which produces a 'rug' plot along one of the axes of your existing plots. A 'rug' plot consists of a series of tick marks that represent data locations. This can be useful to represent an additional one-dimensional plot of your data (e.g., in combination with a density curve). For example, let's add a rug plot to a histogram of a random normal variable - see Figure 4.17. Alternatives to the rugO function are the Hmisc package's scatldO, datadensityO, and histSpikeO functions, which adds tick marks corresponding to non-missing values of the data on any of the four sides of an existing plot, a graphical representation of data density, and a high-resolution data distribution that is particularly good for very large datasets (say N > 1000), respectively. MATHEMATICAL ANNOTATION: Any high- or low-level graphics function or graphical function's argument that draws text should accept both a normal text string (e.g., "some text"), and an R expression, which is typically the result of a call to the expressionO function. If an expression is specified as the text to draw, then it is interpreted as a mathematical annotation and is formatted appropriately. For a complete description of the available mathematical annotations, invoke demo(plotmath), which steps you through several tables. In these tables, the columns of gray text show sample R expressions, and the columns of black text show the resulting output. As an example (Figure 4.18), let's generate a blank plot and then add various mathematical annotations as text to the plot - notice the use of 'double equals' (==) to print a single 92 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.16: Creating multiple plots on one page using the layout () lunction. > library(Emisc) > op <- par(no.readonly = TRUE) > layout(matrix(c(l, 0, 3, 2), ncol = 2, nrow = 2, byrow = TRUE), widths = c(3, + 1), heights = c(l, 3), respect = TRUE) > agehist <- withCpbc, hist(ageyrs, plot = FALSE)) > albumhist <- withCpbc, hist(album, plot = FALSE)) > top <- max(agehist$counts, albumhist$counts) > parCmar = c(0, 3, 1, 1)) > barplot(albumhist$counts, axes = FALSE, ylim = c(0, top), space = 0) > parCmar = c(3, 0, 1, 1)) > barplot(agehist$counts, axes = FALSE, xlim = c(0, top), space = 0, horiz = TRUE) > parCmar = c(4, 4, 1, 1)) > withCpbc, plotCageyrs ~ album, xlab = paste (label(album), " (", units(album), + ")", sep = ""), ylab = paste(label(ageyrs), " (", units(ageyrs), + ")", sep = ""))) > par (op) ~r ~r o° « o ° ° %o © o V °° o° ° o° 8 o o °8 o o m o „ oo „ ~r "T" "T" 2.0 2.5 3.0 3.5 4.0 Serum Albumin (mg/dL) 93 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.17: Adding a 'rug' plot to an existing plot using the low-level rugO lunction. > y <- rnorm(50) > hist(y, main = "", xlab = "", ylab = "") > box() > rug(y) > boxC'figure") 94 CHAPTER 4. R GRAPHICS REFERENCE equal sign. We can also incorporate the paste () function into our expression - see Figure 4.19. Often, we will want to replace some variable in a mathematical annotation with its values. To do this, we need to use the substitute () function instead of the expressionO function. We must specify the substitution using a list as a second argument and the value to be substituted must have previously been computed and assigned a name. See Figure 4.20 for an example. GRAPHICAL OUTPUT: As demonstrated in the first document, when using R interactively, the main persistent record of graphical output is a window on your screen, which is also known as a GUI ('graphical user interface') screen device. When R is installed, an appropriate screen format is selected as the default screen device and this default device is opened automatically the first time any graphical output occurs. For example, on the various Unix systems, the default GUI screen device is an Xll window. These GUI screen devices are opened by internally calling the xll(), windows(), and quartz() functions in the Linux/Unix, Windows, and Mac versions of R, respectively. By default, the size of an Xll graphics window and a Windows graphics window are 7 inches by 7 inches. A Mac graphics window is, by default, 5 inches by 5 inches. The three mentioned functions all have height= and width= arguments that can be used to open new screen devices of the desired size. It is possible to have more than one screen device open at the same time, but only one device is currently 'active' and all graphics output is sent to that device. If multiple devices are open, there are functions to control which device is active. The screen devices are associated with a name (e.g., "Xll" or "postscript") and a number - the "null device" is always device 1. The list of open devices can be obtained using the dev.listO function, which returns the name and number of all open devices, except device 1, the null device. The dev. cur () function returns the number and name of only the active device, or 1, the null device, if none is active. The dev. set() function can be used to make a device active by specifying the appropriate device number. And the dev.next() and dev.prevO functions can be used to make the next/previous device on the device list the active device. Be aware, the display list can consume a reasonable amount of memory if a plot is particularly complex or if there are very many devices open at the same time. The dev. of f () shuts down the specified (by default the current) device, and the graphics. of f () functions shuts down all open graphics devices. It is also possible to partition the active screen device with the split .screen () function. For example: split .screen(c(l, 2)) divides the device into two parts which can be selected with screen(l) or screen(2). A part of the partitioned screen device can itself be divided again with the split. screenO function, which can make complex partitions. The erase, screen() function clears a single screen, and the close.screen() removes the specified screen partition. These functions are totally incompatible with the other mechanisms for arranging plots on a device (i.e., the par() function's mfrow= argument and the layout() function), and some plotting functions like the coplotO function. Also, in the Windows version of R, you can save a 'history' of your graphs by activating the Recording feature under the History drop-menu (seen when the graphics window is selected). You can then access old graphs by using the Page Up and Page Down keys. As we mentioned in the first document, it is possible to produce a file that contains your plot. Similar to the screen devices, the graphical output can be directed to a particular file device, which dictates the output format that will be produced. And like the screen devices, the file devices are controlled by specific functions: the postscript () function produces an Adobe PostScript file; the pdf () function produces an Adobe PDF file; the pictexO function produces a I^TjtX PicTEX file; the xf ig() function produces an XFIG file; the bitmap () function produces a GhostScript conversion to a file; the pngO function produces a PNG bitmap file; and the jpegO function produces a JPEG bitmap file. On the Windows version of R, there are also the win.metaf ile() and bmp() file device functions, which produce a Windows Metafile file and Windows BMP file, respectively. Like the screen device functions, these file device functions allow you to specify things such as the name of the file and the size of the plot. 95 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.18: Examples ol mathematical annotation in a graph using the expressionO lunction. > par(mar = c(l, 1, 1, 1)) > plot(0:10, 0:10, type = "n", axes = FALSE) > textCl, 10, expression(x %+-% y), cex =1.5) > textCl, 9, expression(x[i]), cex = 1.5) > textCl, 8, expression(x~2), cex =1.5) > textCl, 7, expression(sqrt(x)), cex =1.5) > text(1, 6, expression(sqrt(x, 3)), cex =1.5) > text(1, 5, expression(x != y), cex =1.5) > text(1, 4, expression(x <= y), cex =1.5) > text(1, 3, expression(hat(x)), cex =1.5) > text(1, 2, expression(tilde(x)), cex = 1.5) > text(1, 1, expression(bar(x)), cex =1.5) > text(1, 0, expression(x %<=>% y), cex = 1.5) > text(4, 10, express i on(Alpha + Omega), cex = 1.5) > text(4, 9, expression(alpha + omega), cex = 1.5) > text(4, 8, expression(45 * degree), cex =1.5) > text(4, 7, expression(frac(x, y)), cex =1.5) > text(4, 5.5, expression(sum(x[i], i = 1, n)), cex = 1.5) > text(4, 4, expression(prod(plain(P)(X == x), x)), cex = 1.5) > text(4, 2.5, expression(integral(f(x) * dx, a, b)), cex =1.5) > text(4, 0.5, expression(lim(f(x), x %->% 0)), cex = 1.5) > text(8, 10, expression(x~y + z), cex = 1.5) > text(8, 9, expression(x~(y + z)), cex = 1.5) > text(8, 8, expression(x~{ + y + z + }), cex =1.5) > text(8, 6, expression(hat(beta) == (X~t * X)~{ + -1 + } * X't * y), cex =1.5) > text(8, 4, expression(bar(x) == sum(frac(x[i], n), i == 1, n)), cex = 1.5) > text(8, 2, expression(paste(frac(l, sigma * sqrt(2 * pi)), " ", plain(e)~{ + frac(-(x - mu)~2, 2 * sigma~2) + })), cex =1.5) > boxC'figure") x+y A + Q xy + z x. a + co x(y+z) x2 45° xy+z Vx x y ifx n IXi ß = (xXr1xV Xžy i xy limf(x) 96 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.19: Examples ol mathematical annotation in a graph using the paste () lunction in conjunction with the expressionO lunction. > x <- seq(-4, 4, length = 101) > plot(x, sin(x), type = "1", xaxt = "n", xlab = expression(paste("Phase Angle ", + phi)), ylab = expressionC'sin " * phi)) > axis(side = 1, at = c(-pi, -pi/2, 0, pi/2, pi), label = expression(-pi, + -pi/2, 0, pi/2, pi)) > boxC'figure") 97 CHAPTER 4. R GRAPHICS REFERENCE Figure 4.20: Examples of mathematical annotation in a graph using the substituteO function. > x <- 1:10 > alpha <- 4 > plot(x, x'alpha, xlab = "x", ylab = expression(x''alpha), main = substitute(paste("Power plot of ", + x'alpha, " for ", alpha == ch.a), list(ch.a = alpha))) > boxC'figure") 98 CHAPTER 4. R GRAPHICS REFERENCE Unlike sending graphical output to a window on your screen, directing graphical output to a file takes a few more steps. A file device must be created or 'opened to writing' in order to receive graphical output by invoking the desired file device function with at least the desired file name specified. Once you have opened the file to writing, you then compile all of your desired graphical function invocations. In turn, the specific file device that you opened converts the graphical function invocations from R (e.g., 'draw a line') into commands that the particular device can understand (e.g., PostScript commands if you used the postscript() function). When you have finished writing the desired graphical output to a file, you then close the file to writing (and therefore close the file device) by invoking the dev. off () function. For example, let's write a simple scatterplot to a PDF file named myplot.pdf: > pdf("myplot.pdf", height = 8.5, width = 11) > with(pbc, plot(age.years ~ chol, main = "Age (years) vs. Serum Choi", + xlab = label(chol), ylab = label(age.years))) > dev.off() In the previous code, we also specified that the plot should be in a landscape orientation (i.e., height < width), with a height of 8.5" and a width of 11.0". Because all graphical output is directed to a file when using a file device function, it is best to use the appropriate screen device function to specify the size of the graphics window on your screen, and to direct all intermediate iterations of the graphical output to the window. Once the graph has been 'finalized,' you can then use one of the file device functions to write the graph to a file. Also, when using the Windows version of R, you do not necessarily need to save the graph as its own file using a file device function. Specifically, in the Windows version of R, right-clicking on the graphics window offers you three options for outputting any desired graph from R: (1) Copy your graph as either a metafile or a bitmap; (2) Save your graph as either a metafile or postscript; and/or (3) Print your graph. For the Copy and Save options, it is very easy to then Paste or Insert, respectively, your graph into a Microsoft Word and/or Powerpoint document. For a screen device, starting a new page involves clearing the window before producing more output. On the other hand, only certain file devices allow multiple pages of output. For example, PostScript and PDF allow multiple pages, but PNG does not. It is usually possible, especially for devices that do not support multiple pages of output, to specify that each page of output produces a separate file. This is achieved by specifying the argument onefile = FALSE when opening a device and specifying a pattern for the file name like file = "myplot7o03d". The 7o03d is replaced by a three-digit number (padded with zeros) indicating the 'page number' for each file that is created. 99