Advanced indexing
techniques
Lukáš Lehotský and Petr Ocelík
Selecting: operators
Selection type Applicable to Operator
integer/logical vector, matrix, df, list [ index ]
integer/logical vector, matrix, df, list [[ index ]]
variable name vector, matrix, df, list "name"
variable name df, list $name
variable name df, list @name
special operator vector, matrix, df, list %in%
Selecting: indexes
• [ ] accesses an object’s internal structure
• Accesses whole “data container”
• Particular data elements in case of vectors, matrices and data frames,
accesses
• Particular object with its wrapper in case of lists
• [[ ]] accesses a nested “single item” in the internal data
structure
• Access to one item in the object’s internal structure
• Useful to access objects within lists
Selecting: indexes
vect <- c(2, 6, 9)
str(vect)
num [1:3] 2 6 9
Selecting: indexes
vect <- c(2, 6, 9)
str(vect)
num [1:3] 2 6 9
vect[3]
[1] 9
vect[[3]]
[1] 9
Selecting: indexes
vect <- c(2, 6, 9)
str(vect)
num [1:3] 2 6 9
vect[3]
[1] 9
vect[[3]]
[1] 9
vect[1:3]
[1] 2 6 9
vect[[1:3]]
Error in v[[1:3]] ...
Selecting: indexes of nested objects – a list
ls <- list(c(0,1,2,3),
c("car","bike"),
"single object")
Selecting: indexes of nested objects – a list
ls <- list(c(0,1,2,3),
c("car","bike"),
"single object")
ls[2]
[[1]]
[1] "car" "bike"
Selecting: indexes of nested objects – a list
ls <- list(c(0,1,2,3),
c("car","bike"),
"single object")
ls[2]
[[1]]
[1] "car" "bike"
Parent object wrapper
Data elements
Selecting: indexes of nested objects – a list
ls <- list(c(0,1,2,3),
c("car","bike"),
"single object")
ls[2]
[[1]]
[1] "car" "bike"
ls[[2]]
[1] "car" "bike" Data elements only
Selecting: indexes of nested objects – a list
ls <- list(c(0,1,2,3),
c("car","bike"),
"single object")
ls[2]
[[1]]
[1] "car" "bike"
ls[[2]]
[1] "car" "bike"
ls[[2]][2]
[1] "bike"
Single element
Selecting: indexes of nested objects – a list
ls[2:3]
[[1]]
[1] "car" "bike"
[[2]]
[1] "single object"
ls[[2:3]]
Error in ls[[2:3]] : subscript out of bounds
Advanced use of indexes: vectors
• Indexing accepts any result that provides either numeric
indexes or logical values
• Existing objects containing index information
• Vectors of TRUE/FALSE values from logical evaluations
• Functions generating index information/providing logical evaluations
• Useful to subset data
Advanced use of indexes: logical statement
• A logical test applied to a vector will create a vector of
logical values (TRUE/FALSE)
• Such vector may serve as an index
vect <- c(2, 6, 9)
vect == 9
[1] FALSE FALSE TRUE
index <- vect == 9
vect[index]
[1] 9
Advanced use of indexes: logical operators
Operator Description
< Left is smaller than right
> Left is larger than right
<= Left is smaller or equal than right
>= Left is larger or equal than right
== Left is equal than right
!= Left is not equal than right
! Negation
& AND – allows test combinations, all logical statements must be true
| OR – allows test combinations, at least one statement must be true
Source: Adler, J. (2012). R in a nutshell. Pp. 86.
Advanced use of indexes: function results
• Most basic use case – locating missing values on variables
• We can get either rows with missing data or contrary, get rid
of them
• Function is.na()
• Logical test on presence/absence of “NA” value
• Returns vector of logical values TRUE/FALSE
vect.na <- c(1,0,1,2,2,NA,NA,2,1)
Advanced use of indexes: function results
vect.na <- c(1,0,1,2,2,NA,NA,2,1)
is.na(vect.na)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
Advanced use of indexes: function results
vect.na <- c(1,0,1,2,2,NA,NA,2,1)
is.na(vect.na)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
index <- is.na(vect.na)
vect.na[index]
[1] NA NA
Advanced use of indexes: function results
vect.na <- c(1,0,1,2,2,NA,NA,2,1)
is.na(vect.na)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
index <- is.na(vect.na)
vect.na[index]
[1] NA NA ?
Advanced use of indexes: function results
vect.na <- c(1,0,1,2,2,NA,NA,2,1)
is.na(vect.na)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
index <- is.na(vect.na)
vect.na[index]
[1] NA NA
Index.o1 <- !is.na(vect.na) # option 1
Index.o2 <- is.na(vect.na) == FALSE # option 2
vect.na[index.o1]
[1] 1 0 1 2 2 2 1
Advanced use of indexes: combining tests
index <- !is.na(vect.na) & vect.na >= 1.5
vect.na[index]
[1] 2 2 2
Advanced use of indexes: combining tests
index <- !is.na(vect.na) & vect.na >= 1.5
vect.na[index]
[1] 2 2 2
index <- is.na(vect.na) | vect.na >= 1.5
vect.na[index]
[1] 2 2 NA NA 2
Advanced use of indexes: logical tests on data
frames
• Logical tests may be used to filter data frames
• TRUE/FALSE statements index row dimension (unless
specifically intended to subset columns)
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
3 VW Passat 950500 5.9
Advanced use of indexes: logical tests on data
frames
• Problem at hand – filter the data set by cars costing more
than 1 000 000 units
• Select column containing price
• Find values over 1 000 000
• Use the result of logical test to filter the data frame
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
3 VW Passat 950500 5.9
Advanced use of indexes: logical tests on data
frames
• Select column containing price
• We know it is the third column
• We may use index to extract the third column
• We get vector of the column price (downgrade as default behavior)
df[ ,3]
[1] 1200000 1164000 950500
Advanced use of indexes: logical tests on data
frames
• Use logical function to evaluate the car price
• We use the indexed column and add logical evaluation
• The result of evaluation is a vector of logical TRUE/FALSE values
df[ ,3]
[1] 1200000 1164000 950500
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
Advanced use of indexes: logical tests on data
frames
• The vector provides information over rows
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
df
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
3 VW Passat 950500 5.9
df[ ,3,drop = FALSE] > 1000000 # just a demonstration
price
[1,] TRUE
[2,] TRUE
[3,] FALSE
Advanced use of indexes: logical tests on data
frames
• The vector provides information over rows
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
df
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
3 VW Passat 950500 5.9
df[ ,3,drop = FALSE] > 1000000 # just a demonstration
price
[1,] TRUE
[2,] TRUE
[3,] FALSE
Advanced use of indexes: logical tests on data
frames
• The vector provides information over rows
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
df
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
3 VW Passat 950500 5.9
df[ ,3,drop = FALSE] > 1000000 # just a demonstration
price
[1,] TRUE
[2,] TRUE
[3,] FALSE
Advanced use of indexes: logical tests on data
frames
• The vector provides information over rows
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
df
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
3 VW Passat 950500 5.9
df[ ,3,drop = FALSE] > 1000000 # just a demonstration
price
[1,] TRUE
[2,] TRUE
[3,] FALSE
Advanced use of indexes: logical tests on data
frames
• Use the result of logical test to create the condition for dataframe
filtering
• The condition thus applies to rows
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
condition <- df[ ,3] > 1000000
df[ condition , ]
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
Advanced use of indexes: logical tests on data
frames
• Use the result of logical test to create the condition for dataframe
filtering
• The condition applies to rows
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
condition <- df[ ,3] > 1000000
df[ condition , ]
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
Advanced use of indexes: logical tests on data
frames
• Use the result of logical test to create the condition for dataframe
filtering
• The condition applies to rows
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
condition <- df[ ,3] > 1000000
df[ condition , ]
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
Advanced use of indexes: logical tests on data
frames
• The filtered data frame needs to be saved to environment
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
condition <- df[ ,3] > 1000000
df[ condition , ]
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
df.sub <- df[ condition , ]
Advanced use of indexes: logical tests on data
frames
• Alternative – use $ to call a variable
• Works only when variables have names
df$price > 1000000
[1] TRUE TRUE FALSE
condition <- df$price > 1000000
df[ condition , ]
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
df.sub <- df[ condition , ]
Advanced use of indexes: logical tests on data
frames
• Alternative – use the filter directly in the square brackets
without a dedicated object
• There’s a high risk of getting it wrong
df[ df$price > 1000000 , ]
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
df[ df[ ,3] > 1000000 , ]
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
Advanced use of indexes: logical tests on data
frames
• Combinations of filters are possible
df[ ,3] > 1000000
[1] TRUE TRUE FALSE
condition <- df[ ,3] > 1000000
df[ condition , 1 ]
[1] BMW Audi
df[ condition , "consumption" ]
[1] 6.2 6.2
Advanced use of indexes: logical tests on data
frames
• Problem at hand – select data with two or more conditions
• Select cars which are BMW or Audi
• Straightforward approach does not work
df$cars == c( "BMW" , "Audi" )
Warning message:
In df$cars == c("BMW", "Audi") : longer object
length is not a multiple of shorter object length
Advanced use of indexes: logical tests on data
frames
• Problem at hand – select data with two or more conditions
• Select cars which are BMW or Audi
• Approach using logical operators
• Operator AND (“&”) – all conditions must be true at once
• Operator OR (“|”) – at least one condition must be true
df$cars == "BMW" | df$cars == "Audi"
[1] TRUE TRUE FALSE
condition <- df$cars == "BMW" | df$cars == "Audi"
df[condition, ]
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
Advanced use of indexes: logical tests on data
frames
• Problem at hand – select data with two or more conditions
• Select cars which are BMW or Audi
• Alternative – use special operator “%in%”
• Counterintuitive syntax - left %in% right means left contains right
df$cars %in% c( "BMW" , "Audi" )
[1] TRUE TRUE FALSE
condition <- df$cars %in% c( "BMW" , "Audi" )
df[condition, ]
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
Advanced use of indexes: ordering data frame
• Ordering a data frame is the most common use case of this
logic
• Function order() provides an ordered indexes of rows
• Problem at hand
• Order the data frame alphabetically by car make
cars type price consumption
1 BMW 3 1200000 6.2
2 Audi A4 1164000 5.9
3 VW Passat 950500 5.9
Advanced use of indexes: ordering data frame
order(df$cars)
[1] 2 1 3
condition <- order(df$cars)
df[ condition , ]
cars type price consumption
2 Audi A4 1164000 5.9
1 BMW 3 1200000 6.2
3 VW Passat 950500 5.9
df.ord <- df[ condition , ]
Practice 1
• Install and load the package “poliscidata” (or download it
from the IS)
• Create a new object containing the dataset “world”
• Extract countries into separate vector
• Extract the Czech Republic row from the data frame
• Extract all V4 countries (CZ, SK, HU, PL) as a subset of the
world dataset
• Extract only freedom indicators for V4 countries (all column
names starting with “free_” – manual index numbers)
Practice 2
• Load dataset “states” from the “poliscidata” package into
your environment (or download it from the IS)
• Order the dataset according to Obama 2012 election results
(highest to lowest)
• Subset states on variable “gay_policy” – extract only states
which are deemed as “liberal”
• Subset states on variable “gay_policy” and “secularism3” –
extract only states which are deemed as “liberal”, but are
neither deemed “secular” nor “religious”
• Extract only names of the states from the previous step
• Find the country which has missing (“NA”) value in the
“secularism_3” variable