9  Indexing

An index is used to pick elements of a data structure (i.e. a vector, matrix, array, list, data frame, etc.). You can select (or exclude) one or multiple elements at a time.

There are three types of index vectors you can use in R to identify elements of an object:

Integer indexing in R is 1-based, meaning the first item of a vector is in position 1. In contrast, many programming languages use 0-based indexing where the first element is in the 0th position, the second in the 1st, and the nth in the n-1 position.

To understand indexing, make sure you are very comfortable with the core R data structures: vectors, matrices, arrays, lists, and data.frames.

What is indexing used for?

Indexing can be used to get values from an object or to set values in an object.

The main indexing operator in R is the square bracket [].

Lists use double square brackets [[]].

9.1 Vectors

Start with a simple vector:

x <- 15:24
x
 [1] 15 16 17 18 19 20 21 22 23 24

9.1.1 Integer Index

Get the 5th element of a vector:

x[5]
[1] 19

Get elements 6 through 9 of the same vector:

x[6:9]
[1] 20 21 22 23

An integer index can be used to reverse order of elements:

x[5:3]
[1] 19 18 17

Note that an integer index can be used to repeat elements. This is often done by accident, when someone passes the wrong vector as an index, so beware.

x[c(1, 1, 1, 4)]
[1] 15 15 15 18

9.1.2 Logical Index

Logical indexes are usually created as the output of a logical operation, i.e. an elementwise comparison.

Select elements with value greater than 19:

idl <- x > 19

The above comparison is vectorized (Chapter 15), meaning that the comparison is performed elementwise and the result is a logical vector of the same length as the original vector.

idl
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

You can pass the logical vector as an index to the original vector to get the elements that correspond to TRUE in the logical vector:

x[idl]
[1] 20 21 22 23 24

Logical vectors can be created directly in the brackets:

x[x > 19]
[1] 20 21 22 23 24

9.1.3 Extract vs. Replace i.e. Get vs. Set

x <- c(24, 32, 41, 37, 999, 999, 999)

Indexing allows you to access specific elements, for example to perform calculations on them.

Get the mean of elements 1 through 4:

mean(x[1:4])
[1] 33.5

You can combine indexing with assignment to replace elements of an object.

Replace values in elements 1:4 with their log:

x[1:4] <- log(x[1:4])
x
[1]   3.178054   3.465736   3.713572   3.610918 999.000000 999.000000 999.000000

Replace elements that are equal to 999 with NA:

x[x == 999] <- NA
x
[1] 3.178054 3.465736 3.713572 3.610918       NA       NA       NA

9.2 Matrices

Reminder:

  • A matrix is a 2D vector and contains elements of the same type (numeric, integer, character, etc.).
  • A data frame is a 2D list and each column can contain a different data type.

To index a 2D structure, whether a matrix or data frame, we use the form: [row, column].

The following indexing operations are therefore the same whether applied on a matrix or a data frame:

mat <- matrix(21:60, nrow = 10)
colnames(mat) <- paste0("Feature_", seq(ncol(mat)))
rownames(mat) <- paste0("Row_", seq(nrow(mat)))
mat
       Feature_1 Feature_2 Feature_3 Feature_4
Row_1         21        31        41        51
Row_2         22        32        42        52
Row_3         23        33        43        53
Row_4         24        34        44        54
Row_5         25        35        45        55
Row_6         26        36        46        56
Row_7         27        37        47        57
Row_8         28        38        48        58
Row_9         29        39        49        59
Row_10        30        40        50        60
df <- as.data.frame(mat)
df
       Feature_1 Feature_2 Feature_3 Feature_4
Row_1         21        31        41        51
Row_2         22        32        42        52
Row_3         23        33        43        53
Row_4         24        34        44        54
Row_5         25        35        45        55
Row_6         26        36        46        56
Row_7         27        37        47        57
Row_8         28        38        48        58
Row_9         29        39        49        59
Row_10        30        40        50        60

To get the contents of the fifth row, second column:

mat[5, 2]
[1] 35
df[5, 2]
[1] 35

We show the following on matrices, but they work just the same on data.frames.

If you want to select an entire row or an entire column, you leave the row or column index blank, but you must include a comma:

Get the first row:

mat[1, ]
Feature_1 Feature_2 Feature_3 Feature_4 
       21        31        41        51 

Get the second column:

mat[, 2]
 Row_1  Row_2  Row_3  Row_4  Row_5  Row_6  Row_7  Row_8  Row_9 Row_10 
    31     32     33     34     35     36     37     38     39     40 

Note that colnames and rownames were added to the matrix above for convenience - if they are absent, there are no labels above each element.

You can define ranges for both rows and columns:

mat[6:7, 2:4]
      Feature_2 Feature_3 Feature_4
Row_6        36        46        56
Row_7        37        47        57

You can use vectors to specify any combination of rows and columns.

Get rows 2, 4, and 7 of columns 1, 4, and 3:

mat[c(2, 4, 7), c(1, 4, 3)]
      Feature_1 Feature_4 Feature_3
Row_2        22        52        42
Row_4        24        54        44
Row_7        27        57        47

Since a matrix is a vector with 2 dimensions, you can also index the underlying vector directly. Regardless of whether a matrix was created by row or by column (default), the data is stored and accessed by column. You can see that by converting the matrix to a one-dimensional vector:

 [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[26] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

same as:

c(mat)
 [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[26] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For example, ‘mat’ has 10 rows and 4 columns, therefore the 11th element is in row 1, column 2 - this only works with matrices, not data.frames:

mat[11]
[1] 31

is the same as:

mat[1, 2]
[1] 31

9.2.1 Matrix of indexes

This is quite less common, but potentially useful. It allows you to specify a series of individual [i, j] indexes, i.e. is a way to select multiple individual non-contiguous elements

idm <- matrix(c(2, 4, 7, 4, 3, 1), nrow = 3)
idm
     [,1] [,2]
[1,]    2    4
[2,]    4    3
[3,]    7    1

An n-by-2 matrix can be used to index as a length n vector of [row, colum] indexes. Therefore, the above matrix, will return elements [2, 4], [4, 3], [7, 1]:

mat[idm]
[1] 52 44 27

9.2.2 Logical index

Identify rows with value greater than 36 on the second column:

The logical index for this operation is:

mat[, 2] > 36
 Row_1  Row_2  Row_3  Row_4  Row_5  Row_6  Row_7  Row_8  Row_9 Row_10 
 FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   TRUE   TRUE   TRUE   TRUE 

It can be used directly to index the matrix:

mat[mat[, 2] > 36, ]
       Feature_1 Feature_2 Feature_3 Feature_4
Row_7         27        37        47        57
Row_8         28        38        48        58
Row_9         29        39        49        59
Row_10        30        40        50        60

Indexing a matrix or a data.frame can return either a smaller matrix/data.frame or a vector.

In general, many R functions return the simplest R object that can hold the output. As always, check function documentation to look for possible arguments that can change this and what the default behavior is. If you extract a column or a row, you get a vector:

Get the third column:

mat[, 3]
 Row_1  Row_2  Row_3  Row_4  Row_5  Row_6  Row_7  Row_8  Row_9 Row_10 
    41     42     43     44     45     46     47     48     49     50 
class(mat[, 3])
[1] "integer"

You can specify drop = FALSE to stop R from dropping the unused dimension and return a matrix or data.frame of a single column:

mat[, 3, drop = FALSE]
       Feature_3
Row_1         41
Row_2         42
Row_3         43
Row_4         44
Row_5         45
Row_6         46
Row_7         47
Row_8         48
Row_9         49
Row_10        50
df[, 3, drop = FALSE]
       Feature_3
Row_1         41
Row_2         42
Row_3         43
Row_4         44
Row_5         45
Row_6         46
Row_7         47
Row_8         48
Row_9         49
Row_10        50

Check it is still a matrix or data.frame:

class(mat[, 3, drop = FALSE])
[1] "matrix" "array" 
class(df[, 3, drop = FALSE])
[1] "data.frame"

9.3 Lists

Reminder: A list can contain elements of different classes and of different lengths:

x <- list(one = 1001:1004,
          two = sample(seq(0, 100, by = .1), size = 10),
          three = c("Neuro", "Cardio", "Radio"),
          four = median)
x
$one
[1] 1001 1002 1003 1004

$two
 [1] 93.9 44.4 90.1 70.3 79.7 18.6 49.0 82.4 75.9 41.3

$three
[1] "Neuro"  "Cardio" "Radio" 

$four
function (x, na.rm = FALSE, ...) 
UseMethod("median")
<bytecode: 0x149b46190>
<environment: namespace:stats>

9.3.1 Get single list element:

You can access a single list element using:

  • double brackets [[ with either name or integer position
  • $ followed by name of the element (therefore only works if elements are named)

For example, to access the third element:

x$three
[1] "Neuro"  "Cardio" "Radio" 

same as:

x[[3]]
[1] "Neuro"  "Cardio" "Radio" 

same as:

x[["three"]]
[1] "Neuro"  "Cardio" "Radio" 

To access a list element programmatically, i.e. using a name or integer index stored in a variable, only the bracket notation works. Therefore, programmatically, you would always use double brackets to access different elements:

idi <- 3
idc <- "three"
x[[idi]]
[1] "Neuro"  "Cardio" "Radio" 
x[[idc]]
[1] "Neuro"  "Cardio" "Radio" 

9.3.2 Get one or more list elements as a list:

You can extract one or more list elements as a pruned list using single bracket [ notation. Similar to indexing of a vector, this can be either a logical, integer, or character vector:

x[3]
$three
[1] "Neuro"  "Cardio" "Radio" 
x["three"]
$three
[1] "Neuro"  "Cardio" "Radio" 
x[c(FALSE, FALSE, TRUE, FALSE)]
$three
[1] "Neuro"  "Cardio" "Radio" 

Get multiple elements:

x[2:3]
$two
 [1] 93.9 44.4 90.1 70.3 79.7 18.6 49.0 82.4 75.9 41.3

$three
[1] "Neuro"  "Cardio" "Radio" 
# same as
x[c("two", "three")]
$two
 [1] 93.9 44.4 90.1 70.3 79.7 18.6 49.0 82.4 75.9 41.3

$three
[1] "Neuro"  "Cardio" "Radio" 
# same as
x[c(FALSE, TRUE, TRUE, FALSE)]
$two
 [1] 93.9 44.4 90.1 70.3 79.7 18.6 49.0 82.4 75.9 41.3

$three
[1] "Neuro"  "Cardio" "Radio" 

9.3.3 Recursive indexing of list

Given the following list:

x <- list(PIDN = 2001:2020,
          Dept = c("Neuro", "Cardio", "Radio"),
          Age = rnorm(20, mean = 57, sd = 1.3))

We can access the 3rd element of the 2nd element:

x[[2]][3]
[1] "Radio"

or

x[[c(2, 3)]]
[1] "Radio"

This is called recursive indexing and is perhaps more often used by accident, when one instead wanted to extract the 2nd and 3rd elements:

x[c(2, 3)]
$Dept
[1] "Neuro"  "Cardio" "Radio" 

$Age
 [1] 58.31128 59.50883 58.63950 57.79283 56.48979 56.54883 57.70144 55.71086
 [9] 59.27153 58.25800 55.57688 56.71479 57.50166 59.21797 57.82313 57.24295
[17] 59.35482 58.34108 58.48001 56.11052

9.3.4 Flatten list

You can convert a list to a single vector containing all individual components of the original list using unlist(). Notice how names are automatically created based on the original structure:

x <- list(alpha = sample(seq(100), size = 10),
          beta  = sample(seq(100), size = 10),
          gamma = sample(seq(100), size = 10))
x
$alpha
 [1] 97  1 75 60 82 66 61 62 43 15

$beta
 [1] 30  5 66 97 73 34  4 85 24 35

$gamma
 [1] 11 78 23 29 36 82  7 12 51 10
 alpha1  alpha2  alpha3  alpha4  alpha5  alpha6  alpha7  alpha8  alpha9 alpha10 
     97       1      75      60      82      66      61      62      43      15 
  beta1   beta2   beta3   beta4   beta5   beta6   beta7   beta8   beta9  beta10 
     30       5      66      97      73      34       4      85      24      35 
 gamma1  gamma2  gamma3  gamma4  gamma5  gamma6  gamma7  gamma8  gamma9 gamma10 
     11      78      23      29      36      82       7      12      51      10 

If you want to drop the names, you can set the use.names argument to FALSE or wrap the above in unname():

unlist(x, use.names = FALSE)
 [1] 97  1 75 60 82 66 61 62 43 15 30  5 66 97 73 34  4 85 24 35 11 78 23 29 36
[26] 82  7 12 51 10
# same as
unname(unlist(x))
 [1] 97  1 75 60 82 66 61 62 43 15 30  5 66 97 73 34  4 85 24 35 11 78 23 29 36
[26] 82  7 12 51 10

9.4 Data frames

Note

In data science and related fields the terms filter and select are commonly used:

  • Filter: identify cases i.e. rows
  • Select: identify variables a.k.a. features i.e. columns

We’ve saw above that a data frame can be indexed in many ways similar to a matrix, i.e. by defining rows and columns. At the same time, we know that a data frame is a rectangular list. Like a list, its elements are vectors of any type (integer, double, character, factor, and more) but, unlike a list, they have to be of the same length. A data frame can also be indexed the same way as a list and similar to list indexing, notice that some methods return a smaller data frame, while others return vectors.

Tip

You can index a data frame using all the ways you can index a list and all the ways you can index a matrix.

Let’s create a simple data frame:

x <- data.frame(Feat_1 = 21:25,
                Feat_2 = rnorm(5),
                Feat_3 = paste0("rnd_", sample(seq(100), size = 5)))
x
  Feat_1     Feat_2 Feat_3
1     21 -0.5602523 rnd_44
2     22 -1.3585561 rnd_46
3     23 -0.2466165 rnd_12
4     24 -0.7036978 rnd_87
5     25 -0.5399613 rnd_94

9.4.1 Get single column as a vector

Just like in a list, using double brackets [[ or the $ operator returns an element, i.e. a vector:

x$Feat_2
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[[2]]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, 2]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613

9.4.2 Get “one or more” columns as a data.frame

Accessing a column by name using square brackets, returns a single-column data.frame:

x["Feat_2"]
      Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613

Accessing a column by [row, column] either by position or name, returns a vector by default:

x[, 2]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, "Feat_2"]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613

As we saw earlier, we can specify drop = FALSE to return a data.frame:

class(x[, 2, drop = FALSE])
[1] "data.frame"
class(x[, "Feat_2", drop = FALSE])
[1] "data.frame"

As in lists, all indexing and slicing operations, with the exception of the $ notation, work with a variable holding either a column name of or an integer location:

idi <- 2
idc <- "Feat_2"
x[idi]
      Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613
x[idc]
      Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613
x[[idi]]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[[idc]]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, idi]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, idc]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, idi, drop = FALSE]
      Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613
x[, idc, drop = FALSE]
      Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613

Extracting multiple columns returns a data frame:

x[, 2:3]
      Feat_2 Feat_3
1 -0.5602523 rnd_44
2 -1.3585561 rnd_46
3 -0.2466165 rnd_12
4 -0.7036978 rnd_87
5 -0.5399613 rnd_94
class(x[, 2:3])
[1] "data.frame"

9.4.3 Get rows

Unlike indexing a row of a matrix, indexing a row of a data.frame returns a single-row data.frame, since it contains multiple columns of potentially different types:

x[1, ]
  Feat_1     Feat_2 Feat_3
1     21 -0.5602523 rnd_44
class(x[1, ])
[1] "data.frame"

Convert into a list using c():

c(x[1, ])
$Feat_1
[1] 21

$Feat_2
[1] -0.5602523

$Feat_3
[1] "rnd_44"
class(c(x[1, ]))
[1] "list"

Convert into a (named) vector using unlist():

unlist(x[1, ])
             Feat_1              Feat_2              Feat_3 
               "21" "-0.56025226114309"            "rnd_44" 
class(unlist(x[1, ]))
[1] "character"

9.4.4 Logical index

x[x$Feat_1 > 22, ]
  Feat_1     Feat_2 Feat_3
3     23 -0.2466165 rnd_12
4     24 -0.7036978 rnd_87
5     25 -0.5399613 rnd_94

9.5 Logical <-> Integer indexing

In this chapter, we have learned how to use both integer and logical indexes.

Note
  • A logical index needs to be of the same dimensions as the object it is indexing (unless you really want to recycle values - see chapter on vectorization):
    you are specifying whether to include or exclude each element

  • An integer index will be shorter than the object it is indexing: you are specifying which subset of elements to include (or with a - in front, which elements to exclude)

It’s easy to convert between the two types.

For example, start with a sequence of integers:

x <- 21:30
x
 [1] 21 22 23 24 25 26 27 28 29 30

Let’s create a logical index based on two inequalities:

logical_index <- x > 23 & x < 28
logical_index
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

9.5.1 Logical to integer index with which():

Warning

The common mistake is to attempt to convert a logical index to an integer index using as.integer(). This results in a vector of 1’s and 0’s, NOT an integer index.
which() converts a logical index to an integer index.

which() literally gives the position of all TRUE elements in a vector, thus converting a logical to an integer index:

integer_index <- which(logical_index)
integer_index
[1] 4 5 6 7

i.e. positions 4, 5, 6, 7 of the logical_index are TRUE

Note

A logical and an integer index are equivalent if they select the exact same elements

Let’s check than when used to index x, they both return the same result:

x[logical_index]
[1] 24 25 26 27
x[integer_index]
[1] 24 25 26 27
all(x[logical_index] == x[integer_index])
[1] TRUE

9.5.2 Integer to logical index

On the other hand, if we want to convert an integer index to a logical index, we can begin with a logical vector of the same length or dimension as the object we want to index with all FALSE values:

logical_index_too <- vector(length = length(x))
logical_index_too
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

And use the integer index to replace the corresponding elements to TRUE:

logical_index_too[integer_index] <- TRUE
logical_index_too
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

This, of course, is the same as the logical index we started with.

all(logical_index == logical_index_too)
[1] TRUE

9.6 Exclude cases using an index

Very often, we want to use an index, whether logical or integer, to exclude cases instead of to select cases. To do that with a logical integer, we simply use an exclamation point in front of the index to negate each element (convert each TRUE to FALSE and each FALSE to TRUE):

logical_index
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
!logical_index
 [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
x[!logical_index]
[1] 21 22 23 28 29 30

To exclude elements using an integer index, R allows you to use negative indexing:

x[-integer_index]
[1] 21 22 23 28 29 30
Note

To get the complement of an index, you negate a logical index (!logical_index) or you subtract an integer index (-integer_index).

9.7 Resources

“Indexing vectors” in An Introduction to R