x <- 15:24
x
[1] 15 16 17 18 19 20 21 22 23 24
An index is used to pick elements of a data structure (i.e. a vector, matrix, array, list, data frame, etc.). You can select (or exclude) one or multiple elements at a time.
There are three types of index vectors you can use in R to identify elements of an object:
Integer indexing in R is 1-based, meaning the first item of a vector is in position 1. In contrast, many programming languages use 0-based indexing where the first element is in the 0th position, the second in the 1st, and the nth in the n-1 position.
To understand indexing, make sure you are very comfortable with the core R data structures: vectors, matrices, arrays, lists, and data.frames.
What is indexing used for?
Indexing can be used to get values from an object or to set values in an object.
The main indexing operator in R is the square bracket []
.
Lists use double square brackets [[]]
.
Start with a simple vector:
x <- 15:24
x
[1] 15 16 17 18 19 20 21 22 23 24
Get the 5th element of a vector:
x[5]
[1] 19
Get elements 6 through 9 of the same vector:
x[6:9]
[1] 20 21 22 23
An integer index can be used to reverse order of elements:
x[5:3]
[1] 19 18 17
Note that an integer index can be used to repeat elements. This is often done by accident, when someone passes the wrong vector as an index, so beware.
x[c(1, 1, 1, 4)]
[1] 15 15 15 18
Logical indexes are usually created as the output of a logical operation, i.e. an elementwise comparison.
Select elements with value greater than 19:
idl <- x > 19
The above comparison is vectorized (Chapter 15), meaning that the comparison is performed elementwise and the result is a logical vector of the same length as the original vector.
idl
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
You can pass the logical vector as an index to the original vector to get the elements that correspond to TRUE in the logical vector:
x[idl]
[1] 20 21 22 23 24
Logical vectors can be created directly in the brackets:
x[x > 19]
[1] 20 21 22 23 24
x <- c(24, 32, 41, 37, 999, 999, 999)
Indexing allows you to access specific elements, for example to perform calculations on them.
Get the mean of elements 1 through 4:
mean(x[1:4])
[1] 33.5
You can combine indexing with assignment to replace elements of an object.
Replace values in elements 1:4 with their log:
x[1:4] <- log(x[1:4])
x
[1] 3.178054 3.465736 3.713572 3.610918 999.000000 999.000000 999.000000
Replace elements that are equal to 999 with NA:
x[x == 999] <- NA
x
[1] 3.178054 3.465736 3.713572 3.610918 NA NA NA
Reminder:
To index a 2D structure, whether a matrix or data frame, we use the form: [row, column]
.
The following indexing operations are therefore the same whether applied on a matrix or a data frame:
mat <- matrix(21:60, nrow = 10)
colnames(mat) <- paste0("Feature_", seq(ncol(mat)))
rownames(mat) <- paste0("Row_", seq(nrow(mat)))
mat
Feature_1 Feature_2 Feature_3 Feature_4
Row_1 21 31 41 51
Row_2 22 32 42 52
Row_3 23 33 43 53
Row_4 24 34 44 54
Row_5 25 35 45 55
Row_6 26 36 46 56
Row_7 27 37 47 57
Row_8 28 38 48 58
Row_9 29 39 49 59
Row_10 30 40 50 60
df <- as.data.frame(mat)
df
Feature_1 Feature_2 Feature_3 Feature_4
Row_1 21 31 41 51
Row_2 22 32 42 52
Row_3 23 33 43 53
Row_4 24 34 44 54
Row_5 25 35 45 55
Row_6 26 36 46 56
Row_7 27 37 47 57
Row_8 28 38 48 58
Row_9 29 39 49 59
Row_10 30 40 50 60
To get the contents of the fifth row, second column:
mat[5, 2]
[1] 35
df[5, 2]
[1] 35
We show the following on matrices, but they work just the same on data.frames.
If you want to select an entire row or an entire column, you leave the row or column index blank, but you must include a comma:
Get the first row:
mat[1, ]
Feature_1 Feature_2 Feature_3 Feature_4
21 31 41 51
Get the second column:
mat[, 2]
Row_1 Row_2 Row_3 Row_4 Row_5 Row_6 Row_7 Row_8 Row_9 Row_10
31 32 33 34 35 36 37 38 39 40
Note that colnames and rownames were added to the matrix above for convenience - if they are absent, there are no labels above each element.
You can define ranges for both rows and columns:
mat[6:7, 2:4]
Feature_2 Feature_3 Feature_4
Row_6 36 46 56
Row_7 37 47 57
You can use vectors to specify any combination of rows and columns.
Get rows 2, 4, and 7 of columns 1, 4, and 3:
Feature_1 Feature_4 Feature_3
Row_2 22 52 42
Row_4 24 54 44
Row_7 27 57 47
Since a matrix is a vector with 2 dimensions, you can also index the underlying vector directly. Regardless of whether a matrix was created by row or by column (default), the data is stored and accessed by column. You can see that by converting the matrix to a one-dimensional vector:
as.vector(mat)
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[26] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
same as:
c(mat)
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
[26] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
For example, ‘mat’ has 10 rows and 4 columns, therefore the 11th element is in row 1, column 2 - this only works with matrices, not data.frames:
mat[11]
[1] 31
is the same as:
mat[1, 2]
[1] 31
This is quite less common, but potentially useful. It allows you to specify a series of individual [i, j]
indexes, i.e. is a way to select multiple individual non-contiguous elements
An n-by-2 matrix can be used to index as a length n vector of [row, colum]
indexes. Therefore, the above matrix, will return elements [2, 4], [4, 3], [7, 1]
:
mat[idm]
[1] 52 44 27
Identify rows with value greater than 36 on the second column:
The logical index for this operation is:
mat[, 2] > 36
Row_1 Row_2 Row_3 Row_4 Row_5 Row_6 Row_7 Row_8 Row_9 Row_10
FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
It can be used directly to index the matrix:
mat[mat[, 2] > 36, ]
Feature_1 Feature_2 Feature_3 Feature_4
Row_7 27 37 47 57
Row_8 28 38 48 58
Row_9 29 39 49 59
Row_10 30 40 50 60
Indexing a matrix or a data.frame can return either a smaller matrix/data.frame or a vector.
In general, many R functions return the simplest R object that can hold the output. As always, check function documentation to look for possible arguments that can change this and what the default behavior is. If you extract a column or a row, you get a vector:
Get the third column:
mat[, 3]
Row_1 Row_2 Row_3 Row_4 Row_5 Row_6 Row_7 Row_8 Row_9 Row_10
41 42 43 44 45 46 47 48 49 50
class(mat[, 3])
[1] "integer"
You can specify drop = FALSE
to stop R from dropping the unused dimension and return a matrix or data.frame of a single column:
mat[, 3, drop = FALSE]
Feature_3
Row_1 41
Row_2 42
Row_3 43
Row_4 44
Row_5 45
Row_6 46
Row_7 47
Row_8 48
Row_9 49
Row_10 50
df[, 3, drop = FALSE]
Feature_3
Row_1 41
Row_2 42
Row_3 43
Row_4 44
Row_5 45
Row_6 46
Row_7 47
Row_8 48
Row_9 49
Row_10 50
Check it is still a matrix or data.frame:
Reminder: A list can contain elements of different classes and of different lengths:
x <- list(one = 1001:1004,
two = sample(seq(0, 100, by = .1), size = 10),
three = c("Neuro", "Cardio", "Radio"),
four = median)
x
$one
[1] 1001 1002 1003 1004
$two
[1] 93.9 44.4 90.1 70.3 79.7 18.6 49.0 82.4 75.9 41.3
$three
[1] "Neuro" "Cardio" "Radio"
$four
function (x, na.rm = FALSE, ...)
UseMethod("median")
<bytecode: 0x149b46190>
<environment: namespace:stats>
You can access a single list element using:
[[
with either name or integer position
$
followed by name of the element (therefore only works if elements are named)For example, to access the third element:
x$three
[1] "Neuro" "Cardio" "Radio"
same as:
x[[3]]
[1] "Neuro" "Cardio" "Radio"
same as:
x[["three"]]
[1] "Neuro" "Cardio" "Radio"
To access a list element programmatically, i.e. using a name or integer index stored in a variable, only the bracket notation works. Therefore, programmatically, you would always use double brackets to access different elements:
idi <- 3
idc <- "three"
x[[idi]]
[1] "Neuro" "Cardio" "Radio"
x[[idc]]
[1] "Neuro" "Cardio" "Radio"
You can extract one or more list elements as a pruned list using single bracket [
notation. Similar to indexing of a vector, this can be either a logical, integer, or character vector:
x[3]
$three
[1] "Neuro" "Cardio" "Radio"
x["three"]
$three
[1] "Neuro" "Cardio" "Radio"
x[c(FALSE, FALSE, TRUE, FALSE)]
$three
[1] "Neuro" "Cardio" "Radio"
Get multiple elements:
x[2:3]
$two
[1] 93.9 44.4 90.1 70.3 79.7 18.6 49.0 82.4 75.9 41.3
$three
[1] "Neuro" "Cardio" "Radio"
# same as
x[c("two", "three")]
$two
[1] 93.9 44.4 90.1 70.3 79.7 18.6 49.0 82.4 75.9 41.3
$three
[1] "Neuro" "Cardio" "Radio"
# same as
x[c(FALSE, TRUE, TRUE, FALSE)]
$two
[1] 93.9 44.4 90.1 70.3 79.7 18.6 49.0 82.4 75.9 41.3
$three
[1] "Neuro" "Cardio" "Radio"
Given the following list:
We can access the 3rd element of the 2nd element:
x[[2]][3]
[1] "Radio"
or
x[[c(2, 3)]]
[1] "Radio"
This is called recursive indexing and is perhaps more often used by accident, when one instead wanted to extract the 2nd and 3rd elements:
x[c(2, 3)]
$Dept
[1] "Neuro" "Cardio" "Radio"
$Age
[1] 58.31128 59.50883 58.63950 57.79283 56.48979 56.54883 57.70144 55.71086
[9] 59.27153 58.25800 55.57688 56.71479 57.50166 59.21797 57.82313 57.24295
[17] 59.35482 58.34108 58.48001 56.11052
You can convert a list to a single vector containing all individual components of the original list using unlist()
. Notice how names are automatically created based on the original structure:
x <- list(alpha = sample(seq(100), size = 10),
beta = sample(seq(100), size = 10),
gamma = sample(seq(100), size = 10))
x
$alpha
[1] 97 1 75 60 82 66 61 62 43 15
$beta
[1] 30 5 66 97 73 34 4 85 24 35
$gamma
[1] 11 78 23 29 36 82 7 12 51 10
unlist(x)
alpha1 alpha2 alpha3 alpha4 alpha5 alpha6 alpha7 alpha8 alpha9 alpha10
97 1 75 60 82 66 61 62 43 15
beta1 beta2 beta3 beta4 beta5 beta6 beta7 beta8 beta9 beta10
30 5 66 97 73 34 4 85 24 35
gamma1 gamma2 gamma3 gamma4 gamma5 gamma6 gamma7 gamma8 gamma9 gamma10
11 78 23 29 36 82 7 12 51 10
If you want to drop the names, you can set the use.names
argument to FALSE or wrap the above in unname()
:
In data science and related fields the terms filter and select are commonly used:
We’ve saw above that a data frame can be indexed in many ways similar to a matrix, i.e. by defining rows and columns. At the same time, we know that a data frame is a rectangular list. Like a list, its elements are vectors of any type (integer, double, character, factor, and more) but, unlike a list, they have to be of the same length. A data frame can also be indexed the same way as a list and similar to list indexing, notice that some methods return a smaller data frame, while others return vectors.
You can index a data frame using all the ways you can index a list and all the ways you can index a matrix.
Let’s create a simple data frame:
x <- data.frame(Feat_1 = 21:25,
Feat_2 = rnorm(5),
Feat_3 = paste0("rnd_", sample(seq(100), size = 5)))
x
Feat_1 Feat_2 Feat_3
1 21 -0.5602523 rnd_44
2 22 -1.3585561 rnd_46
3 23 -0.2466165 rnd_12
4 24 -0.7036978 rnd_87
5 25 -0.5399613 rnd_94
Just like in a list, using double brackets [[
or the $
operator returns an element, i.e. a vector:
x$Feat_2
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[[2]]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, 2]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
Accessing a column by name using square brackets, returns a single-column data.frame:
x["Feat_2"]
Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613
Accessing a column by [row, column]
either by position or name, returns a vector by default:
x[, 2]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, "Feat_2"]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
As we saw earlier, we can specify drop = FALSE
to return a data.frame
:
As in lists, all indexing and slicing operations, with the exception of the $
notation, work with a variable holding either a column name of or an integer location:
idi <- 2
idc <- "Feat_2"
x[idi]
Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613
x[idc]
Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613
x[[idi]]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[[idc]]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, idi]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, idc]
[1] -0.5602523 -1.3585561 -0.2466165 -0.7036978 -0.5399613
x[, idi, drop = FALSE]
Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613
x[, idc, drop = FALSE]
Feat_2
1 -0.5602523
2 -1.3585561
3 -0.2466165
4 -0.7036978
5 -0.5399613
Extracting multiple columns returns a data frame:
x[, 2:3]
Feat_2 Feat_3
1 -0.5602523 rnd_44
2 -1.3585561 rnd_46
3 -0.2466165 rnd_12
4 -0.7036978 rnd_87
5 -0.5399613 rnd_94
class(x[, 2:3])
[1] "data.frame"
Unlike indexing a row of a matrix, indexing a row of a data.frame returns a single-row data.frame, since it contains multiple columns of potentially different types:
Convert into a list using c()
:
Convert into a (named) vector using unlist()
:
x[x$Feat_1 > 22, ]
Feat_1 Feat_2 Feat_3
3 23 -0.2466165 rnd_12
4 24 -0.7036978 rnd_87
5 25 -0.5399613 rnd_94
In this chapter, we have learned how to use both integer and logical indexes.
A logical index needs to be of the same dimensions as the object it is indexing (unless you really want to recycle values - see chapter on vectorization):
you are specifying whether to include or exclude each element
An integer index will be shorter than the object it is indexing: you are specifying which subset of elements to include (or with a -
in front, which elements to exclude)
It’s easy to convert between the two types.
For example, start with a sequence of integers:
x <- 21:30
x
[1] 21 22 23 24 25 26 27 28 29 30
Let’s create a logical index based on two inequalities:
logical_index <- x > 23 & x < 28
logical_index
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
which()
:The common mistake is to attempt to convert a logical index to an integer index using as.integer()
. This results in a vector of 1’s and 0’s, NOT an integer index.which()
converts a logical index to an integer index.
which()
literally gives the position of all TRUE
elements in a vector, thus converting a logical to an integer index:
integer_index <- which(logical_index)
integer_index
[1] 4 5 6 7
i.e. positions 4, 5, 6, 7 of the logical_index
are TRUE
A logical and an integer index are equivalent if they select the exact same elements
Let’s check than when used to index x
, they both return the same result:
x[logical_index]
[1] 24 25 26 27
x[integer_index]
[1] 24 25 26 27
all(x[logical_index] == x[integer_index])
[1] TRUE
On the other hand, if we want to convert an integer index to a logical index, we can begin with a logical vector of the same length or dimension as the object we want to index with all FALSE values:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
And use the integer index to replace the corresponding elements to TRUE:
logical_index_too[integer_index] <- TRUE
logical_index_too
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
This, of course, is the same as the logical index we started with.
all(logical_index == logical_index_too)
[1] TRUE
Very often, we want to use an index, whether logical or integer, to exclude cases instead of to select cases. To do that with a logical integer, we simply use an exclamation point in front of the index to negate each element (convert each TRUE to FALSE and each FALSE to TRUE):
logical_index
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
!logical_index
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
x[!logical_index]
[1] 21 22 23 28 29 30
To exclude elements using an integer index, R allows you to use negative indexing:
x[-integer_index]
[1] 21 22 23 28 29 30
To get the complement of an index, you negate a logical index (!logical_index
) or you subtract an integer index (-integer_index
).