34  Joins

library(data.table)
options(datatable.print.class = TRUE)

data.table allows you to perform table joins with either:

a <- data.table(PID = 1:9,
                Hospital = c("UCSF", "HUP", "Stanford", 
                             "Stanford", "UCSF", "HUP", 
                             "HUP", "Stanford", "UCSF"),
                Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
                Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0),
                key = "PID")
a
Key: <PID>
     PID Hospital   Age   Sex
   <int>   <char> <num> <num>
1:     1     UCSF    22     1
2:     2      HUP    34     1
3:     3 Stanford    41     0
4:     4 Stanford    19     1
5:     5     UCSF    53     0
6:     6      HUP    21     0
7:     7      HUP    63     1
8:     8 Stanford    22     0
9:     9     UCSF    19     0
b <- data.table(PID = 6:12,
                V1 = c(153, 89, 112, 228,  91, 190, 101),
                Department = c("Neurology", "Radiology", "Emergency",
                               "Cardiology", "Surgery", "Neurology",
                               "Psychiatry"),
                key = "PID")
b
Key: <PID>
     PID    V1 Department
   <int> <num>     <char>
1:     6   153  Neurology
2:     7    89  Radiology
3:     8   112  Emergency
4:     9   228 Cardiology
5:    10    91    Surgery
6:    11   190  Neurology
7:    12   101 Psychiatry

In the above command we use the key argument to set PID as key. This can be performed after the data.table has been created using the setkey() command:

setkey(a, PID)

Multiple keys can be set, in order, with the same setkey() command, separated by commas, e.g.

setkey(a, PID, Hospital)

Keys sort the data.table by the corresponding columns and can be used to perform left and right joins with bracket notation seen later.

34.1 Inner join

merge(a, b)
Key: <PID>
     PID Hospital   Age   Sex    V1 Department
   <int>   <char> <num> <num> <num>     <char>
1:     6      HUP    21     0   153  Neurology
2:     7      HUP    63     1    89  Radiology
3:     8 Stanford    22     0   112  Emergency
4:     9     UCSF    19     0   228 Cardiology

34.2 Outer join

merge(a, b, all = TRUE)
Key: <PID>
      PID Hospital   Age   Sex    V1 Department
    <int>   <char> <num> <num> <num>     <char>
 1:     1     UCSF    22     1    NA       <NA>
 2:     2      HUP    34     1    NA       <NA>
 3:     3 Stanford    41     0    NA       <NA>
 4:     4 Stanford    19     1    NA       <NA>
 5:     5     UCSF    53     0    NA       <NA>
 6:     6      HUP    21     0   153  Neurology
 7:     7      HUP    63     1    89  Radiology
 8:     8 Stanford    22     0   112  Emergency
 9:     9     UCSF    19     0   228 Cardiology
10:    10     <NA>    NA    NA    91    Surgery
11:    11     <NA>    NA    NA   190  Neurology
12:    12     <NA>    NA    NA   101 Psychiatry

34.3 Left outer join

Using merge():

merge(a, b, all.x = TRUE)
Key: <PID>
     PID Hospital   Age   Sex    V1 Department
   <int>   <char> <num> <num> <num>     <char>
1:     1     UCSF    22     1    NA       <NA>
2:     2      HUP    34     1    NA       <NA>
3:     3 Stanford    41     0    NA       <NA>
4:     4 Stanford    19     1    NA       <NA>
5:     5     UCSF    53     0    NA       <NA>
6:     6      HUP    21     0   153  Neurology
7:     7      HUP    63     1    89  Radiology
8:     8 Stanford    22     0   112  Emergency
9:     9     UCSF    19     0   228 Cardiology

Using bracket notation:

b[a, ]
Key: <PID>
     PID    V1 Department Hospital   Age   Sex
   <int> <num>     <char>   <char> <num> <num>
1:     1    NA       <NA>     UCSF    22     1
2:     2    NA       <NA>      HUP    34     1
3:     3    NA       <NA> Stanford    41     0
4:     4    NA       <NA> Stanford    19     1
5:     5    NA       <NA>     UCSF    53     0
6:     6   153  Neurology      HUP    21     0
7:     7    89  Radiology      HUP    63     1
8:     8   112  Emergency Stanford    22     0
9:     9   228 Cardiology     UCSF    19     0

If keys were not set for a and b, you could specify the column to match on using the on argument:

b[a, on = "PID"]
Key: <PID>
     PID    V1 Department Hospital   Age   Sex
   <int> <num>     <char>   <char> <num> <num>
1:     1    NA       <NA>     UCSF    22     1
2:     2    NA       <NA>      HUP    34     1
3:     3    NA       <NA> Stanford    41     0
4:     4    NA       <NA> Stanford    19     1
5:     5    NA       <NA>     UCSF    53     0
6:     6   153  Neurology      HUP    21     0
7:     7    89  Radiology      HUP    63     1
8:     8   112  Emergency Stanford    22     0
9:     9   228 Cardiology     UCSF    19     0
Note

The easy way to understand the bracket notation merges is to think that the data.table inside the bracket is used to index the data.table on the outside, therefore the resulting table will have rows dictated by the inside table’s key.

34.4 Right outer join

merge(a, b, all.y = TRUE)
Key: <PID>
     PID Hospital   Age   Sex    V1 Department
   <int>   <char> <num> <num> <num>     <char>
1:     6      HUP    21     0   153  Neurology
2:     7      HUP    63     1    89  Radiology
3:     8 Stanford    22     0   112  Emergency
4:     9     UCSF    19     0   228 Cardiology
5:    10     <NA>    NA    NA    91    Surgery
6:    11     <NA>    NA    NA   190  Neurology
7:    12     <NA>    NA    NA   101 Psychiatry

Using bracket notation:

a[b, ]
Key: <PID>
     PID Hospital   Age   Sex    V1 Department
   <int>   <char> <num> <num> <num>     <char>
1:     6      HUP    21     0   153  Neurology
2:     7      HUP    63     1    89  Radiology
3:     8 Stanford    22     0   112  Emergency
4:     9     UCSF    19     0   228 Cardiology
5:    10     <NA>    NA    NA    91    Surgery
6:    11     <NA>    NA    NA   190  Neurology
7:    12     <NA>    NA    NA   101 Psychiatry

34.5 See also