21 Working with Data Frames

See the data frame section for an introduction to data frames.

21.1 Indexing

21.2 Column and row names

Let’s start with a simple example data.frame:

df <- data.frame(PID = 111:119,
                 Hospital = c("UCSF", "HUP", "Stanford",
                              "Stanford", "UCSF", "HUP", 
                              "HUP", "Stanford", "UCSF"),
                 Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
                 Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0))
df

  PID Hospital Age Sex
1 111     UCSF  22   1
2 112      HUP  34   1
3 113 Stanford  41   0
4 114 Stanford  19   1
5 115     UCSF  53   0
6 116      HUP  21   0
7 117      HUP  63   1
8 118 Stanford  22   0
9 119     UCSF  19   0

The optional row.names argument (see data.frame usage in the R documentation) can be used to define row names at the time of the data frame creation. It accepts either:

a single integer or a character specifying a column of the data.frame being created whose values should be used as row names, or
a vector of values (character or integer) of the row names to be used.

For example, we can use the “PID” column:

df <- data.frame(PID = 111:119,
                 Hospital = c("UCSF", "HUP", "Stanford",
                              "Stanford", "UCSF", "HUP", 
                              "HUP", "Stanford", "UCSF"),
                 Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
                 Sex = c(1, 1, 0, 1, 0, 0, 1, 0, 0),
                 row.names = "PID")

Note

It is recommended to not use/depend on row names to identify or index data.frames, and instead include a column of case IDs.

We can get column names and row names with colnames() and rownames(), respectively:

colnames(df)

[1] "Hospital" "Age"      "Sex"

rownames(df)

[1] "111" "112" "113" "114" "115" "116" "117" "118" "119"

To set new column or row names use the form:

colnames(df) <- new.colnames

rownames(df) <- new.rownames

where new.colnames and new.rownames is a character vector.

You can rename all columns/rows or use indexing to replace specific names.

For example, to rename all rows, do:

rownames(df) <- paste0("Patient_", 1:9)
df

          Hospital Age Sex
Patient_1     UCSF  22   1
Patient_2      HUP  34   1
Patient_3 Stanford  41   0
Patient_4 Stanford  19   1
Patient_5     UCSF  53   0
Patient_6      HUP  21   0
Patient_7      HUP  63   1
Patient_8 Stanford  22   0
Patient_9     UCSF  19   0

To rename the first two columns, do:

colnames(df)[1:2] <- c("Center", "Age_at_Dx")
df

            Center Age_at_Dx Sex
Patient_1     UCSF        22   1
Patient_2      HUP        34   1
Patient_3 Stanford        41   0
Patient_4 Stanford        19   1
Patient_5     UCSF        53   0
Patient_6      HUP        21   0
Patient_7      HUP        63   1
Patient_8 Stanford        22   0
Patient_9     UCSF        19   0

21.3 Delete columns or rows

To delete a data.frame column, set it to NULL:

df$Sex <- NULL
df

            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_3 Stanford        41
Patient_4 Stanford        19
Patient_5     UCSF        53
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22
Patient_9     UCSF        19

To delete a data.frame row, you can “index it out”.

For example, to remove the third and fifths rows of the above data.frame using an integer index:

df <- df[-c(3, 5), ]
df

            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_4 Stanford        19
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22
Patient_9     UCSF        19

You can similarly exclude a row using a logical index. Logical indexing occurs usually following some filtering condition.

For example, to exclude patients under 20 years old, do:

df <- df[!df$Age < 20, ]
df

            Center Age_at_Dx
Patient_1     UCSF        22
Patient_2      HUP        34
Patient_6      HUP        21
Patient_7      HUP        63
Patient_8 Stanford        22

21.4 `subset()`

subset() allows you to:

filter cases that meet certain conditions using the subset argument
select columns using the select argument

head() returns the first few lines of a data frame. We use it to avoid printing too many lines, e.g.

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

iris_sl.gt.med <- subset(iris, Sepal.Length > median(Sepal.Length))

Note: You can use the column name Sepal.Length directly, i.e. unquoted and you don’t need to use iris$Sepal.Length.

This is called Non-Standard Evaluation (NSE).

x <- data.frame(one = 1:10,
                two = rnorm(10),
                group = c(rep("alpha", 4),  rep("beta", 6)))
subset(x, subset = two > 0, select = two)

          two
1  0.54992537
2  2.55730552
5  0.15558755
7  0.13558505
8  0.08646245
9  0.08867163
10 0.73122091

subset(x, two > 0, -one)

          two group
1  0.54992537 alpha
2  2.55730552 alpha
5  0.15558755  beta
7  0.13558505  beta
8  0.08646245  beta
9  0.08867163  beta
10 0.73122091  beta

subset(x, two > 0, two:one)

          two one
1  0.54992537   1
2  2.55730552   2
5  0.15558755   5
7  0.13558505   7
8  0.08646245   8
9  0.08867163   9
10 0.73122091  10

subset(x, two > 0, two:group)

          two group
1  0.54992537 alpha
2  2.55730552 alpha
5  0.15558755  beta
7  0.13558505  beta
8  0.08646245  beta
9  0.08867163  beta
10 0.73122091  beta

21.5 `split()`

Split a data frame into multiple data frames by groups defined by a factor:

x_by_group <- split(x, x$group)
x_by_group

$alpha
  one        two group
1   1  0.5499254 alpha
2   2  2.5573055 alpha
3   3 -0.9177497 alpha
4   4 -0.6831720 alpha

$beta
   one         two group
5    5  0.15558755  beta
6    6 -0.78421387  beta
7    7  0.13558505  beta
8    8  0.08646245  beta
9    9  0.08867163  beta
10  10  0.73122091  beta

21.6 `with()`

Within a with() expression, you can access list elements or data.frame columns without quoting or using the $ operator:

with(x, one + two)

 [1]  1.549925  4.557306  2.082250  3.316828  5.155588  5.215786  7.135585
 [8]  8.086462  9.088672 10.731221

with(x, x[group == "alpha", ])

  one        two group
1   1  0.5499254 alpha
2   2  2.5573055 alpha
3   3 -0.9177497 alpha
4   4 -0.6831720 alpha

with(x, x[two > 0, ])

   one        two group
1    1 0.54992537 alpha
2    2 2.55730552 alpha
5    5 0.15558755  beta
7    7 0.13558505  beta
8    8 0.08646245  beta
9    9 0.08867163  beta
10  10 0.73122091  beta

21.7 Feature transformation with `transform()`

Make up some data:

dat <- data.frame(Sex = c(0, 0, 1, 1, 0),
                  Height = c(1.5, 1.6, 1.55, 1.73, 1.8),
                  Weight = c(55, 70, 69, 76, 91))

dat <- transform(dat, BMI = Weight/Height^2)
dat

  Sex Height Weight      BMI
1   0   1.50     55 24.44444
2   0   1.60     70 27.34375
3   1   1.55     69 28.72008
4   1   1.73     76 25.39343
5   0   1.80     91 28.08642

transform() is probably not used too often, because it is trivial to do the same with direct assignment:

dat$BMI <- dat$Weight/dat$Height^2

but can be useful when adding multiple variables and/or used in a pipe:

dat |> 
  subset(Sex == 0) |> 
  transform(DeltaWeightFromMean = Weight - mean(Weight),
            BMI = Weight/Height^2,
            CI = Weight/Height^3)

  Sex Height Weight      BMI DeltaWeightFromMean       CI
1   0    1.5     55 24.44444                 -17 16.29630
2   0    1.6     70 27.34375                  -2 17.08984
5   0    1.8     91 28.08642                  19 15.60357

21.8 Identify and remove duplicated row with `duplicated()` and `unique()`

The duplicated() function when applied on a data.frame returns a logical index specifying the location of duplicated rows - specifically, of row which are the duplicate of another row further up the data.frame. This means that if rows 20 and 23 are identical, duplicated() will return TRUE for row 23.

On the other hand, unique() will remove duplicate rows from a data.frame.

x <- data.frame(ID = c(203, 808, 909, 707, 808),
                Age = c(23, 44, 33, 42, 44))

duplicated(x)

[1] FALSE FALSE FALSE FALSE  TRUE

unique(x)

21.1 Indexing

21.2 Column and row names

21.3 Delete columns or rows

21.4 subset()

21.5 split()

21.6 with()

21.7 Feature transformation with transform()

21.8 Identify and remove duplicated row with duplicated() and unique()

21.4 `subset()`

21.5 `split()`

21.6 `with()`

21.7 Feature transformation with `transform()`

21.8 Identify and remove duplicated row with `duplicated()` and `unique()`