18  Aggregate

R includes a number of commands to apply functions on splits of your data. aggregate() is a powerful tools to perform such “group-by” operations.

The function accepts either:

We shall see how to perform each operation below with each approach.

The formula interface might be easier to work with interactively on the console. Note that while you can programmatically create a formula, it is easier to use vector inputs when calling aggregate() programmatically.

For this example, we shall use the penguin data from the palmerpenguins package:

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

See example below for 1 or multiple variables by 1 or more groups using either the formula interface, or working directly on objects with $-indexing or using with():

18.1 Single variable by single grouping

Note that the formula method defaults to na.action = na.omit

Using the formula interface:

aggregate(bill_length_mm ~ species,
          data = penguins,
          mean, na.rm = TRUE)
    species bill_length_mm
1    Adelie       38.79139
2 Chinstrap       48.83382
3    Gentoo       47.50488

Using R objects directly:

aggregate(penguins$bill_length_mm,
          by = list(penguins$species),
          mean, na.rm = TRUE)
    Group.1        x
1    Adelie 38.79139
2 Chinstrap 48.83382
3    Gentoo 47.50488

Note that, unlike the formula notation, if your input is a vector which is unnamed, the output columns are also unnamed.

If instead of passing a vector, you pass a data.frame or list with one or more named elements, the output includes the names:

aggregate(penguins["bill_length_mm"],
          by = penguins["species"],
          mean, na.rm = TRUE)
    species bill_length_mm
1    Adelie       38.79139
2 Chinstrap       48.83382
3    Gentoo       47.50488

By creating a list instead of indexing the given data.frame also allows you to set custom names:

aggregate(list(`Bill length` = penguins$bill_length_mm),
          by = list(Species = penguins$species),
          mean, na.rm = TRUE)
    Species Bill.length
1    Adelie    38.79139
2 Chinstrap    48.83382
3    Gentoo    47.50488

18.2 Multiple variables by single grouping

Formula notation:

aggregate(cbind(bill_length_mm, flipper_length_mm) ~ species,
          data = penguins,
          mean)
    species bill_length_mm flipper_length_mm
1    Adelie       38.79139          189.9536
2 Chinstrap       48.83382          195.8235
3    Gentoo       47.50488          217.1870

Objects:

aggregate(penguins[, c("bill_length_mm", "flipper_length_mm")],
          by = list(Species = penguins$species),
          mean, na.rm = TRUE)
    Species bill_length_mm flipper_length_mm
1    Adelie       38.79139          189.9536
2 Chinstrap       48.83382          195.8235
3    Gentoo       47.50488          217.1870

18.3 Single variable by multiple groups

Formula notation:

aggregate(bill_length_mm ~ species + island, data = penguins, mean)
    species    island bill_length_mm
1    Adelie    Biscoe       38.97500
2    Gentoo    Biscoe       47.50488
3    Adelie     Dream       38.50179
4 Chinstrap     Dream       48.83382
5    Adelie Torgersen       38.95098

Objects:

aggregate(penguins["bill_length_mm"],
          by = list(Species = penguins$species, 
                    Island = penguins$island),
          mean, na.rm = TRUE)
    Species    Island bill_length_mm
1    Adelie    Biscoe       38.97500
2    Gentoo    Biscoe       47.50488
3    Adelie     Dream       38.50179
4 Chinstrap     Dream       48.83382
5    Adelie Torgersen       38.95098

18.4 Multiple variables by multiple groupings

Formula notation:

aggregate(cbind(bill_length_mm, flipper_length_mm) ~ species + island,
          data = penguins, mean)
    species    island bill_length_mm flipper_length_mm
1    Adelie    Biscoe       38.97500          188.7955
2    Gentoo    Biscoe       47.50488          217.1870
3    Adelie     Dream       38.50179          189.7321
4 Chinstrap     Dream       48.83382          195.8235
5    Adelie Torgersen       38.95098          191.1961

Objects:

aggregate(penguins[, c("bill_length_mm", "flipper_length_mm")],
          by = list(Species = penguins$species, 
                    Island = penguins$island),
          mean, na.rm = TRUE)
    Species    Island bill_length_mm flipper_length_mm
1    Adelie    Biscoe       38.97500          188.7955
2    Gentoo    Biscoe       47.50488          217.1870
3    Adelie     Dream       38.50179          189.7321
4 Chinstrap     Dream       48.83382          195.8235
5    Adelie Torgersen       38.95098          191.1961

18.5 Using with()

R allows you to use with(data, expression), where data can be a data.frame, list or environment, and within the expression refer to any elements of data by their name, without the need to index data each time.

For example, with(df, expression) means you can use the data.frame’s column names directly within the expression without the need to use df$column:

with(penguins,
     aggregate(list(`Bill length` = bill_length_mm),
               by = list(Species = species),
               mean, na.rm = TRUE))
    Species Bill.length
1    Adelie    38.79139
2 Chinstrap    48.83382
3    Gentoo    47.50488

18.6 See also

  • tapply() for an alternative methods of applying function on subsets of a single variable (probably faster).
  • For large datasets, it is recommended to use data.table for fast group-by data summarization.