17 Summarizing Data

Let’s read in a dataset on heart disease from OpenML:

heart <- read.csv("https://www.openml.org/data/get_csv/51/dataset_51_heart-h",
                  na.strings = "?",
                  stringsAsFactors = TRUE)

One of the first things you might want to know is the size of the dataset:

dim(heart)

[1] 294  14

Since it does not contain too many columns, you can use str() to get the type of each and a preview of some of the data:

str(heart)

'data.frame':   294 obs. of  14 variables:
 $ age       : int  28 29 29 30 31 32 32 32 33 34 ...
 $ sex       : Factor w/ 2 levels "female","male": 2 2 2 1 1 1 2 2 2 1 ...
 $ chest_pain: Factor w/ 4 levels "asympt","atyp_angina",..: 2 2 2 4 2 2 2 2 3 2 ...
 $ trestbps  : int  130 120 140 170 100 105 110 125 120 130 ...
 $ chol      : int  132 243 NA 237 219 198 225 254 298 161 ...
 $ fbs       : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
 $ restecg   : Factor w/ 3 levels "left_vent_hyper",..: 1 2 2 3 3 2 2 2 2 2 ...
 $ thalach   : int  185 160 170 170 150 165 184 155 185 190 ...
 $ exang     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ oldpeak   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ slope     : Factor w/ 3 levels "down","flat",..: NA NA NA NA NA NA NA NA NA NA ...
 $ ca        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ thal      : Factor w/ 3 levels "fixed_defect",..: NA NA NA 1 NA NA NA NA NA NA ...
 $ num       : Factor w/ 2 levels "'<50'","'>50_1'": 1 1 1 1 1 1 1 1 1 1 ...

You might want to take a look at the first few rows (defaults to 6):

head(heart)

  age    sex  chest_pain trestbps chol fbs               restecg thalach exang
1  28   male atyp_angina      130  132   f       left_vent_hyper     185    no
2  29   male atyp_angina      120  243   f                normal     160    no
3  29   male atyp_angina      140   NA   f                normal     170    no
4  30 female  typ_angina      170  237   f st_t_wave_abnormality     170    no
5  31 female atyp_angina      100  219   f st_t_wave_abnormality     150    no
6  32 female atyp_angina      105  198   f                normal     165    no
  oldpeak slope ca         thal   num
1       0  <NA> NA         <NA> '<50'
2       0  <NA> NA         <NA> '<50'
3       0  <NA> NA         <NA> '<50'
4       0  <NA> NA fixed_defect '<50'
5       0  <NA> NA         <NA> '<50'
6       0  <NA> NA         <NA> '<50'

There is the equivalent tail() to print the last few rows:

tail(heart)

    age    sex  chest_pain trestbps chol fbs               restecg thalach
289  52   male      asympt      140  266   f                normal     134
290  52   male      asympt      160  331   f                normal      94
291  54 female non_anginal      130  294   f st_t_wave_abnormality     100
292  56   male      asympt      155  342   t                normal     150
293  58 female atyp_angina      180  393   f                normal     110
294  65   male      asympt      130  275   f st_t_wave_abnormality     115
    exang oldpeak slope ca              thal     num
289   yes     2.0  flat NA              <NA> '>50_1'
290   yes     2.5  <NA> NA              <NA> '>50_1'
291   yes     0.0  flat NA              <NA> '>50_1'
292   yes     3.0  flat NA              <NA> '>50_1'
293   yes     1.0  flat NA reversable_defect '>50_1'
294   yes     1.0  flat NA              <NA> '>50_1'

17.1 Get summary of an R object with `summary()`

R includes a summary() method for a number of different objects, including (of course) data.frames:

summary(heart)

      age            sex            chest_pain     trestbps          chol      
 Min.   :28.00   female: 81   asympt     :123   Min.   : 92.0   Min.   : 85.0  
 1st Qu.:42.00   male  :213   atyp_angina:106   1st Qu.:120.0   1st Qu.:209.0  
 Median :49.00                non_anginal: 54   Median :130.0   Median :243.0  
 Mean   :47.83                typ_angina : 11   Mean   :132.6   Mean   :250.8  
 3rd Qu.:54.00                                  3rd Qu.:140.0   3rd Qu.:282.5  
 Max.   :66.00                                  Max.   :200.0   Max.   :603.0  
                                                NA's   :1       NA's   :23     
   fbs                       restecg       thalach       exang    
 f   :266   left_vent_hyper      :  6   Min.   : 82.0   no  :204  
 t   : 20   normal               :235   1st Qu.:122.0   yes : 89  
 NA's:  8   st_t_wave_abnormality: 52   Median :140.0   NA's:  1  
            NA's                 :  1   Mean   :139.1             
                                        3rd Qu.:155.0             
                                        Max.   :190.0             
                                        NA's   :1                 
    oldpeak        slope           ca                     thal    
 Min.   :0.0000   down:  1   Min.   :0     fixed_defect     : 10  
 1st Qu.:0.0000   flat: 91   1st Qu.:0     normal           :  7  
 Median :0.0000   up  : 12   Median :0     reversable_defect: 11  
 Mean   :0.5861   NA's:190   Mean   :0     NA's             :266  
 3rd Qu.:1.0000              3rd Qu.:0                            
 Max.   :5.0000              Max.   :0                            
                             NA's   :291                          
      num     
 '<50'  :188  
 '>50_1':106

17.2 Fast builtin column and row operations

R has optimized builtin functions for some very common row and columns operations, with self-explanatory names that can be applied to matrices and data.frames:

colSums(): column sums
rowSums(): row sums
colMeans(): column means
rowMeans(): row means

a <- matrix(1:20, nrow = 5)
a

     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

colSums(a)

[1] 15 40 65 90

rowSums(a)

[1] 34 38 42 46 50

colMeans(a)

[1]  3  8 13 18

rowMeans(a)

[1]  8.5  9.5 10.5 11.5 12.5

17.3 Optimized matrix operations with matrixStats

While the builtin operations above are already optimized and faster than the equivalent calls, the matrixStats package (Bengtsson 2019) offers a number of further optimized matrix operations, including drop-in replacements of the above. These should be preferred when dealing with bigger data:

library(matrixStats)
colSums2(a)

[1] 15 40 65 90

rowSums2(a)

[1] 34 38 42 46 50

colMeans2(a)

[1]  3  8 13 18

rowMeans2(a)

[1]  8.5  9.5 10.5 11.5 12.5

Note: matrixStats provides replacement functions named almost identically to their base counterpart, so they are easy to find, but are different, so they don’t mask the base functions.

17.4 See also

aggregate() for grouped summary statistics
Loop Functions for applying any function on subsets of data.

17.1 Get summary of an R object with summary()

17.2 Fast builtin column and row operations

17.3 Optimized matrix operations with matrixStats

17.4 See also

17.1 Get summary of an R object with `summary()`