17  Summarizing Data

Let’s read in a dataset on heart disease from OpenML:

heart <- read.csv("https://www.openml.org/data/get_csv/51/dataset_51_heart-h",
                  na.strings = "?",
                  stringsAsFactors = TRUE)

One of the first things you might want to know is the size of the dataset:

dim(heart)
[1] 294  14

Since it does not contain too many columns, you can use str() to get the type of each and a preview of some of the data:

str(heart)
'data.frame':   294 obs. of  14 variables:
 $ age       : int  28 29 29 30 31 32 32 32 33 34 ...
 $ sex       : Factor w/ 2 levels "female","male": 2 2 2 1 1 1 2 2 2 1 ...
 $ chest_pain: Factor w/ 4 levels "asympt","atyp_angina",..: 2 2 2 4 2 2 2 2 3 2 ...
 $ trestbps  : int  130 120 140 170 100 105 110 125 120 130 ...
 $ chol      : int  132 243 NA 237 219 198 225 254 298 161 ...
 $ fbs       : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
 $ restecg   : Factor w/ 3 levels "left_vent_hyper",..: 1 2 2 3 3 2 2 2 2 2 ...
 $ thalach   : int  185 160 170 170 150 165 184 155 185 190 ...
 $ exang     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ oldpeak   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ slope     : Factor w/ 3 levels "down","flat",..: NA NA NA NA NA NA NA NA NA NA ...
 $ ca        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ thal      : Factor w/ 3 levels "fixed_defect",..: NA NA NA 1 NA NA NA NA NA NA ...
 $ num       : Factor w/ 2 levels "'<50'","'>50_1'": 1 1 1 1 1 1 1 1 1 1 ...

You might want to take a look at the first few rows (defaults to 6):

head(heart)
  age    sex  chest_pain trestbps chol fbs               restecg thalach exang
1  28   male atyp_angina      130  132   f       left_vent_hyper     185    no
2  29   male atyp_angina      120  243   f                normal     160    no
3  29   male atyp_angina      140   NA   f                normal     170    no
4  30 female  typ_angina      170  237   f st_t_wave_abnormality     170    no
5  31 female atyp_angina      100  219   f st_t_wave_abnormality     150    no
6  32 female atyp_angina      105  198   f                normal     165    no
  oldpeak slope ca         thal   num
1       0  <NA> NA         <NA> '<50'
2       0  <NA> NA         <NA> '<50'
3       0  <NA> NA         <NA> '<50'
4       0  <NA> NA fixed_defect '<50'
5       0  <NA> NA         <NA> '<50'
6       0  <NA> NA         <NA> '<50'

There is the equivalent tail() to print the last few rows:

tail(heart)
    age    sex  chest_pain trestbps chol fbs               restecg thalach
289  52   male      asympt      140  266   f                normal     134
290  52   male      asympt      160  331   f                normal      94
291  54 female non_anginal      130  294   f st_t_wave_abnormality     100
292  56   male      asympt      155  342   t                normal     150
293  58 female atyp_angina      180  393   f                normal     110
294  65   male      asympt      130  275   f st_t_wave_abnormality     115
    exang oldpeak slope ca              thal     num
289   yes     2.0  flat NA              <NA> '>50_1'
290   yes     2.5  <NA> NA              <NA> '>50_1'
291   yes     0.0  flat NA              <NA> '>50_1'
292   yes     3.0  flat NA              <NA> '>50_1'
293   yes     1.0  flat NA reversable_defect '>50_1'
294   yes     1.0  flat NA              <NA> '>50_1'

17.1 Get summary of an R object with summary()

R includes a summary() method for a number of different objects, including (of course) data.frames:

summary(heart)
      age            sex            chest_pain     trestbps          chol      
 Min.   :28.00   female: 81   asympt     :123   Min.   : 92.0   Min.   : 85.0  
 1st Qu.:42.00   male  :213   atyp_angina:106   1st Qu.:120.0   1st Qu.:209.0  
 Median :49.00                non_anginal: 54   Median :130.0   Median :243.0  
 Mean   :47.83                typ_angina : 11   Mean   :132.6   Mean   :250.8  
 3rd Qu.:54.00                                  3rd Qu.:140.0   3rd Qu.:282.5  
 Max.   :66.00                                  Max.   :200.0   Max.   :603.0  
                                                NA's   :1       NA's   :23     
   fbs                       restecg       thalach       exang    
 f   :266   left_vent_hyper      :  6   Min.   : 82.0   no  :204  
 t   : 20   normal               :235   1st Qu.:122.0   yes : 89  
 NA's:  8   st_t_wave_abnormality: 52   Median :140.0   NA's:  1  
            NA's                 :  1   Mean   :139.1             
                                        3rd Qu.:155.0             
                                        Max.   :190.0             
                                        NA's   :1                 
    oldpeak        slope           ca                     thal    
 Min.   :0.0000   down:  1   Min.   :0     fixed_defect     : 10  
 1st Qu.:0.0000   flat: 91   1st Qu.:0     normal           :  7  
 Median :0.0000   up  : 12   Median :0     reversable_defect: 11  
 Mean   :0.5861   NA's:190   Mean   :0     NA's             :266  
 3rd Qu.:1.0000              3rd Qu.:0                            
 Max.   :5.0000              Max.   :0                            
                             NA's   :291                          
      num     
 '<50'  :188  
 '>50_1':106  
              
              
              
              
              

17.2 Fast builtin column and row operations

R has optimized builtin functions for some very common row and columns operations, with self-explanatory names that can be applied to matrices and data.frames:

a <- matrix(1:20, nrow = 5)
a
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20
[1] 15 40 65 90
[1] 34 38 42 46 50
[1]  3  8 13 18
[1]  8.5  9.5 10.5 11.5 12.5

17.3 Optimized matrix operations with matrixStats

While the builtin operations above are already optimized and faster than the equivalent calls, the matrixStats package (Bengtsson 2019) offers a number of further optimized matrix operations, including drop-in replacements of the above. These should be preferred when dealing with bigger data:

[1] 15 40 65 90
[1] 34 38 42 46 50
[1]  3  8 13 18
[1]  8.5  9.5 10.5 11.5 12.5

Note: matrixStats provides replacement functions named almost identically to their base counterpart, so they are easy to find, but are different, so they don’t mask the base functions.

17.4 See also