See the data frame section for an introduction to data frames.
Column and row names
Let’s start with a simple example data.frame:
df <- data.frame ( PID = 111 : 119 ,
Hospital = c ( "UCSF" , "HUP" , "Stanford" ,
"Stanford" , "UCSF" , "HUP" ,
"HUP" , "Stanford" , "UCSF" ) ,
Age = c ( 22 , 34 , 41 , 19 , 53 , 21 , 63 , 22 , 19 ) ,
Sex = c ( 1 , 1 , 0 , 1 , 0 , 0 , 1 , 0 , 0 ) )
df
PID Hospital Age Sex
1 111 UCSF 22 1
2 112 HUP 34 1
3 113 Stanford 41 0
4 114 Stanford 19 1
5 115 UCSF 53 0
6 116 HUP 21 0
7 117 HUP 63 1
8 118 Stanford 22 0
9 119 UCSF 19 0
The optional row.names
argument (see data.frame
usage in the R documentation) can be used to define row names at the time of the data frame creation. It accepts either:
a single integer or a character specifying a column of the data.frame being created whose values should be used as row names, or
a vector of values (character or integer) of the row names to be used.
For example, we can use the “PID” column:
df <- data.frame ( PID = 111 : 119 ,
Hospital = c ( "UCSF" , "HUP" , "Stanford" ,
"Stanford" , "UCSF" , "HUP" ,
"HUP" , "Stanford" , "UCSF" ) ,
Age = c ( 22 , 34 , 41 , 19 , 53 , 21 , 63 , 22 , 19 ) ,
Sex = c ( 1 , 1 , 0 , 1 , 0 , 0 , 1 , 0 , 0 ) ,
row.names = "PID" )
It is recommended to not use/depend on row names to identify or index data.frames, and instead include a column of case IDs.
We can get column names and row names with colnames()
and rownames()
, respectively:
[1] "Hospital" "Age" "Sex"
[1] "111" "112" "113" "114" "115" "116" "117" "118" "119"
To set new column or row names use the form:
colnames(df) <- new.colnames
rownames(df) <- new.rownames
where new.colnames
and new.rownames
is a character vector.
You can rename all columns/rows or use indexing to replace specific names.
For example, to rename all rows, do:
Hospital Age Sex
Patient_1 UCSF 22 1
Patient_2 HUP 34 1
Patient_3 Stanford 41 0
Patient_4 Stanford 19 1
Patient_5 UCSF 53 0
Patient_6 HUP 21 0
Patient_7 HUP 63 1
Patient_8 Stanford 22 0
Patient_9 UCSF 19 0
To rename the first two columns, do:
colnames ( df ) [ 1 : 2 ] <- c ( "Center" , "Age_at_Dx" )
df
Center Age_at_Dx Sex
Patient_1 UCSF 22 1
Patient_2 HUP 34 1
Patient_3 Stanford 41 0
Patient_4 Stanford 19 1
Patient_5 UCSF 53 0
Patient_6 HUP 21 0
Patient_7 HUP 63 1
Patient_8 Stanford 22 0
Patient_9 UCSF 19 0
Delete columns or rows
To delete a data.frame column, set it to NULL :
Center Age_at_Dx
Patient_1 UCSF 22
Patient_2 HUP 34
Patient_3 Stanford 41
Patient_4 Stanford 19
Patient_5 UCSF 53
Patient_6 HUP 21
Patient_7 HUP 63
Patient_8 Stanford 22
Patient_9 UCSF 19
To delete a data.frame row, you can “index it out”.
For example, to remove the third and fifths rows of the above data.frame using an integer index:
Center Age_at_Dx
Patient_1 UCSF 22
Patient_2 HUP 34
Patient_4 Stanford 19
Patient_6 HUP 21
Patient_7 HUP 63
Patient_8 Stanford 22
Patient_9 UCSF 19
You can similarly exclude a row using a logical index. Logical indexing occurs usually following some filtering condition.
For example, to exclude patients under 20 years old, do:
df <- df [ ! df $ Age < 20 , ]
df
Center Age_at_Dx
Patient_1 UCSF 22
Patient_2 HUP 34
Patient_6 HUP 21
Patient_7 HUP 63
Patient_8 Stanford 22
subset()
subset()
allows you to:
filter cases that meet certain conditions using the subset
argument
select columns using the select
argument
head()
returns the first few lines of a data frame. We use it to avoid printing too many lines, e.g.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
iris_sl.gt.med <- subset ( iris , Sepal.Length > median ( Sepal.Length ) )
Note: You can use the column name Sepal.Length directly, i.e. unquoted and you don’t need to use iris$Sepal.Length
.
This is called Non-Standard Evaluation (NSE).
two
1 0.54992537
2 2.55730552
5 0.15558755
7 0.13558505
8 0.08646245
9 0.08867163
10 0.73122091
two group
1 0.54992537 alpha
2 2.55730552 alpha
5 0.15558755 beta
7 0.13558505 beta
8 0.08646245 beta
9 0.08867163 beta
10 0.73122091 beta
two one
1 0.54992537 1
2 2.55730552 2
5 0.15558755 5
7 0.13558505 7
8 0.08646245 8
9 0.08867163 9
10 0.73122091 10
two group
1 0.54992537 alpha
2 2.55730552 alpha
5 0.15558755 beta
7 0.13558505 beta
8 0.08646245 beta
9 0.08867163 beta
10 0.73122091 beta
split()
Split a data frame into multiple data frames by groups defined by a factor:
x_by_group <- split ( x , x $ group )
x_by_group
$alpha
one two group
1 1 0.5499254 alpha
2 2 2.5573055 alpha
3 3 -0.9177497 alpha
4 4 -0.6831720 alpha
$beta
one two group
5 5 0.15558755 beta
6 6 -0.78421387 beta
7 7 0.13558505 beta
8 8 0.08646245 beta
9 9 0.08867163 beta
10 10 0.73122091 beta
with()
Within a with()
expression, you can access list elements or data.frame columns without quoting or using the $
operator:
[1] 1.549925 4.557306 2.082250 3.316828 5.155588 5.215786 7.135585
[8] 8.086462 9.088672 10.731221
with ( x , x [ group == "alpha" , ] )
one two group
1 1 0.5499254 alpha
2 2 2.5573055 alpha
3 3 -0.9177497 alpha
4 4 -0.6831720 alpha
one two group
1 1 0.54992537 alpha
2 2 2.55730552 alpha
5 5 0.15558755 beta
7 7 0.13558505 beta
8 8 0.08646245 beta
9 9 0.08867163 beta
10 10 0.73122091 beta
Identify and remove duplicated row with duplicated()
and unique()
The duplicated()
function when applied on a data.frame returns a logical index specifying the location of duplicated rows - specifically, of row which are the duplicate of another row further up the data.frame. This means that if rows 20 and 23 are identical, duplicated()
will return TRUE
for row 23.
On the other hand, unique()
will remove duplicate rows from a data.frame.
x <- data.frame ( ID = c ( 203 , 808 , 909 , 707 , 808 ) ,
Age = c ( 23 , 44 , 33 , 42 , 44 ) )
[1] FALSE FALSE FALSE FALSE TRUE
ID Age
1 203 23
2 808 44
3 909 33
4 707 42