40  DuckDB Basics

Note

This section is a work in progress.

Loading required package: DBI

40.1 Installation

40.2 DuckDB API

DuckDB uses the R Database Interface (DBI). Learn more about using DBI here.

40.3 Setup Connection

By default, duckdb()’s dbdir argument defaults to DBDIR_MEMORY, which creates a non-persistent, in-memory object.

con <- dbConnect(duckdb::duckdb())

40.4 Settings

dbExecute(con, "PRAGMA enable_progress_bar;")
[1] 0

40.5 Read CSV

You can directly read into data.frame, which may be best for smaller datasets:

ir <- dbGetQuery(con, 
    "SELECT * FROM read_csv_auto('./iris.csv');"
)
head(ir)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

For larger datasets, you can create a DuckDB table that can be queried without loading the dataset into memory:

dbExecute(con,
    "CREATE TABLE iris AS SELECT * FROM './iris.csv';"
)
[1] 150

40.5.1 List tables

dbGetQuery(con, "PRAGMA show_tables;")
  name
1 iris

Using the R API:

[1] "iris"

40.6 Filter

dbGetQuery(con,
    "SELECT * FROM iris WHERE Species in ('setosa', 'versicolor')") |> head()
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
dbExecute(con,
    "
    CREATE TABLE ir AS SELECT * FROM 
    read_csv_auto('/Users/egenn/icloud/Data/iris.csv', 
    delim=',', header=True);
    "
)
[1] 150

40.6.1 Fetch data to data.frame

ir = dbGetQuery(con, "SELECT * from iris")
head(ir)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
class(ir)
[1] "data.frame"

40.6.2 Fetch data to data.table:

ir = dbGetQuery(con, "SELECT * from iris")
class(ir)
[1] "data.frame"
head(ir)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

40.7 Close connection

When you are done working with the database, you should close the connection:

40.8 Resources