1. Overview
In this tutorial, we’ll learn about an analytical relational database called DuckDB. We’ll explore its advantages and see how it’s an efficient solution for analytical tasks. After that, we’ll walk through the installation and some fundamental operations.
2. What Is DuckDB?
DuckDB is an in-memory analytical relational database primarily designed for data analysis. It’s considered an analytical database due to its columnar storage nature, which stores data for each column separately. In contrast, a traditional relational database utilizes a row-based storage that stores data row by row.
Advantages of DuckDB include:
- Fast queries – DuckDB utilizes a columnar-vectorized query execution engine that optimizes the data query in large batches.
- SQL Compatibility – DuckDB supports standard SQL queries such as aggregation and window functions, which are well-suited for data analysts familiar with SQL.
- Quick Deployment – DuckDB has minimal external dependencies and operates within our application process without a separate database instance, making deployment and integration straightforward.
- Free – DuckDB is an open-source project which is freely available to all. Its entire source code is accessible on GitHub for exploration and contribution.
3. DuckDB Installation
DuckDB provides various installation options to suit our environment. We’ll demonstrate two common installation methods here.
3.1. Command Line
For Windows users, we can install DuckDB using the WinGet package manager. All we need is to open a command prompt with administrator privileges and execute the following:
winget install DuckDB.cli
On the Mac OS, we can install it using Homebrew:
brew install duckdb
After completing the installation of the DuckDB CLI, brew adds the binary path to the existing environment variable by itself. We can open a new shell session and run the DuckDB CLI by running the following command:
duckdb
3.2. Java
DuckDB can integrate with Java without installing a separate database instance. To get started, we include the following DuckDB JDBC dependency in our pom.xml:
<dependency>
<groupId>org.duckdb</groupId>
<artifactId>duckdb_jdbc</artifactId>
<version>0.10.0</version>
</dependency>
We can load the DuckDB JDBC driver and then create a JDBC connection via the following JDBC URL:
Class.forName("org.duckdb.DuckDBDriver");
Connection conn = DriverManager.getConnection("jdbc:duckdb:");
DuckDB automatically creates an in-memory database instance by default when we connect to it. However, all data persisted in the instance is lost once the DuckDB process finishes. To save our data into the disk, we can append a database name after the colon in the connection URL:
Connection conn = DriverManager.getConnection("jdbc:duckdb:/test_duckdb");
In this example, DuckDB creates a database file called test_duckdb in the root directory. As this is a JDBC library, we may query data by creating an SQL Statement and executing it to obtain the ResultSet. The following is a simple JDBC example to obtain the current date:
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT current_date");
Date currentDate = rs.next() ? rs.getDate(1) : null;
In the subsequent sections of this tutorial, we can execute SQL statements using the same JDBC approach in Java.
4. Data Import
Let’s proceed with importing some data into DuckDB. It can handle a variety of data formats that simplify importing from external data sources.
4.1. CSV Files
CSV is a common data format for storing tabular data. Let’s consider we have the following CSV file containing customer data:
CustomerId,FirstName,LastName,Gender
101,John,Smith,Male
102,Sarah,Jones,Female
...
We could use the SQL function read_csv to import data from a CSV file to the DuckDB table customer:
CREATE TABLE customer AS
SELECT * FROM read_csv('customer.csv')
DuckDB can figure out the schema from the header row of the CSV file. The header names are considered table column names, while the subsequent rows are treated as data rows.
4.2 JSON Files
Another popular way to store and share data is JSON. For instance, let’s take the following product.json JSON file:
[
{
"productId": 1,
"productName":"EZ Curl Bar",
"category": "Sports Equipment"
},
{
"productId": 2,
"productName": "7' Barbell",
"category": "Sports Equipment"
}
]
Similar to CSV import, we can execute an SQL statement to import data to the DuckDB table product:
CREATE TABLE product AS
SELECT * FROM read_json('product.json')
Just like the CSV, DuckDB automatically figures out the schema from the JSON file based on the JSON attribute names.
4.3. INSERT Statements
We may add data to the DuckDB table using insert statements, as it’s a SQL relational database system. The example below illustrates creating a purchase table that defines the relationship between customer and product and populating a few rows of data:
CREATE TABLE purchase(customerId BIGINT, productId BIGINT);
INSERT INTO purchase(customerId, productId) VALUES (101,1);
INSERT INTO purchase(customerId, productId) VALUES (102,1);
INSERT INTO purchase(customerId, productId) VALUES (102,2);
5. Data Query
With data loaded, we’ll now explore querying DuckDB and analysing our data.
5.1. Join Operations
In addition to importing external data to DuckDB, we can use the external data directly. Based on the examples earlier, we’ll utilize the three data sources from the previous section. Now, let’s join these data sources to gather information on customer products.
SELECT C.firstName, C.lastName, P.productName
FROM read_csv('customer.csv') AS C, read_json('product.json') AS P, purchase S
WHERE S.customerId = C.customerId
AND S.productId = P.productId
Upon the execution, we’ll see the following query result showing the customer names and their corresponding product purchases:
firstName | lastName | productName |
---|---|---|
John | Smith | EZ Curl Bar |
Sarah | Jones | 7′ Barbell |
Sarah | Jones | EZ Curl Bar |
5.2. Aggregation Functions
DuckDB provides a rich set of aggregation functions to perform calculations on groups of rows. Let’s explore an example with these functions:
SELECT P.productName, COUNT(*) AS purchaseCount
FROM customer C, product P, purchase S
WHERE S.customerId = C.customerId
AND S.productId = P.productId
GROUP BY P.productName
ORDER BY COUNT(*) DESC
The query counts the number of purchases of each product and order them by the purchase count in descending order:
productName | purchaseCount |
---|---|
EZ Curl Bar | 2 |
7′ Barbell | 1 |
6. Data Export
In data analysis tasks, we often require exporting our aggregated data to other applications for further analysis.
Let’s walk through the process of exporting our data from DuckDB in various formats. In our example, we first create a database view for easy illustration of the exporting later:
CREATE VIEW purchase_view AS
SELECT P.productName, COUNT(*) AS purchaseCount
FROM customer C, product P, purchase S
WHERE S.customerId = C.customerId
AND S.productId = P.productId
GROUP BY P.productName
ORDER BY COUNT(*) DESC;
6.1. CSV Files
It’s straightforward to export data to a CSV file in DuckDB. We can execute the following simple SQL to copy all data from our database view purchase_view to a CSV file located in the root directory:
COPY purchase_view TO '/output.csv'
6.2. JSON Files
To export data to a JSON file, we need to include an additional option array to specify writing the data as a JSON array. This ensures that our exported JSON file is in an appropriate structure:
COPY (SELECT * FROM purchase_view WHERE purchaseCount > 1) TO '/output.json' (array true);
Instead of exporting all data, we can copy partial results based on the condition of the select query.
7. Conclusion
In this article, we learned about the DuckDB database and its advantages. We also looked at some fundamental operations via illustrative examples.
As always, all the code is available over on GitHub.