Leveraging DuckDB to convert csv into xlsx

When reading the headline you probably think “what the hell does he do? Why does he not import the csv file into Excel and manage it there?”

Here are the reasons

  • I don’t like Excel.
  • My source csv file needs to be converted before generating the Excel file.
  • I wanted to explore DuckDB.

TL;DR: Go to the source files at the very bottom of this blog post.

Business Use Case

Before I describe the technical details I briefly describe my business use case.

I record my working times with the really great tool prima:Time. But as with most tools there are things you miss. One thing I miss in prima:Time is that I can export my data in a format my customers expect.

My current customers request an Excel file in a format (template) they have defined.

Of course prima:Time can export to csv and even to xls, but those files or, to be exact, the columns of those files don’t have the format my customers expect.

Cause I stumbled across DuckDB recently and thought it would be worse to explore it I decided to create a solution for my business case with it.

About DuckDB

I first heard of DuckDB from Golo Roden in his Youtube VLOG. That made me curious.

“DuckDB is a fast in-process analytical database” they say at their homepage. Sounds like it is a lightweight version of SAP HANA.

In-process means that it does not run in an independent database server but in the same process as your application. If you stop your application the DuckDB database also is no longer available. In this respect DuckDB behaves like SQLite.
Nevertheless you can of cource save the data to a single database file.

It integrates with a lot of external datasources like PostgreSQL, MySQL, AWS, Azure, …
It is highly portable which means that it runs on almost all OS. Furthermore it can handle up to terabytes of data and hundreds of CPUs.

It is feature rich. Besides support for the complete SQL syntax it also has a large function library.

It has also a lot of Client APIs. Just to name a few:

  • R
  • Java
  • C++
  • Python
  • Node.js
  • CLI

Yes, there is also a CLI and that’s what I have used for my use case.

As analytical database it’s also supposed to be very fast.

Last but not least DuckDB is available under MIT license which means it is free.

DuckDB and CSV files

One of the integrations that caught my attention was the one with csv files. You can easily import csv files into a database table.

Thanks to it’s large function library you can even convert the csv data while importing it. After import you then have a database table with field types like DATE, TIME, INTEGER, DOUBLE, …

Having such a table you can run all sorts of database queries.

I imported all my timesheet data into my DuckDB database and converted the data during the import. It’s not big, but it has more than 5000 records.

I’m now able to run additional evaluations that I miss in prime:Time.

My business use case in practice

Let’s come to my use case.

Import my csv file into a database table

From prima:time I exported a csv file like this

DATUM,VON,BIS,DAUER,DAUER (STUNDEN),BENUTZER,KUNDE,PROJEKTCODE,PROJEKT,AUFGABE,TÄTIGKEIT,BESCHREIBUNG,INTERNE BEMERKUNG,ABRECHENBARE ZEIT
03.06.2024,09:00,09:30,00:30,"0,50",Tammen Helmut,Customer1,Projekt 1,Customer1 - Projekt 1,TPM (Techn. Projekt Management) ,Remote,"Abstimmung mit Peter Urlaubsübergabe",,00:30
10.06.2024,09:30,11:30,02:00,"2,00",Tammen Helmut,Customer1,Projekt 1,Customer1 - Projekt 1,TPM (Techn. Projekt Management) ,Remote,"Planning Sprint 8",PRJ_1-460,02:00

You see that prima:Time exported a lot of columns. Let us have a deeper look at some of them

  • VON (from). This is a time field, but it is delivered in the format hh:mm, e.g. 09:30
  • BIS (until). This is also a time field
  • “DAUER (STUNDEN)” (duration). This is a double field. It’s delivered in the format xx,yy.

The TIME fields are delivered with format hh:mm, but DuckDB needs the format hh:mm:ss to automatically recognize them as TIME fields.

The duration field “DAUER (STUNDEN)” is a decimal field formatted with my local (german) format. In Germany we use a “,” as decimal separator. Databases and programming languages use a “.” for this.

DuckDB offers the possibilty to convert those field on import of a csv file. Here is my SQL that imports a file, converts some fields and create a database table.

CREATE TEMP TABLE ts_month AS 
  (SELECT DATUM, 
          cast((von || ':00') as TIME) as VON, 
          cast((bis || ':00') as TIME) as BIS, 
          cast((DAUER || ':00') as TIME) as DAUER, 
          cast(replace("DAUER (STUNDEN)", ',', '.') as DOUBLE) as DAUER_STD,
          BENUTZER, KUNDE, PROJEKTCODE, AUFGABE, TÄTIGKEIT AS TAETIGKEIT, BESCHREIBUNG, "INTERNE BEMERKUNG" as BEMERKUNG, 
          cast(("ABRECHENBARE ZEIT" || ':00') as TIME) as ABR_ZEIT
          FROM read_csv('current_month.csv'));

Let me explain the details.

CREATE TEMP TABLE creates a temporary database table. For my use case I don’t need a permanent table. Temporary tables are deleted by DuckDB as soon as the operating process which runs DuckDB exits, even though you might save the DuckDB data to a permanent file.

The following SELECT ... FROM read_csv('current_month.csv') statement imports my csv file into the created database table.

The interesting things are the casts that I call in the SELECT statement for some of the columns. These casts convert the data from datatype string, which is the default when importing csv data, into a datatype that can be evaluated in SQL queries.

The cast of the TIME field von e.g. is done by cast((von || ':00') as TIME) as VON. Cause I get e.g. “09:00” from the csv file I need to add :00 to it first. Then I can cast the result to a real TIME field. Then I name the field VON. Without the latter as VON the field name would be cast((von || ':00') as TIME) which is of course an ugly field name.

To cast the “DAUER (STUNDEN)” field to a numeric (double) type I use the following expression cast(replace("DAUER (STUNDEN)", ',', '.') as DOUBLE) as DAUER_STD.
Here I first replace the “,” by a “.” and then cast it to DOUBLE. The name of the field is DAUER_STD.

Export the database table to excel

Now as I have a database table with fields of type TIME and DOUBLE I can export this table to an excel file.

There is a DuckDB extension that helps me with this. So at the top of my SQL script I first install and load this extension.

install spatial;
load spatial;

The creation of the excel file is then done by calling this line.

COPY (select projektcode, datum, von, bis, dauer, beschreibung, bemerkung from ts_month order by projektcode, datum, von) TO 'month.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');

I select only a few of my table field with the select inside the first parenthesis and then I export it to a file with the name month.xlsx

Automatically create and open the xlsx file with a spreadsheet application

Cause I’m a lazy guy I don’t want to want to open the created xlsx file manually in a spreadsheet application (libreoffice in my case) to do the copy-paste from it to the excel file of my customer.
Note: the copy-paste is necessary cause my customers excel file has some other additional columns.

To automate the complete process of creating the xlsx file from the csv file and open the month.xlsx with my spreadsheet application I created this little bash script.

#! /bin/bash
duckdb :memory: -s ".read timesheet_month.sql"
libreoffice24.2 month.xlsx &
wait
rm month.xlsx

It runs my SQL script to create the database table and export it as xlsx file. Then it calls libreoffice with the created file in background.
When I close libreoffice the script removes the created xlsx file cause I now longer need it.

Conclusion of my business case

Before I had this tiny script I, once a month, had to open my personal timesheet application, open the customer excel file in parallel and copy-paste every single timesheet entry from my timesheet app to the customer excel. This was an annoying and also error prone process.

Now I simply have to run my script, open the customer excel in parallel and copy the entire columns that are contained in both files.

The complete sources of my scripts

timesheet_month.sql SQL Script

install spatial;
load spatial;
CREATE TEMP TABLE ts_month AS 
  (SELECT DATUM, 
          cast((von || ':00') as TIME) as VON, 
          cast((bis || ':00') as TIME) as BIS, 
          cast((DAUER || ':00') as TIME) as DAUER, 
          cast(replace("DAUER (STUNDEN)", ',', '.') as DOUBLE) as DAUER_STD,
          BENUTZER, KUNDE, PROJEKTCODE, AUFGABE, TÄTIGKEIT AS TAETIGKEIT, BESCHREIBUNG, "INTERNE BEMERKUNG" as BEMERKUNG, 
          cast(("ABRECHENBARE ZEIT" || ':00') as TIME) as ABR_ZEIT
          FROM read_csv('current_month.csv'));
COPY (select projektcode, datum, von, bis, dauer, beschreibung, bemerkung from ts_month order by projektcode, datum, von) TO 'month.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');

run-month bash script

#! /bin/bash
duckdb :memory: -s ".read timesheet_month.sql"
libreoffice24.2 month.xlsx &
wait
rm month.xlsx

Posted

in

by