On a recent project, I was working on an ETL process, where
the transform step happened in Clojure. After all of the data
had been shaped the way we wanted, we imported the data into a
Postgresql database. As with our typical pattern, we used the
clojure.java.jdbc/insert-multi!
function to insert rows in batches.
But the performance for the database insertion wasn’t quite what we
were hoping for.
Although the overhead from org.clojure.jdbc
is negligible in most
scenarios, and certainly in virtually all CRUD-like workflows, it can
become significant with batch insertions at scale: all of the work of
converting data types to String-y parameters adds up.
Having previously used the COPY
syntax for importing textual
data like CSVs, I found that Postgresql’s COPY
also has a binary
variant that allows direct insertion into tables. Because the binary
on-the-wire representation of Postgresql’s data types is virtually
the same as the storage representation, there is significantly less
overhead for importing using COPY ... FROM STDIN WITH BINARY
.
So, there are immediately two benefits to using the binary COPY
approach:
- The binary on-the-wire representation weighs less than its textual,
INSERT
-based counterpart. - The Postgres server has less work to do to extract data from a
binary
COPY
than it does withINSERT
statement because the binary representation so closely mirrors the internal storage representation.
There may be other benefits I’m unaware of, but those two things alone
made me curious if there was a way to get the COPY
goodness with
idiomatic Clojure.
Both main postgresql jdbc drivers, https://jdbc.postgresql.org/ and https://impossibl.github.io/pgjdbc-ng/, support putting a connection into COPY mode, after which data can be streamed into or out of postgresql. However, both drivers don’t do anything to help you get the data into the raw format that postgresql expects. For the text (default) format, that’s basically a tab-delimited or CSV payload.
The binary COPY format, however, is a different beast entirely. Each postgresql data type has a very specific binary representation. Again, while the postgresql jdbc drivers provide a way to stream in or stream out COPY data, the data within those streams is left entirely up to you. The best library I found to do this kind of serialization was PgBulkInsert, which works with JDBC, but has a very Java-centric API.
That’s where clj-pgcopy comes in. It maps common data types to
their binary represenation, with the hope that using the binary COPY
style of importing data is as easy to use as clojure.java.jdbc
methods. The primary API of clj-pgcopy is
clj-pgcopy.core/copy-into!
, which aims to be drop-in compatible with
most places where clojure.java.jdbc/insert-multi!
is being used.
Example usage Link to heading
For example, let’s say that we have a table of product listings, looking like this:
create table inventory (
guid uuid primary key,
created_at timestamptz not null,
active boolean not null default false,
price decimal(8,2),
average_rating float4
);
And some data we’d like to import that looks like this:
(def data
[{:guid #uuid "d44c2977-0a9f-4d12-88d2-7d85e07ce1e2",
:created_at #inst "2019-12-01T23:37:33.701-00:00",
:active true,
:price 998.49M,
:average_rating 3.3}
{:guid #uuid "220603d4-c1b9-4ea4-b5f4-c61a38e9f515",
:created_at #inst "2019-12-01T16:22:35.826-00:00",
:active false,
:price 847.90M,
:average_rating 2.1}])
A typical way to import data with clojure.java.jdbc/insert-multi!
would look like this:
(let [cols [:guid :created_at :active :price :average_rating]
->tuple (apply juxt cols)]
(jdbc/with-db-connection [conn conn-spec]
(jdbc/insert-multi! conn :inventory cols (map ->tuple data))))
To use clj-pgcopy
, the only thing that needs to change is adding a
require
, and changing the callsite:
(require '[clj-pgcopy.core :as copy])
(let [cols [:guid :created_at :active :price :average_rating]
->tuple (apply juxt cols)]
(jdbc/with-db-connection [conn conn-spec]
(copy/copy-into! (:connection conn) :inventory cols (map ->tuple data))))
Please note that copy-into!
expects a “raw” JDBC connection, not a
Clojure map wrapping one, like clojure.java.jdbc
uses.
Performance Link to heading
Using the same table as our above example, I did some non-definitive benchmarking. For most typical use-cases, clj-pgcopy
should be a little more than twice as fast as insert-multi
:
tuples | batch size | insert-multi | clj-pgcopy |
---|---|---|---|
10000 | 100 | 218.4 ms | 107.2 ms |
10000 | 500 | 205.2 ms | 90.35 ms |
50000 | 100 | 1.030 sec | 422.3 ms |
50000 | 500 | 1.272 sec | 382.3 ms |
100000 | 100 | 2.051 sec | 1.005 sec |
For measurement methodology, or to run your own benchmarks, see the benchmark namespace of the clj-pgcopy repository.
Type Mappings Link to heading
Out of the box, clj-pgcopy supports many data types, with what I consider reasonable default mappings.
Basic type mapping Link to heading
JVM type | Postgres type |
---|---|
Short | int2 (aka smallint) |
Integer | int4 (aka integer) |
Long | int8 (aka bigint) |
Float | float4 (aka real) |
Double | float8 (aka double presicion) |
BigDecimal | numeric/decimal |
Boolean | boolean |
String | text/varchar/char |
java.util.UUID | uuid |
Date-related mappings Link to heading
JVM type | Postgres type |
---|---|
java.sql.Date | date |
java.time.LocalDate | date |
java.util.Date | timestamp[tz] |
java.sql.Timestamp | timestamp[tz] |
java.time.Instant | timestamp[tz] |
java.time.ZonedDateTime | timestamp[tz] |
java.time.OffsetDatetime | timestamp[tz] |
org.postgres.util.PGInterval | interval |
Geometric mappings Link to heading
JVM type | Postgres type |
---|---|
org.postgres.geometric.PGpoint | point |
org.postgres.geometric.PGline | line |
org.postgres.geometric.PGpath | path |
org.postgres.geometric.PGbox | box |
org.postgres.geometric.PGcircle | circle |
org.postgres.geometric.PGpolygon | polygon |
Arrays Link to heading
Impemented for the following JVM-typed arrays for:
JVM type | Postgres type |
---|---|
int[] | int4[] (aka integer[]) |
long[] | int8[] (aka bigint[]) |
float[] | float4[] (aka real[]) |
double[] | float8[] (aka double precision[]) |
byte[] | bytea |
String[] | text[] (or varchar) |
java.util.UUID[] | uuid[] |
Currently, only 1-dimensional Postgres arrays are supported.
jsonb Link to heading
Things that are String-like, or serialized in string form, should work
using the String -> text mapping. An exception is the jsonb
postgres
type, because the binary format requires a version signifier. Wrapping
a JSON string in a clj-pgcopy.core/JsonB
handles that.
Note that this library does not serialize to JSON, it simply wraps a valid JSON string such that it can actually be used.
What about my weird type? Link to heading
These type mappings are implented using a clojure protocol, namely
clj-pgcopy.core/IPGBinaryWrite
. In order to add support for another
type, just extend that protocol with an implementation. You can and
should use the implementation of other types in the library for
guidance on doing so.
Conclusion Link to heading
Go forth and import!