Introducting clj-pgcopy

On a recent project, I was working on an ETL process, where the transform step happened in Clojure. After all of the data had been shaped the way we wanted, we imported the data into a Postgresql database. As with our typical pattern, we used the clojure.java.jdbc/insert-multi! function to insert rows in batches. But the performance for the database insertion wasn't quite what we were hoping for.

Although the overhead from org.clojure.jdbc is negligible in most scenarios, and certainly in virtually all CRUD-like workflows, it can become significant with batch insertions at scale: all of the work of converting data types to String-y parameters adds up.

Having previously used the COPY syntax for importing textual data like CSVs, I found that Postgresql's COPY also has a binary variant that allows direct insertion into tables. Because the binary on-the-wire representation of Postgresql's data types is virtually the same as the storage representation, there is significantly less overhead for importing using COPY ... FROM STDIN WITH BINARY.

So, there are immediately two benefits to using the binary COPY approach:

  1. The binary on-the-wire representation weighs less than its textual, INSERT-based counterpart.
  2. The Postgres server has less work to do to extract data from a binary COPY than it does with INSERT statement because the binary representation so closely mirrors the internal storage representation.

There may be other benefits I'm unaware of, but those two things alone made me curious if there was a way to get the COPY goodness with idiomatic Clojure.

Both main postgresql jdbc drivers, https://jdbc.postgresql.org/ and https://impossibl.github.io/pgjdbc-ng/, support putting a connection into COPY mode, after which data can be streamed into or out of postgresql. However, both drivers don't do anything to help you get the data into the raw format that postgresql expects. For the text (default) format, that's basically a tab-delimited or CSV payload.

The binary COPY format, however, is a different beast entirely. Each postgresql data type has a very specific binary representation. Again, while the postgresql jdbc drivers provide a way to stream in or stream out COPY data, the data within those streams is left entirely up to you. The best library I found to do this kind of serialization was PgBulkInsert, which works with JDBC, but has a very Java-centric API.

That's where clj-pgcopy comes in. It maps common data types to their binary represenation, with the hope that using the binary COPY style of importing data is as easy to use as clojure.java.jdbc methods. The primary API of clj-pgcopy is clj-pgcopy.core/copy-into!, which aims to be drop-in compatible with most places where clojure.java.jdbc/insert-multi! is being used.

Example usage

For example, let's say that we have a table of product listings, looking like this:

create table inventory (
  guid uuid primary key,
  created_at timestamptz not null,
  active boolean not null default false,
  price decimal(8,2),
  average_rating float4
);

And some data we'd like to import that looks like this:

(def data
  [{:guid #uuid "d44c2977-0a9f-4d12-88d2-7d85e07ce1e2",
    :created_at #inst "2019-12-01T23:37:33.701-00:00",
    :active true,
    :price 998.49M,
    :average_rating 3.3}
   {:guid #uuid "220603d4-c1b9-4ea4-b5f4-c61a38e9f515",
    :created_at #inst "2019-12-01T16:22:35.826-00:00",
    :active false,
    :price 847.90M,
    :average_rating 2.1}])

A typical way to import data with clojure.java.jdbc/insert-multi! would look like this:

(let [cols [:guid :created_at :active :price :average_rating]
      ->tuple (apply juxt cols)]
  (jdbc/with-db-connection [conn conn-spec]
    (jdbc/insert-multi! conn :inventory cols (map ->tuple data))))

To use clj-pgcopy, the only thing that needs to change is adding a require, and changing the callsite:

(require '[clj-pgcopy.core :as copy])

(let [cols [:guid :created_at :active :price :average_rating]
      ->tuple (apply juxt cols)]
  (jdbc/with-db-connection [conn conn-spec]
    (copy/copy-into! (:connection conn) :inventory cols (map ->tuple data))))

Please note that copy-into! expects a “raw” JDBC connection, not a Clojure map wrapping one, like clojure.java.jdbc uses.

Performance

Using the same table as our above example, I did some non-definitive benchmarking. For most typical use-cases, clj-pgcopy should be a little more than twice as fast as insert-multi:

tuples batch size insert-multi clj-pgcopy
10000 100 218.4 ms 107.2 ms
10000 500 205.2 ms 90.35 ms
50000 100 1.030 sec 422.3 ms
50000 500 1.272 sec 382.3 ms
100000 100 2.051 sec 1.005 sec

For measurement methodology, or to run your own benchmarks, see the benchmark namespace of the clj-pgcopy repository.

Type Mappings

Out of the box, clj-pgcopy supports many data types, with what I consider reasonable default mappings.

Basic type mapping

JVM type Postgres type
Short int2 (aka smallint)
Integer int4 (aka integer)
Long int8 (aka bigint)
Float float4 (aka real)
Double float8 (aka double presicion)
BigDecimal numeric/decimal
Boolean boolean
String text/varchar/char
java.util.UUID uuid

Date-related mappings

JVM type Postgres type
java.sql.Date date
java.time.LocalDate date
java.util.Date timestamp[tz]
java.sql.Timestamp timestamp[tz]
java.time.Instant timestamp[tz]
java.time.ZonedDateTime timestamp[tz]
java.time.OffsetDatetime timestamp[tz]
org.postgres.util.PGInterval interval

Geometric mappings

JVM type Postgres type
org.postgres.geometric.PGpoint point
org.postgres.geometric.PGline line
org.postgres.geometric.PGpath path
org.postgres.geometric.PGbox box
org.postgres.geometric.PGcircle circle
org.postgres.geometric.PGpolygon polygon

Arrays

Impemented for the following JVM-typed arrays for:

JVM type Postgres type
int[] int4[] (aka integer[])
long[] int8[] (aka bigint[])
float[] float4[] (aka real[])
double[] float8[] (aka double precision[])
byte[] bytea
String[] text[] (or varchar)
java.util.UUID[] uuid[]

Currently, only 1-dimensional Postgres arrays are supported.

jsonb

Things that are String-like, or serialized in string form, should work using the String -> text mapping. An exception is the jsonb postgres type, because the binary format requires a version signifier. Wrapping a JSON string in a clj-pgcopy.core/JsonB handles that.

Note that this library does not serialize to JSON, it simply wraps a valid JSON string such that it can actually be used.

What about my weird type?

These type mappings are implented using a clojure protocol, namely clj-pgcopy.core/IPGBinaryWrite. In order to add support for another type, just extend that protocol with an implementation. You can and should use the implementation of other types in the library for guidance on doing so.

Conclusion

Go forth and import!