Introducting clj-pgcopy

On a recent project, I was working on an ETL process, where the transform step happened in Clojure. After all of the data had been shaped the way we wanted, we imported the data into a Postgresql database. As with our typical pattern, we used the clojure.java.jdbc/insert-multi! function to insert rows in batches. But the performance for the database insertion wasn’t quite what we were hoping for.

Although the overhead from org.clojure.jdbc is negligible in most scenarios, and certainly in virtually all CRUD-like workflows, it can become significant with batch insertions at scale: all of the work of converting data types to String-y parameters adds up.

Having previously used the COPY syntax for importing textual data like CSVs, I found that Postgresql’s COPY also has a binary variant that allows direct insertion into tables. Because the binary on-the-wire representation of Postgresql’s data types is virtually the same as the storage representation, there is significantly less overhead for importing using COPY ... FROM STDIN WITH BINARY.

So, there are immediately two benefits to using the binary COPY approach:

The binary on-the-wire representation weighs less than its textual, INSERT-based counterpart.
The Postgres server has less work to do to extract data from a binary COPY than it does with INSERT statement because the binary representation so closely mirrors the internal storage representation.

There may be other benefits I’m unaware of, but those two things alone made me curious if there was a way to get the COPY goodness with idiomatic Clojure.

Both main postgresql jdbc drivers, https://jdbc.postgresql.org/ and https://impossibl.github.io/pgjdbc-ng/, support putting a connection into COPY mode, after which data can be streamed into or out of postgresql. However, both drivers don’t do anything to help you get the data into the raw format that postgresql expects. For the text (default) format, that’s basically a tab-delimited or CSV payload.

The binary COPY format, however, is a different beast entirely. Each postgresql data type has a very specific binary representation. Again, while the postgresql jdbc drivers provide a way to stream in or stream out COPY data, the data within those streams is left entirely up to you. The best library I found to do this kind of serialization was PgBulkInsert, which works with JDBC, but has a very Java-centric API.

That’s where clj-pgcopy comes in. It maps common data types to their binary represenation, with the hope that using the binary COPY style of importing data is as easy to use as clojure.java.jdbc methods. The primary API of clj-pgcopy is clj-pgcopy.core/copy-into!, which aims to be drop-in compatible with most places where clojure.java.jdbc/insert-multi! is being used.

Example usage Link to heading

For example, let’s say that we have a table of product listings, looking like this:

create table inventory (
  guid uuid primary key,
  created_at timestamptz not null,
  active boolean not null default false,
  price decimal(8,2),
  average_rating float4
);

And some data we’d like to import that looks like this:

(def data
  [{:guid #uuid "d44c2977-0a9f-4d12-88d2-7d85e07ce1e2",
    :created_at #inst "2019-12-01T23:37:33.701-00:00",
    :active true,
    :price 998.49M,
    :average_rating 3.3}
   {:guid #uuid "220603d4-c1b9-4ea4-b5f4-c61a38e9f515",
    :created_at #inst "2019-12-01T16:22:35.826-00:00",
    :active false,
    :price 847.90M,
    :average_rating 2.1}])

A typical way to import data with clojure.java.jdbc/insert-multi! would look like this:

(let [cols [:guid :created_at :active :price :average_rating]
      ->tuple (apply juxt cols)]
  (jdbc/with-db-connection [conn conn-spec]
    (jdbc/insert-multi! conn :inventory cols (map ->tuple data))))

To use clj-pgcopy, the only thing that needs to change is adding a require, and changing the callsite:

(require '[clj-pgcopy.core :as copy])

(let [cols [:guid :created_at :active :price :average_rating]
      ->tuple (apply juxt cols)]
  (jdbc/with-db-connection [conn conn-spec]
    (copy/copy-into! (:connection conn) :inventory cols (map ->tuple data))))

Please note that copy-into! expects a “raw” JDBC connection, not a Clojure map wrapping one, like clojure.java.jdbc uses.

Performance Link to heading

Using the same table as our above example, I did some non-definitive benchmarking. For most typical use-cases, clj-pgcopy should be a little more than twice as fast as insert-multi:

tuples	batch size	insert-multi	clj-pgcopy
10000	100	218.4 ms	107.2 ms
10000	500	205.2 ms	90.35 ms
50000	100	1.030 sec	422.3 ms
50000	500	1.272 sec	382.3 ms
100000	100	2.051 sec	1.005 sec

For measurement methodology, or to run your own benchmarks, see the benchmark namespace of the clj-pgcopy repository.

Type Mappings Link to heading

Out of the box, clj-pgcopy supports many data types, with what I consider reasonable default mappings.

Basic type mapping Link to heading

JVM type	Postgres type
Short	int2 (aka smallint)
Integer	int4 (aka integer)
Long	int8 (aka bigint)
Float	float4 (aka real)
Double	float8 (aka double presicion)
BigDecimal	numeric/decimal
Boolean	boolean
String	text/varchar/char
java.util.UUID	uuid

JVM type	Postgres type
java.sql.Date	date
java.time.LocalDate	date
java.util.Date	timestamp[tz]
java.sql.Timestamp	timestamp[tz]
java.time.Instant	timestamp[tz]
java.time.ZonedDateTime	timestamp[tz]
java.time.OffsetDatetime	timestamp[tz]
org.postgres.util.PGInterval	interval

Geometric mappings Link to heading

JVM type	Postgres type
org.postgres.geometric.PGpoint	point
org.postgres.geometric.PGline	line
org.postgres.geometric.PGpath	path
org.postgres.geometric.PGbox	box
org.postgres.geometric.PGcircle	circle
org.postgres.geometric.PGpolygon	polygon

Arrays Link to heading

Impemented for the following JVM-typed arrays for:

JVM type	Postgres type
int[]	int4[] (aka integer[])
long[]	int8[] (aka bigint[])
float[]	float4[] (aka real[])
double[]	float8[] (aka double precision[])
byte[]	bytea
String[]	text[] (or varchar)
java.util.UUID[]	uuid[]

Currently, only 1-dimensional Postgres arrays are supported.

jsonb Link to heading

Things that are String-like, or serialized in string form, should work using the String -> text mapping. An exception is the jsonb postgres type, because the binary format requires a version signifier. Wrapping a JSON string in a clj-pgcopy.core/JsonB handles that.

Note that this library does not serialize to JSON, it simply wraps a valid JSON string such that it can actually be used.

What about my weird type? Link to heading

These type mappings are implented using a clojure protocol, namely clj-pgcopy.core/IPGBinaryWrite. In order to add support for another type, just extend that protocol with an implementation. You can and should use the implementation of other types in the library for guidance on doing so.

Conclusion Link to heading

Go forth and import!