Bulk imports with Datomic

This is a repost. You can find the original here

I’ve been really happy with Datomic, but doing an initial bulk import wasn’t as familiar as SQL dump/restore. Here are some things that I’ve learned from doing several imports.

Use core.async Link to heading

The Datomic transactor handles concurrency by transacting datoms serially, but that doesn’t mean it isn’t fast! In my experience, the bottleneck is actually in the reshaping of data and formatting transactions. I use core.async to parallelize just about everything in the import pipeline.

One example of how I’ve leveraged core.async for import jobs can be found in my Kevin Bacon project repository.

Run the import locally Link to heading

I use DynamoDB as my storage backend in production. I used to try to run my import tasks directly to the production transactor/storage. Lately, though, I’ve found it really helpful to run my import tasks to a locally-running transactor and the dev storage backend.

Running an import locally means I don’t have to worry about networking, which speeds the whole process up quite a bit; also, it give me a much more freedom to iterate on the database design itself. (I rarely get an import correct the first time.) And in the case of DynamoDB, I save some money, as I don’t have to have my “write throughput” cranked way up for as long.

Clean up the local database Link to heading

Bulk imports create some garbage, so manually reindexing before backing up is advantageous. Here’s what a REPL session looks like:

(def conn (d/connect "datomic:dev://localhost:4334/database-name"))
(d/request-index conn)
(->> conn d/db d/basis-t (d/sync-index conn) deref)
;; blocks until done indexing
(d/gc-storage conn (java.util.Date.))

For more information on why this cleanup is important, see the relevant Datomic documentation.

Use backup/restore Link to heading

Once everything looks good on the local production database, I use Datomic’s builtin backup/restore facilities to send the database up to production. Assuming you’ve already deployed a production transactor and provisioned DynamoDB storage, here’s the process I follow:

Run the datomic backup-db command against the local import.
Crank my “write throughput” on DynamoDB way up (on the order of 1000).
Run the datomic restore-db command from the backup folder to the remote database.
Turn the “write throughput” back down to whatever value I plan to use for ongoing use (see the Datomic documentation for more information).

The heart of almost every business is its data. Datomic is a great choice for business data, in part because it treats all data as important: nothing is overwritten. New things are learned, but the old facts are not replaced. And knowing how to get your data into Datomic is half the battle.

Go forth and import!