If a team is a machine, then socializing is the grease that keeps it moving fluidly.
I’ve remotely for 8 years, for 5 different companies. The kind of social onboarding that I mention above has been a part of each position, even when it happens organically. Given how important socializing is to team health, it should not come as a surprise that the places that have emphasized social onboarding have also been the ones that I started contributing to most quickly.
Bringing a new person into a team changes that team fundamentally, and the healthiest teams I’ve been a part of seem to understand that, emphasizing this social onboarding in ways both subtle and overt.
When a person leaves a company, be it because of downsizing, accepting an offer at another company, or even just an internal reorganization, the same priority to the social is not always given. Indeed, even as projects are handed off, and any HR offboarding procedure is executed carefully, few teams give adequate attention the social offboarding.
One the most helpful traditions that I’ve come to embrace when someone leaves a team has been a “social offboarding” call with the team. Given that I’ve been on remote teams, this has looked like an informal Zoom call and has been a great way to say goodbye, talk about “what’s next”, and gain some sense of closure. And when I’ve been the one leaving, it has served as a nice palette cleanser amid all of the HR paperwork and other formalized processes necessary when one leaves a company.
Depending on the circumstances that a person is leaving, it might just be a normal Zoom (or whatever) meeting that you make and invite all the teammates to. But if access to internal tools has been revoked, this may involve treating the leaving employee like an external entity, and invite them to a meeting as such. At the very least, ensuring that you have personal contact information makes this a possibility.
This has usually involved anyone that I worked with every day. If you are involved in a daily standup meeting, for example, that might be a good place to start with the “guest list”. Not everyone feels comfortable being candid with higher-ups, so use your discretion on who the team and employee would feel comfortable with.
Of course, you can always as the person leaving who they would like to have on the call.
Given that this is usually an informal conversation, the folks on the call can discuss whatever makes sense. However, as a guideline, I’d caution against discussing the terms of the end of the employment or other topics that could get you in hot water. If the person leaving is going to another position, they might give some details about what drew them to the new place, but steer clear of why they are leaving this team.
This might be a time to reminisce about projects people worked on together, accomplishments you are proud of, or just re-tell funny stories. Again, this should be an informal call.
If, however, you are hoping for more structure, here’s an approach I’ve taken at several places when I’ve left: I offer positive affirmations about the specific people on the call. I’ll address each person individually, naming the things I appreciate about them as coworkers or what I may miss about not being on their team anymore.
Here’s an illustration using completely made up people and positions:
Alice, I've learned so much from your deep knowledge of Bells, and
your work with Whistles has been inspiring. The team is so lucky
to have you, and I'll miss learning from you.
Bob, you were the first person that reached out to me when I was
new here, and you've never ceased encouraging me even while I
learned the ropes. You've been a great teammate, and I'll miss
having you in my corner.
Charlotte, thank you for trusting me enough to experiment on the
Widget project, even though it was clearly in your area of
expertise. You helped me when I got stuck, and I'm so appreciative.
People, in general, seem to like when you compliment them, and even more so when you’re able to call out specifics. Every time I’ve used this format, the people on the call have been quite receptive. It’s been a helpful, perhaps even therapeutic, way to spend the last few minutes of time with these people as teammates.
And when someone else has been the one leaving, this kind of call has been helpful to me and the rest of the team to have a sense of closure, and we leave the call with positive vibes.
]]>clojure.java.jdbc/insert-multi!
function to insert rows in batches.
But the performance for the database insertion wasn’t quite what we
were hoping for.
Although the overhead from org.clojure.jdbc
is negligible in most
scenarios, and certainly in virtually all CRUD-like workflows, it can
become significant with batch insertions at scale: all of the work of
converting data types to String-y parameters adds up.
Having previously used the COPY
syntax for importing textual
data like CSVs, I found that Postgresql’s COPY
also has a binary
variant that allows direct insertion into tables. Because the binary
on-the-wire representation of Postgresql’s data types is virtually
the same as the storage representation, there is significantly less
overhead for importing using COPY ... FROM STDIN WITH BINARY
.
So, there are immediately two benefits to using the binary COPY
approach:
INSERT
-based counterpart.COPY
than it does with INSERT
statement because the
binary representation so closely mirrors the internal storage
representation.There may be other benefits I’m unaware of, but those two things alone
made me curious if there was a way to get the COPY
goodness with
idiomatic Clojure.
Both main postgresql jdbc drivers, https://jdbc.postgresql.org/ and https://impossibl.github.io/pgjdbc-ng/, support putting a connection into COPY mode, after which data can be streamed into or out of postgresql. However, both drivers don’t do anything to help you get the data into the raw format that postgresql expects. For the text (default) format, that’s basically a tab-delimited or CSV payload.
The binary COPY format, however, is a different beast entirely. Each postgresql data type has a very specific binary representation. Again, while the postgresql jdbc drivers provide a way to stream in or stream out COPY data, the data within those streams is left entirely up to you. The best library I found to do this kind of serialization was PgBulkInsert, which works with JDBC, but has a very Java-centric API.
That’s where clj-pgcopy comes in. It maps common data types to
their binary represenation, with the hope that using the binary COPY
style of importing data is as easy to use as clojure.java.jdbc
methods. The primary API of clj-pgcopy is
clj-pgcopy.core/copy-into!
, which aims to be drop-in compatible with
most places where clojure.java.jdbc/insert-multi!
is being used.
For example, let’s say that we have a table of product listings, looking like this:
create table inventory (
guid uuid primary key,
created_at timestamptz not null,
active boolean not null default false,
price decimal(8,2),
average_rating float4
);
And some data we’d like to import that looks like this:
(def data
[{:guid #uuid "d44c2977-0a9f-4d12-88d2-7d85e07ce1e2",
:created_at #inst "2019-12-01T23:37:33.701-00:00",
:active true,
:price 998.49M,
:average_rating 3.3}
{:guid #uuid "220603d4-c1b9-4ea4-b5f4-c61a38e9f515",
:created_at #inst "2019-12-01T16:22:35.826-00:00",
:active false,
:price 847.90M,
:average_rating 2.1}])
A typical way to import data with clojure.java.jdbc/insert-multi!
would look like this:
(let [cols [:guid :created_at :active :price :average_rating]
->tuple (apply juxt cols)]
(jdbc/with-db-connection [conn conn-spec]
(jdbc/insert-multi! conn :inventory cols (map ->tuple data))))
To use clj-pgcopy
, the only thing that needs to change is adding a
require
, and changing the callsite:
(require '[clj-pgcopy.core :as copy])
(let [cols [:guid :created_at :active :price :average_rating]
->tuple (apply juxt cols)]
(jdbc/with-db-connection [conn conn-spec]
(copy/copy-into! (:connection conn) :inventory cols (map ->tuple data))))
Please note that copy-into!
expects a “raw” JDBC connection, not a
Clojure map wrapping one, like clojure.java.jdbc
uses.
Using the same table as our above example, I did some non-definitive benchmarking. For most typical use-cases, clj-pgcopy
should be a little more than twice as fast as insert-multi
:
tuples | batch size | insert-multi | clj-pgcopy |
---|---|---|---|
10000 | 100 | 218.4 ms | 107.2 ms |
10000 | 500 | 205.2 ms | 90.35 ms |
50000 | 100 | 1.030 sec | 422.3 ms |
50000 | 500 | 1.272 sec | 382.3 ms |
100000 | 100 | 2.051 sec | 1.005 sec |
For measurement methodology, or to run your own benchmarks, see the benchmark namespace of the clj-pgcopy repository.
Out of the box, clj-pgcopy supports many data types, with what I consider reasonable default mappings.
JVM type | Postgres type |
---|---|
Short | int2 (aka smallint) |
Integer | int4 (aka integer) |
Long | int8 (aka bigint) |
Float | float4 (aka real) |
Double | float8 (aka double presicion) |
BigDecimal | numeric/decimal |
Boolean | boolean |
String | text/varchar/char |
java.util.UUID | uuid |
JVM type | Postgres type |
---|---|
java.sql.Date | date |
java.time.LocalDate | date |
java.util.Date | timestamp[tz] |
java.sql.Timestamp | timestamp[tz] |
java.time.Instant | timestamp[tz] |
java.time.ZonedDateTime | timestamp[tz] |
java.time.OffsetDatetime | timestamp[tz] |
org.postgres.util.PGInterval | interval |
JVM type | Postgres type |
---|---|
org.postgres.geometric.PGpoint | point |
org.postgres.geometric.PGline | line |
org.postgres.geometric.PGpath | path |
org.postgres.geometric.PGbox | box |
org.postgres.geometric.PGcircle | circle |
org.postgres.geometric.PGpolygon | polygon |
Impemented for the following JVM-typed arrays for:
JVM type | Postgres type |
---|---|
int[] | int4[] (aka integer[]) |
long[] | int8[] (aka bigint[]) |
float[] | float4[] (aka real[]) |
double[] | float8[] (aka double precision[]) |
byte[] | bytea |
String[] | text[] (or varchar) |
java.util.UUID[] | uuid[] |
Currently, only 1-dimensional Postgres arrays are supported.
Things that are String-like, or serialized in string form, should work
using the String -> text mapping. An exception is the jsonb
postgres
type, because the binary format requires a version signifier. Wrapping
a JSON string in a clj-pgcopy.core/JsonB
handles that.
Note that this library does not serialize to JSON, it simply wraps a valid JSON string such that it can actually be used.
These type mappings are implented using a clojure protocol, namely
clj-pgcopy.core/IPGBinaryWrite
. In order to add support for another
type, just extend that protocol with an implementation. You can and
should use the implementation of other types in the library for
guidance on doing so.
Go forth and import!
]]>Doing things exactly once, atomically, is relatively straightforward in traditional ACID transactional databases: within a transaction, you find an entity (row) by the token, perform any updates to that entity, and finally invalidate the token (often by deleting or nullifying it). 1
But how can we accomplish some like this in Datomic?
Welcome to WWSN! We’re so excited you’re here! On WWSN, you can sign up, sign in, and reset your password. It’s so simple!
Let’s say we have a really simple schema. A user has an email address and a bcrypted password:
[{:db/id #db/id [:db.part/db]
:db/ident :user/email
:db/valueType :db.type/string
:db/unique :db.unique/identity
:db/cardinality :db.cardinality/one
:db.install/_attribute :db.part/db}
{:db/id #db/id [:db.part/db]
:db/ident :user/crypted-password
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db.install/_attribute :db.part/db}
{:db/id #db/id [:db.part/db]
:db/ident :user/single-use-token
:db/valueType :db.type/string
:db/unique :db.unique/value
:db/cardinality :db.cardinality/one
:db.install/_attribute :db.part/db}]
Some initial data might be added like this:
(d/transact conn [{:db/id (d/tempid :db.part/user)
:user/email "jim@example.com"
:user/crypted-password (crypt "jello4stapler")}
{:db/id (d/tempid :db.part/user)
:user/email "pam@example.com"
:user/crypted-password (crypt "art4evah")}])
Later, one of the users wants to reset their password, so we generate a password reset token and persist it:
(d/transact conn [{:db/id (d/tempid :db.part/user)
:user/email "jim@example.com"
:user/single-use-token (generate-secure-random)}])
After receiving an email, they follow the link that includes the single-use token in the URL. When they submit their new password, we look up the user by that token and update them accordingly:
(let [db (d/db conn)
token (:token params)
e (d/q '[:find ?e .
:in $ ?token
:where [?e :user/single-use-token ?token]]
db token)]
(if e
(d/transact conn [[:db/add e :user/crypted-password new-password]
[:db/retract e :user/single-use-token token]])))
But there’s a problem with this solution. Even though the new password is asserted in the same transaction that the token is invalidated, the opportunity for concurrency problems between threads and/or peers still exists.
Here’s why. In between the time that e
is first found by its token
and its new facts are transacted, somebody else could have already used
(and deleted) the token. In other words, even though all writes are
transactional, the reads are not. In practice this is rarely, if ever,
a problem. (d/db conn)
will return the most recent version of the
database that the peer can get.
Let’s illustrate this possibility of a stale database introducing a bug:
(def stale-db (d/db conn))
(let [{:keys [token new-password]} params
e (d/q '[:find ?e .
:in $ ?token
:where [?e :user/single-use-token ?token]]
stale-db token)]
(if e
(d/transact conn [[:db/add e :user/crypted-password new-password]
[:db/retract e :user/single-use-token token]])))
(let [token (:token params)
;; this token still exists because we're using and "old" db value
e (d/q '[:find ?e .
:in $ ?token
:where [?e :user/single-use-token ?token]]
stale-db token)]
(if e
(d/transact conn [[:db/add e :user/crypted-password "somethingelse"]
[:db/retract e :user/single-use-token token]])))
The user entity e
is found both times because the database value is
immutable. So, the transactions will both succeed.
The astute reader may have noticed something. I said all writes are transactional, but in the second block of code, we’re retracting a value that’s already been retracted. Something must be broken!
Nothing’s broken. This tripped me up at first, but it turns out that retractions work just like assertions with regard to redundancy elimination.
From the Datomic documentation on transactions:
Redundancy Elimination
A datom is redundant with the current value of the database if there is a matching datom that differs only by transaction id. If a transaction would produce redundant datoms, those datoms are filtered out, and do not appear a second time in either the indexes or the transaction log.
In other words, Datomic is eliminating the redundant retraction: we’ve
already retracted the token, so the effective datoms of the transaction
only include the [:db/add ...]
of the new password. In this particular
use case, retractions cannot be used to safeguard us from using a token
more than once.
Datomic’s got us covered. I mentioned before that all writes are
transactional, and reads are not. That’s actually only true on the
peers. The transactor itself is guaranteed to always have access to the
most recent database value at any time. Among other things, this is what
enables built-in database functions like :db.fn/cas
to work.
Within a transaction, a database function is used in place of a
:db/add
or :db/retract
. When the transactor sees a transactor
function, it invokes it and splices the result into the rest of the
transaction. Also, a database function always receives the most recent
db
value as it’s first argument. Because you have access to the whole
of the Datomic API, you can leverage this db
value to do all sorts of
things.
Let’s transact the following new schema info into our database:
[{:db/id #db/id [:db.part/db]
:db/ident :db.fn/set-with-token
:db/doc "Look up entity by token, set attr and value, and retract token"
:db/fn #db/fn {:lang "clojure"
:params [db token-attr token-value attr value]
:code (let [e (datomic.api/q '[:find ?e .
:in $ ?ta ?tv
:where
[?e ?ta ?tv]]
db token-attr token-value)]
(if e
[[:db/add e attr value]
[:db/retract e token-attr token-value]]
(throw (ex-info "No entity with that token exists"
{token-attr token-value}))))}}]
This function is more generic than our immediate use-case, but I prefer to parameterize attributes as well as values in database functions. It allows us to re-use the database function for other token fields, and won’t have to be updated in the schema if we ever change the name our token attribute.
Here’s how we use this shiny new function:
(let [{:keys [token new-password]} params]
(d/transact conn [[:db.fn/set-with-token :user/single-use-token token
:user/crypted-password new-password]]))
When we transact this data, the transactor invokes our function using the most recent database value. In other words, we are making the lookup portion serializable with the rest of the operations.
If we run this transaction a second time, we’ll get the error message.
This kind of transaction atomicity is made possible by Datomic’s single-writer design. Other database systems (e.g. SQL) have to employ very complicated isolation patterns like MVVC to allow multiple writers while keeping data integrity guarantees. Datomic side-steps those problems by using a single writer, paired with immutable history.
The catch, as we have seen, is that read-dependent writes will require the use of database functions to maintain atomicity. Of course, database functions have uses outside of concurrency contexts. And, as a bonus, they can be loaded and invoked on the client as well.
For more information on database functions, see the docs, watch the video, or see [the Day of Datomic examples](https://github.com/Datomic/day-of-datomic/blob/master/tutorial /data_functions.clj). You can also view my scratch.clj file that I used to build up the code examples here.
1 However, even in SQL setups there are potential pitfalls. Because of the potential for multiple writers, care must be taken to satisfy the “exactly once” requirement. Where possible, a SQL client should use a single statement to find, update, and nullify a token. When a single statement isn’t possible, the use of row-level locks can be used. Or, better yet, wrap everything in a transaction with serializable isolation level.
]]>I’ve been really happy with Datomic, but doing an initial bulk import wasn’t as familiar as SQL dump/restore. Here are some things that I’ve learned from doing several imports.
The Datomic transactor handles concurrency by transacting datoms
serially, but that doesn’t mean it isn’t fast! In my experience, the bottleneck is actually in the
reshaping of data and formatting transactions. I
use core.async
to parallelize just about everything in the import
pipeline.
One example of how I’ve leveraged core.async
for
import jobs can be found in my Kevin Bacon project
repository.
I use DynamoDB as my storage backend in production. I used to try to
run my import tasks directly to the production transactor/storage.
Lately, though, I’ve found it really helpful to run my import tasks to
a locally-running transactor and the dev
storage backend.
Running an import locally means I don’t have to worry about networking, which speeds the whole process up quite a bit; also, it give me a much more freedom to iterate on the database design itself. (I rarely get an import correct the first time.) And in the case of DynamoDB, I save some money, as I don’t have to have my “write throughput” cranked way up for as long.
Bulk imports create some garbage, so manually reindexing before backing up is advantageous. Here’s what a REPL session looks like:
(def conn (d/connect "datomic:dev://localhost:4334/database-name"))
(d/request-index conn)
(->> conn d/db d/basis-t (d/sync-index conn) deref)
;; blocks until done indexing
(d/gc-storage conn (java.util.Date.))
For more information on why this cleanup is important, see the relevant Datomic documentation.
Once everything looks good on the local production database, I use Datomic’s builtin backup/restore facilities to send the database up to production. Assuming you’ve already deployed a production transactor and provisioned DynamoDB storage, here’s the process I follow:
datomic backup-db
command against the local import.datomic restore-db
command from the backup folder to the
remote database.The heart of almost every business is its data. Datomic is a great choice for business data, in part because it treats all data as important: nothing is overwritten. New things are learned, but the old facts are not replaced. And knowing how to get your data into Datomic is half the battle.
Go forth and import!
]]>I’m getting closer to the frontend development environment of my dreams. The combination of editor integration, live browser reload, and not having to manually run commands over and over is time-saving and a pleasure to work with.
At Hashrocket, designers and developers work very closely together. Visual design and markup is handled by our designers, who create “stubbed out” templates in the UI directory. It’s a process that works very well for us, and allows us to iteratively add features to an application.
This process has served us very well in Rails using a UI controller, available only in development mode.
I’ve been using ClojureScript a lot lately, particularly with Om, and have missed that directory of collaboration. After all, the designers at Hashrocket have a proclivity for HAML and SASS.
In the past, I’ve set up a separate repository using Middleman to
handle markup and styles, using middleman build
, copying the generated
CSS files, and eyeballing the generated markup to ensure it matched the
Om component’s markup. Aside from being tedious, it’s really easy to
get out of sync with a manual process like this. The static resource
generation should be a part of our build process.
Enter boot.
If you’re new to the Clojure world, you may have heard of Leiningen, which is the de facto dependency management and build tool for Clojure/Script. Boot is similar to Leiningen, but adds the ability to compose tasks to create build pipelines. This composability, along with some really smart architectural decisions, is what makes boot a great choice for the problem at hand.
Adzerk’s example repo
is a great way to get started with ClojureScript and boot. Of particular
note is the build.boot
file. It demonstrates how one can build
up a dev
task that watches a directory for changes, rebuilding
ClojureScript sources, and notifying the browser to reload the code. It
includes the setup necessary for source maps, a development server,
and the browser-connected REPL. But what I want to add to that pot
of awesome is the ability to compile HAML and SASS as a part of the
pipeline.
I had an epiphany one night after working on this problem for a while: I can just use Middleman. After all, boot and the ClojureScript compiler run on the JVM, and JRuby is easily embeddable. After a short bit, I came up with boot-middleman, the glue I needed to build HAML/SASS as a part of my build process.
It assumes a subdirectory is a Middleman app (assets
by default). This
works nicely because my designer pals can collaborate with me without
having to use the JVM at all. They just run middleman
in the assets
subdirectory and work as normal.
See the boot-middleman README for instructions on setting up.
I used this workflow to create a minesweeper clone, the
source of which is on GitHub.
Just clone and run boot dev
.
To see the workflow in action, check out the following video. It demonstrates how editing front-end files do not require a manual browser refresh to see the effects.
]]>Vim is a powerful text editor. Clojure is a powerful programming language. While its been possible to edit Clojure code in vim for years, the toolchain has improved greatly over the past year. Today we’re going to see how we can integrate vim with our Clojure REPL environment.
In a shell session, let’s fire up a Clojure REPL. I’m going to use lein repl
to do this. In another shell session, let’s start vim and edit a
clojure file.
As I edit my file, I can copy code from the editor, switch to the window with the REPL in it, and paste that code in. This works, but it’s an awkward, slow process. REPLs are supposed to be all about fast feedback. We can do better than copy and paste.
Before we get started, we should get the some basic plugins for clojure development. Using your preferred vim plugin manager, add these plugins:
guns/vim-clojure-static
tpope/fireplace.vim
After you’ve installed the necessary Vim plugins, enter a project
directory. For example, if you have a leiningen project, cd into the
directory. In one shell session, fire up a REPL with lein repl
. In
another shell session, cd that that folder once again, and then open
vim.
Fireplace is able to detect when you are in the same directory as an active REPL, and will attempt to automatically connect for you. This process is transparent, but should be obvious once we attempt to to send a command to the connected REPL.
The most basic fireplace command is :Eval
. :Eval
takes an arbitrary
clojure expression, sends it off to the REPL, and prints the result
for you. For example, we could run :Eval (+ 1 1)
, and we would, as
expected, see 2
printed out. This emulates typing at REPL prompt
directly, but there’s much more we can do with our REPL-connected vim
session.
Let’s stay with :Eval
for a bit longer. :Eval
without any arguments
will send eval and print the outermost form on the current line. For
example, let’s look at a simple expression.
(map inc [1 2 3])
When we have our cursor on this line and type :Eval
with no arguments,
we’ll see (2 3 4)
printed back.
:Eval
, as with many vim commands, can also take a range. So,
:1,3Eval
would evaluate all of lines 1 through 3. All of the normal
special ranges work here, such as %
for the entire file, and '<,'>
for the current selection in visual mode.
:Eval
works well, but there’s a quicker way to get feedback. cp
is
the normal mode mapping for doing a simple eval and print. By default,
cp
expects a motion. The form that I use most though is cpp
, which
will eval and print the innermost form from the cursor’s current
position.
To demonstrate what this means, let’s look at that expression again.
(map inc [1 2 3])
When our cursor is on the m
of map
, and we type cpp
, we’ll see
(2 3 4)
, just as when we did the plain :Eval
. But if we move our
cursor inside the vector and type cpp
again, we’ll see that inner form
evaluated.
Something unique to fireplace is its concept of a quasi-REPL. This is a
cousin of the cp
mappings, but with an intermediate editing window. To
demonstrate this, let’s consider the following example.
(->> [1 2 3]
(map str)
reverse
(mapv dec))
In this trivial example, we want to reverse a sequence and decrement
each number. There’s a bug in here, but it’s in the middle of the
thread-through macro. We could just edit the line directly and
eval/print using cpp
, but there’s another way to do one-off iterative
development like this.
Type cqc
in normal mode. A commandline window will open. This is very
much like a normal vim buffer, with a few notable exceptions:
Enter
in normal mode sends the current line to the REPL
for eval-ing.tpope calls this the “quasi-repl”, and indeed that is the mnemonic for
the mapping itself: cq
is the “Clojure Quasi-REPL”.
While we’re in this special window, let’s type the following, and hit enter:
(map str [1 2 3])
Immediately, we can see the issue. Converting each number to a string
prevents dec
from working later on.
Having to type the whole line again isn’t always convenient. For those
cases, there’s cqq
, which is like cqc
except that it pre-populates
the command window with the innermost form under the cursor. We can
see this in action by putting our cursor near the beginning of the
thread-through macro, and typing cqq
.
You can think of cqq
as being very similar to cpp
, but with a chance
to edit the line or lines before sending it off to the REPL.
One of the great things about Clojure is that documentation is a
first-class citizen, and builtin functions have documentation attached
to them. With a standard REPL, we can use the doc
function to get the
signature and documentation for a given function.
With fireplace, we get this with the :Doc
command, and it works just
like doc
. To see the documentation for map
, for example, type :Doc map
. We immediately see the documentation for the map command printed.
There’s an even shorter way to look up documentation for a function.
When your cursor is on a word, you can press K
, that is Shift
and
K
. We can try this again with the map
function by placing our cursor
on the function itself, and pressing K
.
We can also use the :Source
command to show the source for a function.
When we do this with map
, we see the source code for map
from
clojure.core
.
Datomic is a database that changes the way that you think about databases. It also happens to be effective at modeling graph data and was a great fit for performing graph traversal in a recent project I built.
I started out building a six degress of Kevin Bacon project using Neo4j, a popular open-source graph database. It worked very well for actors that were a few hops away, but finding paths between actors with more than 5 hops proved problematic. The cypher query language gave me little visibility into the graph algorithms actually being executed. I wanted more.
Despite not being explicitly labeled as such, Datomic proved to be an effective graph database. Its ability to arbitrarily traverse datoms, when paired with the appropriate graph searching algorithm, solved my problem elegantly. This technique ended up being fast as well.
Quick aside: this post assumes a cursory understanding of Datomic. I won’t cover the basics, but the official tutorial will help you get started.
The problem domain should be fairly familiar: the 6 degrees of Kevin Bacon. I wanted to create an app where you could pick an actor and find out what their Bacon Number was. That is, given an actor, I wanted to answer the question “how many degrees of separation is there between that actor and Kevin Bacon?”
Using information freely available from IMDb, I developed the following schema:
[
;; movies
{:db/id #db/id[:db.part/db]
:db/ident :movie/title
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/fulltext true
:db/unique :db.unique/identity
:db/doc "A movie's title (upsertable)"
:db.install/_attribute :db.part/db}
{:db/id #db/id[:db.part/db]
:db/ident :movie/year
:db/valueType :db.type/long
:db/cardinality :db.cardinality/one
:db/doc "A movie's release year"
:db.install/_attribute :db.part/db}
;; actors
{:db/id #db/id[:db.part/db]
:db/ident :person/name
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/fulltext true
:db/unique :db.unique/identity
:db/doc "A person's name (upsertable)"
:db.install/_attribute :db.part/db}
{:db/id #db/id[:db.part/db]
:db/ident :actor/movies
:db/valueType :db.type/ref
:db/cardinality :db.cardinality/many
:db/doc "An actor's ref to a movie"
:db.install/_attribute :db.part/db}
]
In a nutshell, movies have titles and years. Actors have names and movies.
The “relationship” of actors to movies is many-to-many, so I’ve declared the
:actor/movies
attribute as having a cardinality of many.
Using datalog and datomic.api/q
, we can make graph-like queries fairly easily.
Because the :where
clauses of a datalog query form an implicit join, we can
join from our starting point to our ending point with relative ease.
As an example, what if we wanted to know the shortest path or paths from Kevin Bacon to Jon Belushi? Let’s use a query to find out:
(require '[datomic.api :as d :refer [q db]])
(def conn (d/connect ...))
(q '[:find ?start ?title ?end
:in $ ?start ?end
:where
[?a1 :actor/name ?start]
[?a2 :actor/name ?end]
[?a1 :actor/movies ?m]
[?a2 :actor/movies ?m]
[?m :movie/title ?title]]
(db conn)
"Bacon, Kevin (I)"
"Belushi, John")
;=> #{["Bacon, Kevin (I)" "Animal House (1978)" "Belushi, John"]}
That is fine when actors have worked together in a movie (a Bacon Number of 1), but doesn’t help us solve Bacon numbers when there are 2 or more movies between the actors. We could add more where clauses to join over two movies, but that isn’t sustainable. The queries would quickly become too long to reason about. This is a prime opportunity to use Datomic’s rules.
(def acted-with-rules
'[[(acted-with ?e1 ?e2 ?path)
[?e1 :actor/movies ?m]
[?e2 :actor/movies ?m]
[(!= ?e1 ?e2)]
[(vector ?e1 ?m ?e2) ?path]]
[(acted-with-1 ?e1 ?e2 ?path)
(acted-with ?e1 ?e2 ?path)]
[(acted-with-2 ?e1 ?e2 ?path)
(acted-with ?e1 ?x ?pp)
(acted-with ?x ?e2 ?p2)
[(butlast ?pp) ?p1]
[(concat ?p1 ?p2) ?path]]])
(q '[:find ?path
:in $ % ?start ?end
:where
[?a1 :actor/name ?start]
[?a2 :actor/name ?end]
(acted-with-2 ?a1 ?a2 ?path)]
(db conn) acted-with-rules "Bieber, Justin" "Bacon, Kevin (I)"))
;=> #{[(17592186887476 17592186434418 17592187362817 17592186339273 17592186838882)] [(17592186887476 17592186434418 17592188400376 17592186529535 17592186838882)] [(17592186887476 17592186434418 17592187854963 17592186529535 17592186838882)] [(17592186887476 17592186434418 17592186926035 17592186302397 17592186838882)]}
This time we get back a collection of paths with entity ids. We can easily transform these ids by mapping them into entities and getting the name or title, using a function like the following:
(defn actor-or-movie-name [db eid]
(let [ent (d/entity db eid)]
(or (:movie/title ent) (:person/name ent))))
So, putting the query together with the above function, we get:
(let [d (db conn)
name (partial actor-or-movie-name d)]
(->> (q '[:find ?path
:in $ % ?start ?end
:where
[?a1 :actor/name ?start]
[?a2 :actor/name ?end]
(acted-with-2 ?a1 ?a2 ?path)]
d acted-with-rules "Bieber, Justin" "Bacon, Kevin (I)")
(map first)
(map (partial mapv name))))
;=> (["Bieber, Justin" "Men in Black 3 (2012)" "Jones, Tommy Lee" "JFK (1991)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Howard, Rosemary (II)" "R.I.P.D. (2013)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Segal, Tobias" "R.I.P.D. (2013)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Brolin, Josh" "Hollow Man (2000)" "Bacon, Kevin (I)"])
The rules above are defined statically, but they are simply clojure data structures: it would be trivial to generate those rules to an arbitrary depth. For an example of doing just that, see the Datomic mbrainz sample.
Having to know the depth at which to traverse the graph is cumbersome. Datomic
has a distinct advantage of being able to treat your data as local, even if its
permanent storage lives somewhere else. That means that we can bring our own
functions to the problem and execute locally, rather than on a database server.
We can leverage Datomic’s datoms
function to search the graph using
our own graph-searching algorithm, rather than relying on the query engine.
Our IMDb actor data is essentially a dense unweighted graph. Because of its density, a bidirectional breadth-first search is probably the most efficient alogrithm for finding the shortest paths from one point to another. A generic bidirectional BFS returning all shortest paths might look like this.
(defn paths
"Returns a lazy seq of all non-looping path vectors starting with
[<start-node>]"
[nodes-fn path]
(let [this-node (peek path)]
(->> (nodes-fn this-node)
(filter #(not-any? (fn [edge] (= edge [this-node %]))
(partition 2 1 path)))
(mapcat #(paths nodes-fn (conj path %)))
(cons path))))
(defn trace-paths [m start]
(remove #(m (peek %)) (paths m [start])))
(defn- find-paths [from-map to-map matches]
(for [n matches
from (map reverse (trace-paths from-map n))
to (map rest (trace-paths to-map n))]
(vec (concat from to))))
(defn- neighbor-pairs [neighbors q coll]
(for [node q
nbr (neighbors node)
:when (not (contains? coll nbr))]
[nbr node]))
(defn bidirectional-bfs [start end neighbors]
(let [find-pairs (partial neighbor-pairs neighbors)
overlaps (fn [coll q] (seq (filter #(contains? coll %) q)))
map-set-pairs (fn [map pairs]
(persistent! (reduce (fn [map [key val]]
(assoc! map key (conj (get map key #{}) val)))
(transient map) pairs)))]
(loop [preds {start nil} ; map of outgoing nodes to where they came from
succs {end nil} ; map of incoming nodes to where they came from
q1 (list start) ; queue of outgoing things to check
q2 (list end)] ; queue of incoming things to check
(when (and (seq q1) (seq q2))
(if (<= (count q1) (count q2))
(let [pairs (find-pairs q1 preds)
preds (map-set-pairs preds pairs)
q1 (map first pairs)]
(if-let [all (overlaps succs q1)]
(find-paths preds succs (set all))
(recur preds succs q1 q2)))
(let [pairs (find-pairs q2 succs)
succs (map-set-pairs succs pairs)
q2 (map first pairs)]
(if-let [all (overlaps preds q2)]
(find-paths preds succs (set all))
(recur preds succs q1 q2))))))))
There’s a lot of code here, including some optimizations and helper functions.
The important function here is bidirectional-bfs
. I won’t explain the details
of the algorithm, but at a high level, it takes in a start and end node and a
function to be called on any node to get it’s “neighbors”.
This is a generic, pure function, agnostic of Datomic or our data. In fact, I used a simple map as the “graph” while developing this:
(def graph
{:a [:b]
:b [:a :c :d]
:c [:b :e]
:d [:b :c :e]
:e [:c :d :f]
:f []})
(bidirectional-bfs :a :e graph)
;=> [[:a :b :c :e] [:a :b :d :e]]
To use this generic algorithm with our database, we need a neighbors
function.
Depending on whether a node is an “actor” or a “movie”, we need to return its
appropriate counterpart. A naive “or” condition is actually good enough here:
(defn movie-actors
"Given a Datomic database value and a movie id,
returns ids for actors in that movie."
[db eid]
(map :e (d/datoms db :vaet eid :actor/movies)))
(defn actor-movies
"Given a Datomic database value and an actor id,
returns ids for movies that actor was in."
[db eid]
(map :v (d/datoms db :eavt eid :actor/movies)))
(defn neighbors
"db is database value
eid is an actor or movie eid"
[db eid]
(or (seq (actor-movies db eid))
(seq (movie-actors db eid))))
Gluing everything together is a simple matter of partial application:
(defn find-id-paths [db source target]
(bidirectional-bfs source target (partial neighbors db)))
Given a source entity id and a target entity id, this will return all shortest paths (ids), much like the query example above. From there, we could map them to Datomic entities, get their names, or sort the paths using a domain-specific heuristic. Plugging in the previous example, we might do something like the following:
(let [d (db conn)
d (d/filter d (without-documentaries d))
biebs (d/entid d [:actor/name "Bieber, Justin"])
bacon (d/entid d [:actor/name "Bacon, Kevin (I)"])
name (partial actor-or-movie-name d)]
(map (partial mapv name) (find-id-paths d biebs bacon)))
;=> (["Bieber, Justin" "Men in Black 3 (2012)" "Jones, Tommy Lee" "JFK (1991)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Segal, Tobias" "R.I.P.D. (2013)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Brolin, Josh" "Hollow Man (2000)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Howard, Rosemary (II)" "R.I.P.D. (2013)" "Bacon, Kevin (I)"])
This returns the same set of paths as the query method did. However, this version has the advantage of going to an arbitrary depth.
This is just one example of graph searching with Datomic. Different kinds of problems and domains could use other algorithms. The idea, though, is that generic graph searching functions can be used directly, since the data is effectively local to the peer machine.
For more Clojure implementations of generic graph searching algorithms, loom’s alg_generic namespace is a great starting point.
I’m using the above ideas and functions on IMDB’s dataset to power the project. Once the peer’s index caches are warmed, the performance is quite good: most searches I’ve performed between well-known actors complete in under a second, and in many cases, under 100 ms. I never got results that good with Neo4j’s cypher query language.
The code in this post is based on the source.
]]>Database constraints are essential to ensuring data integrity, and you should use them. Allowing them to be deferrable during transactions makes them even more convenient. A common scenario in which the database can help us is in a sortable list implementation. This post outlines the how and why of deferring database constraints, using a sortable list domain as an example.
Imagine that you have an application with multiple lists. Each list has items that can be reordered with a drag-and-drop interaction. This can be modelled in a fairly straightforward manner.
Each list has_many
list items, which are ordered by the position
column. Each list’s items have a position beginning with 1 and
incrementing with each subsequent item.
# app/models/list.rb
class List < ActiveRecord::Base
has_many :items, -> { order :position }, class_name: "ListItem"
validates_presence_of :name
end
# app/models/list_item.rb
class ListItem < ActiveRecord::Base
belongs_to :list
validates_presence_of :name, :list, :position
before_validation :ensure_position
def self.update_positions(ids)
ids.each_with_index do |id, index|
where(id: id).update_all(position: index + 1)
end
end
private
def ensure_position
self.position ||= self.class.where(list_id: list_id).maximum(:position).to_i + 1
end
end
A couple things are worth noting about the ListItem
class. Firstly, we
have update_positions
, a class method that accepts an array of ids and
updates each. This method will be called in a sort
controller action
as such:
class ItemsController < ApplicationController
expose(:list)
def sort
# list item ids is an ordered array of ids
list.items.update_positions(params[:list_item_ids])
head :ok
end
end
Secondly, new items don’t necessarily know what position they should
have, so we put list items that don’t have position
at the end of
their respective list, just before validating that the position is
present.
Here are the migrations that we used to create the models’ database tables:
class CreateLists < ActiveRecord::Migration
def change
create_table :lists do |t|
t.string :name
t.timestamps
end
end
end
class CreateListItems < ActiveRecord::Migration
def change
create_table :list_items do |t|
t.belongs_to :list
t.integer :position
t.string :name
t.timestamps
end
end
end
Notice anything missing? If you said database constraints, you’re
correct! Our application is enforcing presence for most attributes, but
our corresponding columns are missing NOT NULL
constraints. Also, the
list_id
column on list_items
is missing a foreign key constraint.
But I’d like to focus on another missing constraint. Our domain model has an implicit requirement that we haven’t enforced with either validations or database constraints: each list item’s position should be unique per list. No two list items in a list should have the same position. That would make the ordering non-deterministic.
We could add a uniqueness validation for position
, scoped to
the list_id
. However, as thoughtbot recently warned,
application-level uniqueness validations are insufficient at best, and
fail completely in concurrent environments.
The position
column needs a database-level constraint.
Adding the uniqueness constraint to position
is fairly straightforward
in PostgreSQL. We’ll just create a new migration with the following:
class AddUniquenessValidationOnListItems < ActiveRecord::Migration
def up
execute <<-SQL
alter table list_items
add constraint list_item_position unique (list_id, position);
SQL
end
def down
execute <<-SQL
alter table list_items
drop constraint if exists list_item_position;
SQL
end
end
Let’s wrap our UPDATE
statements in a transaction so that any failed
UPDATE
of the position column will result in none of them being
updated:
class ListItem < ActiveRecord::Base
# ...
def self.update_positions(ids)
transaction do
ids.each_with_index do |id, index|
where(id: id).update_all(position: index + 1)
end
end
end
end
This ensures at the database level that positions of items are unique per list; no two items in the same list can occupy the “1” position. With regard to data integrity, this is a huge improvement over our initial implementation. But it has one drawback: it doesn’t work.
To illustrate why, imagine a list with the following items:
id | position | name
13 | 1 | Eggs
18 | 2 | Milk
35 | 3 | Bread
To move Bread to the top of the list, we would pass an array of ids,
[35,13,18]
to the update_positions
method. This method does a series
of UPDATE
statements to the database. For the first id, the one for
Bread, we end up sending an update statement that would look like the
following:
UPDATE list_items SET position=1 WHERE id=35;
After this statement is executed in the database, but before we move on
to the next id in the list, Postgres will fail its constraint checks. At
the moment that the UPDATE
happens, the data would be:
id | position | name
13 | 1 | Eggs
18 | 2 | Milk
35 | 1 | Bread
With both Eggs and Bread occupying the same position, the UPDATE
fails.
Of course, we know that we want to change the position of Eggs as well,
so that its position would be “2”, and that collision would not happen.
But at the time that the constraint-check happens, the database doesn’t
know this.
Even within a transaction, database uniqueness constraints are enforced immediately per row. It seems our dreams of data integrity are smashed. If only there were a way to enforce uniqueness constraints at the end of the transaction, rather than the end of each statement…
As mentioned before, constraints are immediately enforced. This
behavior can be changed within a transaction by changing a constraints
deferrable characteristics. In PostgreSQL, constraints are assumed to
be NOT DEFERRABLE
by default.
However, constraints can also behave as deferrable in one of two ways:
DEFERRABLE INITIALLY IMMEDIATE
or DEFERRABLE INITIALLY DEFERRED
.
The first part, DEFERRABLE
, is what allows the database constraint
behavior to change within transactions. The second part describes what
the default behavior will be within a transaction.
With a constraint that is deferrable, but initially immediate, the
constraint will by default behave just like a non-deferrable constraint,
checking every statement immediately. A constraint that is initially
deferred will, by default, defer its checks until the transaction is
committed. Both of these can change their behavior per-transaction with a
call to SET CONSTRAINTS
(documentation).
With that information, let’s change the definition of the constraint we defined before:
class AddUniquenessValidationOnListItems < ActiveRecord::Migration
def up
execute <<-SQL
alter table list_items
add constraint list_item_position unique (list_id, position)
DEFERRABLE INITIALLY IMMEDIATE;
SQL
end
def down
execute <<-SQL
alter table list_items
drop constraint if exists list_item_position;
SQL
end
end
The only thing we’ve changed from before is the DEFERRABLE INITIALLY IMMEDIATE
bit. I think it is a good idea to use the INITIALLY IMMEDIATE
option. This will ensure that other parts of our app, and
other consumers of the database will not be surprised by the behavior of
the constraint; it will continue to act a like a normal, non-deferrable
constraint, until we explicitly opt in to the deferral.
We now need to change our transaction block. In our case, the first
database statement within the transaction must be the SET CONSTRAINTS
statement:
class ListItem < ActiveRecord::Base
# ...
def self.update_positions(ids)
transaction do
connection.execute "SET CONSTRAINTS list_item_position DEFERRED"
ids.each_with_index do |id, index|
where(id: id).update_all(position: index + 1)
end
end
end
end
Having now opted in to deferring our uniqueness constraint, reordering the items now works as expected. The constraint still ensures that we don’t have two items that occupy the same position, but waits until the end of the transaction to do that check. We can have our data integrity cake and eat it too.
Having to name the constraint in two places is a bit of a bummer, and introduces a coupling that could bite us if the constraint name ever changed. Knowing that, we leverage PostgreSQL’s introspective abilities to query the constraint names instead.
For example, we can add the following module to our codebase:
# lib/deferrable.rb
module Deferrable
def deferrable_uniqueness_constraints_on(column_name)
usage = Arel::Table.new 'information_schema.constraint_column_usage'
constraint = Arel::Table.new 'pg_constraint'
arel = usage.project(usage[:constraint_name])
.join(constraint).on(usage[:constraint_name].eq(constraint[:conname]))
.where(
(constraint[:contype].eq('u'))
.and(constraint[:condeferrable])
.and(usage[:table_name].eq(table_name))
.and(usage[:column_name].eq(column_name))
)
connection.select_values arel
end
def transaction_with_deferred_constraints_on(column_name)
transaction do
constraints = deferrable_uniqueness_constraints_on(column_name).join ","
connection.execute("SET CONSTRAINTS %s DEFERRED" % constraints)
yield
end
end
end
And now change our model to use it:
class ListItem < ActiveRecord::Base
extend Deferrable
# ...
def self.update_positions(ids)
transaction_with_deferred_constraints_on(:position) do
ids.each_with_index do |id, index|
where(id: id).update_all(position: index + 1)
end
end
end
end
And, boom! Less coupling.
NOTE That’s a lot of Arel! Use at your own risk. ;-)
While writing this post, I created a sample Rails app to iterate quickly. I used TDD to write the initial approach, and reused the specs while I “refactored” the implementation to the subsequent approaches. Each commit on the master branch more or less follows the sections above.
]]>Sometimes, aggregating data can become overly complex in a normal ActiveRecord model. Because Rails works well with SQL views, we can create associations to SQL views that aggregate data for us, simplifying our models and potentially speeding up queries.
I’ve got an inbox. A cat inbox. For real.
There are many possible implementations for modeling an inbox. I’ve gone with a relatively simple approach. Two users participate in a conversation, sending messages back and forth to each other. The Conversation model has a subject, but the body of the initial message is part of the Message object.
# app/models/conversation.rb
class Conversation < ActiveRecord::Base
# fields: to_id, from_id, subject
belongs_to :to, class_name: "User"
belongs_to :from, class_name: "User"
has_many :messages, dependent: :destroy, inverse_of: :conversation
end
# app/models/message.rb
class Message < ActiveRecord::Base
# fields: user_id, conversation_id, body
belongs_to :conversation, inverse_of: :messages
belongs_to :user
end
After the initial message, the two participants on the conversation send messages back and forth. A user may have any number of conversations with other users. As such, the main inbox view must list the conversations a user is a participant on, as well as some summary information about that conversation.
For our purposes, we’ve decided on an HTML table view with the following columns:
Although the subject is part of the conversation itself, everything else comes from its various associations. This is the view, which reveals the expected interface each conversation object should have:
%table#inbox
%thead
%tr
%th From
%th To
%th Message
%th Last post
%th Replies
%tbody
- conversations.each do |conversation|
%tr
%td= conversation.from_name
%td= conversation.to_name
%td
%p
%strong= conversation.subject
= conversation.most_recent_message_body
%td
= time_ago_in_words(conversation.most_recent_message_sent_at)
ago
%td= conversation.reply_count
Let’s explore a typical way to model this in our model directly.
# app/models/conversation.rb
class Converation < ActiveRecord::Base
# associations, etc...
def most_recent_message_body
most_recent_message.body if most_recent_message
end
def most_recent_message_sent_at
most_recent_message.created_at if most_recent_message
end
def reply_count
messages.size - 1
end
def to_name
to.name
end
def from_name
from.name
end
private
def most_recent_message
@most_recent_message ||= messages.by_date.first
end
end
# app/models/message.rb
class Message < ActiveRecord::Base
# associations, etc...
def self.by_date
order("created_at DESC")
end
end
This approach is fairly straightforward. We obtain the
most_recent_message_body
and most_recent_message_sent_at
from the
most recent message, which is trivial after we’ve ordered the messages
association by date. The to_name
and from_name
methods are delegated
to their respective associations. And reply_count
is simple the total
number of messages, minus one (the initial message doesn’t count as a
“reply”).
This approach offers a number of advantages. For one, it is familiar.
I believe most Rails developers would be able to understand exactly
what’s going on above. It also locates all of the domain logic within
the Conversation
model, making it easy to find.
Having everything in the Conversation
model is actually a blessing
and a curse. Although everything is easy to find, the model is also
quickly becoming bloated. It may not seem like much right now, but as
more information is added to the inbox, it will become unruly.
The other problem with the above is the multitude of N+1 queries that it has introduced. With only 3 conversations in play, loading the inbox outputs a log like this:
Started GET "/" for 127.0.0.1 at 2013-02-11 09:49:02 -0600
Connecting to database specified by database.yml
Processing by InboxesController#show as HTML
User Load (12.8ms) SELECT "users".* FROM "users" LIMIT 1
Conversation Load (0.6ms) SELECT "conversations".* FROM "conversations" WHERE (1 IN (from_id, to_id))
User Load (18.3ms) SELECT "users".* FROM "users" WHERE "users"."id" = 2 LIMIT 1
User Load (0.5ms) SELECT "users".* FROM "users" WHERE "users"."id" = 1 LIMIT 1
Message Load (12.4ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" = 7 ORDER BY created_at DESC LIMIT 1
(0.6ms) SELECT COUNT(*) FROM "messages" WHERE "messages"."conversation_id" = 7
User Load (0.4ms) SELECT "users".* FROM "users" WHERE "users"."id" = 3 LIMIT 1
CACHE (0.0ms) SELECT "users".* FROM "users" WHERE "users"."id" = 1 LIMIT 1
Message Load (0.4ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" = 8 ORDER BY created_at DESC LIMIT 1
(0.4ms) SELECT COUNT(*) FROM "messages" WHERE "messages"."conversation_id" = 8
CACHE (0.0ms) SELECT "users".* FROM "users" WHERE "users"."id" = 1 LIMIT 1
CACHE (0.0ms) SELECT "users".* FROM "users" WHERE "users"."id" = 2 LIMIT 1
Message Load (0.5ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" = 9 ORDER BY created_at DESC LIMIT 1
(0.4ms) SELECT COUNT(*) FROM "messages" WHERE "messages"."conversation_id" = 9
Rendered inboxes/show.html.haml within layouts/application (683.9ms)
Completed 200 OK in 691ms (Views: 272.5ms | ActiveRecord: 418.1ms)
We can definitely cut down on the N+1 query problem by introducing eager
loading. In our controller, the conversations
exposure is currently
defined thusly:
# app/controllers/inboxes_controller.rb
class InboxesController < ApplicationController
expose(:user) { User.first }
expose(:conversations) { user.conversations }
end
Let’s change that to eagerly load its associations:
expose(:conversations) { user.conversations.includes(:messages, :to, :from) }
With eager-loading in place, the log now looks slightly more reasonable:
Started GET "/" for 127.0.0.1 at 2013-02-11 09:55:24 -0600
Processing by InboxesController#show as HTML
User Load (0.3ms) SELECT "users".* FROM "users" LIMIT 1
Conversation Load (0.3ms) SELECT "conversations".* FROM "conversations" WHERE (1 IN (from_id, to_id))
Message Load (0.3ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" IN (7, 8, 9)
User Load (0.2ms) SELECT "users".* FROM "users" WHERE "users"."id" IN (1, 2)
User Load (0.2ms) SELECT "users".* FROM "users" WHERE "users"."id" IN (2, 3, 1)
Message Load (0.4ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" = 7 ORDER BY created_at DESC LIMIT 1
Message Load (0.4ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" = 8 ORDER BY created_at DESC LIMIT 1
Message Load (0.3ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" = 9 ORDER BY created_at DESC LIMIT 1
Rendered inboxes/show.html.haml within layouts/application (9.5ms)
Completed 200 OK in 13ms (Views: 10.0ms | ActiveRecord: 2.4ms)
There are more optimizations we could make here in Ruby land. But data transformation and aggregation is something that databases are good at. We can use a native feature of SQL to aggregate information for us: views.
A SQL view is essentially a virtual table. It can be queried just like a normal table, but does not physically store anything itself. Instead, a view has a query definition that it uses to represent its data.
In our case, SQL views can allow us to treat a complex SQL query as a table, abstracting away the complexity into view itself. SQL views are also read-only, and therefore are usually only used for querying, but not updating data directly.
ActiveRecord plays nicely with SQL views out of the box. It considers a SQL view a normal table, and all associations and querying methods work like they would with a normal table, with one exception: the records are read-only.
Let’s create a view to handle the to_name
and from_name
methods on
conversation. We can do this in a normal migration, but it needs to be
created with raw SQL:
class CreateConversationSummaries < ActiveRecord::Migration
def up
execute <<-SQL
CREATE VIEW conversation_summaries AS
SELECT ...
SQL
end
def down
execute 'DROP VIEW conversation_summaries'
end
end
This is the basic syntax for adding a view with ActiveRecord migrations.
Our view needs to incorporate to_name
and from_name
, so let’s add
those fields:
CREATE VIEW conversation_summaries AS
SELECT c.id,
f.name as from_name,
t.name as to_name
FROM conversations c
inner join users t on t.id = c.to_id
inner join users f on f.id = c.from_id
After we migrate our database, we can use our database console to verify that we see what we expect:
mailbox_development=# select * from conversation_summaries;
id | from_name | to_name
----+-----------------+-----------------
7 | Felionel Richie | Cat Stevens
8 | Nelly Purrtado | Cat Stevens
9 | Cat Stevens | Felionel Richie
(3 rows)
Cool. The id
corresponds to the conversation, and to_name
and
from_name
columns come from the users table, but it’s all displayed to
us as one table.
Now that our view exists, we can integrate it into our application:
class Conversation < ActiveRecord::Base
class Summary < ActiveRecord::Base
self.table_name = "conversation_summaries"
self.primary_key = "id"
belongs_to :conversation, foreign_key: "id"
end
has_one :summary, foreign_key: "id"
end
Let’s break down what’s going on here.
I’ve chosen to nest the Summary model within the Conversation namespace, mostly to call out the fact that we’re doing something non-standard. Also, the Summary class only makes sense in the context of a Conversation. For that reason, we need to manually set the name of the table.
We must also choose a primary key, because Rails cannot infer it for SQL
views. The association itself should be familiar. It works like a normal
has_one
/belongs_to
relationship, except that we override the foreign
key.
Now that the relationships are set up, let’s actually take advantage
of our new view by changing the implementation of the to_name
and
from_name
methods.
class Conversation < ActiveRecord::Base
# ...
def to_name
# Used to be to.name
summary.to_name
end
def from_name
# Used to be from.name
summary.from_name
end
end
One the biggest benefits about this approach is that we can eager-load
a view assocation. We no longer need the to
or from
associations
eager-loaded, since we are no longer using any attributes from them in
the view. Let’s update our controller’s exposure to only eager-load the
necessary parts:
expose(:conversations) { user.conversations.includes(:summary, :messages) }
And when we visit the inbox again, the log looks like this:
Started GET "/" for 127.0.0.1 at 2013-02-11 14:26:12 -0600
Processing by InboxesController#show as HTML
User Load (0.5ms) SELECT "users".* FROM "users" LIMIT 1
Conversation Load (0.4ms) SELECT "conversations".* FROM "conversations" WHERE (1 IN (from_id, to_id))
Conversation::Summary Load (0.6ms) SELECT "conversation_summaries".* FROM "conversation_summaries" WHERE "conversation_summaries"."id" IN (7, 8, 9)
Message Load (0.3ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" IN (7, 8, 9)
Message Load (0.4ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" = 7 ORDER BY created_at DESC LIMIT 1
Message Load (0.4ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" = 8 ORDER BY created_at DESC LIMIT 1
Message Load (0.4ms) SELECT "messages".* FROM "messages" WHERE "messages"."conversation_id" = 9 ORDER BY created_at DESC LIMIT 1
Rendered inboxes/show.html.haml within layouts/application (10.4ms)
Completed 200 OK in 13ms (Views: 9.5ms | ActiveRecord: 2.9ms)
That’s definitely an improvement, albeit a small one. We’ve pushed data from the user model into our SQL view, but we don’t need to stop there.
Let’s update our view migration to include more aggregated information about each conversation.
class CreateConversationSummaries < ActiveRecord::Migration
def up
execute <<-SQL
CREATE VIEW conversation_summaries AS
SELECT c.id,
f.name as from_name,
t.name as to_name,
m.body as most_recent_message_body,
m.created_at as most_recent_message_sent_at,
(select count(*) from messages m2 where m2.conversation_id = c.id) - 1 as reply_count
FROM conversations c
inner join users t on t.id = c.to_id
inner join users f on f.id = c.from_id
left outer join (
select distinct on(conversation_id) conversation_id, body, created_at
from messages m1
order by conversation_id, created_at desc
) m ON m.conversation_id = c.id
SQL
end
def down
execute 'DROP VIEW conversation_summaries'
end
end
After running rake db:migrate:redo
, we can verify that everything is
still working as expect in the database console:
mailbox_development=# select * from conversation_summaries;
id | from_name | to_name | most_recent_message_body | most_recent_message_sent_at | reply_count
----+-----------------+-----------------+----------------------------------------+-----------------------------+-------------
7 | Felionel Richie | Cat Stevens | Say you. Say meow. | 2013-02-08 02:45:27.07712 | 2
8 | Nelly Purrtado | Cat Stevens | Except that I'm a cat | 2013-02-05 16:45:27.088292 | 0
9 | Cat Stevens | Felionel Richie | I'm sorry that you're feeling that way | 2013-01-30 16:45:27.092443 | 1
(3 rows)
That’s a lot of SQL! But actually, all I’ve added are one join to a subquery, and a subselect. Let’s review both of these changes.
There are many ways to grab the most recent message for a conversation in SQL, including using window functions. The method I’ve opted for here is a subquery in the table expression. The subquery alone would return rows for only the most recent messages for each conversation:
conversation_id | body | created_at
----------------+----------------------------------------+----------------------------
7 | Say you. Say meow. | 2013-02-08 02:45:27.07712
8 | Except that I'm a cat | 2013-02-05 16:45:27.088292
9 | I'm sorry that you're feeling that way | 2013-01-30 16:45:27.092443
By joining with only the most recent message per conversation, we avoid
duplicate rows and only get the body
and created_at
columns from
the most recent message. Then, joining against this subquery, we can
add the body
and created_at
to the list of projections, naming
them most_recent_message_body
and most_recent_message_sent_at
,
respectively.
The other thing we’ve added to the view this iteration is the
reply_count
column, which is a subselect to get the count. We also
subtract 1, just as before.
Let’s take a look at our Conversation model now:
# before
class Conversation < ActiveRecord::Base
belongs_to :to, class_name: "User"
belongs_to :from, class_name: "User"
has_many :messages, dependent: :destroy, inverse_of: :conversation
def most_recent_message_body
most_recent_message.body if most_recent_message
end
def most_recent_message_sent_at
most_recent_message.created_at if most_recent_message
end
def reply_count
[messages.size - 1, 0].max
end
def to_name
to.name
end
def from_name
from.name
end
private
def most_recent_message
@most_recent_message ||= messages.by_date.first
end
end
# after
class Conversation < ActiveRecord::Base
class Summary < ActiveRecord::Base
self.table_name = "conversation_summaries"
self.primary_key = "id"
belongs_to :conversation, foreign_key: "id"
end
belongs_to :to, class_name: "User"
belongs_to :from, class_name: "User"
has_many :messages, dependent: :destroy, inverse_of: :conversation
has_one :summary, foreign_key: "id"
delegate :most_recent_message_sent_at, :most_recent_message_body,
:reply_count, :to_name, :from_name, to: :summary
end
With much of our data transformation and aggregation in our SQL view, our model has become trivially simple. It literally only contains assocations and delegation now. We update our exposure to only eager-load the conversation summary:
expose(:conversations) { user.conversations.includes(:summary) }
Now, reloading the page yields the following log output:
Started GET "/" for 127.0.0.1 at 2013-02-11 15:37:49 -0600
Processing by InboxesController#show as HTML
User Load (1.0ms) SELECT "users".* FROM "users" LIMIT 1
Conversation Load (0.2ms) SELECT "conversations".* FROM "conversations" WHERE (1 IN (from_id, to_id))
Conversation::Summary Load (0.8ms) SELECT "conversation_summaries".* FROM "conversation_summaries" WHERE "conversation_summaries"."id" IN (7, 8, 9)
Rendered inboxes/show.html.haml within layouts/application (5.5ms)
Completed 200 OK in 8ms (Views: 6.0ms | ActiveRecord: 2.0ms)
Now we see some real improvement. All N+1 queries are gone, replaced instead with the eager-loading of the the Conversation::Summary model.
I used this technique in a real-world application. It helped abstract some of the mundane details of the inbox and allowed us to think about each conversation at a higher level with a summary.
In fact the app included even more business rules than I’ve included here. Each conversation had to include a read/unread status that updated with each sent message. Although it was easily implemented in pure Ruby, it cluttered the model and created yet more N+1 queries in the app view.
The inbox also had to be sorted by the most recent message date,
so that the conversation with the most recent activity would
appear first in the list. This kind of sorting without SQL is both
cumbersome and inefficient in Ruby; you have to load all messages
for each conversation. With the SQL view, it was as simple as
changing the scope from user.conversations.include(:summary)
to
user.conversations.include(:summary).order("conversation_summaries.most _recent_message_sent_at DESC")
.
Any time that we push stuff into the database, we make a tradeoff. In this case, when we move data transformation into the SQL view, we sacrifice the co-location of the conversation model and the definition of its summary. With the summary definition located in the database, there’s one extra layer of indirection.
The other tradeoff is that any time we’d like to make a non-trivial change the view, we actually have to create an entirely new view, replacing the old one. If for example, we knew that our inbox was likely to change or add fields, the SQL view approach might be too brittle.
On the other hand, we effectively removed N+1 queries from our application and simplified our model considerably. By abstracting the conversation’s summary into a model backed by a SQL view, we’re able to think of the Summary as an object in its own right. This provides a cognitive simplification, but also yields performance gains as the dataset grows.
It may not be right for every situation, but knowing and understanding how we can use SQL views in our Rails applications adds another tool to our toolbelt.
As before, while writing this post, I created a sample Rails app to iterate quickly. I used TDD to write the pure-ruby approach, and reused the specs while I “refactored” the implementation to the subsequent approaches. Of particular note is the history of the Conversation model(https://github.com/jgdavey/tree-sql-example/commits/master/app/mo dels/category.rb), which mirrors the code above.
]]>tl;dr When you have an ActiveRecord tree structure, using the WITH syntax for recursive SQL can provide large performance boons, especially when a tree get several levels deep.
In a previous post, I outlined a Cat Picture store application. As our store grows, more and more categories have to be created, and we end up with a tree of categories. How can we create a homepage that includes all cat pictures for a given category and all of its subcategories?
Pictorally, the category tree might look like this:
Cat Pictures
|-- Funny
| |-- LOLCats
| `-- Animated
`-- Classic
`-- Renaissance
On the landing page for the Cat Pictures category, we want to display all cat pictures for any category below Cat Pictures. Navigating to the Funny category would display all of its pictures, as well as the pictures for LOLCats and Animated. This is the kind of interaction seen on Amazon, for example. The store’s categories become like an ad-hoc filtering system.
Here’s what the Category class looks like:
class Category < ActiveRecord::Base
attr_accessible :name, :parent
has_many :cat_pictures
belongs_to :parent, :class_name => "Category"
has_many :children, :class_name => "Category", :foreign_key => 'parent_id'
scope :top_level, where(:parent_id => nil)
def descendents
# implement me!
end
end
Each category has a parent_id
column that points at its parent
category. In database speak, modeling a tree like this is known as
an Adjacency List; each node of the tree can only see a children
immediately adjacent to it. For this reason, crawling an Adjacency List
requires recursion. This is actually the database setup common for use
with the acts_as_tree
plugin. Let’s see how we can implement the
descendents
method to get all descendent categories.
As you’ve probably already guessed, we need to recursively collect children for each of our category’s children.
class Category < ActiveRecord::Base
# ...
def descendents
children.map do |child|
[child] + child.descendents
end.flatten
end
end
This does the job quite nicely. However, our requirements above state that we want all cat pictures for each descendent category, and our categories. Right now, we’ve omitted the root category, self. Let’s add a new method to include it into the equation:
class Category < ActiveRecord::Base
# ...
def descendents
children.map do |child|
[child] + child.descendents
end.flatten
end
def self_and_descendents
[self] + descendents
end
end
Good deal. Now gathering all cat pictures is just a matter of collecting them for each category:
class Category < ActiveRecord::Base
# ...
def descendent_pictures
self_and_descendents.map(&:cat_pictures).flatten
end
end
For a tree like we have above, this is probably good enough. Our tree is only 3 levels deep. We’ve introduced plenty of N+1 queries, but given our small dataset, that shouldn’t be a huge concern.
That said, as our store grows, and the tree gets deeper and more
detailed, this kind of implementation could become a bottleneck. Also,
because we’re doing Array operations on the children
collection,
we lose the ability to take advantage of ActiveRelation outside of the
descendents
method itself. Among other things, this means that we
can’t eager-load cat pictures unless we always eager-load them within
the descendents
method.
Surely we can do better.
Since we’re using PostgreSQL, we can take advantage of its special features. In this case, we can use a WITH query. From the PostgreSQL documentation:
WITH provides a way to write auxiliary statements for use in a larger query. These statements, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query.
On its own, this might not seem like a big deal, but when combined with the optional RECURSIVE modifier, WITH queries can become quite powerful:
The optional RECURSIVE modifier changes WITH from a mere syntactic convenience into a feature that accomplishes things not otherwise possible in standard SQL. Using RECURSIVE, a WITH query can refer to its own output. A very simple example is this query to sum the integers from 1 through 100:
WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL SELECT n+1 FROM t WHERE n < 100 ) SELECT sum(n) FROM t;
The general form of a recursive WITH query is always a non-recursive term, then UNION (or UNION ALL), then a recursive term, where only the recursive term can contain a reference to the query’s own output.
In other words, the expression contained in the AS statement has two parts. The first part is executed just once. The second part, after the UNION ALL, is executed until it returns an empty result set.
Taking advantage of WITH RECURSIVE, we can reduce our tree crawling technique from n queries to just 1! Let’s how we can use this to crawl our category tree.
As a reminder, here’s what our categories table looks like:
# SELECT id, name, parent_id FROM categories;
id | name | parent_id
----+--------------+-----------
1 | Cat Pictures |
2 | Funny | 1
3 | LOLCats | 2
4 | Animated | 2
5 | Classic | 1
6 | Renaissance | 5
And this is the query:
WITH RECURSIVE category_tree(id, name, path) AS (
SELECT id, name, ARRAY[id]
FROM categories
WHERE parent_id IS NULL
UNION ALL
SELECT categories.id, categories.name, path || categories.id
FROM category_tree
JOIN categories ON categories.parent_id=category_tree.id
WHERE NOT categories.id = ANY(path)
)
SELECT * FROM category_tree ORDER BY path;
Running the query above returns the following:
id | name | path
----+--------------+---------
1 | Cat Pictures | {1}
2 | Funny | {1,2}
3 | LOLCats | {1,2,3}
4 | Animated | {1,2,4}
5 | Classic | {1,5}
6 | Renaissance | {1,5,6}
Whoa! That’s a lot of SQL. Let’s break it down a bit.
First, we declare our “temporary table” using the WITH syntax. We’re
going to call it category_tree
. This “table” has 3 “columns”: id
,
name
, and path
. The id
and name
columns are fairly obvious; they
refer to corresponding columns on the categories table. The path
is an
array of ids that each row will have. More on this in a bit.
The non-recursive term is next:
SELECT id, name, ARRAY[id]
FROM categories
WHERE parent_id IS NULL
It grabs the id
and name
for each top-level category, that is, each
category that has no parent. It also initializes an array containing just
its id
. On its own, this isn’t very interesting, but this array will
become helpful during the recursive step of the query.
The recursive term is the juiciest bit of the query:
SELECT categories.id, categories.name, path || categories.id
FROM category_tree
JOIN categories ON categories.parent_id=category_tree.id
WHERE NOT categories.id = ANY(path)
Notice that we’re selecting from category_tree
. By doing this, we’re
able to use each result set in the subsequent iteration. The first time
we recurse, the result set will be what we selected in the non-recursive
term above.
Given that we have a root result set, we join with categories
to find
its children. From our new result set, we select id
and name
, as
before. But this time, we concatenate the child id onto the path
array
using SQL’s ||
operator. Having this materialized path allows us to
guard against infinite loops; the WHERE clause makes sure that the row
we’re selecting has not appeared in the path before.
This infinite loop check is important. If two categories pointed at each other as parents, the query would never return. Including this check prevents such a mistake from killing our server.
Finally, a WITH query is only useful if you select from it outside of its declaration, so we’ll do just that:
SELECT * FROM category_tree ORDER BY path;
In addition to the infinite loop guard, the path column answers the question “How did I get here?” Like a directory structure on a file system, the path demonstrates the ids necessary to get from grandparent to parent to child, etc.
You may have noticed that we’re also ordering by the path column. We do this because the default sort from a recursive query is nondeterministic. Normal array sorting works well for us here, and groups the categories just like we’d expect, with parents listed before their children.
class Category < ActiveRecord::Base
# ...
def descendents
self_and_descendents - [self]
end
def self_and_descendents
self.class.tree_for(self)
end
def descendent_pictures
subtree = self.class.tree_sql_for(self)
CatPicture.where("category_id IN (#{subtree})")
end
def self.tree_for(instance)
where("#{table_name}.id IN (#{tree_sql_for(instance)})").order("#{table_name}.id")
end
def self.tree_sql_for(instance)
tree_sql = <<-SQL
WITH RECURSIVE search_tree(id, path) AS (
SELECT id, ARRAY[id]
FROM #{table_name}
WHERE id = #{instance.id}
UNION ALL
SELECT #{table_name}.id, path || #{table_name}.id
FROM search_tree
JOIN #{table_name} ON #{table_name}.parent_id = search_tree.id
WHERE NOT #{table_name}.id = ANY(path)
)
SELECT id FROM search_tree ORDER BY path
SQL
end
end
You should notice right away where our recursive query is. The
tree_sql_for
class method returns a SQL string that can be used with
other queries. Compared to the WITH query we looked at before, there a
few differences worth mentioning.
First, and probably most importantly for our original problem, we’ve changed our starting place. The non-recursive term is our “start here” result set. Rather than starting with all top-level categories, we’re using the id of whichever instance is passed in to scope our tree.
Another change we’ve made is to remove the name
column from the query.
It isn’t necessary for what we’re doing, but made the example easier to
demonstrate. We’re also interpolating the table name. This makes the
method much more reusable. In fact, we could extract the method to a
RecursiveTree
module to tidy up our class.
One big advantage of the SQL approach here is that we can create scopes
to further filter our results within just one database round-trip.
For example, the tree_for
class method is really just a named scope
that takes a category instance as a parameter.
Likewise, the the descendent_pictures
method returns a CatPicture
relation that includes all pictures from this category and all
subcategories. In other words, what used to take 2 database round trips
for each category in the tree (one to grab children, one to get its
pictures) will now only take 1 for the entire set.
Taking advantage of PostgreSQL’s advanced features can provide large performance boons, especially when a tree get several levels deep.
Although using database recursion is an efficient way of improving performance with our existing schema, other methods of handling tree structures in SQL exist. The SQL Antipatterns book has a great breakdown of other tree solutions that would require schema changes.
As before, while writing this post, I created a sample Rails app to iterate quickly. I used TDD to write the pure-ruby approach, and reused the specs while I “refactored” the implementation to the subsequent approaches. Of particular note is the [history of the Category model](https://github.com/jgdavey/tree-sql-example/commits/master/app/mo dels/category.rb), which mirrors the code above.
]]>Suppose you have a storefront application that sells pictures of cats. These cat pictures are categorized in meaningful ways. For example, there are LOLcats pictures and “Classic” cat pictures. Now, on the landing page of the store, you’d like to feature one picture from each category. It can’t be a random picture from each. You need to feature the cheapest picture from each category, displaying its name and price.
Also, it turns out that some “low” prices are very common. For example, $9.99 is a common sale price for LOLcats pictures. However, we should only ever feature one picture per category. When there are multiple pictures with the same low price, we fallback to the name, and show the first one alphabetically. How can we solve this problem, while also remaining performant?
As an aside, adding a cat to a Rennaisance painting amplifies its appeal ninefold.
Let’s look at some of the ways that we can approach this problem, displaying a list of cat pictures that are the cheapest for their respective category.
Implementing the solution in Ruby is fairly straightforward.
ActiveSupport Enumerable provides the group_by
and sort_by
methods on
collections, and we can use those to help us cut down on some typing.
class CatPicture < ActiveRecord::Base
attr_accessible :category_id, :description, :name, :price
belongs_to :category
def self.cheapest_per_category
all.group_by(&:category_id).map do |category_id, subset|
subset.sort_by { |pic| [pic.price, pic.name] }.first
end
end
end
First, we group all of the cat pictures by their category. Then, for each set of pictures, we sort them by their price and name, and take only the first one.
Perhaps you are wondering if inverting the responsibility would improve
the implementation, putting the mapping and reduction impetus in the
Category model instead. Although it would be possible to go through
the Category model to find its cheapest picture, that would lead to an
“n+1”, as each category would subsequently need fetch its cat pictures.
Alternatively, eager-loading all categories with their cat pictures
would be expensive, and would essentially duplicate what we’ve done
above with the group_by
.
Either way, as you can probably imagine, the above method would become more expensive as the data set continued to grow. Additionally, we lose the ability to continue to chain ActiveRecord scopes to filter the set further: as soon as we fetch the collection from the database, all filtering has to be done in Ruby.
Pros:
Cons:
We can improve performance by doing the filtering at the database level, rather than loading all cat pictures into memory each time.
class CatPicture < ActiveRecord::Base
attr_accessible :category_id, :description, :name, :price
belongs_to :category
def self.cheapest_per_category
find_by_sql <<-SQL
SELECT DISTINCT ON(category_id) cat_pictures.*
FROM cat_pictures
WHERE ((category_id, price) IN (
SELECT category_id, min(price)
FROM cat_pictures
GROUP BY category_id
))
ORDER BY category_id ASC, cat_pictures.name ASC
SQL
end
end
Here, we use a subselect to filter the initial set down to only those
that have the cheapest price per category. In this inner query, each row
will contain a category_id
and its lowest price
. In the outer query,
we choose all cat pictures whose price
and category_id
match a row
from this inner query, using the IN
syntax.
We would be done here, except that there still exists the possibility
that there could be more than one that have that low price for a given
category. So, depending on the database vendor, we can here find
“distinct” rows, according the columns of interest. In Postgresql,
the syntax for this is DISTINCT ON([column,...])
, which will omit
duplicates of the listed columns. For our purposes, we don’t want more
than one per category, so we distinct on category_id
.
It is worth noting that without an ORDER BY
clause, DISTINCT ON
is
nondeterministic: we are not guaranteed to get the same result each
time. Thus, we order by category_id
and name
, so that only the first
cat picture alphabetically will show up.
We can improve the implementation above by making it a true chainable
scope. Whereas find_by_sql
returns an array of objects, we can
refactor this to return an ActiveRelation instead.
class CatPicture < ActiveRecord::Base
attr_accessible :category_id, :description, :name, :price
belongs_to :category
def self.cheapest_per_category
where("(category_id, price) IN (#{category_id_and_lowest_price_sql})").select("DISTINCT ON(category_id) #{table_name}.*").order("category_id ASC, #{table_name}.name ASC")
end
private
def self.category_id_and_lowest_price_sql
scoped.select("category_id, min(price)").group(:category_id).to_sql
end
end
Functionally, this generates the exact same query as before, but allows
further chaining. Using ActiveRelation’s to_sql
method, we’re able
to build up our inner query without actually executing it. We then
interpolate that query into what was the outer query, which we’ve
reduced to calls to where
, select
and order
.
Pros:
Cons:
DISTINCT ON
- only some RDBMS’ have such functionalityBut there is still another option. The SQL standard defines a concept called window functions, which act a lot like aggregates, but don’t change the result set. From the Postgresql documentation’s excellent introduction to window functions:
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row - the rows retain their separate identities.
Let’s see how this would work with our dataset. First of all, let’s assume the following cat pictures:
# SELECT id, name, category_id, price FROM cat_pictures ORDER BY category_id, price;
id | name | category_id | price
----+----------------------+-------------+-------
7 | Triple LOL | 1 | 9.99
5 | Hugs not Drugs | 1 | 9.99
2 | Puss in Boots | 1 | 14.99
3 | Cats Gone By | 1 | 19.99
6 | Cats in it for me | 1 | 22.99
4 | Turkleton's Folly | 2 | 11.99
1 | Meowna Lisa | 2 | 19.99
8 | Lady Caterly's Lover | 2 | 22.99
Given this data, our goal is to select “Hugs not Drugs” and “Turkleton’s Folly”, which are the cheapest pictures from their categories.
Whereas a normal aggregate function with GROUP BY
would collapse the
results, a window function retains the original row. Let’s consider how
this would affect the inner query from the subselect approach above:
# SELECT category_id, min(price) FROM cat_pictures GROUP BY category_id;
category_id | min
-------------+-------
1 | 9.99
2 | 11.99
# SELECT category_id, min(price) OVER (PARTITION BY category_id) FROM cat_pictures;
category_id | min
-------------+-------
1 | 9.99
1 | 9.99
1 | 9.99
1 | 9.99
1 | 9.99
2 | 11.99
2 | 11.99
2 | 11.99
Above, we’ve replaced the GROUP BY
clause with an OVER
clause. We
have the original rows with an additional column for this aggregate
data. This is useful in its own right, but the real power of window
functions comes from this concept of window framing. The use of
PARTITION BY
creates a frame for each group. In our case, we have
two frames, one for each category_id
. Then, all aggregate and window
functions before the OVER
clause operate against this frame. Each
window frame effectively has its own result set, according to the
defined partition.
When a window frame is ordered, using an ORDER BY
clause, even more
options are possible. For example, consider the following:
# SELECT id, name, category_id, price, rank() OVER (PARTITION BY category_id ORDER BY price) FROM cat_pictures;
id | name | category_id | price | rank
----+----------------------+-------------+-------+------
7 | Triple LOL | 1 | 9.99 | 1
5 | Hugs not Drugs | 1 | 9.99 | 1
2 | Puss in Boots | 1 | 14.99 | 3
3 | Cats Gone By | 1 | 19.99 | 4
6 | Cats in it for me | 1 | 22.99 | 5
4 | Turkleton's Folly | 2 | 11.99 | 1
1 | Meowna Lisa | 2 | 19.99 | 2
8 | Lady Caterly's Lover | 2 | 22.99 | 3
Look familiar? This is essentially the original , except we’ve added a
new column: its price rank within a window partitioned by category_id
.
It’s a mouthful to describe, but we’re very close to our original goal
of finding the cheapest cat picture per category. All we need to do now
is select rows that have a rank of 1.
Not so fast. Can you spot the issue with the above? The rank()
window
function assigns the same rank to ties, but we need the first one
alphabetically in the case of “ties”. We can remedy that by using a
different window function, row_number()
, which guarantees different
numbers.
# SELECT id, name, category_id, price, row_number() OVER (PARTITION BY category_id ORDER BY price, name) FROM cat_pictures;
id | name | category_id | price | row_number
----+----------------------+-------------+-------+------------
5 | Hugs not Drugs | 1 | 9.99 | 1
7 | Triple LOL | 1 | 9.99 | 2
2 | Puss in Boots | 1 | 14.99 | 3
3 | Cats Gone By | 1 | 19.99 | 4
6 | Cats in it for me | 1 | 22.99 | 5
4 | Turkleton's Folly | 2 | 11.99 | 1
1 | Meowna Lisa | 2 | 19.99 | 2
8 | Lady Caterly's Lover | 2 | 22.99 | 3
Perfect! Looking at the rows with “1” as their “row_number”, we see
what we expect, “Hugs not Drugs” and “Turkleton’s Folly”, which are the
cheapest pictures from their categories. We can use an IN
clause for
filtering, similar to the previous approach:
SELECT id, category_id, name, price
FROM cat_pictures
WHERE (id, 1) IN (
SELECT id, row_number() OVER (PARTITION BY category_id ORDER BY price, name)
FROM cat_pictures
);
id | category_id | name | price
----+-------------+----------------------+-------
5 | 1 | Hugs not Drugs | 9.99
4 | 2 | Turkleton's Folly | 11.99
The where clause above filters records that both have an id that appears in the subquery next to a rank of 1. Now that we have the SQL down, let’s convert our Ruby model to take advantage of this window function technique:
class CatPicture < ActiveRecord::Base
attr_accessible :category_id, :description, :name, :price
belongs_to :category
def self.cheapest_per_category
where("(#{table_name}.id, 1) IN (#{price_rank_sql})")
end
private
def self.price_rank_sql
scoped.select("id, row_number() OVER (PARTITION BY category_id ORDER BY price ASC, name ASC)").to_sql
end
end
Groovy. Just like before, we can use to the power of ActiveRelation
to build up our subselect, which then gets interpolated into the
where
clause. I’ve also prepended id
in the where
clause with
table_name
, to avoid potential ambiguous column problems.
There is one potential issue with using window functions: limited vendor support. While most of the big boys implement window functions (Oracle, Postgresql, and SQLServer, to name a few), MySQL and SQLite users are out of luck.
Pros:
Cons:
While they may not be appropriate for every situation, window functions are a great tool for your toolbelt. They excel at filtering down rows based on aggregate data, or adding aggregate data to the rows you’d already like to select.
For more information about window functions, the Postgres documentation is an excellent resource, both for its introduction, and its list of window functions.
While writing this post, I created a sample Rails app to iterate quickly. I used TDD to write the pure-ruby approach, and reused the specs while I “refactored” the implementation to the subsequent approaches. Of particular note is the history of the CatPicture model, which mirrors the code above.
]]>How do you keep your commits atomic easily? Let’s explore one possible approach.
As a practitioner of good source control, you and your team have decided to make all of your git commits atomic within your projects. That is, every commit has a green test suite, and you prefer small, incremental commits to large, monolithic ones. Keeping commits small and atomic has tons of benefits, from more consistent continuous integration results, to better team cohesion (have you ever gotten upset with another team member for committing red?). But in practice, keeping all of your commits atomic can present some challenges.
After doing a bunch of work, making incremental, atomic commits along
the way, it’s time to push your work up. However, when you run git pull --rebase
, you find that another team member has made changes since you
last pushed. Your commits are now sitting on top of a different git
history. Are all of your commits still atomic? Short of checking out
every single commit and running the suite, how can you be certain that
every commit is atomic? What a pain! I don’t want to check out every
commit by hand.
Enter atomically
, a simple shell script designed to take the pain out
of checking every commit between your upstream and you. Before pushing,
you can ensure every commit is atomic by running the script.
To use, just pass atomically the command as arguments:
$ atomically rake
The above command will start at the current branch’s HEAD and run rake. After that, it will check out the previous commit and run the command again. It will do so for all commits between you and origin.
If you are confident that nothing in your spec suite changed, you can run only your cucumber features the same way:
$ atomically cucumber
Or just your spec suite:
$ atomically rspec
Regardless, keeping atomic commits is a vital part of good source control, and this tool makes it slightly easier to do so.
Here’s the source of atomically
:
#!/bin/bash
if [ -n "$(git status --porcelain)" ]; then
echo "ERROR: You have a dirty working copy. This command would remove any files not already checked in"
exit 1
fi
b="$(git symbolic-ref HEAD 2>/dev/null)"
branch="`basename $b`"
program=$*
reset_branch() {
git co $branch --quiet
}
git rev-list origin/${branch-master}..${branch-master} | while read rev; do
trap "exit 1" SIGINT SIGTERM
echo
echo "Running at revision $rev"
echo
git co $rev --quiet && git clean -fd && $program
echo
trap - SIGINT SIGTERM
done
reset_branch
To use, just drop that in a file in your $PATH
, and make sure it is executable.
Thanks to Gary Bernhardt for the scripts’ inspiration,
run-command-on-git-revisions
, which you can see in his
dotfiles.
ipfw
than, among other things, can do just this.
I first saw this technique from Joe Miller’s post on the subject. I packaged up the settings he mentioned into a little shell script:
You can drop that somewhere in your $PATH
and chmod +x
to make it executable. You can call it whatever you want, but I called mine “hinder”. After that, it’s simply a matter of using it:
$ hinder www.google.com
Now when you visit google.com, you should see some marked slowness. To reset, just run:
$ hinder reset
Google is now fast again.
The script works by adding 250ms delay to both directions of network traffic. It also adds a packet-loss percentage of 10%. You can play with these numbers to get even more latency simulation. Enjoy!
]]>Just add it to your global git config file at ~/.gitconfig
and enjoy.
Taking a cue from Xavier Shay’s excellent intro to tmux, I’ve been using tmux lately as my primary workspace. There are excellent introductions to tmux elsewhere, but I’ve really enjoyed the switch from MacVim/Terminal to a single tmux session for development. But rather than sing tmux’s praises, I’d like to talk about how tmux and a vim plugin have changed my testing feedback loop for the better.
Autotesting gives you immediate feedback, but runs everytime you save a file. Even though this often is desired behavior, I can’t tell you how many times I’ve saved a feature file, only to immediately notice a typo. Especially with Rails project, this can be an expensive amount of time. I end up feeling punished for saving my work.
I’ve also tried more editor-embedding techniques of running tests. Both
rails.vim and rake.vim provide facilities for running :Rake
. When
combined with a keyboard shortcut, this gets closer to the kind of
control I like to have, running my tests exactly when I want them. The
downside, though, is that I lose control of my editor and have to wait
for the command to finish before I can type, or even navigate again. And
I can’t look at a failure message and my code at the same time.
A practice that is quickly gaining popularity in the Ruby community is isolating your business logic from your persistance logic and framework. Rather than load Rails (or some other large library or framework), you sequester all business logic in its own class or module, and then test that class or module in isolation. This has a ton of benefits for the longevity of your code, but one of the side benefits is the speed increase for running individual specs or tests. This technique is being championed by Gary Bernhardt and Corey Haines, among others.
Because tmux is so scriptable, it isn’t hard to send commands to other panes in a tmux session programmatically. Leveraging the power of rails.vim and tslime.vim, I’ve created a vim plugin that shortens the feedback loop when practicing TDD in a tmux session. It’s called turbux.vim.
My typical workflow now involves setting up a tmux session for my
project, splitting vertically (<C-b> %
), and using layout 4 (<C-b> <Alt-4>
). In fullscreen, the result is about 30% for my shell on the
left, and 70% for vim on the right.
The first time you use it, tslime.vim will prompt you to input your tmux
session name, window number, and pane number. There is completion for
each of these prompts, so you can happily mash <Tab>
.
The plugin exposes a general-purpose function to send arbitrary text to the configured tmux pane. For example, you can use it in the following way:
:call Send_to_Tmux("rspec ".expand("%")."\n")
The above command would send rspec path/to/spec.rb
to the configured
pane. For me, this pattern of running the test file that is currently
open is so common that I’ve packaged up some useful defaults in
turbux.vim.
Turbux.vim tries to intelligently choose the right spec, test or feature
to run when you invoke it. If you’re in a spec, invoking the plugin
(by default with <leader>t
) will run rspec path/to/my_spec.rb
in
the corresponding pane. In a test-unit file, it will run ruby -Itest path/to/test.rb
. In a cucumber feature file, it will run cucumber path/to/my.feature
.
Thanks to rails.vim’s awesomeness, I’ve also provided some mappings for when the current file has a corresponding test or spec. For example, When I’m in a file that has a corresponding spec, such as a model or controller, the command will run the that spec.
Finally, if the plugin is invoked outside the context of any feature or spec-related file, it will simply run the most recent test command again.
Also, I’ve added a mapping for <leader>T
to run a more focused spec or
cucumber scenario. It works by adding the vim cursor’s line number to
the rspec or cucumber command.
This setup has been really rewarding so far. There’s far less context switching, as I never have to leave my editor. There are also fewer surprises. As far as I’m concerned, the faster my feedback, the better.
Note: You will probably want to use my fork of tslime.vim, as the main repository has some outstanding bugs, and the fixes have yet to been merged in.
See the plugin in action. If the video is hard to see, visit vimeo. There is a link to download at full size.
]]>Hitch works by setting the GIT_AUTHOR_NAME
and GIT_AUTHOR_EMAIL
environment variables. For the email itself, it joins the hitched author’s github usernames and a prefix with a “+”, creating email addresses of the form “dev+jgdavey+therubymug@hashrocket.com”. Using it is as simple as hitch <github_username> <github_username>
.
Recently, I was annoyed that I had to always remember the github username of the person I was pairing with. I was at the command line typing and found myself hitting <tab>
repeatedly, hoping it would complete with the authors I pair with most often.
So I quickly whipped up a zsh completion script, and boom, pairing nirvana.
To use the completion script, save the following script as _hitch
and add it to your fpath
:
If you see any way to improve the function, please fork the gist on github.
]]>From my experience, there are two solid options for QR code generation in Ruby:
make
doesThe QR Decoding landscape varies even more than its QR encoding counterpart. There are several ooptions to choose from, and the best one for you will likely depend your environment and stack. The top three libraries that I’ve found are:
Wherein I offer “My Two Cents”. If you’re on a stack that can install libraries and you’re using an MRI ruby (1.8 or 1.9), use qrencoder and qrdecoder. They are both very fast and awesome. Additionally, their APIs complement each other. If you’re on Heroku, use rQRCode and qrio. They’re both pure ruby and play nice with an environment where you can’t install you’re own libraries. If you’re on JRuby, use Zxing.rb for decoding, and try both rQRCode and qrencoder. I haven’t tried qrencoder with JRuby–or any compiled gems for that matter–so your mileage may vary.
N.B. In the interest of full disclosure, I have contributed to both qrencoder and zxing.rb, and maintain qrdecoder, and tend to favor them. Nevertheless, the pure ruby options are solid and well worth a look.
Regardless of what your needs are, there are lots of options. Which one fits your needs best will largely depend on your stack.
]]>I’ve converted this to a full-blown VIM plugin: https://github.com/jgdavey/vim-blockle
]]>