Using Datomic as a Graph Database

This is a repost. You can find the original here

Datomic is a database that changes the way that you think about databases. It also happens to be effective at modeling graph data and was a great fit for performing graph traversal in a recent project I built.

I started out building a six degress of Kevin Bacon project using Neo4j, a popular open-source graph database. It worked very well for actors that were a few hops away, but finding paths between actors with more than 5 hops proved problematic. The cypher query language gave me little visibility into the graph algorithms actually being executed. I wanted more.

Despite not being explicitly labeled as such, Datomic proved to be an effective graph database. Its ability to arbitrarily traverse datoms, when paired with the appropriate graph searching algorithm, solved my problem elegantly. This technique ended up being fast as well.

Quick aside: this post assumes a cursory understanding of Datomic. I won’t cover the basics, but the official tutorial will help you get started.

6 Degrees Kevin == Cool; 6 Degrees Kelvin == Cold Link to heading

The problem domain should be fairly familiar: the 6 degrees of Kevin Bacon. I wanted to create an app where you could pick an actor and find out what their Bacon Number was. That is, given an actor, I wanted to answer the question “how many degrees of separation is there between that actor and Kevin Bacon?”

Using information freely available from IMDb, I developed the following schema:

[
 ;; movies
 {:db/id #db/id[:db.part/db]
  :db/ident :movie/title
  :db/valueType :db.type/string
  :db/cardinality :db.cardinality/one
  :db/fulltext true
  :db/unique :db.unique/identity
  :db/doc "A movie's title (upsertable)"
  :db.install/_attribute :db.part/db}

 {:db/id #db/id[:db.part/db]
  :db/ident :movie/year
  :db/valueType :db.type/long
  :db/cardinality :db.cardinality/one
  :db/doc "A movie's release year"
  :db.install/_attribute :db.part/db}

 ;; actors
 {:db/id #db/id[:db.part/db]
  :db/ident :person/name
  :db/valueType :db.type/string
  :db/cardinality :db.cardinality/one
  :db/fulltext true
  :db/unique :db.unique/identity
  :db/doc "A person's name (upsertable)"
  :db.install/_attribute :db.part/db}

 {:db/id #db/id[:db.part/db]
  :db/ident :actor/movies
  :db/valueType :db.type/ref
  :db/cardinality :db.cardinality/many
  :db/doc "An actor's ref to a movie"
  :db.install/_attribute :db.part/db}
 ]

In a nutshell, movies have titles and years. Actors have names and movies. The “relationship” of actors to movies is many-to-many, so I’ve declared the :actor/movies attribute as having a cardinality of many.

Using datalog queries Link to heading

Using datalog and datomic.api/q, we can make graph-like queries fairly easily. Because the :where clauses of a datalog query form an implicit join, we can join from our starting point to our ending point with relative ease.

As an example, what if we wanted to know the shortest path or paths from Kevin Bacon to Jon Belushi? Let’s use a query to find out:

(require '[datomic.api :as d :refer [q db]])
(def conn (d/connect ...))

(q '[:find ?start ?title ?end
      :in $ ?start ?end
      :where
     [?a1 :actor/name ?start]
     [?a2 :actor/name ?end]
     [?a1 :actor/movies ?m]
     [?a2 :actor/movies ?m]
     [?m :movie/title ?title]]
    (db conn)
    "Bacon, Kevin (I)"
    "Belushi, John")

;=> #{["Bacon, Kevin (I)" "Animal House (1978)" "Belushi, John"]}

That is fine when actors have worked together in a movie (a Bacon Number of 1), but doesn’t help us solve Bacon numbers when there are 2 or more movies between the actors. We could add more where clauses to join over two movies, but that isn’t sustainable. The queries would quickly become too long to reason about. This is a prime opportunity to use Datomic’s rules.

(def acted-with-rules
  '[[(acted-with ?e1 ?e2 ?path)
     [?e1 :actor/movies ?m]
     [?e2 :actor/movies ?m]
     [(!= ?e1 ?e2)]
     [(vector ?e1 ?m ?e2) ?path]]
    [(acted-with-1 ?e1 ?e2 ?path)
     (acted-with ?e1 ?e2 ?path)]
    [(acted-with-2 ?e1 ?e2 ?path)
     (acted-with ?e1 ?x ?pp)
     (acted-with ?x ?e2 ?p2)
     [(butlast ?pp) ?p1]
     [(concat ?p1 ?p2) ?path]]])

(q '[:find ?path
     :in $ % ?start ?end
     :where
     [?a1 :actor/name ?start]
     [?a2 :actor/name ?end]
     (acted-with-2 ?a1 ?a2 ?path)]
    (db conn) acted-with-rules "Bieber, Justin" "Bacon, Kevin (I)"))
;=> #{[(17592186887476 17592186434418 17592187362817 17592186339273 17592186838882)] [(17592186887476 17592186434418 17592188400376 17592186529535 17592186838882)] [(17592186887476 17592186434418 17592187854963 17592186529535 17592186838882)] [(17592186887476 17592186434418 17592186926035 17592186302397 17592186838882)]}

This time we get back a collection of paths with entity ids. We can easily transform these ids by mapping them into entities and getting the name or title, using a function like the following:

(defn actor-or-movie-name [db eid]
  (let [ent (d/entity db eid)]
    (or (:movie/title ent) (:person/name ent))))

So, putting the query together with the above function, we get:

(let [d (db conn)
      name (partial actor-or-movie-name d)]
  (->> (q '[:find ?path
            :in $ % ?start ?end
            :where
            [?a1 :actor/name ?start]
            [?a2 :actor/name ?end]
            (acted-with-2 ?a1 ?a2 ?path)]
          d acted-with-rules "Bieber, Justin" "Bacon, Kevin (I)")
       (map first)
       (map (partial mapv name))))
;=> (["Bieber, Justin" "Men in Black 3 (2012)" "Jones, Tommy Lee" "JFK (1991)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Howard, Rosemary (II)" "R.I.P.D. (2013)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Segal, Tobias" "R.I.P.D. (2013)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Brolin, Josh" "Hollow Man (2000)" "Bacon, Kevin (I)"])

The rules above are defined statically, but they are simply clojure data structures: it would be trivial to generate those rules to an arbitrary depth. For an example of doing just that, see the Datomic mbrainz sample.

Low-level traversal for better performance Link to heading

Having to know the depth at which to traverse the graph is cumbersome. Datomic has a distinct advantage of being able to treat your data as local, even if its permanent storage lives somewhere else. That means that we can bring our own functions to the problem and execute locally, rather than on a database server. We can leverage Datomic’s datoms function to search the graph using our own graph-searching algorithm, rather than relying on the query engine.

Our IMDb actor data is essentially a dense unweighted graph. Because of its density, a bidirectional breadth-first search is probably the most efficient alogrithm for finding the shortest paths from one point to another. A generic bidirectional BFS returning all shortest paths might look like this.

(defn paths
  "Returns a lazy seq of all non-looping path vectors starting with
  [<start-node>]"
  [nodes-fn path]
  (let [this-node (peek path)]
    (->> (nodes-fn this-node)
         (filter #(not-any? (fn [edge] (= edge [this-node %]))
                            (partition 2 1 path)))
         (mapcat #(paths nodes-fn (conj path %)))
         (cons path))))

(defn trace-paths [m start]
  (remove #(m (peek %)) (paths m [start])))

(defn- find-paths [from-map to-map matches]
  (for [n matches
        from (map reverse (trace-paths from-map n))
        to (map rest (trace-paths to-map n))]
    (vec (concat from to))))

(defn- neighbor-pairs [neighbors q coll]
  (for [node q
        nbr (neighbors node)
        :when (not (contains? coll nbr))]
    [nbr node]))

(defn bidirectional-bfs [start end neighbors]
  (let [find-pairs (partial neighbor-pairs neighbors)
        overlaps (fn [coll q] (seq (filter #(contains? coll %) q)))
        map-set-pairs (fn [map pairs]
                        (persistent! (reduce (fn [map [key val]]
                                  (assoc! map key (conj (get map key #{}) val)))
                                (transient map) pairs)))]
    (loop [preds {start nil} ; map of outgoing nodes to where they came from
           succs {end nil}   ; map of incoming nodes to where they came from
           q1 (list start)   ; queue of outgoing things to check
           q2 (list end)]    ; queue of incoming things to check
      (when (and (seq q1) (seq q2))
        (if (<= (count q1) (count q2))
          (let [pairs (find-pairs q1 preds)
                preds (map-set-pairs preds pairs)
                q1 (map first pairs)]
            (if-let [all (overlaps succs q1)]
              (find-paths preds succs (set all))
              (recur preds succs q1 q2)))
          (let [pairs (find-pairs q2 succs)
                succs (map-set-pairs succs pairs)
                q2 (map first pairs)]
            (if-let [all (overlaps preds q2)]
              (find-paths preds succs (set all))
              (recur preds succs q1 q2))))))))

There’s a lot of code here, including some optimizations and helper functions. The important function here is bidirectional-bfs. I won’t explain the details of the algorithm, but at a high level, it takes in a start and end node and a function to be called on any node to get it’s “neighbors”.

This is a generic, pure function, agnostic of Datomic or our data. In fact, I used a simple map as the “graph” while developing this:

(def graph
  {:a [:b]
   :b [:a :c :d]
   :c [:b :e]
   :d [:b :c :e]
   :e [:c :d :f]
   :f []})

(bidirectional-bfs :a :e graph)
;=> [[:a :b :c :e] [:a :b :d :e]]

To use this generic algorithm with our database, we need a neighbors function. Depending on whether a node is an “actor” or a “movie”, we need to return its appropriate counterpart. A naive “or” condition is actually good enough here:

(defn movie-actors
  "Given a Datomic database value and a movie id,
  returns ids for actors in that movie."
  [db eid]
  (map :e (d/datoms db :vaet eid :actor/movies)))

(defn actor-movies
  "Given a Datomic database value and an actor id,
  returns ids for movies that actor was in."
  [db eid]
  (map :v (d/datoms db :eavt eid :actor/movies)))

(defn neighbors
  "db is database value
  eid is an actor or movie eid"
  [db eid]
  (or (seq (actor-movies db eid))
      (seq (movie-actors db eid))))

Gluing everything together is a simple matter of partial application:

(defn find-id-paths [db source target]
  (bidirectional-bfs source target (partial neighbors db)))

Given a source entity id and a target entity id, this will return all shortest paths (ids), much like the query example above. From there, we could map them to Datomic entities, get their names, or sort the paths using a domain-specific heuristic. Plugging in the previous example, we might do something like the following:

(let [d (db conn)
      d (d/filter d (without-documentaries d))
      biebs (d/entid d [:actor/name "Bieber, Justin"])
      bacon (d/entid d [:actor/name "Bacon, Kevin (I)"])
      name (partial actor-or-movie-name d)]
  (map (partial mapv name) (find-id-paths d biebs bacon)))
;=> (["Bieber, Justin" "Men in Black 3 (2012)" "Jones, Tommy Lee" "JFK (1991)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Segal, Tobias" "R.I.P.D. (2013)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Brolin, Josh" "Hollow Man (2000)" "Bacon, Kevin (I)"] ["Bieber, Justin" "Men in Black 3 (2012)" "Howard, Rosemary (II)" "R.I.P.D. (2013)" "Bacon, Kevin (I)"])

This returns the same set of paths as the query method did. However, this version has the advantage of going to an arbitrary depth.

This is just one example of graph searching with Datomic. Different kinds of problems and domains could use other algorithms. The idea, though, is that generic graph searching functions can be used directly, since the data is effectively local to the peer machine.

For more Clojure implementations of generic graph searching algorithms, loom’s alg_generic namespace is a great starting point.

Performance Link to heading

I’m using the above ideas and functions on IMDB’s dataset to power the project. Once the peer’s index caches are warmed, the performance is quite good: most searches I’ve performed between well-known actors complete in under a second, and in many cases, under 100 ms. I never got results that good with Neo4j’s cypher query language.

Source Link to heading

The code in this post is based on the source.