Go Julia!

Last week two new language bindings were added to the YAMLScript family: Go and Julia.

Go

The Go binding has been a long time coming. Several people have been working on it this year but it was Andrew Pam who finally got it over the finish line.

Go is a big user of the YAML data language, so we're happy to be able to provide this library and hope to see it used in many Go projects.

Julia

The Julia binding was a bit more of a recent surprise addition. A few weeks ago a Julia hacker dropped by the YAML Chat Room to ask some questions about YAML. I ended up asking him more about Julia and if he could help write a YAMLScript binding.

He invited Kenta Murata to the chat room and Kenta said he could do it for us. Then Kenta disappeared for a few weeks. Last week he came back with a fully working Julia binding for YAMLScript!

Fun fact: Julia is Clark Evans favorite programming language! Clark is one of the original authors of the YAML data language.

YAMLScript Loader Libraries

These YAMLScript language bindings are intended to be an alternative YAML loader library for the respective languages. They can load normal existing YAML files in a consistent way, and common API across all languages. They can also load YAML files with embedded YAMLScript code, to achieve data importing, transformation, interpolation; anything a programming language can do.

The current list of YAMLScript loader libraries is:

Join the Fun!

If your language is missing a YAMLScript binding or you want to help improve one, please drop by the YAMLScript Chat Room and we'll get you started.

All of the bindings are part of the YAMLScript Mono-Repo on GitHub. If you look at the existing bindings, you'll see that they are all quite small. You'll need to learn about basic FFI (Foreign Function Interface) for your language, to make calls to the YAMLScript shared library libyamlscript, but that's about it.

It's a great way to get started with a new language project.

Some Future Plans

There's a lot of upcoming work planned for YAMLScript. I've mapped some of it out in the YAMLScript Roadmap.

Currently YAMLScript (written in Clojure, which compiles to JVM bytecode, which…) compiles to a native binary interpreter using the GraalVM native-image compiler. This is great for performance and distribution, but it's not great for portability, limiting it to Linux, MacOS and Windows.

The JVM is a great platform for portability, so we're planning to make a JVM version of the ys YAMLScript interpreter. Of course, having YAMLScript available as a JVM language is also a good thing for Linux, MacOS and Windows users.

We also want to make WebAssembly, JavaScript and C++ versions of the YAMLScript interpreter.

And of course we still want to get to our goal of 42 language bindings!!!

Lots of fun stuff to explore!

Permalink

Clojure Deref (July 17, 2024)

Welcome to the Clojure Deref! This is a weekly link/news roundup for the Clojure ecosystem (feed: RSS). Thanks to Anton Fonarev for link aggregation.

Libraries and Tools

New releases and tools this week:

  • pg2 0.1.15 - A fast PostgreSQL driver for Clojure

  • yamlscript 0.1.66 - Programming in YAML

  • expose-api 0.3.0 - A Clojure library designed to simplify the process of creating public-facing API namespaces

  • datomic-gcp-tf - Terraform module to run Datomic on GCP

  • clong 1.4 - A wrapper for libclang and a generator that can turn c header files into clojure apis

  • tools.build 0.10.5 - Clojure builds as Clojure programs

  • deep-diamond 0.29.4 - A fast Clojure Tensor & Deep Learning library

  • adorn 0.1.131-alpha - Extensible conversion of Clojure code to Hiccup forms

  • calva 2.0.467 - Clojure & ClojureScript Interactive Programming for VS Code

  • http-server 0.1.13 - Serve static assets

  • squint 0.8.113 - Light-weight ClojureScript dialect

  • overarch 0.27.0 - A data driven description of software architecture based on UML and the C4 model

  • hanamicloth 1-alpha4-SNAPSHOT - Easy layered graphics with Hanami & Tablecloth

  • clay 2-beta12 - A tiny Clojure tool for dynamic workflow of data visualization and literate programming

  • polylith 0.2.20 - A tool used to develop Polylith based architectures in Clojure

Permalink

Clojure macros continue to surprise me

Clojure macros have two modes: avoid them at all costs/do very basic stuff, or go absolutely crazy.

Here’s the problem: I’m working on Humble UI’s component library, and I wanted to document it. While at it, I figured it could serve as an integration test as well—since I showcase every possible option, why not test it at the same time?

This is what I came up with: I write component code, and in the application, I show a table with the running code on the left and the source on the right:

It was important that code that I show is exactly the same code that I run (otherwise it wouldn’t be a very good test). Like a quine: hey program! Show us your source code!

Simple with Clojure macros, right? Indeed:

(defmacro table [& examples]
  (list 'ui/grid {:cols 2}
    (for [[_ code] (partition 2 examples)]
      (list 'list
        code (pr-str code)))))

This macro accepts code AST and emits a pair of AST (basically a no-op) back and a string that we serialize that AST to.

This is what I consider to be a “normal” macro usage. Nothing fancy, just another day at the office.

Unfortunately, this approach reformats code: while in the macro, all we have is an already parsed AST (data structures only, no whitespaces) and we have to pretty-print it from scratch, adding indents and newlines.

I tried a couple of existing formatters (clojure.pprint, zprint, cljfmt) but wasn’t happy with any of them. The problem is tricky—sometimes a vector is just a vector, but sometimes it’s a UI component and shows the structure of the UI.

And then I realized that I was thinking inside the box all the time. We already have the perfect formatting—it’s in the source file!

So what if... No, no, it’s too brittle. We shouldn’t even think about it... But what if...

What if our macro read the source file?

Like, actually went to the file system, opened a file, and read its content? We already have the file name conveniently stored in *file*, and luckily Clojure keeps sources around.

So this is what I ended up with:

(defn slurp-source [file key]
  (let [content      (slurp (io/resource file))
        key-str      (pr-str key)
        idx          (str/index-of content key)
        content-tail (subs content (+ idx (count key-str)))
        reader       (clojure.lang.LineNumberingPushbackReader.
                       (java.io.StringReader.
                         content-tail))
        indent       (re-find #"\s+" content-tail)
        [_ form-str] (read+string reader)]
    (->> form-str
      str/split-lines
      (map #(if (str/starts-with? % indent)
              (subs % (count indent))
              %)))))

Go to a file. Find the string we are interested in. Read the first form after it as a string. Remove common indentation. Render. As a string.

Voilà!

I know it’s bad. I know you shouldn’t do it. I know. I know.

But still. Clojure is the most fun I have ever had with any language. It lets you play with code like never before. Do the craziest, stupidest things. Read the source file of the code you are evaluating? Fetch code from the internet and splice it into the currently running program?

In any other language, this would’ve been a project. You’d need a parser, a build step... Here—just ten lines of code, on vanilla language, no tooling or setup required.

Sometimes, a crazy thing is exactly what you need.

Permalink

Going to the cinema is a data visualization problem

Do you like going to the cinema? I do. But I also like to know where I am going and which movie I am going to see. But how do you choose?

You can’t go to the cinema’s website. There are just too many. Of course, you might have a favorite one and always go to it, but you won’t know what you are missing out.

Then, there are aggregators. The idea is good: gather everything that’s playing in cinemas right now in one place. Flight aggregators, but for movies.

Implementation, unfortunately, is not that good. As with any other website, the aggregator’s goal is to make you go through as many web pages as possible, do as many clicks as possible, and show you as many ads as possible.

Please use an ad blocker, this is unbearable

They even play a freaking TV ad in place of a movie trailer!

Information architecture can be weird too:

kino.de, auto-translated from German

Should I go to “Movies” or “Cinema Programme”? Should I select “Currently in Cinema” or “New in Cinema”?

So I decided to take matters into my own hands and build a cinema selection website I always dreamed of.

Meet allekinos.de:

So what is it?

It’s a website that shows every movie screening in every cinema across the entire Germany.

And when I say EVERY screening, I mean it:

Every screening, every cinema, every movie. All in one long HTML table.

What else can it do?

Just filter. You can filter:

  • by city,
  • by city district (don’t want to travel too far),
  • by a particular cinema (maybe you have a favorite one),
  • by genre (want to see something with your kid but don’t know what),
  • or by movie (which cities does it still play?).

That’s it. That’s the site.

Oh, we also have a list of premieres so you would know what’s coming. But that’s it.

What about the interface?

There isn’t one. I mean, there is, of course, but I tried to make it as invisible as possible. There’s no logo. No menu. No footer. No pagination. No “See more”. No cookie banners (because no cookies). No ChatGPT/SEO generated bullshit. No ads, of course.

Why? Because people don’t care about that stuff. They care about function. And our UI is a pure function.

But how do I search?

Well, Ctrl+F, of course. We are too humble, too lazy, and too smart to try to compete with in-browser implementation.

Wait, what about page size?

It’s totally fine. I mean, for Berlin, for example, we serve 1.4 MB of HTML. 3 MB with posters. It’s fine.

Slack loads 50 MB (yes, MEGA bytes) to show you a list of 10 chats. AirBnB loads 15 MB, including 500 KB HTML, just to show 20 images. LinkedIn loads 1.5 MB of just HTML (37 MB total) for a fraction of the data we’re showing. So we are fine.

It’s kind of refreshing, actually. What kind of speed do you get from a table with a thousand rows. Feels like a lot, but still feels faster than anything on the modern web.

What about mobile?

That is a good question. I am still thinking about it.

The table trick won’t work on mobile. So layout needs to be different, but I also want it to have the same information density as the desktop, which is tricky.

If you just make the table vertical, it’ll be too much to scroll even for people with the strongest fingers. Maybe I’ll figure something out one day.

What’s under the hood?

DataScript.

When I looked at the data, I realized it’s multidimensional: there are movies, they have genres, years, countries, languages, there are cinemas, which are located in districts, which are located in cities, then there are showings, which have day and time, and very possibly something else will come up later, too.

Now, I had no idea how that data would be accessed. Is the cinema part of the movie or is the movie part of the cinema? So I decided to make it all flat and put it into the database.

And it worked! It worked remarkably well. Now I can utilize DataScript queries being data to build them on the fly:

(defn search [{:keys [city cinema district movie genre]}]
  (let [inputs   
        (cond-> [['$ db]]
          city     (conj ['?city     city])
          cinema   (conj ['?cinema   cinema])
          district (conj ['?district district])
          movie    (conj ['?movie    movie])
          genre    (conj ['?genre    genre]))
      
        where
        (cond-> [:where]
          city     (conj '(or
                            [?cinema :cinema/city ?city]
                            [?cinema :cinema/area ?city]))
          cinema   (conj '[?cinema :cinema/title ?cinema-title])
          district (conj '[?cinema :cinema/district ?district])
          movie    (conj '[?movie :movie/title ?movie-title])
          genre    (conj '[?movie :movie/genre ?genre]))]

    (apply ds/q
      (concat
        '[:find ?show ?date ?time ?url ?cinema ?version ?movie
          :keys  id    date  time  url  cinema  version  movie
          :in]
        (map first inputs)
        where 
        '[[?show    :show/cinema         ?cinema]
          [?show    :show/date           ?date]
          [?show    :show/time           ?time]
          [?show    :show/url            ?url]
          [?show    :show/movie-version  ?version]
          [?version :movie-version/movie ?movie]])
      (map second inputs))))

The whole database is around 11 MB, basically nothing. I don’t even bother with proper storage, I just serialize the whole thing to a single JSON file every time it updates.

The hosting

I have been building websites for a while. I have two (Grumpy and this blog) running right now on my own server. I already spent my time, I have figured this all out. I have all the templates at my fingertips.

But for allekinos.de I decided to try something different: application.garden.

It’s a hosting for small Clojure web apps (still in private beta) that’s supposed to take care of insignificant details for you and let you focus on your app first and foremost.

And it works! It’s refreshingly simple: you download a single binary that operates as a command-line tool, create garden.edn file with your project’s name, and call garden deploy. That’s it! Your app is live!

No, seriously. You tend to forget how many annoying small details there are before other people can use your app. But when something like Garden takes them away, you remember and get blown away again! If that’s what Heroku used to feel like back in the day, I’m all in for it.

The beauty Garden is that it helps you start fast, but it’s not a toy. It easily scales all the way up to production. Custom domain, HTTPS, auth, cron, logs, persistent storage: they take care of all of this for you.

And a cherry on top: they even provide nREPL to production! Again, no setup, just garden repl and you are in! Perfect for debugging weird performance issues or running one-off jobs.

An example: when I implemented premieres and committed the code, I still needed to run it for the first time. Instead of making a special flag or endpoint or adding and then immediately removing the startup code, I just connected to remote nREPL and invoked the function in the code. It doesn’t get easier than that!

Uncharacteristic of me, but I kind of enjoy building web apps again, when it’s that simple. Might build more in the future.

Conclusion

In the beginning, I wanted a simple website that solved my problem. I wanted a website that I’d enjoy using.

But I don’t want to make a product out of it. We have enough products already. It’s time someone took a user’s side. And I am one of the users.

Magic things happen when you trust your users and just show them everything you’ve got.

For example, I found some rare films playing that I had no idea about. Matrix in German (!), but once a week and only in one cinema. Or Mars Express, they play it in three cities only, excluding mine. How do you find out about stuff like this?

Here, I discovered it. I looked at the data and you started seeing stuff that otherwise is completely invisible.

Anyway, enjoy. If this becomes a trend, I’m all in for it. Wouldn’t mind seeing more sites like this in the future.

Permalink

Clojure Corner with Johnny Stevenson

In this episode of our Clojure Corner, we are excited to present an insightful interview with Johnny Stevenson, also known as ‪@Practicalli‬ Johnny is a renowned author, mentor, broadcaster, and engineer with a deep passion for Clojure.

Watch the full interview on your Youtube Channel

While you are waiting for the next Clojure Corner you can read our past Corner with Mauricio Szabo.

The post Clojure Corner with Johnny Stevenson appeared first on Flexiana.

Permalink

Call for Proposals. June 2024 Member Survey


Happy July! It’s that time again.

Clojurists Together is pleased to announce that we are opening our Q3 2024 funding round for Clojure Open Source Projects. Applications will be accepted through the 26th of July 2024 (midnight Pacific Time). We are looking forward to reviewing your proposals! More information and the application can be found here.

We will be awarding up to $44,000 USD for a total of 6-8 projects. The $2k funding tier is for experimental projects or smaller proposals, whereas the $9k tier is for those that are more established. Projects generally run 3 months, however, the $9K projects can run between 3 and 12 months as needed. We expect projects to start on Sept. 1, 2024.

We surveyed our members again in June to find out what types of initiatives they would like us to focus on for this round of funding. Their responses are summarized below. In particular, it was great to see members' feedback relating to how often they used or referred to developers' work we have funded. Also noted that several of you plan to attend Heart of Clojure in Belgium in Sept. and Clojure/Conj in the US in October. Check this out!

Usage by Members June 2024

If you are working on a Clojure open source project or have a new one in mind, especially one mentioned as a focus area for our members, consider applying. Or if you know someone that might be interested, please pass this information along. Let’s get the word out! If you have questions, please contact Kathy Davis at kdavis@clojuriststogether.org.

Our Members Speak: Feedback from the June 2024 Survey.

Platform June 2024

Clojure June 2024

Clojurscript June 2024


What other information would you like to know about Clojurists Together and the work we fund?

  • Your updates have been great. The current amount of information is good.
  • A roadmap of where clojurists together wants to go/grow to.
  • It’d be great to learn about your decision making-process (which is so fruitful)!
  • I’d like to know more about your approach to supporting new contributors and people from underrepresented groups.

What areas of the Clojure and ClojureScript ecosystem need support?

  • Although this doesn’t affect me personally, it is always good to make it easier for newcomers, so may some articles/documentation aimed at them, and maybe support for things like Kit Framework so they can get going quickly. (4 comments)
  • A better test framework than clojure.test but ideally mostly compatible with current tooling. The implementation of clojure.test is borderline unreadable.
  • AI copilots specific to clojure libraries
  • Cloud deployment patterns & tooling
  • JVM/Clojure runtime has seen catch up with the JDK platform, but ClojureScript runtime seems to be a bit “stuck”, which makes me worried.
  • Clojure Editor integrations
  • ClojureDart, auth libs for web apps
  • Fairly bare-to-the-metal visualization. I’m still using Oz (vega, vega-lite), but it is not maintained. I am not that much interested in hanami, which is its own metalanguage on top of vega-lite, and I think cloj-viz wants to build on top of hanami again. At least oz I can debug by looking at vega/vega-lite code.
  • 3rd Party Libraries, some of which have merges for fixes but no releases!
  • Structural editing
  • I’d love access to ClojureScript debugging within Emacs
  • Data-science
  • Making simple things easy, like Rails does.
  • Calva support for simultaneous clojure+clojurescript from the same workspace
  • Test tooling
  • I was recently looking at test.check and clojure.spec.alpha and was having some trouble figuring out how (if at all) the two are related? Maybe some unification of these two concepts?
  • The usual issues that crop up every year, like getting newcomers, error messages, and community engagement. Core team output feels stagnant; nothing in 1.12 feels worthy of a major release. Plus, it seems like Clojure is falling behind in just keeping up with changes to Java.
  • Marketing. Rich’s seminal talks were a fantastic advert for Clojure, but the tech market is fickle and nothing has really stepped in to replace these to attract companies in a position to create jobs.
  • Library maintenance
  • Documentation is still horrible for most libraries in the ecosystem.

What areas of the Clojure and ClojureScript ecosystem are strong? 

  • Core language itself
  • Stability has been great (4).
  • Expertise is strong - but especially on Slack, where it is quickly swept under the tide of time. Slack has some very patient people with great advice. The design work that went into the language is showing it’s benefit.
  • Good community, good tools and libraries developed by talented and experienced people, with a preference towards stability and backwards compatibility. Community is strong and welcoming. Documentation is there (though often in not always obvious places).
  • Babashka and it’s ecosystem
  • The libraries and general “can get things done with the language” abilities are super-strong.
  • Web development
  • Interactive developer tooling

Are there any particular libraries, tools, or projects that are important to you that you would like to see supported?

(Number of mentions in parentheses):

  • Cider (8)
  • Malli (6)
  • Shadow-cljs (5) 
  • Clj-kondo (5)
  • Calva (4)
  • Reiti (4)
  • Ring (3)
  • Babashka, Biff, buddy, cljfmt, clj-reload, Clojure-lsp, ClojureDart, Donut, Flowstorm, Re-frame (each 2)
  • Async, badspreadsheet, Carmine cluster support, Clay, Clojure-spec, Clojure-ts mode, Cloverage, Conjure, Data Analysis processing framework, Datalevin, DataScript, Datasplash, Dtype-next, eMac, Fastmath, Fulcro, Hanami, Hanamicloth, HoneySQL, HTMX projects, Humble UI, Jank, Kindly, Langohr, Leinigen, libpython-clj, Metamorph.mi, nbb, Neanderthal, Neovim, next.jdbc, Noj, Parinfer, Pedestal, Polylith, Portal, Regeant, Reframe, ring-jetty9-adapter, SciCloj, Scittle, Spandex, Tablecloth, Tech.ml.dataset, uix, VIM integrations, All things Borkdude, Taoensso & Weavejester’s projects (each 1)
  • I am very impressed by ClojureScript. With that said, I have been reading about unpoly and htmx of late. I’m not sure if it is a fad or not at this point, but it would be interesting to highlight if we have any special advantages or integrations between Clojure and these frameworks. I especially like the “hypermedia first” narrative that htmx is pushing.
  • The data-science thing. Datascript’s core insofar as it is reused in Datalevin. Datalevin might not be looking for funding, but it is(?) the obvious heir to the Datomic popular revolution (FOSS triple-store replacing SQL+Lucene without time-travel) and thus has the potential to bring lots of attention as the SQL ball-and-chain is finally clipped from developers' ankles.
  • Mac version (M1+ apple). It’s a great base for high-performance computing
  • Some level of ClojureScript debugging support that could also work when developing for nbb via nrepl
  • I would like the bi-quarterly updates to include a one line description of what the project is and a link to its website, before describing all that work that was done - to set the scene. Don’t assume we know (and remember) all the cool Clojure projects.

What would you like to be different in the Clojure community in the next 12 months?

  • Better AI / ML story
  • Maturing of the data-science stack
  • Coordinate more around certain standards like biff.
  • For 3rd party libraries that have commits, but no releases, or even open pull releases to fix issues are released!
  • kotlin interop, scala interop
  • Get rid of the perceived lack of adoption, bring companies before the curtain who use Clojure.
  • Helping innovation like squintjs/cherry, or something similar, to strengthen the ClojureScript runtime story.
  • CT has funded people to be present on Slack and share expertise. A live answer is very nice, but the goodness will diminish as soon as funding stops, and completely vanish if Slack ceases to renew the free enterprise status. In a word, when the music stops there will not even be records to replay. What is a potential solution?
  • I notice many of the questions recur, over and over and over again. They are literally FAQs. Could CT encourage contributing to a FAQ?
  • ClojureDart will be big: very big. As a toy, it is already impressive. My app came out super on mobile phones & tolerable as a website, totally bypassing the self-inflicted agonies of Javascript. CLJD is still missing a few key things - or maybe just multimethods - that a lot of Clojure software depends on. Soon as that obstacle is past, I predict a tsunami of interest in CLJD-adapted CLJC of many Clojure libraries. Clojurists Together may have a part in adapting long-stable libraries, in service of possibly really putting Clojure ‘on the map’, leapfrogging the labor quagmire of JS/CLJS while reusing tons of software already written for the JVM/JS hosts!"
  • The Clojure community is already pretty great. I think we must just continue to be a friendly community and continue to help each other and newcomers on Slack and AskClojure. Efforts like Clojurists Together is a wonderful way to support developers that work on important open source tools and libraries. Keep up the good work!
  • Attracting newcomers & find out what bumps are in onboarding . More advocacy and uptake of the language, I sense a lull not only due to the economy, but also due to Rich’s retirement. How can we grow the Clojure userbase?
  • Maybe more focus on getting started quickly for new Clojure projects? I am not sure exactly how, but maybe docker images or nix integrations that are just dead simple. I don’t honestly have the answers here, but I think that a lot of people just dismiss Clojure because it has somewhat of a learning and setup cost. I think if you can just get it in front of people quickly then you can quickly impress them with language feature/design choices, but you have to get them to look at it. Getting started was a little easier then we could slowly grow our ranks a little more.
  • I’m personally not a fan of the community going to deps. Kind of feels like if the Java community moved back to ant.
  • Better transparency over the current state of Clojure libraries. Are they being actively maintained? Do they have stale dependencies that might have security issues?

Permalink

System-wide user.clj with tools.deps

Updated on 2024-07-11: changed the approach from passing eval-command with :main-opts to adding a :local/root directory to classpath. The old approach can be found at the end of the post.Ever since I converted from Leiningen and Boot to tools.deps, I've been missing a place to define devtime functions and helpers that would automatically be available in any REPL I start locally. Boot allows to put any code into profile.boot, Leiningen has a system-wide profiles.clj that is a bit more awkward for defining functions but it still can be done. I finally decided to recreate the same experience with tools.deps and got pretty close. The setup I came up with took a bit of effort to figure out, so I want to document all the steps and gotchas here and share this setup with you.

Permalink

Making Custom Datomic Datalog Datasources

Disclaimers

What follows is not meant for those new to Datomic. This is a very deep dive into Datomic on-prem internals–even deeper than usual! I don’t have access to any source code or any insider knowledge and none of the interfaces discussed here are public, so expect inaccuracies!

Caveat Lector out of the way, let’s get started.

What are Datasources

Datomic datalog’s :where clause has “data-pattern” sub-clauses. For example:

[:find ?foo
 :where
 [(+ 1 1) ?foo]         ;; This is a function-expression clause
 [?foo :attribute ?bar] ;; This is a data-pattern clause--the one we care about.
 [(< ?bar 1)]           ;; This is a predicate-expression clause
 ]

Data pattern clauses match tuples in a “datasource”, which we can also call a relation. Syntactically, datasources are datalog symbols that start with $.

[:find ?foo
 :in $ $h ;; two datasources $ and $h
 :where
 [?foo :attribute ?bar] ;; Implicitly datasource $
 [$h ?bar :other-attribute ?baz] ;; Explicitly datasource $h
 ]

Usually a datasource is a Datomic database, but that’s not the only thing it can be!

My aim is to show you what “makes” a datasource, so you can understand the performance of datalog queries better and potentially make your own datasources.

(Spoiler alert: a datasource is a protocol.)

The ExtRel Protocol

A datasource is anything that has a useful implementation of the datomic.datalog/ExtRel protocol. (I’m not sure what this protocol name abbreviates. Perhaps “existential relation”?)

datomic.datalog/ExtRel
;; This is the map that defprotocol creates: 
=>
{:on datomic.datalog.ExtRel,
 :on-interface datomic.datalog.ExtRel,
 ;; Note the method signature
 :sigs {:extrel {:name extrel,
                 :arglists ([src consts starts whiles]),
                 :doc nil}},
 :var #'datomic.datalog/ExtRel,
 :method-map {:extrel :extrel},
 :method-builders {#'datomic.datalog/extrel #object[datomic.datalog$fn__17615 0x576cf258 "datomic.datalog$fn__17615@576cf258"]},
 ;; Note there are four implementations
 :impls {nil {:extrel #object[datomic.datalog$fn__17637 0x3c5ce098 "datomic.datalog$fn__17637@3c5ce098"]},
         java.lang.Object {:extrel #object[datomic.datalog$fn__17639 0x66250ab1 "datomic.datalog$fn__17639@66250ab1"]},
         datomic.db.Db {:extrel #object[datomic.datalog$fn__17641 0x2effa778 "datomic.datalog$fn__17641@2effa778"]},
         java.util.Map {:extrel #object[datomic.datalog$fn__17643 0x1e317ecd "datomic.datalog$fn__17643@1e317ecd"]},
         java.util.Collection {:extrel #object[datomic.datalog$fn__17645 0x45b215b3 "datomic.datalog$fn__17645@45b215b3"]}}}

From the protocol map we know that its definition looked something like this:

(defprotocol ExtRel
  (extrel [src consts starts whiles]))

And we know a few implementing objects to investigate.

Built-in Datasources

Collections

The nil and Object implementations are just to throw error messages:

(d/q '[:find ?x ?y :in $ :where [?x ?y]]
     nil)
Execution error (Exceptions$IllegalArgumentExceptionInfo) at datomic.error/arg (error.clj:79).
:db.error/invalid-data-source Nil or missing data source. Did you forget to pass a database argument?
(d/q '[:find ?x ?y :in $ :where [?x ?y]]
     (Object.))
Execution error (Exceptions$IllegalArgumentExceptionInfo) at datomic.error/arg (error.clj:79).
:db.error/invalid-data-source class java.lang.Object is not a valid data source type.

Although nil and Object technically implement ExtRel, these implementations are not “useful”, so I don’t want to call these “datasources”.

By contrast, java.util.Collection allows you to use any collection of tuples as a datasource:

;; Using a vector of tuples
(d/q '[:find ?e ?attr ?v
       :in $
       :where
       [(ground 1) ?e]
       [?e ?attr ?v]]
     [[1 :int 2] [1 :int 3] [2 :int 4]])
=> #{[1 :int 3] [1 :int 2]}

Note that this implementation has very few constraints:

  • The outer collection can be anything that implements just size and iterator. This includes Clojure’s persistent vectors and sets.
  • The tuples can be anything that supports indexed access via nth.
  • Your tuples can be any length you want, but they should ideally be the same length to be a true relation. (Also you might get IndexOutOfRange exceptions.)

There’s a special case for java.util.Map that just treats a map as a collection of two-element tuples. It seems to only need an entrySet method, and it probably just delegates the result to the j.u.Collection implementation.

ExtRel Parameters

Let’s instrument the extrel method to see what it calls. This function will take a datasource and return a wrapped datasource with a trace atom that records every call to its extrel method.

(defn extrel-trace [base-ds]
  (let [trace (atom [])
        ds (reify
             ExtRel
             (extrel [_ consts starts whiles]
               (let [args [consts starts whiles]
                     ret (datomic.datalog/extrel base-ds consts starts whiles)]
                 (swap! trace conj {:args args :ret ret})
                 ret)))]
    [ds trace]))

And let’s try it on a simple query:

(let [[ds t] (extrel-trace [["e1" :int 1 "extra"]
                            ["e1" :int 2 "extra"]
                            ["e1" :int 3 "extra"]
                            ["e1" :int 4 "extra"]
                            ["e1" :int 5 "extra"]
                            ["e2" :int 2 "extra"]])]
  [(d/q '[:find ?v ?extra
          :in $
          :where
          [(ground ["e1" "e2"]) [?e ...]]
          [$ ?e :int ?v ?extra]
          [(= "extra" ?extra)]
          [(< ?v 4)]
          [(> ?v 1)]]
        ds)
   @t])
=>
[#{[2 "extra"] [3 "extra"]}
 [{:args [[nil :int nil nil]
          [nil nil 1 "extra"]
          [nil
           nil
           #object[datomic.datalog$ranges$fn__18296$fn__18300 0x35ac0c98 "datomic.datalog$ranges$fn__18296$fn__18300@35ac0c98"]
           #object[datomic.datalog$ranges$fn__18296$fn__18300 0x20fe78e3 "datomic.datalog$ranges$fn__18296$fn__18300@20fe78e3"]]],
   :ret (["e1" :int 1 "extra"]
         ["e1" :int 2 "extra"]
         ["e1" :int 3 "extra"]
         ["e1" :int 4 "extra"]
         ["e1" :int 5 "extra"]
         ["e2" :int 2 "extra"])}]]

This query had exactly one data-pattern clause in it: [?e :int ?v ?extra]. The extrel method invocation corresponds to this clause. Even though there are two possible values of ?e, extrel was only called once. This is because extrel’s responsibility isn’t to join against the results of previous binding clauses, but only provide relations which can be determined from a static examination of the query and its :in arguments. (We’ll illustrate this point better later.)

This call illustrates the structure of the consts starts and whiles parameters. All three parameters will have the same length as the data-pattern clause, less the optional src-var (in this case $). The call may include information from other surrounding clauses.

consts contains all constant values, or nil if the value in that slot is not constant. Note that only :int is constant. ?e is not considered constant even though it has a ground value and could be known statically. Instead, the result of the ground will be joined against the result of this clause later–remember, extrel is not about joining.

Subrange optimizations

starts and while is an optimization available for datasources which are able to return a subset of values. Through knowledge of the primitive predicates < and = used with static arguments datalog was able to determine that ?v must be >= 1 and ?extra must start with the string "extra". Therefore the corresponding slots of starts have non-nil values where a containing start value is known: [nil nil 1 "extra"].

while is the same information, but for the end of the range. Each item in the corresponding slot is a predicate which returns false if the value in that slot of a candidate tuple is outside the end range.

;; *1 is the previous result
(let [[_ _ while-v while-extra] (-> *1 peek first :args peek)]
  (mapv (juxt identity while-v) (range 6)))
=> [[0 true] [1 true] [2 true] [3 true] [4 false] [5 false]]

A datasource can use starts and whiles information–especially if it’s available in a sorted order–to return a subset of its relations which could never possibly join with anything else in the query. All a sorted datasource has to do is start seeking at the starts slot of its choice and take-while the corresponding while slot predicate.

However, applying starts and whiles is (as far as I can tell) completely optional for the correctness of the query. If a datasource understands them it can leverage them to reduce the size of relations it returns and thus the number of items involved in subsequent joins, but it is only required to return items which satisfy consts.

You’ll notice in the trace above that the extrel method of j.u.Collection included ["e1" :int 5 "extra"] even though this doesn’t satisfy whiles. From what I can tell, the j.u.Collections implementation only filters all items by consts and doesn’t use starts or whiles.

Datomic Database Datasources

However, a Datomic database does provide sorted items, and can leverage starts and whiles to reduce its result-set. Based on the non-nil consts, starts, and whiles slots it can choose an appropriate index to seek. For example, if the attribute is known and its values are indexed and the value start or “while” is known it can do a sub-seek of :avet.

Let’s trace the extrel call of an actual Datomic datasource. We’ll use a freshly-created dev connection instead of an in-memory connection so that io-stats will tell us what indexes we are using.

(let [[ds t] (extrel-trace db)]
  (-> (d/query {:query       '[:find ?v
                               :where
                               [?e :db/ident ?v]
                               [(<= :db/a ?v)]
                               [(< ?v :db/b)]]
                :args        [ds]
                :query-stats true
                :io-context  :user/query})
      (assoc :extrel-trace @t)))
=>
{:ret #{[:db/add]},
 :io-stats {:io-context :user/query,
            :api :query,
            :api-ms 5.18,
            :reads {:avet 1, :dev 1, :ocache 1, :dev-ms 1.57, :avet-load 1}},
 :query-stats {:query [:find ?v :where [?e :db/ident ?v] [(<= :db/a ?v)] [(< ?v :db/b)]],
               :phases [{:sched (([?e :db/ident ?v] [(<= :db/a ?v)] [(< ?v :db/b)])),
                         :clauses [{:clause [?e :db/ident ?v],
                                    :rows-in 0,
                                    :rows-out 1,
                                    :binds-in (),
                                    :binds-out [?v],
                                    :preds ([(<= :db/a ?v)] [(< ?v :db/b)]),
                                    :expansion 1,
                                    :warnings {:unbound-vars #{?v ?e}}}]}]},
 :extrel-trace [{:args [[nil :db/ident nil]
                        [nil nil :db/a]
                        [nil
                         nil
                         #object[datomic.datalog$ranges$fn__18296$fn__18300 0x3461f77c "datomic.datalog$ranges$fn__18296$fn__18300@3461f77c"]]],
                 :ret #object[datomic.datalog.DbRel 0x2e373360 "datomic.datalog.DbRel@2e373360"]}]}

There are three things to note here:

  1. Alias resolution is the responsibility of the datasource.
  2. Index choice is partially the responsibility of extrel.
  3. The return value of extrel is not necessarily a concrete collection but anything that implements datomic.datalog/IJoin.

First, the :db/ident attribute keyword constant was supplied to the query. Datoms don’t have attribute idents (keywords) in them; rather they have the attribute’s entity id. The Datomic database datasource must translate this to an entity id number. This means if the datasource has any aliasing mechanism that allows queries to refer to values in relations by anything other than their raw value, it’s the responsibility of the datasource to normalize those aliases into their canonical form.

Second, notice from the :io-stats information that the query used the :avet index for its reads. This index choice is also the responsibility of the datasource. In this case, it used the pattern of consts, start, and while to choose the :avet index.

If we don’t supply something that the datalog engine can recognise as a start or while parameter, the datasource may choose a different index:

;; Using a custom predicate to hide the subrange selection from datalog
(defn db-starts-with-a [x]
  (and (= "db" (namespace x))
       (.startsWith (name x) "a")))

(let [[ds t] (extrel-trace db)]
  (-> (d/query {:query       '[:find ?v
                               :where
                               [?e :db/ident ?v]
                               [(user/db-starts-with-a ?v)]]
                :args        [ds]
                :query-stats true
                :io-context  ::query})
      (assoc :extrel-trace @t)))

=>
{:ret #{[:db/add]},
 :io-stats {:io-context :user/query,
            :api :query,
            :api-ms 5.56,
            :reads {:aevt 2, :dev 2, :aevt-load 2, :ocache 2, :dev-ms 2.28}},
 :query-stats {:query [:find ?v :where [?e :db/ident ?v] [(user/db-starts-with-a ?v)]],
               :phases [{:sched (([?e :db/ident ?v] [(user/db-starts-with-a ?v)])),
                         :clauses [{:clause [?e :db/ident ?v],
                                    :rows-in 0,
                                    :rows-out 1,
                                    :binds-in (),
                                    :binds-out [?v],
                                    :preds ([(user/db-starts-with-a ?v)]),
                                    :expansion 1,
                                    :warnings {:unbound-vars #{?v ?e}}}]}]},
 :extrel-trace [{:args [[nil :db/ident nil] [nil nil nil] [nil nil nil]],
                 :ret #object[datomic.datalog.DbRel 0x99687c4 "datomic.datalog.DbRel@99687c4"]}]}

Notice in this case the io-stats reports reading two :aevt index segments instead of one :avet segment; but the query stats look mostly the same except for the :preds clause. In this case the extrel returned something which would seek all idents instead of a subset of them, so more (potential) IO was performed.

Why didn’t :query-stats show this difference? It still reports “rows-out” as 1. This is because of the third thing to notice, which is that the extrel call didn’t return a collection but something called a DbRel. What is this?

The IJoin Protocol

Datasources actually have a pair of protocols which are needed to evaluate a query. The first one is ExtRel, which we have just covered in detail. But the second one is what extrel returns. Although the simple builtin extrel implementations simply return a collection, what extrel is actually expected to return is something which implements datomic.datalog/IJoin.

datomic.datalog/IJoin
=>
{:on datomic.datalog.IJoin,
 :on-interface datomic.datalog.IJoin,
 :sigs {:join-project {:name join-project, :arglists ([xs ys join-map project-map-x project-map-y predctor]), :doc nil},
        :join-project-with {:name join-project-with,
                            :arglists ([ys xs join-map project-map-x project-map-y predctor]),
                            :doc nil}},
 :var #'datomic.datalog/IJoin,
 :method-map {:join-project-with :join-project-with, :join-project :join-project},
 :method-builders {#'datomic.datalog/join-project #object[datomic.datalog$fn__17492 0x1a9d7f3c "datomic.datalog$fn__17492@1a9d7f3c"],
                   #'datomic.datalog/join-project-with #object[datomic.datalog$fn__17513 0x27dd69e9 "datomic.datalog$fn__17513@27dd69e9"]},
 :impls {java.util.Collection {:join-project #object[datomic.datalog$fn__17586 0x1d8e070c "datomic.datalog$fn__17586@1d8e070c"],
                               :join-project-with #object[datomic.datalog$fn__17588 0x32170047 "datomic.datalog$fn__17588@32170047"]},
         java.lang.Object {:join-project #object[datomic.datalog$fn__17590 0x4302eda6 "datomic.datalog$fn__17590@4302eda6"],
                           :join-project-with #object[datomic.datalog$fn__17592 0xfdc460e "datomic.datalog$fn__17592@fdc460e"]},
         datomic.datalog.DbRel {:join-project #object[datomic.datalog$fn__17661 0x6287869e "datomic.datalog$fn__17661@6287869e"],
                                :join-project-with #object[datomic.datalog$fn__17663 0x6898a7b1 "datomic.datalog$fn__17663@6898a7b1"]}}}

Note that j.u.Collection implements IJoin, which is why you can return a normal collection from extrel.

From this protocol map, we know the protocol definition looked something like this:

(defprotocol IJoin
  (join-project [xs ys join-map project-map-x project-map-y predctor])
  (join-project-with [xs ys join-map project-map-x project-map-y predctor]))

I must confess I have no idea what join-project is for. I’ve never observed it invoked.

However, join-project-with is the method that performs a join, projection, and filtering from two IJoin-ables xs and ys. (Read “project” as a verb, not a noun.)

Here’s an instrumented example of the join-project-with call. The code below reifies an ExtRel datasource which returns a reified IJoin.

(def trace (atom []))
(d/query {:query '[:find ?e ?a ?v
                   :in $
                   :where
                   [?e :int ?i]
                   [(< 1 ?i)]
                   [(<= ?i 2)]
                   [?e :str ?str]
                   [(clojure.string/starts-with? ?str "foo")]
                   [?e ?a ?v]
                   ]
          :args
          (let [ds [["e1" :int 1]
                    ["e1" :int 2]
                    ["e1" :str "foo"]
                    ["e1" :str "bar"]
                    ["e2" :int 1]
                    ["e2" :str "baz"]]]
            [(reify ExtRel
               (extrel [_ consts starts whiles]
                 (let [xs (datomic.datalog/extrel ds consts starts whiles)]
                   (swap! trace conj {:fn 'extrel :args [consts starts whiles] :ret xs})
                   (reify
                     IJoin
                     (join-project-with [_ ys join-map project-map-x project-map-y predctor]
                       (let [r (datomic.datalog/join-project-with
                                xs
                                ys
                                join-map
                                project-map-x
                                project-map-y
                                predctor)]
                         (swap! trace conj
                                {:fn   'join-project-with
                                 :args [xs ys join-map project-map-x project-map-y predctor]
                                 :ret  r})
                         r))))))])})

Now lets look at the trace which I have annotated inline:

@trace
[
 ;; This is for the clauses [?e :int ?i] [(< 1 ?i)] [(<= ?i 2)]
 {:fn extrel,
  :args [[nil :int nil]
         [nil nil 1]
         [nil
          nil
          #object[datomic.datalog$ranges$fn__18296$fn__18300 0x5474264c "datomic.datalog$ranges$fn__18296$fn__18300@5474264c"]]],
  :ret (["e1" :int 1] ["e1" :int 2] ["e2" :int 1])}
 ;; Now we join the result against an empty initial result set
 {:fn join-project-with,
  :args [(["e1" :int 1] ["e1" :int 2] ["e2" :int 1]) ;; previous extrel
         #{[]}                                       ;; initial result set
         {}                                          ;; no joins
         ;; Projection of xs:
         ;; Put slot 2 in xs into slot 0 in the result
         ;; Put slot 0 in xs into slot 1 in the result
         {2 0, 0 1}
         ;; Projection of ys: keep nothing
         {}
         ;; This is a predicate constructor.
         ;; When called, it will return predicates which should be called
         ;; to filter results.
         ;; This is why `extrel` doesn't need to honor `starts` and `whiles`--
         ;; this is what *really* does the filtering.
         #object[datomic.datalog$push_preds$fn__18015$fn__18027 0x33a1ac5e "datomic.datalog$push_preds$fn__18015$fn__18027@33a1ac5e"]],
  :ret #{[2 "e1"]}}
 ;; Now we get the extrel for [?e :str ?str]
 ;; Note the `starts-with?` predicate is not included.
 {:fn extrel,
  :args [[nil :str nil] [nil nil nil] [nil nil nil]],
  :ret (["e1" :str "foo"] ["e1" :str "bar"] ["e2" :str "baz"])}
 ;; Now join this extrel with the result of the previous IJoin
 {:fn join-project-with,
  :args [(["e1" :str "foo"] ["e1" :str "bar"] ["e2" :str "baz"])
         ;; This is the result of the previous IJoin
         ;; It is *always* a set.
         ;; Note the tuple slots correspond to the projection maps.
         #{[2 "e1"]}
         ;; Join slot 0 in xs to slot 1 in ys
         ;; Here, it means only include tuples with "e1"
         {0 1}
         ;; Project xs 2 to 0, 0 to 1
         {2 0, 0 1}
         ;; Project ys 1 to 1
         ;; Since 1 in the result is the join target of xs and ys,
         ;; this is ok--they will never conflict.
         {1 1}
         ;; This has the `starts-with?` predicate in it.
         #object[datomic.datalog$push_preds$fn__18015$fn__18027 0x2590becd "datomic.datalog$push_preds$fn__18015$fn__18027@2590becd"]],
  :ret #{["foo" "e1"]}}
 ;; This final extrel is for the clause [?e ?a ?v]
 ;; On datomic databases, this would eventually throw because it is a full scan.
 ;; That behavior is from the datasource implementation, not the datalog!
 {:fn extrel,
  :args [[nil nil nil] [nil nil nil] [nil nil nil]],
  :ret [["e1" :int 1] ["e1" :int 2] ["e1" :str "foo"] ["e1" :str "bar"] ["e2" :int 1] ["e2" :str "baz"]]}
;; The final projection is to extract what the `:find` clause wants.
 {:fn join-project-with,
  :args [[["e1" :int 1] ["e1" :int 2] ["e1" :str "foo"] ["e1" :str "bar"] ["e2" :int 1] ["e2" :str "baz"]]
         #{["foo" "e1"]}
         {0 1}
         {0 0, 1 1, 2 2}
         {1 0}
         #object[datomic.datalog$truep 0x4922463b "datomic.datalog$truep@4922463b"]],
  :ret #{["e1" :str "bar"] ["e1" :int 1] ["e1" :int 2] ["e1" :str "foo"]}}]

If you compare this trace with the :query-stats output of the query, you’ll notice that its data roughly corresponds to join-project-with invocations (including row counts) more than the extrel invocations. In general, :io-stats tells you more about the relations from extrel and :query-stats about join-project-with calls.

Realizing Relations in IJoin

The existing of IJoin as a protocol allows ExtRel to defer realization of the relation until it receives more information from join parameters. In this case, DbRel isn’t actually reading any datoms–this happens during its IJoin, where it can make better index choices.

This ExtRel vs IJoin split also explains a lot of seemingly inconsistent behavior in datalog queries around lookup-ref resolution. The inconsistency often comes down to whether the lookup could be resolved at extrel time or at join-project-with time.

Take the following query as an example.

(let [[ds t] (extrel-trace db)]
  (-> (d/query {:query       '[:find ?v
                               :in $ [?a ?v]
                               :where
                               [:db.part/db ?a ?v]]
                :args        [ds [:db.install/attribute :db/doc]]
                :io-context :user/query
                :query-stats true})
      (assoc :extrel-trace @t)))

=>
{:ret #{[:db/doc]},
 :io-stats {:io-context :user/query,
            :api :query,
            :api-ms 4.26,
            :reads {:aevt 1, :dev 1, :aevt-load 1, :ocache 1, :dev-ms 1.11}},
 :query-stats {:query [:find ?v :in $ [?a ?v] :where [:db.part/db ?a ?v]],
               :phases [{:sched (([(ground $__in__2) [?a ?v]] [:db.part/db ?a ?v])),
                         :clauses [{:clause [(ground $__in__2) [?a ?v]],
                                    :rows-in 0,
                                    :rows-out 1,
                                    :binds-in (),
                                    :binds-out [?a ?v],
                                    :expansion 1}
                                   {:clause [:db.part/db ?a ?v],
                                    :rows-in 1,
                                    :rows-out 1,
                                    :binds-in [?a ?v],
                                    :binds-out [?v]}]}]},
 :extrel-trace [{:args [[:db.part/db :db.install/attribute :db/doc] [nil nil nil] [nil nil nil]],
                 :ret #object[datomic.datalog.DbRel 0x750e1765 "datomic.datalog.DbRel@750e1765"]}]}

In this query the [?a ?v] is a single tuple and not a relation, so the value of ?a and ?v are effectively constant. Thus datalog knows it can supply them as consts to extrel, which can interpret them as lookups. And you can see in the extrel-trace that :db.install/attribute and :db/doc were both provided, so extrel knew the attribute was a ref and the value should be resolved to a ref.

But the following seemingly semantically identical query gives a different result:

(let [[ds t] (extrel-trace db)]
  (-> (d/query {:query       '[:find ?v
                               :in $ [[?a ?v]]
                               :where
                               [:db.part/db ?a ?v]]
                :args        [ds [[:db.install/attribute :db/doc]]]
                :io-context  :user/query
                :query-stats true})
      (assoc :extrel-trace @t)))

=>
{:ret #{},
 :io-stats {:io-context :user/query,
            :api :query,
            :api-ms 4.55,
            :reads {:aevt 6, :dev 4, :aevt-load 6, :ocache 6, :dev-ms 3.81}},
 :query-stats {:query [:find ?v :in $ [[?a ?v]] :where [:db.part/db ?a ?v]],
               :phases [{:sched (([(ground $__in__2) [[?a ?v]]] [:db.part/db ?a ?v])),
                         :clauses [{:clause [(ground $__in__2) [[?a ?v]]],
                                    :rows-in 0,
                                    :rows-out 1,
                                    :binds-in (),
                                    :binds-out [?a ?v],
                                    :expansion 1}
                                   {:clause [:db.part/db ?a ?v],
                                    :rows-in 1,
                                    :rows-out 0,
                                    :binds-in [?a ?v],
                                    :binds-out [?v]}]}]},
 :extrel-trace [{:args [[:db.part/db nil nil] [nil nil nil] [nil nil nil]],
                 :ret #object[datomic.datalog.DbRel 0x329effe3 "datomic.datalog.DbRel@329effe3"]}]}

In this case, the ?a and ?v were part of a relation and so were not provided to the extrel of the data-pattern clause for the Datomic datasource. For this to work correctly, the alias resolution would have to happen in the join-project-with, where it is more difficult to determine if a value should be resolved as a lookup ref.

Note that the :query-stats for both queries have nearly identical row-counts because they would be the same from the perspective of the join-project-with clauses. Only the difference in :io-context reveals that the extrel call for the second query was clearly seeking more datoms.

Implementing your own IJoin-able is more difficult than your own ExtRel because you need to rely on even more interfaces to actually perform the projection, joining, and filtering required. Probably anything that is reduce-able and whose elements are indexed will work, but I don’t know for sure and I won’t explore it here.

However, ExtRel is pretty easy to fulfil!

Custom Datasources

So far we have just looked at “built-in” datasources: collections and Datomic databases. But because its behavior is governed by the ExtRel protocol we can create our own datasources by implementing this protocol.

We just need to follow these rules:

  • ExtRel takes consts, starts, and take-while predicates and returns a relation implementing IJoin which only supplies tuples matching consts. Implementations may use starts and whiles to return a subset of things matching consts.
  • IJoin performs projection, unification, and filtering against another IJoin and returns the result as another IJoin-able, typically a set of tuples. The IJoin object may close over information from the ExtRel that supplied it to defer decisions to join-time if it wants.

Let’s look at some simple examples of useful custom datasources.

Transaction Data with Ident Syntax for Attributes

Tx-data from a transaction is a set of datoms.

(:tx-data (d/with db [{:db/id "new" :db/doc "My new entity"}]))
=>
[#datom[13194139534312 50 #inst"2024-06-18T21:07:05.141-00:00" 13194139534312 true]
 #datom[17592186045417 62 "My new entity" 13194139534312 true]]

Conceptually, this is the same as a Datomic database. However, as we have seen, the extrel of a Datomic database does ident resolution of attributes that normal collections don’t, so this query doesn’t work:

(let [{:keys [tx-data db-after]} (d/with db [{:db/id "new" :db/doc "My new entity"}])
      query '[:find ?e :where [?e :db/doc "My new entity"]]]
  [
   (d/q query db-after)
   (d/q query tx-data)
   ])
=> [#{[17592186045417]} ;; Works as expected with a normal db
    #{}]                ;; Fails with tx-data

But, what if it could work? All we need is an extrel that resolves attributes in its consts to their entity id:

(defn tx-data-extrel [db tx-data]
  (reify ExtRel
    (extrel [_ consts _starts _whiles]
      (let [resolve-ref #(d/entid db %)
            [c-e c-araw c-vraw c-tx c-op] consts
            c-e (resolve-ref c-e)
            attr (when (some? c-araw)
                   (or (d/attribute db c-araw)
                       (throw (ex-info "Unknown attribute reference" {:attr c-araw}))))
            c-a (:id attr)
            ref-v? (= :db.type/ref (:value-type attr))
            c-v (if ref-v?
                  (when (some? c-vraw)
                    (or (d/entid db c-vraw)
                        (throw (ex-info "Could not resolve entity reference" {:ref c-vraw}))))
                  c-vraw)
            c-tx (resolve-ref c-tx)
            xfs (cond-> []
                        (some? c-e) (conj (filter #(== ^long c-e ^long (:e %))))
                        (some? c-a) (conj (filter #(== ^long c-a ^long (:a %))))
                        (some? c-v) (conj (filter #(= c-v (:v %))))
                        (some? c-tx) (conj (filter #(== ^long c-tx ^long (:tx %))))
                        (some? c-op) (conj (filter #(= c-op (:op %)))))]
        (into [] (apply comp xfs) tx-data)))))

Now if we use this to wrap the tx-data, we can query using attribute idents:

(let [{:keys [tx-data db-after]} (d/with db [{:db/id "new" :db/doc "My new entity"}])
      query '[:find ?e :where [?e :db/doc "My new entity"]]]
  (d/q query (tx-data-extrel db-after tx-data)))
=> #{[17592186045417]}

Voilà!

Notice that we return a normal collection as the IJoin-able, which means that attributes that are supplied as relations still won’t resolve.

For example:

(let [{:keys [tx-data db-after]} (d/with db [{:db/id "new" :db/doc "My new entity"}])
      query '[:find ?e
              :where
              [(ground [:db/doc]) [?a ...]]
              ;; ?a is not recognized as a scalar constant
              [?e ?a "My new entity"]]]
  (d/q query (tx-data-extrel db-after tx-data)))
=> #{} ;; Does not match!

Compare with a normal Datomic database, where the DbRel or something in it is doing some resolution of attribute references:

(let [{:keys [tx-data db-after]} (d/with db [{:db/id "new" :db/doc "My new entity"}])
      query '[:find ?e
              :where
              [(ground [:db/doc]) [?a ...]]
              [?e ?a "My new entity"]]]
  (d/q query db-after))
=> #{[17592186045417]} ;; Works!

I’ll leave that enhancement as an exercise for the reader!

Subset Relations from Sorted Sets

Perhaps you have a datasource which is large and sorted. You could just use it as a normal j.u.Collection, but as we saw, the built-in implementation can’t take advantage of sorted-ness.

But what if you could use starts and whiles information to return a smaller relation and join across fewer rows?

Perhaps like this:

(defn sorted-set-extrel [s width]
  ;; width is how many slots per element tuple, which is needed because the
  ;; default comparator compares length before element content.
  ;; Assumes set `s` is sorted by its elements and does not contain nil anywhere.
  ;; Does not assume set has a nil-safe custom comparator, but it's a good enhancement!
  (reify ExtRel
    (extrel [_ consts starts whiles]
      ;; We always have to filter by constants
      (let [consts-pred (if-some [preds (not-empty
                                         (into []
                                               (comp
                                                (map-indexed (fn [i v] (when v #(= v (nth % i)))))
                                                (filter some?))
                                               consts))]
                          (apply every-pred preds)
                          (constantly true))
            ;; Lets see if "starts" has non-nil prefixes; if so we can use them!
            usable-starts (when-some [prefix (not-empty
                                              (take-while some? starts))]
                            ;; Padding the suffix for the builtin comparator
                            (vec (concat prefix
                                         (repeat (- width (count prefix)) nil))))
            ;; Use non-nil prefix predicates from whiles too
            usable-whiles (not-empty
                           (into []
                                 (comp
                                  (take-while some?)
                                  (map-indexed (fn [i f] #(f (nth % i)))))
                                 whiles))]
        (filterv consts-pred
                 (if usable-starts
                   (cond->> (subseq s >= usable-starts)
                            usable-whiles (take-while (apply every-pred usable-whiles)))
                   s))))))

Let’s use a 16000-row set as an example.

(def myset (apply sorted-set
                  (for [i (range 1000)
                        j ["a" "b" "c" "d"]
                        k ["a" "b" "c" "d"]]
                    [i j k])))

(count myset)
=> 16000

If we query a range of this set as a normal collection, many more items will be returned. We won’t see any difference in row-counts the :query-stats, but we should see higher :api-ms in io-stats.

(def query
  '[:find ?i ?j ?k
    :where
    [(ground "b") ?k]
    [?i ?j ?k]
    [(< 10 ?i)]
    [(< ?i 13)]])

Running this query a few times on a normal set, :api-ms stablizes at about 8 ms on my machine.

(d/query {:query query :args [myset] :query-stats true :io-context ::query})
=>
{:ret #{[11 "d" "b"] [11 "c" "b"] [12 "d" "b"] [11 "b" "b"] [12 "c" "b"] [11 "a" "b"] [12 "b" "b"] [12 "a" "b"]},
 :io-stats {:io-context :user/query, :api :query, :api-ms 8.1, :reads {}},
 :query-stats {:query [:find ?i ?j ?k :where [(ground "b") ?k] [?i ?j ?k] [(< 10 ?i)] [(< ?i 13)]],
               :phases [{:sched (([(ground "b") ?k] [?i ?j ?k] [(< 10 ?i)] [(< ?i 13)])),
                         :clauses [{:clause [(ground "b") ?k],
                                    :rows-in 0,
                                    :rows-out 1,
                                    :binds-in (),
                                    :binds-out [?k],
                                    :expansion 1}
                                   {:clause [?i ?j ?k],
                                    :rows-in 1,
                                    :rows-out 8,
                                    :binds-in [?k],
                                    :binds-out [?k ?j ?i],
                                    :preds ([(< 10 ?i)] [(< ?i 13)]),
                                    :expansion 7}]}]}}

But if we use the custom ExtRel, it’s a bit under 1 ms!

(d/query {:query query :args [(sorted-set-extrel myset 3)] :query-stats true :io-context ::query})
=>
{:ret #{[11 "d" "b"] [11 "c" "b"] [12 "d" "b"] [11 "b" "b"] [12 "c" "b"] [12 "b" "b"] [11 "a" "b"] [12 "a" "b"]},
 :io-stats {:io-context :user/query, :api :query, :api-ms 0.82, :reads {}},
 :query-stats {:query [:find ?i ?j ?k :where [(ground "b") ?k] [?i ?j ?k] [(< 10 ?i)] [(< ?i 13)]],
               :phases [{:sched (([(ground "b") ?k] [?i ?j ?k] [(< 10 ?i)] [(< ?i 13)])),
                         :clauses [{:clause [(ground "b") ?k],
                                    :rows-in 0,
                                    :rows-out 1,
                                    :binds-in (),
                                    :binds-out [?k],
                                    :expansion 1}
                                   {:clause [?i ?j ?k],
                                    :rows-in 1,
                                    :rows-out 8,
                                    :binds-in [?k],
                                    :binds-out [?k ?j ?i],
                                    :preds ([(< 10 ?i)] [(< ?i 13)]),
                                    :expansion 7}]}]}}

Note that the query-stats isn’t any different, only the io-stats!

Time-traveling Datasources

One limitation of Datomic queries is that you cannot re-parameterize a database within a query. So for example, you can’t do something like this:

[:find ?e ?as-of-tx ?v
 :in $normal $history
 :where
 ;; At moments when ?e :my/attr changed ...
 [$history ?e :my/attr _ ?as-of-tx]
 ;; ... what was the corresponding value of :my/other-attr ?
 [$normal-but-as-of-tx ?e :my/other-attr ?v]]

But what would it take to make this possible? An approach could be to extend the tuple syntax to pretend that the first slot is an as-of parameter, like so:

[$ ?as-of-tx ?e :my/other-attr ?v]

The extrel implementation could do something like this:

(reify ExtRel
       (extrel [_ consts starts whiles]
               (datomic.datalog/extrel
                (db/as-of normal-db (first consts))
                (rest consts)
                (rest starts)
                (rest whiles))))

But this only works in the simplest possible case where the as-of value (first consts) is non-nil (i.e. a query-constant). Even in our example query this is not the case. To get the sample query working as expected, we need to return something IJoin-compatible which interprets the first tuple slot as an as-of parameter during the join-project-with, uses that to set the as-of of the database, then delegates the join to the DbRel of that database.

In other words, something like this:

(defn as-of-db [db]
  (reify ExtRel
    (extrel [_ consts starts whiles]
      (let [as-of (first consts)
            consts (vec (rest consts))
            starts (vec (rest starts))
            whiles (vec (rest whiles))]
        (if (some? as-of)
          ;; as-of is query scalar constant
          (datomic.datalog/extrel (d/as-of db as-of) consts starts whiles)
          ;; as-of will come from a later join
          (reify IJoin
            (join-project-with [_ ys join-map pmx pmy predctor]
             ;; Our column 0 is the as-of value.
             ;; What column in ys should join to it?
              (let [as-of-column-idx (get join-map 0)
                    _ (when (nil? as-of-column-idx) (throw (ex-info "as-of must be bound!" {})))
                    ;; Turn the single ys into a bunch of ys grouped by db-as-of
                    ys-by-as-of (group-by #(nth % as-of-column-idx) ys)
                    ;; We need to rewrite the join and project-x maps to reflect that the as-of column doesn't exist in datoms
                    join-map' (into {}
                                    (keep (fn [[^long x ^long y]]
                                            [(dec x) y]))
                                    (dissoc join-map 0))
                    pmx' (into {}
                               (keep (fn [[^long x ^long y]]
                                       [(dec x) y]))
                               (dissoc pmx 0))]
                (into #{}
                      (mapcat (fn [[as-of ys']]
                                ;; Use the as-of value to set the database ...
                                (let [dbrel (datomic.datalog/extrel (d/as-of db as-of) consts starts whiles)]
                                  ;; and perform a join as usual!
                                  ;; Because the as-of value *must* be supplied by the ys *and* joined,
                                  ;; we know that if it is wanted in the output it will be projected via pmy,
                                  ;; so we can rely on the normal dbrel to assemble the output correctly for us.
                                  (datomic.datalog/join-project-with dbrel ys' join-map' pmx' pmy predctor))))
                      ys-by-as-of)))))))))

And a demonstration of use:

(let [{db :db-after} (d/with db [{:db/ident       :my/attr
                               :db/valueType   :db.type/string
                               :db/cardinality :db.cardinality/one}
                              {:db/ident       :my/other-attr
                               :db/valueType   :db.type/string
                               :db/cardinality :db.cardinality/one}])
      {db :db-after} (d/with db [{:db/ident :my/entity
                                  :my/other-attr "oldvalue"}])
      {db :db-after} (d/with db [{:db/ident :my/entity
                                  :my/attr "ignored"}])
      {db :db-after} (d/with db [{:db/ident :my/entity
                                  :my/other-attr "newvalue"}])]
  (d/q
   '[:find ?e ?as-of-tx ?v
     :in $normal $history
     :where
     [$history ?e :my/attr _ ?as-of-tx]
     [$normal ?as-of-tx ?e :my/other-attr ?v]]
   (as-of-db db) (d/history db)))
=> #{[17592186045418 13194139534315 "oldvalue"]}

Pretty cool!

Summary

Datomic’s Datalog engine uses the ExtRel protocol on datasources to get relations–sets of tuples–which correspond to the data-pattern clauses in the query. The extrel call includes query-wide constants and sometimes start values and take-while predicates inferred from surrounding predicate-expression clauses. The start and take-while predicates are advisory and the datasource may use them to return a smaller relation.

The returned value must be an IJoin-able which takes a pair of IJoin-ables, applies joins, projection, and filtering, and returns the next result-set via the join-project-with method. An extrel can defer realization of its relations to join-project-with, and in fact this is what Datomic databases do via the DbRel “relation”. The advantage of this is you can make better index choices with more information about the join; the disadvantage is that it’s a lot more complicated!

A datasource is responsible for any aliasing behavior (such as keywords for attribute ids) in its implementations of extrel and join-project-with.

:query-stats gives information mostly about join-project-with calls; deferred realization of relation members can often hide the true number of rows examined by a datasource. Clues to realized-relation sizes come from :io-stats.

Finally, we looked at three toy examples of custom datasources:

  • The tx-data example interprets aliases for more egonomic query syntax (attribute idents instead of numbers) over normal collections.
  • The sorted-set example takes advantage of subrange information from extrel to produce smaller relations and faster query runtimes.
  • Finally, we added extra within-query as-of parameterization of Datomic databases via an additional “virtual” column of the relation. This dipped a bit into IJoin and messing with joins and projections, but left the heavy-lifting to the default Datomic database implementation.

I hope you found this intriguing and enlightening!

Permalink

Datomic Entity Id and Datom Internals

This is an update of a post I wrote in 2019 for a talk given at a Shortcut engineering Lunch and Learn while I worked there.

Introduction

Datomic’s immutable index structures get most of the attention, but there are even more foundational structures underneath them: two counters, the entity id, and the Datom.

Disclaimers

What follows is not meant for those new to Datomic. This is a pretty deep dive into internals and not all of this is officially documented.

It’s also very focused on Datomic on-prem specifically, and doesn’t investigate cloud. I suspect cloud’s entity id and Datom internals are very similar though.

Because these are internal implementation details, they can change at any time. You shouldn’t rely on any behavior that isn’t in Datomic’s official documentation–or if you do, make sure you have regression tests!

Caveat Lector out of the way, let’s get started.

The Counters

Every Datomic database has two counters:

  1. the T counter and
  2. an attribute and partition entity counter which I will call the “element” counter.

The T Counter

The T counter is 42 bits. It advances whenever most kinds of entity ids are created. Entity ids are created indirectly via a temporary id (tempid) failing to resolve to an existing entity during a transaction. It is never rewound, even if the entity id is not used. This may happen if the transaction that created the entity id is aborted. The T counter is kept at the root of the database tree where it is called “next-t”. You can see its value using (d/next-t (d/db db)) in Datomic on-prem–this will be the T of the next transaction entity.

Assume db is a freshly-created Datomic on-prem database:

(d/basis-t db)
=> 66

The next-t of fresh databases starts at 1000. Note that the next-t is always greater than the basis-t!

(d/next-t db)
=> 1000

(clojure.repl/doc d/next-t)
-------------------------
datomic.api/next-t
([db])
Returns the t one beyond the highest reachable via this db value.

The Element Counter

The schema and partition entity (or “elements”) counter is 19 bits. It advances whenever an attribute or partition entity id is created, and it is also never rewound. Unlike the T counter, this doesn’t seem to be stored as a separate counter, but derived from the size of a special cache.

Every database object keeps a fast in-memory cache of every attribute and partition entity in a vector called :elements. Data about the entity is stored in an index corresponding to its entity id. The size of this vector is the next value in the element counter.

There is no public api to the elements cache, but you can retrieve it from a database object using associative lookup:

(def elements (:elements db))

The index in elements is the entity id of the cached item:

(nth elements 0)
=> #datomic.db.Partition{:id 0,                     ;; entity id
                         :kw :db.part/db}           ;; ident

(nth elements 10)
=> #datomic.db.Attribute{:id             10,        ;; entity id
                         :kw             :db/ident, ;; ident
                         :vtypeid        21,        ;; value type
                         :cardinality    35,        ;; ... etc
                         :isComponent    false,
                         :unique         38,
                         :index          false,
                         :storageHasAVET true,
                         :needsAVET      true,
                         :noHistory      false,
                         :fulltext       false}

The vector has nil in indexes that don’t correspond to schema or partitions. For example, the :db/add primitive “transaction function”:

(d/pull db ['*] 1)
=> #:db{:id    1,
        :ident :db/add,
        :doc   "Primitive assertion. All transactions eventually [...]"}
;; :db/add is still special, but it's not an attribute or partition entity!
(nth elements 1)
=> nil

This is the size of the cache, thus the value of the counter:

(count elements)
=> 72

Thus next attribute I create will have entity id 72:

@(d/transact conn [{:db/ident :my/attr
                    :db/valueType :db.type/long
                    :db/cardinality :db.cardinality/many
                    :db.install/_attribute :db.part/db}])

And now it’s in the element cache at index 72:

(nth (:elements (d/db conn)) 72)
=> #datomic.db.Attribute{:id             72,
                         :kw             :my/attr,
                         :vtypeid        22,
                         :cardinality    36,
                         :isComponent    false,
                         :unique         nil,
                         :index          false,
                         :storageHasAVET false,
                         :needsAVET      false,
                         :noHistory      false,
                         :fulltext       false}

What’s the point of these counters though? They’re for stuffing into entity ids!

The Entity Id

The entity id is the foundational data structure of a Datomic database. It is a 64-bit signed long with the following structure in big-endian order:

  1. The sign bit. If set, this is a temporary id (TempId).
  2. A seemingly-unused bit that is always unset. You can manually construct an entity id which has this bit set, and Datomic seems to honor it as-is, but there’s no public api way to set it. I don’t know what this is for.
  3. 20 bits of partition. The highest bit (labeled “PType” in the diagram below) indicates the type of partition number, discussed later.
  4. 42 bits of counter value. This is a number issued by the T or element counter.

Entity Id Structure

To help visualize entity id bits at the repl, we can use the following function:

(defn print-eid
  "Print the bits of a datomic entity id in base 2.
  Separates out the sign, unused, partition, and counter bits visually."
  [^long n]
  (let [s (Long/toBinaryString n)
        s (.concat (.repeat "0" (- 64 (.length s))) s)
        [_ sign unused part counter] (re-matches #"(\d)(\d)(\d{20})(\d{42})" s)]
    (println sign unused part counter)))

A demonstration:

(print-eid (d/t->tx 1000))
0 0 00000000000000000011 000000000000000000000000000000001111101000
|   \____ _____________/ \__ _____________________________________/
|        |                  |
|        |                  \_ Counter bits, in this case the number 1000 
|        |                     from the T counter.
|        \_ Partition bits, in this case the entity id of :db.part/tx
\_ Sign bit

Datomic’s public api to construct an entity id “from scratch” is d/entid-at. It takes some partition entity or ident reference to one, and some counter number. (It can also take a date, but that isn’t interesting to us right now.)

This is usually how you use it:

(d/entid-at db :db.part/db 1)
=> 1

(d/entid-at db :db.part/tx 1)
=> 13194139533313

(d/entid-at db :db.part/user 1)
=> 17592186044417

But you can also use it with raw partition ids.

;; The :db.part/user partition is 4
(= (d/entid-at db :db.part/user 1)
   (d/entid-at db 4 1))
=> true

Note that this use case doesn’t actually need a database–it’s just bit manipulation–but the function still requires one because it’s a wrapper around a method invocation on the database object.

d/t->tx is a special case of entid-at for the transaction partition that doesn’t require a database argument. It doesn’t need one because the transaction partition entity id is hardcoded in every Datomic database.

(= (d/entid-at db :db.part/tx 1)
   (d/entid-at db 3 1)
   (d/t->tx 1))
=> true

Let’s start examining the parts of an Entity Id.

The Counter Field

The counter bits of an entity correspond to the value of the T or element counter at the moment the entity was created.

Entities are created when a tempid exists in transaction data but cannot be resolved to an existing entity id. There is always at least one of these in any transaction: the current transaction itself!

When the transaction-data expander determines it needs to “mint” a new entity, it constructs an entity id from a partition value and either the T counter or element counter, then advances the counter. Partition and attribute entities advance the element counter, and all other entities advance the T counter.

(Determining what partition value to use is complicated–I won’t discuss it here.)

The current transaction is always the first to receive the next-T. As a consequence, the T of transaction ids interleave with the T of entity ids created within the prior transaction. This allows you to perform tricks with d/entid-at and d/seek-datoms to find recently-created entities without using the transaction log.

The public api to access the counter field value is d/tx->t. You’ll notice from its name that it’s meant for transaction ids and T values, but it actually works on any entity id–it just masks out any bits of the entity id that don’t belong to the counter field.

(d/tx->t (d/entid-at db :db.part/user 1))
=> 1
(d/tx->t (d/entid-at db :db.part/tx 1))
=> 1

Because the next-T is issued to new entities without considering partitions, adding partitions doesn’t let you have more entity ids in your Datomic database–the 42 bits of the counter field bounds the theoretical max limit on the number of non-attribute, non-partition entities. Why have partitions at all then?

The Partition Field

Partitions are a mechanism to sort entity ids better, according to some criteria other than creation order. The partition bits are the 20 immediately more-significant bits above the 42 bits of T so that the natural sort order of longs will collate entities with the same partition next to one another. They are a crucial performance optimization because they allow you to sort Datoms into “runs” that are commonly read together and improve the chance that any given query will make use of already-cached index segments. Partitions can also reduce the number of index segments invalidated by new indexes if the writes exhibit some locality too.

Datomic itself uses this to keep transaction entities away from schema entities and user data. Schema entities have partition :db.part/db (always entity id 0) and transaction entities have partition :db.part/tx (always entity id 3), and the default partition for new data is :db.part/user (always entity id 4).

The value in the bits of the partition field has gotten complicated.

Prior to Datomic version 1.0.6711, this was simply the entity id of a partition entity. You can retrieve that entity id with d/part.

(def explicit-part-eid (d/entid-at db :db.part/user 1))

(d/part explicit-part-eid)
=> 4

(d/ident db 4)
=> :db.part/user

(print-eid explicit-part-eid)
0 0 00000000000000000100 000000000000000000000000000000000000000001

On version 1.0.67611 and afterwards, this can also be an implicit partition number if the highest bit of this field (labeled “PType” above) is 1.

d/implicit-part constructs an entity-id where the counter bits are zero, the PType bit is set, and the implicit-partition-number is shifted over into the partition bit fields.

(def mypart (d/implicit-part 1))
mypart
=> 2305847407260205056

(print-eid mypart)
0 0 10000000000000000001 000000000000000000000000000000000000000000
;;  |--- Note PType bit is set.

Unlike explicit partitions which are always in partition 0 (:db.part/db), the partition of an implicit partition entity id is always itself. The contract of d/part is that it gives you an entity-id that can be used as a partition id, not (or no longer) that it gives you the partition field bits. Implicit partitions are just encoded into the partition field differently than explicit ones.

Because d/part always returns an entity id, it returns implicit partition entity ids unchanged.

(= mypart (d/part mypart))
=> true

Note that implicit partitions are still real, valid entity ids, so you can still assert things about them:

(let [db (:db-after (d/with db [{:db/id (d/implicit-part 0)
                                 :db/doc "implicit partition 0"}]))]
  (d/pull db ['*] (d/implicit-part 0)))
=> #:db{:id 2305843009213693952, :doc "implicit partition 0"}

In a world with implicit partitions, there’s no public api to access the partition bits, but you can get them with this:

(defn partition-bits [^long eid]
  (let [p (d/part eid)]
    ;; implicit-part-id returns nil when given explicit partition ids.
    (if-some [ip (d/implicit-part-id p)]
      (bit-shift-right ^long (d/implicit-part ip) 42)
      p)))
(-> (d/entid-at db :db.part/user 1)
    (partition-bits))
=> 4
(-> (d/entid-at db (d/implicit-part 1) 1)
    (partition-bits)
    (Long/toBinaryString))
=> "10000000000000000001"

“Permanent” Entity Id Recap

We’ve discussed the structure of “permanent” (non-temporary) entity ids. Before we move on, let’s summarize:

  • Entity ids have 20 partition bits and 42 counter bits.
  • “Element” entities (attributes and explicit partitions) have this structure:
    • Partition bits zeroed out and d/part returns 0.
    • Counter bits correspond to the “element” counter value at the moment of entity creation.
  • Explicitly partitioned entities have this structure:
    • Top bit of partition bits is 0.
    • The remaining bits are the entity id of the explicit partition–which is itself also an “element” entity, and so representable with 19 bits anyway.
    • Counter bits correspond to the T counter value at the moment of entity creation.
  • Implicitly partitioned entities:
    • Top bit of partition bits is 1.
    • The remaining partition bits are the implicit-part-id (a number between 0 and 524287 inclusive–19 bits)
    • Counter bits correspond to the T counter value at the moment of entity creation.
  • Implicit partition entities themselves:
    • have the same partition bits as Implicitly Partitioned Entities
    • the counter bits are 0

You’ll notice we haven’t talked about the sign bit yet.

Temporary Entity Ids

A temporary entity id (tempid) is an entity id that is not meant to outlive a single transaction. They are only valid in submitted tx-data and as keys of the :tempids map returned from d/transact, and only exist for the lifetime of transaction preparation and submission, and represent no cross-transaction identity.

They exist only to be replaced by either an existing or a new “permanent” entity id during a transaction.

In modern Datomic, there are three ways to represent a tempid: strings, tempid records, and negative entity ids.

(We won’t talk about the string method.)

Tempid Records

A tempid record, is the thing returned by d/tempid. It’s just a record with two fields: a partition (which can be an entity id, implicit partition id, or an ident keyword that resolves to a partition entity) and a negative number called an idx.

When called with two arguments, the idx value comes from a counter on the peer that starts at -100001. This counter is unrelated to the T and element counters!

;; This is a fresh process to ensure the idx counter is at its starting value.
(require '[datomic.api :as d])
(def tempid (d/tempid :db.part/user))

;; Tempid records have a tagged-value printed form
tempid
=> #db/id[:db.part/user -1000001]

(:part tempid)
=> :db.part/user
(:idx tempid)
=> -1000001

;; idx is issued from a single per-peer counter.
(:idx (d/tempid :db.part/db))
=> -1000002
(:idx (d/tempid :db.part/tx))
=> -1000003

Because this counter is per-peer, there’s a chance of collision: you may call (d/tempid :db.part/user) within a peer preparing tx-data and within a transaction function of the same tx-data. To avoid collisions, transactors and peers use disjoint idx ranges as of version 0.9.5561.62 released in October 2017.

When d/tempid is called with two arguments you can set the idx value yourself in the range from -1 to -100000.

Tempid Longs

Tempid records can also be represented as a negative long using the entity id structure. When the sign bit of the entity id is set (i.e. the entity id is negative), the entity id represents a temporary id. The partition bits of that entity id correspond to the partition indicated by the partition field of the record, and the counter bits to the lower 42 bits of the negative number.

You see these tempid entity-ids returned from transactions:

(def tempid (d/tempid :db.part/user -100))

(:tempids (d/with db [{:db/id tempid :db/doc "foo"}]))
=> {-9223350046622220388 17592186045427}
(print-eid -9223350046622220388)
1 0 00000000000000000100 111111111111111111111111111111111110011100

The possibility of idents to reference partitions is why you need d/resolve-tempid: it converts a tempid record to the equivalent tempid entity-id before looking it up in the :tempids map.

There’s no public api to create tempid longs, but you can do it with a little bit-manipulation:

(defn tempid->eid [tempid]
  ;; This handles implicit partitions also.
  (let [part-eid ^long (d/entid db (:part tempid))
        part-bits (if (== 0 ^long (d/part part-eid))
                    (bit-shift-left part-eid 42)
                    part-eid)
        ;; Mask out partition bit and unused bit from idx
        ;; Keep the sign bit
        temp-and-counter-bits (bit-and-not
                               ^long (:idx tempid)
                               0x7ffffc0000000000)]
    ;; Combine sign, partition, and counter fields
    (bit-or temp-and-counter-bits part-bits)))

(tempid->eid (d/tempid :db.part/user -100))
=> -9223350046622220388

That’s all we can say about entity ids. Now we’ll compose entity ids together into Datoms.

Datoms

A Datom is–at the domain-model level–a tuple of the following elements:

  1. :e An entity id.
  2. :a An attribute entity id.
  3. :v An arbitrary value, sometimes an entity id.
  4. :tx A transaction entity id.
  5. :added A boolean representing a primitive datom operation. “True” means the asserted, “false” means retracted.

Datoms are unique in a database by key [:e :a :v :tx]. Note that :added is not included because you can’t assert and retract same [:e :a :v] in the same transaction.

Concretely–at the data-model level–a Datom is an instance of the datomic.db.Datum class. (Note Datum not Datom!) This class has the following properties:

(->> (#'clojure.reflect/declared-fields datomic.db.Datum)
     (remove #(-> % :flags (contains? :static)))
     (map (juxt :name :type)))
=> ([a int] [tOp long] [v java.lang.Object] [e long])

Clearly the e property holds the :e slot value and the v property the :v slot.

But note two anomalies:

  1. Attributes are entity ids, which are a long, but the a property is an int.
  2. There’s a weird tOp property and no :tx or :added field.

Lets look at these.

The a Property

a is a java int, which is a signed 32 bit number in Java. But it’s supposed to be an entity id, which is a 64 bit long. How can it fit? First, the partition of all attributes is 0, so an attribute id has at most 42 bits of useful precision from the counter field. Second, attribute entity’s counter field bits come from the element counter, which is limited to 19 bits and advances much more slowly than the T counter in a typical database. These two together ensure that the entity id of any attribute will be small enough to fit in 32 bits for even very large, very old databases.

This compression of attribute entity id range saves 4 bytes per Datum in memory.

The tOp Property

The tOp is a fusion of transaction T (not entity id) and operation that lets Datums avoid having an extra boolean field.

Lets look at one:

;; Using a fresh database
(d/basis-t db)
=> 66
;; This will be the next transaction T
(d/next-t db)
=> 1000
;; Let's get a datom object, such as a new :db/txInstant assertion
(def datom (first (:tx-data (d/with db []))))
datom
;; Slots  :e             :a :v                                   :tx            :added
=> #datom[13194139534312 50 #inst"2024-05-16T15:27:56.377-00:00" 13194139534312 true]
(.tOp datom)
2001

What is this mysterious value? It’s a fusion of the transaction entity id’s counter field (a T value) and a bit representing the operation. The T value is shifted right one bit to leave room for the operation bit. The operation bit is encoded into the lowest bit so that the natural sort order of longs will sort retractions before assertions within a transaction.

;; If we undo the left shift, we get the transaction T, which is 1000
(= 1000
   (d/tx->t (:tx datom))
   (bit-shift-right 2001 1))
=> true

;; The lowest bit is the operation.
;; Here it is an assert, thus boolean true, thus bit set
(bit-and 2001 1)
=> 1

;; To make a tOp value, we just do the opposite
(bit-or
 (bit-shift-left ^long (d/tx->t (:tx datom)) 1)
 (if (:added datom) 1 0))
=> 2001

This encoding has two benefits:

By using t instead of tx, we reduce the magnitude of the tOp slot. When encoding this value into Fressian (the on-disk format of Datomic index segments), numbers of smaller magnitude will encode to fewer bytes. In this case, the number 2001 requires only 2 bytes to encode. If it were a full transaction entity id, it would always require 7 bytes because of the position of the partition bits in the long. If it were an unpacked long, it would require a full 8 bytes.

By fusing the operation into the transaction T, we decrease the Fressian size on-disk by one byte, the object size by one field, and save typically 4 bytes in memory for the boolean value itself. The in-memory representation of boolean values is unspecified in Java, but OpenJdk uses 4 bytes (a full int) to represent boolean values. With tOp, this requires only one bit!

Summary

We covered a lot of ground! To recap:

  • There are two counters: the T and the element counter.
    • T counter is for normal entities.
    • Element counter is for explicit partition and attribute entities.
  • Counter values are encoded into entity ids when the entity is created:
    • The 42-bit counter field gets the current T or Element counter, depending on the entity type.
    • The 20-bit partition field encodes another entity id into it losslessly by exploiting range restrictions in explicit and implicit partitions.
    • The sign bit signals that the entity id is a temporary id.
  • The Datum class that represents datoms has two clever tricks to reduce its size on-disk and in-memory:
    • The attribute property is an int because no attribute entity id can have more than 19 significant bits.
    • The tOp property encodes tx id and operation boolean field by exploiting the constant fixed partition bits of transaction entity ids.

That’s a fair bit of impressive design even before you get to indexes!

Permalink

Soundcljoud, or a young man's Soundcloud clonejure

A stack of CDs. Photo by Brett Jordan on Unsplash.

😱 Warning!

This blog post is ostensibly about Clojure (of the Babashka variety), but not long after starting to write it, I found myself some 3100 words into some rambling exposition about the history of audio technology and how it intersected with my life, and had not typed the word "ClojureScript" even once (though it may appear that I've now typed it twice, I actually wrote this bit post scriptum, but decided to attach it before the post, which I suppose makes it a prelude and not a postscript, but I digress).

Whilst this won't surprise returning readers, I thought it worth warning first-timers, and offering all readers the chance to skip over all the stage-setting and other self-indulgent nonsense, simply by clicking the link that says "skip over".

If you'd like to delay your gratification, you are in luck! Read on, my friend!

Rambling exposition

Once upon a time there were vinyl platters into which the strategic etching of grooves could encode sound waves. If one placed said platter on a table and turned it say, 78 times a minute, and attached a needle to an arm and dropped it onto the rotating platter, one could use the vibrations in the arm caused by the needle moving left and right in the grooves to decode the sound waves, which you then turn into a fluctuating electric current and cause it to flow through a coil which vibrates some fabric and reproduces the sound waves. And this was good, for we could listen to music.

The only problem with these "records", as they were called, is that they were kinda big and you couldn't fit them in your backpack. So some Dutch people and some Japanese people teamed up and compacted the records, and renamed them discs because a record is required by law (in Japan) to be a certain diameter or you can't call it a record. They decided to make them out of plastic instead of vinyl, and then realised that they couldn't cut grooves into plastic because they kept breaking the discs—perhaps the choice of hammer and chisel to cut the grooves wasn't ideal, but who am I to judge? "だってさ、" said one of the Japanese engineers, "針でディスクにちっちゃい穴やったら、どうなるかな?" No one knew, so they just tried it, and lo! the disc didn't break! But also lo! poking at the disc with a needle made little bumps on the other side of the disc, because of the law of conservation of mass or something... I don't know, I had to drop out of physics in uni because it apparently takes me 20 hours to solve one simple orbital mechanics problem; I mean, come on, how hard is it to calculate the orbit of a planet around three stars? Jeez.

But anyway, they made some bumps, which was annoying at first but then turned out to be a very good thing indeed when someone had the realisation that if squinted at the disc with a binary way of thinking, you could consider a bump to be a 1, and a flat place on the disc to be a 0, and then if you were to build a digital analyser that sampled the position of a sound wave, say, 44,100 times a second and wrote down the results in binary, you could encode the resulting string of 1s and 0s onto the disc with a series of bumps.

But how to decode the bumps when trying to play the thing back? The solution was super obvious this time: a frickin' laser beam! (Frickin' laser beams were on everyone's mind back in the early 80s because of Star Wars—the movies; the missile defence system wouldn't show up until a few years later). If they just fired a frickin' laser beam continuously whilst rotating the disc and added a photodiode next to the laser, the light bouncing back off a bump would knock the wavelength of the light 1/2 out of phase, which would partially cancel the reflected light, lowering the intensity, which the photodiode would pick up and interpret as a 1. Obviously.

Except for one thing. Try as they might, the engineers couldn't make the frickin' laser beam bounce off the frickin' surface of the frickin' polycarbonate. If the plastic was too dark, it just absorbed the light, and if it was too light, it certainly reflected it, but not with high enough intensity for the photodiode to tell the difference between a 1 and a 0. 😢

This was a real head-scratcher, and they were well and truly stuck until one day one of the Dutch engineers was enjoying a beer from a frosty glass at a table at an outdoor cafe on Museumplein on a hot day and the condensation on the glass made the coaster stick to the bottom of the glass in the annoying way it does when one doesn't put a little table salt on the coaster first—amateur!—and the coaster fell into a nearby ashtray (people used to put these paper tubes stuffed with tobacco in their mouths, light them on fire, and suck the smoke deep into their lungs; ask your parents, kids) and got all coated in ash. The engineer wrinkled their nose in disgust before having an amazing insight. "What if," they thought to themselves, "we coated one side of the polycarbonate with something shiny that would reflect the frickin' laser?" Their train of thought then continued thusly: "And what is both reflective and cheap? Why, this selfsame aluminium of which this here ashtray is constructed!"

And thus the last engineering challenge was overcome, and there was much rejoicing!

The first test of the technology was a recording of Richard Strauss's "An Alpine Symphony" made in the beginning of December 1981, which was then presented to the world the following spring. It took a whole year before the first commercial compact disc was released, and by 1983, the technology had really taken off, thus introducing digital music to the world and ironically sowing the seeds of the format's demise. But I'm getting ahead of myself again.

Sometime around 1992, give or take, my parents got me a portable CD player (by this time, people, being ~lazy~ efficient by nature, had stopped saying "compact disc" and started abbreviating it to "CD") and one disc: Aerosmith's tour de force "Get a Grip". Thus began a period of intense musical accumulation by yours truly.

But remember when I said the CD format contained within it the seeds of its own demise? Quoth Wikipedia, and verily thus:

In 1894, the American physicist Alfred M. Mayer reported that a tone could be rendered inaudible by another tone of lower frequency. In 1959, Richard Ehmer described a complete set of auditory curves regarding this phenomenon. Between 1967 and 1974, Eberhard Zwicker did work in the areas of tuning and masking of critical frequency-bands, which in turn built on the fundamental research in the area from Harvey Fletcher and his collaborators at Bell Labs

You see where this is going, right? Good, because I wouldn't want to condescend to you by mentioning things like space-efficient compression with transforming Fouriers into Fast Fouriers modifying discrete cosines and other trivia that would bore any 3rd grade physics student.

So anyway, some Germans scribbled down an algorithm and convinced the Motion Picture Experts Group to standardise it as the MPEG-1 Audio Layer III format, and those Germans somehow patented this "innovation" that no one had even bothered to write down because it was so completely obvious to anyone who bothered to think about it for more than the time a CD takes to revolve once or twice. This patent enraged such people as Richard Stallman (who, to be fair, is easily enraged by such minor things as people objecting to his mysogyny and opinions on the acceptability of romantic relationships with minors), leading some people to develop a technically superior and free as in beer audio coding format that they named after a Terry Pratchett character and a clone of a clone of Spacewar!. The name, if you haven't guessed it by now from the copious amount of clues I've dropped here (just call me Colonel Mustard) was Ogg Vorbis.

By early summer 2005, I had accumulated a large quantity of CDs, which weighed roughly a metric shit-tonne. In addition to the strain they placed on my poor second- or third-hand bookshelves, I was due to move to Japan in the fall, and suspected that the sheer mass of my collection would interfere with the ability of whatever plane I would be taking to Tokyo to become airborne, which would be a real bummer. However, a solution presented itself, courtesy of one of the technical shortcomings of the compact disc technology itself.

Remember how CDs have this metallic layer that reflects the laser back at the sensor? Turns out that this layer is quite vulnerable, and a scratch that removes even a tiny bit of the metal results in the laser not being reflected as the disc rotates past the missing metal, which causes that block of data to be discarded by the player as faulty. To recover from this, the player would do one of the following#Basic_players):

  1. Repeat the previous block of audio
  2. Skip the faulty block
  3. Try and retry to read it, causing a stopping and starting of the music

For the listener, this is a sub-optimal auditory experience, and most listeners don't like any sub mixed in with their optimal.

Luckily, consumer-grade CD recorders started appearing in the mid-90s, when HP released the first sub-$1000 model. As a teenager in the 90s, I certainly couldn't afford $1000, but in 1997, I started working as a PC repair technician, and we had a CD "burner" (as they were known back then, not to be confused with a "burner" phone, which didn't exist back then, at least not in the cultural zeitgeist of the time) for such uses as device drivers which were too big to fit on a 3.5 inch "floppy" disk (those disks weren't actually floppy, but their 5.25 inch predecessors certainly were). I sensed an opportunity to protect my investment in digital music by "ripping" my discs (transferring the data on a CD onto the computer) and then burning them back to recordable CDs, at which point I could leave the original CD in its protective case and only expose my copy to the harsh elements.

Of course, one could also leave the ripped audio on one's computer and listen to it at any time of one's choosing, which was really convenient since you didn't have to change CDs when the disc ended or you were just in the mood to listen to something different. The problem is that the raw audio from the CDs (encoded in the WAV format that even modern people are probably familiar with) was fairly large, with a single CD taking up as much as 700MB of space. That may not seem like much until you know that most personal computers in the late 90s had somewhere between 3 and 16 GB of storage, which was enough to store between 20 and 220 CDs, assuming you had nothing else on the drive, which was unlikely since you needed to have software for playing back the files which meant you needed an operating system such as Windows...

To move somewhat more rapidly to the point, one solution to the issue of space was rooted in an even older technology than the compact disc (though younger than the venerable phonograph record): the cassette tape! A cassette tape was... OK, given that I've written nigh upon 2000 words at this point without mentioning Soundcloud or ClojureScript, perhaps I'll just link you to the Wikipedia article on the cassette tape instead of attempting to explain how it works in an amusing (to me) fashion. Interesting (to me) sidenote, though: the cassette tape was also invented by our intrepid Dutch friends over at Philips! 🤯

And my point was... oh yeah, mixtapes! Cassette tapes were one of the first media that gave your average consumer access to a recorder at an affordable price (the earliest such media that I know of was the reel-to-reel tape, which was like a giant cassette tape without the plastic bit that protects the tape), and in addition to stuffing tissue in the top of a tape just to record Marley Marl that we borrowed from our friend down the street, we also made "mixtapes", an alchemical process whereby we boiled down our tape collection and extracted only the bangers (or tear-jerkers, or hopelessly optimistic love songs, or whatever mood we were trying to capture) and recorded those onto a tape, giving us 60 minutes of magic to blare in our cars or hand to that cutie in chemistry class to try and win their affection.

With the invention of the CD and the burner, we were back in the mixtape business, and this time we had up to 80 minutes to express ourselves. By the time I entered university back in 19*cough*, I had saved up enough from my job as a PC technician to buy my own burner, and at university, I gained somewhat of a reputation as a mixtape maestro. People would bring me a stack of CDs and ask me to produce a mixtape to light up the dancefloor or get heads nodding along to the dope-ass DJs of the time (I'm looking at you, Premo!), and also pick a cheeky title to scrawl onto the recordable CD in Sharpie. The one that sticks in my memory was called "The Wu Tang Clan ain't Nothin' to Fuck With"...

OK, but anyway, what if 80 minutes wasn't enough? Remember several minutes of rambling ago when I mentioned the MPEG-1 Audio Layer III format, and you may (or may not) have been like, "WTF is that?" What if I told you that MPEG-1 Audio Layer III is usually referred to by its initials (kinda): MP3? Now you see where I'm going, right? By taking raw CD audio and compressing it with the MP3 encoding algorithm, one could now fit something like 140 songs onto a recordable CD (assuming 5MB per song and 700MB of capacity on the CD), or roughly 10 albums.

So back to the summer of 2005, when I'm getting ready to move to Japan and I realise I can't realistically take all of my CDs with me. What do I do? I rip them onto my computer, encode them as not as MP3s but as Ogg Vorbis files because, y'know, freedom and stuff, burn them onto a recordable CD along with ~9 of their compatriots, and pack them in a box, write their names on a bill of lading which I tape to the box once it gets full, and then store the box in my parents' basement. The freshly recorded backup CD goes into once of those big CD case thingies that we used to have:

A black Case Logic 200 CD capacity case

My CD ripping frenzy was concluded in time for my move to Japan, but did not end there, because I ended up getting a job at this bookstore that also sold CDs and other stuff, and publishers would send books and CDs to the buyers that worked at said bookstore, who would then decided if and how many copies of said books and CDs to buy for stock, and then usually put the books and CDs on a shelf in a printer room, where random bookstore employees such as myself were welcome to take them. So I got some cool books, and loads and loads of CDs, many of them from Japanese artists, which were promptly ripped, Ogg Vorbisified, and written to a 500GB USB hard drive that I had bought from the bookstore with my employee discount. Hurrah!

And thus when 2008 rolled around and I left Tokyo for Dublin, I did so with the vast majority of my music safely encoded into Ogg Vorbis format and written to spinning platters. Sadly, my sojourn on the shamrock shores of the Emerald Isle didn't last long, but happily, my next stop in Stockholm has been of the more permanent variety. By the time I moved here in 2010, Apple and Amazon's MP3 stores were starting to become passé, with streaming services replacing them as the Cool New Thing™, led a brash young Swedish startup called Spotify. And lo! did my collection of Ogg Vorbis files seem unnecessary, since I could now play every song ever recorded whenever I wanted to without having to lug around a hard drive full of files.

Except, at some point, some artists decided that they didn't want their music on Spotify, some for admirable reasons and others for, um, other rea$on$, and now I couldn't listen to every song ever recorded whenever I wanted to without having to lug around a hard drive full of files. Plus Spotify never had a lot of the Japanese music that I had on file. This was suboptimal to be sure, but my laziness overwhelmed my desire to listen to all of my music, until one fateful day that I was sad about something and decided that I absolutely had to listen to some really sad country music, and the first song that came to mind was Garth Brook's "Much too Young to Feel this Damn Old". Much to my dismay, Garth was one of those artists who had withheld their catalogue from Spotify, meaning I had to resort to a cover of the song instead.

My sadness was replaced by rage, and I turned to Clojure to exact my revenge on Spotify for not having reached terms to licence music from one of the greatest Country & Western recording artists of all time!

OMG finally stuff about Clojure

If you wisely clicked the link at the beginning to skip my rambling exposition, welcome to a discussion of how I solved a serious problem caused by a certain Country & Western super-duper star (much like VA Beach legend Magoo—RIP—on every CD, he spits 48 bars) wisely flicking the V at the odious Daniel Ek and the terrible Tim Cook but somehow being A-OK with an even more repulsive billionaire's streaming service.

To briefly recap, I really wanted to listen to some tear jerkin' country, and Spotify doesn't carry Garth, but I had purchased all of his albums on CD back in the day (all the ones recorded before 1998, anyway) and ripped them into Ogg Vorbis format. Which is great, because I can listen to Garth anytime I want, as long as that desire occurs whilst I happen to be sitting within reach of the laptop onto which I copied all of those files. However, I like to do such things as not sit within reach of my laptop all the time, so now I'm back to square almost one.

One day, as I was bemoaning my fate, I had a flash of inspiration! What if I put those files somewhere a web browser could reach them, and then I could listen to them anytime I happened to be sitting within reach of a web browser, which is basically always, since I have a web browser that fits in my pocket (I think it can also make "phone calls", whatever those are). For example, I could upload them to Soundcloud. The only problem with that is that Soundcloud would claim that I was infringing on Garth's copyright, and they'd kinda have a point, since not only could I listen to "The Beaches of Cheyenne" anytime I wanted to, having obtained a licence to do so by virtue of forking over $15 back in 1996 for a piece of plastic dipped in metal, but so could any random person with an internet connection.

This left me with only one option: clone Soundcloud! With Clojure! And call it Soundcljoud because I just can't help myself! And write a long and frankly absurdly self-indulgent blog post about it!

OK really Clojure now I promise

As I mentioned, I have a bunch of Ogg Vorbis files on my laptop:

: jmglov@alhana; ls -1 ~/Music/g/Garth\ Brooks/
'Beyond the Season'
'Double Live (Disc 1)'
'Double Live (Disc 2)'
'Fresh Horses'
'Garth Brooks'
'In Pieces'
'No Fences'
"Ropin' the Wind"
Sevens
'The Chase'
'The Hits'

I also have Babashka:

A logo of a face wearing a red hoodie with orange sunglasses featuring the Soundcloud logo

So let's get to cloning!

The basic idea is to turn these Ogg Vorbis files into MP3 files, which the standard knows how to play, and then wrap a little ClojureScript around that element to stuff my sweet sweet country music into the <audio> element and then call it a day.

We'll accomplish the first part with Babashka and some command-line tools. I'll start by creating a new directory and dropping a bb.edn into it:

{:paths ["src" "resources"]}

Now I can create a src/soundcljoud/main.clj like this:

(ns soundcljoud.main
  (:require [babashka.fs :as fs]
            [babashka.process :as p]
            [clojure.string :as str]))

Firing up a REPL in my trusty Emacs with C-c M-j and then evaluating the buffer with C-c C-k, let me introduce Babashka to good ol' Garth:

(comment

  (def dir (fs/file (fs/home) "Music/g/Garth Brooks/Fresh Horses")) ; C-c C-v f c e
  ;; => #'soundcljoud.main/dir

)

If you're a returning reader, you'll of course have translated C-c C-k to Control + c Control + k in your head and C-c C-v f c e to Control + c Control + v f c e and understood that they mean cider-load-buffer and cider-pprint-eval-last-sexp-to-comment, respectively. If you're a first-timer, what's happening here is that I'm using a so-called Rich comment (which protects the code within the (comment) form from evaluation when the buffer is evaluated) to evaluate forms one at a time as I REPL-drive my way towards a working program, for this is The Lisp Way.

Let's take a look at the Ogg Vorbis files in this directory:

(comment

  (->> (fs/glob dir "*.ogg")
       (map str))
  ;; => ("~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - Ireland.ogg"
  ;;     "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - The Fever.ogg"
  ;;     "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - She's Every Woman.ogg"
  ;;     "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - The Old Stuff.ogg"
  ;;     "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - Rollin'.ogg"
  ;;     "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - The Beaches of Cheyenne.ogg"
  ;;     "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - That Ol' Wind.ogg"
  ;;     "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - It's Midnight Cinderella.ogg"
  ;;     "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - The Change.ogg"
  ;;     "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - Cowboys and Angels.ogg")

)

Knowing my fastidious nature, I bet I wrote some useful tags into those Ogg files. Let's use vorbiscomment to check:

(comment

  (->> (fs/glob dir "*.ogg")
       (map str)
       first
       (p/shell {:out :string} "vorbiscomment")
       :out
       str/split-lines)
  ;; => ["title=Ireland" "artist=Garth Brooks" "album=Fresh Horses"]

)

Most excellent! With a tiny bit more work, we can turn these strings into a map:

(comment

  (->> (fs/glob dir "*.ogg")
       (map str)
       first
       (p/shell {:out :string} "vorbiscomment")
       :out
       str/split-lines
       (map #(let [[k v] (str/split % #"=")] [(keyword k) v]))
       (into {}))
  ;; => {:title "Ireland", :artist "Garth Brooks", :album "Fresh Horses"}

)

And now I think we're ready to write a function that takes a filename and returns this info:

(defn track-info [filename]
  (->> (p/shell {:out :string} "vorbiscomment" filename)
       :out
       str/split-lines
       (map #(let [[k v] (str/split % #"=")] [(keyword k) v]))
       (into {})
       (merge {:filename filename})))

Now that we've established that we have some Ogg Vorbis files with appropriate metadata, let's jump in the hammock for a second and think about how we want to proceed. What we're actually trying to accomplish is to make these tracks playable on the web. What if we create a podcast RSS feed per album, then we can use any podcast app to play the album?

Faking a podcast with Selmer

Let's go this route, since it seems like very little work! We'll start by creating a Selmer template in resources/album-feed.rss:

<?xml version='1.0' encoding='UTF-8'?>
<rss version="2.0"
     xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
     xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link
        href="{{base-url}}/{{album|urlescape}}/album.rss"
        rel="self"
        type="application/rss+xml"/>
    <title>{{artist}} - {{album}}</title>
    <link>{{link}}</link>
    <pubDate>{{date}}</pubDate>
    <lastBuildDate>{{date}}</lastBuildDate>
    <ttl>60</ttl>
    <language>en</language>
    <copyright>All rights reserved</copyright>
    <webMaster>{{owner-email}}</webMaster>
    <description>Album: {{artist}} - {{album}}</description>
    <itunes:subtitle>Album: {{artist}} - {{album}}</itunes:subtitle>
    <itunes:owner>
      <itunes:name>{{owner-name}}</itunes:name>
      <itunes:email>{{owner-email}}</itunes:email>
    </itunes:owner>
    <itunes:author>{{artist}}</itunes:author>
    <itunes:explicit>no</itunes:explicit>
    <itunes:image href="{{image}}"/>
    <image>
      <url>{{image}}</url>
      <title>{{artist}} - {{album}}</title>
      <link>{{link}}</link>
    </image>
    {% for track in tracks %}
    <item>
      <itunes:title>{{track.title}}</itunes:title>
      <title>{{track.title}}</title>
      <itunes:author>{{artist}}</itunes:author>
      <enclosure
          url="{{base-url}}/{{album|urlescape}}/{{track.mp3-filename|urlescape}}"
          length="{{track.mp3-size}}" type="audio/mpeg" />
      <pubDate>{{date}}</pubDate>
      <itunes:duration>{{track.duration}}</itunes:duration>
      <itunes:episode>{{track.number}}</itunes:episode>
      <itunes:episodeType>full</itunes:episodeType>
      <itunes:explicit>false</itunes:explicit>
    </item>
    {% endfor %}
  </channel>
</rss>

If you're not familiar with Selmer, the basic idea is that anything inside {{}} tags is a variable, and you also have some looping constructs like {% for %} and so on. So let's look at the variables that we slapped in that template:

General info:

  • base-url
  • owner-name
  • owner-email

Album-specific stuff:

  • album
  • artist
  • link
  • date
  • image

Track-specific stuff:

  • track.title
  • track.mp3-filename
  • track.mp3-size
  • track.duration
  • track.number

OK, so where are we going to get all this? The general info is easy; we can just decide what we want it to be and slap it in a variable:

(comment

  (def opts {:base-url "http://localhost:1341"
             :owner-name "Josh Glover"
             :owner-email "jmglov@jmglov.net"})
  ;; => #'soundcljoud.main/opts

)

The album-specific stuff is a little more challenging. album and artist we can get from our track-info function, and link can be something like base-url + artist + album, but what about date (the date the album was released) and image (the cover image of the album)? Well, for this we can use a music database that offers API access, such as Discogs. Let's start by creating an account and then visiting the Developers settings page to generate a personal access token, which we'll save in resources/discogs-token.txt. With this in hand, let's try searching for an album. We'll need to add an HTTP client (luckily, Babashka ships with one), a JSON parser (luckily, Babashka ships with one) and a way to load the resources/discogs-token.txt to our namespace, then we can use the API.

(ns soundcljoud.main
  (:require [babashka.fs :as fs]
            [babashka.process :as p]
            [clojure.string :as str]
            ;; ⬇⬇⬇ New stuff ⬇⬇⬇
            [babashka.http-client :as http]
            [cheshire.core :as json]
            [clojure.java.io :as io]))

(comment

  (def discogs-token (-> (io/resource "discogs-token.txt")
                         slurp
                         str/trim-newline))
  ;; => #'soundcljoud.main/discogs-token

  (def album-info (->> (fs/glob dir "*.ogg")
                       (map str)
                       first
                       track-info))
  ;; => #'soundcljoud.main/album-info

  (-> (http/get "https://api.discogs.com/database/search"
                {:query-params {:artist (:artist album-info)
                                :release_title (:album album-info)
                                :token discogs-token}
                 :headers {:User-Agent "SoundCljoud/0.1 +https://jmglov.net"}})
      :body
      (json/parse-string keyword)
      :results
      first)
  ;;  {:format ["CD" "Album"],
  ;;   :master_url "https://api.discogs.com/masters/212114",
  ;;   :cover_image
  ;;   "https://i.discogs.com/0eLXmM1tK1grkH8cstgDT6eV2TlL0NvgWPZBoyScJ_8/rs:fit/g:sm/q:90/h:600/w:600/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTY4NDcx/Ny0xNzE3NDU5MDIy/LTMxNjguanBlZw.jpeg",
  ;;   :title "Garth Brooks - Fresh Horses",
  ;;   :style ["Country Rock" "Pop Rock"],
  ;;   :year "1995",
  ;;   :id 212114,
  ;;   ...
  ;;  }

)

This looks very promising indeed! We now have the release year, which we can put in our RSS feed as date, and the cover image, which we can put in image. Now let's grab info for the tracks:

(comment

  (def master-url (:master_url *1))
  ;; => #'soundcljoud.main/master-url

)

That (:master_url *1) thing might be new to you, so let me explain before we continue. The REPL keeps track of the result of the last three evaluations, and binds them to *1, *2, and *3. So (:master_url *1) says "give me the :master_url key of the result of the last evaluation, which I assume is a map or I'm SOL".

OK, back to the fetching track info:

(comment

  (def master-url (:master_url *1))
  ;; => #'soundcljoud.main/master-url

  (-> (http/get master-url
                {:query-params {:token discogs-token}
                 :headers {:User-Agent "SoundCljoud/0.1 +https://jmglov.net"}})
      :body
      (json/parse-string keyword)
      :tracklist)
  ;; => [{:position "1",
  ;;      :title "The Old Stuff",
  ;;      :duration "4:12"}
  ;;     {:position "2",
  ;;      :title "Cowboys And Angels",
  ;;      :duration "3:16"}
  ;;     ...
  ;;    ]

)

We now have all the pieces, so let's clean this up by turning it into a series of functions:

(def discogs-base-url "https://api.discogs.com")
(def user-agent "SoundCljoud/0.1 +https://jmglov.net")

(defn load-token []
  (-> (io/resource "discogs-token.txt")
      slurp
      str/trim-newline))

(defn api-get
  ([token path]
   (api-get token path {}))
  ([token path opts]
   (let [url (if (str/starts-with? path discogs-base-url)
               path
               (str discogs-base-url path))]
     (-> (http/get url
                   (merge {:headers {:User-Agent user-agent}}
                          opts))
         :body
         (json/parse-string keyword)))))

(defn search-album [token {:keys [artist album]}]
  (api-get token "/database/search"
           {:query-params {:artist artist
                           :release_title album
                           :token token}}))

(defn album-info [token {:keys [artist album] :as metadata}]
  (let [{:keys [cover_image master_url year]}
        (->> (search-album token metadata)
             :results
             first)
        {:keys [tracklist]} (api-get token master_url)]
    (merge metadata {:link master_url
                     :image cover_image
                     :year year
                     :tracks (map (fn [{:keys [title position]}]
                                    {:title title
                                     :artist artist
                                     :album album
                                     :number position
                                     :year year})
                                  tracklist)})))

Putting it all together, let's load all the album info in a format that's amenable to stuffing into our RSS template:

(comment

  (let [tracks (->> (fs/glob dir "*.ogg")
                    (map (comp track-info fs/file)))]
    (album-info (load-token) (first tracks))))
  ;; => {:title "Ireland",
  ;;     :artist "Garth Brooks",
  ;;     :album "Fresh Horses",
  ;;     :link "https://api.discogs.com/masters/212114",
  ;;     :image "https://i.discogs.com/0eLXmM1tK1grkH8cstgDT6eV2TlL0NvgWPZBoyScJ_8/rs:fit/g:sm/q:90/h:600/w:600/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTY4NDcx/Ny0xNzE3NDU5MDIy/LTMxNjguanBlZw.jpeg",
  ;;     :year "1995",
  ;;     :tracks
  ;;     ({:title "The Old Stuff",
  ;;       :artist "Garth Brooks",
  ;;       :album "Fresh Horses",
  ;;       :year "1995",
  ;;       :number 1}
  ;;      {:title "Cowboys and Angels",
  ;;       :artist "Garth Brooks",
  ;;       :album "Fresh Horses",
  ;;       :year "1995",
  ;;       :number 2}
  ;;      ...
  ;;      {:title "Ireland",
  ;;       :artist "Garth Brooks",
  ;;       :album "Fresh Horses",
  ;;       :year "1995",
  ;;       :number 10})}

)

Now that we have a big ol' map containing all the metadata an RSS feed could possibly desire, let's use Selmer to turn our template into some actual RSS! We'll need to add Selmer itself to our namespace, and also grab some java.time stuff in order to produce the RFC 2822 datetime required by the podcast RSS format, then we can get onto the templating itself.

(ns soundcljoud.main
  (:require ...
            [selmer.parser :as selmer])
  (:import (java.time ZonedDateTime)
           (java.time.format DateTimeFormatter)))

(def dt-formatter
  (DateTimeFormatter/ofPattern "EEE, dd MMM yyyy HH:mm:ss xxxx"))

(defn ->rfc-2822-date [date]
  (-> (Integer/parseInt date)
      (ZonedDateTime/of 1 1 0 0 0 0 java.time.ZoneOffset/UTC)
      (.format dt-formatter)))

(defn album-feed [opts album-info]
  (let [template (-> (io/resource "album-feed.rss") slurp)]
    (->> (update album-info :tracks
                 (partial map #(update % :mp3-filename fs/file-name)))
         (merge opts {:date (->rfc-2822-date (:year album-info))})
         (selmer/render template))))

(comment

  (let [tracks (->> (fs/glob dir "*.ogg")
                    (map (comp track-info fs/file)))]
    (->> (album-info (load-token) (first tracks))
         (album-feed opts)))
  ;; => java.lang.NullPointerException soundcljoud.main
  ;; {:type :sci/error, :line 3, :column 53, ...}
  ;;  at sci.impl.utils$rethrow_with_location_of_node.invokeStatic (utils.cljc:135)
  ;;  ...
  ;; Caused by: java.lang.NullPointerException: null
  ;;  at babashka.fs$file_name.invokeStatic (fs.cljc:182)
  ;;  ...

)

Oops! It appears that fs/file-name is angry at us. Searching for it, we identify the culprit:

(partial map #(update % :mp3-filename fs/file-name))

Nowhere in our album-info map have we mentioned :mp3-filename, which actually makes sense given that we only have an Ogg Vorbis file and not an MP3. Let's see what we can do about that, shall we? (Spoiler: we shall.)

Converting from Ogg to MP3

We'll honour Rich Hickey by decomplecting this problem into two problems:

  1. Converting an Ogg Vorbis file into a WAV
  2. Converting a WAV into an MP3

Let's start with problem #1 by taking a look at what we get back from album-info:

(comment

  (let [tracks (->> (fs/glob dir "*.ogg")
                    (map (comp track-info fs/file)))]
    (album-info (load-token) (first tracks))))
  ;; => {:title "Ireland",
  ;;     :artist "Garth Brooks",
  ;;     :album "Fresh Horses",
  ;;     :link "https://api.discogs.com/masters/212114",
  ;;     :image "https://i.discogs.com/0eLXmM1tK1grkH8cstgDT6eV2TlL0NvgWPZBoyScJ_8/rs:fit/g:sm/q:90/h:600/w:600/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTY4NDcx/Ny0xNzE3NDU5MDIy/LTMxNjguanBlZw.jpeg",
  ;;     :year "1995",
  ;;     :tracks
  ;;     ({:title "The Old Stuff",
  ;;       :artist "Garth Brooks",
  ;;       :album "Fresh Horses",
  ;;       :year "1995",
  ;;       :number 1}
  ;;      {:title "Cowboys and Angels",
  ;;       :artist "Garth Brooks",
  ;;       :album "Fresh Horses",
  ;;       :year "1995",
  ;;       :number 2}
  ;;      ...
  ;;      {:title "Ireland",
  ;;       :artist "Garth Brooks",
  ;;       :album "Fresh Horses",
  ;;       :year "1995",
  ;;       :number 10})}

)

The problem here is that we've lost the filename that came from fs/glob, so we have no idea which files we need to convert. Let's fix this by tweaking album-info to take the token and directory, rather than just the track info of the first file in the directory:

(defn normalise-title [title]
  (-> title
      str/lower-case
      (str/replace #"[^a-z]" "")))

(defn album-info [token tracks]
  (let [{:keys [artist album] :as track} (first tracks)
        track-filename (->> tracks
                            (map (fn [{:keys [filename title]}]
                                   [(normalise-title title) filename]))
                            (into {}))
        {:keys [cover_image master_url year]}
        (->> (search-album token track)
             :results
             first)
        {:keys [tracklist]} (api-get token master_url)]
    (merge track {:link master_url
                  :image cover_image
                  :year year
                  :tracks (map (fn [{:keys [title position]}]
                                 {:title title
                                  :artist artist
                                  :album album
                                  :number position
                                  :year year
                                  :filename (track-filename (normalise-title title))})
                               tracklist)})))

(comment

  (->> (fs/glob dir "*.ogg")
       (map (comp track-info fs/file))
       (album-info (load-token)))
  ;; => {:artist "Garth Brooks",
  ;;     :album "Fresh Horses",
  ;;     :link "https://api.discogs.com/masters/212114",
  ;;     :image
  ;;     "https://i.discogs.com/0eLXmM1tK1grkH8cstgDT6eV2TlL0NvgWPZBoyScJ_8/rs:fit/g:sm/q:90/h:600/w:600/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTY4NDcx/Ny0xNzE3NDU5MDIy/LTMxNjguanBlZw.jpeg",
  ;;     :year "1995",
  ;;     :tracks
  ;;     ({:title "The Old Stuff",
  ;;       :artist "Garth Brooks",
  ;;       :album "Fresh Horses",
  ;;       :number "1",
  ;;       :year "1995",
  ;;       :filename
  ;;       #object[java.io.File 0x96d79f0 "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - The Old Stuff.ogg"]}
  ;; ...
  ;;      {:title "Ireland",
  ;;       :artist "Garth Brooks",
  ;;       :album "Fresh Horses",
  ;;       :number "10",
  ;;       :year "1995",
  ;;       :filename
  ;;       #object[java.io.File 0x13968577 "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - Ireland.ogg"]})}

)

Much better! Given this, let's convert this file into a WAV:

(comment

  (def info (->> (fs/glob dir "*.ogg")
                 (map (comp track-info fs/file))
                 (album-info (load-token))))
  ;; => #'soundcljoud.main/info

  (def tmpdir (fs/create-dirs "/tmp/soundcljoud"))
  ;; => #'soundcljoud.main/tmpdir

  (let [{:keys [filename] :as track} (->> info :tracks first)
        out-filename (fs/file tmpdir (str/replace (fs/file-name filename)
                                                  ".ogg" ".wav"))]
    (p/shell "oggdec" "-o" out-filename filename)
    (assoc track :wav-filename out-filename))
  ;; => {:title "The Old Stuff",
  ;;     :artist "Garth Brooks",
  ;;     :album "Fresh Horses",
  ;;     :number "1",
  ;;     :year "1995",
  ;;     :filename
  ;;     #object[java.io.File 0x96d79f0 "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - The Old Stuff.ogg"],
  ;;     :wav-filename
  ;;     #object[java.io.File 0x4221dcb2 "/tmp/soundcljoud/Garth Brooks - The Old Stuff.wav"]}

)

Lovely! Let's make a nice function out of this:

(defn ogg->wav [{:keys [filename] :as track} tmpdir]
  (let [out-filename (fs/file tmpdir (str/replace (fs/file-name filename)
                                                  ".ogg" ".wav"))]
    (println (format "Converting %s -> %s" filename out-filename))
    (p/shell "oggdec" "-o" out-filename filename)
    (assoc track :wav-filename out-filename)))

Now let's see if problem #2 is equally tractable.

(comment

  (let [{:keys [filename artist album title year number] :as track}
        (->> info :tracks first)
        wav-file (fs/file tmpdir
                          (-> (fs/file-name filename)
                              (str/replace #"[.][^.]+$" ".wav")))
        mp3-file (str/replace wav-file ".wav" ".mp3")
        ffmpeg-args ["ffmpeg" "-i" wav-file
                     "-vn"  ; no video
                     "-q:a" "2"  ; dynamic bitrate averaging 192 KB/s
                     "-y"  ; overwrite existing files without prompting
                     mp3-file]]
    (p/shell "ffmpeg" "-i" wav-file
             "-vn"       ; no video
             "-q:a" "2"  ; dynamic bitrate averaging 192 KB/s
             "-y"        ; overwrite existing files without prompting
             mp3-file))
  ;; => {:exit 0,
  ;;     ...
  ;;     }

  (fs/size "/tmp/soundcljoud/Garth Brooks - The Old Stuff.mp3")
  ;; => 5941943

)

Nice! There's one annoying thing about this, though. My Ogg Vorbis file had metadata tags telling me stuff and also things about the contents of the file, whereas my MP3 is inscrutable, save for the filename. Let's ameliorate this with our good friend id3v2:

(comment

  (let [{:keys [filename artist album title year number] :as track}
        (->> info :tracks first)
        wav-file (fs/file tmpdir
                          (-> (fs/file-name filename)
                              (str/replace #"[.][^.]+$" ".wav")))
        mp3-file (str/replace wav-file ".wav" ".mp3")
        ffmpeg-args ["ffmpeg" "-i" wav-file
                     "-vn"  ; no video
                     "-q:a" "2"  ; dynamic bitrate averaging 192 KB/s
                     "-y"  ; overwrite existing files without prompting
                     mp3-file]]
    (p/shell "id3v2"
             "-a" artist "-A" album "-t" title "-y" year "-T" number
             mp3-file))
  ;; => {:exit 0,
  ;;     ...
  ;;     }

  (->> (p/shell {:out :string}
                "id3v2" "--list"
                "/tmp/soundcljoud/Garth Brooks - The Old Stuff.mp3")
       :out
       str/split-lines)
  ;; => ["id3v1 tag info for /tmp/soundcljoud/Garth Brooks - The Old Stuff.mp3:"
  ;;     "Title  : The Old Stuff                   Artist: Garth Brooks"
  ;;     "Album  : Fresh Horses                    Year: 1995, Genre: Unknown (255)"
  ;;     "Comment:                                 Track: 1"
  ;;     "id3v2 tag info for /tmp/soundcljoud/Garth Brooks - The Old Stuff.mp3:"
  ;;     "TPE1 (Lead performer(s)/Soloist(s)): Garth Brooks"
  ;;     "TALB (Album/Movie/Show title): Fresh Horses"
  ;;     "TIT2 (Title/songname/content description): The Old Stuff"
  ;;     "TRCK (Track number/Position in set): 1"]

)

There's an awful lot of copy and paste code here, so let's consolidate MP3 conversion and tag writing into a single function. We should also make sure that function returns a track info map that contains all the good stuff that our RSS template needs. Casting our mind back to the track-specific stuff, we need:

  • track.title
  • track.number
  • track.mp3-filename
  • track.mp3-size
  • track.duration

mp3-filename we have, and m3-size we can get with the same fs/size call that we previously used to check if the MP3 file existed. duration is a little more interesting. What the RSS feed standard is looking for is a duration in one of the following formats:

  • hours:minutes:seconds
  • minutes:seconds
  • seconds

We can use the ffprobe tool that ships with FFmpeg to get some info about the MP3:

(comment

  (-> (p/shell {:out :string}
               "ffprobe -v quiet -print_format json -show_format -show_streams"
               "/tmp/soundcljoud/01 - Garth Brooks - The Old Stuff.mp3")
      :out
      (json/parse-string keyword)
      :streams
      first)
  ;; => {:tags {:encoder "Lavc60.3."},
  ;;     :r_frame_rate "0/0",
  ;;     :sample_rate "44100",
  ;;     :channel_layout "stereo",
  ;;     :channels 2,
  ;;     :duration "252.473469",
  ;;     :codec_name "mp3",
  ;;     :bit_rate "188278",
  ;;     ...
  ;;     :codec_tag "0x0000"}

)

Cool! ffprobe reports duration in seconds (with some extra nanoseconds that we don't need), so let's write a function that grabs the duration and chops off everything after the decimal place, then we can consolidate the WAV -> MP3 conversion and ID3 tag writing in another function:

(defn mp3-duration [filename]
  (-> (p/shell {:out :string}
               "ffprobe -v quiet -print_format json -show_format -show_streams"
               filename)
      :out
      (json/parse-string keyword)
      :streams
      first
      :duration
      (str/replace #"[.]\d+$" "")))

(defn wav->mp3 [{:keys [filename artist album title year number] :as track} tmpdir]
  (let [wav-file (fs/file tmpdir
                          (-> (fs/file-name filename)
                              (str/replace #"[.][^.]+$" ".wav")))
        mp3-file (str/replace wav-file ".wav" ".mp3")
        ffmpeg-args ["ffmpeg" "-i" wav-file
                     "-vn"  ; no video
                     "-q:a" "2"  ; dynamic bitrate averaging 192 KB/s
                     "-y"  ; overwrite existing files without prompting
                     mp3-file]
        id3v2-args ["id3v2"
                    "-a" artist "-A" album "-t" title "-y" year "-T" number
                    mp3-file]]
    (println (format "Converting %s -> %s" wav-file mp3-file))
    (apply println (map str ffmpeg-args))
    (apply p/shell ffmpeg-args)
    (println "Writing ID3 tag")
    (apply println id3v2-args)
    (apply p/shell (map str id3v2-args))
    (assoc track
           :mp3-filename mp3-file
           :mp3-size (fs/size mp3-file)
           :duration (mp3-duration mp3-file))))

(comment

  (-> info :tracks first (wav->mp3 tmpdir))
  ;; => {:number "1",
  ;;     :duration "252",
  ;;     :artist "Garth Brooks",
  ;;     :title "The Old Stuff",
  ;;     :year "1995",
  ;;     :filename
  ;;     #object[java.io.File 0x96d79f0 "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - The Old Stuff.ogg"],
  ;;     :mp3-filename "/tmp/soundcljoud/Garth Brooks - The Old Stuff.mp3",
  ;;     :album "Fresh Horses",
  ;;     :mp3-size 5943424}

)

Looking good! Now we should have everything we need for the RSS feed, so let's try to put it all together:

(defn process-track [track tmpdir]
  (-> track
      (ogg->wav tmpdir)
      (wav->mp3 tmpdir)))

(defn process-album [opts dir]
  (let [info (->> (fs/glob dir "*.ogg")
                  (map (comp track-info fs/file))
                  (album-info (load-token)))
        tmpdir (fs/create-temp-dir {:prefix "soundcljoud."})]
    (spit (fs/file tmpdir "album.rss") (rss/album-feed opts info))
    (assoc info :out-dir tmpdir)))

(comment

  (process-album opts dir)
  ;; => {:out-dir "/tmp/soundcljoud.12524185230907219576"
  ;;     :artist "Garth Brooks",
  ;;     :album "Fresh Horses",
  ;;     :link "https://api.discogs.com/masters/212114",
  ;;     :image
  ;;     "https://i.discogs.com/0eLXmM1tK1grkH8cstgDT6eV2TlL0NvgWPZBoyScJ_8/rs:fit/g:sm/q:90/h:600/w:600/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTY4NDcx/Ny0xNzE3NDU5MDIy/LTMxNjguanBlZw.jpeg",
  ;;     :year "1995",
  ;;     :tracks
  ;;     ({:number "1",
  ;;       :duration "252",
  ;;       :artist "Garth Brooks",
  ;;       :title "The Old Stuff",
  ;;       :year "1995",
  ;;       :filename
  ;;       #object[java.io.File 0x344bc92b "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - The Old Stuff.ogg"],
  ;;       :mp3-filename
  ;;       "/tmp/soundcljoud.12524185230907219576/Garth Brooks - The Old Stuff.mp3",
  ;;       :album "Fresh Horses",
  ;;       :wav-filename
  ;;       #object[java.io.File 0x105830d2 "/tmp/soundcljoud.12524185230907219576/Garth Brooks - The Old Stuff.wav"],
  ;;       :mp3-size 5943424}
  ;;      ...
  ;;      {:number "10",
  ;;       :duration "301",
  ;;       :artist "Garth Brooks",
  ;;       :title "Ireland",
  ;;       :year "1995",
  ;;       :filename
  ;;       #object[java.io.File 0x59ba6e31 "~/Music/g/Garth Brooks/Fresh Horses/Garth Brooks - Ireland.ogg"],
  ;;       :mp3-filename
  ;;       "/tmp/soundcljoud.12524185230907219576/Garth Brooks - Ireland.mp3",
  ;;       :album "Fresh Horses",
  ;;       :wav-filename
  ;;       #object[java.io.File 0x4de1472 "/tmp/soundcljoud.12524185230907219576/Garth Brooks - Ireland.wav"],
  ;;       :mp3-size 6969472})}
)

We also have a /tmp/soundcljoud.12524185230907219576/album.rss file containing:

<?xml version='1.0' encoding='UTF-8'?>
<rss version="2.0"
     xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
     xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Garth Brooks - Fresh Horses</title>
    <link>https://api.discogs.com/masters/212114</link>
    <pubDate>Sun, 01 Jan 1995 00:00:00 +0000</pubDate>
    <itunes:subtitle>Album: Garth Brooks - Fresh Horses</itunes:subtitle>
    <itunes:author>Garth Brooks</itunes:author>
    <itunes:image href="https://i.discogs.com/0eLXmM1tK1grkH8cstgDT6eV2TlL0NvgWPZBoyScJ_8/rs:fit/g:sm/q:90/h:600/w:600/czM6Ly9kaXNjb2dz/LWRhdGFiYXNlLWlt/YWdlcy9SLTY4NDcx/Ny0xNzE3NDU5MDIy/LTMxNjguanBlZw.jpeg"/>
    
    <item>
      <itunes:title>The Old Stuff</itunes:title>
      <title>The Old Stuff</title>
      <itunes:author>Garth Brooks</itunes:author>
      <enclosure
          url="http://localhost:1341/albums/Fresh+Horses/01+-+Garth+Brooks+-+The+Old+Stuff.mp3"
          length="5943424" type="audio/mpeg" />
      <pubDate>Sun, 01 Jan 1995 00:00:00 +0000</pubDate>
      <itunes:duration>252</itunes:duration>
      <itunes:episode>1</itunes:episode>
      <itunes:episodeType>full</itunes:episodeType>
      <itunes:explicit>false</itunes:explicit>
    </item>

    ...
    
    <item>
      <itunes:title>Ireland</itunes:title>
      <title>Ireland</title>
      <itunes:author>Garth Brooks</itunes:author>
      <enclosure
          url="http://localhost:1341/albums/Fresh+Horses/Garth+Brooks+-+Ireland.mp3"
          length="6969472" type="audio/mpeg" />
      <pubDate>Sun, 01 Jan 1995 00:00:00 +0000</pubDate>
      <itunes:duration>301</itunes:duration>
      <itunes:episode>10</itunes:episode>
      <itunes:episodeType>full</itunes:episodeType>
      <itunes:explicit>false</itunes:explicit>
    </item>
    
  </channel>
</rss>

In theory, if we put this RSS file and our MP3 somewhere a podcast player can find them, we should be able to listen to some Garth Brooks! However, http://localhost:1341/ is not likely to be reachable by a podcast player, so perhaps we should put a webserver there and whilst we're at it, just write our own little Soundcloud clone webapp. Seems reasonable, right?

We'll get into that in the next instalment of "Soundcljoud, or a young man's Soundcloud clonejure."

Permalink

Infinite rest

Some functions types in Clojure seamlessly handle infinite arguments, while others misuse them and freeze our programs.

Clojure 1.11.1
user=> (apply (fn [& args]) (range))
nil
user=> (apply (fn []) (range))
^C ;; loops forever

Let’s level the playing field.

In the first case, the infinite arguments are preserved as a lazy sequence which is never realized by the function body, while the second case pours them into a Java array before the function body is called, looping “forever” (until the JVM runs out of memory or the size limitations of arrays are hit).

The compiler decides the behavior of fn in this respect based on the presence of a rest-argument. The second case of using a Java array is more fundamental in Clojure, so special support is needed to do something other than loop forever with infinite arguments.

So which other function types accept infinite arguments and why?

Vars simply apply their mapped functions, and thus inherit their ability to handle infinite arguments from the functions they hold. So, since fn’s with rest arguments can handle infinite arguments, then so can vars containing them:

user=> (defn yes-inf [& args] [:yes-inf (take 10 args)])
user=> (apply yes-inf (range))
[:yes-inf (0 1 2 3 4 5 6 7 8 9)]
user=> (apply #'yes-inf (range))
[:yes-inf (0 1 2 3 4 5 6 7 8 9)]

On the other hand, fn’s with fixed arguments (and vars containing them) cannot handle infinite arguments:

user=> (defn no-inf [])
user=> (apply no-inf (range))
^C
user=> (apply #'no-inf (range))
^C

Intuitively, we can do better: since fixed-arity functions in Clojure only support up to 20 parameters, we could decide to throw an error instead of walking past the 21st argument.

Collections are functions that usually accept 1 or 2 arguments, but they also loop forever when passed infinite arguments.

user=> (apply {} (range))
^C
user=> (apply [] (range))
^C

Again, intuition says we don’t need to walk past the 3rd argument, so this behavior can be improved. We can also add atomic values like symbols and keywords to this category of function.

Multimethods, like vars, are another kind of function that delegates to other functions. Unfortunately, they only support finite arguments.

Clojure 1.11.1
user=> (defmulti disp-first (fn [target & _] (class target)))
user=> (defmethod disp-first Long [target & _] (inc target))
user=> (apply disp-first 1 (range))
^C

Intuitively, if the dispatch function and the dispatch method both support infinite arguments, then so should the multimethod. This can be improved.

Let’s follow these breadcrumbs of intuition to enhance all Clojure functions with better handling of infinite arguments.

First, we noticed that vars inherit their capability for infinite args, and yet multimethods could potentially do the same. To bring them up to speed, we can use the same trick as Var, but twice: instead of the multimethod calling invoke on itself, first apply the dispatch function then apply the chosen method.

Now multimethods with rest parameters support infinite arguments!

user=> (defmulti disp-first (fn [target & _] (class target)))
user=> (defmethod disp-first Long [target & _] (inc target))
user=> (apply disp-first 1 (range))
2

But this does not handle fixed-arity multimethods, since they depend on the functions they contain for infinite argument handling:

user=> (defmulti mm (fn []))
user=> (apply mm (range))
^C

This brings us back to AFn, where fn’s applyTo implementation lives. The basic algorithm is to realize 20 arguments to determine whether we can call invoke without building the final array, otherwise pour the 21st argument and beyond into an array to call invoke. We can’t get around calling invoke, but let’s add features to AFn so then it can throw an exception rather than diverging in cases where it’s pointless to traverse possibly-infinite args.

One such case is a fn without rest arguments, which can only support up to 20 arguments. The compiler always extends AFunction in this case, so let’s hardcode the following case in AFn’s applyTo: if the function being applied is an AFunction, then throw an arity exception for 21 or more arguments. The risk: if anyone implements an AFunction that supports rest arguments, it won’t work for 21+ args. Let’s assume this never happens.

With this tweak, fixed-arg fn’s, multimethods, vars, and even vars containing multimethods now error instead of diverge:

user=> (apply identity (range))
Wrong number of args (21+) passed to: clojure.core/identity
user=> (defmulti mm identity)
user=> (apply mm (range))
Wrong number of args (21+) passed to: clojure.core/identity
user=> (apply #'mm (range))
Wrong number of args (21+) passed to: clojure.core/identity

That works for all 0-20 arg fn’s; now for everything else that extends AFn.

I settled on a new AFn method that returns this.class if the current class does not have a rest argument. If the runtime type of the collection is identical to that class, then we can short-circuit rest-argument support. If it’s not equal, then it must be a extending class that we don’t control, so continue as normal.

After implementing this method on all functional collections in Clojure, we now have the desirable semantics: error over divergence.

user=> (apply {} (range))
Wrong number of args (21+) passed to: clojure.lang.PersistentArrayMap

Libraries providing their own functional collections need to update their implementations to opt-in.

user=> (apply (proxy [clojure.lang.MapEntry] [0 1])
              (range))
^C

These ideas have been implemented in this pull request.

To summarize, we’ve made every function in Clojure (except those with rest-arguments) lazier either directly (fns, collections, keywords) or indirectly (multimethods, vars). This enables infinite arguments to error instead of diverge, unless the function itself walks its infinite arguments. Given that most functions only support fixed arguments, this enhancement has broad applicability. In the case of multimethods, we’ve also added support for legitimate uses of infinite arguments.

Interestingly, this post is essentially the opposite of my previous one: there for collections, we fixed arity exceptions to reflect the actual number of arguments passed; here we cut off the processing of arguments before all arguments can even be counted.


Thanks to Santiago Gepigon III for pairing on this and the post title.

Permalink

Stay ahead in web development: latest news, tools, and insights #40

weeklyfoo #40 is here: your weekly digest of all webdev news you need to know! This time you'll find 47 valuable links in 8 categories! Enjoy!

🚀 Read it!

📰 Good to know

🧰 Tools

  • Pikimov: Online motion design and video editor / video
  • Docmost: Docmost is an open source collaborative documentation and wiki software. It is an open-source alternative to the likes of Confluence and Notions. / docs, collaboration
  • Solid Toast: Create beautiful, customizable toasts with Solid JS / notifications, toast
  • doggo: Command-line DNS Client for Humans. Written in Golang / cli
  • squirrelly: Semi-embedded JS template engine that supports helpers, filters, partials, and template inheritance. 4KB minzipped, written in TypeScript / templating
  • fast-json-stringify: 2x faster than JSON.stringify() / stringify, json
  • Mako: An extremely fast, production-grade web bundler based on Rust. / bundler
  • SmoothMQ: A drop-in replacement for SQS designed for great developer experience and efficiency. / queues, sqs
  • Changesets: A way to manage your versioning and changelogs with a focus on monorepos / changelog
  • Search HackerNews: A search engine for HackerNews / search
  • React Ace: A set of react components for Ace (Editor) / react
  • lets form: A JSON form generator for React with Material UI / AntDesign / Bootstrap / RSuite / Mantine / react, forms
  • H5Web: React components for data visualization and exploration / visualization
  • GridStack: Build interactive dashboards in minutes. / grids, dashboards
  • Flitter: Flitter is a powerful framework inspired by Flutter, supporting both SVG and Canvas to create high-performance graphics and user interfaces. / visualizations
  • MSHR: A collection of 208 vanilla CSS mesh gradients free for you to use in any of your projects. / css, gradients
  • json.bash: Command-line tool and bash library that creates JSON / json
  • bwip-js: Barcode Writer in Pure JavaScript / barcode
  • Superstruct: A simple and composable way to validate data in JavaScript (and TypeScript). / validation
  • Termino.js: Create a web based terminal on any website - great for games, animations and real world apps! / terminal
  • 0xtools: X-Ray vision for Linux systems / cli, performance

🎨 Design

🤪 Fun

🤣 Meme

  • yaml: this explains a lot of things / yaml / 0 min read

📚 Tutorials

📺 Videos

  • How Vercel Works: With Malte Ubl, we deep dived into how Vercel works as a team, what could developer experience look like, and the future of AI-enabled applications. / vercel

Want to read more? Check out the full article here.

To sign up for the weekly newsletter, visit weeklyfoo.com.

Permalink

The Complete Lineup + Late Bird Cliff

We&aposve kept you all in suspense for long enough. Today we&aposre announcing the final Heart of Clojure speakers, and with that the full programme is now available.

With that the time has also come to make the decision, do you want to be at the hottest Clojure conference of the year, or not? Because at the end of next weekend (July 14) we end the sale of regular tickets. From then on only the more expensive Late Bird tickets will be available.

We&aposll explain a bit more in the next newsletter why exactly we&aposre doing this, but the tl;dr is that as organizers we can do more with budget we have today, vs budget we only know we&aposll have in September. If we sell enough tickets there are a lot of cool things we can do, like add live performances in the evening, have more fringe activities, better lunch and drink options, and so forth. But if we want to do those things we need to start planning them soon, which is why we&aposre creating a strong incentive for anyone who&aposs still on the fence.

Eric Normand

https://dynogee.com/gen?id=xqxdvgzswovkl2c&speaker=Eric+Normand&title=The+Wonders+of+Abstraction&type=Keynote&img=https%3A//2024.heartofclojure.eu/img/speakers/eric-normand.jpg%3Fv%3D2

No one less but Eric Normand will be delivering the closing keynote, putting the capstone on our event. An experienced speaker, writer, and teacher, he&aposs been creating courses in Functional Programming and specifically Clojure for over a decade. His latest book "Grokking Simplicity" achieves what few have managed before, really making functional programming concepts accessible and understandable to a wide audience.

Eric is a deep and original thinker. For this talk he&aposs diving into the concept of Abstraction. It&aposs a term we throw around a lot as programmers, but the clearly the last word on the topic hasn&apost been said yet, and we&aposre really looking forward to this philosophical exploration.

Philippa Markovics, Martin Kavalar

https://dynogee.com/gen?id=kr8gdk8c4ya552y&speaker=Philippa+Markovics%2C+Martin+Kavalar&title=Staring+into+the+PLFZABYSS+-+From+the+IBM+AS/400+to+Clojure+%26+Datomic&type=Talk&img=https%3A//2024.heartofclojure.eu/img/speakers/philippa-markovics-martin-kavalar.png%3Fv%3D2

When nextjournal isn&apost creating awesome notebook software like Clerk, or revolutionizing Clojure deployments with application.garden, they do also do real work. Since 2021 they&aposve taken on the unenviable task of bringing a large German automotive logistics company into the modern era. From AS/400 to Clojure and Datomic.

A transition like that is technically challenging, but also challenging on a human level, getting a large and old school organization to adopt new practices, and work in different ways. But they succeeded, and will share with us their fascinating story.

Thousands of globally unique 8-character column names, green-screen terminal UIs, skunk work projects and personal drama — this talk has it all!

Lovro Lugović, Sung-Shik Jongmans

https://dynogee.com/gen?id=kr8gdk8c4ya552y&speaker=Lovro+Lugovi%C4%87%2C+Sung-Shik+Jongmans&title=Klor%3A+Choreographic+Programming+in+Clojure&type=Talk&img=https%3A//2024.heartofclojure.eu/img/speakers/lovro-lugovic-sung-shik-jongmans.png%3Fv%3D2

We will be honest, we had not heard of Choreographic programming before this talk popped up in our CFP. But once we looked into it we knew we wanted to have Lovro and Sung-Shik present their project. This is a talk with Strangeloop Vibes.

Their Clojure-based system "Klor" provides a whole new paradigm for writing distributed systems as choreographies, eliminating common problems like communication mismatches and deadlocks.

Felix

https://dynogee.com/gen?id=xqxdvgzswovkl2c&speaker=Felix+Alm&title=Squint%3A+a+taste+of+Clojure+for+JavaScript+devs&type=Talk&img=https%3A//2024.heartofclojure.eu/img/speakers/felix.jpeg%3Fv%3D2

As Clojure programmers we&aposve been blessed for years with ClojureScript. Whenever we need to write for the frontend, or reach in other places where JavaScript is the default, we can count on the trusty ClojureScript compiler.

But ClojureScript makes some very particular tradeoffs. The reliance on the Google Closure compiler is a blessing but also a curse. It gives us a very advanced optimizing compiler, but it integrates poorly with the rest of the JS ecosystem.

Squint is a new take on a ClojureScript-like LISP-to-JS compiler, one that sticks much closer to contemporary JavaScript tooling and practices, making it much easier to integrate with existing code bases, or to adopt gradually.

Together with the talk about Jank (Clojure on LLVM) and the Babashka workshop (Clojure on Graal Native Image), this is the third panel in the alt-Clojure part of the programme.

Jordan Miller, Carmen Huidobro

https://dynogee.com/gen?id=kr8gdk8c4ya552y&speaker=Jordan+Miller%2C+Carmen+Huidobro&title=Our+Lovely+Hosts&type=Host&img=https%3A//2024.heartofclojure.eu/img/speakers/jordan-miller-carmen-huidobro.png%3Fv%3D2

Last but not least, Jordan and Carmen will be your hosts for the two days of Heart of Clojure.

Heart of Clojure is made possible thanks to our lovely Gold Sponsors, Nubank, Clojurists Together, Latacora, and Exoscale.

Permalink

PG2 release 0.1.15

PG2 version 0.1.15 is out. This version mostly ships improvements to connection pool and folders (reducers) of a database result. There are two new sections in the documentation that describe each part. I reproduce them below.

Connection Pool

Problem: every time you connect to the database, it takes time to open a socket, pass authentication pipeline and receive initial data from the server. From the server’s prospective, a new connection spawns a new process which is also an expensive operation. If you open a connection per a query, your application is about ten times slower than it could be.

Connection pools solve that problem. A pool holds a set of connections opened in advance, and you borrow them from a pool. When borrowed, a connection cannot be shared with somebody else any longer. Once you’ve done with your work, you return the connection to the pool, and it’s available for other consumers.

PG2 ships a simple and robust connection pool out from the box. This section covers how to use it.

A Simple Example

Import both core and pool namespaces as follows:

(ns demo
  (:require
    [pg.core :as pg]
    [pg.pool :as pool]))

Here is how you use the pool:

(def config
  {:host "127.0.0.1"
   :port 5432
   :user "test"
   :password "test"
   :database "test"})

(pool/with-pool [pool config]
  (pool/with-connection [conn pool]
    (pg/execute conn "select 1 as one")))

The pool/with-pool macro creates a pool object from the config map and binds it to the pool symbol. Once you exit the macro, the pool gets closed.

The with-pool macro can be easily replaced with the with-open macro and the pool function that creates a pool instance. By exit, the macro calls the .close method of an opened object, which closes the pool.

(with-open [pool (pool/pool config)]
  (pool/with-conn [conn pool]
    (pg/execute conn "select 1 as one")))

Having a pool object, use it with the pool/with-connection macro (there is a shorter version pool/with-conn as well). This macro borrows a connection from the pool and binds it to the conn symbol. Now you pass the connection to pg/execute, pg/query and so on. By exiting the with-connection macro, the connection is returned to the pool.

And this is briefly everything you need to know about the pool! Sections below describe more about its inner state and behavior.

Configuration

The pool object accepts the same config the Connection object does section for the table of parameters). In addition to these, the fillowing options are accepted:

Field Type Default Comment
:pool-min-size integer 2 Minimum number of open connections when initialized.
:pool-max-size integer 8 Maximum number of open connections. Cannot be exceeded.
:pool-expire-threshold-ms integer 300.000 (5 mins) How soon a connection is treated as expired and will be forcibly closed.
:pool-borrow-conn-timeout-ms integer 15.000 (15 secs) How long to wait when borrowing a connection while all the connections are busy. By timeout, an exception is thrown.

The first option :pool-min-size specifies how many connection are opened at the beginning. Setting too many is not necessary because you never know if you application will really use all of them. It’s better to start with a small number and let the pool to grow in time, if needed.

The next option :pool-max-size determines the total number of open connections. When set, it cannot be overridden. If all the connections are busy and there is still a gap, the pool spawns a new connection and adds it to the internal queue. But if the :pool-max-size value is reached, an exception is thrown.

The option :pool-expire-threshold-ms specifies the number of milliseconds. When a certain amount of time has passed since the connection’s initialization, it is considered expired and will be closed by the pool. This is used to rotate connections and prevent them from living for too long.

The option :pool-borrow-conn-timeout-ms prescribes how long to wait when borrowing a connection from an exhausted pool: a pool where all the connections are busy and the :pool-max-size value is reached. At this case, the only hope that other clients complete their work and return theri connection before timeout bangs. Should there still haven’t been any free connections during the :pool-borrow-conn-timeout-ms time window, an exception pops up.

Pool Methods

The stats function returns info about free and used connections:

(pool/with-pool [pool config]

  (pool/stats pool)
  ;; {:free 1 :used 0}

  (pool/with-connection [conn pool]
    (pool/stats pool)
    ;; {:free 0 :used 1}
  ))

It might be used to send metrics to Grafana, CloudWatch, etc.

Manual Pool Management

The following functions help you manage a connection pool manually, for example when it’s wrapped into a component (see Component and Integrant libraries).

The pool function creates a pool:

(def POOL (pool/pool config))

The used-count and free-count functions return total numbers of busy and free connections, respectively:

(pool/free-count POOL)
;; 2

(pool/used-count POOL)
;; 0

The pool? predicate ensures it’s a Pool instance indeed:

(pool/pool? POOL)
;; true

Closing

The close method shuts down a pool instance. On shutdown, first, all the free connections get closed. Then the pool closes busy connections that were borrowed. This might lead to failures in other threads, so it’s worth waiting until the pool has zero busy connections.

(pool/close POOL)
;; nil

The closed? predicate ensures the pool has already been closed:

(pool/closed? POOL)
;; true

Borrow Logic in Detail

When getting a connection from a pool, the following conditions are taken into account:

  • if the pool is closed, an exception is thrown;
  • if there are free connections available, the pool takes one of them;
  • if a connection is expired (was created long ago), it’s closed and the pool performs another attempt;
  • if there aren’t free connections, but the max number of used connection has not been reached yet, the pool spawns a new connection;
  • if the number of used connections is reached, the pool waits for :pool-borrow-conn-timeout-ms amount of milliseconds hoping that someone releases a connection in the background;
  • by timeout (when nobody did), the pool throws an exception.

Returning Logic in Detail

When you return a connection to a pool, the following cases might come into play:

  • if the connection is an error state, then transaction is rolled back, and the connection is closed;
  • if the connection is in transaction mode, it is rolled back, and the connection is marked as free again;
  • if it was already closed, the pool just removes it from used connections. It won’t be added into the free queue;
  • if the pool is closed, the connection is removed from used connections;
  • when none of above conditions is met, the connection is removed from used and becomes available for other consumers again.

This was the Connecton Pool section, and now we proceed with Folders.

Folders (Reducers)

Folders (which are also known as reducers) are objects that transform rows from network into something else. A typical folder consists from an initial value (which might be mutable) and logic that adds the next row to that value. Before returning the value, a folder might post-process it somehow, for example turn it into an immutable value.

The default folder (which you don’t need to specify) acts exactly like this: it spawns a new transient vector and conj!es all the incoming rows into it. Finally, it returns a persistent! version of this vector.

PG2 provides a great variety of folders: to build maps or sets, to index or group rows by a certain function. With folders, it’s possible to dump a database result into a JSON or EDN file.

It’s quite important that folders process rows on the fly. Like transducers, they don’t keep the whole dataset in memory. They only track the accumulator and the current row no matter how many of them have arrived from the database: one thousand or one million.

A Simple Folder

Technically a folder is a function (an instance of clojure.lang.IFn) with three bodies of arity 0, 1, and 2, as follows:

(defn a-folder
  ([]
   ...)
  ([acc]
   ...)
  ([acc row]
   ...))
  • The first 0-arity form produces an accumulator that might be mutable.

  • The third 2-arity form takes the accumulator and the current row and returns an updated version of the accumulator.

  • The second 1-arity form accepts the last version of the accumulator and transforms it somehow, for example seals a transient collection into its persistent view.

Here is the default folder:

(defn default
  ([]
   (transient []))
  ([acc!]
   (persistent! acc!))
  ([acc! row]
   (conj! acc! row)))

Some folders depend on initial settings and thus produce folding functions. Here is an example of the map folder that acts like the map function from clojure.core:

(defn map
  [f]
  (fn folder-map
    ([]
     (transient []))
    ([acc!]
     (persistent! acc!))
    ([acc! row]
     (conj! acc! (f row)))))

Passing A Folder

To pass a custom folder to process the result, specify the :as key as follows:

(require '[pg.fold :as fold])

(defn row-sum [{:keys [field_1 field_2]}]
  (+ field_1 field_2))

(pg/execute conn query {:as (fold/map row-sum)})

;; [10 53 14 32 ...]

Standard Folders and Aliases

PG provides a number of built-in folders. Some of them are used so often that it’s not needed to pass them explicitly. There are shortcuts that enable certain folders internally. Below, find the actual list of folders, their shortcuts and examples.

Column

Takes a single column from each row returning a plain vector:

(pg/execute conn query {:as (fold/column :id)})

;; [1 2 3 4 ....]

There is an alias :column that accepts a name of the column:

(pg/execute conn query {:column :id})
;; [1 2 3 4 ....]

Map

Acts like the standard map function from clojure.core. Applies a function to each row and collects a vector of results.

Passing the folder explicitly:

(pg/execute conn query {:as (fold/map func)})

And with an alias:

(pg/execute conn query {:map func})

Default

Collects unmodified rows into a vector. That’s unlikely you’ll need that folder as it gets applied internally when no other folders were specified.

Dummy

A folder that doesn’t accumulate the rows but just skips them and returns nil.

(pg/execute conn query {:as fold/dummy})

nil

First

Perhaps the most needed folder, first returns the first row only and skips the rest. Pay attention, this folder doesn’t have a state and thus doesn’t need to be initiated. Useful when you query a single row by its primary key:

(pg/execute conn
            "select * from users where id = $1"
            {:params [42]
             :as fold/first})

{:id 42 :email "test@test.com"}

Or pass the :first (or :first?) option set to true:

(pg/execute conn
            "select * from users where id = $1"
            {:params [42]
             :first true})

{:id 42 :email "test@test.com"}

Index by

Often, you select rows as a vector and build a map like {id => row}, for example:

(let [rows (jdbc/execute! conn ["select ..."])]
  (reduce (fn [acc row]
            (assoc acc (:id row) row))
          {}
          rows))

{1 {:id 1 :name "test1" ...}
 2 {:id 2 :name "test2" ...}
 3 {:id 3 :name "test3" ...}
 ...
 }

This process is known as indexing because later on, the map is used as an index for quick lookups.

This approach, although is quite common, has flaws. First, you traverse rows twice: when fetching them from the database, and then again inside reduce. Second, it takes extra lines of code.

The index-by folder does exactly the same: it accepts a function which is applied to a row and uses the result as an index key. Most often you pass a keyword:

(let [query
      "with foo (a, b) as (values (1, 2), (3, 4), (5, 6))
      select * from foo"

      res
      (pg/execute conn query {:as (fold/index-by :a)})]

{1 {:a 1 :b 2}
 3 {:a 3 :b 4}
 5 {:a 5 :b 6}})

The shortcut :index-by accepts a function as well:

(pg/execute conn query {:index-by :a})

Group by

The group-by folder is simlar to index-by but collects multiple rows per a grouping function. It produces a map like {(f row) => [row1, row2, ...]} where row1, row2 and the rest return the same value for f.

Imagine each user in the database has a role:

{:id 1 :name "Test1" :role "user"}
{:id 2 :name "Test2" :role "user"}
{:id 3 :name "Test3" :role "admin"}
{:id 4 :name "Test4" :role "owner"}
{:id 5 :name "Test5" :role "admin"}

This is what group-by returns when grouping by the :role field:

(pg/execute conn query {:as (fold/group-by :role)})

{"user"
 [{:id 1, :name "Test1", :role "user"}
  {:id 2, :name "Test2", :role "user"}]

 "admin"
 [{:id 3, :name "Test3", :role "admin"}
  {:id 5, :name "Test5", :role "admin"}]

 "owner"
 [{:id 4, :name "Test4", :role "owner"}]}

The folder has its own alias which accepts a function:

(pg/execute conn query {:group-by :role})

KV (Key and Value)

The kv folder accepts two functions: the first one is for a key (fk), and the second is for a value (fv). Then it produces a map like {(fk row) => (fv row)}.

A typical example might be a narrower index map. Imagine you select just a couple of fields, id and email. Now you need a map of {id => email} for quick email lookup by id. This is where kv does the job for you.

(pg/execute conn
            "select id, email from users"
            {:as (fold/kv :id :email)})

{1 "ivan@test.com"
 2 "hello@gmail.com"
 3 "skotobaza@mail.ru"}

The :kv alias accepts a vector of two functions:

(pg/execute conn
            "select id, email from users"
            {:kv [:id :email]})

Run

The run folder is useful for processing rows with side effects, e.g. printing them, writing to files, passing via API. A one-argument function passed to run is applied to each row ignoring the result. The folder counts a total number of rows being processed.

(defn func [row]
  (println "processing row" row)
  (send-to-api row))

(pg/execute conn query {:as (fold/run func)})

100 ;; the number of rows processed

An example with an alias:

(pg/execute conn query {:run func})

Table

The table folder returns a plain matrix (a vector of vectors) of database values. It reminds the columns folder but also keeps column names in the leading row. Thus, the resulting table always has at least one row (it’s never empty because of the header). The table view is useful when saving the data into CSV.

The folder has its inner state and thus needs to be initialized with no parameters:

(pg/execute conn query {:as (fold/table)})

[[:id :email]
 [1 "ivan@test.com"]
 [2 "skotobaza@mail.ru"]]

The alias :table accepts any non-false value:

(pg/execute conn query {:table true})

[[:id :email]
 [1 "ivan@test.com"]
 [2 "skotobaza@mail.ru"]]

Java

This folder produces java.util.ArrayList where each row is an instance of java.util.HashMap. It doesn’t require initialization:

(pg/execute conn query {:as fold/java})

Alias:

(pg/execute conn query {:java true})

Reduce

The reduce folder acts like the same-name function from clojure.core. It accepts a function and an initial value (accumulator). The function accepts the accumulator and the current row, and returns an updated version of the accumulator.

Here is how you collect unique pairs of size and color from the database result:

(defn ->pair [acc {:keys [sku color]}]
  (conj acc [a b]))

(pg/execute conn query {:as (fold/reduce ->pair #{})})

#{[:xxl :green]
  [:xxl :red]
  [:x :red]
  [:x :blue]}

The folder ignores reduced logic: it performs iteration until all rows are consumed. It doesn’t check if the accumulator is wrapped with reduced.

The :reduce alias accepts a vector of a function and an initial value:

(pg/execute conn query {:reduce [->pair #{}]})

Into (Transduce)

This folder mimics the into logic when it deals with an xform, also known as a transducer. Sometimes, you need to pass the result throughout a bunch of map/filter/keep functions. Each of them produces an intermediate collection which is not as fast as it could be with a transducer. Transducers are designed such that they compose a stack of actions, which, when being run, does not produce extra collections.

The into folder accepts an xform produced by map/filter/comp, whatever. It also accepts a persistent collection which acts as an accumulator. The accumulator gets transformed into a transient view internally for better performance. The folder uses conj! to push values into the accumulator, so maps are not acceptable, only vectors, lists, or sets. When the accumulator is not passed, it’s an empty vector.

Here is a quick example of into in action:

(let [tx
      (comp (map :a)
            (filter #{1 5})
            (map str))

      query
      "with foo (a, b) as (values (1, 2), (3, 4), (5, 6))
       select * from foo"]

  (pg/execute conn query {:as (fold/into tx)}))

;; ["1" "5"]

Another case where we pass a non-empty set to collect the values:

(pg/execute conn query {:as (fold/into tx #{:a :b :c})})

;; #{:a :b :c "1" "5"}

The :into alias is a vector where the first item is an xform and the second is an accumulator:

(pg/execute conn query {:into [tx []]})

To EDN

This folder writes down rows into an EDN file. It accepts an instance of java.io.Writer which must be opened in advance. The folder doesn’t open nor close the writer as these actions are beyond its scope. A common pattern is to wrap pg/execute or pg/query invocations with the with-open macro that handles closing procedure even in case of an exception.

The folder writes down rows into the writer using pr-str. Each row takes one line, and the lines are split with \n. The leading line is [, and the trailing is ].

The result is a number of rows processed. Here is an example of dumping rows into a file called “test.edn”:

(with-open [out (-> "test.edn" io/file io/writer)]
  (pg/execute conn query {:as (fold/to-edn out)}))

;; 199

Let’s check the content of the file:

[
  {:id 1 :email "test@test.com"}
  {:id 2 :email "hello@test.com"}
  ...
  {:id 199 :email "ivan@test.com"}
]

The alias :to-edn accepts a writer object:

(with-open [out (-> "test.edn" io/file io/writer)]
  (pg/execute conn query {:to-edn out}))

To JSON

Like to-edn but dumps rows into JSON. Accepts an instance of java.io.Writer. Writes rows line by line with no pretty printing. Lines are joined with a comma. The leading and trailing lines are square brackets. The result is the number of rows put into the writer.

(with-open [out (-> "test.json" io/file io/writer)]
  (pg/execute conn query {:as (fold/to-json out)}))

;; 123

The content of the file:

[
  {"b":2,"a":1},
  {"b":4,"a":3},
  // ...
  {"b":6,"a":5}
]

The :to-json alias accepts a writer object:

(with-open [out (-> "test.json" io/file io/writer)]
  (pg/execute conn query {:to-json out}))

For more details, you’re welcome to the readme file of the repo.

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.