Maps Implementation Secrets

by cgrand (X 🩣)

Today no Datalog nor interop: let's talk data structures!

In ClojureDart we took pride in writing our own persistent collections. A design goal for our collections was to make them have a canonical layout so as to be able to leverage canonical layout and structural sharing to optimize boolean operations on them (diff, merge, mass dissoc etc.). It's yet untapped potential.

(And it's tangentially related with my ongoing obsession with "Datalog As The Engine Of Application State" đŸ€”.)

While ClojureDart hash maps are still HAMTs and their differences with Clojure ones are marginal, ClojureDart sorted maps/sets belong to a rather novel family of search trees because the usual suspects (red-black trees like in Clojure or b-trees like in Datascript) are heavily history dependent and thus not amenable to having a canonical representation.

It turns out that this family of search trees is simple (no rotations or balancing) and efficient (wide trees are supported).

Their secret? They use the hash function!

Don't forget: when we're not busy writing custom datastructures, we are available to help you on your Clojure projects or working on ClojureDart or on our app Paktol (The expense tracker focused on making both ends meet!).

How hashing relates to balancing

If you consider 32-bits number, half of them ends (when written in binary) by 0. One 4th of them ends by 00, one 8th by 000, one 16th by 0000, one 32th by 00000... We have a geometric progression (we multiply by a constant factor, 2 here, at each step).

In a complete binary tree, there's one node at depth 0, 2 at depth 1, 4 at depth 3, 8 at depth 4, 16 at depth 5. Another geometric progression.

The key idea is to put these two geometric progressions together: if the 32-bit hash of key ends with N zeros then this key will appear at depth 32-N (N is called the rank) in the tree!

Let's see how the integers from 0 to 999 would be spread out using this idea:

user=> (->> (range 1000)
         (map #(Integer/numberOfTrailingZeros (hash %)))
         frequencies
         (into (sorted-map)))
{0 500, 1 262, 2 121, 3 58, 4 26, 5 17, 6 7, 7 4, 8 3, 9 1, 32 1}

Very smooth progression with all ranks being at their expected capacity from 0 to 9... except for this outlier at 32. This one is because 0 is special-cased in Clojure there's a special case that (hash 0) is 0. Let's just flip all bits with a bit-xor -1 to drown zero amongst the first level crowd.

user=> (->> (range 1000)
         (map #(Integer/numberOfTrailingZeros (hash (bit-xor % -1))))
         frequencies
         (into (sorted-map)))
{0 498, 1 254, 2 137, 3 60, 4 27, 5 15, 6 4, 7 2, 8 1, 9 2}

This time it's less perfect as we have two items at rank 9 but only one at rank 8 but c'est la vie and in practice it's a rare occurence.

It's a rare occurence because we want our tree to be shallow enough and we achieve that by having a higher average branching factor. (In the ClojureDart implementation it's 16: we consider hash bits 4 by 4 so the max dept is 8 (32/4).) The higher the branching factor is, the less likely it is to have a lone higher ranked node.

See if we simulate a 16 branching factor:

user=> (->> (range 1000)
         (map #(quot (Integer/numberOfTrailingZeros (hash (bit-xor % -1))) 4)) ; 4 because we need 4 bits 
         frequencies
         (into (sorted-map)))
{0 949, 1 48, 2 3}

And {0 9347, 1 617, 2 33, 3 3} for 10,000 items, {0 93814, 1 5817, 2 337, 3 30, 4 2} for 100,000.

Values are properly spread over ranks.

Why is it simpler?

Since a value is deterministically assigned to a rank, it's not going to move in the tree: it will always be at this rank. It's the power of statistics (aka the quality of the hash function) which guarantees the tree is balanced. If that worries you, it's the same statistics which make hash maps work.

Given its simplicity, nodes can be reduced to just an array:

  • Inner nodes arrays have an odd length and are references to children nodes separated by values in order.
  • Leaves arrays consist only of values in order since there are no children

The tree object only has to store a reference to the root node and the rank of this root node.

The empty tree is thus a pointer to an empty array and a rank field set to 0.

Nota bene: here I use interchangeably key and value as we are just concerned with the tree and not with the storage of the value part of a key-value pair. Conceptually we are explaining sorted sets and maps are left as an exercise to the reader. For the record in ClojureDart sorted maps we chose to never store the "actual value" in inner nodes: they are all stored in leaves and the associated key is also repeated in the leaves. It makes iteration simpler/faster.

Lookup

Lookup is as plain as the memory layout: start from the root node, test the lookup key against values to figure out which branch to follow. Repeat until the lookup key is found or rank 0 reached.

Insertion

First determine the rank for the inserted key. Then performs lookup but stops as soon as the target rank is reached.

When the key is present, you are done.

When the key is absent, a branch sits in its place. We call this branch the incumbent branch.

The incumbent branch has to be split in two: the half which contains values less than the key and the one with values greater than the key. Each half will be inserted on each side of the key.

Pseudo-code for this case is:

let i be the index of the incumbent branch
let new-node be a copy of the current node with two extra slots such that:
  new-node[j] := node[j] for 0 <= j < i
  new-node[i+1] := key
  new-node[j+2] := node[j] for i < j
call split(node[i], key, rank-1, new-node, i, new-node, i+1)

where split is:

split(array, key, rank, left, l, right, r):
  if rank > 0 then
    let i be the index of the incumbent branch
    left[l] := copy array from its start to i (excl.) with one extra slot at the end
    right[r] := copy array from i+1 (incl.) to its end with one extra slot at the start
    call split(array[i], key, rank-1, left[l], i, right[r], 0)
  else
    let i be the index such that array[i-1] < key < array[i] (with tests involving out of bounds indices omitted)
    left[l] := copy array from its start to i (excl.)
    right[r] := copy array from i (incl.) to its end

Special case: the rank is greater than the root's rank: the branches resulting from the split must be padded to match the desired rank.

Deletion

First determine the rank for the deleted key. Then performs lookup but stops as soon as the target rank is reached.

When the key is present, take the two branches on its sides and join (zip) them together. Replace these two branches and the deleted key by the joined branch. This is the opposite operation to splitting.

Pseudo-code for this case is:

let i be the index of the key in the node
let new-node be a copy of the current node two slots shorter, such that:
  new-node[j] := node[j] for 0 <= j < i-1
  new-node[j-2] := node[j] for i+1 < j
  new-node[i-1] := zip(node[i-1], node[i+1], rank-1)

where join is:

zip(left, right, rank):
  if rank > 0 then 
    let i = length(left)-1
    let new-node be a new array such that:
      new-node[j] := left[j] for 0 <= j < i
      new-node[i] := zip(left[i], right[0], rank-1)
      new-node[i+j] := right[j] for 0 < j < length(right)
    return new-node
  else
    return array-concat(left, right)

Special case: when the resulting root node has no value (is a single-child node), its child becomes the new root and this rule is applied to the new root.

When the key is absent, well, you are done!

In practice it's even simpler

Splitting and joining branches are no big deal (except if you insist on writing split in pure FP where it becomes awkward — but since all mutations occur on fresh unshared arrays, it's safe).

As we have seen the overwhelming majority (in ClojureDart case, 93.75% (15/16)) of values live at rank 0 and will thus never trigger a split or a zip but only mundane array copies.

Conclusion

We hope we have piqued your interest for these sorted trees as an alternative to better known red-black trees and b-trees.

If you want to read more, a good starting may be the paper which introduced Zip Trees as the key insight of using a deterministic geometric distribution as ranks led to the tree explained above.

Permalink

Clojurists Together project - Scicloj community building - October 2024 update

The Clojurists Together organisation has decided to sponsor Scicloj community building for Q3 2024, as a project by Daniel Slutsky. This is the second time the project is selected this year. Here is Daniel’s update for October. Comments and ideas would help. 🙏 Clojurists Together update - October 2024 - Daniel Slutsky # Scicloj is a Clojure group developing a stack of tools and libraries for data science. Alongside the technical challenges, community building has been an essential part of its efforts since the beginning of 2019.

Permalink

Build and Deploy Web Apps With Clojure and FLy.io

This post walks through a small web development project using Clojure, covering everything from building the app to packaging and deploying it. It’s a collection of insights and tips I’ve learned from building my Clojure side projects but presented in a more structured format.

As the title suggests, we’ll be deploying the app to Fly.io. It’s a service that allows you to deploy apps packaged as Docker images on lightweight virtual machines.[1] My experience with it has been good, it’s easy to use and quick to set up. One downside of Fly is that it doesn’t have a free tier, but if you don’t plan on leaving the app deployed, it barely costs anything.

This isn’t a tutorial on Clojure, so I’ll assume you already have some familiarity with the language as well as some of its libraries.[2]

Project Setup

In this post, we’ll be building a barebones bookmarks manager for the demo app. Users can log in using basic authentication, view all bookmarks, and create a new bookmark. It’ll be a traditional multi-page web app and the data will be stored in a SQLite database.

Here’s an overview of the project’s starting directory structure:

.
├── dev
│   └── user.clj
├── resources
│   └── config.edn
├── src
│   └── acme
│       └── main.clj
└── deps.edn

And the libraries we’re going to use. If you have some Clojure experience or have used Kit, you’re probably already familiar with all the libraries listed below.[3]

;; deps.edn
{:paths ["src" "resources"]
 :deps {org.clojure/clojure               {:mvn/version "1.12.0"}
        aero/aero                         {:mvn/version "1.1.6"}
        integrant/integrant               {:mvn/version "0.11.0"}
        ring/ring-jetty-adapter           {:mvn/version "1.12.2"}
        metosin/reitit-ring               {:mvn/version "0.7.2"}
        com.github.seancorfield/next.jdbc {:mvn/version "1.3.939"}
        org.xerial/sqlite-jdbc            {:mvn/version "3.46.1.0"}
        hiccup/hiccup                     {:mvn/version "2.0.0-RC3"}}
 :aliases
 {:dev {:extra-paths ["dev"]
        :extra-deps  {nrepl/nrepl    {:mvn/version "1.3.0"}
                      integrant/repl {:mvn/version "0.3.3"}}
        :main-opts   ["-m" "nrepl.cmdline" "--interactive" "--color"]}}}

I use Aero and Integrant for my system configuration (more on this in the next section), Ring with the Jetty adaptor for the web server, Reitit for routing, next.jdbc for database interaction, and Hiccup for rendering HTML. From what I’ve seen, this is a popular “library combination” for building web apps in Clojure.[4]

The user namespace in dev/user.clj contains helper functions from Integrant-repl to start, stop, and restart the Integrant system.

;; dev/user.clj
(ns user
  (:require
   [acme.main :as main]
   [clojure.tools.namespace.repl :as repl]
   [integrant.core :as ig]
   [integrant.repl :refer [set-prep! go halt reset reset-all]]))

(set-prep!
 (fn []
   (ig/expand (main/read-config)))) ;; we'll implement this soon

(repl/set-refresh-dirs "src" "resources")

(comment
  (go)
  (halt)
  (reset)
  (reset-all))

Systems and Configuration

If you’re new to Integrant or other dependency injection libraries like Component, I’d suggest reading “How to Structure a Clojure Web”. It’s a great explanation about the reasoning behind these libraries. Like most Clojure apps that use Aero and Integrant, my system configuration lives in a .edn file. I usually name mine as resources/config.edn. Here’s what it looks like:

;; resources/config.edn
{:server
 {:port #long #or [#env PORT 8080]
  :host #or [#env HOST "0.0.0.0"]
  :auth {:username #or [#env AUTH_USER "john.doe@email.com"]
         :password #or [#env AUTH_PASSWORD "password"]}}

 :database
 {:dbtype "sqlite"
  :dbname #or [#env DB_DATABASE "database.db"]}}

In production, most of these values will be set using environment variables. During local development, the app will use the hard-coded default values. We don’t have any sensitive values in our config (e.g., API keys), so it’s fine to commit this file to version control. If there are such values, I usually put them in another file that’s not tracked by version control and include them in the config file using Aero’s #include reader tag.

This config file is then “expanded” into the Integrant system map using the expand-key method:

;; src/acme/main.clj
(ns acme.main
  (:require
   [aero.core :as aero]
   [clojure.java.io :as io]
   [integrant.core :as ig]))

(defn read-config
  []
  {:system/config (aero/read-config (io/resource "config.edn"))})

(defmethod ig/expand-key :system/config
  [_ opts]
  (let [{:keys [server database]} opts]
    {:server/jetty (assoc server :handler (ig/ref :handler/ring))
     :handler/ring {:database (ig/ref :database/sql)
                    :auth     (:auth server)}
     :database/sql database}))

The system map is created in code instead of being in the configuration file. This makes refactoring your system simpler as you only need to change this method while leaving the config file (mostly) untouched.[5]

My current approach to Integrant + Aero config files is mostly inspired by the blog post “Rethinking Config with Aero & Integrant” and Laravel’s configuration. The config file follows a similar structure to Laravel’s config files and contains the app configurations without describing the structure of the system. Previously I had a key for each Integrant component, which led to the config file being littered with #ig/ref and more difficult to refactor.

Also, if you haven’t already, start a REPL and connect to it from your editor. Run clj -M:dev if your editor doesn’t automatically start a REPL. Next, we’ll implement the init-key and halt-key! methods for each of the components:

;; src/acme/main.clj
(ns acme.main
  (:require
   ;; ...
   [acme.handler :as handler]
   [acme.util :as util])
   [next.jdbc :as jdbc]
   [ring.adapter.jetty :as jetty]))
;; ...

(defmethod ig/init-key :server/jetty
  [_ opts]
  (let [{:keys [handler port]} opts
        jetty-opts (-> opts (dissoc :handler :auth) (assoc :join? false))
        server     (jetty/run-jetty handler jetty-opts)]
    (println "Server started on port " port)
    server))

(defmethod ig/halt-key! :server/jetty
  [_ server]
  (.stop server))

(defmethod ig/init-key :handler/ring
  [_ opts]
  (handler/handler opts))

(defmethod ig/init-key :database/sql
  [_ opts]
  (let [datasource (jdbc/get-datasource opts)]
    (util/setup-db datasource)
    datasource))

The setup-db function creates the required tables in the database if they don’t exist yet. This works fine for database migrations in small projects like this demo app, but for larger projects, consider using libraries such as Migratus (my preferred library) or Ragtime.

;; src/acme/util.clj
(ns acme.util 
  (:require
   [next.jdbc :as jdbc]))

(defn setup-db
  [db]
  (jdbc/execute-one!
   db
   ["create table if not exists bookmarks (
       bookmark_id text primary key not null,
       url text not null,
       created_at datetime default (unixepoch()) not null
     )"]))

For the server handler, let’s start with a simple function that returns a “hi world” string.

;; src/acme/handler.clj
(ns acme.handler
  (:require
   [ring.util.response :as res]))

(defn handler
  [_opts]
  (fn [req]
    (res/response "hi world")))

Now all the components are implemented. We can check if the system is working properly by evaluating (reset) in the user namespace. This will reload your files and restart the system. You should see this message printed in your REPL:

:reloading (acme.util acme.handler acme.main)
Server started on port  8080
:resumed

If we send a request to http://localhost:8080/, we should get “hi world” as the response:

$ curl localhost:8080/
hi world

Nice! The system is working correctly. In the next section, we’ll implement routing and our business logic handlers.

Routing, Middleware, and Route Handlers

First, let’s set up a ring handler and router using Reitit. We only have one route, the index / route that’ll handle both GET and POST requests.

;; src/acme/handler.clj
(ns acme.handler
  (:require
   [reitit.ring :as ring]))

(def routes
  [["/" {:get  index-page
         :post index-action}]])

(defn handler
  [opts]
  (ring/ring-handler
   (ring/router routes)
   (ring/routes
    (ring/redirect-trailing-slash-handler)
    (ring/create-resource-handler {:path "/"})
    (ring/create-default-handler))))

We’re including some useful middleware:

  • redirect-trailing-slash-handler to resolve routes with trailing slashes,
  • create-resource-handler to serve static files, and
  • create-default-handler to handle common 40x responses.

Implementing the Middlewares

If you remember the :handler/ring from earlier, you’ll notice that it has two dependencies, database and auth. Currently, they’re inaccessible to our route handlers. To fix this, we can inject these components into the Ring request map using a middleware function.

;; src/acme/handler.clj
;; ...

(defn components-middleware
  [components]
  (let [{:keys [database auth]} components]
    (fn [handler]
      (fn [req]
        (handler (assoc req
                        :db database
                        :auth auth))))))
;; ...

The components-middleware function takes in a map of components and creates a middleware function that “assocs” each component into the request map.[6] If you have more components such as a Redis cache or a mail service, you can add them here.

We’ll also need a middleware to handle HTTP basic authentication.[7] This middleware will check if the username and password from the request map matche the values in the auth map injected by components-middleware. If they match, then the request is authenticated and the user can view the site.

;; src/acme/handler.clj
(ns acme.handler
  (:require
   ;; ...
   [acme.util :as util]
   [ring.util.response :as res]))
;; ...

(defn wrap-basic-auth
  [handler]
  (fn [req]
    (let [{:keys [headers auth]} req
          {:keys [username password]} auth
          authorization (get headers "authorization")
          correct-creds (str "Basic " (util/base64-encode
                                       (format "%s:%s" username password)))]
      (if (and authorization (= correct-creds authorization))
        (handler req)
        (-> (res/response "Access Denied")
            (res/status 401)
            (res/header "WWW-Authenticate" "Basic realm=protected"))))))
;; ...

A nice feature of Clojure is that interop with the host language is easy. The base64-encode function is just a thin wrapper over Java’s Base64.Encoder:

;; src/acme/util.clj
(ns acme.util
   ;; ...
  (:import java.util.Base64))

(defn base64-encode
  [s]
  (.encodeToString (Base64/getEncoder) (.getBytes s)))

Finally, we need to add them to the router. Since we’ll be handling form requests later, we’ll also bring in Ring’s wrap-params middleware.

;; src/acme/handler.clj
(ns acme.handler
  (:require
   ;; ...
   [ring.middleware.params :refer [wrap-params]]))
;; ...

(defn handler
  [opts]
  (ring/ring-handler
   ;; ...
   {:middleware [(components-middleware opts)
                 wrap-basic-auth
                 wrap-params]}))

Implementing the Route Handlers

We now have everything we need to implement the route handlers or the business logic of the app. First, we’ll implement the index-page function which renders a page that:

  1. Shows all of the user’s bookmarks in the database, and
  2. Shows a form that allows the user to insert new bookmarks into the database
;; src/acme/handler.clj
(ns acme.handler
  (:require
   ;; ...
   [next.jdbc :as jdbc]
   [next.jdbc.sql :as sql]))
;; ...

(defn template
  [bookmarks]
  [:html
   [:head
    [:meta {:charset "utf-8"
            :name    "viewport"
            :content "width=device-width, initial-scale=1.0"}]]
   [:body
    [:h1 "bookmarks"]
    [:form {:method "POST"}
     [:div
      [:label {:for "url"} "url "]
      [:input#url {:name "url"
                   :type "url"
                   :required true
                   :placeholer "https://en.wikipedia.org/"}]]
     [:button "submit"]]
    [:p "your bookmarks:"]
    [:ul
     (if (empty? bookmarks)
       [:li "you don't have any bookmarks"]
       (map
        (fn [{:keys [url]}]
          [:li
           [:a {:href url} url]])
        bookmarks))]]])

(defn index-page
  [req]
  (try
    (let [bookmarks (sql/query (:db req)
                               ["select * from bookmarks"]
                               jdbc/unqualified-snake-kebab-opts)]
      (util/render (template bookmarks)))
    (catch Exception e
      (util/server-error e))))
;; ...

Database queries can sometimes throw exceptions, so it’s good to wrap them in a try-catch block. I’ll also introduce some helper functions:

;; src/acme/util.clj
(ns acme.util
  (:require
   ;; ...
   [hiccup2.core :as h]
   [ring.util.response :as res])
  (:import java.util.Base64))
;; ...

(defn preprend-doctype
  [s]
  (str "<!doctype html>" s))

(defn render
  [hiccup]
  (-> hiccup h/html str preprend-doctype res/response (res/content-type "text/html")))

(defn server-error
  [e]
  (println "Caught exception: " e)
  (-> (res/response "Internal server error")
      (res/status 500)))

render takes a hiccup form and turns it into a ring response, while server-error takes an exception, logs it, and returns a 500 response.

Next, we’ll implement the index-action function:

;; src/acme/handler.clj
;; ...

(defn index-action
  [req]
  (try
    (let [{:keys [db form-params]} req
          value (get form-params "url")]
      (sql/insert! db :bookmarks {:bookmark_id (random-uuid) :url value})
      (res/redirect "/" 303))
    (catch Exception e
      (util/server-error e))))
;; ...

This is an implementation of a typical post/redirect/get pattern. We get the value from the URL form field, insert a new row in the database with that value, and redirect back to the index page. Again, we’re using a try-catch block to handle possible exceptions from the database query.

That should be all of the code for the controllers. If you reload your REPL and go to http://localhost:8080, you should see something that looks like this after logging in:

Screnshot of the app

The last thing we need to do is to update the main function to start the system:

;; src/acme/main.clj
;; ...

(defn -main [& _]
  (-> (read-config) ig/expand ig/init))

Now, you should be able to run the app using clj -M -m acme.main. That’s all the code needed for the app. In the next section, we’ll package the app into a Docker image to deploy to Fly.

Packaging the App

While there are many ways to package a Clojure app, Fly.io specifically requires a Docker image. There are two approaches to doing this:

  1. Build an uberjar and run it using Java in the container, or
  2. Load the source code and run it using Clojure in the container

Both are valid approaches. I prefer the first since its only dependency is the JVM. We’ll use the tools.build library to build the uberjar. Check out the official guide for more information on building Clojure programs. Since it’s a library, to use it we can add it to our deps.edn file with an alias:

;; deps.edn
{;; ...
 :aliases
 {;; ...
  :build {:extra-deps {io.github.clojure/tools.build 
                       {:git/tag "v0.10.5" :git/sha "2a21b7a"}}
          :ns-default build}}}

Tools.build expects a build.clj file in the root of the project directory, so we’ll need to create that file. This file contains the instructions to build artefacts, which in our case is a single uberjar. There are many great examples of build.clj files on the web, including from the official documentation. For now, you can copy+paste this file into your project.

;; build.clj
(ns build
  (:require
   [clojure.tools.build.api :as b]))

(def basis (delay (b/create-basis {:project "deps.edn"})))
(def src-dirs ["src" "resources"])
(def class-dir "target/classes")

(defn uber
  [_]
  (println "Cleaning build directory...")
  (b/delete {:path "target"})

  (println "Copying files...")
  (b/copy-dir {:src-dirs   src-dirs
               :target-dir class-dir})

  (println "Compiling Clojure...")
  (b/compile-clj {:basis      @basis
                  :ns-compile '[acme.main]
                  :class-dir  class-dir})

  (println "Building Uberjar...")
  (b/uber {:basis     @basis
           :class-dir class-dir
           :uber-file "target/standalone.jar"
           :main      'acme.main}))

To build the project, run clj -T:build uber. This will create the uberjar standalone.jar in the target directory. The uber in clj -T:build uber refers to the uber function from build.clj. Since the build system is a Clojure program, you can customise it however you like. If we try to run the uberjar now, we’ll get an error:

# build the uberjar
$ clj -T:build uber
Cleaning build directory...
Copying files...
Compiling Clojure...
Building Uberjar...

# run the uberjar
$ java -jar target/standalone.jar
Error: Could not find or load main class acme.main
Caused by: java.lang.ClassNotFoundException: acme.main

This error occurred because the Main class that is required by Java isn’t built. To fix this, we need to add the :gen-class directive in our main namespace. This will instruct Clojure to create the Main class from the -main function.

;; src/acme/main.clj
(ns acme.main
  ;; ...
  (:gen-class))
;; ...

If you rebuild the project and run java -jar target/standalone.jar again, it should work perfectly. Now that we have a working build script, we can write the Dockerfile:

# Dockerfile
# install additional dependencies here in the base layer
# separate base from build layer so any additional deps installed are cached
FROM clojure:temurin-21-tools-deps-bookworm-slim AS base

FROM base as build
WORKDIR /opt
COPY . .
RUN clj -T:build uber

FROM eclipse-temurin:21-alpine AS prod
COPY --from=build /opt/target/standalone.jar /
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "standalone.jar"]

It’s a multi-stage Dockerfile. We use the official Clojure Docker image as the layer to build the uberjar. Once it’s built, we copy it to a smaller Docker image that only contains the Java runtime.[8] By doing this, we get a smaller container image as well as a faster Docker build time because the layers are better cached.

That should be all for packaging the app. We can move on to the deployment now.

Deploying with Fly.io

First things first, you’ll need to install flyctl, Fly’s CLI tool for interacting with their platform. Create a Fly.io account if you haven’t already. Then run fly auth login to authenticate flyctl with your account.

Next, we’ll need to create a new Fly App:

$ fly app create
? Choose an app name (leave blank to generate one): 
automatically selected personal organization: Ryan Martin
New app created: blue-water-6489

Another way to do this is with the fly launch command, which automates a lot of the app configuration for you. We have some steps to do that are not done by fly launch, so we’ll be configuring the app manually. I also already have a fly.toml file ready that you can straight away copy to your project.

# fly.toml
# replace these with your app and region name
# run `fly platform regions` to get a list of regions
app = 'blue-water-6489' 
primary_region = 'sin'

[env]
  DB_DATABASE = "/data/database.db"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 0

[mounts]
  source = "data"
  destination = "/data"
  initial_sie = 1

[[vm]]
  size = "shared-cpu-1x"
  memory = "512mb"
  cpus = 1
  cpu_kind = "shared"

These are mostly the default configuration values with some additions. Under the [env] section, we’re setting the SQLite database location to /data/database.db. The database.db file itself will be stored in a persistent Fly Volume mounted on the /data directory. This is specified under the [mounts] section. Fly Volumes are similar to regular Docker volumes but are designed for Fly’s micro VMs.

We’ll need to set the AUTH_USER and AUTH_PASSWORD environment variables too, but not through the fly.toml file as these are sensitive values. To securely set these credentials with Fly, we can set them as app secrets. They’re stored encrypted and will be automatically injected into the app at boot time.

$ fly secrets set AUTH_USER=hi@ryanmartin.me AUTH_PASSWORD=not-so-secure-password
Secrets are staged for the first deployment

With this, the configuration is done and we can deploy the app using fly deploy:

$ fly deploy
# ...
Checking DNS configuration for blue-water-6489.fly.dev

Visit your newly deployed app at https://blue-water-6489.fly.dev/

The first deployment will take longer since it’s building the Docker image for the first time. Subsequent deployments should be faster due to the cached image layers. You can click on the link to view the deployed app, or you can also run fly open which will do the same thing. Here’s the app in action:

The app in action

If you made additional changes to the app or fly.toml, you can redeploy the app using the same command, fly deploy. The app is configured to auto stop/start, which helps to cut costs when there’s not a lot of traffic to the site. If you want to take down the deployment, you’ll need to delete the app itself using fly app destroy <your app name>.

Adding a Production REPL

This is an interesting topic in the Clojure community, with varying opinions on whether or not it’s a good idea. Personally I find having a REPL connected to the live app helpful, and I often use it for debugging and running queries on the live database.[9] Since we’re using SQLite, we don’t have a database server we can directly connect to, unlike Postgres or MySQL.

If you’re brave, you can even restart the app directly without redeploying from the REPL. You can easily go wrong with it, which is why some prefer to not use it.

For this project, we’re gonna add a socket REPL. It’s very simple to add (you just need to add a JVM option) and it doesn’t require additional dependencies like nREPL. Let’s update the Dockerfile:

# Dockerfile
# ...
EXPOSE 7888
ENTRYPOINT ["java", "-Dclojure.server.repl={:port 7888 :accept clojure.core.server/repl}", "-jar", "standalone.jar"]

The socket REPL will be listening on port 7888. If we redeploy the app now, the REPL will be started but we won’t be able to connect to it. That’s because we haven’t exposed the service through Fly proxy. We can do this by adding the socket REPL as a service in the [services] section in fly.toml.

However, doing this will also expose the REPL port to the public. This means that anyone can connect to your REPL and possibly mess with your app. Instead, what we want to do is to configure the socket REPL as a private service.

By default, all Fly apps in your organisation live in the same private network. This private network, called 6PN, connects the apps in your organisation through Wireguard tunnels (a VPN) using IPv6. Fly private services aren’t exposed to the public internet but can be reached from this private network. We can then use Wireguard to connect to this private network to reach our socket REPL.

Fly VMs are also configured with the hostname fly-local-6pn, which maps to its 6PN address. This is analogous to localhost, which points to your loopback address 127.0.0.1. To expose a service to 6PN, all we have to do is bind or serve it to fly-local-6pn instead of the usual 0.0.0.0. We have to update the socket REPL options to:

# Dockerfile
# ...
ENTRYPOINT ["java", "-Dclojure.server.repl={:port 7888,:address \"fly-local-6pn\",:accept clojure.core.server/repl}", "-jar", "standalone.jar"]

After redeploying, we can use the fly proxy command to forward the port from the remote server to our local machine.[10]

$ fly proxy 7888:7888
Proxying local port 7888 to remote [blue-water-6489.internal]:7888

In another shell, run:

$ rlwrap nc localhost 7888
user=>

Now we have a REPL connected to the production app! rlwrap is used for readline functionality, e.g. up/down arrow keys, vi bindings. Of course you can also connect to it from your editor.

Deploy with GitHub Actions

If you’re using GitHub, we can also set up automatic deployments on pushes/PRs with GitHub Actions. All you need is to create the workflow file:

# .github/workflows/fly.yaml
name: Fly Deploy
on:
  push:
    branches:
      - main
  workflow_dispatch:

jobs:
  deploy:
    name: Deploy app
    runs-on: ubuntu-latest
    concurrency: deploy-group
    steps:
      - uses: actions/checkout@v4
      - uses: superfly/flyctl-actions/setup-flyctl@master
      - run: flyctl deploy --remote-only
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

To get this to work, you’ll need to create a deploy token from your app’s dashboard. Then, in your GitHub repo, create a new repository secret called FLY_API_TOKEN with the value of your deploy token. Now, whenever you push to the main branch, this workflow will automatically run and deploy your app. You can also manually run the workflow from GitHub because of the workflow_dispatch option.

End

As always, all the code is available on GitHub. Originally, this post was just about deploying to Fly.io, but along the way I kept adding on more stuff until it essentially became my version of the user manager example app. Anyway, hope this article provided a good view into web development with Clojure. As a bonus, here are some additional resources on deploying Clojure apps:


  1. The way Fly.io works under the hood is pretty clever. Instead of running the container image with a runtime like Docker, the image is unpacked and “loaded” into a VM. See this video explanation for more details. ↩

  2. If you’re interested in learning Clojure, my recommendation is to follow the official getting started guide and join the Clojurians Slack. Also, read through this list of introductory resources. ↩

  3. Kit was a big influence on me when I first started learning web development in Clojure. I never used it directly, but I did use their library choices and project structure as a base for my own projects. ↩

  4. There’s no “Rails” for the Clojure ecosystem (yet?). The prevailing opinion is to build your own “framework” by composing different libraries together. Most of these libraries are stable and are already used in production by big companies, so don’t let this discourage you from doing web development in Clojure! ↩

  5. There might be some keys that you add or remove, but the structure of the config file stays the same. ↩

  6. “assoc” (associate) is a Clojure slang that means to add or update a key-value pair in a map. ↩

  7. For more details on how basic authentication works, check out the specification. ↩

  8. Here’s a cool resource I found when researching Java Dockerfiles: WhichJDK. It provides a comprehensive comparison on the different JDKs available and recommendations on which one you should use. ↩

  9. Another (non-technically important) argument for live/production REPLs is just because it’s cool. Ever since I read the story about NASA’s programmers debugging a spacecraft through a live REPL, I’ve always wanted to try it at least once. ↩

  10. If you encounter errors related to Wireguard when running fly proxy, you can run fly doctor which will hopefully detect issues with your local setup and also suggest fixes for them. ↩

Permalink

ShipClojure: The Clojure Boilerplate to ship startups FAST - complete stack presentation

Table of Contents

  1. The complete stack
    1. Backend
      1. Underlying server - Jetty 12
      2. Routing provider - Reitit
      3. Database - Postgres
      4. Dependency & Lifecycle management - Integrant
      5. Environment & Secret management - aero
      6. Transactional emails - Resend
      7. Authentication - Ring cookie sessions
      8. Deployment - fly.io
    2. Frontend
      1. UI Rendering Framework - UIx
      2. State management - re-frame
      3. Styling - tailwind-css
      4. UI Components - daisyUI
      5. SEO & Blog Generator - borkdude's quickblog
      6. UI component documentation - Portfolio
      7. ClojureScript build tool
    3. Misc
      1. Transport
      2. Email components
      3. Dashboard & User Management
  2. Conclusion

ShipClojure is the Clojure boilerplate I've been working on for almost a year. It helps clojure developers ship SaaS products fast.

It allows you to start building your products core business logic from the first day without focusing on adjacent boilerplate code like authentication, database setup, deployment & UI base components.

The complete stack

In this post I will describe the entire list of tech choices I made for ShipClojure so you can get an overview of the project but also to show modern options for people trying to build a full SaaS product and don't know what to use.

Disclaimer: You may not agree with some of the choices I made or you may have different preferences. That is fine! You can change things if you need. I built the boilerplate to be modular to a certain extent - You're gonna have a hard time replacing tailwind ;)

Backend

1. Underlying server - Jetty 12

I chose Jetty 12 and specifically ring-jetty9-adapter (naming is confusing but it uses Jetty 12) because it uses the latest jetty version and it supports websocket connections from the outside as opposed to Jetty 9. I might have to fact check this but I couldn't do it with the default ring.adapter.jetty.

2. Routing provider - Reitit

Reitit has become the most popular router and based on the benchmarks, it's the fastest. I chose it because:

  • people are familiar with it
  • it is data oriented
  • it is the fastest
  • it supports frontend routing which means you can express your application routes in a .cljc file.

3. Database - Postgres

PostgreSQL is a battle-tested open source DB. I chose it because I used it a lot in the past, it never failed me and the tools to interact with it are robust:

  • DB interaction: next.jdbc
  • Query creation: honeysql - write clojure datastructures that transform to SQL

These two choices make it very easy to swap out postgres for any other SQL flavor database you want, which is awesome!

In full honesty, I am looking at bitemporal DBs because I see a raising interest in them. I am considering providing plugins or guides to change the database to datomic or xtdb.

This is not my main focus at the moment but it is in the backlog.

4. Dependency & Lifecycle management - Integrant

I love Integrant. It is seamless to describe your entire system and the dependencies between components. It provides a way to orchestrate system startup and shut down.

I designed ShipClojure to be a top down tree of dependencies. You create all the dependencies at the beginning (db connection, server instance etc.) and all of the functionality receives the specific component from the system it needs as a function call parameter.

Here's an example for a database query:

    (defn get-user-password
      "Get a user's password from the database.

      `db` - The database connection.
      `user-id` - The user's ID.

      Returns a map containing the user's password hash and updated_at:
      {`:hash` string
       `:updated_at` timestamp"
      [db user-id]
      (execute-one! db (-> (select :hash :updated_at)
                           (from :password)
       (where [:= :user_id user-id]))))

We pass the database connection (db) to the query function so this can be easily tested with a different system configuration specific for testing.

5. Environment & Secret management - aero

Aero is a library for intentful configuration with works really well with Integrant.

All of the secrets are in a file called .saas-secrets.edn (you can change the name), aero reads them into the system and then you can inject them into the integrant components so they access only relevant secrets at runtime:

{:saas/secrets #include ".saas-secrets.edn" ;; reading secrets

 :db.sql/connection #ref [:saas/secrets :db] ;; only accesses the db secrets
}

Here's a secret example

{:db {:dbtype "postgresql"
      :port 5432
      :host "localhost"
      :user "postgres"
      :password "secretpassword"
      :dbname "shipclojure"}}

And when we instantiate the DB through integrant, it looks like this:

(defmethod ig/init-key :db.sql/connection
  [_ secrets] ;; secrets read by aero
  (log/info "Configuring db")
  (-> (config->jdbc-url secrets)
      (datasource)
      (jdbc/with-options jdbc/snake-kebab-opts)))

You can configure this setup to use env variables for DB like this:

:saas/db {:host #or [#env DB_HOST #ref [:saas/secrets :db :host]]
          :dbtype #or [#ref [:saas/secrets :db :dbtype] "postgresql"]
          :port #or [#env DB_PORT #ref [:saas/secrets :db :port]]
          :user #or [#env DB_USER #ref [:saas/secrets :db :user]]
          :password #or [#env DB_PASSWORD #ref [:saas/secrets :db :password]]
          :dbname #or [#env DB_NAME #ref [:saas/secrets :db :dbname]]}

Shipclojure has support for different lifecycle dependencies based on the environment.

A good example is building clojurescript. In development mode, we spawn a shadow-cljs process to watch the clojurescript source. You don't need this process in production, so it doesn't start when the system starts.

I took the environment system from Kit, where you have different directories for each environment, each containing different versions of code.

6. Transactional emails - Resend

I chose Resend because it is a reliable service and it has a generous free offer of 3000 emails / month which should be enough to get you started. Another provider that you can look at if you send a massive number of emails is Amazon SES. This is the most cost effective provider at scale.

If you wonder why even go with a email provider at all and why not just use our own SMTP hosted service? Because deliverability will suffer if you don't use a service. You need a guarantee that the emails will arrive to the end user.

7. Authentication - Ring cookie sessions

Initially I implemented JWT Tokens with Refresh Token Rotation so the access tokens are only stored in memory. I hit a blocker with this strategy because most clojure libraries we use like ring-oauth2 rely on cookie sessions by default.

I'm not saying that this is a complete blocker, it just means going the JWT Token route requires more work, so I ended up going with http-only cookie sessions. This is a good thing too because clojurians seem to prefer cookie sessions.

To make this secure, I added CSRF & XSS protection so attackers cannot abuse the system.

Ship clojure also supports sign in with Google for passwordless sign-in.

8. Deployment - fly.io

I like Fly because it is easy to configure, without a bunch of magic underneath. You wrap your app inside a docker and just ship it. Fly comes with easy configurations for:

  • auto scaling machines
  • healthchecks
  • logs + sentry integration

To make the docker image, we bundle shipclojure in a jar which runs inside the docker. This handles the frontend & the backend.

This setup has a limitation that you don't have access to a production REPL which clojurians prefer. Biff already has a a guide on deploying clojure projects to a server and most of this guide applies to shipclojure.

Frontend

Let's discuss the stack used for ShipClojure's frontend. ShipClojure has a Single Page Application that is responsible for the dynamic part. It server renders static pages like the landing page, blog etc.

1. UI Rendering Framework - UIx

UIx is an idiomatic Clojure(script) interface to modern React. I explained here why I chose UIx over Reagent for ShipClojure. Here are the greatest features of UIx:

  • It's macro-based rendering, so no runtime performance cost, as opposed to Reagent
  • It builds on top of modern react so you get access to the entire ecosystem seamlessly
  • It has server-side rendering capabilities which I use for rendering performant, interactive landing pages inside ShipClojure
  • It has an in-built linter, so it is beginner friendly and fast to pick up.
  • It interops with Reagent and more importantly re-frame to build enjoyable state management, application flows that are easy to test.

2. State management - re-frame

re-frame is a framework for building scalable Single-Page applications. It's main focus is on high programmer productivity and seamless scaling.

I admit it wasn't my first choice because it is hard to pick up. It is a joy once you get it but it takes some time to reach this point. I tried to implement ShipClojure's own state management, and the more I worked on it, the more it felt like I was re-implementing re-frame so I accepted re-frame as the state management choice for the boilerplate.

State management is hard and this is why frameworks in javascript land move to server-based applications that don't keep frontend state. Because of this, I chose re-frame, so when it gets hard, you have a guiding hand, rich documentation and a rich community to help you in the journey.

3. Styling - tailwind-css

Tailwind is an utility-first CSS framework packed with classes like flex, pt-4 that you can compose directly in your markup.

I love tailwind because I can look at a component and understand fully what it does and it does away with the mental overhead of having to constantly name CSS classes. If you are not familiar with it, it is fast to pick up and you'll find a lot of documentation on it.

Why I chose tailwind:

  • Pure CSS so I can use all of the styling with server-side rendering
  • It is one of the most popular styling frameworks across languages, so you get infinite examples for pages
  • You reuse CSS classes so you serve less CSS over the wire

4. UI Components - daisyUI

Tailwind is great but it lacks a component system to move fast. DaisyUI is a comprehensive CSS component library built on top of tailwind.

Why I chose daisyUI:

  • Access to 35+ pre-made themes instantly
  • Pure CSS so we can reuse it for server-side rendering
  • It's just CSS, so you can change the behaviour easily
  • (Again) It's just CSS, so you don't have to do too much interop with javascript land
  • Integrates with the Tailwind JIT Compiler so you generate CSS only for the components you use.

ShipClojure takes all of the CSS components and comes with a in-house component library that is easy to use for clojure developers. All of the UI components are stateless, documented and full stack (you can use them for static pages too).

5. SEO & Blog Generator - borkdude's quickblog

Quickblog is a lightweight static blog engine that supports blogs written in markdown. It generates static blog pages for blogs, tags & authors.

For basic blogs it works well. It might require some changes for more advanced use-cases like:

  • popups for newsletters
  • analytics
  • next article recommandation

For now it suffices and it was easy to setup and style. I'm thinking to use react server-side rendering in order to create the templates that quickblog uses so I can add interactivity for the blog. This is not big in the list of priorities.

6. UI component documentation - Portfolio

I chose Portfolio because it is the Clojure version of storybook and it is easier to setup up for the average clojurian. Portfolio helps with documenting all of the UI components and all of thir possible states.

7. ClojureScript build tool

ShipClojure uses shadow-cljs as a build tool to enable users to tap into the NPM ecosystem. I reduced the bundle(s) shipped to the end user as much as possible using code splitting so we ship only the required code. shadow-cljs is also responsible to compile the code to hydrate server-side rendered static pages that have interactivity built-in.

Misc

1. Transport

ShipClojure uses the transit data format for over-the-wire communication between client and server to reduce payload size to the minimum required.

In practice, you, as a user, rarely need to interact with transport as it gets converted to and from edn from the backend to the frontend.

2. Email components

ShipClojure comes with most of the components from react-email ported to UIX to create beautiful transactional emails in the same way you would create UI pages. I wrote the email component library in cljc so you can use these components full stack: You can preview them on the UI but send the emails from the backend server.

3. Dashboard & User Management

ShipClojure gives a dashboard for purchase & user management through metabase as it is easy to setup, popular and best of all, created in clojure!

Conclusion

This post is a full presentation of all the tech choices for ShipClojure. I hope you learned something from reading it that you can apply to your own projects. If you have any questions about the tech choices or about the boilerplate, please DM me or write me an email and we will talk more

If you are interested in ShipClojure, visit shipclojure.com for more details.

Permalink

shadow-css-in-practice

shadow-css-in-practice

shadow-css-in-practice

Tiny overview of my personal usage of shadow-css, which I like.


The screencast is meant to showcase the intersection of emacs, meow, lispy, clojure, browser and repl driven development.


Conceptually tiny:

  • Write something like (css :p-4 :mt-2), css is a macro that just returns a classname, nothing else
  • shadow-css analyses your clj/cljs/cljc sources outputs .css files where the class names your_namespace_L44, that's it.

This is the clj hiccup shadow-css scittle stack.

It builds using babashka only, generating html from hiccup source code.

For development, you run a clj repl, optionally with a scittle nrepl in the browser, too.

Your scittle source code are cljs files that you add to your build artifact (a directory of files in my case), and you add them with a script tag on the pages you need them.

CSS is generated with the tailwind-like shadow-css.

This stack grew bottom up.

Resources:

All build steps run with babashka, which is fast!

The result is here.


(Currently, the build duration of my blog is dominated by the emacs org html gen).


This would go well together with a clojure server that "just" returns html, optionally with shadow-graft.

Clojurescript and/or reagent could be added if needed. (scittle supports reagent, so clojurescript might not be needed).


It's something I would consider for small-medium projects.

Date: 2024-11-01 Fri 18:07

Permalink

OSS updates September and October 2024

In this post I'll give updates about open source I worked on during September and October 2024.

To see previous OSS updates, go here.

Sponsors

I'd like to thank all the sponsors and contributors that make this work possible. Without you, the below projects would not be as mature or wouldn't exist or be maintained at all.

Current top tier sponsors:

Open the details section for more info about sponsoring.

Sponsor info

If you want to ensure that the projects I work on are sustainably maintained, you can sponsor this work in the following ways. Thank you!

Soon Clojurists Together will be opening their application for long term funding. If you are a member, don't forget to vote!

If you're used to sponsoring through some other means which isn't listed above, please get in touch.

On to the projects that I've been working on!

Updates

In September I visited Heart of Clojure where Christian, Teodor and I did a workshop on babashka. The first workshop was soon fully booked so we even did a second one and had a lot of fun doing so. It was so good to see familiar Clojure faces in real life again. Thanks Arne and Gaiwan team for organizing this amazing conference.

Although I didn't make it to the USA for the Clojure conj in October, Alex Miller did invite me to appear towards the end of his closing talk when he mentioned that 90% of survey respondents used babashka.

If you are interested in a full stack web framework with babashka and squint, check out borkweb.

Here are updates about the projects/libraries I've worked on in the last two months.

  • clj-kondo: static analyzer and linter for Clojure code that sparks joy.
    • Unreleased
    • #1784: detect :redundant-do in catch
    • #2410: add --report-level flag
    • 2024.09.27
    • #2404: fix regression with metadata on node in hook caused by :redundant-ignore linter
    • 2024.09.26
    • #2366: new linter: :redundant-ignore. See docs
    • #2386: fix regression introduced in #2364 in letfn
    • #2389: add new hooks-api/callstack function
    • #2392: don't skip jars that were analyzed with --skip-lint
    • #2395: enum constant call warnings
    • #2400: deftype and defrecord constructors can be used with Type/new
    • #2394: add :sort option to :unsorted-required-namespaces linter to enable case-sensitive sort to match other tools
    • #2384: recognize gen/fmap var in cljs.spec.gen.alpha
  • babashka: native, fast starting Clojure interpreter for scripting.
    • #1752: include java.lang.SecurityException for java.net.http.HttpClient support
    • #1748: add clojure.core/ensure
    • Upgrade to taoensso/timbre v6.6.0
    • Upgrade to GraalVM 23
    • #1743: fix new fully qualified instance method in call position with GraalVM 23
    • Clojure 1.12 interop: method thunks, FI coercion, array notation (see below)
    • Upgrade SCI reflector based on clojure 1.12 and remove specific workaround for Thread/sleep interop
    • Add tools.reader.edn/read
    • Fix #1741: (taoensso.timbre/spy) now relies on macros from taoensso.encore previously not available in bb
    • Upgrade Clojure to 1.12.0
    • #1722: add new clojure 1.12 vars
    • #1720: include new clojure 1.12's clojure.java.process
    • #1719: add new clojure 1.12 clojure.repl.deps namespace. Only calls with explicit versions are supported.
    • #1598: use Rosetta on CircleCI to build x64 images
    • #1716: expose babashka.http-client.interceptors namespace
    • #1707: support aset on primitive array
    • #1676: restore compatibility with newest at-at version (1.3.58)
    • Bump SCI
    • Bump fs
    • Bump process
    • Bump deps.clj
    • Bump http-client
    • Bump clj-yaml
    • Bump edamame
    • Bump rewrite-clj
    • Add java.io.LineNumberReader
  • SCI: Configurable Clojure/Script interpreter suitable for scripting and Clojure DSLs
    • Fix #942: improve error location of invalid destructuring
    • Fix #917: support new Clojure 1.12 Java interop: String/new, String/.length and Integer/parseInt as fns
    • Fix #925: support new Clojure 1.12 array notation: String/1, byte/2
    • Fix #926: Support add-watch on vars in CLJS
    • Support aset on primitive array using reflection
    • Fix #928: record constructor supports optional meta + ext map
    • Fix #934: :allow may contain namespaced symbols
    • Fix #937: throw when copying non-existent namespace
    • Update sci.impl.Reflector (used for implementing JVM interop) to match Clojure 1.12
  • squint: CLJS syntax to JS compiler
    • Fix watcher and compiler not overriding squint.edn configurations with command line options.
    • Allow passing --extension and --paths via CLI
    • Fix #563: prioritize refer over core built-in
    • Update chokidar to v4 which reduces the number of dependencies
    • BREAKING: Dynamic CSS in #html must now be explicitly passed as map literal: (let [m {:color :green}] #html [:div {:style {:& m}}]). Fixes issue when using lit-html in combination with classMap. See demo
    • #556: fix referring to var in other namespace via global object in REPL mode
    • Pass --repl opts to watch subcommand in CLI
    • #552: fix REPL output with hyphen in ns name
    • Ongoing work on browser REPL. Stay tuned.
  • cherry: Experimental ClojureScript to ES6 module compiler
    • Fix referring to vars in other namespaces globally
    • Allow defclass to be referenced through other macros, e.g. as cherry.core/defclass
    • Fix emitting keyword in HTML
    • #138: Support #html literals, ported from squint
  • http-client: babashka's http-client
    • #68 Fix accidental URI path decoding in uri-with-query (@hxtmdev)
    • #71: Link back to sources in release artifact (@lread)
    • #73: Allow implicit ports when specifying the URL as a map (@lvh)
  • http-server: serve static assets
    • #16: support range requests (jmglov)
    • #13: add an ending slash to the dir link, and don't encode the slashes (@KDr2)
    • #12: Add headers to index page (rather than just file responses)
  • bbin: Install any Babashka script or project with one command
    • Fix #88: bbin ls with 0-length files doesn't crash
  • scittle: Execute Clojure(Script) directly from browser script tags via SCI
    • Add cljs.pprint/code-dispatch and cljs.pprint/with-pprint-dispatch
  • clojurescript
  • neil: A CLI to add common aliases and features to deps.edn-based projects.
    • #241: ignore missing deps file (instead of throwing) in neil new (@bobisageek)
  • sci.configs: A collection of ready to be used SCI configs.
    • Added a configuration for cljs.spec.alpha and related namespaces
  • nbb: Scripting in Clojure on Node.js using SCI
    • Include cljs.spec.alpha, cljs.spec.gen.alpha, cljs.spec.test.alpha
  • qualify-methods
    • Initial release of experimental tool to rewrite instance calls to use fully qualified methods (Clojure 1.12 only0
  • clerk: Moldable Live Programming for Clojure
    • Add support for :require-cljs which allows you to use .cljs files for render functions
    • Add support for nREPL for developing render functions
  • deps.clj: A faithful port of the clojure CLI bash script to Clojure
    • Upgrade/sync with clojure CLI v1.12.0.1479
  • process: Clojure library for shelling out / spawning sub-processes
    • Work has started to support prepending output (in support for babashka parallel tasks). Stay tuned.

Other projects

These are (some of the) other projects I'm involved with but little to no activity happened in the past month.

Click for more details

  • edamame: Configurable EDN/Clojure parser with location metadata
  • quickdoc: Quick and minimal API doc generation for Clojure
  • CLI: Turn Clojure functions into CLIs!
  • fs - File system utility library for Clojure
  • tools: a set of bbin installable scripts
  • sci.nrepl: nREPL server for SCI projects that run in the browser
  • html: Html generation library inspired by squint's html tag
  • rewrite-edn: Utility lib on top of rewrite-clj with common operations to update EDN while preserving whitespace and comments
  • instaparse-bb: Use instaparse from babashka
  • babashka.json: babashka JSON library/adapter
  • tools-deps-native and tools.bbuild: use tools.deps directly from babashka
  • squint-macros: a couple of macros that stand-in for applied-science/js-interop and promesa to make CLJS projects compatible with squint and/or cherry.
  • grasp: Grep Clojure code using clojure.spec regexes
  • lein-clj-kondo: a leiningen plugin for clj-kondo
  • http-kit: Simple, high-performance event-driven HTTP client+server for Clojure.
  • babashka.nrepl: The nREPL server from babashka as a library, so it can be used from other SCI-based CLIs
  • jet: CLI to transform between JSON, EDN, YAML and Transit using Clojure
  • pod-babashka-go-sqlite3: A babashka pod for interacting with sqlite3
  • pod-babashka-fswatcher: babashka filewatcher pod
  • lein2deps: leiningen to deps.edn converter
  • sql pods: babashka pods for SQL databases
  • cljs-showcase: Showcase CLJS libs using SCI
  • babashka.book: Babashka manual
  • rewrite-clj: Rewrite Clojure code and edn
  • pod-babashka-buddy: A pod around buddy core (Cryptographic Api for Clojure).
  • gh-release-artifact: Upload artifacts to Github releases idempotently
  • carve - Remove unused Clojure vars
  • 4ever-clojure - Pure CLJS version of 4clojure, meant to run forever!
  • pod-babashka-lanterna: Interact with clojure-lanterna from babashka
  • joyride: VSCode CLJS scripting and REPL (via SCI)
  • clj2el: transpile Clojure to elisp
  • deflet: make let-expressions REPL-friendly!
  • deps.add-lib: Clojure 1.12's add-lib feature for leiningen and/or other environments without a specific version of the clojure CLI

Permalink

Making LLMs Do More of What You Want

Once we can make LLMs do what we want, we might want to formalize this and scale it up. We've got

generate :: SystemPrompt -> Prompt -> Response
generateChecked :: (String -> Maybe a) -> SystemPrompt -> Prompt -> ?Int -> Maybe a

as a baseline. And, sure we have a vague sketch of something called generateCSSSelector, which is interesting only in a very narrow sense.

Ok, so what's next?

Suppose that, in addition to being able to do String -> String prompts the way that generate does and String -> Maybe a prompts the way that generateChecked, you want your model to be able to call some set of functions that you want to extend to it.

type Env :: Map Name Function

Compiler writers, are you with me here?

transform :: Env -> String -> Maybe (Function, Args)
transform (env, result) = 
  let parsed = maybeJsonParse result
  in case Just json => if (json.functionName and json.args 
                           and json.functionName in env 
                           and (validArgsFor env[json.functionNameame] json.args))
                       then Just (env[json.functionName], json.args)
                       else Nothing
          Nothing => Nothing

I'd bet Schemers, Clojurers and Common Lispers know where this is going too.

define :: Env -> Name -> (Args -> ResMap)
define (env, name, fn) = assoc env name (fn ,args ,@body)

generateToolCall :: Env -> Prompt -> Maybe (Function, Args)
generateToolCall env prompt = 
   let sysprompt = """
   You are a computer specialist. 
   Your job is translating client requests into tool calls.
   Your client has sent a request to use a tool; return the function 
  call corresponding to the request and no other commentary. Return a value 
  of type `{"functionName" :: string, "args" :: {arg_name: arg value}}`. You 
  have access to the tools: {map #(%k, typeSig %v, docstring %v) env}.
"""
   in generateChecked transform sysprompt prompt

Looking at it from out here, this is almost too trivial to bother writing. But in effect, what we've got is a pluggable, fully generalizable toolkit that gives any sufficiently smart model access to tool capabilities. call really is too trivial to bother writing in the notional language we've got; if I had to I'd say something like call = funcall. Which tells you Everything you need to know if you've worked with enough languages, and also, exactly nothing if you didn't. The big point of flexibility that I'm insisting on here is that you can swap out different environments in order to keep your models restricted to a (hopefully, if you've done your job) known-safe set of function bindings.

Python

So lets step down from the realm of notional pseudocode and grab the snake by the tail and head simultaneously.

def generateToolCall(tools, llm, prompt):
    sysprompt = f'You are a computer specialist. Your job is translating client requests into tool calls. Your client has sent a request to use a tool; return the function call corresponding to the request and no other commentary. Return a value of type `{{"functionName" :: string, "args" :: {{arg_name: arg value}} }}`. You have access to the tools: {tools.list()}.'

    return llm.generate_checked(tools.transform, sysprompt, prompt)

There's the head. You take a tools environment, and an llm, and a prompt describing something that asks for a tool call, and you return the tool call. That definition tells us that tools is going to need the methods list and transform at minimum.

class Tools:
    def __init__(self):
        self.env = {}

You know what's up here. An environment is a dictionary. Duh.

    def define(self, toolName, toolFunction):
        assert (
            toolFunction.__annotations__ is not None
        ), "Interned functions must be annotated"
        assert toolFunction.__doc__, "Interned functions must have docstrings"
        if toolName in self._env:
            return False
        self._env[toolName] = {
            "type": {
                k: v
                for k, v in toolFunction.__annotations__.items()
                if not k == "return"
            },
            "function": toolFunction,
        }
        return True

define is clunkier than I'd like, but I mean, what am I supposed to do here? We take a name and a function (and use Python internal methods to assert that it has type annotations and documentation, because those things make it easier to spit at a model). Realistically, I could give it optional type and description so that you can override the given functions' __annotations__ and __doc__, and I could give __name__ the same treatment so that you could pass in lambdas if you really wanted to, even though they're awful in Python. That's about it though.

Honestly, all this definition is doing is reminding me how much simpler this code would be over in Clojure-land. Where I might still put it eventually.

    def list(self):
        return [
            {
                "name": k,
                "type": {
                    k: v
                    for k, v in v["function"].__annotations__.items()
                    if not k == "return"
                },
                "description": v["function"].__doc__,
            }
            for k, v in self.env.items()
        ]

One of the implied methods from earlier. This is why we need to ensure documentation and type annotations; it gives the target model more info to work with.

    def validate(self, tool_call):
        if (
            "functionName" in tool_call
            and "args" in tool_call
            and tool_call["functionName"] in self._env
        ):
            f = self._env[tool_call["functionName"]]
            if not set(tool_call["args"].keys()).difference(f["type"].keys()):
                return True
        return False

    def transform(self, resp):
        parsed, success = loadch(resp)
        if not success:
            return None, False
        if self.validate(parsed):
            return parsed, True
        return None, False

the original pseudocode transform got split up here. Mostly, because I'm going to be a little paranoid and use validate again inside of call. Still, you can see what's up here.

Transform takes a string, tries to JSON parse it using the loadch function we defined last time. If it fails, we bail. Otherwise, we validate the result. If that succeeds, then we have a valid tool_call that we can call with confidence, assuming we've safely defined the underlying function.

validate itself does exactly what the pseudo implied earlier; we check that it's a dict with a functionName and an args, check that the functionName references something in our env, and that the thing it references has the corresponding argument list. If any of that fails, False, otherwise True.

    def call(self, tool_call):
        if self.validate(tool_call):
            return self.env[tool_call["functionName"]]["function"](**tool_call["args"])
        return None

Bam it's a one-liner. In a lisp-like, this would just be funcall, or possibly not even a function at all, just a pair of parens marking it as something to evaluate. Also, technically, this is a Maybe <whatever type your function returns> (note that we return None in the case that the validate call fails).

Don't take the code too seriously in its' current form. I don't think I'm going to keep it precisely the way it is now, but the interface is there and any changes are likely to be cosmetic or QoL-enabling. Check the docs before building anything out of it.

The Upshot

So what's the point of all this?

>>> from typing import Optional, List
>>> def _screenshot(url: str, selectors: Optional[List[str]] = None) -> None:
    "Takes a url and an optional list of selectors. Takes a screenshot"
    print(f"GOT {url}, {selectors}!")
... ... ... 
>>>
>>> from trivialai import tools, ollama
>>> tls = tools.Tools()
>>> tls.define("screenshot", _screenshot)
True
>>> tls.list()
[{'name': 'screenshot', 'type': {'url': <class 'str'>, 'selectors': typing.Optional[typing.List[str]]}, 'description': 'Takes a url and an optional list of selectors. Takes a screenshot'}]
>>> client = ollama.Ollama("gemma2:2b", "http://localhost:11434/")
>>> tools.generate_tool_call(tls, client, "Take a screenshot of the Google website and highlight the search box")
LLMResult(raw=<Response [200]>, content={'functionName': 'screenshot', 'args': {'url': 'https://www.google.com/', 'selectors': ['#search']}})
>>> res = _
>>> res.content
{'functionName': 'screenshot', 'args': {'url': 'https://www.google.com/', 'selectors': ['#search']}}
>>> tls.call(res.content)
GOT https://www.google.com/, ['#search']!
>>> 

There.

If you followed this far, I think you know exactly where I'm going.

As always, I'll let you know how it goes.

Permalink

PG2 release 0.1.18

PG2 version 0.1.18 is available (it’s a client for Postgres). This release brings two major features:

  • built-in pgvector extension support;
  • better type mapping between Postgres and Clojure.

PGVector Support

Pgvector is a well known extension for PostgreSQL. It provides a fast and robust vector type which is quite useful for heavy computations. Pgvector also provides a sparse version of a vector to save space.

This section covers how to use types provided by the extension with PG2.

Vector

First, install pgvector as the official readme file prescribes. Now that you have it installed, try a simple table with the vector column:

(def conn
  (jdbc/get-connection {...}))

(pg/query conn "create temp table test (id int, items vector)")

(pg/execute conn "insert into test values (1, '[1,2,3]')")
(pg/execute conn "insert into test values (2, '[1,2,3,4,5]')")

(pg/execute conn "select * from test order by id")

;; [{:id 1, :items "[1,2,3]"} {:id 2, :items "[1,2,3,4,5]"}]

It works, but we got the result unparsed: the :items field in each row is a string. This is because, to take a custom type into account when encoding and decoding data, you need to specify something. Namely, pass the :with-pgvector? flag to the config map as follows:

(def config
  {:host "127.0.0.1"
   :port 5432
   :user "test"
   :password "test"
   :database "test"
   :with-pgvector? true})

(def conn
  (jdbc/get-connection config))

Now the strings are parsed into a Clojure vector of double values:

(pg/execute conn "select * from test order by id")

[{:id 1, :items [1.0 2.0 3.0]}
 {:id 2, :items [1.0 2.0 3.0 4.0 5.0]}]

To insert a vector, pass it as a Clojure vector as well:

(pg/execute conn "insert into test values ($1, $2)"
            {:params [3 [1 2 3 4 5]]})

It can be also a lazy collection of numbers produced by a map call:

(pg/execute conn "insert into test values ($1, $2)"
            {:params [4 (map inc [1 2 3 4 5])]})

The vector column above doesn’t have an explicit size. Thus, vectors of any size can be stored in that column. You can limit the size by providing it in parentheses:

(pg/query conn "create temp table test2 (id int, items vector(5))")

Now if you pass a vector of a different size, you’ll get an error response from the database:

(pg/execute conn "insert into test2 values (1, '[1,2,3]')")

;; Server error response: {severity=ERROR, code=22000, file=vector.c, line=77,
;; function=CheckExpectedDim, message=expected 5 dimensions, not 3,
;; verbosity=ERROR}

The vector type supports both text and binary modes of PostgreSQL wire protocol.

Sparse Vector

The pgvector extension provides a special sparsevec type to store vectors where only certain elements are filled. All the rest elements are considered as zero. For example, you have a vector of 1000 items where the 3rd item is 42.001, and 10th item is 99.123. Storing it as a native vector of 1000 double numbers is inefficient. It can be written as follows which takes much less:

{3:42.001,10:99.123}/1000

The sparsevec Postgres type acts exactly like this: internally, it’s a sort of a map that stores the size (1000) and the {index -> value} mapping. An important note is that indexes are counted from one, not zero (see the README.md file of the extension for details).

PG2 provides a special wrapper for a sparse vector. A brief demo:

(pg/execute conn "create temp table test3 (id int, v sparsevec)")

(pg/execute conn "insert into test3 values (1, '{2:42.00001,7:99.00009}/9')")

(pg/execute conn "select * from test3")

;; [{:v <SparseVector {2:42.00001,7:99.00009}/9>, :id 1}]

The v field above is an instance of the org.pg.type.SparseVector class. Let’s look at it closer:

;; put it into a separate variable
(def -sv
  (-> (pg/execute conn "select * from test3")
      first
      :v))

(type -sv)

org.pg.type.SparseVector

The -sv value has a number of interesting traits. To turn in into a native Clojure map, just deref it:

@-sv

{:nnz 2, :index {1 42.00001, 6 99.00009}, :dim 9}

It mimics the nth access as the standard Clojure vector does:

(nth -sv 0) ;; 0.0
(nth -sv 1) ;; 42.00001
(nth -sv 2) ;; 0.0

To turn in into a native vector, just pass it into the vec function:

(vec -sv)

[0.0 42.00001 0.0 0.0 0.0 0.0 99.00009 0.0 0.0]

There are several ways you can insert a sparse vector into the database. First, pass an ordinary vector:

(pg/execute conn "insert into test3 values ($1, $2)"
            {:params [2 [5 2 6 0 2 5 0 0]]})

Internally, zero values get eliminated, and the vector is transformed into a SparseVector instance. Now read it back:

(pg/execute conn "select * from test3 where id = 2")

[{:v <SparseVector {1:5.0,2:2.0,3:6.0,5:2.0,6:5.0}/8>, :id 2}]

The second way is to pass a SparseVector instance produced by the pg.type/->sparse-vector function. It accepts the size of the vector and a mapping of {index => value}:

(require '[pg.type :as t])

(pg/execute conn "insert into test3 values ($1, $2)"
            {:params [3 (t/->sparse-vector 9 {0 523.23423
                                              7 623.52346})]})

Finally, you can pass a string representation of a sparse vector:

(pg/execute conn "insert into test3 values ($1, $2)"
            {:params [3 "{1:5.0,2:2.0,3:6.0,5:2.0,6:5.0}/8"]})

Like the vector type, sparsevec can be also limited to a certain size:

create table ... (id int, items sparsevec(5))

The sparsevec type supports both binary and text Postgres wire protocol.

Custom Schemas

The text above assumes you have the pgvector extension installed globally meaning it is hosted in the public schema. Sometimes though, extensions are setup per schema. For example only a schema named sales has access to the pgvector extension but nobody else.

If it’s your case and you installed pgvector into a certain schema, the standard :with-pgvector? flag won’t work. By default, PG2 scans the pg_types table for the public.vector and public.sparsevec types. Since the schema name is not public but sales, you need to specify it by passing a special option called :type-map. It’s a map where keys are fully qualified type names (either a keyword or a string), and values are predefined instances of the IProcessor interface:

(def config
  {:host "127.0.0.1"
   :port 5432
   :user "test"
   :password "test"
   :database "test"
   :type-map {"sales.vector" t/vector
              "sales.sparsevec" t/sparsevec}})

You can rely on keywords as well:

(def config
  {:host "127.0.0.1"
   :port 5432
   :user "test"
   :password "test"
   :database "test"
   :type-map {:sales/vector t/vector
              :sales/sparsevec t/sparsevec}})

The t alias references the pg.type namespace.

Now if you install the extension into the statistics schema as well, add it into the map:

(def config
  {:host "127.0.0.1"
   :port 5432
   :user "test"
   :password "test"
   :database "test"
   :type-map {:sales/vector t/vector
              :sales/sparsevec t/sparsevec
              :statistics/vector t/vector
              :statistics/sparsevec t/sparsevec}})

Should you make a mistake in a fully qualified type name, it will be ignored, and you’ll get value from the database unparsed. The actual value depends on the binary encoding and decoding options of a connection. By default, it uses text protocol so you’ll get a string like “[1, 2, 3]”. For binary encoding and decoding, you’ll get a byte array that holds raw Postgres payload.

Custom Type Processors

PG2 version 0.1.18 has the entire type system refactored. It introduces a conception of type processors which allows to connect Postgres types with Java/Clojure ones with ease.

When reading data from Postgres, the client knows only the OID of a type of a column. This OID is just an integer number points to a certain type. The default builtin types are hard-coded in Postgres, and thus their OIDs are known in advance.

Say, it’s for sure that the int4 type has OID 23, and text has OID 25. That’s true for any Postgres installation. Any Postgres client has a kind of a hash map or a Enum class with these OIDs.

Things get worse when you define custom types. These might be either enums or complex types defined by extensions: pgvector, postgis and so on. You cannot guess OIDs of types any longer because they are generated in runtime. Their actual values depend on a specific machine. On prod, the public.vector type has OID 10541, on pre-prod it’s 9621, and in Docker you’ll get 1523.

Moreover, a type name is unique only across a schema that’s holding it. You can easily have two different enum types called status defined in various schemas. Thus, relying on a type name is not a good option unless it’s fully qualified.

To deal with all said above, a new conception of type mapping was introduced.

First, if a certain OID is builtin (meaning it exists the list of predefined OIDs), it gets processed as before.

When you connect to a database, you can pass a mapping like {schema.typename => Processor}. When pg2 has established a connection, it executes an internal query to discover type mapping. Namely, it reads the pg_type table to get OIDs that have provided schemas and type name. The query looks like this:

select
    pg_type.oid, pg_namespace.nspname || '.' || pg_type.typname as type
from
    pg_type, pg_namespace
where
    pg_type.typnamespace = pg_namespace.oid
    and pg_namespace.nspname || '.' || pg_type.typname in (
        'schema1.type1',
        'schema2.type2',
        ...
    );

It returns pairs of OID and the full type name:

121512 | schema1.type1
 21234 | schema2.type2

Now PG2 knows that the OID 121512 specifies schema1.type1 but nothing else.

Finally, from the map {schema.typename => Processor} you submitted before, PG2 builds a map {OID => Processor}. If the OID is not a default one, it checks this map trying to find a processor object.

A processor object is an instance of the org.pg.processor.IProcessor interface, or, if more precisely, an abstract AProcessor which is partially implemented. It has four methods:

ByteBuffer encodeBin(Object value,  CodecParams codecParams);
    String encodeTxt(Object value,  CodecParams codecParams);
    Object decodeBin(ByteBuffer bb, CodecParams codecParams);
    Object decodeTxt(String text,   CodecParams codecParams);

Depending on whether you’re decoding (reading) the data or encoding them (e.g. passing parameters), and the current format (text or binary), a corresponding method is called. By extending all four methods, you can handle any type you want.

At the moment, there are about 25 processors implementing standard types: int2, int4, text, float4, and so on. Find them in the pg-core/src/java/org/pg/processor directory. There is also a couple of processors for the pgvector extension in the pgvector subdirectory.

The next step is to implement processors for the postgis extension.

Permalink

OSS Updates September and October 2024

This is a summary of the open source work I spent my time on throughout September and October 2024. This was a very busy period in my personal life and I didn't make much progress on my projects, but I did have more time than usual to think about things, which prompted many further thoughts. Keep reading for details :)

Sponsors

I always start these posts with a sincere thank you to the generous ongoing support of my sponsors that make this work possible. I can't say how much I appreciate all of the support the community has given to my work and would like to give a special thanks to Clojurists Together and Nubank for providing incredibly generous grants that allowed me reduce my client work significantly and afford to spend more time on projects for the Clojure ecosystem for nearly a year.

If you find my work valuable, please share it with others and consider supporting it financially. There are details about how to do that on my GitHub sponsors page. On to the updates!

Personal update

I'll save the long version for the end but there is one important personal update that's worth mentioning up front: I go by Kira Howe now. I used be known as Kira McLean, and all of my talks, writing, and commits up to this point use Kira McLean, but I'm still the same person! Just with a new name. I even updated my GitHub handle, which went remarkably smoothly.

Conj 2024

The main Clojure-related thing I did during this period was attend the Conj. It's always cool to meet people in person who you've only ever worked with online, and I finally got to meet so many of the wonderful people from Clojure Camp and Scicloj who I've had the pleasure of working with virtually. I also had the chance to meet some of my new co-workers, which was great. There were tons of amazing talks and as always insightful and inspiring conversations. I always leave conferences with tons of energy and ideas. Then get back to reality and realize there's no time to implement them all :) But still, below are some of the main ideas I'm working through after a wonderful conference.

SVGs for visualizing graphics

Tim Pratley and Chris Houser gave a fun talk about SVGs, among other things, that made me realize using SVGs might be the perfect way to implement the "graphics" side of a grammar of graphics.

Some of you may be following the development of tableplot (formerly hanamicloth), in which Daniel Slutsky has been implementing an elegant, layered, grammar-of-graphics-inspired way to describe graphics in Clojure. This library takes this description of a graphic and translates it into a specification for one of the supported underlying Javascript visualization libraries (currently vega-lite or plotly, via hanami). Another way to think about it is as the "grammar" part of a grammar of graphics; a way to declaratively transform an arbitrary dataset into a standardized set of instructions that a generic visualization library can turn into a graphic. This is the first half of what we need for a pure Clojure implementation of a grammar of graphics.

The second key piece we need is a Clojure implementation of the actual graphics rendering. Whether we adopt a similar underlying representation for the data as vega-lite, plotly, or whatever else is less consequential at this stage. Currently we just "translate" our Clojure code into vega-lite or plotly specs and call it a day. What I want to implement is a Clojure library that can take some data and turn it into a visualization. There are many ways to implement such a thing, all with different trade-offs, but Tim and Chouser's talk made me realize SVGs might be a great tool for the job. They're fast, efficient, simple to style and edit, plus they offer potentially the most promising avenues toward making graphics accessible and interactive since they're really just XML, which is semantic, supports ARIA labels, and is easy to work with in JS.

Humble UI also came up in a few conversations, which is a totally tangential concern, but it was interesting to start thinking about how all of this could come together into a really elegant, fully Clojure-based data visualization tool for people who don't write code.

A Clojurey way of working with data

I also had a super interesting conversation on my last night in Alexandria about Clojure's position in the broader data science ecosystem. It's fair to say that we have more or less achieved feature parity now for all the essential things a person working with data would need to do. Work is ongoing organizing these tools into a coherent and accessible stack (see noj), but the pieces are all there.

The main insight I left with, though, was that we shouldn't be aiming for mere feature parity. It's important, but if you're a working data scientist doing everything you already do just with Clojure is only a very marginal improvement and presents a very high switching cost for potentially not enough payoff. In short, it's a tough sell to someone who's doesn't already have some prior reason to prefer Clojure.

What we should do is leverage Clojure's strengths to build tools that could leapfrog the existing solutions, rather than just providing better implementations of them. I.e. think about new ways to solve the fundamental problems in data science, rather than just offering better tools to work within the current dominant paradigm.

For example, a fundamental problem in science is reproducibility. The current ways data is prepared and managed in most data (and regular) science workflows is madness, and versioning is virtually non-existent. If you pick up any given scientific paper that does some sort of data analysis, the chances that you will be able to reproduce the results are near zero, let alone using the same tools the author used. If you do manage to, you will have had to use a different implementation than the authors, re-inventing wheels and reverse-engineering their thought process. The problem isn't that scientists are bad at working with data, it's the fundamental chaos of the underlying ecosystem that's impossible to fight.

If you've ever worked with Python code, you know that dependency management is a nightmare, never mind state management within a single program. Stateful objects are just a bad mental model for computing because they require us to hold more information in our heads in order to reason about a system than our brains can handle. And when your mental model for a small amount of local data is a stateful, mutable thing, the natural inclination is to scale that mental model to your entire system. Tracking data provenance, versions, and lineage at scale is impossible when you're thinking about your problem as one giant, mutable, interdependent pile of unorganized information.

Clojure allows for some really interesting ways of thinking about data that could offer novel solutions to problems like these, because we think of data as immutable and have the tools to make working with such data efficient. None of this is new. Somehow at this Conj between some really interesting talks focused on ways of working with immutable data and subsequent conversations it clicked for me, though. If we apply the same ways we think about data in the small, like in a given program, more broadly to an entire system or workflow, I think the benefits could be huge. It's basically implementing the ideas from Rich Hickey's "Value of values" talk over 10 years ago to a modern data science workflow.

Other problems that Clojure is well-placed to support are:

  • Scalability – Current dominant data science tools are slow and inefficient. People try to work around it by implementing libraries in C, Rust, Java, etc. and using them from e.g. Python, but this can only get you so far and adds even more brittleness and dependency management problems to the mix.
  • Tracking data and model drift – This problem has a very similar underlying cause as the reproducibility issue, also fundamentally caused by a faulty mental model of data models as mutation machines.
  • Testing and validation – Software engineering practices have not really permeated the data science community and as such most pipelines are fragile. Bringing a values-first and data-driven paradigm to pipeline development could make them much more robust and reliable.

Anyway I'm not exactly sure what any of this will look like as software yet, but I know it will be written in Clojure and I know it will be super cool. It's what I'm thinking about and experimenting with now. And I think the key point that thinking about higher-level problems and how Clojure can be applied to them is the right path toward introducing Clojure into the broader data science ecosystem.

Software engineers as designers

Alex Miller's keynote was all about designing software and how they applied a process similar to the one described in Rich Hickey's keynote from last year's conj to Clojure 1.12 (among other things). The main thing I took away from it was that the best use of an experienced software engineer's time is not programming. I've had the good fortune of working with a lot of really productive teams over the years, and this talk made me realize that one thing the best ones all had in common is that at least a couple of people with a lot of experience were not in the weeds writing code all the time. Conversely a common thread between all of the worst teams I've been a part of is that team leads and managers were way too in the weeds, worrying too much about implementation details and not enough about what was being implemented.

I've come to believe that it's not possible to reason about systems at both levels simultaneously. My brain at least just can't handle both the intense attention to detail and very concrete, specific steps required to write software that actually works and the abstract, general conceptual type of thinking that's required to build systems that work. The same person can do both things at different times, but not at the same time, and the cost of switching between both contexts is high.

Following the process described by Rich and then Alex is a really great way to add structure and coherence to what can otherwise come across as just "thinking", but it requires that we admit that writing code is not always the best use of our time, which is a hard sell. I think if we let experienced software engineers spend more time thinking and less time coding we'd end up with much better software, but this requires the industry to find better ways to measure productivity.

Long version of personal updates

As most of you know or will have inferred by now, I got married in September! It was the best day ever and the subsequent vacation was wonderful, but it did more or less cancel my productivity for over a month. If you're into weddings or just want a glimpse into my personal life, we had a reel made of our wedding day that's available here on instagram via our wedding coordinator.

Immediately after I got back from my honeymoon I also started a new job at BroadPeak, which is going great so far, but also means I have far less time than I used for open source and community work. I'm back to strictly evening and weekend availability, and sadly (or happily, depending how you see it) I'm at a stage of my life where not all of that is free time I can spend programming anymore.

I appreciate everyone's patience and understanding as I took these last couple of months to focus on life priorities outside of this work. I'm working on figuring out what my involvement in the community will look like going forward, but there are definitely tons of interesting things I want to work on. I'm looking forward to rounding out this year with some progress on at least some of them, but no doubt the end of December will come before I know it and there will be an infinite list of things left to do.

Thanks for reading all of this. As always, feel free to reach out anytime, and hope to see you around the Clojureverse :)

Permalink

A deep dive into data roles at Nubank

The roles within software and data engineering are increasingly integral to driving innovation and operational excellence. Nubank, as a leader in the fintech industry, exemplifies this trend through its diverse team of Software Engineers, Analytics Engineers, Data Scientists, Machine Learning Engineers, and Business Analysts.

Each of these roles plays a crucial part in the development, optimization, and deployment of data-driven solutions. This blog post delves into the unique contributions of each role at Nubank, their synergistic collaboration, and a practical example of how they come together to develop a customer-centric product recommendation widget.

We aim to provide insights into the multifaceted world of data and software engineering, illustrating how each role contributes to the overarching goal of enhancing customer experience and business efficiency.

Software engineering: programming as a core

The role of Software Engineers is critical, encompassing the development and maintenance of client-facing applications and backend microservices.

These engineers work on everything from mobile app development using Flutter to microservices creation with Clojure, integrating technologies like Kafka and hosting on AWS.

Beyond programming, they are responsible for the quality, stability, and performance of their deliverables, ensuring robust testing and proactive monitoring for any production issues.

Analytics engineering: data handling and optimization

Analytics Engineers play a vital role in creating and maintaining high-quality, high-performance datasets. Their responsibilities extend to contributing to data pipelines, data visualization, and user support, especially in Scala. 

They focus on automating processes for scalability and efficiency, managing data from various sources, and promoting data accessibility and best practices across the organization.

Despite the fact that this definition applies to Analytics Engineers in all business units, it’s important to note that there are important variations in the work performed in each BU.

For example, in the Marketplace business unit, Analytics Engineers are very focused on creating and maintaining datasets. However, in the Data BU, they are more concerned with improving the data platform and data tools to help other Analytics Engineers.

Data science: business problem solving with data

Data Scientists at Nubank are tasked with solving complex business problems through data. They develop predictive models to support key business decisions, continuously innovate in feature development, and rigorously evaluate model performance.

Sharing insights and collaborating with different teams is a significant aspect of their role. They leverage tools like Jupyter Notebooks, Scikit-learn, Keras, and internally developed open-source libraries for their analytical work.

Synergistic collaboration among data roles

The collaborative effort between Software Engineers, Analytics Engineers, and Data Scientists highlights the multifaceted approach to data management and utilization at Nubank. Each role contributes uniquely: Software Engineers build the technical infrastructure, Analytics Engineers optimize data handling, and Data Scientists apply this data to tackle real-world business challenges. This cohesive interplay is essential in reinforcing our data-centric strategy.

Machine Learning Engineers: bridging models and infrastructure

Machine Learning Engineers at Nubank play a pivotal role in operationalizing the models created by our Data Scientists. They take these sophisticated models, initially developed in environments like Python notebooks, and adapt them to our infrastructure, which includes technologies like Scala and Clojure.

Their responsibilities extend beyond deployment to include ongoing monitoring and maintenance of the models, ensuring consistency, accuracy, and health of the system.

Business Analysts: decision-making and analysis

Business Analysts at Nubank are the strategic thinkers who aid in decision-making from start to finish, leveraging the entire data infrastructure.

Their process involves business analysis to identify opportunities, designing tests to validate hypotheses, implementing strategies, and monitoring outcomes. They analyze test results and develop business strategies based on data, considering both company and customer perspectives.

They are also involved in internal processes, using data to improve recruitment, internal surveys, and operational efficiencies.

Developing a widget

To illustrate how these roles interplay, let’s consider a project to develop a product recommendation widget for our app, tailored to individual customer behavior. The process involves:

  • Data collection: Analytics Engineers extract and model data necessary for the project, making it accessible for analysis.
  • Model development: Data Scientists analyze the data and develop the recommendation model.
  • Model operationalization: Machine Learning Engineers work on deploying the model, ensuring its inputs are in place and monitoring its performance. They also assist in determining the feasibility of certain features.
  • Business analysis: Business Analysts sit with Data Scientists to optimize functions based on the model outputs, assessing the business impacts of these recommendations.
  • Final dataset creation: depending on the complexity of business rules derived from the model, either Business Analysts or Analytics Engineers create a final dataset with customer-product decisions.
  • Widget development: Software Engineers develop the widget in the app, utilizing the final dataset to display personalized product recommendations to customers.

Key takeaways

  • Sequential yet overlapping processes: while outlined in a sequential manner, these steps often occur simultaneously or overlap, showcasing the dynamic and flexible nature of our project management.
  • Role versatility: these roles are not rigid; they often overlap and support each other. For instance, a Business Analyst might engage in data modeling, or an Analytics Engineer might develop datasets for model inputs.

Conclusion

The collective efforts of all the data roles in developing a product recommendation widget underscore the importance of an integrated approach to data and software engineering. By blending their unique skills and perspectives, these professionals at Nubank showcase the power of teamwork in creating solutions that are not only technologically advanced but also deeply attuned to customer needs.

This exploration serves as a testament to the importance of diverse expertise in the tech industry, proving that the sum of collaborative efforts is greater than its individual parts. It’s an inspiring example for businesses and professionals alike, emphasizing the value of multi-disciplinary collaboration in the ever-evolving world of technology.

Check out what we shared about this topic on Meetup below:

The post A deep dive into data roles at Nubank appeared first on Building Nubank.

Permalink

Applications Open for 2025 Long-Term Funding

6 Developers will each be awarded $1,500 USD per month in 2025.

This is the 4th year we are awarding annual funding. We’ve received an unmanageable number of nominations in the past few years, so based on member input, we’ve decided to try a new approach for 2025.

APPLY: Anyone interested in receiving annual funding submits an APPLICATION outlining what they intend to work on and how that work will benefit the Clojure community. The deadline for applications is Nov. 12th, 2024 midnight Pacific Time.
BOARD REVIEW: The Clojurists Together board will review the applications and select finalists to present to the members.
MEMBERS VOTE: Ballot will go out to members no later than Nov. 20th. Members will vote on the finalists using Ranked Voting. Deadline Dec. 2. midnight Pacific Time.
AWARDS: 6 developers will receive $1,500 USD per month in 2025. Awards will be announced no later than Dec. 9th, 2024.
REPORTS: Developers submit bi-monthly reports to the membership.

In the past 3 years, we have seen that giving developers flexible, long-term funding gives them the space to do high-impact work. This might be continuing maintenance on existing projects, new feature development, or perhaps a brand-new project. We’ve been excited with what they came up with in the last few years and are looking forward to seeing more great work in 2025! Thanks all and good luck!

Permalink

Clojure Deref (Oct 31, 2024)

Welcome to the Clojure Deref! This is a weekly link/news roundup for the Clojure ecosystem (feed: RSS). Thanks to Anton Fonarev for link aggregation.

Libraries and Tools

New releases and tools this week:

  • uix 1.2.0-rc3 - Idiomatic ClojureScript interface to modern React.js

  • martian 0.1.28 - The HTTP abstraction library for Clojure/script, supporting OpenAPI, Swagger, Schema, re-frame and more

  • di 3.2.0 - DI is a dependency injection framework that allows you to define dependencies as cheaply as defining function arguments

  • zodiac 0.2.31 - A simple web framework for Clojure

  • calva 2.0.481 - Clojure & ClojureScript Interactive Programming for VS Code

  • bootstring-clj 1.0.13 - Use the bootstring algorithm for encoding and decoding unicode strings to a smaller subset of the unicode space

  • telemere 1.0.0-RC1 - Structured telemetry library for Clojure/Script

  • nippy 3.5.0-RC1 - The fastest serialization library for Clojure

  • sente 1.20.0-RC1 - Realtime web comms library for Clojure/Script

  • timbre 6.6.1 - Pure Clojure/Script logging library

  • kindly-advice 1-beta11 - A small library to advise Clojure data visualization and notebook tools how to display forms and values, following the kindly convention

  • clay 2-beta21 - A tiny Clojure tool for dynamic workflow of data visualization and literate programming

  • noj 2-alpha10 - A clojure framework for data science

Permalink

Holy Dev Newsletter October 2024

Welcome to the Holy Dev newsletter, which brings you gems I found on the web, updates from my blog, and a few scattered thoughts. You can get the next one into your mailbox if you subscribe.What is happeningI have taken a break from coding and spent instead time reading, meditating, and exercising đŸ’Ș, so there is a lot of "gems" this month. Though some coding, was there - I have helped Tony kick-start a rewrite of Fulcro Inspect to support the new Chrome manifest and thus not be removed from Chrome Web Store (while Tony has also taken the opportunity to also move it from Fulcro 2 to three). You can help us out and try it out. I have also extended my clj-tumblr-summarizer, which creates the gems summary below, to support overriding a single updated post.Another update is that Heart of Clojure talks have been released, and I have added links to my favourite ones to the September newsletter.

Permalink

Supercharging the REPL Workflow

In my previous post I outlined my REPL based workflow for CLJ as well as CLJS. In this post I want to dive deeper into how you can extend/customize it to make it your own.

The gist of it all is having a dedicated src/dev/repl.clj file. This file serves as the entrypoint to start your development setup, so this is where everything related should go. It captures the essential start/stop cycle to quickly get your workflow going and restarting it if necessary.

Quick note: There is no rule that this needs to be in this file, or even that everyone working in a team needs to use the same file. Each team member could have their own file. This is all just Clojure code and there are basically no limits.

There are a couple common questions that come up in shadow-cljs discussions every so often. I’ll give some examples of stuff I have done or recommendations I made in the past. I used to recommend npm-run-all in the past, but have pretty much eliminated all my own uses in favor of this.

Example 1: Building CSS

Probably the most common question is how to process CSS in a CLJ/CLJS environment. Especially people from the JS side of the world often come with the expectation that the build tool (i.e. shadow-cljs) would take care of it. shadow-cljs does not support processing CSS, and likely never will. IMHO it doesn’t need to, you can just as well build something yourself.

I’ll use the code as a reference that builds the CSS for the shadow-cljs UI itself. You can find in its src/dev/repl.clj.

(defonce css-watch-ref (atom nil))

(defn start []
  ;; ...

  (build/css-release)

  (reset! css-watch-ref
    (fs-watch/start
      {}
      [(io/file "src" "main")]
      ["cljs" "cljc" "clj"]
      (fn [updates]
        (try
          (build/css-release)
          (catch Exception e
            (prn [:css-failed e])))
        )))

  ::started)

(defn stop []
  (when-some [css-watch @css-watch-ref]
    (fs-watch/stop css-watch))
  
  ::stopped)

So, when I start my workflow, it calls the build/css-release function. Which is just a defn in another namespace. This happens to be using shadow-css, but for the purposes of this post this isn’t relevant. It could be anything you can do in Clojure.

shadow-css does not have a built-in watcher, so in the next bit I’m using the fs-watch namespace from shadow-cljs. Basically it watches the src/main folder for changes in cljs,cljc,clj files, given that shadow-css generates CSS from Clojure (css ...) forms. When fs-watch detects changes it calls the provided function. In this case I don’t care about which file was updated and just rebuild the whole CSS. This takes about 50-100ms, so optimizing this further wasn’t necessary, although I did in the past and you absolutely can.

fs-watch/start returns the watcher instance, which I’m storing in the css-watch-ref atom. The stop function will properly shut this down, to avoid ending up with this watch running multiple times.

When it is time to make an actual release I will just call the build/css-release as part of that process. That is the reason this is a dedicated function in a different namespace, I don’t want to be reaching into the repl ns for release related things, although technically there is nothing wrong with doing that. Just personal preference I guess.

Example 2: Running other Tasks

Well, this is just a repeat of the above. The pattern is always the same. Maybe you are trying to run tailwindcss instead? This has its own watch mode, so you can skip the fs-watch entirely. Clojure and the JVM have many different ways of running external processes. java.lang.ProcessBuilder is always available and quite easy to use from CLJ. The latest Clojure 1.12 release added a new clojure.java.process namespace, which might just suit your needs too, and is also just wrapper around ProcessBuilder.

;; with ns (:import [java.lang ProcessBuilder ProcessBuilder$Redirect Process]) added
(defn css-watch []
  (-> ["npx" "tailwindcss" "-i" "./src/css/index.css" "-o" "public/css/main.css" "--watch"]
      (ProcessBuilder.)
      (.redirectError ProcessBuilder$Redirect/INHERIT)
      (.redirectOutput ProcessBuilder$Redirect/INHERIT)
      (.start)))

So, this starts the tailwindcss command (example taken directly from tailwind docs). Those .redirectError/.redirectOutput calls make sure the output of the process isn’t lost and instead is written to the stderr/stdout of the current JVM process. We do not want to wait for this process to exit, since it just keeps running and watching the files. Integrating this into our start function then becomes

  (reset! css-watch-ref
    (build/css-watch))

The css-watch function returns a java.lang.Process handle, which we can later use to kill the process in our stop function.

(defn stop []
  (when-some [css-watch @css-watch-ref]
    (.destroyForcibly css-watch)
    (reset! css-watch-ref nil))
  
  ::stopped)

You could then build your own build/css-release function, that uses the same mechanism but skips the watch. Or just run it directly from your shell instead.

Things to be aware of

Unlimited Power!

It is quite possible to break your entire JVM or just leaking a lot of resources all over the place. Make sure you actually always properly clean up after yourself and do not just ignore this. Shutting down the JVM entirely will usually clean up, but I always consider this the “nuclear” option. I rely on stop to clean up everything, since actually starting the JVM is quite expensive I want to avoid doing it. I often have my dev process running for weeks, and pretty much the only reason to ever restart it is when I need to change my dependencies.

Also make sure you actually see the produced output. The above tailwind process for example. I would want to see this output in case tailwind is showing me an error. Depending on how you start your setup this may not be visible by default. If running this over nREPL for example it won’t show up in the REPL. I actually prefer that, so I’ll always have this visible in a dedicated Terminal window on my side monitor.

Those are the two main reasons I personally do not like the “jack-in” process of most editors. My JVM process lives longer than my editor, and, at least in the past, some of the output wasn’t always visible by default. Could be that is a non-issue these days, just make sure to check.

Permalink

Bthreads: A Simple and Easy Paradigm for Clojure

Asynchronous programs are hard to reason about. But is this intrinsic to asynchrony? Or might we be using the wrong paradigms?

Behavioral programming is a programming paradigm that aims to make asynchronous, event-driven systems both simple and easy by using a system centered on behavioral threads (bthreads). In my previous article, I introduced the idea of behavioral programming in Clojure.

In this article, we dive deeper. I hope to convince you that, compared to the alternatives:

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.