Why Re-frame instead of Om Next

Why Re-frame instead of Om Next

When I first announced that I was producing a course about Re-frame, I inevitably got lots of questions about why I chose Re-frame and not Om Next. It's a good question and I'd like to address it, because in the end, the decision was hard but the choice was kind of made for me.

Re-frame is a front end framework created by Mike Thompson. It takes care of DOM rendering, application state management, and it has some building blocks useful for other effects like communicating with the backend. Om Next, created by David Nolen, is similar, except that it has a grander vision about client/server communication. It's ambition is to unify the model of data on the client and server to reduce the number of queries the client runs which will make applications faster, especially on mobile.

Like I said before, the two frameworks are very similar. However, I had to choose one. Making courses about both would have diluted my attention. I wound up teaching Re-frame and I am very happy with the choice. So why did I do it?

People have been asking me to teach Om Next for two years. As soon as I announced my original Om course Single Page Applications with ClojureScript and Om, Om Next was announced. People were already asking whether it would teach Om Next and when the Om Next version would be available. My answer at that time was simple: it won't teach Om Next and I won't update it. Om Next is different enough that it needs its own course. However, Om Next was released as Alpha at that time. I've been burned before making a course and having the material change out from under me. My answer to people asking when I would make an Om Next course was "not before it comes out of Alpha".

That was nearly two years ago and Om Next, when I started my Re-frame course, was still in Alpha. As I write this, it went into Beta1 two months ago. I know a lot of companies use it in production already. But there is so much out there for me to teach that doesn't have a label of volatility. I cannot afford to make a course that might not work in six months.

Meanwhile, I was very interested in Om Next and I wanted to learn it. I watched David Nolen's talks. I went through the tutorial a few times. I made a small application. I got on Slack. I read Tony Kay's excellent interactive tutorial. Even with all of that, and understanding the individual pieces, I can never say that I fully understood what was going on. I wrote a small application in it and it felt like threading a needle. There were so many things I had to get perfect or they wouldn't line up as neatly as promised. I felt like at any moment I might violate some implicit assumption about how my data should be shaped or what the query DSL could do.

I knew I would have to get a handle on this to be able to make it teachable, and I didn't feel like I could in a reasonable amount of time and effort. Teachability is a huge part of designing a framework. It is probably as important as capability. I would have to invent an onramp to get people started with Om Next. In short, I'd have to create a mental framework to use the actual framework with. In the end, this was my biggest beef with it. It has too many degrees of freedom.

Meanwhile, during my struggles with Om Next, I discovered Re-frame. Re-frame felt aggressive. The README was long and dense and opinionated, not to mention stylistically eccentric (it has been toned down since). There were all of these parts to it, and it came from someone I had never heard of, and plus it was built on Reagent, which I thought was neat but not as robust as Om. One day I decided to record a Clojure in One Hour lesson about it. I hadn't done anything more than read the README when I attempted my first project. It was fun and I got something working. Plus, I understood it and could see how to teach it.

Re-frame does not do all that Om Next does. Om Next has a concept of database entities, it will de-duplicate them and build indexes. It has a query language that specifies what data each component needs. And much more.

It embodies my #1 reason for using a framework: it is a process for turning ideas into code. It's a mental framework as much as a code framework. You get a handful of useful pieces so you can divide up the problem and fit them into the application. Those pieces easily map to the parts of you should be familiar with as a functional programmer. Each part is there for a reason and makes the system better.

I'm going to write this process up more fully, but here's a small example of what I'm talking about. Let's say we want to make a button that increments a count. We also want to display that count.

What are the UI elements?

UI elements go in Re-frame Components.

We need a button and a display of the count. Let's call the Components increment-button and count-display.

What UI events do we need to capture?

UI events should be captured as Re-frame Events.

The button can be clicked. Let's call the event :increment-count.

What application state do we need to store and modify?

Application state changes go in Event handlers.

We need to store the count in application state, and increment it whenever the :increment-count Event is handled.

What application state do we need to read to build our UI?

Read-only access to application state is done using Subscriptions.

Our count-display needs to read the current count, so we make a subscription for that and subscribe in the component.

We mount those components into the UI and it works.

Notice the mental framework: each question leads directly to a piece of code. Building applications is already complex enough. You want something to give you a place for the things you need. So here's the strong statement I'm going to make regarding Re-frame:

Re-frame gives you just the right amount of structure you want when building a frontend application. The structure it gives you is easy to learn.

Conclusions

Om Next has a lot of cool capabilities. You should at least watch the talks and give the Quick Start a shot. It might be just what you need, or you might like the freedom and potential it gives you. If that's you, go for it. I think it's a cool system with a great philosophy. If you're starting a new frontend and backend from scratch, do check out Untangle, which does provide much more ready-made structure so you can take advantage of Om Next's capabilities.

However, I think Re-frame is a great choice for most people. It gives you 90% of what you need plus a place to put the last 10%. It's powerful, teachable, and it has a large and growing community. I'm still happy with my choice and I invite you to learn Re-frame with me. I've already got one course called Building Re-frame Components which dives into the nitty-gritty of making custom UI components in Re-frame.

Get on the mailing list!


Permalink

Dates in Clojure: Making Sense of the Mess

You can always count on human culture to make programming messy. To find out if a person is a programmer just have them say “encodings” or “timezones” and watch their face.

In Java, and hence Clojure, there are half a dozen different ways to represent dates and times, which can lead to confusion, needless type coercion, and inconsistent code. In this post I’ll give you a quick lay of the land, and some tips to make it all a bit easier to deal with.

Representing Time

Unix Time

The traditional way to represent time, the system that has been handed down to us by The Elders, is known as Unix Time. A Unix timestamp is a number representing the amount of seconds passed since the start of time, which is conveniently defined as midnight, January 1, 1970.

1970's family pic

A happy family shortly after the start of time (credits)

java.util.Date

Some 822 million seconds later Java 1.0 came out. It was the year of the Spice Girls, the Macarena, and mutable date classes.

user> (def date (java.util.Date.))
#'user/date
user> date
#inst "2017-06-09T06:59:40.829-00:00"
user> (.setTime date 822391200000)
nil
user> date
#inst "1996-01-23T10:00:00.000-00:00"
user>

A Date wraps a milli-second based Unix timestamp, meaning it’s not time zone aware.

user>  (java.util.TimeZone/setDefault (java.util.TimeZone/getTimeZone "UTC"))
user> (.toString date)
"Tue Jan 23 10:00:00 UTC 1996"
user> (java.util.TimeZone/setDefault (java.util.TimeZone/getTimeZone "Europe/Berlin"))
user> (.toString date)
"Tue Jan 23 11:00:00 CET 1996"

That’s not all that’s wrong with java.util.Date, so in Java 1.1 they tried to rectify things by deprecating half of the Date API and instead introducing a (obviously mutable) GregorianCalendar.

JodaTime

It’s 2002, Brazil has beaten Germany in the FIFA World Cup, and someone decides the time is ripe for a proper date/time library. In no time at all JodaTime is born!

It doesn’t take long or Joda becomes the de facto time library on the JVM, and in 2010 it becomes the basis for clj-time.

Joda contains two principles classes for representing time, org.joda.time.Instant, which is essentially a Unix timestamp, and org.joda.time.DateTime, which explicitly stores year/month/day/etc.

Having both these classes makes the distinction clear between a “machine time” and “human time”. For machines things are relatively straightforward: time is an ever increasing number. The same instant in time corresponds with the same number, no matter where you are or how you look at it. There are no jumps or gaps, time is strictly monotonous.

The human view of time is a lot more complex. There are time zones, daylight saving time, not to mention different calendar systems and historical time adjustments. For adding timestamps to your logs machine time will work fine, but when your system communicates dates and times with end users you need a richer model.

JSR-310

For years this was the state of things: you used the built-in java.util.Date despite its shortcomings, or you relied on JodaTime as an external dependency. That you needed a library to do something as essential as working with dates and times seemed a bit odd, and so Oracle involved the JodaTime author, Stephen Colebourne, to work on a new Date/Time library for the JDK, affectionately known as JSR-310.

JSR-310 introduces a new java.time package, which includes java.time.Instant and java.time.ZonedDateTime. While clearly inspired by JodaTime, there are also some differences. The author explains in this article that there are some shortcomings in JodaTime that they sought to address.

Development was completed in 2014, and java.time shipped with Java 8. That’s basically last week in Java years, it will be a while before support for this has become widespread.

On the clj-time issue tracker there’s an ongoing discussion about what this will mean for clj-time. While there’s a lot of interest for clj-time to adopt JSR-310, it seems the current concensus is that if they do this will be with a new artifact id (a new library name). The differences between JodaTime and JSR-310 are big enough to make breaking API changes inevitable. By using a different name it will be possible to switch to the new version, while libraries you depend on can keep using the old version.

clojure.java-time

Meanwhile someone has gone ahead and created an idiomatic Clojure library for the java.time.* API, aptly named clojure.java-time, so perhaps this will be what clj-time 2.0 could have been. If you are starting a new project today you might skip clj-time altogether, and instead use clojure.java-time for your date and time needs, assuming you’re at least on Java 8.

If you do go that route you should do your homework though. java.time contains a handful of different date/timestamp classes, and you should know which is which. The clojure.java.time README has several links for further reading.

clojure.instant

It’s also worth pointing out that since Clojure 1.8 there’s a built-in namespace called clojure.instant, which contains a few utility methods for parsing dates.

java.sql.Timestamp

You thought we were done? Guess again! Besides java.util.Date, Java has always included yet another timestamp class: java.sql.Timestamp. The reason is that timestamps in the SQL standard have a nanosecond precision, while java.util.Date only stores milliseconds. So a java.sql.Timestamp inherits from java.util.Date, but adds an extra nanosecond field.

If you’re using JDBC to read or store timestamps, then you’ll run into this class.

Making things bearable

Having all these different types for representing essentially the same thing is a recipe for massive frustration, but with a few small tweaks we can make it all a lot less painful.

The trick is to pick one of the above, and use it across the board. For now org.joda.time.Instant is still a likely candidate, it has widespread library support, and you can combine it with clj-time for writing idiomatic Clojure code. If instead you want to be future proof you can opt for java.time.Instant, in combination with clojure.java-time. Several people have reported they’re already getting good use out of java.time. The snippets below assume you’re sticking with JodaTime.

Whenever time values enter your program, convert them to JodaTime. Maybe you’re reading JSON, EDN, Transit, you’re pulling values off an API or out of a database. Don’t let them propagate through your code as java.util.Dates (or worse, longs representing Unix timestamps), but convert them to org.joda.time.DateTime as soon as possible.

The same goes for outputting values. You don’t want to see an exception any time you encode some JSON with a timestamp, or save a date to the database.

Many libraries can be configured to do this automatically, so you only have to worry about it once.

JDBC

JDBC is Java’s standard for communicating with relational databases, and the clojure.java.jdbc library provides a convenient Clojure wrapper. By default JDBC expects, and outputs, java.sql.Timestamp. But by loading the clj-time.jdbc namespace it switched to JodaTime instead. Nice, that was easy!

;; load/save DB times as joda DateTime
(require 'clj-time.jdbc)

Cheshire

The go-to library for reading and writing JSON from Clojure is called Cheshire. JSON doesn’t have a standard way of encoding time values. If you do need to encode timestamps the common thing to do is to either use a string containing an ISO-8601 formatted timestamp, or to use a number representing a UNIX timestamp.

This snippet configures Cheshire to output org.joda.time.DateTime as a string.

(ns my.ns
  (:require [cheshire.factory]
            [cheshire.generate :as cheshire]
            [clj-time.coerce :as tc]
            [clj-time.core :as t])
  (:import [org.joda.time.DateTime]
           [com.fasterxml.jackson.core.JsonGenerator]
           [java.text.SimpleDateFormat]
           [java.util SimpleTimeZone]))

(defn cheshire-add-jodatime-encoder! []
  (cheshire/add-encoder
   org.joda.time.DateTime
   (fn [^org.joda.time.DateTime d ^JsonGenerator jg]
     (let [sdf (SimpleDateFormat. cheshire.factory/default-date-format)]
       (.setTimeZone sdf (SimpleTimeZone. 0 "UTC"))
       (.writeString jg (.format sdf (tc/to-date d)))))))

(cheshire-add-jodatime-encoder!)

EDN/Clojure

In EDN and Clojure source code you can use an #inst tagged literal to represent timestamps. These by default will be read as java.util.Date.

Having a literal syntax for timestamps can be very handy, one example would be to write out test data containing timestamps.

In my apps I tend to configure Clojure’s reader and printer to represent org.joda.time.DateTime as #joda/inst "...timestamp...", providing me the same convenience. When doing REPL-driven development (directly or inside a source buffer) this is particularly useful, since results that are printed out can be read again, they are proper time literals.

Clojure’s printer can be extended through the print-method and print-dup multimethods.

;; Configure the printer
(defmethod print-method org.joda.time.DateTime
  [dt out]
  (.write out (str "#joda/inst \"" (.toString dt) "\"") ))

(defmethod print-dup org.joda.time.DateTime
  [dt out]
  (.write out (str "#joda/inst \"" (.toString dt) "\"") ))

;; Utility function for the reader
(defn ->joda-inst [time-str]
  (org.joda.time.DateTime/parse time-str))

The reader can be taught about new tagged literals through a data_readers.clj file placed on the classpath. I found it works best to put it under resources, that way Leiningen is smart enough to merge it with other data_readers.clj files provided by libraries when packing up an uberjar.

;; resources/data_readers.clj
{joda/inst lambdaisland.util.time/->joda-inst}

Tu quoque, ClojureScript?

This article has focused on JVM based Clojure, but what about ClojureScript? Things aren’t quite as chaotic on the JavaScript side. It seems in general people are content using js/Date. There is a separate goog.date.DateTime included with the Google Closure library, which in turn is used by cljs-time, so you might run into it.

For a good time

Java’s time/date landscape can be difficult to navigate, but a map of the territory can help you along the way. And with the right tools in your belt it can be a pleasant journey after all.

Do you have your own snippets for making libraries or subsystems play well with either JodaTime or java.time? Send them in and I will add them to the article

A special thanks to Ben Lovell, Tim Gilbert, and Sean Corfield for pointing me to clojure.java-time.

Permalink

Delivering our full stack training course

The two of us - Malcolm Sparks and Jon Pither - have just spent a thoroughly enjoyable week in Lisbon providing full stack Clojure training. It was a pleasure to see the start of a promising Clojure scene in Portugal's capital, one that we hope will flourish in the years to come.

We eased into the course of eight attendees with a Clojure Kata - pulling back some interesting JSON from a public API and transforming it to answer some basic queries. This involved using the REPL and Clojure's sequence library to great effect.

For the duration of the course, we built upon our Edge project, which is a Clojure bootstrapped project containing the libraries we favour and a working web application example linked up to Datomic.

We had a range of different operating systems and IDEs to contend with. There were no users of Arch Linux on this course - which is what we use at JUXT - instead, we had some Macs and a Windows environment in play. We were able to get moving in Windows by installing Git Bash and from there boot-clj, which is the build system we use for Edge.

Once everyone had their IDEs and REPLs in full swing and the Kata was completed, we then started to move through the full-stack syllabus.

During the two days, we went through the following, using a mix of exercise-based learning and discovery via live coding.

The participants were excellent and swiftly got up to speed with the course material. We regularly solicited and received feedback throughout the course to balance the mix of live coding and exercises to ensure that we maintained traction as we moved through the advanced topics.

Whilst it was an intensive two days, we were able to take the participants through the syllabus so that hopefully, they should be able to start using the various technologies for their next Clojure project.

Aside from training, we both had a lovely time in Lisbon. We were told by various people (a taxi driver, hotel manager, and course attendees) to visit the restaurant Ramiro for exemplary Portuguese sea food. We can report that the shellfish was mind-blowingly good.

We wandered into the Old Town for a cocktail, and visited the Timeout market to try some more Portugese national dishes.

Aside from the food - of which we hopefully did justice - Lisbon is a beautiful city. We spent hours walking around and enjoying the vibe, and we would love to return. If you're thinking of a trip to Lisbon - go there.

We have training courses available to book, please see our training page for details.

Permalink

Clojure's most important lesson

This article originally appeared on my blog

Any programming language that you learn will hopefully teach you new techniques and ways of solving problems.

In the case of Clojure, it has taught me how great immutability is, the difference between simple and easy, the value of values, the power of the REPL and macros, the trade-off of dynamic typing, and the beauty behind all those parentheses.

Also, being so different, it has allowed me to reconsider a lot of common wisdom, resulting in heretical thoughts about global state, dependency injection, encapsulation, testing, types and such.

However, the single most surprising lesson has been how close-minded I had been at the mere idea of it, how biased I was, how I would immediately dismiss it and mock it, and how now, I will sing its praises.

If you had come and told me five years ago about a dynamically typed (what?? I am a proper engineer! Static typing is the only way to go!) Lisp (parenthesis make your code unreadable!) where everything is immutable by default (surely very slow!!) and where you just work with maps of maps (unmaintainable and error prone!), I would have laughed at you.

How blind I was? How many other great things I have missed because of my cognitive bias? How about you?

I encourage you to go and learn Prolog, look at Shen, play around with Bloom, Scala, Kotlin, Haskell, Frege, Go, Erlang, Elixir, Idris ... there are plenty to choose from.

But pick and learn something that is really different from what you are doing now or something that you have dismissed as silly before. That can even be Java or JavaScript.

Maybe you will not like everything that you see, but enjoy the journey, be open minded, and don't be afraid to admit that maybe some of the stuff on the other side of the fence is cool, interesting or even better.

I leave you with a quote:

A language that doesn't affect the way you think about programming, is not worth knowing.
Alan Perlis, Epigrams on Programming.

Clojure, thank you.

Permalink

How to get the AST of a ClojureScript form

See this page for a live demo:

Code on Github: https://github.com/lincoln-b/rage

Basic function to get the full AST of a string of CLJS:

(defn ast [code-str]
  (c/analyze-str
    (c/empty-state)
    code-str
    nil
    {:eval c/js-eval :context :expr}
    identity))

The above code returns an EXTREMELY long data structure. Ex. ((fn [a b] (+ a b)) 5 7) returns a map with almost 3,000 lines.

I asked on the Clojurian’s #clojurescript Slack channel and got the following explanation:

chalcidfly [9:11 AM] Why is the AST for `((fn [a b] (+ a b)) 5 7)` almost 3000 lines long?

mccraigmccraig [9:17 AM] it's got the environment expanded at every node... most of the bulk is under `:env` keys

alexmiller [9:17 AM] because AST is very verbose

chrisbroome [9:17 AM] looks like there's a lot of `:js-globals` that are repeated in various nodes

alexmiller [9:17 AM] most of those are actually the same identical object referred to from multiple locations

They pointed me to this page for an abbreviated AST: https://github.com/clojure/clojurescript/blob/master/src/main/clojure/cljs/analyzer/utils.clj#L12-L22

As mfikes pointed out, that code is Clojure (not ClojureScript) and cannot be required; has to be copied.

Permalink

SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking

Conventional wisdom and textbooks say BK-trees are especially suited for spelling correction and fuzzy string search. But does this really hold true?

Also in the comments to my blog post on spelling correction the BK-tree has been mentioned as a superior data structure for fuzzy search.

So I decided to compare and benchmark the BK-tree to other options.

Approximate string search algorithms

Approximate string search allows to lookup a string in a list of strings and return those strings which are close according to a specific string metric.

There are are many are different string metrics like Levenshtein, Damerau-Levenshtein, Hamming distance, Jaro-Winkler and Strike a match.

I will compare four different algorithms to lookup a string in a list of strings within a maximum edit distance according to the Damerau-Levenshtein string metric.

For spelling correction an additional word frequency associated with every term can be used to sort and filter the results (suggestions) further.

One could also implement a Weighted Edit distance giving a higher priority to pairs which are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound). While there is a SymSpell implementation with weighted Damerau-Levenshtein edit distance / keyboard-distance, the weighted edit distance is beyond the focus of this post. It can be added as a post-processing step with only small impact on the performance to most approximate string searching algorithms by just re-prioritizing/sorting the preliminary results filtered by the Damerau-Levenshtein edit distance according to your preferences.

All algorithms strive for the same goals to achieve short lookup times: reducing the number of lookups and comparisons (between words and/or artificially generated candidates), possibly reducing further the number of full edit distance calculations and finally reducing the computational complexity of the edit distance calculation itself, while not compromising its accuracy.

The four different algorithms I want to compare and benchmark are:

Levenshtein edit distance variations

All four algorithms are using derivatives of the Levenshtein edit distance.
There are three different levenshtein distances:

  • Levenshtein distance: adjacent transposition (AC->CA) counted as 2 edits. The triangle inequality does hold.
  • Restricted Damerau-Levenshtein distance (Optimal string alignment algorithm): adjacent transposition counted as 1 edit, but substrings can’t be edited more than once: ed(“CA” , “ABC”) =3. The triangle inequality does not hold.
  • True Damerau-Levenshtein distance: adjacent transposition counted as 1 edit, substrings can be edited more than once: ed(“CA” , “ABC”) =2. The triangle inequality does hold.

Norvig’s algorithm is using the true Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance.

The BK-Tree implementation of Xenopax is using the Levenshtein edit distance. It could be modified to use the True Damerau-Levenshtein edit distance, but not the Restricted Damerau-Levenshtein edit distance where the triangle inequality required for a BK tree does not hold.

SymSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.

LinSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.

Verbosity of search results

In our tests we distinguish three levels of verbosity of search results, which will result in different lookup times:

Level 0: Return only the result with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return the result with the highest word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.

Level 1: Return all results with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return them all sorted by word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.

Level 2: Return all results within the maximum edit distance, sorted by word frequency. This allows for no early termination of the search.

Norvig’s Spelling Corrector

The idea is if we artificially generate all terms within maximum edit distance from the misspelled term, then the correct term must be among them. We have to look all of them up in the dictionary until we have a match. So all possible combinations of the 4 spelling error types (insert, delete, replace and adjacent switch) are generated. This is quite expensive with e.g. 114,324 candidate term generated for a word of length=9 and edit distance=2.

Originally it was written in Python. For the benchmark I used the faithful C# port from Lorenzo Stoakes of Peter Norvig’s algorithm, which has been extended to support edit distance 3.

BK-tree

The BK-tree utilizes the triangle inequality, a property of the Levenshtein edit distance: Levenstein(A,B)+Levenstein(A,C)≥Levenstein(B,C) and Levenstein(A,B)−Levenstein(A,C)≤Levenstein(B,C).

During indexing the Levenshtein(root node,child node) are precalculated.

During lookup we calculate Levenshtein(input ,root node). The triangle inequality is used as a filter, to only recursively follow those child nodes where the precalculated Levenstein(root node,child node) is in the range [Levenstein(input ,root node)-dmax, Levenstein(input ,root node)+dmax].

There are several interesting posts discussing the BK-tree in detail:

I compared three C# implementations of the BK-Tree

and decided to use the fastest one from Xenopax (which is also linked from Wikipedia) for this benchmark.

SymSpell Algorithm

SymsSpell is an algorithm to find all strings within an maximum edit distance from a huge list of strings in very short time. It can be used for spelling correction and fuzzy string search.

SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!

The speed comes from pre-calculation. An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but with SymSpell you need to pre-calculate & store only 25 deletes to cover them all. Magic!

The idea behind prefix indexing is that the discriminatory power of additional chars is decreasing with word length. Thus by restricting the delete candidate generation to the prefix, we can save space, without sacrificing filter efficiency too much. In the benchmark three different prefix lengths lp=5, lp=6 and lp=7 were used. They reflect different compromises between search speed and index size. Longer prefix length means higher search speed at the cost of higher index size.

The C# source code of SymSpell is released as Open Source on GitHub.

LinSpell

This is basically a linear scan through the word list and calculating the edit distance for every single word (with a few tweaks). It was intended as the baseline measure for the benchmark. Surprisingly enough and despite its O(n) characteristics it turned out to excel both BK-tree and Norvig’s algorithm.

There are several reasons for that:

  • do not underestimate the constants in the Big O notation. Visiting only 20% of the nodes in a BK-tree is more expensive than a linear search where the atomic cost is only 10%.
  • As Damerau-Levenshtein calculation is very expensive it is not the number of processed words which matters, but the number of words where we need the full Damerau-Levenshtein calculation. We can speedup the search if we can prefilter words from the calculation or terminate the calculation once a certain edit distance is reached.
  • If we restrict the search to best match we can utilize options for early termination of the search.
  • words with no spelling errors are a frequent case. Then the lookup can be done with a hash table or trie in O(1) ! If we restrict the search to best match, we can instantaneously terminate the search.
  • We do not need to calculate the edit distance if Abs(word.Length-input.Length) > Maximum edit distance (or best edit distance found so far if we restrict the search to best match)
  • If we restrict the search to best match, we can terminate the edit distance calculation once the best edit distance found so far is reached.
  • If we restrict the search to best match, and we have found already one match with edit distance=1, then we do not need to calculate the edit distance if the count of the term in question is smaller than the count of the match already found.

The C# source code of LinSpell is released as Open Source on GitHub.

Test setup

In order to benchmark the four approximate string searching algorithms we perform 2*3*3+2*2 tests for each algorithm:

1000 words with random spelling errors are searched for two dictionary sizes (29,159 words, 500,000 words), three maximum edit distances (2, 3, 4) and three verbosity types (0, 1, 2). For each test the total search time is measured.

For two dictionary sizes (29,159 words, 500,000 words) the precalculation time necessary to create the dictionary and auxiliary data structures and their memory consumption are measured.

For each of the four algorithms the found suggestions have been compared to ensure completeness of results. Result differences caused by the different Levenshtein variations used by the algorithms have been taken into account.

Test focus & limitations

The benchmark has been limited to maximum edit distance up to 4, because for natural language search an edit distance of 4 is already on the border of what’s reasonable, as the number of false positives grows exponentially with the maximum edit distance, decreasing precision, while improving recall only slightly.

The benchmark has been limited to a dictionary size up to 500,000 words because even the 20-volume Oxford English Dictionary contains only 171,476 words in current use, and 47,156 obsolete words.

Fuzzy search beyond natural language search with longer strings or bit vectors (images, voice, audio, video, DNA sequences, fingerprints, …) may require higher edit distances and larger dictionary sizes and lead to different results.

Fuzzy search for multiple words (product/event databases) or larger text fragments (plagiarism check) will likely require compounding/decompounding and string metrics beyond edit distances like Strike a match to account for missing and switched words.

Both application fields are beyond the focus of this post.

Test data

noisy_query_en_1000.txt

For the query we use the first 1000 unique words. from Norvig's text corpus big.txt.

For each word a random number of edits in the range 0..Min(word.length/2 , 4) is chosen. For each edit a random type of edit (delete, insert random char, replace with random char, switch adjacent chars) is applied at a random position within the word. After the edits no duplicates and words with length<2 are allowed.

frequency_dictionary_en_30_000.txt

These are the 29,159 unique words from Norvig’s text corpus big.txt, together with their frequency in that corpus.

frequency_dictionary_en_500_000.txt

These are the most frequent 500,000 words from the English One Million list from Google Books Ngram data, together with their frequency.

All three test data files are released on GitHub.

Benchmark results

In the current benchmark we compared words with a random edit distance (0...maximum edit distance). This gives a good understanding of the average lookup time.

In a previous benchmark we were comparing words with a fixed edit distance (= maximum edit distance). This gives a good understanding of the maximum lookup time. Sometimes averages can be misleading as measure for user experience. For edit distance=3 SymSpell was 1 million times faster than Norvig's algorithm.

Summary

The benchmark could not confirm any rationale behind the widespread use and recommendation of BK-trees and Norvig’s algorithm other than historical reasons and habit that is perpetually passed on by text books and forums. With SymSpell and LinSpell there are two alternative algorithms available which always provide better results, at least within the scope of this test.

  • Use SymSpell, if speed is important. It is 2–3 orders of magnitude faster than BK-Tree and 5–6 orders of magnitude faster than Norvig’s algorithm. SymSpell lookup time grows only moderate with dictionary size and maximum edit distance. It outperforms all other three algorithms in all cases, often by several orders of magnitude. This comes at the cost of higher memory consumption and precalculation times. Precalculation times incur only once during program/server start or even only during program development if the precalculated data is serialized/deserialized.
  • Use LinSpell, if memory usage is important. It is 10* faster than BK-Tree for same memory usage. LinSpell lookup time grows linearly with the size of the dictionary, but is nearly independent from the maximum edit distance. It is almost always better than BK-tree and Norvig’s algorithm. No additional memory consumption and precalculation times incur.
  • There is no obvious benefit of using a BK-tree. In all cases it is excelled speedwise by SymSpell and both speed & memory wise by LinSpell. While a BK-tree is faster than Norvig's algorithm for edit distance>2, it is much slower than LinSpell or SymSpell. BK-tree performance depends heavily on the size of the dictionary, but grows only moderate with maximum edit distance.
  • There is no obvious benefit of Norvig’s algorithm. In all cases it is excelled speedwise by SymSpell and LinSpell and matched memory wise by LinSpell. Lookup time of Norvig’s algorithm is independent from the size of the dictionary, but grows exponentially with maximum edit distance.
  • Always use verbose=2 carefully (enumerate all suggestion within maximum edit distance, not only those with the smallest edit distance). It is much slower, as it prevents an early termination of the search!

SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Permalink

PurelyFunctional.tv Newsletter 235: 10x Programmers, Types and Design, Deep Learning

Issue 235 – July 24, 2017

Hi Clojurers,

Thanks so much for all of the birthday wishes. I have the best readers.

I need to apologize for a bug a customer recently discovered on my site: the download links for individual videos and reference sheets sometimes didn’t work. I’ve fixed the bug. If you tried to download them before but couldn’t, please try again.

Also, I would like to let all of my customers who bought a long time ago that if they need to reactivate their purchase, just send me the email receipts and I’ll take care of it. You paid for the courses and I want to make sure you can get them.

Please enjoy the issue.

Rock on!
Eric Normand <eric@purelyfunctional.tv>

PS Want to get this in your email? Subscribe!


Some History of Functional Programming Languages YouTube

David Turner is a legend in the academic Functional Programming world. He designed the Miranda language, which later greatly influenced Haskell. This talk is great. It talks about why he didn’t choose Lisp to teach functional programming. I never knew that Lisp wasn’t based on the Lambda Calculus. You’ll have to watch the video to learn more.


Hyperproductive development

Jessica Kerr talks about why some developers are ten times as productive as others. And it’s not what you think.


Types as Design Tools YouTube

Kris Jenkins discusses how types can help us design better abstractions. His points remind me a lot of the topic of my talk Building Composable Abstractions. It also reminds me of how much I learned as a Haskell programmer.


IEEE Spectrum Top Programming Languages

IEEE published a new ranking for programming languages. Theirs is based on weighted metrics from different sources. I always find these surprising (for instance I would have guessed JavaScript for the top of all the rankings), but then I realize the programming world is much bigger than the small part I stay in.


Repl Driven Development Vimeo

Stuart Halloway seems to be on a roll. He’s teaching about how best to use the Repl to develop in Clojure. His main point: we should use the Repl more. I have to agree with him. Use of the Repl distinguishes experienced Clojure programmers.


The Formulartic Spectrum YouTube

Suz Hinton shows how easy it is to use JavaScript to creatively analyze data and visualize it.


Functional programming for deep learning

Joyce Xu briefly introduces Clojure then goes right into Cortex to talk about Deep Learning.


Simple By Design

Drew Verlee’s perspective on Clojure’s strengths.


Euroclojure and ClojureTRE Speaker Interviews

I often forget to link to these interviews I do at PurelyFunctional.tv. Euroclojure just passed and if you missed it, check out interviews with some of the speakers. Also, ClojuTre has already announced their speaker lineup, and there are plenty of interviews already available. I promise I’ll link to more of these in the future.

Permalink

Weekly links #27

I’ve just returned from EuroClojure, it was amazing! Thank you everybody who brought it to us.

Here are some interesting links I found on Medium:

Permalink

Slacker 0.15 released

After another year of development, I'm proud to announce a new major release of my Clojure RPC library, slacker. This release, 0.15.0, includes an update on wire protocol, some performance improvement and bugfix.

The request extension

This release of slacker client use v6 protocol by default. The server can still accept v5 protocol for backward compatibility.

The v6 protocol added an extension field to request and response packet. By using this field, you can exchange some metadata in RPC call. A typical use case is for distributed tracing. Trace ID is passed as extension.

In the API level, we kept the non-invasive principle. Extensions will be injected as var bindings.

(require '[slacker.common :refer [with-extension]])
(require '[slacker.client :as sc])

(def conn (sc/slackerc "127.0.0.1:2104"))
(sc/defn-remote conn slacker.example.api/timestamp)

(def extension-id 16)
(with-extension {extension-id "extension-value"}
  (timestamp))

A more typical usage of extension is in interceptors. slacker-htrace use extension to pass htrace trace id along with rpc call.

(defn client-interceptor [^Tracer tracer span-id-extension-id]
  {:pre (fn [req]
          (let [scope-name (str (:fname req) ":outer")
                trace-scope (.newScope tracer scope-name)]
            (-> req
                (assoc ::trace-scope trace-scope)
                (update :extensions assoc
                        span-id-extension-id
                        (from-span-id(.getSpanId ^TraceScope trace-scope))))))
   :post (fn [resp]
           (when-let [scope (::trace-scope resp)]
             (.close ^TraceScope scope))
           (dissoc resp ::trace-scope))})

(defn server-interceptor [^Tracer tracer span-id-extension-id]
  {:before (fn [req]
             (let [scope-name (str (:fname req) ":inner")
                   span-id (-> req :extensions (get span-id-extension-id) to-span-id)
                   trace-scope (.newScope tracer scope-name span-id)]
               (-> req
                   (assoc ::trace-scope trace-scope))))
   :after (fn [resp]
            (when-let [scope (::trace-scope resp)]
              (.close ^TraceScope scope))
            (dissoc resp ::trace-scope))})

Flexible server thread pool

Previously, you can only configure one thread pool for all exposed namespace. This adds risk that one backend issue may exhaust all your server threads. And your server stops respond requests for any namespace.

To isolate the execution for namespaces, slacker 0.15 added support for finer thread pool configuration.

(start-slacker-server [(the-ns 'slacker.example.api) (the-ns 'slacker.example.api2)]
                      2104
                      :executors {"slacker.example.api"
                                   (ThreadPoolExecutor. 10 10
                                                        0 TimeUnit/MINUTES
                                                        (LinkedBlockingQueue. 100)
                                                        (ThreadPoolExecutor$AbortPolicy.))})

Once all your execution in slacker.example.api blocked, it won't affect requests to slacker.example.api2.

Bugfix and performance improvement

There are some more fixes in release:

  • Fixed issue in stop-slacker-server
  • Use platform specific buffer allocator for (de)serialization
  • Improved client side error report. Full context can be accessed via ex-data.

Middleware

Also I created two example middleware for slacker:

Permalink

Server Side Rendering at modnakasta.ua

I’ve decided to translate an article I wrote for modnaKasta technical blog since it should be fairly interesting for a wider audience than just Russian-speaking people.

modnakasta.ua is a single page application right now, which uses React internally and is written in ClojureScript. But we care about Google and people experience, which means we certainly care about server-side rendering.

Ancientry

Since the whole front-end is in React, which promises to make app renderable without having access to DOM - in Node.js in particular - the first idea was to render everything using Node.js. Doesn’t feel that fuzzy, does it? Especially given that there is a new shiny super-fast JS engine in Java 8: Nashorn. And first experiments showed that it’s indeed very fast, faster than Node.js.

“That’s great,” I thought and we started building an app (not SSR, an app itself). When the app became at least somewhat usable the time has come for a reality check. Loading our ~2-3k loc application into Nashorn proved difficult: 8 Gb of heap was barely enough to load JS file and after 30 seconds of hard work, it spits “this expression in your JS is invalid” error. Fixing few of those errors (mostly by commenting code) showed that there is no way we could develop with an environment like that.

Antiquity

So the Node.js at last. The architecture was simple: we put compiled JS file inside of API server uber jar, on start write it somewhere, and run Node.js against it. After that communicate with Node via HTTP: proxy user requests there, proxy rendered HTML back.

Because Node.js is single-threaded it’s totally wrong to have a single process - so our API server becomes a process pool manager and a load balancer.

Also, we make a request for data while a component is mounted. Which is fine for the browser, you just wait for a bit, the response comes back and the browser will render that data. But on server user request is already gone: your first version of an “empty” component was already sent to a user because rendering process does not wait for the data to be received.

Which was fixed by making it two-pass renderer: first you rendered to kick off the data gather process, and when all XHR activity stops, app is rendered one more to have full content. That takes around 300 ms on a fairly simple page in optimistic case.

Can’t forget to mention that Node.js also leaked memory. Profilers did not help so I went lazy uwsgi way: made a Node process restart every 1000 requests.

But what won’t you sacrifice for the sake of result? So we’ve got this system in place (not in production yet though) and I went to Philly for Clojure/Conj 2015.

Enlightenment

Weirdly I did not plan on attending Allen Rohner’s “Optimizing ClojureScript Apps For Speed” which taught me that descriptions suck. There was nobody in the lobby though so I became bored and went there.

Core idea is following: Clojure 1.7 (came out 5 months before that, in June 2015) gained an ability to have a single .cljc file which can be compiled by both Clojure (which usually lives in .clj files) and ClojureScript (.cljs files). Let’s say this is a React component which renders an image:

(defc Image [image-source]
  [:img {:src image-source}]]

Here defc is a macro which transforms the code into React calls. Here’s a result (roughly):

(def Image (js/React.createComponent {
  :render (fn []
            (js/React.createElement "img" 
              {:src (get js/this.args :image-source)}))})) 

But that’s what macro generates when we compile into JavaScript. What if during execution on JVM (or, in Clojure) it will make this code into a function, returning a string <img src="...">?

(defn Image [image-source]
  (format "<img src=\"%s\">" image-source))

I wrote first version of this rendering for Rum back in January 2016, which Nikita (author of Rum) and other contributors developed into a properly working thing. We’ve got a server-side rendering, but without React runtime, in JVM, with synchronous (yay!) network reads, with multithreading: in other words, pure joy!

Fruits

Synchronous network reads are good because until data from a top-level component is received from the server, inner components (which are dependent on that data) sit there waiting. But right after data is there, inner components are rendered/initialized and start fetching their own data. So the data dependency management here is really simple.

Also, if you render your HTML in the same process your API is running, there is no need to go through network and HTTP. Clojure’s Ring protocol is just a function which accepts a request as an argument, so you can call your whole app with a simple map (which is what request in Ring is) and get response back. XHR looks like that for ClojureScript:

(defn xhr [params callback]
  (let [req (js/XMLHttpRequest.)]
    (set-params req params)
    (set-callback req callback)))

And like that for Clojure (app is our HTTP API):

(defn xhr [params callback]
  (callback (app (params->http-request params)))

Of course, real implementations are a bit longer, setting headers, parsing JSON, etc. By the way, there is an interesting story about JSON: right after Black Friday 2016, when our architecture proved to be a success, I was showing to somebody how it works and saw in MiniProfiler that serializing JSON takes quite a bit of time. Obviously parsing is not free as well.

54 lines of changes later and now we pass unserialized data to server rendering: takes less time and less memory (minus string for JSON and minus newly parsed data). Pleasantly, median time for rendering HTML went down from 110 to 80 ms. :)

Also interesting that we have a cheat (or optimization) around caching and user-specific content: everything that depends on a current user is rendered only on a client. It’s a compromise: users see “anonymous” version of a site initially for a few moments, but in exchange, we can cache rendered HTML for everyone. This makes us able to look like we’re not sweating a bit during sharp increases in traffic. :)

The End

And they lived happily ever after. And you can go to modnakasta.ua by yourself and click around to see how that stuff works. Also, look at the source HTML server gives you, this all is rendered in JVM, with absolutely no JavaScript.

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.