When I first announced that I was producing a course about Re-frame, I
inevitably got lots of questions about why I chose Re-frame and not Om
Next. It's a good question and I'd like to address it, because in the
end, the decision was hard but the choice was kind of made for me.
Re-frame is a front end framework created by Mike Thompson. It takes
care of DOM rendering, application state management, and it has some
building blocks useful for other effects like communicating with the
backend. Om Next, created by David Nolen, is similar, except that it
has a grander vision about client/server communication. It's ambition
is to unify the model of data on the client and server to reduce the
number of queries the client runs which will make applications faster,
especially on mobile.
Like I said before, the two frameworks are very similar. However, I
had to choose one. Making courses about both would have diluted my
attention. I wound up teaching Re-frame and I am very happy with the
choice. So why did I do it?
People have been asking me to teach Om Next for two years. As soon as
I announced my original Om course Single Page Applications with
ClojureScript and Om, Om Next was announced. People were
already asking whether it would teach Om Next and when the Om Next
version would be available. My answer at that time was simple: it
won't teach Om Next and I won't update it. Om Next is different enough
that it needs its own course. However, Om Next was released as Alpha
at that time. I've been burned before making a course and having the
material change out from under me. My answer to people asking when I
would make an Om Next course was "not before it comes out of Alpha".
That was nearly two years ago and Om Next, when I started my Re-frame
course, was still in Alpha. As I write this, it went into
Beta1 two months ago. I know a lot of companies use it in production
already. But there is so much out there for me to teach that doesn't
have a label of volatility. I cannot afford to make a course that
might not work in six months.
Meanwhile, I was very interested in Om Next and I wanted to learn
it. I watched David Nolen's talks. I went through
the tutorial a few times. I made a small application. I
got on Slack. I read Tony Kay's excellent interactive
tutorial. Even with all of that, and understanding the
individual pieces, I can never say that I fully understood what was
going on. I wrote a small application in it and it felt like threading
a needle. There were so many things I had to get perfect or they
wouldn't line up as neatly as promised. I felt like at any moment I
might violate some implicit assumption about how my data should be
shaped or what the query DSL could do.
I knew I would have to get a handle on this to be able to make it
teachable, and I didn't feel like I could in a reasonable amount of
time and effort. Teachability is a huge part of designing a
framework. It is probably as important as capability. I would have to
invent an onramp to get people started with Om Next. In short, I'd
have to create a mental framework to use the actual framework with. In
the end, this was my biggest beef with it. It has too many degrees of
Meanwhile, during my struggles with Om Next, I discovered
Re-frame. Re-frame felt aggressive. The README was long and dense and
opinionated, not to mention stylistically eccentric (it has been toned
down since). There were all of these parts to it, and it came from
someone I had never heard of, and plus it was built on Reagent, which
I thought was neat but not as robust as Om. One day I decided to
record a Clojure in One Hour lesson about it. I hadn't
done anything more than read the README when I attempted my first
project. It was fun and I got something working. Plus, I understood it
and could see how to teach it.
Re-frame does not do all that Om Next does. Om Next has a concept of
database entities, it will de-duplicate them and build indexes. It has
a query language that specifies what data each component needs. And
It embodies my #1 reason for using a framework: it is a
process for turning ideas into code. It's a mental framework as much
as a code framework. You get a handful of useful pieces so you can
divide up the problem and fit them into the application. Those pieces
easily map to the parts of you should be familiar with as a functional
programmer. Each part is there for a reason and makes the system
I'm going to write this process up more fully, but here's a small
example of what I'm talking about. Let's say we want to make a button
that increments a count. We also want to display that count.
We need to store the count in application state, and increment it
whenever the :increment-count Event is handled.
What application state do we need to read to build our UI?
Read-only access to application state is done using Subscriptions.
Our count-display needs to read the current count, so we make a
subscription for that and subscribe in the component.
We mount those components into the UI and it works.
Notice the mental framework: each question leads directly to a piece
of code. Building applications is already complex enough. You want
something to give you a place for the things you need. So here's the
strong statement I'm going to make regarding Re-frame:
Re-frame gives you just the right amount of structure you want when
building a frontend application. The structure it gives you is easy
Om Next has a lot of cool capabilities. You should at least watch the
talks and give the Quick Start a shot. It might be just what you
need, or you might like the freedom and potential it gives you. If
that's you, go for it. I think it's a cool system with a great
philosophy. If you're starting a new frontend and backend from
scratch, do check out Untangle, which does provide much more
ready-made structure so you can take advantage of Om Next's
However, I think Re-frame is a great choice for most people. It gives
you 90% of what you need plus a place to put the last 10%. It's
powerful, teachable, and it has a large and growing community. I'm
still happy with my choice and I invite you to learn Re-frame with
me. I've already got one course called Building Re-frame
Components which dives into the nitty-gritty of making custom
UI components in Re-frame.
You can always count on human culture to make programming messy. To find out if
a person is a programmer just have them say “encodings” or “timezones” and watch
In Java, and hence Clojure, there are half a dozen different ways to represent
dates and times, which can lead to confusion, needless type coercion, and
inconsistent code. In this post I’ll give you a quick lay of the land, and some
tips to make it all a bit easier to deal with.
The traditional way to represent time, the system that has been handed down to
us by The Elders, is known
as Unix Time. A Unix timestamp is a
number representing the amount of seconds passed since
the start of time, which
is conveniently defined as midnight, January 1, 1970.
A happy family shortly after the start of time (credits)
Some 822 million seconds later Java 1.0 came out. It was the year of the Spice
Girls, the Macarena, and mutable date classes.
user> (def date (java.util.Date.))
user> (.setTime date 822391200000)
A Date wraps a milli-second based Unix timestamp, meaning it’s not time zone
user> (java.util.TimeZone/setDefault (java.util.TimeZone/getTimeZone "UTC"))
user> (.toString date)
"Tue Jan 23 10:00:00 UTC 1996"
user> (java.util.TimeZone/setDefault (java.util.TimeZone/getTimeZone "Europe/Berlin"))
user> (.toString date)
"Tue Jan 23 11:00:00 CET 1996"
That’s not all that’s wrong with java.util.Date, so in Java 1.1 they tried
to rectify things by deprecating half of the Date API and instead introducing
a (obviously mutable) GregorianCalendar.
It’s 2002, Brazil has beaten Germany in the FIFA World Cup, and someone decides
the time is ripe for a proper date/time library. In no time at
all JodaTime is born!
It doesn’t take long or Joda becomes the de facto time library on the JVM, and
in 2010 it becomes the basis
Joda contains two principles classes for representing time,
org.joda.time.Instant, which is essentially a Unix timestamp, and
org.joda.time.DateTime, which explicitly stores year/month/day/etc.
Having both these classes makes the distinction clear between a “machine time”
and “human time”. For machines things are relatively straightforward: time is an
ever increasing number. The same instant in time corresponds with the same
number, no matter where you are or how you look at it. There are no jumps or
gaps, time is strictly monotonous.
The human view of time is a lot more complex. There are time zones, daylight
saving time, not to mention different calendar systems and historical time
adjustments. For adding timestamps to your logs machine time will work fine, but
when your system communicates dates and times with end users you need a richer
For years this was the state of things: you used the built-in java.util.Date
despite its shortcomings, or you relied on JodaTime as an external dependency.
That you needed a library to do something as essential as working with dates and
times seemed a bit odd, and so Oracle involved the JodaTime author, Stephen
Colebourne, to work on a new Date/Time library for the JDK, affectionately known
JSR-310 introduces a new java.time package, which includes java.time.Instant
and java.time.ZonedDateTime. While clearly inspired by JodaTime, there are
also some differences. The author
explains in this article
that there are some shortcomings in JodaTime that they sought to address.
Development was completed in 2014, and java.time shipped with Java 8. That’s
basically last week in Java years, it will be a while before support for this
has become widespread.
On the clj-time issue tracker there’s
an ongoing discussion about
what this will mean for clj-time. While there’s a lot of interest for clj-time
to adopt JSR-310, it seems the current concensus is that if they do this will be
with a new artifact id (a new library name). The differences between JodaTime
and JSR-310 are big enough to make breaking API changes inevitable. By using a
different name it will be possible to switch to the new version, while libraries
you depend on can keep using the old version.
Meanwhile someone has gone ahead and created an idiomatic Clojure library for
the java.time.* API, aptly
named clojure.java-time, so perhaps
this will be what clj-time 2.0 could have been. If you are starting a new
project today you might skip clj-time altogether, and instead use
clojure.java-time for your date and time needs, assuming you’re at least on Java
If you do go that route you should do your homework though. java.time contains
a handful of different date/timestamp classes, and you should know which is which. The
has several links for further reading.
It’s also worth pointing out that since Clojure 1.8 there’s a built-in namespace
which contains a few utility methods for parsing dates.
You thought we were done? Guess again! Besides java.util.Date, Java has always
included yet another timestamp class: java.sql.Timestamp. The reason is that
timestamps in the SQL standard have a nanosecond precision, while
java.util.Date only stores milliseconds. So a java.sql.Timestamp inherits
from java.util.Date, but adds an extra nanosecond field.
If you’re using JDBC to read or store timestamps, then you’ll run into this
Making things bearable
Having all these different types for representing essentially the same thing is
a recipe for massive frustration, but with a few small tweaks we can make it all
a lot less painful.
The trick is to pick one of the above, and use it across the board. For now
org.joda.time.Instant is still a likely candidate, it has widespread library
support, and you can combine it with clj-time for writing idiomatic Clojure
code. If instead you want to be future proof you can opt for
java.time.Instant, in combination with clojure.java-time. Several people
have reported they’re already getting good use out of java.time. The snippets
below assume you’re sticking with JodaTime.
Whenever time values enter your program, convert them to JodaTime. Maybe you’re
reading JSON, EDN, Transit, you’re pulling values off an API or out of a
database. Don’t let them propagate through your code as java.util.Dates (or
worse, longs representing Unix timestamps), but convert them to
org.joda.time.DateTime as soon as possible.
The same goes for outputting values. You don’t want to see an exception any time
you encode some JSON with a timestamp, or save a date to the database.
Many libraries can be configured to do this automatically, so you only have to
worry about it once.
JDBC is Java’s standard for communicating with relational databases, and the
clojure.java.jdbc library provides a convenient Clojure wrapper. By default
JDBC expects, and outputs, java.sql.Timestamp. But by loading the
clj-time.jdbc namespace it switched to JodaTime instead. Nice, that was easy!
;; load/save DB times as joda DateTime
The go-to library for reading and writing JSON from Clojure is called Cheshire.
JSON doesn’t have a standard way of encoding time values. If you do need to
encode timestamps the common thing to do is to either use a string containing an
ISO-8601 formatted timestamp, or to use a number representing a UNIX timestamp.
This snippet configures Cheshire to output org.joda.time.DateTime as a string.
In EDN and Clojure source code you can use an #inst tagged literal to
represent timestamps. These by default will be read as java.util.Date.
Having a literal syntax for timestamps can be very handy, one example would be
to write out test data containing timestamps.
In my apps I tend to configure Clojure’s reader and printer to represent
org.joda.time.DateTime as #joda/inst "...timestamp...", providing me the
same convenience. When doing REPL-driven development (directly or inside a
source buffer) this is particularly useful, since results that are printed out
can be read again, they are proper time literals.
Clojure’s printer can be extended through the print-method and print-dup
;; Configure the printer
(defmethod print-method org.joda.time.DateTime
(.write out (str "#joda/inst \"" (.toString dt) "\"") ))
(defmethod print-dup org.joda.time.DateTime
(.write out (str "#joda/inst \"" (.toString dt) "\"") ))
;; Utility function for the reader
(defn ->joda-inst [time-str]
The reader can be taught about new tagged literals through a
data_readers.clj file placed on the classpath. I found it works best to put it
under resources, that way Leiningen is smart enough to merge it with other
data_readers.clj files provided by libraries when packing up an uberjar.
This article has focused on JVM based Clojure, but what about ClojureScript?
people are content using js/Date. There is a separate
with the Google Closure library, which in turn is used
by cljs-time, so you might run into it.
For a good time
Java’s time/date landscape can be difficult to navigate, but a map of the
territory can help you along the way. And with the right tools in your belt it
can be a pleasant journey after all.
Do you have your own snippets for making libraries or subsystems play well with either JodaTime or java.time? Send them in and I will add them to the article
A special thanks to Ben Lovell, Tim Gilbert, and Sean Corfield for pointing me to clojure.java-time.
The two of us - Malcolm Sparks and Jon Pither - have just spent a thoroughly enjoyable week in Lisbon providing full stack Clojure training. It was a pleasure to see the start of a promising Clojure scene in Portugal's capital, one that we hope will flourish in the years to come.
We eased into the course of eight attendees with a Clojure Kata - pulling back some interesting JSON from a public API and transforming it to answer some basic queries. This involved using the REPL and Clojure's sequence library to great effect.
For the duration of the course, we built upon our Edge project, which is a Clojure bootstrapped project containing the libraries we favour and a working web application example linked up to Datomic.
We had a range of different operating systems and IDEs to contend with. There were no users of Arch Linux on this course - which is what we use at JUXT - instead, we had some Macs and a Windows environment in play. We were able to get moving in Windows by installing Git Bash and from there boot-clj, which is the build system we use for Edge.
Once everyone had their IDEs and REPLs in full swing and the Kata was completed, we then started to move through the full-stack syllabus.
During the two days, we went through the following, using a mix of exercise-based learning and discovery via live coding.
The participants were excellent and swiftly got up to speed with the course material. We regularly solicited and received feedback throughout the course to balance the mix of live coding and exercises to ensure that we maintained traction as we moved through the advanced topics.
Whilst it was an intensive two days, we were able to take the participants through the syllabus so that hopefully, they should be able to start using the various technologies for their next Clojure project.
Aside from training, we both had a lovely time in Lisbon. We were told by various people (a taxi driver, hotel manager, and course attendees) to visit the restaurant Ramiro for exemplary Portuguese sea food. We can report that the shellfish was mind-blowingly good.
We wandered into the Old Town for a cocktail, and visited the Timeout market to try some more Portugese national dishes.
Aside from the food - of which we hopefully did justice - Lisbon is a beautiful city. We spent hours walking around and enjoying the vibe, and we would love to return. If you're thinking of a trip to Lisbon - go there.
Any programming language that you learn will hopefully teach you new techniques and ways of solving problems.
In the case of Clojure, it has taught me how great immutability is, the difference between simple and easy, the value of values, the power of the REPL and macros, the trade-off of dynamic typing, and the beauty behind all those parentheses.
Also, being so different, it has allowed me to reconsider a lot of common wisdom, resulting in heretical thoughts about global state, dependency injection, encapsulation, testing, types and such.
However, the single most surprising lesson has been how close-minded I had been at the mere idea of it, how biased I was, how I would immediately dismiss it and mock it, and how now, I will sing its praises.
If you had come and told me five years ago about a dynamically typed (what?? I am a proper engineer! Static typing is the only way to go!) Lisp (parenthesis make your code unreadable!) where everything is immutable by default (surely very slow!!) and where you just work with maps of maps (unmaintainable and error prone!), I would have laughed at you.
How blind I was? How many other great things I have missed because of my cognitive bias? How about you?
Maybe you will not like everything that you see, but enjoy the journey, be open minded, and don't be afraid to admit that maybe some of the stuff on the other side of the fence is cool, interesting or even better.
The above code returns an EXTREMELY long data structure. Ex. ((fn [a b] (+ a b)) 5 7) returns a map with almost 3,000 lines.
I asked on the Clojurian’s #clojurescript Slack channel and got the following explanation:
chalcidfly [9:11 AM] Why is the AST for `((fn [a b] (+ a b)) 5 7)` almost 3000 lines long?
mccraigmccraig [9:17 AM] it's got the environment expanded at every node... most of the bulk is under `:env` keys
alexmiller [9:17 AM] because AST is very verbose
chrisbroome [9:17 AM] looks like there's a lot of `:js-globals` that are repeated in various nodes
alexmiller [9:17 AM] most of those are actually the same identical object referred to from multiple locations
I will compare four different algorithms to lookup a string in a list of strings within a maximum edit distance according to the Damerau-Levenshtein string metric.
For spelling correction an additional word frequency associated with every term can be used to sort and filter the results (suggestions) further.
One could also implement a Weighted Edit distance giving a higher priority to pairs which are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound). While there is a SymSpell implementation with weighted Damerau-Levenshtein edit distance / keyboard-distance, the weighted edit distance is beyond the focus of this post. It can be added as a post-processing step with only small impact on the performance to most approximate string searching algorithms by just re-prioritizing/sorting the preliminary results filtered by the Damerau-Levenshtein edit distance according to your preferences.
All algorithms strive for the same goals to achieve short lookup times: reducing the number of lookups and comparisons (between words and/or artificially generated candidates), possibly reducing further the number of full edit distance calculations and finally reducing the computational complexity of the edit distance calculation itself, while not compromising its accuracy.
The four different algorithms I want to compare and benchmark are:
All four algorithms are using derivatives of the Levenshtein edit distance. There are three different levenshtein distances:
Levenshtein distance: adjacent transposition (AC->CA) counted as 2 edits. The triangle inequality does hold.
Restricted Damerau-Levenshtein distance (Optimal string alignment algorithm): adjacent transposition counted as 1 edit, but substrings can’t be edited more than once: ed(“CA” , “ABC”) =3. The triangle inequality does not hold.
True Damerau-Levenshtein distance: adjacent transposition counted as 1 edit, substrings can be edited more than once: ed(“CA” , “ABC”) =2. The triangle inequality does hold.
Norvig’s algorithm is using the true Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance.
SymSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.
LinSpell is using the Restricted Damerau-Levenshtein edit distance. It could be modified to use the Levenshtein distance or the True Damerau-Levenshtein distance.
Verbosity of search results
In our tests we distinguish three levels of verbosity of search results, which will result in different lookup times:
Level 0: Return only the result with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return the result with the highest word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.
Level 1: Return all results with the smallest edit distance within the maximum edit distance. If multiple results with the same edit distance exist, then return them all sorted by word frequency. This allows for early termination of the search, e.g. if a result with edit distance=0 was found.
Level 2: Return all results within the maximum edit distance, sorted by word frequency. This allows for no early termination of the search.
Norvig’s Spelling Corrector
The idea is if we artificially generate all terms within maximum edit distance from the misspelled term, then the correct term must be among them. We have to look all of them up in the dictionary until we have a match. So all possible combinations of the 4 spelling error types (insert, delete, replace and adjacent switch) are generated. This is quite expensive with e.g. 114,324 candidate term generated for a word of length=9 and edit distance=2.
The BK-tree utilizes the triangle inequality, a property of the Levenshtein edit distance: Levenstein(A,B)+Levenstein(A,C)≥Levenstein(B,C) and Levenstein(A,B)−Levenstein(A,C)≤Levenstein(B,C).
During indexing the Levenshtein(root node,child node) are precalculated.
During lookup we calculate Levenshtein(input ,root node). The triangle inequality is used as a filter, to only recursively follow those child nodes where the precalculated Levenstein(root node,child node) is in the range [Levenstein(input ,root node)-dmax, Levenstein(input ,root node)+dmax].
There are several interesting posts discussing the BK-tree in detail:
SymsSpell is an algorithm to find all strings within an maximum edit distance from a huge list of strings in very short time. It can be used for spelling correction and fuzzy string search.
SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing.
The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!
The speed comes from pre-calculation. An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but with SymSpell you need to pre-calculate & store only 25 deletes to cover them all. Magic!
The idea behind prefix indexing is that the discriminatory power of additional chars is decreasing with word length. Thus by restricting the delete candidate generation to the prefix, we can save space, without sacrificing filter efficiency too much. In the benchmark three different prefix lengths lp=5, lp=6 and lp=7 were used. They reflect different compromises between search speed and index size. Longer prefix length means higher search speed at the cost of higher index size.
This is basically a linear scan through the word list and calculating the edit distance for every single word (with a few tweaks). It was intended as the baseline measure for the benchmark. Surprisingly enough and despite its O(n) characteristics it turned out to excel both BK-tree and Norvig’s algorithm.
There are several reasons for that:
do not underestimate the constants in the Big O notation. Visiting only 20% of the nodes in a BK-tree is more expensive than a linear search where the atomic cost is only 10%.
As Damerau-Levenshtein calculation is very expensive it is not the number of processed words which matters, but the number of words where we need the full Damerau-Levenshtein calculation. We can speedup the search if we can prefilter words from the calculation or terminate the calculation once a certain edit distance is reached.
If we restrict the search to best match we can utilize options for early termination of the search.
words with no spelling errors are a frequent case. Then the lookup can be done with a hash table or trie in O(1) ! If we restrict the search to best match, we can instantaneously terminate the search.
We do not need to calculate the edit distance if Abs(word.Length-input.Length) > Maximum edit distance (or best edit distance found so far if we restrict the search to best match)
If we restrict the search to best match, we can terminate the edit distance calculation once the best edit distance found so far is reached.
If we restrict the search to best match, and we have found already one match with edit distance=1, then we do not need to calculate the edit distance if the count of the term in question is smaller than the count of the match already found.
In order to benchmark the four approximate string searching algorithms we perform 2*3*3+2*2 tests for each algorithm:
1000 words with random spelling errors are searched for two dictionary sizes (29,159 words, 500,000 words), three maximum edit distances (2, 3, 4) and three verbosity types (0, 1, 2). For each test the total search time is measured.
For two dictionary sizes (29,159 words, 500,000 words) the precalculation time necessary to create the dictionary and auxiliary data structures and their memory consumption are measured.
For each of the four algorithms the found suggestions have been compared to ensure completeness of results. Result differences caused bythe different Levenshtein variations used by the algorithms have been taken into account.
Test focus & limitations
The benchmark has been limited to maximum edit distance up to 4, because for natural language search an edit distance of 4 is already on the border of what’s reasonable, as the number of false positives grows exponentially with the maximum edit distance, decreasing precision, while improving recall only slightly.
The benchmark has been limited to a dictionary size up to 500,000 words because even the 20-volume Oxford English Dictionary contains only 171,476 words in current use, and 47,156 obsolete words.
Fuzzy search beyond natural language search with longer strings or bit vectors (images, voice, audio, video, DNA sequences, fingerprints, …) may require higher edit distances and larger dictionary sizes and lead to different results.
Fuzzy search for multiple words (product/event databases) or larger text fragments (plagiarism check) will likely require compounding/decompounding and string metrics beyond edit distances like Strike a match to account for missing and switched words.
Both application fields are beyond the focus of this post.
For the query we use the first 1000 unique words. from Norvig's text corpus big.txt.
For each word a random number of edits in the range 0..Min(word.length/2 , 4) is chosen. For each edit a random type of edit (delete, insert random char, replace with random char, switch adjacent chars) is applied at a random position within the word. After the edits no duplicates and words with length<2 are allowed.
These are the 29,159 unique words from Norvig’s text corpus big.txt, together with their frequency in that corpus.
These are the most frequent 500,000 words from the English One Million list from Google Books Ngram data, together with their frequency.
In the current benchmark we compared words with a random edit distance (0...maximum edit distance). This gives a good understanding of the averagelookup time.
In a previous benchmark we were comparing words with a fixed edit distance (= maximum edit distance). This gives a good understanding of the maximumlookup time. Sometimes averages can be misleading as measure for user experience. For edit distance=3 SymSpell was 1 million times faster than Norvig's algorithm.
The benchmark could not confirm any rationale behind the widespread use and recommendation of BK-trees and Norvig’s algorithm other than historical reasons and habit that is perpetually passed on by text books and forums. With SymSpell and LinSpell there are two alternative algorithms available which always provide better results, at least within the scope of this test.
Use SymSpell, if speed is important. It is 2–3 orders of magnitude faster than BK-Tree and 5–6 orders of magnitude faster than Norvig’s algorithm. SymSpell lookup time grows only moderate with dictionary size and maximum edit distance. It outperforms all other three algorithms in all cases, often by several orders of magnitude. This comes at the cost of higher memory consumption and precalculation times. Precalculation times incur only once during program/server start or even only during program development if the precalculated data is serialized/deserialized.
Use LinSpell, if memory usage is important. It is 10* faster than BK-Tree for same memory usage. LinSpell lookup time grows linearly with the size of the dictionary, but is nearly independent from the maximum edit distance. It is almost always better than BK-tree and Norvig’s algorithm. No additional memory consumption and precalculation times incur.
There is no obvious benefit of using a BK-tree. In all cases it is excelled speedwise by SymSpell and both speed & memory wise by LinSpell. While a BK-tree is faster than Norvig's algorithm for edit distance>2, it is much slower than LinSpell or SymSpell. BK-tree performance depends heavily on the size of the dictionary, but grows only moderate with maximum edit distance.
There is no obvious benefit of Norvig’s algorithm. In all cases it is excelled speedwise by SymSpell and LinSpell and matched memory wise by LinSpell. Lookup time of Norvig’s algorithm is independent from the size of the dictionary, but grows exponentially with maximum edit distance.
Always use verbose=2 carefully (enumerate all suggestion within maximum edit distance, not only those with the smallest edit distance). It is much slower, as it prevents an early termination of the search!
Thanks so much for all of the birthday wishes. I have the best readers.
I need to apologize for a bug a customer recently discovered on my site: the download links for individual videos and reference sheets sometimes didn’t work. I’ve fixed the bug. If you tried to download them before but couldn’t, please try again.
Also, I would like to let all of my customers who bought a long time ago that if they need to reactivate their purchase, just send me the email receipts and I’ll take care of it. You paid for the courses and I want to make sure you can get them.
David Turner is a legend in the academic Functional Programming world. He designed the Miranda language, which later greatly influenced Haskell. This talk is great. It talks about why he didn’t choose Lisp to teach functional programming. I never knew that Lisp wasn’t based on the Lambda Calculus. You’ll have to watch the video to learn more.
Kris Jenkins discusses how types can help us design better abstractions. His points remind me a lot of the topic of my talk Building Composable Abstractions. It also reminds me of how much I learned as a Haskell programmer.
Stuart Halloway seems to be on a roll. He’s teaching about how best to use the Repl to develop in Clojure. His main point: we should use the Repl more. I have to agree with him. Use of the Repl distinguishes experienced Clojure programmers.
I often forget to link to these interviews I do at PurelyFunctional.tv. Euroclojure just passed and if you missed it, check out interviews with some of the speakers. Also, ClojuTre has already announced their speaker lineup, and there are plenty of interviews already available. I promise I’ll link to more of these in the future.
After another year of development, I'm proud to announce a new major release of
my Clojure RPC library, slacker. This
release, 0.15.0, includes an update on wire protocol, some performance
improvement and bugfix.
The request extension
This release of slacker client use v6 protocol by default. The server can still
accept v5 protocol for backward compatibility.
The v6 protocol added an extension field to request and response
packet. By using this field, you can exchange some metadata in RPC call. A
typical use case is for distributed tracing. Trace ID is passed as extension.
In the API level, we kept the non-invasive principle. Extensions will be injected
as var bindings.
Previously, you can only configure one thread pool for all exposed namespace.
This adds risk that one backend issue may exhaust all your server threads. And
your server stops respond requests for any namespace.
To isolate the execution for namespaces, slacker 0.15 added support for finer
thread pool configuration.
I’ve decided to translate an article I wrote
for modnaKasta technical blog since it should be
fairly interesting for a wider audience than just Russian-speaking people.
modnakasta.ua is a single page application right now, which uses React
internally and is written in ClojureScript. But we care about Google and people
experience, which means we certainly care about server-side rendering.
Since the whole front-end is in React, which promises to make app renderable
without having access to DOM - in Node.js in particular - the first idea was to
render everything using Node.js. Doesn’t feel that fuzzy, does it? Especially
given that there is a new shiny super-fast JS engine in Java 8: Nashorn. And
first experiments showed that it’s indeed very fast, faster than Node.js.
“That’s great,” I thought and we started building an app (not SSR, an app
itself). When the app became at least somewhat usable the time has come for a
reality check. Loading our ~2-3k loc application into Nashorn proved difficult:
8 Gb of heap was barely enough to load JS file and after 30 seconds of hard work,
it spits “this expression in your JS is invalid” error. Fixing few of
those errors (mostly by commenting code) showed that there is no way we could
develop with an environment like that.
So the Node.js at last. The architecture was simple: we put compiled JS file inside
of API server uber jar, on start write it somewhere, and run Node.js against
it. After that communicate with Node via HTTP: proxy user requests there, proxy
rendered HTML back.
Because Node.js is single-threaded it’s totally wrong to have a single process -
so our API server becomes a process pool manager and a load balancer.
Also, we make a request for data while a component is mounted. Which is fine for
the browser, you just wait for a bit, the response comes back and the browser
will render that data. But on server user request is already gone: your first
version of an “empty” component was already sent to a user because rendering
process does not wait for the data to be received.
Which was fixed by making it two-pass renderer: first you rendered to kick off
the data gather process, and when all XHR activity stops, app is rendered one
more to have full content. That takes around 300 ms on a fairly simple page in
Can’t forget to mention that Node.js also leaked memory. Profilers did not help
so I went lazy uwsgi way: made a Node process restart every 1000 requests.
But what won’t you sacrifice for the sake of result? So we’ve got this system in
place (not in production yet though) and I went to Philly for Clojure/Conj 2015.
Weirdly I did not plan on attending Allen Rohner’s “Optimizing ClojureScript
Apps For Speed” which taught me that descriptions suck. There was nobody in the
lobby though so I became bored and went there.
Core idea is following: Clojure 1.7 (came out 5 months before that, in June 2015)
gained an ability to have a single .cljc file which can be compiled by both
Clojure (which usually lives in .clj files) and ClojureScript (.cljs
files). Let’s say this is a React component which renders an image:
I wrote first version of this rendering for Rum back in January 2016, which
Nikita (author of Rum) and other contributors developed into a properly working
thing. We’ve got a server-side rendering, but without React runtime, in JVM,
with synchronous (yay!) network reads, with multithreading: in other words, pure
Synchronous network reads are good because until data from a top-level component
is received from the server, inner components (which are dependent on that data)
sit there waiting. But right after data is there, inner components are
rendered/initialized and start fetching their own data. So the data dependency
management here is really simple.
Also, if you render your HTML in the same process your API is running, there is
no need to go through network and HTTP. Clojure’s Ring protocol is just a
function which accepts a request as an argument, so you can call your whole app
with a simple map (which is what request in Ring is) and get response back. XHR
looks like that for ClojureScript:
Of course, real implementations are a bit longer, setting headers, parsing JSON,
etc. By the way, there is an interesting story about JSON: right after Black
Friday 2016, when our architecture proved to be a success, I was showing to
somebody how it works and saw in MiniProfiler that serializing JSON takes
quite a bit of time. Obviously parsing is not free as well.
54 lines of changes later and now we pass unserialized data to server rendering:
takes less time and less memory (minus string for JSON and minus newly parsed
data). Pleasantly, median time for rendering HTML went down from 110 to 80
Also interesting that we have a cheat (or optimization) around caching and
user-specific content: everything that depends on a current user is rendered
only on a client. It’s a compromise: users see “anonymous” version of a site
initially for a few moments, but in exchange, we can cache rendered HTML for
everyone. This makes us able to look like we’re not sweating a bit during sharp
increases in traffic. :)
And they lived happily ever after. And you can go to modnakasta.ua by yourself and click around to
see how that stuff works. Also, look at the source HTML server gives you, this all