Martian

I've worked quite a lot with APIs and HTTP in particular, and I've come to some conclusions:

  • I feel like I keep writing the same code on every project
  • It's still surprisingly hard to get right
  • I like defining what
  • I don't want to have to care about how

When creating an API there will normally be two parts to consider: the interface and the application layer. The interface consists of the operations, parameters and data that gets returned, and is what makes your API unique and useful. The application layer is how two processes talk to each other, HTTP for example, which has its own nomenclature, idioms and behaviours which are abstract and generic. Layers in networking - physical, transport etc - should be opaque to each other. Your interface in turn should be opaque to HTTP, but too often we fall at the final hurdle and complect the definition of our interface with implementation details like verbs and distinctions between minutiae like route parameters and query parameters. It's not something you should necessarily care about - as an application developer you want that sort of thing to just work, so you can get on with using data and operations to do interesting and valuable things.

Fortunately, if you want to provide an API libraries like pedestal-api, yada and compojure-api can take care of HTTP and let you simply describe your interface using data structures. With interceptors or middleware you can easily add cross-cutting aspects like metrics, logging and security. You even get a flashy Swagger UI for free to show your friends how awesome your API is. Sweet!

(defhandler create-pet
  {:summary     "Create a pet"
   :parameters  {:body-params {:name s/Str
                               :type s/Str
                               :age s/Int}}
   :responses   {201 {:body {:id s/Int}}}}
 (fn [request] ... ))))

But what about consuming APIs as a client? Client HTTP libraries like httpkit, clj-http and cljs-http let you make HTTP calls but you still have to know what parameters go where in the url, query string, headers or body, how to serialise and deserialise the body, what method to use for each request and so on; your concise description of your API operation has been lost, all complected with HTTP. Cross-cutting requirements are also hard to implement in a uniform manner - you'll have to pass your metrics registry or credentials configuration around to wrap any code that might need to make an HTTP call.

At this point you reach an interesting dichotomy - how come the Swagger UI makes using the API easy for a human, but all your code calling it is messy? Aren't machines meant to be better at talking to other machines than people?

Martian uses descriptions of APIs to hide the HTTP application layer and leave behind just the operations and parameters of the interface. You're back to calling operations with parameters and getting return values, just like normal code. Here's a minimal example of using Martian against an API described by Swagger:

(require '[martian.core :as martian]
         '[martian.clj-http :as martian-http])

(let [m (martian-http/bootstrap-swagger "https://pedestal-api.herokuapp.com/swagger.json")]
  (martian/response-for m :create-pet {:name "Doggy McDogFace" :type "Dog" :age 3})
  ;; => {:status 201 :body {:id 123}}

  (martian/response-for m :get-pet {:id 123}))
  ;; => {:status 200 :body {:name "Doggy McDogFace" :type "Dog" :age 3}}

Nice and simple, looks like the interface we described on the server side, and bootstrapping at runtime even allows the underlying HTTP implementation to be refactored without your code needing to change. Martian maps the parameters to the right place and chooses the most efficient serialisation the API supports without any of it leaking into your code. This is the separation of interface and application layer that we've been aiming for.

Martian offers a few other features to further speed the progress of writing your client code:

;; explore the endpoints
(explore m)
=> [[:get-pet "Loads a pet by id"]
    [:create-pet "Creates a pet"]]

;; explore the :get-pet endpoint
(explore m :get-pet)
=> {:summary "Loads a pet by id"
    :parameters {:id s/Int}}

;; build the url for a request
(url-for m :get-pet {:id 123})
=> https://pedestal-api.herokuapp.com/pets/123

;; build the request map for a request
(request-for :get-pet {:id 123})
=> {:method :get
    :url "https://pedestal-api.herokuapp.com/pets/123"
    :headers {"Accept" "application/transit+msgpack"
    :as :byte-array}

What about those cross-cutting, non-functional aspects like authentication or metrics? Martian uses interceptors, just like Pedestal, to allow you to customise behaviour of the call either before the request is made, after the response is received, or both. Let's add some timing to our requests:

(require '[martian.core :as martian]
         '[martian.clj-http :as martian-http])

(def request-timer
  {:name ::request-timer
   :enter (fn [ctx]
            (assoc ctx ::start-time (System/currentTimeMillis)))
   :leave (fn [ctx]
            (->> ctx ::start-time
                 (- (System/currentTimeMillis))
                 (format "Request to %s took %sms" (get-in ctx [:handler :route-name]))
                 (println))
            ctx)})

(let [m (martian-http/bootstrap-swagger
               "https://pedestal-api.herokuapp.com/swagger.json"
               {:interceptors (concat martian/default-interceptors
                                      [martian-http/encode-body
                                       (martian-http/coerce-response)
                                       request-timer
                                       martian-http/perform-request])})]

        (martian/response-for m :all-pets {:id 123}))
        ;; Request to :all-pets took 38ms
        ;; => {:status 200 :body {:pets []}}

Martian uses interceptors for everything, including the making of the request, and allows you to write your own and configure them as you wish. As a result, you can use any HTTP library you please if you don't like the implementations provided for httpkit, clj-http and cljs-http. For testing, you may even want to mock out HTTP and use schema validation and generative responses - martian-test does just this, and will be covered in the next article, along with mapping APIs without Swagger and more advanced composition.

Martian is on Github, feedback and contributions are very welcome.

Permalink

Kata: Variation on Lights Out, introducing component library

Recently at a Clojure Barcelona Developers event, we had a refactoring session in which we introduced Stuart Sierra's component library in a code that was already done: our previous solution to the Lights out kata.

First, we put the code in charge of talking to the back end and updating the lights atom in a separated component, ApiLightsGateway:

Then, we did the same for the lights code which was put in the Lights component:

This component provided a place to create and close the channel we used to communicate lights data between the Lights and the ApiLightsGateway components.

Next, we used the Lights component from the view code:

And, finally, we put everything together in the lights core name space:

This was a nice practice to learn more about component and what it might take to introduce it in an existing code base.

You can find the code we produced in these two GitHub repositories: the server and the client (see the componentization branch).

You can check the changes we made to componentize the code here (see the commits made on Aug 30, 2016).

As usual it was a great pleasure to do mob programming and learn with the members of Clojure Developers Barcelona.

Permalink

Using Cognonto to Generate Domain Specific word2vec Models

word2vec is a two layer artificial neural network used to process text to learn relationships between words within a text corpus to create a model of all the relationships between the words of that corpus. The text corpus that a word2vec process uses to learn the relationships between words is called the training corpus.

In this article I will show you how Cognonto‘s knowledge base can be used to automatically create highly accurate domain specific training corpuses that can be used by word2vec to generate word relationship models. However you have to understand that what is being discussed here is not only applicable to word2vec, but to any method that uses corpuses of text for training. For example, in another article, I will show how this can be done with another algorithm called ESA (Explicit Semantic Analysis).

It is said about word2vec that “given enough data, usage and contexts, word2vec can make highly accurate guesses about a word’s meaning based on past appearances.” What I will show in this article is how to determine the context and we will see how this impacts the results.

Training Corpus

A training corpus is really just a set of text used to train unsupervised machine learning algorithms. Any kind of text can be used by word2vec. The only thing it does is to learn the relationships between the words that exist in the text. However, not all training corpuses are equal. Training corpuses are often dirty, biaised and ambiguous. Depending on the task at hand, it may be exactly what is required, but more often than not, their errors need to be fixed. Cognonto has the advantage of starting with clean text.

When we want to create a new training corpus, the first step is to find a source of text that could work to create that corpus. The second step is to select the text we want to add to it. The third step is to pre-process that corpus of text to perform different operations on the text, such as: removing HTML elements; removing punctuation; normalizing text; detecting named entities; etc. The final step is to train word2vec to generate the model.

word2vec is somewhat dumb. It only learns what exists in the training corpus. It does not do anything other than “reading” the text and analyzing the relationships between the words (which are really just group of characters separated by spaces). The word2vec process is highly subject to the Garbage In, Garbage Out principle, which means that if the training set is dirty, biaised and ambiguous, then the learned relationship will end-up being of little or no value.

Domain-specific Training Corpus

A domain-specific training corpus is a specialized training corpus where its text is related to a specific domain. Examples of domains are music, mathematics, cars, healthcare, etc. In contrast, a general training corpus is a corpus of text that may contain text that discusses totally different domains. By creating a corpus of text that covers a specific domain of interest, we limit the usage of words (that is, their co-occurrences) to texts that are meaningful to that domain.

As we will see in this article, a domain-specific training corpus can be quite useful, and much more powerful, than general ones, if the task at hand is in relation to a specific domain of expertise. The major problem with domain-specific training corpuses is that they are really costly to create. We not only have to find the source of data to use, but we also have to select each document that we want to include in the training corpus. This can work if we want a corpus with 100 or 200 documents, but what if you want a training corpus of 100,000 or 200,000 documents? Then it becomes a problem.

It is the kind of problem that Cognonto helps to resolve. Cognonto and KBpedia, its knowledge base, is a set of ~39,000 reference concepts that have ~138,000 links to schema of external data sources such as Wikipedia, Wikidata and USPTO. It is that structure and these links to external data sources that we use to create domain-specific training corpuses on the fly. We leverage the reference concept structure to select all of the concepts that should be part of the domain that is being defined. Then we use Cognonto’s inference capabilities to infer all the other hundred or thousands of concepts that define the full scope of the domain. Then we analyze the hundreds or thousands of concepts we selected that way to get all of the links to external data sources. Finally we use these references to create the training corpus. All of this is done automatically once the initial few concepts that define my domain got selected. The workflow looks like:

cognonto-workflow

The Process

To show you how this process works, I will create a domain-specific training set about musicians using Cognonto. Then I will use the Google News word2vec model created by Google and that has about 100 billion words. The Google model contains 300-dimensional vectors for 3 million words and phrases. I will use the Google News model as the general model to compare the results/performance between a domain specific and a general model.

Determining the Domain

The first step is to define the scope of the domain we want to create. For this article, I want a domain that is somewhat constrained to create a training corpus that is not too large for demo purposes. The domain I have chosen is musicians. This domain is related to people and bands that play music. It is also related to musical genres, instruments, music industry, etc.

To create my domain, I select a single KBpedia reference concept: Musician. If I wanted to broaden the scope of the domain, I could have included other concepts such as: Music, Musical Group, Musical Instrument, etc.

Aggregating the Domain-specific Training Corpus

Once we have determined the scope of the domain, the next step is to query the KBpedia knowledge base to aggregate all of the text that will belong to that training corpus. The end result of this operation is to create a training corpus with text that is only related to the scope of the domain we defined.

(defn create-domain-specific-training-set
  [target-kbpedia-class corpus-file]
  (let [step 1000
        entities-dataset "http://kbpedia.org/knowledge-base/"
        kbpedia-dataset "http://kbpedia.org/kko/"
        nb-entities (get-nb-entities-for-class-ws target-kbpedia-class entities-dataset kbpedia-dataset)]
    (loop [nb 0
           nb-processed 1]
      (when (< nb nb-entities)
        (doseq [entity (get-entities-slice target-kbpedia-class entities-dataset kbpedia-dataset :limit step :offset @nb-processed)]          
          (spit corpus-file (str (get-entity-content entity) "\n") :append true)
          (println (str nb-processed "/" nb-entities)))
        (recur (+ nb step)
               (inc nb-processed))))))

(create-domain-specific-training-set "http://kbpedia.org/kko/rc/Musician" "resources/musicians-corpus.txt")

What this code does is to query the KBpedia knowledge base to get all the named entities that are linked to it, for the scope of the domain we defined. Then the text related to each entity is appended to a text file where each line is the text of a single entity.

Given the scope of the current demo, the musicians training corpus is composed of 47,263 documents. This is the crux of the demo. With a simple function, we are able to aggregate 47,263 text documents highly related to a conceptual domain we defined on the fly. All of the hard work has been delegated to the knowledge base and its conceptual structure (in fact, this simple function leverages 8 years of hard work).

Normalizing Text

The next step is a natural step related to any NLP pipeline. Before learning from the training corpus, we should clean and normalize the text of its raw form.

(defn normalize-proper-name
  [name]
  (-> name
      (string/replace #" " "_")      
      (string/lower-case)))

(defn pre-process-line
  [line]  
  (-> (let [line (-> line
                     ;; 1. remove all underscores
                     (string/replace "_" " "))]
        ;; 2. detect named entities and change them with their underscore form, like: Fred Giasson -> fred_giasson
        (loop [entities (into [] (re-seq #"[\p{Lu}]([\p{Ll}]+|\.)(?:\s+[\p{Lu}]([\p{Ll}]+|\.))*(?:\s+[\p{Ll}][\p{Ll}\-]{1,3}){0,1}\s+[\p{Lu}]([\p{Ll}]+|\.)" line))
               line line]
          (if (empty? entities)
            line
            (let [entity (first (first entities))]
              (recur (rest entities)                     
                     (string/replace line entity (normalize-proper-name entity)))))))
      (string/replace (re-pattern stop-list) " ")
      ;; 4. remove everything between brackets like: [1] [edit] [show]
      (string/replace #"\[.*\]" " ")
      ;; 5. punctuation characters except the dot and the single quote, replace by nothing: (),[]-={}/\~!?%$@&*+:;<>
      (string/replace #"[\^\(\)\,\[\]\=\{\}\/\\\~\!\?\%\$\@\&\*\+:\;\<\>\"\p{Pd}]" " ")
      ;; 6. remove all numbers
      (string/replace #"[0-9]" " ")
      ;; 7. remove all words with 2 characters or less
      (string/replace #"\b[\p{L}]{0,2}\b" " ")
      ;; 10. normalize spaces
      (string/replace #"\s{2,}" " ")
      ;; 11. normalize dots with spaces
      (string/replace #"\s\." ".")
      ;; 12. normalize dots
      (string/replace #"\.{1,}" ".")
      ;; 13. normalize underscores
      (string/replace #"\_{1,}" "_")
      ;; 14. remove standalone single quotes
      (string/replace " ' " " ")
      ;; 15. re-normalize spaces
      (string/replace #"\s{2,}" " ")        
      ;; 16. put everything lowercase
      (string/lower-case)

      (str "\n")))

(defn pre-process-corpus
  [in-file out-file]
  (spit out-file "" :append true)
  (with-open [file (clojure.java.io/reader in-file)]
    (doseq [line (line-seq file)]
      (spit out-file (pre-process-line line) :append true))))

(pre-process-corpus "resources/musicians-corpus.txt" "resources/musicians-corpus.clean.txt")

We remove all of the characters that may cause issues to the tokenizer used by the word2vec implementation. We also remove unnecessary words and other words that appear too often or that add nothing to the model we want to generate (like the listing of days and months). We also drop all numbers.

Training word2vec

The last step is to train word2vec on our clean domain-specific training corpus to generate the model we will use. For this demo, I will use the DL4J (Deep Learning for Java) library that is a Java implementation of the word2vec algorithm. Training word2vec is as simple as using the DL4J API like this:

(defn train
  [training-set-file model-file]
  (let [sentence-iterator (new LineSentenceIterator (clojure.java.io/file training-set-file))
        tokenizer (new DefaultTokenizerFactory)
        vec (.. (new Word2Vec$Builder)
                (minWordFrequency 1)
                (windowSize 5)
                (layerSize 100)
                (iterate sentence-iterator)
                (tokenizerFactory tokenizer)
                build)]
    (.fit vec)
    (SerializationUtils/saveObject vec (io/file model-file))
    vec))

(def musicians-model (train "resources/musicians-corpus.clean.txt" "resources/musicians-corpus.model"))

What is important to notice here is the number of parameters that can be defined to train word2vec on a corpus. In fact, that algorithm can be sensitive to parametrization.

Importing the General Model

The goal of this demo is to demonstrate the difference between a domain-specific model and a general model. Remember that the general model we chose was the Google News model that is composed of billion of words, but which is highly general. DL4J can import that model without having to generate it ourselves (in fact, only the model is distributed by Google, not the training corpus):

(defn import-google-news-model
  []
  (org.deeplearning4j.models.embeddings.loader.WordVectorSerializer/loadGoogleModel (clojure.java.io/file "GoogleNews-vectors-negative300.bin.gz") true))

(def google-news-model (import-google-news-model))

Playing With Models

Now that we have a domain-specific model related to musicians and a general model related to news processed by Google, let’s start playing with both to see how they perform on different tasks. In the following examples, we will always compare the domain-specific training corpus with the general one.

Ambiguous Words

A characteristic of words is that their surface form can be ambiguous; they can have multiple meanings. An ambiguous word can co-occur with multiple other words that may not have any shared meaning. But all of this depends on the context. If we are in a general context, then this situation will happen more often than we think and will impact the similarity score of these ambiguous terms. However, as we will see, this phenomenum is greatly diminished when we use domain-specific models.

Similarity Between Piano, Organ and Violin

What we want to check is the relationship between 3 different music instruments: piano, organ and violin. We want to check the relationship between each of them.

(.similarity musicians-model "piano" "violin")
0.8422856330871582
(.similarity musicians-model "piano" "organ")
0.8573281764984131

As we can see, both tuples have a high likelihood of co-occurrence. This suggests that these terms of each tuple are probably highly related. In this case, it is probably because violins are often played along with a piano. And, it is probably that an organ looks like a piano (at least it has a keyboard).

Now let’s take a look at what the general model has to say about that:

(.similarity google-news-model "piano" "violin")
0.8228187561035156
(.similarity google-news-model "piano" "organ")
0.46168726682662964

The surprising fact here is the apparent dissimilarity between piano and organ compared with the results we got with the musicians domain-specific model. If we think a bit about this use case, we will probably conclude that these results makes sense. In fact, organ is an ambiguous word in a general context. An organ can be a musical instrument, but it can also be a part of an anatomy. This means that the word organ will co-occur beside piano, but also all kind of other words related to human and animal biology. This is why they are less similar in the general model than in the domain one, because it is an ambiguous word in a general context.

Similarity Between Album and Track

Now let’s see another similarity example between two other words album and track where track is an ambiguous word depending on the context.

(.similarity musicians-model "album" "track")
0.7943775653839111
(.similarity google-news-model "album" "track")
0.18461623787879944

As expected, because track is ambiguous, there is a big difference in terms of co-occurence probabilities depending on the context (domain-specific or general).

Similarity Between Pianist and Violinist

However, are domain-specific and general differences always the case? Let’s take a look at two words that are domain specific and unambiguous: pianist and violinist.

(.similarity musicians-model "pianist" "violinist")
0.8430571556091309
(.similarity google-news-model "pianist" "violinist")
0.8616064190864563

In this case, the similarity score between the two terms is almost the same. In both contexts (generals and domain specific), their co-occurrence is similar.

Nearest Words

Now let’s look at the similarity between two distinct words in two new and distinct contexts. Let’s take a look at a few words and see what other words occur most often with them.

Music

(.wordsNearest musicians-model ["music"] [] 7)
music revol samoilovich bunin musical amalgamating assam. voice dance.
(.wordsNearest google-news-model ["music"] [] 8)
music classical music jazz Music Without Donny Kirshner songs musicians tunes

One observation we can make is that the terms from the musicians model are more general than the ones from the general model.

Track

(.wordsNearest musicians-model ["track"] [] 10)
track released. album latest entitled released debut year. titled positive
(.wordsNearest google-news-model ["track"] [] 5)
track tracks Track racetrack horseshoe shaped section

As we know, track is ambiguous. The difference between these two sets of nearest related words is striking. There is a clear conceptual correlation in the musicians’ domain-specific model. But in the general model, it is really going in all directions.

Year

Now let’s take a look at a really general word: year

(.wordsNearest musicians-model ["year"] [] 11)
year ghantous. he was grammy naacap grammy award for best luces del alma year. grammy award grammy for best sitorai sol nominated
(.wordsNearest google-news-model ["year"] [] 10)
year month week months decade years summer year.The September weeks

This one is quite interesting too. Both groups of words makes sense, but only in their respective contexts. With the musicians’ model, year is mostly related to awards (like the Grammy Awards 2016), categories like “song of the year”, etc.

In the context of the general model, year is really related to time concepts: months, seasons, etc.

Playing With Co-Occurrences Vectors

Finally we will play with manipulating the co-occurrences vectors by manipulating them. A really popular word2vec equation is king - man + women = queen. What is happening under the hood with this equation is that we are adding and substracting the co-occurences vectors for each of these words, and we check the nearest word of the resulting co-occurence vector.

Now, let’s take a look at a few of these equations.

Pianist + Renowned = ?

(.wordsNearest musicians-model ["pianist" "renowned"] [] 9)
pianist renowned teacher. composer. prolific virtuoso teacher leading educator.
(.wordsNearest google-news-model ["pianist" "renowned"] [] 7)
renowned pianist pianist composer jazz pianist classical pianists composer pianist virtuoso pianist

These kind of operations are kind of interesting. If we add the two co-occurrence vectors for pianist and renowned then we get that a teacher, an educator, a composer or a virtuoso is a renowned pianist.

For unambiguous surface forms like pianist, then the two models score quite well. The difference between the two examples comes from the way the general training corpus has been created (pre-processed) compared to the musicians corpus.

Metal + Death = ?

(.wordsNearest musicians-model ["metal" "death"] [] 10)
metal death thrash deathcore melodic doom grindcore metalcore mathcore heavy
(.wordsNearest google-news-model ["metal" "death"] [] 5)
death metal Tunstallbled steel Death

This example uses two quite general words with no apparent relationship between them. The results with the musicians’ model are all the highly similar genre of music like trash metal, deathcore metal, etc.

However with the general model, it is a mix of multiple unrelated concepts.

Metal – Death + Smooth = ?

Let’s play some more with these equations. What if we want some kind of smooth metal?

(.wordsNearest musicians-model ["metal" "smooth"] ["death"] 5)
smooth fusion funk hard neo

This one is quite interesting. We substracted the death co-occurrence vector to the metal one, and then we added the smooth vector. What we end-up with is a bunch of music genres that are much smoother than death metal.

(.wordsNearest google-news-model ["metal" "smooth"] ["death"] 5)
smooth metal Brushed aluminum durable polycarbonate chromed steel

In the case of the general model, we end-up with “smooth metal”. The removal of the death vector has no effect on the results, probably since these are three ambiguous and really general terms.

What Is Next

The demo I presented in this article uses public datasets currently linked to KBpedia. You may wonder what are the other possibilities? Another possibility is to link your own private datasets to KBpedia. That way, these private datasets would then become usable, exactly in the same way, to create domain-specific training corpuses on the fly. Another possibility would be to take totally unstructured text like local text documents, or semi-structured text like a set of HTML web pages. Then, tag them using the Cognonto topics analyzer to tag each of the text document using KBpedia reference concepts. Then we could use the KBpedia structure exactly the same way to choose which of these documents we want to include in the domain-specific training corpus.

Conclusion

As we saw, creating domain-specific training corpuses to use with word2vec can have a dramatic impact on the results and how results can be much more meaningful within the scope of that domain. Another advantage of the domain-specific training corpuses is that they create much smaller models. This is quite an interesting characteristic since smaller models means they are faster to generate, faster to download/upload, faster to query, consumes less memory, etc.

Of the concepts in KBpedia, roughly 33,000 of them correspond to types (or classes) of various sorts. These pre-determined slices are available across all needs and domains to generate such domain-specific corpuses. Further, KBpedia is designed for rapid incorporation of your own domain information to add further to this discriminatory power.

Permalink

Kata: Variation on Lights Out, a bit of FRP using reagi library

To finish with the series of exercises we've been doing lately around the Lights Out kata, I decided to use reagi library to raise the level of abstraction of the communications between the Lights and the LightsGateway components.

reagi is an FRP library for Clojure and ClojureScript which is built on top of core.async. I discovered it while reading Leonardo Borges' wonderful Clojure Reactive Programming book.

I started from the code of the version using the component library I posted recently about and introduced reagi. Let's see the code.

Let's start with the lights-gateway name space:

The main change here is that the ApiLightsGateway component keeps a lights-stream which it's initialized in its start function and disposed of in its stop function using reagi.

I also use, reagi's deliver function to feed the lights-stream with the response that is taken from the channel that cljs-http returns when we make a post request to the server.

Next,the lights name space:

Notice how the dependency on core.async disappears and the code to update the lights atom is now a subscription to the lights-stream (look inside the listen-to-lights-updates! function). This new code is much easier to read and is at a higher level of abstraction than the one using core.async in previous versions of the exercise.

Now the lights-view name space:

Here I also used reagi to create a stream of clicked-light-positions. Again the use of FRP makes the handling of clicks much simpler (in previous versions a callback was being used to do the same).

Another change to notice is that we made the view a component (LightsView component) in order to properly create and dispose of the clicked-light-positions stream.

Finally, this is the core name space where all the components are composed together:

This was a nice practice to learn more about doing FRP with reagi. I really like the separation of concerns and clarity that FRP brings with it.

You can find my code in these two GitHub repositories: the server and the client (see the master branch).

You can check the changes I made to use reagi here (see the commits made from the commit d51d2d4 (using a stream to update the lights) on).

Ok, that's all, next I'll start posting a bit about re-frame.

Permalink

Clojure, Cursive and Emacs

J. Pablo Fernández has recently posted a piece with the incendiary title of “Emacs is Hurting Clojure“.

I disagree with the idea behind the title, but then again, he seems to do so himself. He promptly clarifies:

The way Emacs is hurting Clojure is by Clojurians maintaining this myth that you need to use Emacs for Clojure. This is not done by simple statements but by a general culture of jokes saying things such as “you are wrong if you don’t use emacs”.

That’s one point I can agree with.

I can see how being pushed in the general direction of Emacs would turn people off. I’ve never been a fan myself. Some of this Emacs advocacy is just “follow the leader”, some of it is religious adherence to their own ecosystem, some of it is a stubborn refusal to pay for tools (sadly, an argument I have heard more than once).

Luckily there’s Cursive. Yes, it is a paid tool, but there is a free license for non-commercial or student work. Yes, it requires IntelliJ IDEA… but there’s also a free community edition of IDEA you can use. And yes, you are expected to pay for Cursive if you’re doing commercial work in Clojure. Considering how much it helps, why wouldn’t you want to?

If you’re looking to getting started with Clojure, aren’t into Emacs and don’t want to have to learn both a language and an environment at the same time, I’d strongly recommend it. And if you’re already an experienced Clojure developer, I expect you’re going to almost immediately see how much of a difference Cursive makes.

Permalink

Community Update

Picture: Arachne demo of web handler Picture: Arachne demo of web handler

Approximate Reading Time: 5 minutes

Hello Arachne Enthusiasts!

Arachne is taking form. What does its future hold? After sharing some news and updating you on our progress, this post peers into Arachne's future. Our last post was in early July, so Luke has been feeling torn between writing a community update and writing Arachne itself. That's why I'm writing you instead.

My name is Jay Martin. I'm a fledgling web programmer with a sweet tooth for Clojure and Datomic. I'll be helping Luke focus on building Arachne by serving as a community liaison.

So, what does a community liaison do? Mostly listen.

Arachne's community liaisons will listen to feedback from the Community and relay that feedback to Luke in digestible form, striving to honor the feedback's spirit and substance.

Also, we want to learn as much as we can about Luke's mental model for building software systems with Arachne. We intend to share that mental model with you, clearly and concisely.

Luke is passionate about fostering a culture of learning and knowledge sharing, as is the Steering Group.

You can expect updates on our progress about once every two weeks.

Also, keep an eye out for our first technical screencasts and Wiki articles covering Arachne fundamentals. We'll post links to new resources here on this site as they become available.

Many people have stepped forward to offer Luke their help. Luke is deeply inspired by this outpouring of support. The Steering Group sees a risk to the project's success by opening up the development process too early, while Luke is simply getting ideas out of his head and into an initial project structure. The opposite risk for Arachne is that of isolating the development process from the Community, alienating the very people who will help Arachne reach its full potential – you, and every single supporter, code contributor, Kickstarter, developer, designer, author, blogger, speaker and tire kicker out there.

We aim to strike a pragmatic balance between these competing risks and we'll rely on you to let us know when we're off track. If you have ideas about how to improve Arachne, please open a GitHub issue. If you have ideas about how to make our Community Experience the very best in this quadrant of the galaxy, please send me a tweet @webappzero.

We'll open up more lines of communication, as needed to best support Arachne's community with the care and attention it deserves. We'll include Community Developers and Designers in the creative process once Arachne's fundamental pieces are more concrete. The idea being that it's easier to talk en masse about improving tangible code than intangible concepts. Presently, that's the role of the Steering Group. Luke shares his ideas and progress and then listens to our feedback. Gradually, the Community as a whole will replace the Steering Group in that role.

Speaking of Arachne's core pieces, here's a list of things that Luke has gotten done, adding to his original work, since officially starting work on Arachne on July 18th:

  • implemented continuous delivery via CircleCI
  • fully Speced the core module
  • implemented an ontology for config schemas
  • reified config & runtime entities in the config database
  • wrote utilities for error messages
  • factored apart the abstract HTTP & Pedestal modules
  • added provenance info to config & init scripts
  • designed Spec-based config & runtime validation
  • wrote 12 Architectural Decision Records (ADRs)

…and finally, saving the best for last:

  • created three working demos of Arachne's core functionality!

Yes! We've got demos. They reveal the inner-workings of the low-level Arachne system. These are largely undocumented so only the adventurous should expect to receive a learning outcome by attempting their execution. Also, keep in mind that these primitives are very low level and much friendlier DSLs (Domain Specific Languages) can and will be built on top of what you'll see there now.

My first impression of Arachne, based on seeing the working demos, is that of a system which vanquishes mystery and embraces accountability. Arachne's validated, acyclic, runtime system graph ignited atop a queryable, validated, schema compliant config value feels like a new level of control and awareness when initializing a software system. And while Arachne is marketed as a web framework, at its core its a framework of frameworks, a novel way of organizing a software application with the potential to influence the way we write server, desktop and even mobile software.

This level of control comes with a price: learning a new way of doing things that is, as yet, unproven in the real world of shipping software. Existing Clojure Libraries and software will need to be wrapped in an interface to comply with the Arachne Module protocol in order to reap the benefits described in the ADR on abstract modules.

When I saw my first Arachne demo I felt a little overwhelmed by all the new words like 'ontology', 'config schema', 'component' and how they all fit together. At that time, I'd never even used a Stuart Sierra Component, so it was a lot to take in.

For some, this price will appear too steep. Fortunately for Arachne, our Industry is awash in change. Driving us toward this new future is the mere possibility that the inherent frustrations of building and managing software systems could be significantly mitigated. Many have already opted for a future that includes Arachne by voting for its existence with their time, talent and treasure. For that we cannot thank all of you enough!

Arachne's immediate future will see Luke writing the Database Abstraction Layer (DAL). The DAL will include the common set of database primitives and commands shared by today's prevalent database systems: entities, queries, etc. I'll be working on the Static Site Generator which will allow us to dogfood Arachne by using it to host this site in the near future.

Concluding with a bit of philosophical conjecture about Arachne's future:

I recently learned about solving problems using recursion from a web page by Carnegie Mellon University:

To solve a problem recursively means that you have to first redefine the problem in terms of a smaller subproblem of the same type as the original problem. […] This step is often called a recursive leap of faith. Before using a recursive call, you must be convinced that the recursive call will do what it is supposed to do. You do not need to think how recursive calls work, just assume that it returns the correct result.

Arachne isn't pulled from the ether. It benefits from a host of innovations, built by people that answered countless challenges and shared their life's work with the world. It has been refined by many generous and thought provoking conversations within the Clojure and programming Communities. Something else will come after Arachne and It will be better than It could have been because of what Arachne is and what it is not. Ultimately, Arachne's success will depend on our collective willingness to convince ourselves that the pleasure it promises will outstrip the pain it hides and move ourselves to that self-evident and demonstrable leap of faith we call – action!

Jay @webappzero

P.S. Tweet me with any ideas you have about Arachne or our Community. Or just to say hi!

Permalink

Update and Schedule

I am pleased to announce that we finally have a concrete schedule for starting real work on Arachne. By July 18 I will have wrapped up my current consulting engagements and turned my full attention to Arachne.

During this time, I will also be working closely with Apex Data Solutions. Apex is a healthcare technology company that has already invested substantially in Arachne, via a generous donation to the Kickstarter that significantly helped to ensure that the campaign would succeed.

The current plan is to consult with them for two days of every week, specifically around Arachne and how it can further their technological and business goals. During this time I can work either on Arachne's core (as it is relevant to Apex's needs), Apex-specific modules and functionality, or other software as the need arises.

The remaining three days will be dedicated exclusively to general open-source Arachne work, building it out according to the vision outlined in the Kickstarter pitch. The Kickstarter funds will only be used to pay me on these days when I'm focused purely on open-source. Task prioritization during these days will still be governed by the steering group, to maintain focus on the core general-purpose product.

By joining forces in this way, we get two significant benefits:

  • Arachne will have a real world, rigorous use case from inception, giving an opportunity to "dogfood" the product in realtime.
  • Presuming this arrangement continues, this gives at least 10 months of runway for working on Arachne, as opposed to the ~4 that the Kickstarter funds alone would have provided.

I think this is a great mutually beneficial opportunity, and I look forward to working with Apex as well as developing the core web framework.

Permalink

Senior Developer

Senior Developer

Droit Financial Technologies, Inc. | New York
remote

Droit Financial Technologies builds systems to help customers make decisions and trades in compliance with market logic and regulatory rules.

We are looking for software developers who want to be part of a small but growing team as we improve and extend our architecture.

Basic qualifications

  • Take time to ask good questions and understand problems
  • Envision a variety of solutions and explain trade-offs
  • Design simple, elegant solutions
  • Use functional programming comfortably
  • Use data-driven designs comfortably
  • Possess excellent team and interpersonal skills
  • Communicate clearly and succinctly
  • Write code first for humans to read

Desired qualifications

  • Proficient in Clojure or another mostly-functional Lisp

Technologies used

Including but not limited to:

  • Clojure, ClojureScript, Java
  • React via Reagent and Om
  • SQL and Datomic databases
  • RESTful APIs

We welcome candidates who are willing and able to work effectively remotely, as well as on-site in NYC.

Permalink

Clojure in Europe: Akvo

It takes the right people with the right motivation and the right tools to make a project succeed.

A Multi-National Delivering Aid

Akvo is an international not-for-profit, not-for-loss organisation with headquarters in the Netherlands. They use government grants to develop and enhance tools that are used in 60+ countries around the world to help the distribution and governance of aid. They open-source their software every step of the way.

Akvo flow is a mobile application that helps users to quickly map out developing situations on the ground, for example to collate the locations of water pumps in relation to villages. Prior to more sophisticated tools the gathering of data was traditionally achieved using a pen, paper and camera. Now the job can be done using a smartphone that has GPS, an in-built camera, and reliable data synchronisation to a remote platform.

All of this data is fed back into a centralised reporting system that allows overseers to make better decisions on where to co-ordinate aid, and to track the impact that existing aid is having.

I caught up with Iván Perdomo, a developer at Akvo who introduced Clojure. He resides in Pamplona Spain and our interview coincided with the city's annual bull festival. He was wisely staying off the streets.

The company

Jon: How many developers work at Akvo and how are they spread across Europe?

Iván: We have around 20 people living in Finland, Sweden, the UK, the Netherlands and Spain (Barcelona and Pamplona). We also have a Bangalore office running Akvo Caddisfly, a project to analyse photos of water in an attempt to diagnose what level of fluoride the water has.

The organisation started off by being distributed. One co-founder is in the Netherlands (the Hague) and the other is in Sweden. The mantra of the company is that it doesn't matter where you live if you're the right person. We've got small clumps of people spread all over.

Iván Perdomo

The history of Clojure at Akvo

Jon: When did you join and why?

Iván: Akvo is a non-profit organization that builds open source software, this is the first thing that attracted me. I joined in Autumn 2012.

Jon: How did Clojure get started at Akvo?

Iván: In May 2013 we decided that we needed to break up a Java based monolith that was running on Google App engine.

We identified some Java code we could prise away into a different service and I wrapped this Java code in a HTTP layer built using Clojure. It was 2 weeks of budget, but now the service has been running ever since. This is Akvo Flow Services, our first Clojure experience.

Jon: Why Clojure in particular?

EuroClojure

Iván: I started looking at Clojure in 2011 - I wanted to learn something different to Java and JavaScript. I bought the first book - Programming clojure by Stuart Halloway and liked what I saw. I watched and re-watched Rich's talks.

What impresses me about Clojure is the foundation of Immutability - never changing the values. We do not need to be concerned if we pass a data-structure around and it gets modified, this is something that has bit me often pre-Clojure.

I've been to all the EuroClojures and now I'm trying now to setup a functional programming meet-up in the small city of Pamplona where I live.

Jon: How has Akvo progressed with Clojure?

In 2014 we started to look at broadening our range of products and the underlying architecture. We decided that instead of building everything using loosely coupled APIs communicating with each other using HTTP, we would go down the event sourced route instead.

Now we have systems that publish events to a central place - a central store of events. The usual approach here is to use Kafka, but for us as a small team we went for a slimmed down and lightweight PostgreSQL db version instead. We expose a HTTP layer on top so that any system can post events to a given topic. A consumer can then connect to the DB and read events.

We have written our event consumers in Clojure. One consumer transforms the data and pushes it to a self-hosted CartoDB installation (a Geographical Information System), another takes the data and normalizes it for reporting purposes.

EuroClojure

Introducing Clojure

Jon: I think you're probably unique in introducing Clojure to a completely distributed organisation, what's that been like?

Iván: I tried to introduce CLJ to my last organisation and failed. My take now is that it takes a lot of effort to share your enthusiastic view, and it takes an extra effort to share the ideas behind Clojure and of the underlying mechanics. Why is Clojure better or different to Java? We have to focus on the advantages for this effort to be a success.

Instead of focusing on low level details focus on the big ideas, focus on the 'whys' and not the 'hows'.

Upskilling

Jon: How did you go about training people up?

Iván: On Fridays I ran some remote Clojure sessions - I opened up a REPL and shared my screen, explaining what the code did that powered our services. This got some traction. We now run more generic 'learning sessions' on Thursdays.

For senior people it's more of a challenge to onboard people into Clojure. As a senior dev you may have to change your mindset, and even change your editor in some cases.

To help with this I do lots of screen sharing - advising and helping developers to improve their code and to reduce verbosity (see this code here, let me explain what I did here that's similar using some core functions). I like to walk people through the code.

Jon: How many Clojure devs have you got now?

Iván: We have five Clojure devs working in total. After seeing the joy of working with Clojure there are now more people interested.

Hiring

Jon: Have you tried to hire for Clojure devs?

Iván: We've had developers seek out and join Akvo because of our open source work and Clojure. One example is Jonas the developer behind the lint tool Eastwood.

The Clojure future?

Jon: What's the future for Clojure at Akvo?

Iván: Recently we've built a new data visualisation platform - Akvo Lumen. It's got Clojure on the back-end and React.js on the front-end.

With Lumen we will able to go faster and produce good quality software because of Clojure. This is a measure of the tech but also, a project succeeds or fail depending on people being happy. Projects are successful when developers enjoy what they do and put everything in to make it succeed.

In less than 3 months we got a new MVP up and running - we made good architectural decisions but our users also saw visible progress. The founders are really interested and happy.

It takes the right people with the right motivation and the right tools to make a project succeed.

Introduction to Clojure Spec

State of Clojure

Jon: A standard question I always ask, what's the state of Clojure like in your area, in this case Pamplona in Spain?

Iván: In Spain it's still niche - although three weeks ago there was a software craftmanship event in Pamplona with a session in Clojure and Scala. In northern Europe Clojure is more accepted.

Jon: Anything that excites you in the development of Clojure itself?

Iván: I'm quite excited about Clojure.spec. I've written some spec like functions and will move to Spec when it's more readily available.

Technologies

Jon: Any particular technologies you'd like to give a shout out to?

Iván: OpenID Connect and Keycloak as an implementation of that. I have written an example Ring web app served with Immutant secured with Keycloak.

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.