core.logic & VPRI STEPS

I stumbled across this amazing blog series which is tackling computational linguistics by porting Prolog to core.logic. It reminds me a bit of my earlier attempt to implement Definite Clause Grammars. I got something experimental working but in order for core.logic to really scale for large parsing tasks we'll probably need to rethink how core.logic handles substitutions. That said, core.logic does have tabling so building packrat parsers shouldn't be too difficult.

All this brings me to the incredibly succinct program in Appendix II of the VPRI 2011 Report (Alan Kay et al). They start with a machine oriented lisp, define a grammar, implement Smalltalk, and finish with a non-trivial program.

Could we take similar approaches when writing software with Clojure?

Permalink

Permalink

lx in core.logic #2: Jumps, Flexible Transitions and Parsing

This is a continuation of the post Finite State Machines in Clojure core.logic.

This current plan for this series is to follow the book Algorithms for Computational Linguistics using Clojure core.logic instead of Prolog.

Jumps, wildcard transitions and parsing are natural and useful ways to extend and leverage finite state machines for text analysis. This was an opportunity to introduce extensions of fact databases and non-deterministic matching. Here's the code:

Permalink

Where to Find Relevancers this Month!

Want to meet a Relevancer in person? Here's where you can find us in the next month:

Durham, NC 1/24/2012, Every Tuesday - 7pm @ Splat Space
Splat Space Open Meeting
Attending: Alan Dipert, Splat Space founder and Meetup organizer

Aruba 2/6-2/10
SpeakerConf
Speaking: Michael Nygard and Stuart Halloway

Durham, NC 2/6/2012, 7pm @ Splat Space
Monthly TriClojure Meeting
Speaking: Brenton Ashworth, ClojureScript One
Attending: Chris Redinger, TriClojure Founder and Meetup Organizer

Atlanta, GA 2/23-2/24
We're proud sponsors of Lessconf
Attending: A dozen Relevancers

Permalink

Today in the Intertweets (Jan 26th Ed)

Permalink

in which an interview is posted

The folks over at The Setup just posted an interview with me wherein I rant about hardware, interactivity, and Emacs.

Note: this was written up over a month ago; if I were interviewed today I couldn't help but mention the Nix package manager. I use it on my Debian Squeeze system to complement apt-get; I get all the system-level stuff that has to be stable from Debian and anything that needs to be fresh from Nix.

Permalink

Comparing JavaScript, CoffeeScript & ClojureScript

UPDATE: Jeremy Ashkenas (CoffeeScript creator) has pointed out on HN a somewhat intentional flaw in the final gist. Hopefully you can spot it and see that this post is about solving that very problem. I'm being a bit old school here - I don't like to give everything away :)

I’ve been spending a lot of time recently hacking on the ClojureScript language. I can say without qualification that I haven’t had this much fun programming since I first taught myself JavaScript nearly seven years ago. So let’s put aside logic programming for a moment and let’s talk about code complexity and code expressivity.

Recently on StackOverflow someone asked how to idiomatically construct a type in ClojureScript. Before we get into that let’s consider how this is done in JavaScript:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// 193 characters

var Foo = function(a, b, c){
  this.a = a;
  this.b = b;
  this.c = c;
}

Foo.prototype.bar = function(x){
  return this.a + this.b + this.c + x;
}

var afoo = new Foo(1,2,3);
afoo.bar(3);

CoffeeScript gets a lot of deserved attention for its brevity for common tasks. For example the same thing in CoffeeScript:

1
2
3
4
5
6
7
8
# 106 characters

class Foo
  constructor: (@a, @b, @c) ->
  bar: (x) -> @a + @b + @c + x

afoo = new Foo 1, 2, 3
afoo.bar 3

That requires nearly half the amount of characters. Of course on real code the code compression isn't nearly that great - perhaps 10-20% in my experience. Still I find that CoffeeScript tends to give the feeling of compression for many common tasks and how a language feels day in and day out is important for programmer happiness.

Let's take a look at the same thing in ClojureScript:

1
2
3
4
5
6
7
8
9
10
11
;; 130 characters

(defprotocol IFoo
(bar [this x])) ;; 93 characters w/o this!

(deftype Foo [a b c]
IFoo
(bar [_ x] (+ a b c x)))

(def afoo (Foo. 1 2 3))
(bar afoo 3)

The ClojureScript without the strange protocol form would give even better compression than CoffeeScript! So what does this protocol form do and why do we need that cluttering up our type definition?

ClojureScript, unlike JavaScript or CoffeeScript, promotes defining reusable abstractions. Imagine if all the types in your favorite library were swappable with your own implementations? Hmm ... perhaps that's an abstraction too far for many users of JavaScript or CoffeeScript.

Well here's a use case I think more people will get - neither JavaScript nor CoffeeScript provide any kind of doesNotUnderstand: hook that is fantastic for providing default implementations.

1
2
3
4
5
6
7
8
(defprotocol IFoo
(bar [this x]))

(extend-type default
IFoo
(bar [_ x] :default))

(bar 1) ; >> :default

We've extended all objects including numbers to respond to the bar function. We can provide more specific implementations at anytime, i.e. by using extend-type on stringarray, Vector, even your custom types instead of default. It's important to note that this extension is safe and local to whatever namespace you defined your protocol.

Still not convinced? Let's demonstrate a very powerful form of extension that even Dart is getting behind.

In ClojureScript it's simple to construct types which act like functions. While this might sound esoteric consider very succinct operations like the following:

1
2
3
4
5
6
7
(def address {:street "1010 Foo Ave."
:apt "11111111"
:city "Bit City"
:zip "00000000"})

(map address [:street :zip])
;; >> ("1010 Foo Ave." "00000000")

Wow. HashMaps in ClojureScript are functions! Now this may look like some special case provided by the language but that's not true. ClojureScript eats its own dog food - the language is defined on top of reusable abstractions.

How can we leverage this? An example - JavaScript and CoffeeScript both let you extract a range from strings and arrays. In JavaScript you have slice and CoffeeScript provides sugar via the [i..j] syntax. Neither provide you with a way to succinctly construct and manipulate the idea of a slice. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
(defprotocol ISlice
(-shift [this]))

(deftype Slice [start end]
ISlice
(-shift [_] (Slice. (inc start) (inc end)))
IFn
(-invoke [_ x]
(cond
(string? x) (.substring x start end)
(vector? x) (subvec x start end))))

(def s (Slice. 0 5))
(def v ["List Processing" [0 1 2 3 4 5 6]])

(map s v)
;; >> ("List " [0 1 2 3 4])
(map (-shift s) v)
;; >> ("ist P" [1 2 3 4 5])

IFn is one of the many reusable abstractions that ships with language. We define ISlice to illustrate that our type has dual functionality - as an object with fields that can be manipulated and as a function which can be applied to data!

Many people have the misconceived notion that Clojure/Script is only about functional programming - on the contrary Clojure/Script is very much "Object Oriented Programming: The Good Parts".

Permalink

Permalink

Parser Combinators: How to Parse (nearly) Anything

This is the best video I've ever seen defining monadic parser combinators from the ground up. Nate Young explains his translation of Parsec into Clojure1. I'm surprised he could show the code and explain it well in less than forty-five minutes.

As you may know, I have my own PEG parser combinator library called squarepeg. I began developing it as monadic, but decided that I didn't like having to write my own version of Haskell's do notation (let->> in his version), which felt awkward in Clojure. Instead, squarepeg lets you bind variables and refer to them later within each parser itself (instead of in the code that builds the parser). There were a few other differences, but in general, I think the two libraries are more similar than different.

One thing I think I will borrow from his library is counting the character position of the current parser to help report parsing errors. It's going on my TODO list.

I have more plans for squarepeg when I get a chance. They include limiting the laziness of the parser and being able to report better error messages.


  1. Nate said his parser was available on Clojars under the name "The Parsitron", but I could not find it.

LC

Permalink

Handle this! (views, const, state in Clojure, Java, C++ and Python)

Introduction

The original vision Alan Key had on object oriented programming was about separate entities communicating through message passing. A logical consequence is that the global programming state is the sum of the individual states of these entities (called objects). State of such objects is naturally hidden from the outside and state modifications occur only as a consequence of the exchanged messages.

I would like to mention that in this model the "privacy" of internal variables is not exactly simply a matter of a keyword, but a consequence of a programming philosophy. This is not the kind of limitation you get in Java classes or C++, where the field is there, you just cannot access it. It somewhat more similar to calling a black-box with a state that is its own business; there are no fieldsand if there are, they are just an implementation detail. Or even more so, private variables are not accessed in the same sense that the physical address of an object in Java is not part of the programming model.

Such objects do not necessarily have their own thread of execution (in the sense that they are concurrently in control). However, if they had, the logical model would not be overly different. But back to the objects…

I somewhat believe that objects are an overloaded metaphor. In fact, there are at least two types of objects. And while the object oriented message only metaphor well applies for domain specific objects, I somewhat feel that it is not appropriate for some data structures. Sometimes, it is a nice property that "similar structures" have a common interface so that, for example, switching from an array to a linked-list is a painless transition, because it eases experimentation with different trade-offs regarding computational efficiency (although such problems are better solved with pen and paper).

However, in other situations, accessing the internals of some complex structure is plainly the "right thing to do". It is a walking-horror from the object oriented point of view, but it plainly makes sense for computational reasons. I often have to deal with graphs with billions of nodes, and more often than not I feel that usual OO laws are too restrictive.

Graph example

A clear example here is the design of networkx.Graph: I have nothing against the design, by the way. I believe they do the right thing. Here the idea is that they have implemented their Graph internals in some way (does not matter how, right now). However, you may want to get a list of all the nodes in the graph. Now, how to do this? The first issue, is that the nodes may not be memorized in a way which is easier to return. This is actually the case: nodes are a dictionary keys, under the hood. So essentially there is no easy way to return them without calling some dict member which returns a newstructure holding the nodes.

State and "Static" OOP

OOP is all about state change. Perhaps just local state change, if done correctly. And hopefully the state's effect do not propagate too far from where the state is hidden. About C++, I found no other very mainstream OO language that makes it clear what you shall change and what you shall not.

The C++ pragmatics is really precise on consting whatever you can const. And to solve issues where it is not practical to have a logically const object which actually mutates something inside, you can use the mutable modifier to support the idea that the object realstate did not change while some irrelevant parts of it indeed changed. Examples are forms of caching, counting stuff, logging to a logger we hold a reference to.

Another important aspect of C++ is that it quite distinguishes between a const pointer (a pointer that cannot change) and a pointer to a const object (the pointed object cannot change). As always all this leads to additional complexity. However, declaring stuff const is good: first it is a rather strong safety guarantee, second it really leads to optimizations otherwise impossible. Still, it is tragically inadequate wrt. plainly immutable objects.

Moreover, although many other languages do not have pointer arithmetics, they do have references. In Java it is possible, for example, to mark such a reference final, which essentially means it will always refer to a given object. However, there is no way to state that the actual object could not be mutated by accesses through that specific reference.

In Java, the only way to achieve that goal is not providing methods that mutate the state. In fact this approach makes sense. Somewhat you make the language simpler without really losing much. And C++ newbies really do not get the whole constness thing very well.


mutable-immutable.png

Essentially, in Java you do not have the possibility to have a mutable object that some clients cannot mutate. There are options, however. For example, in Figure 1) we have two interfaces, one mutable and one immutable. We have the mutable interface extend the immutable one and the appropriate base classes.

Immutability at class level can be obtained both (a) with a true immutable implementation implementing the immutable interface and a mutable implementation implementing the mutable one; or (b) with just a mutable implementation: clients that should not mutate the object will use the immutable interface. This is quite similar to the const in C++ in the sense that a const_cast is usually possible (and in this case we could just cast to the mutable interface). Such things somewhat break the whole immutability thing, but sometimes have their uses.

And what is the big deal with immutability? Basically, in this context immutable stuff can be shared with no fear. And copying huge datasets is too inefficient to be considered.

Dynamic OOP

The essential problem here is that the OO language we have discussed so far are built around the idea that your co-workers will screw the project if they can do stuff. So the objective is not letting them do it. Constness shall be enforcedby the language (you had the opinion that I was happy about C++ const, did you?) because otherwise someone will foobar the project.

On the other hand in languages such as Python you may well do everything to every object and consequently the const-enforcement does not fare very well. A bit more could be done (formally) in Ruby. Still, even then you could always hack the objects to let you do whatever you want. And believe me, you could do that also in C++ and Java, provided you have sufficient control of the environment where the program is going to be run. It is just way harder.

In fact, I believe approaches where good policies about code isolation can be also (easily) implemented in Python. Good API design is of paramount importance. A C++ wise advise (from Meyers) was "Avoid returning Handles to object internals" (Item 28, Effective C++, 3rd ed, Scott Meyers, Addison-Wesley).

Essentially the idea is never to let your object guts exposed and never ever let someone mess with it. This is not about trust. This is about such handles are just a sure way to break your object constraints (why I'm talking like a static programmer anyway?). The point is that such handles change state independently by the core object and this is probably going to be bad, because the corruption of the state will be revealed in a place and time extremely distant from when and where it actually happen.

So, we have to carefully design our APIs, even (shall I say, especially?) if we are dynamic programmers. For example, we can return views on our object internals. Since our languages are very dynamic, such views can be easily constructed: they just have to quack like the original objects. When it makes sense, it is probably just better place the functionality in the "large" object and to delegate to the attribute (delegation is so trivial to implement in dynamic languages!). Notice that strongly interface based languages such as Java could make this approach even more natural, provided that formally specified interfaces make sense for the specific case.

Sometimes it makes also sense to return object which can mutate and where their mutation influences the state of the object from which they come from. However, in this situations such objects shall be built in a way that they do not break the behavior of the object from which they were gotten. Essentially here we are just obeying principle like SRP (single responsibility principle) and design things to work together. In fact, they are not handles to the object internals at all. We are not exposing the implementation of the object: we are just exposing an interface to a part of the object state (perhaps even state that cannot be changed through the main object interface).

What are the problems with this approach? As long as things are not modified, copying is fine. A view is a good thing, because it may be as efficient as possible for reading, while being completely safe. The problem essentially arises when we want to mutate the objects state: internals handles are bad, so we have to:

1. carefully craft the object interface to allow modifications efficiently and that make sense to the problem at hand, without making it excessively general (because it clashes with efficiency) or excessively big (because it clashes with almost every good property OOP tries to give to programs)

2. Perhaps create special objects that are able to perform controlled modifications on the original object. This may give lot of generality, in a sense, but also complicates the class hierarchy significantly.

Graph example

Back to our example… we may have many solutions. Suppose that this "get the list of nodes" operation is frequent enough. It may make sense to memorize such list separately from the dictionary. If node removal and addition is not too frequent, the additional memory may well be worth it (well, perhaps not, if we really have lots of nodes). Even if such operations are frequent, we double the cost of the addition and make deletion O(N)… but if instead of a list we use a set, we have both operation O(1) simply with an increased multiplicative constant. Of course a language could offer a dict implementation which essentially offers an efficient view over the set of keys, so that separate memorization is not needed.

We could use a mutable datatype to hold the list, but then we should make a copy before returning it (this what actually happens with networkx). Not making a copy has the same problems of returning an internal handle. If we make a copy, then we could return something immutable or mutable. Essentially returning something immutable has not a lot of sense, as modifications would not affect the graph andmodifications to the graph are not reflected in the node-list. The simplest thing to do is plainly return a list of nodes.

The true solution would be that dictionary supported a "true" view object which is able to modify the original dictionary. And actually Python 2.7 and Python 3 have it. At this point we could just return such thing and have both efficiency and functionality… were it not for a simple issue: a networkx graph has more than one internal structure holding the nodes. Thus a higher level view would need to be created which could work across the different point were the same information is memorized inside a Graph. And we are back to the "complicating the class hierarchy thing".

Immutable by default

The thing is that actually having to specify things to be const, is a bit a pain. And perhaps it is just me... but consider the Java solutions (this apply to things which roughly work like Java): we are talking about having two class and two interfaces (or just one class and two interfaces) for lots of objects. In my opinion, this is not practical. And if we want to create "well-behaved handles" things become even more complex.

In fact, this is probably why it is not done (most of the times). Probably it should be sufficient to limit such strategies for things where it really matters. Think about the collections framework.

On the other hand... think about a world where most things are just immutable. I think it is just a safer mind model of programming. It is not about limiting your colleagues (or yourself) on not doing things which are licit in the model and that we want to restrict.

If we thinkimmutable, things are just easier. But then we are definitely moving towards the functional side of things. I'm not claiming that functional languages have onlyimmutable stuff. Even though many functional languages (Clojure, Haskell) have mostly immutable stuff. However, reasoning in terms of flows of functions and immutable objects is just easier than thinking about immutable immutable objects. At least, it should be, if we were trained to think functionally from the beginning.

Here we are used to deal with const objects. Sometimes we needto change the state. Two typical scenarios spring to mind. We aren't doing "Object Oriented Programming": we are just writing an algorithm and the algorithm was conceived for imperative languages. Sometimes there is no clear conversion into the functional world. Not an efficient one, at least. In this case we may want to use some special mutable object (arrays?) to perform our computation efficiently. And this may even generally work.

In the second case, it is simply not practical to structure the state of the world as some function parameters. In fact most of the times the global state is to big to be wisely represented as a huge set of parameters. In this case we probably want to express the computation as a set of transformations (functions, basically) that shall be executed one after the other on the world. Here I am mostly thinking about Haskell's monads. Though, even different from a syntactical and semantical point of view, we are not far from the realm of refs/agents.

The issue of efficiency, however, remains. We should still keep in mind that well buried under layers of object orientations there may be lots of hidden costs. Interfaces often get in the way of really efficient implementations, because costs are not part of the interface. The collections framework is beautiful… but sort is still implemented copying everything to an array and sorting the array.

Welcome under the sign of the Lambda

Not only it is better to have const object by default, that is to say object mutability shall be an opt-in rather than an opt-out. In fact, a part from the famous koanabout objects and closures, we have to avoid returning handles to our objects guts… but I do not see often closures that open up the enclosed state to the world.

The point is that avoiding all the copy costs may be simply thething to do when we have to deal with huge datasets. Restrict mutability where needed (e.g., implementing the algorithms) but mostly use mutable input and outputs from functions. Moreover, functional code is generally flatter, which can also, in the long run, improve efficiency.

Eventually, with languages such as clojure, even the perceived drawbacks of lists can be avoided using vectors, which support efficiently a different sets of primitives. Lazyness is also extremely helpful: actions that are not performed do not cost.

Permalink

World Singles at Clojure/West

I was very pleased today to get confirmation that my team are all going to Clojure/West in San Jose in March!

We've been a CFML house for a decade but we're using Clojure more and more on the back end to provide a high-performance, concurrency-safe foundation for our application. Back in December a couple of us attended Clojure/conj, with three days of Clojure training for one of our team. Now we're training up another team member on Clojure, and in March three of us will attend the Clojure conference, with training on Cascalog (big data analysis) for one of our team.

It's an exciting time to be a developer!

Permalink

Today in the Intertweets (Jan 25th Ed)

Permalink

Clojure and Seesaw

There’s a very nice desktop graphics library called Seesaw for Clojure. You can see a lot of examples of how to use it inside seesaw/test/examples folder of the distribution or you can browse online at GitHub.

I played with it – here’s a small LED-matrix digital clock implementation. On Ubuntu 11.10 it looks like this:

Here’s the code:

(ns com.icyrock.clojure.seesaw.led-matrix
  (:import [ java.util Calendar])
  (:use seesaw.core
        seesaw.graphics))

(def lcd-dot-style-off
  (style
   :background "#181818"
   :stroke (stroke :width 3)))

(def lcd-dot-style-on
  (style
   :background "#00bc00"
   :stroke (stroke :width 3)))

(def lcd-dot-styles
  {false lcd-dot-style-off
   true lcd-dot-style-on})

(defn draw-lcd-dot [g width height is-on]
  (let [dot-style (lcd-dot-styles is-on)
        border (-> dot-style :stroke .getLineWidth)
        border2 (* 2 border)]
    (draw g
          (ellipse border border (- width border2) (- height border2)) dot-style)))

(def lcd-symbol-dots
  {\0
   [".***."
    "*...*"
    "*...*"
    "*...*"
    "*...*"
    "*...*"
    ".***."]
   \1
   ["..*.."
    ".**.."
    "..*.."
    "..*.."
    "..*.."
    "..*.."
    "*****"]
   \2
   [".***."
    "*...*"
    "....*"
    "...*."
    "..*.."
    ".*..."
    "*****"]
   \3
   [".***."
    "*...*"
    "....*"
    "..**."
    "....*"
    "*...*"
    ".***."]
   \4
   ["...*."
    "..**."
    ".*.*."
    "*..*."
    "*****"
    "...*."
    "...*."]
   \5
   ["*****"
    "*...."
    "****."
    "....*"
    "....*"
    "*...*"
    ".***."]
   \6
   [".***."
    "*...*"
    "*...."
    "****."
    "*...*"
    "*...*"
    ".***."]
   \7
   ["*****"
    "....*"
    "...*."
    "..*.."
    "..*.."
    "..*.."
    "..*.."]
   \8
   [".***."
    "*...*"
    "*...*"
    ".***."
    "*...*"
    "*...*"
    ".***."]
   \9
   [".***."
    "*...*"
    "*...*"
    ".****"
    "....*"
    "*...*"
    ".***."]
   \:
   ["....."
    "....."
    "..*.."
    "....."
    "..*.."
    "....."
    "....."]
   })

(defn draw-lcd-symbol [g width height symbol]
  (let [dots (lcd-symbol-dots symbol)
        dot-width (/ width (count (first dots)))
        dot-height (/ height (count dots))]
    (doseq [row dots]
      (doseq [cell row]
        (draw-lcd-dot g dot-width dot-height (= cell \*))
        (translate g dot-width 0))
      (translate g (- width) dot-height))))

(defn get-time-string []
  (let [ c (Calendar/getInstance)
        h (.get c Calendar/HOUR_OF_DAY)
        m (.get c Calendar/MINUTE)
        s (.get c Calendar/SECOND)]
    (format "%02d:%02d:%02d" h m s)))

(defn paint-lcd-symbol [ c g]
  (try
    (let [symbols (get-time-string)
          symbol-count (count symbols)
          width (.getWidth c)
          height (.getHeight c)
          symbol-width (/ width symbol-count)]
      (doseq [symbol symbols]
        (push g
              (draw-lcd-symbol g (- symbol-width 20) height symbol))
        (translate g symbol-width 0)))
    (catch Exception e
      (println e))))

(defn content-panel []
  (border-panel
   :center (canvas :id :clock
                   :background "#000000"
                   :paint paint-lcd-symbol)))

(defn make-frame []
  (let [f (frame :title "com.icyrock.clojure.seesaw.led-matrix"
                 :width 1200 :height 300
                 :on-close :dispose
                 :visible? true
                 :content (content-panel))]
    (.setLocation f (java.awt.Point. 100 300))
    (timer (fn [e] (repaint! (select f [:#clock])) 1000))))

(defn -main [& args]
  (native!)
  (make-frame))
(-main)

You can find the code at com.icyrock.clojure GitHub repository – just clone and fire up in your favorite IDE.

Share and Enjoy:FacebookFacebookDZoneDZoneSlashdotSlashdotDiggDiggStumbleUponStumbleUpondel.icio.usdel.icio.usYahoo! BuzzYahoo! BuzzTwitterTwitterGoogle BookmarksGoogle BookmarksGoogle BuzzGoogle BuzzMySpaceMySpaceTumblrTumblremailemailPrintPrintAdd to favoritesAdd to favorites

Permalink

ThinkRelevance: The Podcast - Episode 004 - Aaron Bedra's Valedictory

Aaron Bedra on ThinkRelevance: The Podcast

When I heard that Aaron Bedra was leaving Relevance, I was surprised and a bit saddened. But I also thought, "Hey, we should have him on the podcast." And that's just what we did. I think it's great to work at a place where it's cool to record an interview with someone who has decided to move on.

In this episode, we talk to Aaron about what brought him to Relevance, some of the things he's worked on while he was here and even a bit about what the future holds for him.

Download the episode here.

You may have noticed a sweet new feature on the podcast: cover art! Our crack design team offered to produce "album covers" for our shows, and I realized that I'd be an idiot not to take them up on the offer. It's a fun detail, and one I hope you'll enjoy.

As a reminder, you can subscribe to the podcast using our podcast feed. I'm still working on getting the show added to iTunes, but that should happen pretty soon.

You can send feedback about the show to podcast@thinkrelevance.com, or leave a comment here on the blog. Thanks for listening!

Show Notes

Permalink

Finite State Machines in core.logic

This is an implementation of Finite State Machines in Clojure using core.logic. They are a good starting point for computational linguistics and illustrate several features of core.logic, such as various ways of defining new relations, pattern matching and also the invertibility of relations.

It is not an introduction to core.logic. To learn the basics, I would recommend the Logic Starter.

Permalink

Setting up clojure emacs on windows

Setting up a Clojure development environment proved to be quite a challenge, especially given my handicap of using Windows.

The most widely used editor for Clojure is Emacs. I don't have any idea how to use it, but I have seen video of people who know what they are doing, and I really want that!

There are lots of options for setting up Clojure. If you are just starting out, there are too many. There are at least half a dozen editors that have Clojure plug-ins. Emacs itself has two. The instructions available also tend to present different ways of accomplishing the same thing.

I am not going to give you lots of options. I am going to give you the steps to go from a clean Windows 7 install to having the most popular plug-in for the most popular editor using the most popular build tool for Clojure. I am sure there is a lot of value to all of the flexibility that other people support, but it is lost on me and if you are reading this, it is probably lost on you too.

Also, I intend to cover each step in excruciating detail. I know from experience that any instructions on this subject, no matter how detailed, would include at least one or two steps that assumed I actually knew how to use Emacs. If you follow these instructions and there is a step that is not completely clear to you, please put it in the comments so that I can fix it. If the detail is too much for you, just read the bold print!

Our goal is to have a working installation of the Leiningen build tool, and the SLIME plug-in for the Emacs editor.

Before I start, let me say this works best if you do not already have clojure installed on your computer, or more specifically that you do not have the clojure.jar file in your class path. The Leiningen install will fail if you do.

Install the Java Development Kit (JDK )


Version 1.6 or better is required, I am running 1.7 Standard Edition. Download this from Oracle. There is also a link to installation instructions on that page.

After jdk is installed you need to make sure that the environment variables are set up correctly. (Control Panel >> System and Security >> System >> Advanced system settings >> Environment Variables...)

Under user variables I have JAVA_HOME with a path to the folder where the jdk is installed. On my system this is C:\Program Files\Java\jdk1.7.0_01

Under System Variables, the java bin directory needs to be in your path. click edit on the path, so you can scroll through and look to see if these settings are already there, and if not add them (path items are separated with ; on windows) %Java_Home%\bin\ and the physical path to the bin directory that holds the jdk executable. In my case this is C:\Program Files\Java\jdk1.7.0_01\bin\

Test the java installation by going to a command prompt (click the windows button in your taskbar and type cmd in the search box) at the command prompt type:
javac -version
if the computer responds with a version number the jdk is set up properly, if it says that javac is not found, you have a problem you need to fix before you move on.

Install Leiningen

(build tool)
Create a folder that will hold the leiningen batch file and also a helper utility it needs. Mine is in c:\lein. to create it I went to the command prompt, typed
cd \
md lein
After that type exit to close that command prompt window. Add the path to your new folder to your path setting.

Download curl from
http://www.paehl.com/open_source/?download=curl_723_1_ssl.zip
and place curl.exe and into your new folder.

download libssl from
http://www.paehl.com/open_source/?download=libssl.zip
and place libeay32.dll and ssleay32.dll into your new folder.

Download the lein.bat file from
https://raw.github.com/technomancy/leiningen/stable/bin/lein.bat
and put it in to your new folder. (My browser displayed the text, so to save it I did file >> save page as and then navigated to my c:\lein folder.

Open a new command prompt and type
lein self-install

After the installation completes you can test your leiningen install by typing
lein repl
at the command prompt. It should launch the clojure repl which you can test by typing
(+ 1 2)
at the user prompt.

click the x in the upper right to close the command prompt.

Install Emacs

(Editor)
Download emacs from
http://ftp.gnu.org/pub/gnu/emacs/windows/emacs-23.1-bin-i386.zip
extract the emacs-23.1 folder, and put it somewhere. I just put mine in c:\

Create a folder to hold plugins for emacs named .emacs.d I put mine in my emacs-23.1 folder.

Create a new user environment variable called HOME in the value put the path to the .emacs.d folder. in my case this is C:\emacs-23.1

Add the path to the emacs.exe folder to your path. mine was C:\emacs-23.1\bin

Install Clojure mode

Create a file called init.el in your .emacs.d folder and enter this text
(add-to-list 'load-path "~/.emacs.d/")
(require 'clojure-mode)
Add the file clojure-mode.el from
https://github.com/technomancy/clojure-mode/blob/master/clojure-mode.el
to your .emacs.d directory. I found the easiest way to do this was to copy the text in the code window at that url and paste it into a new text file that I called clojure-mode.el. If you just download the webpage, you will get lots of html commands that will cause errors in emacs.

Install Swank plugin

open a new command prompt and type
lein plugin install swank-clojure 1.3.4
-- note initially this install got hung for me, when I disabled avg link scanner it worked right away.

Create a new Clojure projectfrom the command prompt, navigate to a folder where you would like to create clojure files.from c:\Users\Rick I typed
md projects
then
cd projectscreate the new project by typing
lein new testproj
then type
cd testproj
emacs

after emacs loads type
alt-x clojure-jack-in
emacs will spend a couple of moments processing the plugin.After this, you should have a running REPL that you can test the same way that you tested the repl from leiningen. type
(+ 1 2).
If you get 3, it works.

Permalink

(take 4 arnoldo-jose-muller-molina)

I met Arnoldo Jose Muller-Molina for the first time at the second Clojure Conj where he presented a fascinating talk entitled "Hacking the Human Genome using Clojure and Similarity Search". As a speaker I enjoyed his engaging style and his grasp of and ability to clearly explain deep topics. In this installment of the (take ...) series we discuss the power of Lisp in general and Clojure specifically, starting a company built on Clojure, and the current barriers to entry for Clojure in data sciences.

How did you discover Clojure?

I learned about Clojure in Hacker News. HN is the first site I open in the morning when I wake up. Someone linked a post written by a long-time Lisper saying many good things about Clojure so I decided to give it a try.

The following three characteristics were instrumental in my decision in adopting Clojure:

  • Lisp (macro system): You can write programs that write programs.

  • JVM ecosystem: Years of programs built for the JVM and also years of effort put into the JVM itself make a very robust and complete platform. You can fully harness the power of the JVM from Clojure.

  • Many data science tools (Hadoop, Cassandra, etc) are built in Java already. If you are doing big data projects , then it makes a lot of sense to employ a JVM-based language.

Are you pushing for wider adoption of Clojure on your team?

I am in a transition period right now. I am starting a data science company simMachines that will be focused on providing solutions that turn data into money/time by using similarity functions as they are easier to understand by non-technical people. I will be using Clojure for data analysis and people who join simMachines will be using Clojure too. In addition, I will be teaching a "Big Data" Clojure course from January using the book: The Joy of Clojure. I will make my students handle very large biological datasets with Clojure!

What are your plans for using Clojure in the context of your work?

Widely. There is a strong preference towards Perl, Python, Matlab and R. The Java language is not that popular I would say. The fact that Java itself hasn't penetrated is already something that could hinder Clojure's adoption. Perhaps we need to further work on Incanter's plotting facilities so that we reach the customization capabilities of GGPlot.

People in life sciences are very visual, and the way of communicating complex ideas is with very elaborate illustrations, graphs and plots. People could actually chose a language based only on this, because at the end, it helps a lot to explain complex ideas.

What are the problems that Clojure is suited to helping solve in your field?

Clojure is exceedingly good at handling data. You can parse complex raw files in a very declarative way. You can then extract useful bits of your stream, create records and then pass them on to an arbitrarily complex pipeline of transformations. In OO languages, you need to write Iterators to traverse large files, whereas in Clojure you declaratively concatenate functions that efficiently handle your data with lazy evaluation. With Clojure, you can focus on the problem and forget about error prone tasks like iterators, while loop counters, etc. After you have processed data with Clojure, you will not want to go back to your previous programming language.

Permalink

Concurrency with Vars

When I started using Clojure, I thought I understood what vars were. They’re globals, and they live in a namespace! In fact, vars are a powerful tool for building concurrent, parallelizable systems.

So what is a var? A var is like a global variable—called the “global root binding”. But a var can also be overridden to have a dynamic scope, by declaring it dynamic (all vars are ^:dynamic in Clojure 1.2 and before):

Dynamic scoping (also known as fluid-let in some LISPS and local in Perl) causes the var to have the value that it was bound to most recently on that thread. Another way to think of this is that the value of the var is the top of a stack, and every time a binding is entered it pushes the new value on the stack, and every time the binding is exited it pops the top of the stack off.

This give us a powerful technique: implicit parameters. We can use dynamic vars to pass parameters into functions from much higher in the call stack. We can use this to direct data flow from a central point, and have that change for each thread and each binding (e.g. *out* for the print family of functions). Strategy and configuration are the main usages of dynamic vars.

What if I want to return a fn which executes some deferred work, but I want the dynamic vars when I invoke that fn to be the same as the context it was created? Enter bound-fn and bound-fn* (the former accepting an fntail (e.g. [param1 param2] (body)); the latter, a function.

The above example shows how bound-fn allows you to preserve bindings when you need to run code in another thread. As I mentioned in my previous post, bound-fn plays a key role in the implementation of Futures in Clojure.

The last question I’d like to look at is why we must declare our vars #^:dynamic now when before it wasn’t necessary? If a var is dynamic, then we must always check to see if the thread local stack of values has any elements, and if so, what it’s value is, and if not, then it must retrieve the global root binding. Nondynamic vars just need to retrieve the global root binding. In the name of performance!

Ok, I lied. I also want to mention defonce. This macro essentially expands to:

Not that exciting.

Permalink

Leaving aside Java for Clojure

…or shall I say “Leaving aside object-oriented programming (in Java) for functional programming (with Clojure)”?

I seem to be getting into functional programming with Clojure steadily. And I’m serious to have it under my belt. I seem to question all I learnt so far about object-oriented programming with Java and am quite often treading on people’s toes, esp. in the Java community in Poland (where I’m active the most).

I don’t question Java features or syntax, or the way Java programmers see things and moreover use Java for everything they design. I don’t even compare Java to Clojure (I wouldn’t be able to and could cause more damage than anyone could afford to accept). What I’m doing is to be asking questions about the purpose of using Java and its tools and frameworks in a given context – an IDE, design patterns, code-compile-debug-run cycle and such.

I think the main reason is that I began noticing things which I hadn’t been able to before Clojure.

It’s also part of my learning process where I think I should leave aside the emotional baggage to Java I’ve been carrying around with me for years. I must admit I learnt how to design applications with and in Java, and it took me years to grasp all the concepts which ultimately turned me into a seasoned Java specialist (a mixture of a programmer, a application designer and an architect).

I’m way too far from claiming I know how to effectively develop Java applications, but don’t think it would take me months to learn and accustomed to it (unless I’ve already).

I’ve already gone through a couple of books about Clojure (see my take on The Joy of Clojure, Practical Clojure and Programming Clojure and vote for them should they please you) and it turns out the reading list is not going shorter any time soon (see Clojure – Grundlagen, Concurrent Programming, Java, and Clojure in Action and, to be released soonish, Clojure Programming). It turns out that all the people who know Clojure well enough have already written a book about the language or are about to do so. It’s a hectic activity to follow along with their reading. Not an easy task after all, is it?

So, I’ve immersed myself in reading the books and in the meantime am trying to find a place for my new skill – programming functionally with Clojure. And, honestly, it’s not an easy task at all. Not after so many years with Java.

But I’m not giving up. Quite the contrary. I may have found a way out – I’ll be developing simple applications around Web development which I used to cover with various Java frameworks like Apache Wicket, Seam Framework, JavaServer Faces (JSF) or recently Grails and some others.

The idea is to follow the path many Java programmers do when they start developing their object-orientation with Java EE. It’s not only about Java Servlet, but a layer atop, be it the aforementioned JSF or Grails. I’m not going to build a yet another framework for Web development (which I don’t understand so well and don’t have time for), but am going to have a bunch of very simple examples of what I used to cover with Java that should ultimately help me to present the goodies offered by Clojure.

I wish I could also be working with someone interested in learning Scala or JRuby this way. I believe it could help me have another view on a problem with the other language’s solution which would eventually lead me to find the right one in Clojure. Ping me if you’re interested.

Should you have an idea for a very short demo with Clojure, I’d be happy to hear so. Even if it’s already managed by a library/framework in Clojure, I’m up for doing it again on my own hoping I learn Clojure better (when the sources of the solution are available and will be able to have a look at a solution).

I’m thinking aloud and therefore what I wrote may not be useable at all and won’t ever be. You’ve been warned.

Permalink

ClojureDocs Android App

利用春节的假期写了一个Android应用,可以在ClojureDocs.org上搜索clojure API,浏览文档、源代码和社区贡献的代码实例。ClojureDocs在我学习Clojure的过程中起了很大的作用,所以我想这个网站应该对很多人有用。

无暇去学习Android平台上繁琐的知识,不过好在有Phonegap这样的框架,可以把网页应用转化为本地应用,并且提供访问本地设备的API。通过Phonegap开发的程序还可以直接移植到iphone平台上。ClojureDocs Android就是运行在Phonegap中。

首页:

搜索界面

API函数界面

你可以从github获得代码和签名过的apk:https://github.com/sunng87/clojuredocs-android

Known Issue,phonegap程序在屏幕旋转时会崩溃,已经在2.3和3.2上重现,目前还不清楚具体的原因。(Edit 20120127: Fixed in 1.0.4)

欢迎任何的pull request。

Permalink

Clojure is one answer

My last post was a link to a video talking about the challenges of many-core computing. Today I am linking to another video from Channel 9. This one is a discussion with Rich Hickey about Clojure. The topics build on one another: introducing Clojure, why Clojure is a lisp, functional programming, lists and vectors, persistent data structures, identities and concurrent programming. I recommend the whole video, but if you just want to jump to the section on concurrency that starts at 37:15.

Permalink

MetaWeblog API with Clojure

I've added a recipe for implementing a basic MetaWeblog server endpoint in Clojure to the necessary-evil github wiki. It's bare bones and uses a dummy store that does the bare minimum. Because so much of the detail of MetaWeblog is dependant on the backend and model you use, it's not much more than the public interface.

Permalink

Compiling and loading in ClojureCLR

Wherein I document environment variables and other factors influencing compiling and loading files in ClojureCLR and how ClojureCLR differs from Clojure in this regard.

Compiler variables

During AOT-compilation, the following vars are consulted to control aspects of the compilation process:

Vardoc says
*compile-path*
Specifies the directory where 'compile' will write out .classfiles. This directory must be in the classpath for 'compile' towork. Defaults to "classes"
*unchecked-math*
While bound to true, compilations of +, -, *, inc, dec and thecoercions will be done without overflow checks. Default: false.
*warn-on-reflection*
When set to true, the compiler will emit warnings when reflection isneeded to resolve Java method calls or field accesses. Defaults to false.

If you compile by invoking the compile function, such as from a REPL, you will have had a chance to set these vars to appropriate values. However, when compiling from the command line by running Clojure.Compile.exe, you do not have a chance to run Clojure code to initialize these vars. Instead, you can set environment variables to initialize these vars prior to compilation.

The same is true for Clojure. In fact, ClojureCLR and Clojure used the same environment variables for these variables until just recently. Starting with the 1.4.0-alpha5 release (already in the master branch), ClojureCLR has changed the environment variable names to be strict POSIX-compliant. This is due to problems with periods in environment variable names in Cygwin's bash -- see this thread for more information. Here are the names:

Clojure & older ClojureCLRnew in ClojureCLR
clojure.compile.pathCLOJURE_COMPILE_PATH
clojure.compile.unchecked-mathCLOJURE_COMPILE_UNCHECKED_MATH
clojure.compile.warn-on-reflectionCLOJURE_COMPILE_WARN_ON_REFLECTION

BTW, ClojureCLR defaults *compile-path* to ".".  "classes" didn't seem to make sense given that ClojureCLR creates assemblies.

Locating files

For identifying libraries for loading, Clojure relates the symbol naming the library to a Java package name and uses Java's mapping of package name to a classpath-relative path. For example, evaluating (compile 'a.b.c) causes Clojure to look for a file a/b/c.clj relative to some root listed on the classpath.  The result of the compilation will be a set of classfiles, written to classes/a/b/c.

ClojureCLR follows Clojure in mapping dotted symbol names to relative paths.  Not having classpaths, ClojureCLR instead uses the value of the environment variable CLOJURE_LOAD_PATH to supply roots for the file probes. In addition, it will look (first) in the current directory and directory of the entry assembly.

The same holds for load, use, require and other lib-loading functions.

Assembly output

The Clojure compiler outputs (many) class files.  The ClojureCLR compiler outputs (not as many) assemblies.  All classes resulting from (compile 'a.b.c) will go into an assembly named a.b.c.clj.dll located in *compile-path*.

When evaluating (load "a/b/c"),  ClojureCLR will look for both <AppDomain.CurrentDomain.BaseDirectory>\a.b.c.clj.dll and <any_load_path_root>\a\b\c.clj, and load the assembly if it exists and has a timestamp newer than the .clj file (if it exists).  At the moment the same set of roots (as named above) is used for assemblies and source code.  

AppDomain.CurrentDomain.BaseDirectory is used as the root for ClojureCLR assembly probes as that is also the CLR's root for resolving assembly references.  

Too many assemblies

Each file loaded during compilation will go into its own assembly.  I find this terribly inelegant.  The distribution for ClojureCLR itself needs Clojure.Main.exe, Clojure.Compile.exe, and the DLR support assemblies, of course, but also thirty-plus assemblies resulting from compiling the Clojure source that defines the initial environment. The pprint lib alone contributes eight assemblies.  They are not really independent.   Conceivably that code all could go into one assembly.  

I've not been able to think of a way to make this work.  I know that the eight files making up pprint are related.  They get compiled because the main pprint file loads each of them, and loading a file while compiling cause that file to be compiled also.  I could very easily write the compiler to output the code into the same assembly as the parent.  However, pprint could load support code that should not be part of its assembly, that should have its own assembly.  In fact, it does;  pprint loads clojure.walk.  It happens to do this with a :use clause in its ns form, but it doesn't have to.  Without a mechanism in Clojure that allows us to distinguish these uses of load, I'm afraid we're stuck with some inelegance.

Permalink

Porting effort for Clojure contrib libs

Looking for Clojure contrib lib projects to port to ClojureCLR?

I looked at the most popular libs on https://github.com/clojure, the official libs of the clojure project.  I defined popularity by the number of watchers, lacking a better criterion.  Here are the top projects sorted by number of watchers when I looked recently.  Ignoring those in single digits and all java.* projects, here they are:

WatchersProjectWatchersProject
129
core.logic
23
test.generative
69
core.match
20
core.cache
60
tools.nrepl
19
core.memoize
37
tools.cli
18
algo.monads
36
data.finger-tree
15
data.xml
35
tools.logging
11
test.benchmark
32
core.unify
10
core.incubator
28
data.json
10
data.csv
10
tools.macro

There are some fairly trivial edits that are required in porting most libs.  These include:

  1. Substituting an appropriate CLR exception class.  For example, InvalidArgumentException becomes ArgumentException.  If a throw uses Exception, that will work as is.
  2. Substituting interop method names.  For example, toString becomes ToString, hashCode becomes GetHashCode, etc.  Most String methods and some I/O methods just need capitialization.  BTW, ClojureCLR preserves case on most clojure.lang class method names so they don't need to be changed.  (You're welcome.)  Also, method names on protocols won't need to be changed.

I'll refer to these kinds of changes below as the usual.

I did a quick scan of the source of each project to estimate the effort required to port the project to ClojureCLR. In the order given above, here are some comments on each.

core.logic: This is one of the larger projects.  The usual, and not that much of it.  The only thing I saw that might take a little more investigation is that the deftype Pair implements java.util.Map$Entry.  (See below for more.)  Easy. (Unless it requires actual thought, in which case you'd have to understand the code, and that would make it a Challenge.)

core.match:  Another large project.  The usual, and not much of it.  The bean-match function will require adaptation to CLR classes and the regular expression matcher will need to be examined -- JVM vs CLR regexes always requires a look.  Of most concern is the deftype MapPattern that mentions java.util.Map.  The question is always dealing with IDictionary and IDictionary<K,V> -- support for arbitray generics is always tricky.  Probably Easy, with the same caveat as core.logic.

tools.nrepl: This is likely to be tricky.  There are some Java classes that will have to be ported.  Of greater concern is the amount of low-level I/O on sockets.  At best, a Medium project, likely a Challenge.  Given that this project is being redesigned, it might be wise to wait for 2.0 and then put in the effort.

tools.cli: The uusual, and not much of it.  There is a test that uses an Integer method.  Trivial.

data.finger-tree:  The usual.  The only concern is the mention of java.util.Set.  There is no System.Collections.ISet, only System.Collections.Generic.ISet<T>, so some thought will be required. At worst, Medium; more likely Easy.

tools.logging: This will take some work because adapters for .Net logging tools will have be developed.  One might consider log4net, ELMAH, NLog.  The good news is that the code is designed to plug different adapters into its framework, so developing new adapters should be easy, requiring mostly a decent knowledge of the target logging framework.  Most of the tests will have to be rewritten.  Medium, probably fun.

core.unify: The usual.  The same concern about java.util.Map mentioned for core.match.  I'm guessing this is trivial here.  Easy.

data.json: We know exactly how much work this will take.  See Porting libs to ClojureCLR: an example.

test.generative: Needs tools.namespace.  That didn't make the popularity cut, but it should be barely Medium to port, mostly due to the need to think a little about the I/O interop.  In test.generative, there are some library calls, to Random, Math.* methods, system time, etc., that will take a little more work than just the usual.  Barely Medium.

core.cache: A moment's thought about replacing java.lang.Iterable in the definition of defcache.  Otherwise, just the usual.  Easy.

core.memoize: Needs core.cache.  Might work as-is!  Trivial.

algo.monads: Might work as-is!  Trivial.  Hey, when was the last time you saw 'trivial' and 'monads' in such proximity?

data.xml:  The README notes that is is not yet ready for use.  Really, this should be called java.xml because of its dependence on org.xml.sax, java.xml.parsers, etc.  This will require a major rewrite.  Until this is complete, I can't say how hard it will be.

test.benchmark: Looks straightforward.  Easy.

core.incubator: The toughest thing is reference to java.util.Map (see above).  Trivial.

data.csv: The I/O will take some time, but at worst a Medium.  A very Easy Medium at that.

tools.macro: Appears to be Trivial.

So, what are you waiting for.  Plenty of easy ones to get started with and a few more challenging ones.  Whatever you pick, you'll have a chance to read some good Clojure code, always a worthwhile exercise.

Where are the hard ones, you ask?  They certainly exist, just not among the official contrib libs.  There are plenty of other Clojure projects floating around that will require significant effort.

Port a lib today!

A note on java.util.Map$Entry:  clojure.lang.IMapEntry extends java.util.Map$Entry on the JVM. ClojureCLR could not do that because the equivalent to Map$Entry, System.Collections.DictionaryEntry, is a struct and can't be subclassed. Also, we have the problem with the generic System.Collections.Generic.KeyValuePair<TKey,TValue>. I shudder when I see Map$Entry; this is a sign that real thinking will be required.

Permalink

Next Clojure Steps

Now that the Intro to Clojure course is over, here are the next steps for me. First, I have two books I’m going to read, and then I need to keep coding.

One of the books is Peter Seibel’s Practical Common Lisp and the other is Doug Hoyte’s Let Over Lambda . My reasons for wanting to read  Peter Seibel’s book is it is just a plain good read. My goal in reading this book is to take more philosophy of Lisp dialects in general, not just Clojure.

Let Over Lambda is important to me to understand further they why of macros, not just the how. There are some good explanations in the Clojure texts I already have, but Let Over Lambda is all about macros and is boldly written, from the few excerpts I have seen.

And finally, I am going to start using Clojure as a scripting language, using its Java roots to create stand-alone “main” programs that perform singular tasks. The area will be water use reports, and we have tons of data to process. To me Clojure is just cleaner when processing delimited data (.csv for example), and remapping it to create new data.


Permalink

Cascalog Testing 2.0

22 Jan 2012 - San Francisco

Cascalog Testing 2.0

A few months ago I announced Midje-Cascalog, my layer of Midje testing macros over the Cascalog MapReduce DSL. These allow you to write tests for your Cascalog jobs in a style that mimics Cascalog's own query execution syntax. In this post I discuss midje-cascalog's 0.4.0 release, which brings tighter Midje integration and a number of new ways to write tests. I'll start with a refresher on the old syntax before debuting the new. If you're eager, add the following to your project.clj:

[midje-cascalog "0.4.0"]

Midje-Cascalog Refresher

Take the following Cascalog query:

(use 'cascalog.api)

(let [src [["word"]]]
  (?<- (stdout)
       [?out-word]
       (src ?word)
       (str ?word " up!" :> ?out-word)))

Executing this code at the repl prints a single tuple with the string word up! to standard out.

How would you go about testing that this is true? With midje-cascalog, you would swap out the ?<- form for its testing equivalent: fact?<-. Here's the same Cascalog test alongside a typical Midje test:

(let [src [["word"]]]
  (fact?<- [["word up!"]]
           [?out-word]
           (src ?word)
           (str ?word " up!" :> ?out-word)))

(fact "+ should add two numbers."
  (+ 2 2) => 4)

I find that fact?<- and fact?- macros can be a bit confusing when you start mixing Cascalog and Midje tests, as they break the Midje pattern of <thing-to-test> => <expected-thing>. The syntax updates fix all of this with a set of checker functions that mimic Midje's excellent set of collection checkers.

The "produces" checker

Midje-cascalog 0.4.0 introduces the produces function, mirroring Midje's just. Let's define a source of tuples and a query to test.

(use 'cascalog.api)
(require '[cascalog.ops :as c])

(def src
  [[1 2] [1 3]
   [3 4] [3 6]
   [5 2] [5 9]])

;; adds the values in each input tuple, sorts the output and returns
;; 2-tuples of the first number and the sum. [1 2] becomes [1 3], for
;; example.
(def query
  (<- [?x ?sum]
      (src ?x ?y)
      (:sort ?x)
      (c/sum ?y :> ?sum)))  

You can think of a query as a set of tuples waiting to be generated (through query execution). With Midje, you test sets using the just checker:

(facts
  [1 2 3] => (just [1 2 3])    ;; true
  [1 2 3] => (just [1 2 3 4])) ;; false

The cascalog analog to just is the produces checker. produces works like just, but against queries instead of bare collections. Executing the following test shows that the query produces the expected set of pairs, in any order:

(facts
  query => (produces [[3 10] [1 5] [5 11]])  ;; true
  query => (produces [[1 5] [3 10] [5 11]])) ;; true  

You can read this test as saying "query, when executed, produces [3 10], [1 5] and [5 11]. You can also check that a query doesn't produce a set of tuples by swapping out =not=> for =>:

(fact
  query =not=> (produces [["string!" 11] [1 5] [5 11]])) ;; true

Using the :in-order keyword after the expected tuple sequence forces the test to respect ordering:

(facts    
  query =not=> (produces [[3 10] [5 11] [1 5]] :in-order) ;; true
  query => (produces [[1 5] [3 10] [5 11]] :in-order))    ;; true

(:in-order is really only helpful in cases where output is sorted, like our query above.)

produces-some

The produces-some checker tests that a query's output contains a subset of tuples:

(fact
  query => (produces-some [[5 11] [1 5]])) ;; true

Note that the behaviour of produces-some is similar to the behavior of Midje's contains collection checker.

As with produces, you can use the :in-order keyword to force produces-some to respect ordering. Gaps between tuples are okay.

(facts
  query =not=> (produces-some [[5 11] [1 5]] :in-order) ;; true
  query => (produces-some [[1 5] [5 11]] :in-order))    ;; true

Adding the :no-gaps keyword introduces the constraint that tuples must also be contiguous:

(facts    
  query =not=> (produces-some [[1 5] [5 11]] :in-order :no-gaps) ;; true
  query => (produces-some [[1 5] [3 10]] :in-order :no-gaps))    ;; true

produces-prefix and produces-suffix

produce-prefix mimics the has-prefix collection checker by checking that some set of tuples is produced at the beginning of the query's output. produces-prefix implicitly assumes that tuples will be produced in order with no gaps:

(facts    
  query => (produces-prefix [[1 5]])         ;; true
  query => (produces-prefix [[1 5] [3 10]])) ;; true

Similarly, produce-suffix mimics the has-suffix collection checker by checking that the supplied set of tuples is produced at the tail end of a query:

(facts
  query => (produces-suffix [[5 11]])) ;; true

log-level keywords

In addition to the keyword options supported above, every one of these checkers supports on optional logging-level keyword. For example, the following two facts are equivalent, but the second one produces :info level logging when it runs:

(facts
  query => (produces-suffix [[5 11]])        ;; true
  query => (produces-suffix [[5 11]] :info)) ;; true

Log level keywords can be useful when debugging tests, as errors will often only appear in the logging output. Currently supported keywords are :off (the default), :fatal, :warn, :info and :debug. The log level needs to be the first keyword argument if you supply multiple.

wrap-checker

The real power of the 0.4.0 update is the way in which the previous query checkers were defined. Each of the above checkers mimics the behavior of one of Midje's built-in collection checkers with slightly different keyword arguments. This makes sense if you think of a query as a collection of tuples waiting to be produced (by query execution). The above checkers will get you quite a ways, but what if you want to test a query against some other Midje collection checker?

The answer is wrap-checker. wrap-checker is a higher-order function that accepts a midje collection checker and wraps it up, turning it into a Cascalog query checker. I'll demonstrate the power of this function by wrapping Midje's has checker.

has is a powerful way to run functions across every value in some sequence:

(fact
  [1 3 5 7 9] => (has every? odd?) ;; true
  [1 3 5 6] => (has some even?))   ;; true

If you try to use has against a query it will fail, as it expects to be tested against a sequence, not an unexecuted query. Here's how to get around this:

(defn odd-tuple? [tuple]
  (odd? (first tuple)))

(defn even-tuple? [tuple]
  (even? (first tuple)))

(def has-tuples
  (wrap-checker has))

(def new-query
  (let [src [[1] [3] [5]]]
    (<- [?x] (src ?x))))

(fact
  new-query     => (has-tuples every? odd-tuple?) ;; true
  new-query =not=> (has-tuples some even-tuple?)) ;; true  

has-tuples will support log-level keywords like any of the predefined query collection checkers.

A few more examples:

(defn id-query [src]
  (<- [?x] (src ?x)))

(let [one-of-tuples (wrap-checker one-of)
      two-of-tuples (wrap-checker two-of)
      src [[1] [3] [4]]]
  (facts
    src            => (two-of odd-tuple?)           ;; true
    src            => (one-of even-tuple?)          ;; true
    (id-query src) => (two-of-tuples odd-tuple?)    ;; true
    (id-query src) => (one-of-tuples even-tuple?))) ;; true  

Backwards Compatibility

All of the collection checkers discussed above can be used with the fact?<- and fact?- macros:

(fact<- (produces-some [[5 11] [1 5]] :in-order)
        [?x ?sum]
        (src ?x ?y)
        (:sort ?x)
        (c/sum ?y :> ?sum)) ;; true

fact?<- and fact?- are also compatible with all of Midje's unwrapped collection checkers, as discussed here.

Conclusion

Midje is an astonishingly good testing framework; I'm continually surprised by how well its idioms and conventions satisfy Cascalog's needs. In my next post here I'll go over some of the more subtle details of the wrap-checker function. For the curious, here's the code.

If you'd like more information or additional features, please add your thoughts to the midje-cascalog github issues page, or let me know in the comments below (or on twitter! I'm @sritchie09.)

Permalink

HTML5 and CSS3 seem to have caught my attention

I seem to be getting into the latest developments on browser side where Clojure sits with ClojureScript. During my recent walk with my 3,5-month-old son I was listening to ThinkRelevance: The Podcast – Episode 003 – Brenton Ashworth on ClojureScript One.

For me it was a two-fold experience – firstly, it was a way to learn real English – how it’s used and pronounced, but the most important reason to tune in was to listen to people who designed and developed ClojureScript One to help people get up to speed with, what Brenton and Craig had firstly bothered with, developing highly interactive, JavaScript-rich web pages or how they put it on the website – reducing “the complexity of web development by allowing you to write applications using one language to unify development across the client and the server.” I haven’t tried it out yet, but am into it wholeheartedly (I’m mentally ready to give it a go). I’d like to see how to be in REPL “to make changes to an application in real time”. It must be a breath-taking experience and with my wife and kids away for their holidays I’m not going to wait anymore. Have a go yourself – visit ClojureScript One web site for more up-to-date information. I believe you’ll enjoy it. Drop me an email or comment the blog entry if you fancy watching a screencast about it.

Just along these lines I’ve quite recently been noticing a lot of HTML5 and CSS3 features used in the websites I’m suggested to visit for the reason and it has its effect on my future self-development plans. I seem to be into HTML5/CSS3 and am considering it a new toy I fancy playing with.

My eyes are wide open seeing all the goodies one can build with HTML5/CSS3 with not much time spent. Just a couple keystrokes of sort of declarative programming and a website looks truly overwhelming. They’re so powerful, they nearly blew my mind. I don’t think real Java programmers will face troubles trying it out themselves once they have grasped the basics (since they could understand Java, nothing should be harder :) )

The more HTML5/CSS3 I see, the often I think a browser is no longer a mere runtime for very simple, rudimentary HTML, CSS and JavaScript web pages that attract attention with nice-looking graphics, but it de facto became a sort of application server that’s already on client devices, which is where I used to dispatch my view to (with Java EE and view technologies). I was so scared to enter the realm of client-side development that I was glad to have used object-orientation with Java EE to have done it for me.

Over the years I’ve built understanding where the only viable architecture is composed of a Java EE application server to eventually generate views that are in turn sent over HTTP to rendering device, i.e. a client device (mobile or not) hosting a browser. One monolithic application that’s completely built with Java EE frameworks. HTTP was the way to communicate client runtime with server one.

With AJAX the way clients and servers communicate changed so a page was only partially ready for a complete display and the rest was downloaded at request. I could live with it and it hasn’t changed a lot in the way I thought about enterprise architectures yet introduced a bit of dynamicity in my static object-orientation with Java EE.

With HTML5/CSS3 combo I’m experiencing a twist in my thinking where HTTP is to deliver a complete (client-side) web application that’s hosted directly in a browser – kind of application server – and is supposed to connect the server (the place it was downloaded from) for more data, but it could be that it will never do it and will use different data sources (to ultimately create a mashup). The benefits of having such powerful runtimes – the browsers with HTML5/CSS3 support – right on client devices are enormous.

I used to think that HTML, CSS and JavaScript are for people not able to manage to develop full-blown applications that are supposed to run on a server – a Java EE application server. I was considering HTML/CSS/JavaScript for young people who can only develop PHP applications. With HTML5 and CSS3 I no longer think so. And it makes my mind suffer from a great intellectual pain to grasp all the techniques to deliver highly interactive, feature-rich applications. I do however like it greatly and am sure by the end of the year I’ll have figured out how to use it in my architectures.

Have a look at Spectacular CSS3 Hover Effect Tutorials should you feel a need to experience it yourself.

Permalink

All Your HBase Are Belong to Clojure

I’m sure you’ve heard a variation on this story before…

So I have this web crawler and it generates these super-detailed log files, which is great ‘cause then we know what it's doing but it’s also kind of bad ‘cause when someone wants to know why the crawler did this thing but not that thing I have, like, literally gajigabytes of log files and I’m using grep and awk and, well, it’s not working out. Plus what we really want is a nice web application the client can use.

I’ve never really had a good solution for this. One time I crammed this data into a big Lucene index and slapped a web interface on it. One time I turned the data into JSON and pushed it into CouchDB and slapped a web interface on that. Neither solution left me with a great feeling although both worked okay at the time.

This time I already had a Hadoop cluster up and running, I didn’t have any experience with HBase but it looked interesting. After hunting around the internet, thought this might be the solution I had been seeking. Indeed, loading the data into HBase was fairly straightforward and HBase has been very responsive. I mean, very responsive now that I’ve structured my data in such a way that HBase can be responsive.

And that’s the thing: if you are loading literally gajigabytes of data into HBase you need to be pretty sure that it’s going to be able to answer your questions in a reasonable amount of time. Simply cramming it in there probably won’t work (indeed, that approach probably won’t work great for anything). I loaded and re-loaded a test set of twenty thousand rows until I had something that worked.

Setting Up Your Environment

Both Hadoop and HBase have pretty decent documentation that goes over installing and configuring these tools. I won’t re-hash that here, especially given the variance in setup (single node for development verses multi-node for deployment). In my experience, getting a multi-node development setup with virtual machines (i.e. with Vagrant) was problematic; even with host names properly configured I had issues with connections timing outa.

I’m using Clojure 1.3, the libraries that I needed were available in Clojars but are built for 1.2. I forked these libraries and moved them to 1.3 to keep things neat for myself. If you’re using Clojure 1.3, you can clone these repositories and build your own copies.

My fork of the HBase library requires a 1.3 compatible version of the monolithic Clojure Contrib library. You can download a copy if you need it.

What is HBase Again?

HBase is provides a BigTable-like database on top of Hadoop. Uh-oh, I said database! Well, it’s not at all like a SQL database; it’s more like a really big sorted map. In the simplest scenario, given a particular key HBase can quickly give you all of the values associated with that key. Because it’s sorted, getting all of the rows between one key and another is pretty fast. You can also quickly get all of the rows that begin with a particular key.

While Hadoop will let you store huge amounts of data and run jobs to analyze that data, you need to do something with the results of those jobs. Often making those results available for people to look at is good enough, in my case the client wants to be able to search for specific information (for instance, the log records produced when a particular host was crawled) and this isn’t something I could do easily with Hadoop. My solution was to store the crawl logs on my Hadoop cluster and then run a Hadoop job that would load this data into HBase. HBase will let customer query for data on-demand.

This is important to note: it’s a good solution for myself and my peculiar scenario. There are other data stores that may be a better fit for someone else. I already have a Hadoop cluster up, running and in active use. If I was already using Riak, I most likely would have concentrated my efforts on making that work instead.

An HBase table is made up of rows, each row contains any number of columns and each one of those columns belongs to a column family. Each column family can then have a qualifier and a value. It took a bit of thinking to get my head around what a column family was, it’s purpose and how it should be used. Generally speaking, items in the same column family are stored together physically and that makes fetching all of the data for a particular row or group of rows column family faster. There’s a good discussion about how HBase actually stores this data in the O’Reilly HBase book. It’s definitely worth reading.

Goals: Web Application for Log Viewing

Before we move further down this path and try to figure out what data should go where, we’re going to take a brief detour and look at the web application that will be presenting the log data. In my opinion, this will help us get a handle on what data we need and, at least in terms of idealized application usage, when we will need it.

The data set I’m working with looks like this: a set of files, each line of each file contains one web crawler transaction. This will have the date and time of the transaction, the URL that the crawler tried to crawl, the response from the host serving that resource and the status the indexer assigned to the result. Here’s a sample row:

TimestampResponseStatusURL
2011-01-12-12:44:33200NEWhttp://twitch.nervestaple.com
2011-01-12-12:45:12200NEWhttp://twitch.nervestaple.com/index

The client wants the ability to see a summary page for each crawled host, this page should display the number of URLs for the host, how many returned content and how many had problems. They’d like to have a report page for each crawled URL that provides some summary information and a list of transactions for that URL. They want to be able to bang in an URL or host name into a search box and get dropped at the right page. They say that’s everything but I can’t help but have the feeling that this really only represents everything they’ve though of so far.

Laying Out My Log Data

Given the way that column families work, it makes sense to store data that we want to access all at once in the same column family. While it’s hard to know what data people will want, we can make some guesses. In my case, I know what my web application is going to look like and I can use that information to inform my choices.

First up, we want the ability to fetch all of the transactions for a given URL. This means that we’ll want an HBase table where the key will at least start with the URL. Each URL could be crawled multiple times, so we’ll combine the URL with the date and time crawled. It’ll look unpleasant but something like this:

http://twitch.nervestaple.com/index2011-01-12-12:45:12

HBase will store our data sorted by key, this means that when we want all of the transactions for a given URL it can jump to the first row that starts with our URL and then return each row there-after in sequence.

For our log data, we won’t have particularly interesting column families. I chose to name my columns families “request”, “transaction”, “response” and “crawler”. To follow through with our example, the data in HTable would look something like this:

RowColumn FamilyQualifierValue
http://twitchrequesturlhttp://twitch
requesthosttwitch…
transactiondate-time2011-01-12-12:45:12
responsecode200
crawlerstatusNew

This layout makes it easy to pull out the rows for a particular URL. When we want to provide summary information for a specific host, this isn’t as helpful. While we could scan through all of the records for the host in question, the client would have to crunch through the returned data and calculate the summary data. This will be problematic in practice, instead we’re going to leverage another HBase function: counters.

HBase will let us treat a column as if it were a counter, this lets us atomically increment the counter in just one call to the server. As we load data into HBase, we’ll increment these counters; this will make it easy to provide the summary data the client demands. For instance, we’ll have a table called “host-stats” with the column families “transactions”. We can use a qualifier like “total” to represent all of the transactions for that host. Some example rows might look like this:

RowColumn FamilyQualifierValue
twitch.nervestaple.comtransactionstotal104
bakery.somewhere.comtransactionstotal83

In practice it’d make sense to break this down further. We could have a family called “transactions-yearly” and use the year portion of the crawl date as our qualifier. We could have a family called “transactions-monthly” and use a combination of the month and year portions of the crawl date as the qualifier. You get the idea.

HBase will let us have as many items of data for each column family as we want, in our last example we’re using the qualifier to distinguish between each year and month combination in the “transactions-monthly” family.

Loading Data into HBase with Hadoop

There are a couple different ways to load data into HBase. Given the amount of computation we have to do (incrementing the counters based on host and URL) it makes sense to write a Hadoop map/reduce job to load in this data.

Setting Up Our Project

We’ll be using both the Clojure Hadoop and the Clojure HBase libraries. I’m going to assume that you’re using Leinigen to manage your Clojure projects. If not, you’ll want to revisit that. Go ahead a create a new project with…

lein new crawl-log-loader

Next up you’ll want to add the dependencies to your “project.clj” file. If you built the libraries from my repositories, you’ll want to add the following:

(defproject crawl-log-loader "1.0-SNAPSHOT"
    :description "Load Crawler Log data"
    :dependencies [[org.clojure/clojure "1.3.0"]
                   [clojure-hbase "0.90.5-SNAPSHOT"]
                   [clojure-hadoop "1.3.3-SNAPSHOT"]
                   [org.clojure/tools.logging "0.2.3"]
                   [org.clojure/tools.cli "0.2.1"]]
    :dev-dependencies [[org.codehaus.jackson/jackson-mapper-asl "1.9.2"]
                       [org.slf4j/jcl104-over-slf4j "1.4.3"]
                       [org.slf4j/slf4j-log4j12 "1.4.3"]]
    :main crawl-log-loader.core)

We’re defining our dependencies on Clojure, the HBase library and the Hadoop library. We need the Clojure “tools.logging” library so that we can log message while our Hadoop job is running. I like to include “tools.cli” so I can look up documentation for functions from the REPL while I work. Lastly, the Hadoop libraries depend on Jackson and SLF4J; they’ll be present on our Hadoop cluster but we’ll need them around in order to build our application.

Use, Require and Import

With the project setup, it’s time to add some code. Open up the core file (src/crawl_log_loader/core.clj), and add something like this…

(ns crawl-log-loader.core
  (:use [clojure.repl]
        [clojure.tools.logging]
        [clojure.tools.cli])
  (:require [clojure.string :as string]
            [clojure-hadoop.gen :as gen]
            [clojure-hadoop.imports :as imp]
            [clojure-hbase.core :as hb])
  (:import [org.apache.hadoop.util Tool]
           [org.apache.hadoop.hbase.client Increment]
           [org.apache.commons.logging LogFactory]
           [org.apache.commons.logging Log]
           [java.net URL]
           [org.apache.hadoop.hbase.util Bytes]))

I’ve added the Clojure REPL, logging and CLI tools to make it easier to bootstrap your application and parse out command line arguments. Since we’re parsing log files, we’ll need the Clojure String library as well. After we pull in our libraries for Hadoop and HBase, we import some of the Java classes we’ll need to make those libraries work.

I haven’t mentioned if before, but HBase stores everything as an array of bytes. It has no notion of type, it just sees byte arrays. Our last import provides a utility class that makes it easier for us to convert objects like strings and numbers into byte arrays.

Define Our Hadoop Job

We’re going to do this a little backwards and add the function that sets up the Hadoop job next. This function will reference our map and reduce functions although we haven’t written those yet.

(defn tool-run
  "Provides the main function needed to bootstrap the Hadoop application."
  [^Tool this args-in]

  ;; define our command line flags and parse out the provided
  ;; arguments
  (let [[options args banner]
        (cli args-in
             ["-h" "--help"
              "Show usage information" :default false :flag true]
             ["-p" "--path" "HDFS path of data to consume"]
             ["-o" "--output" "HDFS path for the output report"])]

    ;; display the help message
    (if (:help options)
      (do (println banner) 0)

      ;; setup and run our job
      (do
        (doto (Job.)
          (.setJarByClass (.getClass this))
          (.setJobName "crawl-log-load")
          (.setOutputKeyClass Text)
          (.setOutputValueClass LongWritable)
          (.setMapperClass (Class/forName "crawl-log-loader.core_mapper"))
          (.setReducerClass (Class/forName "crawl-log-loader.core_reducer"))
          (.setInputFormatClass TextInputFormat)
          (.setOutputFormatClass TextOutputFormat)
          (FileInputFormat/setInputPaths (:path options))
          (FileOutputFormat/setOutputPath (Path. (:output options)))
          (.waitForCompletion true))
          0))))

This article isn’t about parsing command line arguments, but the above is a good habit to get into. We use the CLI library to both setup our arguments and to parse those arguments out into a hash-map. More information on how this library works can be found on the project’s page.

Hadoop wants our application to return a status code that indicates healthy completion of the job or exiting under an error condition. We return “0” to indicate that our job exited normally. In real life you may want to do something more clever.

If our app isn’t invoked with the “-h” or “–help” flag, we setup the Hadoop job. We create a new Job object and set a bunch of fields. Note that we set the output key and value class. The main purpose of this job is to load data into HBase but we’ll also output the number of transactions per host. This could be used for any number of things, perhaps we want to double-check the data stored in HBase.

We set the mapper and reducer classes, we’ll write those up next. We set the input and output formats; the TextInputFormat reads plain text files line-by-line, a good fit our log input. The TextOutputFormat writes plain text files.

Mapping Function

We’ll now add our mapping function. Make sure that you add your own mapping function above the definition of your “tool-run” function.

(defn mapper-map
  "Provides the function for handling our map task. We parse the data,
  apply it to HBase and then write out the host and 1. This output is
  used to provide a summary report that details the number of URLs
  logged per host."
  [this key value ^MapContext context]

  ;; parse the data
  (let [parsed-data (parse-data value)]

          ;; apply the data to HBase
          (process-log-data parsed-data)

          ;; write our counter for our reduce task
          (.write context
                  (Text. (:host parsed-data))
                  (LongWritable. 1))))

This isn’t so tricky! We read in a key and a value, they key will be the line number of the file being processed and the value will be the text of that line (a log entry). We don’t really care which line of what file this entry came from so we ignore it. Then we parse the line of log data into a hash-map, apply that data to HBase with our “process-log-data” function (yet unwritten) and then write out data for our final report.

We’re writing out the host for the URL that was crawled in this log entry and the number 1. During the reduce phase we’ll sum the values for each host and output the total number of transactions. In fact, let’s do that right now.

Reduce Function

(defn reducer-reduce
  "Provides the function for our reduce task. We sum the values for
  each host yeilding the number of URLs logged per host."
  [this key values ^ReduceContext context]

  ;; sum the values for each host
  (let [sum (reduce + (map (fn [^LongWritable value]
                             (.get value))
                           values))]

    ;; write out the total
    (.write context key (LongWritable. sum))))

Again, this function isn’t scary at all. We map over the incoming values (Hadoop wraps the number in a LongWritable instance), pull out the actual values and reduce those values into our final sum. We write out the key, which is the name of the host and the sum, the total number of transactions for this host.

We’re nearly through, the last function is the bit that applies our data to HBase.

Load Data into HBase

We need to parse out our line of log data into something easier to work with, a hash-map. My files are separated with spaces making this very easy.

(defn parse-data
  "Parses a String representing a row of data from an ESP crawler log
  into a hash-map of data."
  [text]

  ;; parse out the row of data
  (let [data (string/split (str text) #"\s+")]

    ;; return a map of data
    {:timestamp (nth data 0)
     :reponse (nth data 1)
     :crawler-response (nth data 2)
     :url (nth data 3)
     :host (.getHost (URL. (nth data 3)))}))

This is a bit simplistic, but you get the idea. In practice you’d probably want to be more careful and make sure the input data is valid. You’d likely want to split the date and time out or even parse it back into a real date instance.

Lastly, we want to add this row to HBase and update some counters.

(defn process-log-data
  "Handles the processing of log data by applying the map of data to
  the proper counters and adding the correct rows to our HBase
  tables."
  [parsed-data]

  ;; add our row of data
  (hb/with-table [urls (hb/table "urls")]
    (hb/put urls
            (str (:url parsed-data)
                 (:timestamp parsed-data))
            :values [:request [:url (:url parsed-data)]]
            :values [:request [:host (:host parsed-data)]]
            :values [:transaction [:timestamp (:timestamp parsed-data)]]
            :values [:response [:http (:response parsed-data)]]
            :values [:response [:crawler (:crawler-response parsed-data)]]))

  ;; update our host stats table
  (hb/with-table [host-stats (hb/table "host-stats")]
    (.incrementColumnValue host-stats
                           (hb/to-bytes (:host parsed-data))
                           (hb/to-bytes "transactions")
                           (hb/to-bytes "total")
                           (.longValue 1))))

We’re keeping this simple so that you get an idea of how this works. We add the row of data to our “urls” HBase table, then we increment to total number of transactions for this host in the “host-stats” table. We don’t have to worry if there’s already a row in the “host-stats” table for this host, if there isn’t then HBase will create a new row with a value of zero and then increment it by our supplied value, 1.

Package and Deploy

Deployment is straightforward, create an “uberjar” with Leiningen and then copy that out to your Hadoop cluster. From there you can invoke the JAR with an input and output path.

lein uberjar
...watch Leiningen work...

scp crawl-log-loader-1.0-SNAPSHOT-standalone.jar hadoop1.local:/hadoop

ssh hadoop1.local
...log into your Hadoop node

And then run your job. It turns out this is trickier than you’d think. If you simply invoke the JAR with the “hadoop” command it will run on the local job runner (it won’t run distributed across the cluster) because it won’t be able to find your HBase install. To fix this, create a new folder in your project directory called “resources”. Files in this folder will be bundled up by Leiningen into your final JAR and they’ll be present on the class-path when the application is launched.

Next, create a file called “hbase-site.xml” in this folder. It should look something like this:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>hadoop1.local</value>
  </property>
</configuration>

The Zookeeper Quorum value should list all of your Zookeeper nodes, if you’re running in standalone development mode then there’s just the one. If you are running in production this should be a comma separated list of host names. Lastly, you can use a script similar to the following to start your job.

export HBASE_HOME=/hbase
export HADOOP_HOME=/hadoop

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
  ${HADOOP_HOME}/bin/hadoop jar crawl-log-loader-1.0-SNAPSHOT....jar \
  -p hdfs://hadoop1.local:54310/data/in \
  -o hdfs://hadoop1.local:54310/data/out

This sets up environment variables that point to our HBase and Hadoop installations. We then load our application in the context of our Hadoop instance and add our HBase installation to the class path. The “-p” and “-o” flags at the end are interpreted by our application.

For Great Justice

While this isn’t production code it should be enough to get you started. We now have some data loaded and that will make it a lot easier to explore HBase and evaluate the product. Once again, I am amazed at how easy it is to work with these frameworks using Clojure! A big thanks goes out to the developers and maintainers of the Hadoop and HBase libraries that make this all so easy, as well as the developers and maintainers of Hadoop, HBase and Clojure themselves.

Permalink

Citrusleaf: an interesting (non open source) NoSQL data store

I have been using Citrusleaf for a customer (SiteScout) task. Interesting technology. Maybe because I am excessively frugal, but I almost always favor open source tools (Ruby, Clojure, Java, PostgreSQL, MongoDB, Emacs, Rails, GWT, etc., etc. that I base my businesses on). That said, I also rely on paid for software and services (IntelliJ, Rubymine, Heroku, AWS services, etc.) and it looks like Citrusleaf is a worthy tool because of its speed and scalability (which it gets from Paxos, using lots of memory, efficient multicast when possible for communication between nodes in a cluster, etc.)

Permalink

A naive Adler32 example in Clojure and Scala

Houdy,

between administrative intricacies this week, among other things, I took the time to reproduce both in Clojure and Scala a small exercise found in Real World Haskell (RWH). 
This blog entry will be very small as I simply provided in each language a way to implement the algorithm. 

The algorithm is the Adler32 checksum algorithm as presented in RWH. (You will be able to see the link at the end the protest on the Wikipedia site). Trying to decode the three code samples while the Wikipedia link is blacked out for protest, can be also seen as an interesting exercise :). 

The Adler32 algorithm is an algorithm invented by Mark Adler in 1995 and used in the zlib compression library. I see these katas as an interesting mean of learning new things on a daily base (isn't it our job to learn and understand better than use blindly external frameworks?). 
For copyright purpose I provide here my version of the algorithm and not the one in the book, as I tried to produce my own haskell version


import Data.Char (ord)

import Data.Bits (shiftL, (.&.), (.|.))

base = 65521

cumulate::(Int, Int) -> Char -> (Int, Int)
cumulate (a, b) x = let
a' = (a + (ord x .&. 0xff)) `mod` base
b' = (a'+ b) `mod` base
in (a', b')


adler32::[Char] -> Int
adler32 xs =
let
(a, b) = foldl cumulate (1, 0) xs
in (b `shiftL` 16) .|. a


The authors use this algorithm on purpose in order to present an application of the use of the higher order fold function.Let give it a try:

ghci>adler32 "Thumper is a cute rabbit"
1839204552
ghci>

That gives us meat for our tests in Scala and Clojure (I have not learn yet about quickCheck Haskell)Logically, in Clojure our test should look like:

(ns algorithms.test.adler32-spec
(:use algorithms.adler32)
(:use clojure.test))

(deftest checksum-with-favourit-sentence-should-produce-result
(is (= 1839204552 (checksum "Thumper is a cute rabbit"))))

that runs  green for the following implementation:

(ns algorithms.adler32)

(def base 65521)

(defn cumulate [[a b] x]
(let [a-prim (rem (+ a (bit-and x 255)) base)]
[a-prim (+ b a-prim)]))

(derive clojure.lang.LazySeq ::collection)

(defmulti checksum class)
(defmethod checksum String [data]
(checksum (lazy-seq (.getBytes data))))
(defmethod checksum ::collection [data]
(let [[a b] (reduce cumulate [1 0] data)]
(bit-or (bit-shift-left b 16) a)))


where I naively used a derive routine in order to dispatch my multimethod using the class function as a dispatcher. My dispatching mechanism resolves now the clojure.lang.LazySeq instances as children of ::collection :

algorithms.adler32=> (parents clojure.lang.LazySeq)
#{java.util.List clojure.lang.Obj clojure.lang.ISeq clojure.lang.IPending clojure.lang.Sequential :algorithms.adler32/collection}
algorithms.adler32=>

Test ok.

 Following the same reasoning in Scala, the test will be :


package com.promindis.algorithms.cheksum

import org.specs2.Specification


class Adler32Specification extends Specification { def is =
"Adler32Specification" ^
p^
"checksum for input" ^
"Should restore the expected checksum value" !e1


def e1 =
new DefaultAdler32()
.checksumText("Thumper is a cute rabbit".toCharArray)
.should(beEqualTo(1839204552))
}

leading to

package com.promindis.algorithms.cheksum

trait Adler32 {
val base = 65521

def rebased(value: Int) = value % base

def cumulated(acc: (Int, Int), item : Byte): (Int, Int) = {
val a = rebased(acc._1 + (item & 0xff ))
(a, (a + acc._2) % base)
}

def checksum(data: Traversable[Byte]): Int

def checksumText(data: Traversable[Char]): Int
}


final class DefaultAdler32 extends Adler32 {

override def checksum(data: Traversable[Byte]): Int = {
val result = data.foldLeft((1, 0)) {cumulated(_, _)}
(result._2 << 16) | result._1
}

def checksumText(data: Traversable[Char]) = {
checksum(data.toSeq.map(_.toByte))
}
}

Tests green :)
 That's all folks (I promised it would not be long). And don't take for granted what comes from closed boxes !

Be seeing you !!! :)

Permalink

Clojure Talks at SkillsMatter

I was honoured enough to do another lightning talk at SkillsMatter.  I’m afraid I had easily the least interesting talk of the evening, as you can see from the hastily prepared slides on Google Docs.

Neale Swinnerton gave a great talk on Paredit.  I’m still trying to nail paredit, and this was a great help.  The slides are a thing of beauty, watch them in full screen mode. 

Nick Rothwell gave a lightning talk that I wish had been longer, on teaching Clojure to artists.  He’s done some amazing and thankless work on embedding Clojure in MaxMSP.  As always, he gave a great demo.

Malcolm Sparks has already written up the main talk, which was excellent.  I’m not convinced I agree with him about excluding executable code from configuration (I have form on this).  Ultimately, I believe that “configuration” is just a way we describe code properties that different between environments.  There’s no need for it to be XML, or an INI file, or RDF.  The only requirement really is for it to be findable and editable at short notice.

Neale is doing another talk on 6th March on Clojurescript One.

Technorati Tags: Clojure,Paredit,Configuration,RDF

Permalink

ClojureScript One

Simply amazing.

I tried to build a small project using ClojureScript and gave up, reverting back to Javascript. The workflow was awkward and information about the language was lacking. ClojureScript One aims to exemplify the power latent in ClojureScript by giving us a productive workflow from the very start. You clone the repo and use it as a starting point for your own project.

I haven't explored this new project, but it is already impressive. It sets a high bar for new projects. It includes a well-designed home page, getting started videos, and nice documentation. But, more importantly, it hints at a new age in web development in much the same way as the Blog in 15 Minutes Ruby on Rails video ushered in the rise of Ruby on the web. Watch the ClojureScript One Getting Started video to have your mind blown.

I also recommend a podcast about the goals of ClojureScript One.

LC

Permalink

Protocol Buffers with Clojure and Leiningen

This week I’ve been prototyping some data processing tools that will work across the platforms we use (Ruby, Clojure, .NET). Having not tried Protocol Buffers before I thought I’d spike it out and see how it might fit.

Protocol Buffers

The Google page obviously has a lot more detail but for anyone who’s not seen them: you define your messages in an intermediate language before compiling into your target language.

1
2
3
4
5
6
7
8
9
option java_package = "com.forward";
option java_outer_classname = "Data";

message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
  repeated string likes = 4;
}

There’s a Ruby library that makes it trivially easy to generate Ruby code so you can create messages as follows:

1
2
3
4
5
p = Person.new
p.id = 1234
p.name = "Paul"
p.email = "paul@mycompany.com"
p.likes << "Clojure"

Clojure and Leiningen

The next step was to see how these messages would interact with Clojure and Java. Fortunately, there’s already a few options and I tried out clojure-protobuf which conveniently includes a Leiningen task for running both the Protocol Buffer compiler protoc and javac.

I added the dependency to my project.clj:

[protobuf "0.6.0-beta2"]

At the time, the protobuf library expected your .proto files to be placed in a ./proto directory under your project root. I forked to add a :proto-path so that I could pull in the files from a git submodule.

Assuming you have a proto file or two in your proto source directory, you should be able to invoke the compiler by running

$ lein protobuf compile
Compiling person.proto to /Users/paul/Work/forward/data-spike/protosrc
Compiling 1 source files to /Users/paul/Work/forward/data-spike/classes

You should now see some Java .class files in your ./classes directory.

Using clojure-protobuf to load an object from a byte array looks as follows:

1
2
3
4
5
6
7
8
(ns data-spike
(:import [com.forward Data$Person])
(:use [protobuf.core :only (protodef protobuf-load)])

(def Person (protodef Data$Person))

(defn load-person [bytes]
(protobuf-load Person bytes))

Uberjar Time

I ran into a little trouble when I came to build the command-line tool and deploy it. When building with lein uberjar it seemed that the ./classes directory was being cleaned causing the protobuf compiled Java classes to be unavailable to the application (causing the rest of the application to fail to build- I was using tools.cli with a main fn which meant using :gen-class).

I always turn to Leiningen’s sample project.clj and saw :clean-non-project-classes. The comment mentioned it was set to false by default so that wasn’t it.

It turns out that Leiningen’s uberjar task checks a different option when determining whether to clean the project before executing: :disable-implicit-clean. I added :disable-implicit-clean true to our project.clj and all was good:

$ lein protobuf compile, uberjar

I wasn’t a registered user of the Leiningen mailing list (and am waiting for my question to be moderated) but it feels like uberjar should honour :clean-non-project-class too. I’d love to submit a patch to earn myself a sticker :)

Permalink | Leave a comment  »

Permalink

Configuration Middleware

When you are developing a web application in Clojure, there are likely to be libraries (such as database access libs) that require configuration such as connection information to be provided. If the library provides a way to dynamically bind the configuration data, we can use Ring middleware to simplify our applications. I assume you already know about Ring. I’m going to use Moustache for my routes here, by the principle applies to Compojure – and presumably noir – as well. For a database library I’ll show an example using Clutch (the Clojure CouchDB api), again this also applies to other libraries.

Lets imagine a simple moustache app:

(def my-app 
    [] (delegate home)
    [slug] (delegate page slug))

In this example, home and page do some lookups into a CouchDB database, and return some html, for example:

(use '[ring.util.response :only [response]]
    '[com.ashafa.clutch :only [with-db get-view]])

(defn get-page-from-db
    [slug]
    (-> (with-db "example-db" (get-view "site" :page {:key slug}))
                 first
                 :content))    

(defn get-page-from-db-2
    [slug]
    (-> (get-view "example-db" "site" :page {:key slug})
        first
        :content))

(defn page 
    [req slug]
    (response (get-page-from-db slug)))

This simplified handler and database access function (and its variant) highlight the problem: the connection information (in this case just the string "example-db") is coupled with the code the does the request. Newer Clojure programmers may try hoisting the (with-db …) above the defns but this doesn’t work due to the binding semantics of dynamic scopes.

Chas Emerick discusses the trade-offs behind both configuration patterns. This discussion is the precursor to the approach taken in Clutch to allow both approaches.

Ring middleware presents an answer to this problem. If we create a middleware that will set the (with-db …) for us on each request, then hoist the definition out of the data access code and specify it only once. Here is an example:

(defn clutch-with-db
  "Wraps the routes in a clutch with-db binding"
  [app database]
  (ƒ [req]
    (com.ashafa.clutch/with-db database (app req))))

(def my-app2
   (app 
     (clutch-with-db "example-db")

     [] (delegate home)
     [slug] (delegate page slug)))

Easy!

In addition to hoisting the configuration out of the data access code, we can now trivially use a different database for two sub apps that use the same app definition but accesses a different database, in this case two blogs: one with serious content, and another with humorous cats:

(def simple-blog (app …))

(def blogs 
    (app 
       ["funny-cats" &] (clutch-with-db simple-blog "cats-blog")
       [&] (clutch-with-db simple-blog "serious-blog")))

Finally, this also means you can write test harnesses that work off separate databases without fear.

Permalink

All Your HBase Are Belong to Clojure

All Your HBase Are Belong to Clojure by

I’m sure you’ve heard a variation on this story before…

So I have this web crawler and it generates these super-detailed log files, which is great ‘cause then we know what it’s doing but it’s also kind of bad ‘cause when someone wants to know why the crawler did this thing but not that thing I have, like, literally gajigabytes of log files and I’m using grep and awk and, well, it’s not working out. Plus what we really want is a nice web application the client can use.

I’ve never really had a good solution for this. One time I crammed this data into a big Lucene index and slapped a web interface on it. One time I turned the data into JSON and pushed it into CouchDB and slapped a web interface on that. Neither solution left me with a great feeling although both worked okay at the time.

This time I already had a Hadoop cluster up and running, I didn’t have any experience with HBase but it looked interesting. After hunting around the internet, thought this might be the solution I had been seeking. Indeed, loading the data into HBase was fairly straightforward and HBase has been very responsive. I mean, very responsive now that I’ve structured my data in such a way that HBase can be responsive.

And that’s the thing: if you are loading literally gajigabytes of data into HBase you need to be pretty sure that it’s going to be able to answer your questions in a reasonable amount of time. Simply cramming it in there probably won’t work (indeed, that approach probably won’t work great for anything). I loaded and re-loaded a test set of twenty thousand rows until I had something that worked.

An excellent tutorial on Hadoop, HBase and Clojure!

First seen at myNoSQL but the URL is not longer working at in my Google Reader.

Permalink

My super cool workplace at Lunatech

In my 10 years of programming career, Lunatech is probably the best company to work with.

  1. Herman Miller Aeron: Ultimate Programmer’s Chair
  2. Nerf Recon CS-6 with N-Strike Darts: For fun times.
  3. 17″ Mac Book Pro with 8GB RAM, SSD running Emacs in Full Screen: Ultimate Programmer’s Computer
  4. The Joy of Clojure: The book for Ultimate programming language
  5. mStand for MBP: For enhanced ergonomics and coolness factor

Not shown in the photo:

  • Super Awesome Colleagues
  • Projects using cutting-edge tech including: Scala, Play! Framework.
  • No Managers
  • No Managers, Seriously.
  • WORK-ON-YOUR-OWN-COOL-PROJECT-EVERY-FRIDAY – Yeah, EVERY Friday.

Like what you see and curious about what we do ? Come and talk to us at Play!ground on February 3rd at Paddy Murphy’s Irish Pub, Rotterdam.


Permalink

Another Clojure Noir Site on Heroku

So I'm experimenting with migrating my dad's website to a Clojure backed site based on noir-blog, an example blog built on the awesome Noir web framework. It was pretty easy to do once I went all css for formatting.  And, unless your are logged in as an admin, the site still looks exactly the same.

While I'm still a novice programmer, getting the site up and running on Noir/Heroku was a cakewalk.  It was surprising how few modifications I had to make to completely repurpose the example noir-blog engine into the site engine I needed. (How to get Noir up on Heroku: http://thecomputersarewinning.com/post/clojure-heroku-noir-mongo)

Here is what the front page looks like:

Front_page

Compare that with the actual live website: http://questforthekingdom.com

If you've ever played around with noir-blog (https://github.com/ibdknox/Noir-blog), you may notice some of it's guts here in the admin page:

Admin_page

And here is the article-entry page, with the beefed-up, wysiwyg editor:

Post_page

By the way, the expected workflow is to open the editor in full screen.  Much easier to edit that way :)

As you can see, I added some additional drop downs to let you choose and edit where the article shows up in the left navigation panel.  Let me know what you think about the left navigation panel -- I tried to go with something that would be touch friendly.

Feel free to play around on it with guest & guest at: http://qftk.herokuapp.com/admin

Let me know what you think.  If anyone wants the code (my modifications were minimal), I'll try to clean the site up a bit and push it to github -- just let me know.

Update: The website has been uploaded to Github: https://github.com/johnmn3/qftk-site

Permalink | Leave a comment  »

Permalink

Promises and Futures in Clojure

At the moment, it’s pretty hard to read about what is a promise or a future. In this post, I’m going to try to explain them concisely.

Futures run some code on a newly spawned worker thread, and allow you to wait for the result of that code’s execution.

Promises are a kind of reference that can be set exactly once and allow you to wait for them to be set.

Futures take the expression that’s their argument and run it in another thread. This thread is actually the cachedThreadPool used by send-off for Agents. The IDeref protocol (that’s what lets you say @future to get the value) is implemented by using the underlying Java Future’s get method.

Promises are essentially one-shot atoms that have a CountDownLatch (counting down from 1) to allow other threads to block when they try to @deref the promise until it’s been delivered.

Permalink

Reflections on a real-world Clojure application (take 2)

Last night I gave a talk at the London Clojure Users Group (LCUG) about a ‘real-world’ (16K lines-of-code) application we built in less than a year with Clojure at Deutsche Bank. I really enjoyed the event and thanks to SkillsMatter who were fantastic hosts.

There were a lot of questions during the Q&A at the end which I did my best to answer at the time. Now I’ve had some more thinking time I’d like to add a few extra comments.

If you couldn’t attend the talk you can catch it here.

Below is the original presentation in blog form (thank you Markdown!). My extra comments can be found in the epilogue – feel free to ask further questions in the comments area.

Reflections on a real-world Clojure application-

Background

  • Java background, especially early J2EE circa 1999-2002
  • Test Driven Development – ran 20 courses
  • Mastering TDD helped me to write Java using values rather than objects
  • Began to write Java in a more functional way – but much more verbose!!
  • Started using Clojure at work for user web interfaces in November 2009
  • Began to attend Clojure Dojos in London
  • February 2011 – Clojure used extensively on a new application, now 16K LOCs!

The ‘main’ function

Developer bootstrap

For developers

$ mvn dependency:copy-dependencies
$ ./run

which does this :-

#!/bin/sh
echo "Starting Fandango run script..."

export PATH=$PATH:target/bin

# Set debug to nil to disable JVM debugging.
classpath='src/main/clojure:target/dependency/*'
main=src/main/clojure/com/db/mis4/fandango/main.clj

java -cp ${classpath} clojure.main ${main}

Then slime in with Emacs!

(Let’s look at configuration in more detail)

Configuration

Requirements of a configuration system

  • Flexibility – we should be able to add configuration where we need it
  • Distributed ownership – we shouldn’t have to know the live passwords
  • Source agnostic – we’d like to be able to use local files and centralised storage.

Candidates?

  • Java properties files
  • JSON/YAML
  • XML – tree based, schemas enforces structure rather than value
  • Databases – records for configuration are too diverse
  • RDF – graph based, queryable

Clojure as configuration?

“Protocols and file formats that are Turing-complete input languages are the worst offenders, because for them, recognizing valid or expected inputs is UNDECIDABLE: no amount of programming or testing will get it right… A Turing-complete input language destroys security for generations of users. Avoid Turing-complete input languages! ” — Corey Doctorow

So…

Be careful if you choose Clojure as your configuration format!!

‘Open Data’

All our data (application & environment configuration, report definitions, user details & entitlements, etc.) are stored as RDF statements

  • The cat sat on the mat
    • Subject: the cat
    • Predicate (also known as property): sat on
    • Object: the mat
  • Relations are at an individual level rather than at a set (ie. table) level.

  • More intro to RDF here:
    • http://www.bbc.co.uk/blogs/radiolabs/s5/linked-data/s5.html
    • http://linkeddatabook.com

Our configuration system

  • RDF files (mostly Turtle format)
  • SPARQL queries
  • Uses a dynamic var: (with-config ...)
  • Delays to avoid unnecessary queries

Example

create-assocations :-

(defn create-associations [model]
  {::directories
   (delay
    (sparql/select1-map
     model
     [:proc cmdb/host :host]
     [:proc cmdb/install-dir (as-uri (format "file://%s" (or (System/getenv "FANDANGO_INSTALL_DIR")
                                                             (System/getProperty "user.dir"))))]
     [:host a cmdb/Host]
     [:host cmdb/hostname (get-hostname)]
     [:proc cmdb/userid (System/getProperty "user.name")]
     [:proc ["http://mis4.gto.intranet.db.com/fandango/" "dataDirectory"] :data-dir]
     [:proc ["http://mis4.gto.intranet.db.com/fandango/" "logDirectory"] :log-dir]
     [:proc ["http://mis4.gto.intranet.db.com/fandango/" "workDirectory"] :work-dir]
     [:proc ["http://mis4.gto.intranet.db.com/fandango/" "pidDirectory"] :pid-dir]
     [:optional [:proc cmdb/source-dir :source-dir]]

Security

Entitlements

All users are given FOAF ‘profiles’, with added VCARD and other statements.

Given these prefixes

@prefix foaf: &lt;http://xmlns.com/foaf/0.1/> .
@prefix rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#> .

This statement (in the configuration) gives all users a ‘Guest’ role.

foaf:Person rdfs:subClassOf &lt;Guest> .

N-triples

Statements are then added to create users, request roles, approve or reject roles

Creating a user

&lt;events/5afcf604-16c0-4cab-a6d1-656ed3f3420c> &lt;time> "2011-12-25T12:00Z"^^&lt;http://www.w3.org/2001/XMLSchema#dateTime> .
&lt;events/5afcf604-16c0-4cab-a6d1-656ed3f3420c> rdfs:type &lt;CreateUser> .
&lt;events/5afcf604-16c0-4cab-a6d1-656ed3f3420c> &lt;eventfor> &lt;users/malcolm.sparks%40db.com> .

Request a role

&lt;events/b5bed531-a324-4aec-9ace-2785c65a19b7> &lt;time> "2011-12-25T14:00Z"^^&lt;http://www.w3.org/2001/XMLSchema#dateTime> .
&lt;events/b5bed531-a324-4aec-9ace-2785c65a19b7> rdfs:type &lt;RequestRole> .
&lt;events/b5bed531-a324-4aec-9ace-2785c65a19b7> &lt;role> &lt;Administrator> .
&lt;events/b5bed531-a324-4aec-9ace-2785c65a19b7> &lt;eventfor> &lt;users/malcolm.sparks%40db.com> .

Language integrated query

Data can be queried directly from Clojure

(defn get-approved-roles-for-user [user]
  (sparql/select-map
   [(get-combined-model) (config/get-config-model)]
   [:approval a events-ns/RoleApproved]
   [:approval events-ns/time :approval-time]
   [:approval roles/approver :approver]
   [:approver foaf/name :approver-name]
   [:optional [:approver foaf/homepage :approver-homepage]]
   [:approval roles/cause :request]
   [:request roles/requester user]
   [:request events-ns/time :request-time]
   [:request roles/role :role]
   [:role rdfs/label :role-name]))

Deployment

Releasing to production

$ git clone http://github....db.com/.../fandango.git
$ git verify-tag 4.5.0
$ git checkout tags/4.5.0
$ make release

Derive the version from git!

GNU Make incantation …

describe := $(subst -, ,$(shell git describe --tags --long HEAD))
version := $(word 1,$(subst -, ,$(describe)))
release := $(shell expr 1 + $(word 2,$(describe)))

And generate the pom.xml – ie. in Make :-

pom.xml:    pom.template.xml
            cat $< | sed -e "s/@VERSION@/$(version)/g" >$@

mvn dependency:copy-dependencies
cp -r src/ dest/

We use RPM but the principle of copying the source and dependency jars over is the same.

Installation

Installation is easy

$ rpm --dbpath /opt/privatedb -Uvh fandango-4.5.0-1-x86_64.rpm

Production bootstrap

$ fandango start

A lot more complex than the developer bootstrap.

  • Init script (from Java Service Wrapper – enhanced with roqet to read environment variables from configuration)
  • Init script generates the wrapper.conf, then calls Java Service Wrapper native executable
  • Native binary spawns JVM with 2 args clojure.main boot.clj
  • boot.clj sets up a classloader which pulls in the dependency jars
  • boot.clj hands off to main.clj, rest is as the developer bootstrap.

But source code is still copied onto the system as is.

Logging

Getting started

Logging is important because it’s what everyone expects to find.

These will get you started :-

(clojure.core/println)
(clojure.pprint/pprint)

However, as your application grows you will eventually need a more sophisticated logging system. We use Log4J and configure it with clj-logging-config.

You’ll need the following packages to do this :-”

(use 'clojure.tools.logging)
(use 'clj-logging-config/log4j)

(with-logging-config)

(with-logging-config
  [:root {:level :debug
          :out (io/file workdir "job.log")}]
  ...

(with-logging-context)

For using the NDC and MDC of Log4J.

(with-logging-config
  [:root {:pattern "%d [%p] (for Customer %X{customer}) %m%n"}]
   ...

   (with-logging-context {"customer" "John Smith"}
     ...

Reflections

The Good

  • Retain the JVM
  • No class files, yippee!
  • Sliming in! EDD: Eval Driven Development!
  • Separation of value, identity, state: State is a timeline of changing values.
  • Learning time – even our DBA is now comfortable with Clojure.

The Bad

  • People are justifiably afraid of new things
  • Tooling (for those not comfortable with Emacs)
  • Java interop can bite you

The Ugly

  • Stack traces
  • Debugging

Quality versus value

“Value is what you are trying to produce, and quality is only one aspect of it, intermixed with cost, features, and other factors.” — John Carmack, http://altdevblogaday.com/2011/12/24/static-code-analysis/

cf. ‘Agile’ absolutes

  • Always write the tests first
  • Tests should always pass
  • Always fix the build before working on new features
  • Integrate continuously
  • Refactor prior to adding new features
  • Consistent code style
Our experience of Git + Clojure is prompting us to question certain assumptions.

More info

http://blog.malcolmsparks.com

Q & A

Over to you…

Epilogue

Many of the questions related to the RDF portion of my presentation. There were a lot of others, I can’t remember all now.

How big is your team and how did it grow?

We started with 2 developers and grew to 4. Forcing Clojure on developers is unwise. I know that was tried somewhere else and most developers only used the Java interop!

Why do you use RDF for configuration rather than XML or JSON or even Clojure itself?

JSON is certainly more conventional as a configuration format (or XML in the Java world)
There isn’t a strong reason not to use Clojure itself (I had a slide warning of the dangers of Turing complete input languages but the point stands nevertheless). I don’t think my answer was very good last night so here are some advantages of RDF :-

  • Meaning – RDF allows you to make logic set-based statements to classes of what are otherwise straight name/values pairs.
  • Metadata – RDF allows you to make statements about statements. You can use metadata to label configuration values, add annotations (in multiple languages if you like), or constrain the values to some valid range or set, or say something about the nature of the property. You can do this in a very limited way with XML (perhaps with attributes) but with JSON there’s nothing built-in or idiomatic.
  • Mergeability – RDF allows you to source statements from a wide variety of sources and merge the models together, whereas there’s nothing built-in or idiomatic in XML or JSON. In tree formats config statements have to group inside each other in a single hierarchy – designing this hierarchy is a job in itself. Graphs are more flexible since nodes can exist in multiple hierarchies if needs be.
  • Inference – in RDF, having some data allows you to infer other data which you would otherwise have to make explicitly. This has the potential to reduce data discrepancies. For example, given a database name, listener host and port you can ‘infer’ a database connection string.

That said, I’m not really pushing RDF as a config format. We took a gamble on it and it paid off in our case. Other projects are different. JSON is a great format that enables fast and simple data exchange (when you control both ends).

I also suggested that a domain model is more valuable for persistent data than for transient data structures. Object oriented languages encourage you to design the domain model internal to a program. But in my view there is more value in a domain model you can communicate between systems, and keep for longer periods, than in a domain model that you can only use privately (ie. in a single memory address space) and only while your application is running. This is the exact opposite of designing domain models in Java/C# classes and serializing out to a database or JSON/XML files, hence the need to illustrate with a real-world example (in this case, configuration).

What other Clojure frameworks do you use?

  • Compojure/Ring/Hiccup/Clout for web pages.
  • Plugboard for REST but the intention is to move towards something like compojure-rest
  • Swank – couldn’t manage without it!

It’s a surprise to me how much we manage to do with just the standard Clojure libraries.

Do you think functionality rises linearly or exponentially with lines of code?

I thought this was a great question because it points to the huge amount of algorithmic re-use that we enjoy in Clojure.

Did you have a specific business problem that led you to Clojure?

Honestly, no. In my case it was a growing frustration with large Java systems. But since we’ve been using Clojure in our team there have been a number of business problems that have cropped up that are ideally suited to Clojure. Certainly in my industry (banking) the business is built on mathematical functions and data transformations for which functional languages like Clojure are ideal.

Do you think Clojure be around in 5 years time?

This final question was asked by someone sitting in the front row. I don’t think they would have asked this if they’d seen how many people were in the room! Clojure is building momentum, at least in London, and as I said in my talk I think it’s beyond the point of critical mass now.

But on reflection I think it’s an important question. Why should anyone invest a lot of time in learning something that isn’t going to be around in a few years? However, technology is always about betting on certain horses (VHS or Betamax?) and you can never be 100% certain. LISP is a good bet though, it’s survived over 50 years and people keep rediscovering it. So even if Clojure doesn’t survive, I’m confident the knowledge you get from learning it will remain relevant.

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.