The REPL

The REPL

Datafy, paren parties, and adopting a neanderthal (function)
View this email in your browser

The REPL

Notes.

Hoo boy.

There was quite a commotion this weekend after Stu Halloway's appearance on the Apropos podcast. A few years ago there was a similar commotion and I wrote this on HN with my perspective. I think it is still accurate today.

The thing to realise about Clojure is that it isn't an open source language like Python. It is a language controlled very tightly by Rich Hickey and Cognitect. Major work is done in secret (transducers, reducers, spec), then announced to the world as a "Here it is!" and then suggestions are taken. This goes well mostly, though it took a lot of outside persuasion that Feature Expressions weren't the best idea, and to come up with Reader Conditionals instead.

The underlying problem Clojure/Core has is their communication. If they would come out and explain their philosophy then people wouldn't get frustrated and confused when they expect the language development to work like other open source projects. Clojure's development is functioning exactly as planned. It's not a mistake.

A better way to treat Clojure is closer to a project at Apple (except for Swift). You can submit bugs, and make suggestions for future improvements, and if you really want to provide a patch. But go into it realising that it's not a community project, and you're much less likely to get frustrated.

With all that said, the proof of the pudding is in the eating, and for the most part Clojure has developed pretty well from Rich's tight control. I'd love it if there was a Snow Leopard year for Clojure where the focus was fixing longstanding bugs and feature requests, and more focus on those in general, but the language is developing well.

While I can't claim to speak for the community, in my experience with the Clojure folk I talk with, the concerns Zach Tellman, Tim Baldridge, Bruce Hauman, and others raised in the resulting Twitter threads are shared by more than a tiny minority of the Clojure community. On the whole, I think Clojure is developing well, but there are opportunities within the Clojure language, and for supporting tools, documentation, and libraries to continue to improve the Clojure experience.

-main

  • Snyk released very curious results in a survey on the JVM ecosystem last week. The most surprising thing about it was the question about the principal JVM language that people used. Java was first with 90% share, but Clojure was second with 3%, Kotlin at 2.4%, Groovy at 2.4%, and Scala at 1.8%. I was very surprised to see Clojure ahead of Scala, in my experience, Scala has been 4-5x larger than Clojure in terms of jobs, adoption, e.t.c. There were 10k respondents to the survey though, so it seems like a reasonable sample size, and the way the survey was distributed doesn't seem too biased. Other people also found that surprising.

    Another curious thing about the survey was that 60% of respondents used Java EE. I have no experience with Java EE, or that ecosystem, but I was surprised to see it so highly represented. Maybe that is the case though?
  • Clojurists Together is funding Datascript and Kaocha this quarter
  • Datomic On-Prem has a new caching option to use SSD's
  • How Miika Koskinen uses tap>, and the libraries he recommends
  • I had a timely discussion with Ben Brinckerhoff about spec and error messages
  • It was mentioned above, but the Apropos discussion with Stu Halloway is a really good insight into how Clojure is developed, and how the Clojure team thinks about problems.

Libraries & Books.

  • Bunnicula is a new library for asynchronous messaging with RabbitMQ, with a great writeup on the motivations and context.
  • You can adopt a function from Neanderthal
  • Giving new life to existing OM legacy SPAs with re-om

People are worried about Types. ?

  • A beginner's FAQ for Clojure.Spec, though I think this is useful for everyone

Foundations.

Tools.

Recent Developments.

  • Since my last email, Clojure went to a release candidate, then back to beta with a brand new datafy namespace.
  • CLJ-2373 is where work on improvements to exception messages and printing has gone

Learning.

Misc.


I'm Daniel Compton. I maintain public Maven repositories at Clojars, private ones at Deps, and help fund OSS Clojure projects (along with tons of generous members like PitchJUXTMetosin, Adgoji, and Funding Circle) at Clojurists Together. If you've enjoyed reading this, tell your friends to sign up at therepl.net, or post a link in your company chatroom. If you've seen (or published) a blog post, library, or anything else Clojure/JVM related please reply to this to let me know about it.
Copyright © 2018 Daniel Compton, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list

Email Marketing Powered by Mailchimp

Permalink

Load testing with Gatling and Clojure

Yes, it works, but will it scale?

I’ve gotten so used to the reliability of Amazon autoscaling groups and load balancers that, the one time I left their comfort zone, I didn’t have a good tool to answer that question. Nor does there seem to be a simple, cheap, industry-standard solution.

The use case

You have a website or web API. You want to see how it behaves when a few thousand users hit it with simultaneous HTTP requests - specifically, how many of these requests fail or are severely delayed. You want those results displayed as pretty graphs, both for your own convenience and to impress your clients.

In addition, you don’t want to run that test from your local machine. Your data would hopelessly confuse website performance issues with your office’s faulty internet connection and the twelve tabs of YouTube videos you forgot you were streaming in the background. This is clearly a job for cloud computing.

Finally, you don’t want all these users to show up at the same time. AWS autoscaling groups are designed to handle a gradually increasing load, not a sudden 10,000 users spike. The same is true of the load balancer itself (which quietly autoscales under heavy load). Amazon recommends that you increase the load "at a rate of no more than 50 percent every five minutes". (The entire article is well worth reading.) You might also be interested in ramping down, to test scale-down behaviour.

Existing options

For this basic use case, RedLine13's "simple test" proved almost sufficient.

It’s very easy to set up, and while the paid subscription is expensive, the free tier already provides most everything you need. It doesn’t have ramp-down, but that can be achieved through its Gatling integration. I wasn’t familiar with Gatling at the time though, so I looked elsewhere.

Clojure runs through the soul of JUXT, so I experimented with clj-gatling, a Clojure wrapper around Gatling.

I quickly ran into its limitations. It is not a full wrapper; in fact it covers very few of Gatling’s many functionalities. In particular, clj-gatling currently does not support time-based ramp-up. The completion-based ramp-up it offers is a poor substitute and behaves unreliably. As for clojider, its AWS integration, I did not manage to get it to work at all.

gatling

It ultimately convinced me to use Gatling directly.

Even with no knowledge of Scala, it isn’t too hard to figure out how to write simple simulation scripts. It also comes with a tool that detects your browser activity and turns it into a simulation script - though this does not play nice with HTTPS.

What it doesn’t have is a convenient way to run it remotely on AWS. RedLine13 would work, but since I didn’t realise that at the time, I ended up writing my own tool.

Introducing Ramp

Ramp is an open-source shell script + AWS CloudFormation template to simplify remote load testing. In one command, it spins up an EC2 instance (or reuses the existing one if it hasn’t been terminated), sets it up for Gatling use, runs the Gatling simulation script provided, and uploads the report to an S3 bucket.

Since that report is a nicely formated html, and since S3 buckets can be used as static websites, you can see the report on your browser, with all its glorious multicoloured graphs. Or share it with a colleague or client.

Ramp comes with a Scala simulation script that does a basic load test (spamming a URL with GET requests, with ramp-up and ramp-down). I have since used it to simulate more complex behaviours, including mass registration and log-in. Writing more sample scripts is on my to-do list, but Gatling has decent documentation on its own website.

(I didn’t realise before just how easy it is to make a website unusable via DDOS attack. Poor caching, poor autoscaling, poor ephemeral port management, there are any number of potential weakest links. Implementing load testing in all JUXT projects as a matter of course is, well, a load off our minds.)

Future plans

Raffi Krikorian (of Twitter and Uber fame) pointed out to us that it’s quite hard to create a simulated user that’s as, shall we say, unpredictable, as the real thing. Even if your simulation script is quite complex and makes full use of your website’s various subsystems. Ultimately, it’s still written by an engineer who’s making assumptions about how the website is supposed to be used.

Real users click a button five times in a row because they think the page is too slow to load. Or go back a page right in the middle of execution. Or refresh a resource-intensive page every second as they impatiently wait for news.

Gatling is a valuable tool to find out how your website handles load, but isn’t enough to be certain it’ll handle users. For that, you should ideally capture the behaviour of multiple real users, and replay that thousands of times on your staging site.

At JUXT, we’ve yet to tackle a project that has to scale as much as a Twitter or an Uber - but we’re slowly moving in that direction and are starting to feel a need for this sort of testing. We don’t have the tools for it but we’re investigating options.

Permalink

Making Minesweeper in Reagent

Minesweeper

I had previously written an ASCII Minesweeper game in Clojure that you play in the terminal, but wanted to make a version that runs in the browser.

Then I realized I could just put it in a KLIPSE snippet, and we'll be able to play it right here on this page! In the process we'll improve upon the code as well as make it easier to understand and even change to other games, like Sudoku or Othello.

KLIPSE makes it really easy to use Reagent. We just require it like usual:

(require '[reagent.core :as r])

First we'll define some constants for our game, like the board dimensions and number of mines:

(def board-width 9)
(def board-height 9)
(def num-mines 18)

We'll use the for macro to generate a sequence of x and y coordinates, and then use shuffle to scramble them - that way when we set our mines they will be in random positions.

(defn rand-positions []
  (shuffle
    (for [i (range board-width)
          j (range board-height)]
      [i j])))

(defn set-mines [] 
  (loop [squares (repeat num-mines 1)]
    (if (= (count squares) (* board-width board-height))
      squares
      (recur (conj squares 0)))))

(defn init-matrix []
  (into {}
    (map vector
      (rand-positions)
      (set-mines))))

(def app-state
  (r/atom
    {:matrix (init-matrix)
     :stepped []
     :game-status :in-progress
     :message "Tread lightly..."}))

Now we need to implement the mine-detector, and have it do the thing where it recursively clears the squares with no surrounding mines. We start with a simple predicate function to find out if a given square is mined:

(defn mine? [x y]
  (= 1 (get (:matrix @app-state) [x y])))

Now the fun part - we define a function for each of the 8 directions, which will take a square and a number of mines and add 1 to it if that neighboring square has a mine in it. Each one will return the current coordinates as well as the mine count, so we can chain them together. You know what... let's do this right, by going clockwise from the top-left.

(defn top-left? [x y n]
  (if (mine? (dec x) (dec y))
    [x y (inc n)]
    [x y n]))
    
(defn top? [x y n]
  (if (mine? x (dec y))
    [x y (inc n)]
    [x y n]))
    
(defn top-right? [x y n]
  (if (mine? (inc x) (dec y))
    [x y (inc n)]
    [x y n]))
    
(defn right? [x y n]
  (if (mine? (inc x) y)
    [x y (inc n)]
    [x y n]))
    
(defn bottom-right? [x y n]
  (if (mine? (inc x) (inc y))
    [x y (inc n)]
    [x y n]))
    
(defn bottom? [x y n]
  (if (mine? x (inc y))
    [x y (inc n)]
    [x y n]))
    
(defn bottom-left? [x y n]
  (if (mine? (dec x) (inc y))
    [x y (inc n)]
    [x y n]))
    
(defn left? [x y n]
  (if (mine? (dec x) y)
    [x y (inc n)]
    [x y n]))

Now we take this neat little thing and spin it around like so:

(defn mine-detector [x y]
  (->> [x y 0]
       (apply top-left?)
       (apply top?)
       (apply top-right?)
       (apply right?)
       (apply bottom-right?)
       (apply bottom?)
       (apply bottom-left?)
       (apply left?)
       last))

In the case that mine-detector returns 0, we want to auto-step all around it:

(defn step [x y]
  (swap! app-state assoc-in [:stepped]
         (conj (:stepped @app-state)
               [(dec x) (dec y)]
               [x (dec y)]
               [x (inc y)]
               [(dec x) y]
               [(inc x) y] 
               [(inc x) (dec y)]
               [(inc x) (inc y)]
               [(dec x) (inc y)])))

Now here are our rendering functions:

(defn blank [i j]
  [:rect
   {:width 0.9
    :height 0.9
    :fill "grey"
    :x (+ 0.05 i)
    :y (+ 0.05 j)
    :on-click
    (fn blank-click [e]
      (when (= (:game-status @app-state) :in-progress)
        (swap! app-state assoc-in [:stepped]
          (conj (:stepped @app-state) [i j]))
        (if (= 1 (get (:matrix @app-state) [i j]))
            (do (swap! app-state assoc :game-status :dead)
              (swap! app-state assoc :message "Fuck. You blew up.")))))}])

(defn rect-cell
  [x y]
  [:rect.cell
   {:x (+ 0.05 x) :width 0.9
    :y (+ 0.05 y) :height 0.9
    :fill "white"
    :stroke-width 0.025
    :stroke "black"}])

(defn text-cell [x y]
  [:text
   {:x (+ 0.5 x) :width 1
    :y (+ 0.72 y) :height 1
    :text-anchor "middle"
    :font-size 0.6}
   (if (zero? (mine-detector x y))
     ""
   (str (mine-detector x y)))])

(defn cross [i j]
  [:g {:stroke "darkred"
       :stroke-width 0.4
       :stroke-linecap "round"
       :transform
       (str "translate(" (+ 0.5 i) "," (+ 0.5 j) ") "
            "scale(0.3)")}
   [:line {:x1 -1 :y1 -1 :x2 1 :y2 1}]
   [:line {:x1 1 :y1 -1 :x2 -1 :y2 1}]])

(defn clear-squares! []
  (map step (:stepped @app-state)))

(defn render-board []
  (into
    [:svg.board
     {:view-box (str "0 0 " board-width " " board-height)
      :shape-rendering "auto"
      :style {:max-height "500px"}}]
    (for [i (range board-width)
          j (range board-height)]
      [:g
       [rect-cell i j]
       (if (some #{[i j]} (:stepped @app-state))
         (if (= 1 (get (:matrix @app-state) [i j]))
           [cross i j]
           [text-cell i j])      
         [blank i j])])))

(defn mine []
  [:center
   [:h1 (:message @app-state)]
   [:button
    {:on-click
     (fn new-game-click [e]
       (swap! app-state assoc
              :matrix (init-matrix)
              :message "Welcome back"
              :game-status :in-progress
              :stepped []))}
    "Reset"]
   [:div [render-board]]])

[mine]

Permalink

Using JMH with Clojure - part 1

Hello, performance junkies, long time no see. Today we will learn to access the tool that every JVM gofaster should (mis)use at least once in their lives — Aleksey Shipilёv's Java Microbenchmarking Harness.

Earlier, we reviewed Criterium which is an easy to use benchmarking tool for Clojure. Criterium is a library you include and run directly from the REPL. It calculates some statistics on the results and ensures that the function you run is warmed up properly, but beyond that, it's quite trivial.

JMH, on the other hand, is much more intricate. It provides a toolset to fight against common benchmarking enemies, such as dead code elimination, constant folding, coordinated omission, and many others. The goal of this post is to give you an idea of how to use JMH in your project and what benefits that can potentially bring.

What is the problem?

As long as you measure relatively slow operations (milliseconds or slower) with Criterium or plain time/dotimes, you are most likely fine. The problems start when you decide to measure something minuscule. For example, let's check how fast is long multiplication on JVM:

(time
 (dotimes [_ 1e9]
   (unchecked-multiply 123 456)))

;; "Elapsed time: 337.849181 msecs"

338 milliseconds for 1 billion iterations makes it ~0.3 nanoseconds per multiplication. Modern CPUs are fast! We can ensure that the compiled code is correct with clj-java-decompiler:

(decompile
 (dotimes [_ 1e9]
   (unchecked-multiply 123 456)))
public static Object invokeStatic() {
    for (long n__5742__auto__15409 = RT.longCast(1.0E9), _ = 0L; _ < n__5742__auto__15409; ++_) {
        Numbers.unchecked_multiply(123L, 456L);
    }
    return null;
}

Looks reasonable. So, is it really that fast? Let's double-check with Criterium:

(crit/quick-bench (unchecked-multiply 123 456))

;; Execution time mean : 6.694309 ns

That's quite different! What if we ask JMH? (You will learn how to run such examples later.)

@Benchmark
public long mul() {
    return 123L * 456L;
}

@Benchmark
public void mulWrong() {
    long x = 123L * 456L;
}

// Multiply.mul       2.445 ± 0.126 ns/op
// Multiply.mulWrong  0.329 ± 0.021 ns/op

Multiply.mul appears to be faster than the Criterium result but still slower than the initial 0.3 nanoseconds. What's going on here? Turns out, in the case of time/dotimes benchmark, and in Multiply.mulWrong, the JVM is smart enough to remove the whole body of the loop since its result is not being used by anyone. This optimization is called dead code elimination, and it's quite easy to trigger when doing careless benchmarks. Criterium guards against it by consuming the result of each iteration, and so does JMH.

Why use JMH then if Criterium already does fine? Let's consider another example. We want to measure how long it takes to walk an array of 100000 elements sequentially:

(let [sz 100000
      ^ints arr (int-array sz)]
  (crit/quick-bench
   (dotimes [i sz]
     (aget arr i))))

;; Execution time mean : 57.017862 µs

That's really fast, just 60 microseconds for the whole array, caches and all be praised! But you are already feeling suspicious, aren't you?

@State(Scope.Thread)
public static class BenchState {
    int[] arr = new int[size];
}

@Benchmark
public void walk(BenchState state, Blackhole bh) {
    for (int i = 0; i < size; i++)
        bh.consume(state.arr[i]);
}

// WalkArray.walk  3019.442 ± 426.008 us/op

Now, three milliseconds look much more convincing. We can believe that it's the actual result, not the 60 microseconds we got earlier. Why did Criterium fail us here? Indeed, Criterium prevents DCE for values that are returned by each iteration, but it has no control over the internal loop — the one run by our code. And JMH gives us this Blackhole object that can be used to forcefully consume any intermediate value.

How to use JMH

The setup I'm going to suggest is not very REPL-friendly. JMH can be used as a library, but it is designed to act more like a framework. For example, JMH forks a process for each benchmark to isolate them and minimize the effects that one benchmark-related JIT and JVM behavior can have on other benchmarks.

Having said that, I'm far from being sure that my setup is the most optimal and effective. jmh-clojure is another effort to make JMH usage more similar to standard Clojure workflows. Perhaps, someday I will write about my experience with it too.

Leiningen

Feel free to create a new project with lein new, but for the example I'm giving, it is enough to make an empty directory.

The project.clj has to look like this:

(defproject jmh-lein "0.1.0-SNAPSHOT"
  :java-source-paths ["src/"]
  :dependencies [[org.clojure/clojure "1.9.0"]
                 [org.openjdk.jmh/jmh-core "1.21"]
                 [org.openjdk.jmh/jmh-generator-annprocess "1.21"]]
  :aliases {"jmh" ["exec" "-ep" "(bench.Multiply/run)"]})

Another file you need is src/bench/Multiply.java:

package bench;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.*;
import org.openjdk.jmh.runner.options.*;
import java.util.*;
import java.util.concurrent.*;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1)
public class Multiply {

    @Benchmark
    public long mul() {
        return 123L * 456L;
    }

    @Benchmark
    public void mulWrong() {
        long x = 123L * 456L;
    }

    public static void run() throws RunnerException {
        Options opt = new OptionsBuilder()
            .include(Multiply.class.getSimpleName())
            .build();

        new Runner(opt).run();
    }
}

This is our benchmarking class. By using different JMH annotations, we configure how long to spend warming up the benchmark, for how much time to run it, which units to output the results in. Each method in the class that is marked with @Benchmark annotation will be run many times in a special JMH-managed loop. The static method run is our entrypoint to the benchmark where we can inject some extra configuration through OptionsBuilder object. This is the method that our jmh alias is calling.

You can now run lein jmh in the terminal. Some debugging info will get printed, and then the benchmarking iterations will proceed. In the end, you should get something like this:

Benchmark          Mode  Cnt  Score   Error  Units
Multiply.mul       avgt    5  3.452 ± 0.574  ns/op
Multiply.mulWrong  avgt    5  0.583 ± 0.338  ns/op

Boot

Again, just make an empty directory and put this build.boot in it:

(set-env! :source-paths #{"src"}
          :dependencies '[[org.clojure/clojure "1.9.0"]
                          [org.openjdk.jmh/jmh-core "1.21"]
                          [org.openjdk.jmh/jmh-generator-annprocess "1.21"]])

(deftask jmh []
  (comp (javac)
        (with-pass-thru fs
          (let [cp (->> (conj (output-dirs fs)
                              (System/getProperty "fake.class.path")
                              *compile-path*)
                        (clojure.string/join java.io.File/pathSeparator))]
            (System/setProperty "java.class.path" cp)))
        (call :eval ['(bench.Multiply/run)])))

The main difference from Leiningen is that Boot doesn't automatically set the correct java.class.path property, and JMH expects that. The with-pass-thru step in the middle does two things — it combines fake.class.path property (this is where Boot keeps the list of all dependencies) and also the output dirs from the (javac) step, and then it sets it all into the java.class.path property.

You should also create src/bench/Multiply.java file with the same content as in Leiningen version. You can now run boot jmh from the terminal.

A more interesting example

Now that we've dealt with setting up the environment, let's make a benchmark that utilizes more JMH features. How about observing the effects of branch prediction? We are going to reproduce one very popular StackOverflow question — Why is it faster to process a sorted array than an unsorted array?

Branch prediction is a mechanism inside the CPU that can speculatively pick a more traveled branch of the condition before the actual check has been made. This article provides an in-depth look at the history of branch prediction.

Anyway, here's the code of the benchmark:

package bench;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.*;
import org.openjdk.jmh.runner.options.*;
import java.util.*;
import java.util.concurrent.*;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Benchmark)
@Fork(value = 1)
public class BranchPrediction {

    @Param({"1000", "10000", "100000"})
    public static int size;

    @State(Scope.Thread)
    public static class BenchState {

        int[] unsorted, sorted;

        @Setup(Level.Iteration)
        public void prepare() {
            // Create an array and fill it with random numbers in range 0..size.
            unsorted = new int[size];
            for (int i = 0; i < size; i++)
                unsorted[i] = ThreadLocalRandom.current().nextInt(size);

            // Make a sorted array from the unsorted array.
            sorted = unsorted.clone();
            Arrays.sort(sorted);
        }
    }

    public long sumArray(int[] a) {
        long sum = 0;
        // Threshold is the median value in the array.
        int thresh = size / 2;
        for (int el : a)
            // Sum all array elements that are lower than the median.
            if (el < thresh)
                sum += el;
        return sum;
    }

    @Benchmark
    public long unsortedArray(BenchState state) {
        return sumArray(state.unsorted);
    }

    @Benchmark
    public long sortedArray(BenchState state) {
        return sumArray(state.sorted);
    }

    public static void run() throws RunnerException {
        Options opt = new OptionsBuilder()
            .include(BranchPrediction.class.getSimpleName())
            .build();

        new Runner(opt).run();
    }
}

A couple of new things are introduced in this benchmark. The @Param annotation above size will make the benchmark run separately for each provided value of size. The internal BenchState class is used as a holder of state that should be initialized once. We use it to create two arrays — one with random numbers in 0..size range and the other with the same numbers but sorted.

The two benchmarks are doing the same thing — walking over the array and adding the elements that are lower than size/2. Our hypothesis is that doing this on a sorted array should be faster because of the branch prediction. Let's run and see:

Benchmark                       (size)  Mode  Cnt    Score    Error  Units
BranchPrediction.sortedArray      1000  avgt    5    0.560 ±  0.185  us/op
BranchPrediction.sortedArray     10000  avgt    5    4.694 ±  0.526  us/op
BranchPrediction.sortedArray    100000  avgt    5   48.540 ± 23.480  us/op
BranchPrediction.unsortedArray    1000  avgt    5    0.760 ±  0.072  us/op
BranchPrediction.unsortedArray   10000  avgt    5   13.669 ±  2.409  us/op
BranchPrediction.unsortedArray  100000  avgt    5  370.335 ± 11.986  us/op

Except for size=1000, we see that the sorted array is consistently faster in this benchmark.

JMH profilers

JMH killer feature is the ability to attach different profilers to the benchmark. One profiler we'll try to use today is perfnorm which measures different CPU performance counters and normalizes the results per benchmark iteration.

First, you'll need to install perf (sorry, Linux only). Then, change the Options declaration to:

Options opt = new OptionsBuilder()
    .include(BranchPrediction.class.getSimpleName())
    .addProfiler("perfnorm")
    .build();

And just run the benchmark again. Here's what I got (redacted):

Benchmark                                     (size)  Mode  Cnt        Score   Error  Units
BranchPrediction.sortedArray                    1000  avgt    5        0.593 ± 0.041  us/op
BranchPrediction.sortedArray:branch-misses      1000  avgt             2.034           #/op
BranchPrediction.unsortedArray                  1000  avgt    5        0.696 ± 0.026  us/op
BranchPrediction.unsortedArray:branch-misses    1000  avgt             0.288           #/op

BranchPrediction.sortedArray                  100000  avgt    5       53.436 ± 5.329  us/op
BranchPrediction.sortedArray:branch-misses    100000  avgt            13.268           #/op
BranchPrediction.unsortedArray                100000  avgt    5      378.581 ± 0.818  us/op
BranchPrediction.unsortedArray:branch-misses  100000  avgt         49880.070           #/op

Two observations are to be made here. For size=100000, the unsorted array benchmark indeed has much more branch misses per iteration than the sorted variant (49880 vs. 13). This explains why the sorted array benchmark is so much faster.

But for size=1000, the difference in performance is almost negligible. What's even more surprising is that the unsorted array has fewer branch misses than the sorted one (0.3 vs. 2.0)! I hypothesize that the branch prediction machinery was able to "learn" the whole unsorted array since it's not too big. At the same time, the prediction for the sorted array has not become as sophisticated and it consistently mispredicts the two pivots in the sorted array (at the beginning and in the middle).

Conclusions and future work

You've probably noticed that we haven't actually benchmarked any Clojure code this time. It is because JMH is quite complicated and I want to immerse the readers into it gradually. I will certainly get to benchmarking Clojure in the following posts.

JMH is too big for me to describe all its features. Thankfully, there are plenty of other materials (linked to in References) that will help you customize JMH and write your own sophisticated benchmarks.

References

  • Aleksey Shipilёv's slides and talk about JMH. He has many of those, check Google for more.
  • JMH samples is full of nicely commented examples of what JMH is capable of.
  • Nitsan Wakart has a whole page dedicated to different JMH resources on the web.

Permalink

PurelyFunctional.tv Newsletter 297: Beginner Experience Edition

Issue 297 – October 22, 2018 · Archives · Subscribe

Hi Clojurists (beginner, intermediate, and/or advanced),

I wanted to write a long diatribe about how much Cognitect does to improve the error messages and about the generosity of releasing Clojure as open source. But I’m tired. I’m tired of all the complaining!

So I’m just going to ask for a favor. Clojure/conj is coming up. I can’t make it this year, unfortunately. But if you are going, when you’re there, and you see Rich, just walk up to him and say “Thank you for all of your work”. He works hard on Clojure and you don’t pay anything for it. He deserves at least a “thanks”.

I was this close to quitting programming about eight years ago. I was looking into becoming a school teacher. It was only the growth of functional programming, in no small part due to Rich, that kept me around. I know what it’s like when people online loudly disrespect my work. And Clojure is 100 times more popular than anything I’ve done. The disrespect is probably 100 times more.

You should go the extra mile. The Conj is a little more than a month away. Start planning now, and get Rich, or Stu, or Alex, or someone a gift. Something small but meaningful. Present the gift when you say thank you. It’s not about “paying them back”. It’s about showing your appreciation, because that’s about all the payment they’ll ever get.

Rock on!
Eric Normand <eric@purelyfunctional.tv>

PS Want to get this in your email? Subscribe!


The Father Of Mobile Computing Is Not Impressed

Alan Kay has a lot to say about making products easy to use and popular. I started reading this article again because of the ongoing discussion that we inadvertently reignited with Stu on the show. I hope it will elevate the conversation.


Apropos Clojure #20 YouTube

Stuart Halloway joined us on Apropos to talk about the error message work that’s gone into Clojure 1.10. It improves the out-of-the-box experience AND the power of tooling.


Sources of complexity in software Podcast

I talk about the difference between essential complexity and accidental complexity.


The Hard Parts of Open Source YouTube

Did you know that even Evan Czaplicki, probably the gentlest of language creators, gets yelled at for Elm? In this talk, he breaks down three tendencies we see in online communities and traces their origins. He mentions a blog post hating on Clojure, and Rich Hickey’s response. What are we doing to these people who share so much for free?


Running With Scissors: Live Coding With Data YouTube

Stuart Halloway has been on a mission to spread the techniques of highly interactive, REPL-driven development.


A Model of Interceptors From the archives

I developed a model of interceptors with an operation for combining two interceptors into a bigger interceptor. I hope this might be useful one day for a project I’ve been fantasizing.


How Lisp Became God’s Own Programming Language

A nice, concise history of the development of Lisp.


Help Them Help You

Radford Smith’s perspective on the brouhaha over Clojure’s beginner experience.


Categories for the Working Hacker YouTube

Philip Wadler sharing mathematical ideas from Category Theory. I’ve been following Philip Wadler for a while (he invented a lot of stuff in use in Haskell) and he’s quite a smart guy. This may be the most accessible talk on Category Theory he’s ever given–and perhaps the most accessible explanation of the ideas behind Category Theory on the web.


Clojure Spec and Error Messages Podcast

Ben Brinckerhoff was on The REPL podcast. He’s the creator of Expound. He talks about Elm’s error messages, the purpose of Expound, and more. It couldn’t be more timely.


Clojure Collections Currently recording

More lesson in this course! It’s almost done recording!

Since last week, here are the new lessons:

The post PurelyFunctional.tv Newsletter 297: Beginner Experience Edition appeared first on PurelyFunctional.tv.

Permalink

Why do functional programmers model things as data?

Functional programmers tend to prefer pure, immutable data to represent their domain. We talk about many of the benefits of such an approach. But we focus on one in particular: that good data representations can reduce complexity by reducing the number of if statements you need.

Transcript

Eric Normand: Why do functional programmers like to represent things as data?

Hi, my name is Eric Normand. These are my thoughts on functional programming. This is one of those thoughts that I’m not quite sure about.

Mostly, I’m not sure about the explanation that I’m able to give. I am sure that from my experience this is true. I feel like I’m making leaps that are not justified when I’m trying to explain it more rationally.

If you have an idea about how I could explain this better or even just nitpick on my explanation so I can get better at it, ask questions, etc. Here it goes. I’m going to try it anyway and we’ll see.

One of the biggest problems in software is complexity. We have these really big complex systems and we can’t understand them, at least not all at once because they’re too complicated. We have to break it up into small pieces or we could try to reduce the complexity.

We’ve already talked about essential complexity versus accidental complexity. I’m just talking about complexity in general right now. One source of complexity is branches. Every conditional you have in your program means at least two branches and those two branches will multiply the number of code paths that you have in your program.

This is a traditional or classical way to measure complexity. It’s called cyclometric complexity. It’s where you measure the number of code paths that your program could go through. Again, it’s all about not knowing what’s the next thing that’s going to happen because you have a conditional, could be this branch, could be that branch, you don’t know. It’s just hard to keep in your head.

Imagine a program with 100 lines but no conditionals. Just do this, do this, do this, do this, do this versus 100 lines of do this, but then IF this is true, then do that, but otherwise, do this. It just starts to get harder to understand just because the branch is through there.

What functional programming does, one way that it tries to reduce complexity, is by eliminating branches. We do that by modeling things as data. This is where the explanation is starting to make some leaps.

The reason we like to model things as data is because data has a well-understood structure to it. It’s a limited structure, something like a pure function. Inside it could have branches. It could have a ton of stuff. It’s Turing Complete. It could calculate anything.

Whereas if you have a piece of data, there’s only certain cases that it can have. For example, an array is going to have an empty case. It’s going to have a singular case. It’s going to have a plural case if you want to divide it up that way. There’s a certain number of cases. The plural case is like two or more. You can see how it’s different from the singular case anyway.

You could have an infinite number of things in that array but we tend to think of it in those terms. There’s a lot of other structure to it. For instance, the zero index is the first thing. There’s no negatives. Then there’s a limit however big that array is. There’s a maximum index. They’re all integers. These are all structural constraints that the data structures that we use have.

It’s those constraints that drastically simplify our implementations. If your implementation uses an array, there’s only certain stuff you can do with it. This is in addition to all of its other benefits that it’s serializable and deserializable, of course.

It’s also often on a literal representation. You could include it right in your program. You could store it in a database if you had a database that had a similar structure. You could do that.

This is something else. This is saying it’s much more constrained and so it reduces the complexity. Everything you could move from a Turing Complete function into a highly constrained data structure is going to simplify your code. There’s another benefit to data which is that it’s not opaque. A function is opaque.

When you have a function, sure as a human you can read the code that generated that function, you can understand it to a certain extent. When you are past a function, let’s anthropomorphize it, you are a function.

You get past a function as an argument, the only thing you can do to that function is call it. There’s no way to understand what it’s going to do besides just doing it. That’s different from data.

Data typically has an API, a HashMap. You can ask, “What are all your keys? What are all your values?” Give me all your keys and values in a sequence. There’s all sorts of stuff that you can do, add stuff to the HashMap, remove stuff.

It has its own API but it’s a well-known, well-understood API that is limited but gives you properties that you want to be able to take advantage of. It’s a function. The nice thing about it is all you can do is run it. The downside is you don’t know what it’s going to do until you run it.

That means that you can have a piece of data that is interpreted in different ways. This is sort of the value of a database. You store all these bits of data in there and then you can query them in different ways that you hadn’t planned for before. It gives you a huge amount of flexibility. I do want to focus mostly on this complexity argument.

The main reason we do it is because it’s less complex. Just thinking of the branches, that is my best explanation that I’ve managed of why the data representation is less complex. It has fewer branches.

It has known branches, so what you do is you start modeling your domain with data structures. You have to choose what kind of data will best represent this domain idea. What you can do is you can see, you can analyze that my domain needs zero or more items in it.

You think, “Well, that sounds perfect for an array.” You have this very close fit between your domain and your data representation or your domain model. The domain and its model in a close fit, there’s going to be complexity in your domain.

We talked about that. That’s called the central complexity. That essential complexity is exactly mirrored in the array and it has nothing else. There’s no accidental complexity because the domain had zero or more cases.

Now you’re modeling that with something with zero or more cases. Imagine you had a different scenario where you had one or more items. The array can handle that, but it also has this extra case where it’s empty, that it’s handling that, too. You have this danger that your data structure is a little bit more complex than your domain.

Your domain model is one case more complex than the domain. You’re adding in complexity. You’re adding in that accidental complexity. You’re going to have to deal with that somehow. You might deal with it with a conditional. When you create that array, you might say, “Hey, is this empty?”

Whatever you’re creating as a constructor, as some factory, you’re saying, “Is this array empty?” If it is, then throw in an exception or do something to indicate this failed. Can’t do it. Empty, one or more. There is no empty case. Or, you could do something else where your constructor requires a first item before it will make the thing, before it will make the array for you.

It will put that first item in for you and ensure that there’s at least one. There’s two ways to do it but notice they both add a little complexity that you have to add in yourself because they’re trying to move the complexity from the data structure which allows the zero case, the empty case.

They’re trying to move it away and make sure that that’s impossible. You’re shifting it around. That becomes the design question where this complexity goes, how to minimize its impact on the rest of the system. That’s my explanation. I’ll just summarize real briefly. Modeling stuff with data has a lot of benefits but the main benefit is that it reduces complexity.

It does so because our data, the ways we can represent data in our languages, is constrained. Much more constrained than a Turing Complete function for sure. Those constraints are what give us the material, the grain that lets us build the main models that can have the same structure as the domain. At least very close.

The closer you get the fewer workarounds you have to do in your code to avoid those corner cases, the cases where it’s a misfit. The more of those you avoid the less accidental complexity you have. You’re going to capture as close to just essential complexity as possible.

Please let me know what you think. I love getting emails from my listeners. You can email me at eric@lispcast.com. I’m also on Twitter. You can follow me. I love to tweet out little quotes and stuff from this podcast so you can learn about stuff that way.

Get your fix on Twitter if you want. Love to have discussions on Twitter. I’m @ericnormand. Follow me. Mention me. Let’s get talking. Awesome. Thanks so much. See you later.

The post Why do functional programmers model things as data? appeared first on LispCast.

Permalink

Extending Datomic Pull Queries

Datomic is a hosted database, made by the creators of Clojure. It provides a pleasant and powerful interface for developers and extends the principles of Clojure into the database layer. Read the introduction to Datomic for more information.

We use Datomic as the database of several of our products at AdStage. One of those products, Report, is a data visualization dashboarding tool for online advertisers which we built with Datomic, Fulcro, Clojure, and ClojureScript. Unfortunately, Datomic doesn’t support blobs (binary large objects) and ordered collections which are both needed by Report.

This post details the custom solution we built to work around Datomic’s lack of blob store and ordered collections for our Report product.

Building the Blob Store

When users make a new data visualization in Report, it requests the metrics and metadata of the relevant ads from our Data API. Report calls a single data visualization a widget and the data populating it a data-stream. Widgets collate data from our many sources of advertising metrics and metadata, and as a result each data-stream contains very specific and unique data. Here is one of our rendered charts next to the data-stream containing its data.

;; CTR by Month data-stream
{:data-stream/series
{:plots
{:0-metrics-ctr
{:plot-name "Facebook",
:tooltip-name "Facebook CTR",
:data-points [1.59458465538327 2.00280548628429 1.33447208284403],
:legend-name "Facebook CTR",
:network "facebook",
:metric :ctr},
:1-metrics-ctr
{:plot-name "AdWords",
:tooltip-name "AdWords CTR",
:data-points
[0.556948274421325 0.487455307016367 0.398452503720537],
:legend-name "AdWords CTR",
:network "adwords",
:metric :ctr}},
:meta {:currency-code "USD", :currency-symbol "$"},
:time-range
["2018–02–01T00:00:00Z"
"2018–03–01T00:00:00Z"
"2018–04–01T00:00:00Z"]},
:data-stream/start-date "2018–02–01T00:00:00Z",
:data-stream/end-date "2018–04–30T23:59:59Z"}

Data-streams can contain a lot of data. We don’t need to do advanced queries on them and we never mutate them, so it was a natural fit to store them as blobs. Datomic, currently, does not support blobs.

We chose Postgres for our blob store since it has native blob support and our Datomic instance is hosted on top of it. We created a table in Postgres with two columns, the id and blob data, and serialize/deserialize them from Report with the library nippy. Below is how we store our data-streams:

Datomic’s FAQ explicitly recommends against storing blobs, so we felt confident in our decision to use a different data store. Using a different data store, however, means we lose out on Datomic’s very expressive and concise query languages!

Pluck over Pull

Datomic Pull queries are our primary method of querying data used to render the UI. They accept a tree like data structure of attribute names and return data in the same shape. Here is a sample query for the data that our widgets need to render:

(require '[datomic.api :as d])
(d/pull db
[:db/id
:widget/title
:widget/data-stream]
entity-id)
;; => {:db/id              17592186088168
;; :widget/title "CTR by Month"
;; :widget/data-stream 17592186088371}

When queried, the attribute :widget/data-stream returns a reference to the data-stream in the blob store but not the blob of the data-stream itself. We needed a simple mechanism to query for the data in Datomic as well as the blob store. From this need came Pluck, a layer that sits on top of Datomic Pull query.

Pluck consists of two parts: a depth first tree traversal of the Pull request and a multimethod that is extendable by adding callback functions for each attribute you want to override or add. When pluck reaches an individual attribute (a leaf node) of the query, it dispatches to the specific pluck implementation. Each pluck method implementation is passed the attribute name, database, and the Pull result.

To integrate our blob store with the rest of our data in Datomic, we implemented the :widget/data-stream pluck method:

(defmethod -pluck :widget/data-stream
[k {:keys [db blob-store] :as env} pull-result]
(let [data-stream (:widget/data-stream pull-result)
blob-id (:data-stream/id (d/entity db data-stream))]
;; fetch the data-stream from the blob store
(data-stream/find-by-id blob-store blob-id)))

When this method is called, we grab the data-stream id from Datomic and use data-stream/find-by-id to extract and deserialize the blob data from Postgres. The deserialized blob is returned with the rest of the widget data joined to the :widget/data-stream attribute. With the Pluck layer included, querying the widget data looks like this:

Pluck is an effective way to extend Datomic Pull queries in a consistent manner. It also allows us to work around many of Datomic’s other limitations, including a lack of ordered collections.

Let’s Plucking Order those Vectors

Datomic lacks native ordering, so applications have to create an order attribute and do their own sorting. Our users create dashboards to organize their widgets. By default, a query for our dashboard’s widgets returns an unordered vector:

(let [conn    (d/connect (e/env :db-url))
db (d/db conn)
dash-id 17592186072163]
(d/pull db
'[:db/id
{:dashboard/widgets [:widget/order]}]
dash-id))
;; => {:db/id 17592186072163,
;; :dashboard/widgets
;; [{:db/id 17592186072165, :widget/order 2}
;; {:db/id 17592186072169, :widget/order 0}
;; {:db/id 17592186072173, :widget/order 1}
;; …]}

This fails to solve our user’s needs as most users don’t want their widgets in a seemingly nondeterministic order! We extended our original :dashboard/widgets query with Pluck to order them.

(defmethod -pluck :dashboard/widgets [k {:keys [db]} pull-result]
(sort-by :widget/order (:dashboard/widgets pull-result)))
(let [conn (d/connect (e/env :db-url))
db (d/db conn)]
(p/pluck {:db db}
'[:db/id
{:dashboard/widgets [:widget/order]}]
dash-id))
;; => {:db/id 17592186072163,
;; :dashboard/widgets
;; [{:db/id 17592186072169, :widget/order 0}
;; {:db/id 17592186072173, :widget/order 1}
;; {:db/id 17592186072165, :widget/order 2}
;; …]}

With this Pluck method, our dashboard queries return an ordered list. By switching from Datomic Pull to our Pluck API, we’ve been able to keep Datomic’s powerful query syntax while splicing in new data types and adding properties to our data that Datomic doesn’t natively support.

Conclusion

Choosing Datomic has required investing resources to build custom solutions, but the custom solutions were straightforward to build, integrate, and maintain. Furthermore, since Datomic queries take reified data structures as input, we have been able to leverage Clojure effectively to easily extend them. This has enabled Report to be even more powerful than it would have been if we had just used Datomic as-is, and has helped us deliver value to our customers.

If you are interested in Datomic, Clojure/ClojureScript, Fulcro, or any combination thereof check out AdStage, we’re hiring!


Extending Datomic Pull Queries was originally published in AdStage Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Permalink

Alright, Break It Up! Using Partition/ Chunk

Programming Should Be About Transforming Data [...] I don't want to hide data I want to transform it.

Whether it's calculating students' final grades, turning mouse clicks into t-shirt orders, or reading and publishing temperature data from a RPi, programming is basically distilling data from one form into another. Oftentimes, it doesn't feel like pure data transformation because we try to handle how the data is stored as well as its transformation in one step. This can make code awfully confusing, especially to anyone that has to maintain the code.

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.

  • Martin Fowler

I've got a couple of examples from past projects that I'd like to share, going from how I'd written algorithms when I was more concerned about the place of my datapoints, and then after I discovered partitioning/chunking functions.

What is Partitioning/Chunking?

Partitioning and chunking, in the context of this article, is subdividing existing collections, streams, and enumerables/iterables, into pieces more useful to your algorithm. This chunking can be done either by:

;; partition a list of 22 items into 5 (20/4) lists of 4 items 
;; the last two items do not make a complete partition and are dropped.
(partition 4 (range 22))
;;=> ((0 1 2 3) (4 5 6 7) (8 9 10 11) (12 13 14 15) (16 17 18 19))
_.chunk(['a', 'b', 'c', 'd'], 2);
// => [['a', 'b'], ['c', 'd']]

_.chunk(['a', 'b', 'c', 'd'], 3);
// => [['a', 'b', 'c'], ['d']]
  • A mapping function to group contiguous items in the stream that map to the same value (from Elixir Docs)
Enum.chunk_by([1, 2, 2, 3, 4, 4, 6, 7, 7], &(rem(&1, 2) == 1))
# => [[1], [2, 2], [3], [4, 4, 6], [7, 7]]

Divide and Conquer?

Basically, yes... divide and conquer.

Parsing Arrays from Binary Streams

If you've ever had to get raw data off of a wire or read in a proprietary/lesser-used binary format, then you've probably had to write your own parser for it. And more often than not, you've had to read an array whose length is defined by an earlier declared variable. Endianness aside, that in and of itself is not the most fun. The format is usually akin to:

Value Type Size
Sync Word (0xC011FEFE) int32 4
Timestamp (µsec) int64 8
nElements int32 4
Data elements int32[] 4*nElements

Now, normally how I would have normally built something to read in the packet (without validation) like:

public class DataFrame {
    public readonly DateTime Timestamp;
    public readonly int[] Data;

    public DataFrame(DateTime timestamp, int[] data) {
        Timestamp = timestamp;
        Data = data;
    }

    public static FromBytes(byte[] rawData) {
        var timeUsec = BitConverter.ToInt64(rawData, 4);    // Skip the sync word
        var timestamp = DateTime.FromFileTimeUtc(timeUsec);
        var count = BitConverter.ToInt32(rawData, 12);
        var data = new int[count];

        // Skip the first 16 bytes before we start reading the data
        for (int i = 0; i < count; i++) {
            data[i] = BitConverter.ToInt32(rawData, 16 + i * 4);
        }

        return new DataFrame(timestamp, data);
    }
}

Granted, I have to care a lot about where the header information is, but there has to be a way to make my intention clearer on reading the array of data into data. The same thing, written in F#, feels a little cleaner to me:

type DateFrame = { Timestamp : DateTime; Data : int array }

let fromBytes data = 
  let time = BitConverter.ToInt64(data, 4)
  let numItems = BitConverter.ToInt32(data, 12)
  {
    Timestamp = DateTime.FromFileTimeUtc time
    Data = data
           |> Array.skip 16
           |> Array.chunkBySize 4
           |> Array.take numItems
           |> Array.map (fun x -> BitConverter.ToInt32(x, 0))
  }

Chunking data into 4 byte increments allows me to very cleanly pass info to the BitConverter to make the conversion to 32-bit integers. I don't have to keep track of offsets, and have effectively removed the entire category of offset or "off by x" errors from my code!

No one expects off by one errors!

Only parts of the data matter

Let's face it: sometimes you just don't care about a lot of your data.

It's really about like that
Image credit to this blog post

Let's say you're a delivery company and you want to make sure that your drivers are obeying speed limits. You really don't care about times that they're actually doing that, but when they go over the speed limit they might want to figure out why. Maybe they were going downhill and started to brake late. Perhaps they were texting and not paying attention. Who knows? But you're supposed to take the speed limit from your map API and velocity from the GPS and figure out when, and for how long, a driver was going over the speed limit. Let's see what it might look like to find drivers that were speeding for more than one minute:

public class DriverData {
    public readonly DateTime Timestamp;
    public readonly double Speed;
    public readonly double SpeedLimit;

    public DriverData(DateTime time, double speed, double speedLimit) {
        Timestamp = time;
        Speed = speed;
        SpeedLimit = speedLimit;
    }
}

public class Driver {
    public readonly Name;
    public IEnumerable<DriverData> DrivingHistory;

    public IEnumerable<(DateTime, TimeSpan)> SpeedingMoreThanAMinute() {

        // If the collection has no values or only one value, result is empty
        if (DrivingHistory == null || DrivingHistory.Count() <= 1) {
            yield break;
        }

        var isSpeeding = false;
        var speedingStart = new DateTime(0);
        var lastTime = new DateTime(0);
        var i = 0;

        foreach (var point in DrivingHistory) {
            var isSpeedingNow = point.Speed > point.SpeedLimit;

            if (isSpeedingNow) {
                if (!isSpeeding) {
                    speedingStart = point.Timestamp;
                }
                isSpeeding = true;
                lastTime = point.Timestamp;
            }
            else {
                if (isSpeeding) {     // We were speeding, and now we're not
                    var duration = (lastTime - speedingStart).TotalSeconds();
                    if (duration >= 60) {
                        yield return (speedingStart, duration);
                    }
                }
                isSpeeding = false;
            }
        }
        yield break;
    }
}

Yuck. That's a lot of ugly logic! There are six possible conditions there, and we have to keep all of those internal state variables and their functions in our head as we write this out. We need all of this overhead to find out when our drivers are speeding? Surely you can't be serious!

Surely you can't be serious?

This where chunking by a function comes in handy. Trying this in Elixir (EDIT: slight fix thanks to Aleksei Matiushkin!) :

defmodule DriverData do
    defstruct [:timestamp, :speed, :speed_limit]

    def is_speeding_one_minute_plus(data)
        data
        |> Enum.drop_while(& &1[speed] <= &1[speed_limit])
        |> Enum.chunk_by(& &1[speed] > &1[speed_limit])
        |> Enum.take_every(2)
        |> Enum.map(fun x -> 
            %{start => hd(x)[timestamp],
              duration => DateTime.diff(List.last(x)[timestamp] - hd(x)[timestamp])
            end)
        |> Enum.filter(& &1[duration] >= 60)
    end
end

Since this is one of five languages I've used in this post (six if you count English!), I'll walk you through what's going on since it's less simple. The Elixir |> operator takes the previously evaluated expression and uses this as the first argument in the next function call. This functions the same way as Clojure's -> macro, but differently than F#'s |> operator (which uses the previous statement as the last argument in the next function). I use Enum.drop_while/2 to drop off all the points at the beginning from where the driver isn't speeding, and then I use Enum.chunk_by/2 then break the list into chunks based on whether or not the driver's speed exceeds the speed limit, and then use Enum.take_every/2 to take every 2nd element (as specified by the 2 in the arguments). That leaves me with only Lists of times where the driver was speeding. I then use Enum.map/2 to turn those lists into maps with keys of :start and :duration, and then filter those based on a duration >= 60 seconds. We did the whole thing solely by operating on the chunks of data we cared about, filtered out what we didn't need, and turned the output into something we could use, all without using any conditional branches or any internal state.

Conclusion

Admittedly, it took me quite a while to wrap my head around chunking/ partitioning data into useful bits. It's hard to make that transition from how I'm used to developing algorithms in over a decade of procedural and OO programming, to thinking about everything as just data transformations. If the concept seems difficult for you, just understand that lots of other people, including me, have been there too! If you find yourself building considerable amount of internal state just to parse some data out of an iterable/ enumerable collection, use that as a code smell to see if chunking is right for your problem.

Happy coding!

Permalink

Notes on my first PHP job

This is kind of long and somewhat repetitive. That's sort of the idea, to show where a few bad decisions or structures will pop up everywhere, over and over.

Background

As a kid I dabbled in BASIC and Comal80 on a Commodore 64. The next computer was a 23 kilo 286 which managed Win 3.x over DOS and taught me batch scripting, then came others and I picked up some C, HTML and Linux before high school. There I was taught some Java, C++ and more web development, including standard SQL and JavaScript. The impression was that the schooling was abysmally bad and that adults knew nothing about computers, especially since my 'thesis', a project consisting of a small OpenGL engine at the end of high school, was graded solely on the documentation because the teacher didn't manage to get a C++ compiler and run it.

So I vowed to never make a living as a programmer and only use my interest in computers for evil. Instead I played around with a little cracking and a lot of malicious input into web apps over the next years, in a kind of grey hat not-very-legal manner.

Fast forward some eighteen years and I have changed my mind. At uni I studied some theology and law but never took a degree, in my spare time I passed through Python, Common Lisp, Clojure and then to the more obscure post-LISP language picolisp. This led to an offer I couldn't refuse, emigrate to the european south and work with modernising legacy PHP in an addiction business.

Having no ethical issues with profiting from semicompulsory behaviour, I wouldn't turn down Twitter either, and wanting the experience I jumped on the offer and relocated. At the time I had also found that my home country was turning weird and dangerous to leftists so it was attractive to leave and watch from a distance. It turned out to be great fun, absolutely terrifying and a very valuable experience.

Organisation

When I arrived I found myself in a group of roughly twenty developers including two bosses in one room where managers from elsewhere in the company considered it a virtue to take part in micromanagement of development. The result was an interesting, fragmented development process where specs and UX analysis seemed taboo, more or less replaced by vague instructions like 'port this' and 'add this feature' and a lot of silence following questions and suggestions.

After watching one boss integrating with a third party API for a couple of days I started getting assignments and worked away on them. Weeks passed without constructive feedback and it turned out that code review was not a part of the regular development work, and also that devs were explicitly forbidden review among peers.

Instead a kind of code review was implemented after one of the bosses moved to another country and started working remotely. However, it was done by skimming through commit diff snippets and commenting on string concatenation or recommending hands down bad coding practices. Sometimes these practices were implemented instead of the present, more adequate solution as to avoid friction or conflict.

Since every developer chugged along alone with little contact inbetween bosses also had to repeat themselves quite a bit, telling the same thing to different people, something which appeared to be inconvenient and wasteful to me.

The lack of structured planning meant deadlines didn't exist, it was rather a culture of quick fix driven development and incremental design. No cohesion in the team parallelled with little cohesion in code output, also this an interesting choice of managerial doctrine.

A result of this was that the code bases were quite irregular. One person liked warts on variables very much and no one else did. Another thought PHP lacked web templating and invented some in PHP, arguably the worlds most widely adopted web templating language. One had failed at creating REST API:s for years and adopted GraphQL instead, and thus coerced others to do it as well. Some liked Eloquent while others prefered passing in strings to Fluent QB for raw execution, yet others wrote a custom layer on a little of both.

When there was quitting and laying off ahead it was obvious but not spoken of, it seemed like the bosses thought the rest of us didn't notice anything. It meant quite weird moods among some colleagues and a lot of loose rumour when someone no longer worked there. Perhaps this was the idea, I found that hard to tell.

It took several months before I was first given a private talk about issues and performance, where I brought up concerns regarding this and some SQLi:s I'd found. I was told something something company policy and that we tend to patch such vulnerabilities when they were found.

Leadership by fear, anxiety and alienation can be profitable but only in sectors where quality is a non-issue, where machines are the producers and humans are used for pulling levers or something similar. It seems it doesn't translate very well to the art of programming.

Lessons learned

  • If your boss is more impressed by LoC output or obedience than code quality and ability to reason about it, start looking for another job because when this becomes an issue you're going to get fired rather than promoted

  • If every branch in a repo has a style of its own you'll learn a lot, but it will also slow down bughunting a lot unless it was you who wrote the code

  • If there is no constructive feedback on your work you'll have trouble performing well

  • A manager has the power of the payslip while you have the power of trust, hence you'll be more likely to be fired if your colleagues start trusting you more than their managers

  • Don't work unpaid overtime, and if you did anyway, make sure to tell your superiors you have so they know that they owe you more than they thought

  • When confronted with incoherent code bases consistently write in your own style to the best of your ability and try to make sure the result is good, rather than ask superiors or colleagues about it

  • Listen closely to what your colleagues are saying about past employees and what resulted in them no longer working there, this can help avoid sudden changes in your employment

Code

What I thought would be modernising legacy code eventually turned out to be conservation and wrapping of it in more and more projects. The core application was an elderly publishing platform rewritten in the details over the years but never refactored. A term that happened to be used as a synonym to 'changing existing code' rather than the common technical meaning at that workplace. At first this was very confusing.

This core was written in a procedural style with some OOP on top, ruminated for many years and having a high degree of interconnectivity between files and functions. The resulting complexity turned out as big red circles in PHPmetrics and quite challenging to anyone, regardless of previous experience. What had been described as microservice architecture was rather wrapping of bulky frameworks around new or old instances of this.

At some point when developing new software one needs to make some strategic decisions about what kind of programmers one wants to write code for. If one is the only one at a company who appreciates higher-order functions and currying as a way to break down problems one should avoid it and make a series of classes and interfaces instead.

On this issue I found no leadership and struggled quite a bit while figuring out that this was my issue and that I had to try and predict who might read my code later on. Not having a common idea of what kind of software shop the place should become seems to be bad for productivity. Instead the environment was one where old-timey for i < count(input) i++ was a viable pattern even though foreach was introduced in PHP 4 and a code review comment could consist of a demeaning comment about education and a link to the PHP manual on strings. Because apparently the problem with pasting unvalidated input into database queries are whether it is done with single or double quotes.

While the absence of strategic discussion and peer based code review together with quick fix driven development makes it imperative for a junior to quickly learn new tools and code patterns it also seems like it has some cost in the long run, since victorious debugging becomes reserved for the ruthless one cannot just assign anyone to a ticket without risking that they'll spend a week in circles.

Increasingly I got better at quickly judging the code and found more and more serious issues to add to my private bug database. For some reason there was no common one. These findings alarmed me more and more, since I thought it too risky to have some of them in production. When sounding the alarm I was however met with an indifference I had not anticipated, and later I learned that some of these had been known for years and that those who made a fuss consistently had been laid off.

Lessons learned

  • When confronted with incoherent code, relax and follow the execution over and over in your mind until you understand it and why the output is as it is

  • Thinking in incoherent code is demanding, never skip lunch because you'll get nowhere the last hours of the day

  • If you aren't getting good criticism of your code you need to invent it yourself

  • The worse the code, the more power is needed in your tooling as well as a solid understanding of command line scripting

  • Coding is an art so insensitive comments about it are sure to hurt, use this might wisely and with consideration

  • Don't give up bughunting until you have spent a fair bit of time breaking the application in creative ways

Conclusions

I strongly recommend anyone willing to do PHP but lacking work experience to take a job at a place that will hire you on just a few lines crappy code and some talk about knowledge in areas that have nothing to do with the business they're in.

You'll learn fast, learn a lot and see things that you otherwise would never understand how much you need to avoid until you've done the same mistake yourself and messed up bad. And this is very valuable, impossible to learn on your own (arguably you could pick some up in OSS dev) and will be with you for the rest of your life as a programmer.

Next I'll be more useful and show the toolchain I got comfy with.

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.