What is Reason?
The interesting thing about Reason is now, it's not completely new. It can basically be seen as a new syntax for OCaml. So you can convert Reason to OCaml and back without any losses.
Now, people weren't too impressed by this, because, as I mentioned above, many languages did this before, also OCaml didn't have a too familiar syntax, but the creators of Reason found that the OCaml compiler is pretty gud and wanted to harness this power for the Web, so they set out to create this new syntax while trying to get full OCaml compat.
Why should anyone care?
OCaml is crazy powerful.
First, it has a sound static type-system, that gives really nice error messages, think Elm level errors, with thinks like "you wrote X did you mean Y?", but yes, some people don't care much about this and those who do already using Elm, Flow or TypeScript. (btw. Flow is build in OCaml)
For example this Reason code:
But it's actually converted to (something like) this:
It throws out modules you don't use, since you never import them yourself anyway, OCaml does it for you and then it also tries to pre-calculate values and throws out dead code.
Finally, it has ESLint and Prettier (which is inspired by refmt, the Reason formatter) like features already included. Code hints and automatic formatting, so you don't have to think about indentation or semicolons anymore.
Also, there are some smaller things that help in everyday coding, like named function arguments or the use of the type-system instead of strings for decisions, things that compile down to a bool or number in the end instead of using strings etc.
So I think it is worth a try, for my private projects at least.
This is a flash post, written after having a sudden thought while reading a book totally unrelated to Clojure.
How hard (or easy) is it to start with Neanderthal? This is only a rhetorical question, since I am not
interested in your answer. I'm interested in your opinion on another matter. Here it is.
There might be two camps of people, when it comes to this:
It's easy. In addition to Clojure environment, you need to setup MKL (and CUDA or OpenCL if you would like to do GPU computations), which was easy for our team. It would have been more convenient for deployment if that wasn't necessary, but so what?
It's scary. What? I have to do some installation procedure? Huh, can I handle that?
The first type I understand (who'd guess), but a mystery to me is the second type.
I am still unsure whether a significant proportion of those developers really exist, or they are just
a product of the imagination of the more experienced Clojurist. I'll skip the proverbial total Clojure
beginners now (and I doubt they are target audience for ML and HPC anyway), but pose this question:
Suppose a Clojure developer would like to program something related to machine learning or some
other kind of program that requires high performance computing. Suppose also that installing
the non-Clojure prerequisites for Neanderthal (Intel MKL / CUDA / OpenCL) is on the level
of the simplest installation of a game on a computer (GPU drivers + another trivial installer).
Then, If some aspiring user has a mental block or some other hesitation due to
the installation process (something that an average teenage gamer trivially does all the time), how
likely is it that that developer will be able to effective use matrix computations and linear algebra
in their programs anyway?
I'm puzzled. Does that developer exist or is just a product of our imagination?
avoids DCE (so the entire runtime library remains available)
But, these properties also hold for the whitespace only and simple Closure Compiler optimization modes. (They only fail for advanced.) Because of this, self-hosted ClojureScript supports up to :simple.
Planck 2.7.0 and Replete 1.12 now ship with Closure Compiler optimizations applied ahead-of-time to their bundled namespaces. You can readily see this if you (set! *print-fn-bodies* true) and print some of the functions in cljs.core.
Using simple also makes some things in the bundled ClojureScript standard library run a bit faster: In Planck, (reduce + (range 1e7)) now completes in about 0.10 s on my machine, whereas previously it would take roughly 0.19 s.
The way Planck internally uses simple is not to concatenate all of the bundled scripts into a single monolithic script to be loaded at startup time. Instead, compiled scripts are maintained separately, per namespace, and lazily loaded. This helps keep startup latency low. With simple
time planck -e 1
now completes in about 0.19 s on my machine, whereas previously it would take 0.23 s (17% faster).
Optimizing Your Scripts with Closure Compiler
Planck exposes the capability to apply Closure Compiler optimizations to Planck scripts. A new command-line option -O / --optimizations is introduced which can take values of none, whitespace, or simple.
If you set this option to anything other than none (the default), then any ClojureScript code loaded from a namespace during Planck's execution will additionally have Closure Compiler applied to it using the specified level.
To illustrate the effects simple can have on the code generated in Planck, consider this function, which illustrates opportunities for variable name shortening and constant folding:
I hope you enjoy this new capability, either to make your own scripts run faster, or simply as another avenue for learning about what Closure Compiler has to offer for ClojureScript code!
It is a unique priviledge of ours to look away from Charlottesville, to
turn a blind eye to clear hatred and bigotry, to pretend that oppression
historically and presently is a fantasy.
We can do that because, historically, we’re the ones who have
perpetrated it, or at the very least have not been targeted by it.
Because of that your self-care plan today and tomorrow can include
turning off the news and shutting down social media and living a
Don’t do it.
Because for every one of us that there is out there there are thousands
of other people who are no less made in the image of God and have no
less value or worth than you or I who can’t. They don’t have that choice
because they’ve been and are being oppressed. They, every day, are
forced to confront these realities because of who they are, not
because of what they believe.
So don’t look away from Charlottesville. Don’t say, “Whoa! Those are
some bad apples in the alt-right. Glad they’re not normal.” and then
move on with your life. This is a perfect opportunity to look on this
horror and be shocked by it. To be hurt by it. To let the barest
reflection of the emotion felt by people groups who have been truly
oppressed every day enter into you. To feel righteous and holy anger at
the systems and the people that allow it to persist. And to get up and
do something about it.
If you’re confused what the big deal is, find a non-WASP and ask them
how they’re feeling. Start by asking questions and just listening. Don’t
be so quick to defend yourself or your heritage. Just listen and try to
process it and try to realize that we can and must do better than this.
Don’t miss this opportunity to reflect and to love and to act.
Over the past few years, much of the progress in deep learning for computer vision can be boiled down to just a handful of neural network architectures. Setting aside all the math, the code, and the implementation details, I wanted to explore one simple question: how and why do these models work?
The VGG networks, along with the earlier AlexNet from 2012, follow the now archetypal layout of basic conv nets: a series of convolutional, max-pooling, and activation layers before some fully-connected classification layers at the end. MobileNet is essentially a streamlined version of the Xception architecture optimized for mobile applications. The remaining three, however, truly redefine the way we look at neural networks.
This rest of this post will focus on the intuition behind the ResNet, Inception, and Xception architectures, and why they have become building blocks for so many subsequent works in computer vision.
ResNet was born from a beautifully simple observation: why do very deep nets perform worse as you keep adding layers?
Intuitively, deeper nets should perform no worse than their shallower counterparts, at least at train time (when there is no risk of overfitting). As a thought experiment, let’s say we’ve built a net with n layers that achieves a certain accuracy. At minimum, a net with n+1 layers should be able to achieve the exact same accuracy, if only by copying over the same first n layers and performing an identity mapping for the last layer. Similarly, nets of n+2, n+3, and n+4 layers could all continue performing identity mappings and achieve the same accuracy. In practice, however, these deeper nets almost always degrade in performance.
The authors of ResNet boiled these problems down to a single hypothesis: direct mappings are hard to learn. And they proposed a fix: instead of trying to learn an underlying mapping from x to H(x), learn the difference between the two, or the “residual.” Then, to calculate H(x), we can just add the residual to the input.
Say the residual is F(x)=H(x)-x. Now, instead of trying to learn H(x) directly, our nets are trying to learn F(x)+x.
This gives rise to the famous ResNet (or “residual network”) block you’ve probably seen:
Each “block” in ResNet consists of a series of layers and a “shortcut” connection adding the input of the block to its output. The “add” operation is performed element-wise, and if the input and output are of different sizes, zero-padding or projections (via 1x1 convolutions) can be used to create matching dimensions.
If we go back to our thought experiment, this simplifies our construction of identity layers greatly. Intuitively, it’s much easier to learn to push F(x) to 0 and leave the output as x than to learn an identity transformation from scratch. In general, ResNet gives layers a “reference” point — x — to start learning from.
This idea works astoundingly well in practice. Previously, deep neural nets often suffered from the problem of vanishing gradients, in which gradient signals from the error function decreased exponentially as they backpropogated to earlier layers. In essence, by the time the error signals traveled all the way back to the early layers, they were so small that the net couldn’t learn. However, because the gradient signal in ResNets could travel back directly to early layers via shortcut connections, we could suddenly build 50-layer, 101-layer, 152-layer, and even (apparently) 1000+ layer nets that still performed well. At the time, this was a huge leap forward from the previous state-of-the-art, which won the ILSVRC 2014 challenge with 22 layers.
ResNet is one of my personal favorite developments in the neural network world. So many deep learning papers come out with minor improvements from hacking away at the math, the optimizations, and the training process without thought to the underlying task of the model. ResNet fundamentally changed the way we understand neural networks and how they learn.
The 1000+ layer net is open-source! I would not really recommend you try re-training it, but…
If you’re feeling functional and a little frisky, I recently ported ResNet50 to the open-source Clojure ML library Cortex. Try it out and see how it compares to Keras!
If ResNet was all about going deeper, the Inception Family™ is all about going wider. In particular, the authors of Inception were interested in the computational efficiency of training larger nets. In other words: how can we scale up neural nets without increasing computational cost?
The original paper focused on a new building block for deep nets, a block now known as the “Inception module.” At its core, this module is the product of two key insights.
The first insight relates to layer operations. In a traditional conv net, each layer extracts information from the previous layer in order to transform the input data into a more useful representation. However, each layer type extracts a different kind of information. The output of a 5x5 convolutional kernel tells us something different from the output of a 3x3 convolutional kernel, which tells us something different from the output of a max-pooling kernel, and so on and so on. At any given layer, how do we know what transformation provides the most “useful” information?
Insight #1: why not let the model choose?
An Inception module computes multiple different transformations over the same input mapin parallel, concatenating their results into a single output. In other words, for each layer, Inception does a 5x5 convolutional transformation, and a 3x3, and a max-pool. And the next layer of the model gets to decide if (and how) to use each piece of information.
The increased information density of this model architecture comes with one glaring problem: we’ve drastically increased computational costs. Not only are large (e.g. 5x5) convolutional filters inherently expensive to compute, stacking multiple different filters side by side greatly increases the number of feature maps per layer. And this increase becomes a deadly bottleneck in our model.
Think about it this way. For each additional filter added, we have to convolve over all the input maps to calculate a single output. See the image below: creating one output map from a single filter involves computing over every single map from the previous layer.
Let’s say there are M input maps. One additional filter means convolving over M more maps; N additional filters means convolving over N*M more maps. In other words, as the authors note, “any uniform increase in the number of [filters] results in a quadratic increase of computation.” Our naive Inception module just tripled or quadrupled the number of filters. Computationally speaking, this is a Big Bad Thing.
This leads to insight #2: using 1x1 convolutions to perform dimensionality reduction. In order to solve the computational bottleneck, the authors of Inception used 1x1 convolutions to “filter” the depth of the outputs. A 1x1 convolution only looks at one value at a time, but across multiple channels, it can extract spatial information and compress it down to a lower dimension. For example, using 20 1x1 filters, an input of size 64x64x100 (with 100 feature maps) can be compressed down to 64x64x20. By reducing the number of input maps, the authors of Inception were able to stack different layer transformations in parallel, resulting in nets that were simultaneously deep (many layers) and “wide” (many parallel operations).
How well did this work? The first version of Inception, dubbed “GoogLeNet,” was the 22-layer winner of the ILSVRC 2014 competition I mentioned earlier. Inception v2 and v3 were developed in a second paper a year later, and improved on the original in several ways — most notably by refactoring larger convolutions into consecutive smaller ones that were easier to learn. In v3, for example, the 5x5 convolution was replaced with 2 consecutive 3x3 convolutions.
Inception rapidly became a defining model architecture. The latest version of Inception, v4, even threw in residual connections within each module, creating an Inception-ResNet hybrid. Most importantly, however, Inception demonstrated the power of well-designed “network-in-network” architectures, adding yet another step to the representational power of neural networks.
The original Inception paper literally cites the “we need to go deeper” internet meme as an inspiration for its name. This must be the first time knowyourmeme.com got listed as the first reference of a Google paper.
The second Inception paper (with v2 and v3) was released just one day after the original ResNet paper. December 2015 was a good time for deep learning.
Xception stands for “extreme inception.” Rather like our previous two architectures, it reframes the way we look at neural nets — conv nets in particular. And, as the name suggests, it takes the principles of Inception to an extreme.
Here’s the hypothesis: “cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly.”
What does this mean? Well, in a traditional conv net, convolutional layers seek out correlations across both space and depth. Let’s take another look at our standard convolutional layer:
In the image above, the filter simultaneously considers a spatial dimension (each 2x2 colored square) and a cross-channel or “depth” dimension (the stack of four squares). At the input layer of an image, this is equivalent to a convolutional filter looking at a 2x2 patch of pixels across all three RGB channels. Here’s the question: is there any reason we need to consider both the image region and the channels at the same time?
In Inception, we began separating the two slightly. We used 1x1 convolutions to project the original input into several separate, smaller input spaces, and from each of those input spaces we used a different type of filter to transform those smaller 3D blocks of data. Xception takes this one step further. Instead of partitioning input data into several compressed chunks, it maps the spatial correlations for each output channel separately, and then performs a 1x1 depthwise convolution to capture cross-channel correlation.
The author notes that this is essentially equivalent to an existing operation known as a “depthwise separable convolution,” which consists of a depthwise convolution (a spatial convolution performed independently for each channel) followed by a pointwise convolution (a 1x1 convolution across channels). We can think of this as looking for correlations across a 2D space first, followed by looking for correlations across a 1D space. Intuitively, this 2D + 1D mapping is easier to learn than a full 3D mapping.
And it works! Xception slightly outperforms Inception v3 on the ImageNet dataset, and vastly outperforms it on a larger image classification dataset with 17,000 classes. Most importantly, it has the same number of model parameters as Inception, implying a greater computational efficiency. Xception is much newer (it came out in April 2017), but as mentioned above, its architecture is already powering Google’s mobile vision applications through MobileNet.
The author of Xception is also the author of Keras. Francois Chollet is a living god.
That’s it for ResNet, Inception, and Xception! I firmly believe in having a strong intuitive understanding of these networks, because they are becoming ubiquitous in research and industry alike. We can even use them in our own applications with something called transfer learning.
Transfer learning is a technique in machine learning in which we apply knowledge from a source domain (e.g. ImageNet) to a target domain that may have significantly fewer data points. In practice, this generally involves initializing a model with pre-trained weights from ResNet, Inception, etc. and either using it as a feature extractor, or fine-tuning the last few layers on a new dataset. With transfer learning, these models can be re-purposed for any related task we want, from object detection for self-driving vehicles to generating captions for video clips.
To get started with transfer learning, Keras has a wonderful guide to fine-tuning models here. If it sounds interesting to you, check it out — and happy hacking!
I've said before that immutable data is easier to reason about because it's more like the real world. That seems counterintuitive because the world is mutable. I can move things around, erase a chalkboard, and even change my name. How, then, is immutable data more like the real world than mutable data?
Here are two stories that will explain my point. Imagine I take a clean sheet of looseleaf paper, the kind with blue lines. I take out a nice, fat marker, and write the word banana in big letters. Then I fold it up and hand it to Susan. "This is very important", I say, "so listen carefully. Put this in your pocket and don't let anyone touch it until you get home. Then you can take it out."
Susan doesn't understand why, but she puts it in her pocket. She makes sure nobody goes in her pocket all day. When she gets home, she finds the paper folded there like she remembers it. She opens it up and reads it.
What will be written on it? (This is not a trick question.)
Everyone will know the answer. It's the word banana. But how did you know? And how are you so sure? In fact, the world fulfills our expectations so many times, all day long, and we don't even think about it. We've learned these tendencies of the universe since we were babies. This kind of reasoning may even be baked into the structure of our nervous systems. Our understanding of the world is intuitive. You might say "we can reason about it".
Now, let me tell you another story. Let's say I construct an object P (for Paper), and set a property on it to the string "banana". I pass the object to a method on a different object S (for Susan), which stores that object in a private property. The object S keeps that property protected, never letting anything write to it, and it makes sure that no method is ever called that can write to it. After a while, object S prints out the contents of P.
What do you expect to be stored in P? (This also is not a trick question.)
The answer is "I don't know". It could be "banana". It could be something else if the object P was modified by something else that had a reference to it. We can't know what it will say because it's mutable and that lets in magical effects like "action at a distance". We can no longer reason about it.
Many objects in our universe are constantly changing on their own. The locations of moving cars, the contents of a box of uranium (through decay), the brightness of a candle. Each time we look at it, we expect it to be different. But so, too, are many objects basically static unless acted upon by something else. A house, a glass of water, a piece of paper. The object persists virtually unchanged in any way I would care about. And this is the part of the world that immutable objects model better than mutable objects.
The thing is, once you share two references to the same mutable object, you're way outside the world we live in and know. Two references to the same object would be like two people having the same paper in their pockets at the same time. You could do shared-nothing like Erlang does (everything is copied as it passes between actors), but Erlang also chooses to make things immutable. You could also always copy objects before you store them. This practice requires tough discipline and also undoes any efficiency advantage you gain by using mutable data in the first place.
Mutable objects are important. They can model very common and useful things. For instance, you might model a filing cabinet as a mutable object because you can take stuff out and put stuff in. But you will want to model the papers inside as immutable because they can't change while they're inside. Having both is what makes it work.
Clojure makes it very easy to separate the two. Data structures like lists, vectors, hash maps, and sets are all immutable in Clojure. And Clojure has a toolbox of mutable things, too: atoms, refs, agents, and vars. These are good for modeling the current state of something shared and changing.
Well, I hope these stories explain why immutable objects do act like real world objects in many useful cases, and how this is a key part of why people claim that Clojure code is easier to reason about. If you like this idea and want to explore it more, check out The PurelyFunctional.tv Newsletter. It's a weekly newsletter for