Lazy Sequences and Java Streams

When Clojure 1.12.0 was released in September, the release note had a lot of cool features, such as virtual threads, but one feature caught my eye in particular: support for Java Streams and functional interface support. That was just what I needed at work to port one of our internal libraries to Java, to make it easier to reuse from projects written in different JVM languages. Little did I know that Java Streams suck.

The problem

We have a small library, that implements queues with lazy sequences. It has both producer and consumer parts, so several projects can reuse the same library, one being a producer, and the other being a consumer. Today we’re interested in the consumer part.

One of the features this library has is that we can have several separate queues, and we can take items from all of them in a fixed order. Each queue is sorted, so when we have multiple queues, we need to get elements in order and still in a lazy way. Let’s look at some examples:

(def queue1
  (filter odd? (range)))

(def queue2
  (filter even? (range)))

Here, I created two infinite sequences of sorted numbers, just as an example:

user> (take 10 queue1)
(1 3 5 7 9 11 13 15 17 19)
user> (take 10 queue2)
(0 2 4 6 8 10 12 14 16 18)

It’s not that different from actual queues used in our library, as each sequence is potentially infinite and sorted. What we need to get as a result is a single queue with items of both queues combined with retained ordering:

(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...)

Lazy Seq solution

To achieve this we can implement a merge sort function that will produce a single queue from multiple queues:

(defn merge-sort-sequences [sequences]
  (lazy-seq
   (let [[sequence & sequences] (sort-by first sequences)]
     (cons (first sequence)
           (merge-sort-sequences (conj sequences (rest sequence)))))))

Note, that I’ve omitted some checks for things like empty sequences, and such for the sake of simplicity.

In short, it evaluates like this:

  1. Let’s say, we have [(1 3 5 7...) (0 2 4 6...)] as an input;
  2. We sort it by the first element of each sequence: [(0 2 4 6...) (1 3 5 7...)];
  3. We take the first element from the first queue: 0, and cons it to the recursive call to which we pass back all of the sequences, except we remove the first element from the first queue;
    • Since the function returns a lazy sequence the recursive call is effectively trampolined.

And we can see that it works:

user> (take 10 (merge-sort-sequences [queue1 queue2]))
(0 1 2 3 4 5 6 7 8 9)
user> (take 20 (merge-sort-sequences
                [(take-nth 3 (range))          ; (0 3 6 9 12 ...)
                 (take-nth 3 (drop 1 (range))) ; (1 4 7 10 13 ...)
                 (take-nth 3 (drop 2 (range))) ; (2 5 8 11 14 ...)
                ]))
(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19)
user> (take 20 (merge-sort-sequences
                [(filter odd? (range))
                 (filter even? (drop 20 (range)))]))
   (1 3 5 7 9 11 13 15 17 19 20 21 22 23 24 25 26 27 28 29)
;;  \_first_sequence_only_/  \___both_sequences_merged___/

Of course, it won’t work if sequence items are in arbitrary order, but that’s not the case for our implementation.

Java Stream solution

Now, as I mentioned, I had to rewrite this library in Java, for it to be reused in other projects written in other JVM languages. It is possible to use a Clojure library from, say, Scala, but it is quite tedious to do so. We either have to AOT compile the library to a jar, and provide a bunch of methods via genclass. Alternatively, it’s possible to load Clojure, compile the sources, and use it this way, but it introduces way too many hoops to jump through, that rarely anyone would want to do so in our team. And the library in question is small, about 400 LOC with documentation strings and comments, so rewriting it in Java wouldn’t be that hard.

Or so I thought.

I’m not a Java programmer, and I have very limited knowledge of Java. Thankfully, Java is a simple language, unless you wrap everything in an unnecessary amount of classes, use abstract fabric builders, and so on. This library, thankfully, required neither of those cursed patterns - it’s a single static class with no need for instancing, with a bunch of pure methods.

So, knowing that Java has a Stream class, and looking at its interface I thought that I would be able to implement this library. And in truth, it wasn’t a problem, until I got to the lazy merge sort part. That’s when it started looking cursed.

First of all, a Stream is not a data structure - you can’t work with it as if it were data, you have to use pipelines, and then either consume it or pass it around. Moreover, most examples use streams as an intermediate transformation step and return a collection, or suggest passing in a transformer, instead of returning a stream, so I wonder where is this coming from:

Java APIs increasingly return Streams and are hard to consume because they do not implement interfaces that Clojure already supports, and hard to interop with because Clojure doesn’t directly implement Java functional interfaces.

From Clojure 1.12.0 release notes.

Anyhow, I’ve created functions that return streams of items in the queue, much like the ones I showed above. So it was time to implement the merge sort. Let’s look at the skeleton of our function:

public static Stream<Object> mergeSortStreams(Stream<Object>[] streams) {
	return Stream.generate(() -> {
        // implement merge sort somehow...
	});
}

Stream.generate(Supplier) produces an infinite stream, generated by calling the supplier. Basically, it’s what we need here, we can do our sorting inside the supplier. However, there’s a problem - Streams are not data structures. And there’s no way to take one element without consuming the stream. I mean, there’s stream.findFirst() but if we look at the documentation for it, we’ll see that:

Optional<T> findFirst()

Returns an Optional describing the first element of this stream, or an empty Optional if the stream is empty. If the stream has no encounter order, then any element may be returned.

This is a short-circuiting terminal operation.

Returns: an Optional describing the first element of this stream, or an empty Optional if the stream is empty Throws: NullPointerException - if the element selected is null

What is a short-circuiting terminal operation you ask? A terminal operation may traverse the stream to produce a result or a side effect. A short-circuiting terminal operation does the same, but even if presented with an infinite stream, it can finish in a finite time. And after the terminal operation is performed, the stream is considered consumed, and can no longer be used.

But even if we could use findFirst without closing the stream, it wouldn’t be useful to us, because remember - we need to take the first element from each stream and sort the streams themselves. But findFirst is a destructive operation, it removes the element from the stream.

In Clojure, sequences are immutable - all we can do is construct a new sequence if we wish to add items or take its tail if we need fewer items. Thus first does nothing to the sequence in question, we can freely call it on any sequence, obtain an element, do stuff with it, and be done. You can think of first like of an iterator peek, where you look at what the next element is in the iterator without advancing it.

Thankfully, we can convert a Stream to an Iterator, with it staying lazy and potentially infinite. Only, there’s no peek method in the base Iterator class in Java. Oh well.

Well, we can always implement our own wrapper for the Iterator class:

package org.example;

import java.util.Iterator;

public class PeekingIterator<T> implements Iterator<T> {
    Iterator<T> iterator;
    private boolean peeked = false;
    private T peeked_item;

    public PeekingIterator(Iterator<T> it) { iterator = it; }

    public T next() {
        if (peeked) {
            peeked = false;
            T tmp = peeked_item;
            peeked_item = null;
            return tmp;
        } else
            return iterator.next();
    }

    public boolean hasNext() { return iterator.hasNext(); }

    public T peek() {
        if (!peeked && iterator.hasNext()) {
            peeked = true;
            peeked_item = iterator.next();
        }
        return peeked_item;
    }
}

Of course, we could use a dependency, but due to circumstances, we have to keep the amount of dependencies as low as possible. Now, we can get back and implement our merge sort:

public static Stream<Object> mergeSortStreams(Stream<Object>[] streams) {
	List<PeekingIterator<Object>> iterators = new ArrayList<>();
	for (Stream<Object> s : streams) {
		iterators.add(new PeekingIterator<>(s.iterator()));
	}
	return Stream.generate(() -> {
		iterators.sort(Comparator.comparingInt(a -> (Integer) a.peek()));
		return iterators.getFirst().next();
	});
}

Testing it with Clojure examples from above reveals that it works as expected:

System.out.println(
		mergeSortStreams(
				new Stream[]{
						Stream.iterate(1, i -> i + 2),
						Stream.iterate(0, i -> i + 2)
				})
				.limit(20)
				.collect(Collectors.toList())
);
// => [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
System.out.println(
		mergeSortStreams(
				new Stream[]{
						Stream.iterate(0, i -> i + 3),
						Stream.iterate(1, i -> i + 3),
						Stream.iterate(2, i -> i + 3)
				})
				.limit(20)
				.collect(Collectors.toList())
);
// => [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
System.out.println(
		mergeSortStreams(
				new Stream[]{
						Stream.iterate(1, i -> i + 2),
						Stream.iterate(20, i -> i + 2)
				})
				.limit(20)
				.collect(Collectors.toList())
);
// => [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]

Obviously, I’m omitting most of the things in the code here, but this should be enough to give you an idea of what I had to do.

Sequences vs Streams

Let’s compare our versions.

Clojure:

(defn merge-sort-sequences [sequences]
  (lazy-seq
    (let [[sequence & sequences] (sort-by first sequences)]
      (cons (first sequence)
            (merge-sort-sequences (conj sequences (rest sequence)))))))

A pretty straightforward way of doing this kind of operation, in my opinion.

Java:

public static Stream<Object> mergeSortStreams(Stream<Object>[] streams) {
	List<PeekingIterator<Object>> iterators = new ArrayList<>();
	for (Stream<Object> s : streams) {
		iterators.add(new PeekingIterator<>(s.iterator()));
	}
	return Stream.generate(() -> {
		iterators.sort(Comparator.comparingInt(a -> (Integer) a.peek()));
		return iterators.getFirst().next();
	});
}

The general idea is the same but the fact that we had to create a peeking iterator, and store it in an array is disturbing. In Clojure, we manipulate lazy sequences as if they were ordinary data. In Java, we don’t have any real data, so we have to make our own way of accessing it. We have to create an intermediate array list to sort it when every item in the stream is generated. The same happens in Clojure, of course when we call sort-by, and this is possibly worse, as in Java we only create the array once, and sort it in place, and in Clojure, we create a new list every time. JVM is good at collecting garbage though, and the rate at which this sequence is consumed is far greater than the time to clean up the garbage, but it is a thing to consider. Java streams also don’t specify anything about the order and can be processed in parallel, so I’m not sure how my PeekingIterator would behave.

And all of that is simply because Java streams are a half-backed interface made in a rush, or at least it feels like that. Yes, it supports data pipelines with its own implementation of map, filter, etc., however, it makes me appreciate Clojure even more because it has one implementation of map that works across everything (also, it has transducers). In a more complete version of this Java library, we have to map over streams, transform them into iterators just to do some stuff, that is not implemented for streams, transform iterators back into streams, and so forth. The Clojure version is much more straightforward, and concise.

In hindsight, I wish it was easier to use Clojure from other JVM languages. It would save me the time it took to re-implement everything in Java for sure.

In the end, I hooked the old Clojure implementation to use the Java version as a core and retained the interface by converting streams to sequences via stream-seq!. It passed all of the library tests, so I moved on.

Permalink

Half Dumb Datalog in 30 loc

by cgrand (X 🦣)

Today, a follow-up to Writing the Worst Datalog Ever in 26loc, maybe even the start of a series.🍿

Our 26-loc Datalog is naive. Nothing personal, it's a technical term: each iteration in saturate rederives all the facts derived plus hopefully some new ones. The last iteration is guaranteed to be 100% redundant since by definition it's the one which derived nothing new!

Let's engineer a bad case (not that it's difficult given how purposefully unsophisticated our code is):

user=> (q 
         (set (for [i (range 50)] [:edge (inc i) i]))
         '([i] [:edge+ 1 i])
         '[([:edge+ i j] [:edge i j])
           ([:edge+ i j] [:edge+ i k] [:edge k j])])
#{[0]}

It runs in 10 seconds. Replace 50 by 100 and it runs in 4 minutes! 🐌

The solution to all these useless rederivations is well known and called semi-naive evaluation!

Don't forget: when we're not busy writing silly Datalog implementations, we are available to help you on your Clojure projects or working on ClojureDart or on our app Paktol (The positive spending tracker where money goes up!).

Semi-naive evaluation

The idea behind semi-naive evaluation is to not keep rederiving from the same facts at each iteration. So the rule is to only consider facts which can be derived by using at least one fresh fact (derived during the previous iteration).

Changes to saturate

The first step is to split facts in two: fresh facts (dfacts—the d stands for diff or delta) and old news (facts).

In the saturate loop, we initialize dfacts with the initial set of facts because at the start of the computation everything is fresh. We keep looping while dfacts' is not empty.

We will modify match-rule to only return facts derived by using at least one fresh fact. However we'll still have to post-process its returned values with (remove facts') just in case it accidentally rederives old facts.

(defn saturate [facts rules]
  (loop [dfacts facts, facts #{}]
    (let [facts' (into facts dfacts)
          dfacts' (into #{} (comp (mapcat #(match-rule dfacts facts %)) (remove facts')) rules)]
      (cond->> facts' (seq dfacts') (recur dfacts')))))

Changes to match-rule

Not much changes, we just pass an extra dfacts argument through to match-patterns.

(defn match-rule [dfacts facts [head & patterns]]
  (for [env (second (match-patterns patterns dfacts facts))]
    (into [] (map #(env % %)) head)))

Oh yes, that's true: we call second on the value returned by match-patterns and we explain why right below👇.

Changes to match-patterns

Here is usually where semi-naive gets gory in textbooks where a typical rule:

becomes a monster expression 🐲 such as:

👆And this is a simplified version since we lumped EDB and IDB together in the previous article.

It's tedious to implement and match_patterns would end very different from its naive version.

However we can evolve the existing match_patterns rather than replace it by looking at automatic differentiation with dual numbers for inspiration.

Automatic differentiation with dual numbers feels like magic: you compute the original function with special numbers and you get both the original result but also the value of the derivative and without knowing the expression of the derivative!

For example let's say you want to compute x^2+x at 7, then you compute as usual but by writing 7+ε instead of 7 and simplifying using the magic property that ε^2=0. (A non-null number whose square is zero isn't more bizarre than a number whose square is -1...)

Here we have computed both the value (56) of x^2+x for x=7 but we also computed the value (15) of the derivative of x^2+x (2x+1) without knowing the formula of the derivative!

The monster expression 🐲 above is a kind of derivative and we'd like to compute it without implementing it.

In the same way x is replaced by x + ε in dual numbers we are going to replace envs by [envs denvs] where envs are environments created using strictly matches over old facts and denvs environments where at least one fresh fact was matched against.

The original (for [fact facts env envs] ...) is thus going to be declined in 4 versions: facts×envs, facts×denvs, dfacts×denvs and dfacts×envs. Only the first one contributes to the envs component; all the others by having a d on envs or facts contribute to the devs component.

Last, we have to be careful to not eagerly compute envs since it's exactly what we want to avoid: rediriving the old facts. We do so by delaying its actual computation with delay.

(defn match-patterns [patterns dfacts facts]
  (reduce
    (fn [[envs denvs] pattern]
      [(-> #{} (into (for [fact facts env @envs] (match pattern fact env))) (disj nil) delay)
       (-> #{}
         (into (for [fact facts env denvs] (match pattern fact env)))
         (into (for [fact dfacts env denvs] (match pattern fact env)))
         (into (for [fact dfacts env @envs] (match pattern fact env)))
         (disj nil))])
    [(delay #{{}}) #{}] patterns))

which keeps the same structure as the original naive version:

(defn match-patterns [patterns facts]
  (reduce
    (fn [envs pattern]
      (-> (for [fact facts env envs] (match pattern fact env))
        set (disj nil)))
    #{{}} patterns))

The set has been replaced by into #{} to make envs and denvs computations more similar but otherwise the envs component of the semi-naive version is the original envs computation in the naive version.

Changes to q

Not much to say except we pass an empty set as the facts parameter to match-rule.

(defn q [facts query rules]   
  (-> facts (saturate rules) (match-rule #{} query) set))

🥳 Here it is: semi-naive datalog in 30 lines of code!

Is it less bad yet?

Remember the bad case which took 10s for 50 and 4 minutes for 100?

user=> (q 
         (set (for [i (range 50)] [:edge (inc i) i]))
         '([i] [:edge+ 1 i])
         '[([:edge+ i j] [:edge i j])
           ([:edge+ i j] [:edge+ i k] [:edge k j])])
#{[0]}

Now it takes 350ms for 50 and 5s for 100! It's respectively 30 and 50 times faster: progress! 🎉

Conclusion

This datalog is maybe vaguely better but it's still as minimal/useless as in our previous article: we have made the engine more efficient but we don't have made the language more powerful.

See, if we try to compute Bart's siblings (reusing edb and rules from 26-loc Datalog):

user=> (q edb '([c] [:parent :bart p] [:parent c p]) rules)
#{[:lisa] [:maggie] [:bart]}

Bart is returned as his own sibling! This is because we can't constraint a variable to have a different value than another.

This is what we'll fix in the next installment of this Datalog series, we'll add constraints to be able to write:

([:sibling a b] [:parent a p] [:parent b p] (not= a b))

and have:

user=> (q edb '([c] [:sibling :bart c]) rules)
#{[:lisa] [:maggie]}

Share online if you'd like to see more open ended exploratory code (including keeping to grow this datalog)!

Appendix: 30 half-sophisticated locs at once

(defn match [pattern fact env]
   (when (= (count pattern) (count fact))
     (reduce (fn [env [p v]]
               (let [p-or-v (env p p)]
                 (cond
                   (= p '_) env
                   (= p-or-v v) env
                   (symbol? p-or-v) (assoc env p v)
                   :else (reduced nil))))
       env (map vector pattern fact))))

(defn match-patterns [patterns dfacts facts]
  (reduce
    (fn [[envs denvs] pattern]
      [(-> #{} (into (for [fact facts env @envs] (match pattern fact env))) (disj nil) delay)
       (-> #{}
         (into (for [fact facts env denvs] (match pattern fact env)))
         (into (for [fact dfacts env denvs] (match pattern fact env)))
         (into (for [fact dfacts env @envs] (match pattern fact env)))
         (disj nil))])
    [(delay #{{}}) #{}] patterns))

(defn match-rule [dfacts facts [head & patterns]]
  (for [env (second (match-patterns patterns dfacts facts))]
    (into [] (map #(env % %)) head)))

(defn saturate [facts rules]
  (loop [dfacts facts, facts #{}]
    (let [facts' (into facts dfacts)
          dfacts' (into #{} (comp (mapcat #(match-rule dfacts facts %)) (remove facts')) rules)]
      (cond->> facts' (seq dfacts') (recur dfacts')))))

(defn q [facts query rules]   
  (-> facts (saturate rules) (match-rule #{} query) set))

Permalink

Datomic as a Higher-Level Database

Datomic as a Higher-Level Database

More than 20 years ago, when I first began learning programming languages, I read a line in a book:

C is a high-level language.

But it wasn&apost until years later, during a university class on assembly language, when I had to use jump commands just to write a loop, that I truly realized how high-level C was. Despite this, for much of my career, I found C to be quite low-level because I didn&apost want to deal with memory management via malloc and free, nor did I want to handle pointers. As my career progressed, I learned many programming languages. Java, for instance, was much higher-level than C because it provided garbage collection. Clojure, in turn, was even higher-level than Java because of its immutable collection types.

High-Level Doesn&apost Mean More Features — Sometimes It&aposs the Opposite

High-level doesn&apost refer to having more features; it means higher-level semantics. As a result, high-level languages often restrict or even eliminate lower-level semantics. High-level semantics allow you to focus on specifying what you want to achieve without worrying about every single implementation detail—the machine handles those for you. In some cases, you’re even restricted from accessing certain details because they are easy to mess up.

For example, when writing in C, you are strongly discouraged from using jump commands; when writing in Java, you cannot directly manipulate pointers; when writing in Clojure, you are advised against using mutable collection types.

High-level semantics often come with a trade-off in terms of machine performance. The JVM’s garbage collection obviously uses extra memory, and Clojure’s immutable collection types also consume more memory compared to Java. Furthermore, Clojure&aposs startup time far exceeds Java&aposs, testing the limits of human patience and even spawning solutions like Babashka, specifically designed for shell usage.

High-level typically means trading machine efficiency for developer productivity, a trade-off that’s often worth it thanks to Moore’s Law, which ensures that machine performance will automatically improve over time.

In What Ways is Datomic High-Level?

From my experience using Datomic, I&aposve observed at least four ways in which it is a higher-level database:

  1. DDL (Data Definition Language)
    • Primary Keys
    • Many-to-Many Relationships
  2. Isolation Level
  3. Time-travel Queries
  4. A Query Language that References Context

DDL - Primary Keys

When working with SQL databases, I often struggled with how to design the primary key for my tables. The first decision to make is:

  1. Should I use a natural key, which is one or more data attributes derived from the business domain? This can be convenient early on, but sometimes not as good later.
  2. Should I use a surrogate key, which decouples the key from the business meaning, providing more flexibility for future modifications? Some tables, which represent parts of an entity that lack suitable natural keys, require surrogate keys.

If you decide to use a surrogate key, the next question is which data type to choose:

  • Auto-incrementing integers
  • UUIDs

Then, considerations about enumeration attacks and performance arise. Should you opt for integers that avoid enumeration attacks, or UUIDs that boost performance?

Datomic makes the primary key decision for you. In the world of Datomic, the primary key is the entity ID. It’s that simple. [1]

DDL - Many-to-Many Relationships

In SQL databases, when modeling many-to-many relationships, we usually need to design a bridge table.

In Datomic, there is no need for a bridge table because you can set the :db/cardinality attribute to :db.cardinality/many, which means the field supports one-to-many relationships. This feature not only simplifies the semantics by eliminating the need for a bridge table, but also makes the syntax for one-to-many and many-to-many relationships much more consistent.

Isolation Level

SQL databases offer four isolation levels:

  • Read Uncommitted
  • Read Committed
  • Repeatable Read
  • Serializable

These various levels exist to allow for higher performance when dealing with transactions. In contrast, Datomic only provides one isolation level—Serializable. [2] With fewer options, there is less for us to worry about.

Time-travel Queries

Traditional databases are like regular files; once something is written, you can’t go back to a previous state. Datomic is different. Its state is like a git repository, allowing you to easily revert to a previous state using a time-travel query known as the as-of query.

A Query Language that References Context

When writing SQL queries that involve multiple JOIN operations, the resulting query often becomes so long that it becomes hard to read. Human languages, in contrast, are typically composed of many short sentences. Short sentences can still convey complex ideas because they reference context. When we understand natural language, we don’t process each sentence in isolation; we use its context to fully grasp its meaning.

Datomic’s query language, Datalog, has a mechanism called Rules that can cleverly inject context into your queries.

Consider the following Datalog query, which retrieves the names of the actors in "The Terminator":

[:find ?name
 :where
 [?p :person/name ?name]
 [?m :movie/cast ?p]
 [?m :movie/title "The Terminator"]]

Now imagine that the part of the query responsible for "finding the actor&aposs name from the movie title" is something you repeatedly write across different queries. Is there a way to avoid rewriting this section each time?

[?p :person/name ?name]
[?m :movie/cast ?p]
[?m :movie/title ?title]

Yes, and that mechanism is called Rules. We can rewrite the above query using Datomic Rules as follows:

;; The rules we defined
[(appear ?name ?title)
 [?p :person/name ?name]
 [?m :movie/cast ?p]
 [?m :movie/title ?title]]
 
;; re-written Datolog Query
 [:find ?name
  :in $ %
  :where (appear ?name "The Terminator")]

In this version, the appear rule abstracts the definition of an actor&aposs "appearance" in a movie. Once this logical rule is applied by the query engine, the concept of "appearance" can be inferred as new knowledge.

This rule acts as a tool to reference context in a query. When a query language can reference context, it becomes more like human natural language, and thus more concise.

Envision a New Conversation

You recommend Datomic to your boss, and he/she asks, "What are its benefits? How can you prove it?"

You reply, "It improves productivity because it is a higher-level database."

I expect your boss will then ask, "What do you mean by higher-level?"

If the conversation is conducted thoughtfully and cleverly, this opening dialogue could lead to successfully advocating for the use of a higher-level database within your company. Savvy businesspeople may not remember or fully understand the various details of databases, but they understand that a higher level is equivalent to exchanging machine power for brain power. Exchanging something cheap for something expensive is a concept I believe businesspeople will understand.

Notes

  • [1] In practice, when using Datomic and needing to expose a unique identifier to the outside world, we typically design an additional UUID field. However, in this article, I won’t delve into all the design issues related to primary keys. My focus is: Datomic has already made the design decision for entity ID, which reduces the decisions we need to make, thus making this database more high-level.

  • [2] Most SQL database systems compose transactions from a series of updates, where each update changes the state of the database. However, Datomic transaction execution is not defined as a series of updates. It is defined as the addition of a set of datoms to the previous value of the database. Ref

Permalink

Beyond Traditional Testing: Addressing the Challenges of Non-Deterministic Software

Software development of non-deterministic systems have become increasingly common. From distributed systems with untrusted inputs to AI-powered solutions, there is a growing challenge in ensuring reliability and consistency in environments that are not fully predictable. The integration of Large Language Models (LLMs) and other AI technologies can in fact introduce data that can change every time it is computed.

Non-deterministic software, by its very nature, can produce different outputs for the same input under seemingly identical conditions. This unpredictability presents significant challenges for testing.

This article explores some of the fundamental characteristics of non-deterministic software, discuss established best practices for testing such systems, examine recent innovations in the field with a focus on AI-driven techniques, and provide practical examples complete with Python code samples. It’ll also investigate the unique challenges posed by LLMs in software testing and offer guidance on implementing a comprehensive testing strategy for these complex systems.

All code is available in this repository. All examples are in Python but the same ideas can be applied to any programming language. At the end, I recap a few testing frameworks that can be used with other programming languages, including C#, Java, JavaScript, and Rust.

Characteristics and Challenges of Non-Deterministic Software

Non-deterministic software can be seen as a reflection of the complex, often unpredictable world we live in. Unlike deterministic software systems, which produce the same output for a given input every time, non-deterministic systems introduce an element of variability.

Non-determinism in software can arise from various sources such as inherent randomness in the algorithms being used or the effect of an internal state that is not observable from the outside. It might also be the result of numerical computing errors. For example, when dealing with floating-point arithmetic, tiny rounding errors can accumulate and lead to divergent results.

A newer source of non-determinism is the integration of generative AI components, such as Large Language Models (LLMs), because at every invocation their outputs can vary significantly for the same input.

To demonstrate non-deterministic behavior, let's have a look at a simple Python example:

import random

def non_deterministic_function(x):
    if random.random() < 0.1:  # 10% chance of failure
        return None
    return x * 2

# Running this function multiple times with the same input
for _ in range(20):
    result = non_deterministic_function(5)
    print(result)

If you run this code, it will return 10 most of the time, because the function is the input 5, but about 10% of the time, it returns None. This simple example illustrates the challenge when testing non-deterministic software: how to write tests for a function that doesn’t always behave the same way?

To tackle these challenges. we can adapt traditional testing methods and implement new approaches. From property-based testing to AI-driven test generation, the field of software testing is evolving to meet the demands of an increasingly non-deterministic digital world.

Effective Testing Strategies for Non-Deterministic Software

Testing non-deterministic software requires a shift in how we approach software quality assurance. One interesting approach we can use to test non-deterministic software is property-based testing.

With property-based testing, rather than writing tests for specific input-output pairs, you define properties that should hold true for all possible inputs. The testing framework then generates a large number of random inputs and checks if the defined properties hold for each of them.

Let look at an example of property-based testing using the Hypothesis library in Python:

from hypothesis import given, strategies as st
import random

def non_deterministic_sort(lst):
    """A non-deterministic sorting function that occasionally makes mistakes."""
    if random.random() < 0.1:  # 10% chance of making a mistake
        return lst  # Return unsorted list
    return sorted(lst)

@given(st.lists(st.integers()))
def test_non_deterministic_sort(lst):
    result = non_deterministic_sort(lst)

    # Property 1: The result should have the same length as the input
    assert len(result) == len(lst), "Length of the result should match the input"

    # Property 2: The result should contain all elements from the input
    assert set(result) == set(lst), "Result should contain all input elements"

    # Property 3: The result should be sorted in most cases
    attempts = [non_deterministic_sort(lst) for _ in range(100)]

    # We allow for some failures due to the non-deterministic nature
    # Replace 'any' with 'all' to make the test fail if any attempt is not sorted
    assert any(attempt == sorted(lst) for attempt in attempts), "Function should produce a correct sort in multiple attempts"

# Run the test
if __name__ == "__main__":
    test_non_deterministic_sort()

In this example, we're testing a non-deterministic sorting function that occasionally makes mistakes. Instead of checking for a specific output, we can verify properties that should hold true regardless of the function’s non-deterministic behavior. For example, we can check that the output has the same length as the input, contains all the same elements, and is correctly sorted in at least some of multiple attempts.

While property-based testing is powerful, it can be slow and costly when LLMs are involved in the test cases. This is because each test run may require multiple invocations of the LLM, which can be computationally expensive and time-consuming. Therefore, it’s crucial to carefully design property-based tests when working with LLMs to balance thoroughness with efficiency.

Another crucial strategy for testing non-deterministic software is to check if it is feasible to create repeatable test environments. This involves controlling as many variables as possible to reduce the sources of non-determinism during testing. For example, you can use fixed random seeds, mock external dependencies, and use containerization to ensure consistent environments.

When dealing with AI, especially LLMs, you can use semantic similarity measures to evaluate outputs rather than expecting exact matches. For instance, when testing an LLM-based chatbot, you might check if the model’s responses are semantically similar to a set of acceptable answers, rather than looking for specific phrases.

Here’s an example of how to test an LLM’s output using semantic similarity:

import json
import boto3

from scipy.spatial.distance import cosine

AWS_REGION = "us-east-1"
EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

def get_embedding(text):
    body = json.dumps({"inputText": text})
    response = bedrock_runtime.invoke_model(
        modelId=EMBEDDING_MODEL_ID,
        contentType="application/json",
        accept="application/json",
        body=body
    )
    response_body = json.loads(response['body'].read())
    return response_body['embedding']

def semantic_similarity(text1, text2):
    embedding1 = get_embedding(text1)
    embedding2 = get_embedding(text2)
    return 1 - cosine(embedding1, embedding2)

def test_llm_response(llm_function, input_text, acceptable_responses, similarity_threshold=0.8):
    llm_response = llm_function(input_text)
    print("llm_response:", llm_response)

    for acceptable_response in acceptable_responses:
        similarity = semantic_similarity(llm_response, acceptable_response)
        print("acceptable_response:", acceptable_response)
        if similarity >= similarity_threshold:
            print("similarity:", similarity)
            return True

    return False

# Example usage
def mock_llm(input_text):
    # This is a mock LLM function for demonstration purposes
    return "The capital of France is Paris, a city known for its iconic Eiffel Tower."

input_text = "What is the capital of France?"
acceptable_responses = [
    "The capital of France is Paris.",
    "Paris is the capital city of France.",
    "France's capital is Paris, known for its rich history and culture."
]

result = test_llm_response(mock_llm, input_text, acceptable_responses)
print(f"LLM response test passed: {result}")

In this example, we use Amazon Bedrock to compute semantic embeddings of a simulated LLM’s response and a set of acceptable responses. Then, we use cosine similarity to determine if the LLM’s output is semantically similar enough to any of the acceptable responses.

On another note, an interesting development not strictly related to non-deterministic software testing is the use of LLMs themselves to generate test data and check test outputs. This approach leverages the power of LLMs to understand context and generate diverse, realistic test cases.

Here’s an example generating structured test data in JSON format:

import json
import boto3

AWS_REGION = "us-east-1"

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20240620-v1:0"   

bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

def generate_structured_test_data(prompt, num_samples=5):
    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[{
            'role': 'user',
            'content': [{ 'text': prompt }]
        }]
    )
    generated_data = response['output']['message']['content'][0]['text']
    try:
        json_data = json.loads(generated_data)
    except json.JSONDecodeError:
        print("Generated data is not valid JSON")

    return json_data

# Example usage
prompt = """Generate 5 JSON objects representing potential user inputs for a weather forecasting app.
Each object should have 'location' and 'query' fields.
Output the result as a valid JSON array.
Output JSON and nothing else.
Here's a sample to guide the format:
[
  {
    "location": "New York",
    "query": "What's the temperature tomorrow?"
  }
]"""

test_inputs = generate_structured_test_data(prompt)

print(json.dumps(test_inputs, indent=2))

In this example, we're using Amazon Bedrock and the Anthropic Claude 3.5 Sonnet model to generate structured JSON test inputs for a weather forecasting app. Using this approach, you can create a wide range of test cases, including edge cases that could be difficult to think initially. These test cases can be stored and used multiple times.

Similarly, LLMs can be used to check test outputs, especially for tasks where the correct answer might be subjective or context-dependent. This approach is more precise than just using semantic similarity but is slower and more costly. The two approaches can be used together. For example, if the semantic similarity test has passed, we then use an LLM for further checks.

import boto3

AWS_REGION = "us-east-1"

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20240620-v1:0"   

bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

def check_output_with_llm(input_text, test_output, prompt_template):
    prompt = prompt_template.format(input=input_text, output=test_output)

    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[{
            'role': 'user',
            'content': [{ 'text': prompt }]
        }]
    )

    response_content = response['output']['message']['content'][0]['text']

    return response_content.strip().lower() == "yes"

# Example usage
input_text = "What's the weather like today?"
test_output = "It's sunny with a high of 75°F (24°C) and a low of 60°F (16°C)."
prompt_template = "Given the input question '{input}', is this a reasonable response: '{output}'? Answer yes or no and nothing else."

is_valid = check_output_with_llm(input_text, test_output, prompt_template)

print('input_text:', input_text)
print('test_output:', test_output)
print(f"Is the test output a reasonable response? {is_valid}")

In this example, we're using again an Anthropic Claude model to evaluate whether the system’s response is reasonable given the input question. Depending on the difficulty of the test, we can use a more or less powerful model to optimize speed and costs.

This approach can be used for testing chatbots, content generation systems, or any other application where the correct output isn’t easily defined by simple rules.

These strategies - property-based testing, repeatable environments, semantic similarity checking, and LLM-assisted test generation and validation - form the foundation for an effective testing of non-deterministic software. They allow to make meaningful assertions about system behavior even when exact outputs cannot be predicted.

Advanced Techniques for Testing Complex Non-Deterministic Systems

Using AI to generate test cases can go beyond generative AI and LLMs. For example, machine learning models can analyze historical test data and system behavior to identify patterns and generate test cases that are most likely to uncover bugs or edge cases that a human tester might miss.

Let's see an example of using a simple machine learning model to generate test cases for a non-deterministic function.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Simulated historical test data
# Features: input_a, input_b, system_load
# Target: 0 (pass) or 1 (fail)
X = np.array([
    [1, 2, 0.5], [2, 3, 0.7], [3, 4, 0.3], [4, 5, 0.8], [5, 6, 0.4],
    [2, 2, 0.6], [3, 3, 0.5], [4, 4, 0.7], [5, 5, 0.2], [6, 6, 0.9]
])
y = np.array([0, 0, 0, 1, 0, 0, 0, 1, 0, 1])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a random forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Function to generate new test cases
def generate_test_cases_from_historical_test_data(n_cases):
    # Generate random inputs
    new_cases = np.random.rand(n_cases, 3)
    new_cases[:, 0] *= 10  # Scale input_a to 0-10
    new_cases[:, 1] *= 10  # Scale input_b to 0-10

    # Predict failure probability
    failure_prob = clf.predict_proba(new_cases)[:, 1]

    # Sort cases by failure probability
    sorted_indices = np.argsort(failure_prob)[::-1]

    return new_cases[sorted_indices]

# Generate and print top 5 test cases most likely to fail
top_test_cases = generate_test_cases_from_historical_test_data(100)[:5]
print("Top 5 test cases most likely to fail:")
for i, case in enumerate(top_test_cases, 1):
    print(f"Case {i}: input_a = {case[0]:.2f}, input_b = {case[1]:.2f}, system_load = {case[2]:.2f}")

This example demonstrates to use of a random forest classifier to generate test cases that are more likely to uncover issues in a system. Model are can be better than humans in learning from historical data to predict which combinations of inputs and system conditions are most likely to cause failures.

Another related technique is the use of chaos engineering for testing non-deterministic systems. For example, you can deliberately introduce failures and perturbations into a system to test its resilience and identify potential issues before they occur in production.

For instance, you can randomly terminate instances in a distributed system, simulate network latency, or inject errors into data streams. By systematically introducing chaos in a controlled environment, you are able to uncover weaknesses in a systems that might not be apparent under normal testing conditions.

When it comes to testing AI-powered systems, especially those involving Large Language Models (LLMs), a similar approach is to use adversarial testing, where input prompts are designed to challenge the LLM’s understanding and generate edge cases.

Here’s an example of how to implemented a simple adversarial testing framework for an LLM:

import random
import string

def generate_adversarial_prompt(base_prompt, num_perturbations=3):
    perturbations = [
        lambda s: s.upper(),
        lambda s: s.lower(),
        lambda s: ''.join(random.choice([c.upper(), c.lower()]) for c in s),
        lambda s: s.replace(' ', '_'),
        lambda s: s + ' ' + ''.join(random.choices(string.ascii_letters, k=5)),
    ]

    adversarial_prompt = base_prompt
    for _ in range(num_perturbations):
        perturbation = random.choice(perturbations)
        adversarial_prompt = perturbation(adversarial_prompt)

    return adversarial_prompt

def test_llm_robustness(llm_function, base_prompt, expected_topic, num_tests=10):
    for _ in range(num_tests):
        adversarial_prompt = generate_adversarial_prompt(base_prompt)
        response = llm_function(adversarial_prompt)

        # Here I use my semantic similarity function to check if the response
        # is still on topic despite the adversarial prompt
        is_on_topic = semantic_similarity(response, expected_topic) > 0.7

        print(f"Prompt: {adversarial_prompt}")
        print(f"Response on topic: {is_on_topic}")
        print("---")

# Example usage (assuming I have my LLM function and semantic_similarity function from before)
base_prompt = "What is the capital of France?"
expected_topic = "Paris is the capital of France"

test_llm_robustness(mock_llm, base_prompt, expected_topic)

This example generates adversarial prompts by applying random perturbations to a base prompt, then tests whether the LLM can still produce on-topic responses despite these challenging inputs. Other approaches to generating adversarial prompts include the use of different human languages, asking to output in poetry or specific formats, and asking for internal information such as tool use syntax.

Because no single technique is a silver bullet, the most effective testing strategies often involve a combination of approaches, tailored to the specific characteristics and requirements of the system under test.

In the next section, let's explore how to implement a comprehensive testing strategy that incorporates advanced techniques alongside more traditional methods, creating a robust approach to testing even the most complex non-deterministic systems.

Comprehensive Strategy for Testing Non-Deterministic Software

To effectively test complex systems, we need a comprehensive strategy that combines multiple techniques and adapts to the specific challenges of each system.

Let's go through the process of implementing such a strategy, using a hypothetical AI-powered recommendation system as an example. This system uses machine learning models to predict user preferences, incorporates real-time data, and interfaces with a Large Language Model to generate personalized content descriptions. We can use it as an example of a non-deterministic system with multiple sources of unpredictability.

The first step in this strategy is to identify the critical components of the system and assess the potential impact of failures. In this sample recommendation system, we can find the following high-risk areas:

  • The core recommendation algorithm
  • The real-time data processing pipeline
  • The LLM-based content description generator

For each of these components, let's consider the potential impact of failures on user experience, data integrity, and system stability. This assessment can be used to guide testing efforts, ensuring that resources are focused where they’re most needed.

Then, with the previous risk assessment in hand, we can design a layered testing approach that combines multiple techniques.

Unit Testing with Property-Based Tests

For individual components, we can use property-based testing to ensure they behave correctly across a wide range of inputs. Here’s an example of how to test the recommendation algorithm.

from hypothesis import given, strategies as st
import numpy as np

def recommendation_algorithm(user_preferences, item_features):
    # Simplified recommendation algorithm
    return np.dot(user_preferences, item_features)

@given(
    st.lists(st.floats(min_value=-1, max_value=1), min_size=5, max_size=5),
    st.lists(st.lists(st.floats(min_value=-1, max_value=1), min_size=5, max_size=5), min_size=1, max_size=10)
)
def test_recommendation_algorithm(user_preferences, item_features_list):
    recommendations = [recommendation_algorithm(user_preferences, item) for item in item_features_list]

    # Property 1: Recommendations should be in the range [-5, 5] given our input ranges
    assert all(-5 <= r <= 5 for r in recommendations), "Recommendations out of expected range"

    # Property 2: Higher dot products should result in higher recommendations
    sorted_recommendations = sorted(zip(recommendations, item_features_list), reverse=True)
    for i in range(len(sorted_recommendations) - 1):
        assert np.dot(user_preferences, sorted_recommendations[i][1]) >= np.dot(user_preferences, sorted_recommendations[i+1][1]), "Recommendations not properly ordered"

# Run the test
test_recommendation_algorithm()

Integration Testing with Chaos Engineering

To test how our components work together under various conditions, we can use chaos engineering techniques. For example, we can randomly degrade the performance of the real-time data pipeline, simulate network issues, or introduce delays in API responses. This helps ensure the system remains stable even under suboptimal conditions.

System Testing with AI-Generated Test Cases

For end-to-end testing, we can use AI to generate diverse and challenging test scenarios. This might involve creating complex user profiles, simulating various usage patterns, and generating edge case inputs for our LLM.

Continuous Monitoring and Adaptation

A good testing strategy doesn’t end when we deploy to production. We need robust monitoring and observability tools to catch issues that might not have surfaced during testing.

This includes:

  • Real-time performance monitoring
  • Anomaly detection algorithms to identify unusual behavior
  • A/B testing for gradual rollout of changes
  • User feedback collection and analysis

Observability tools often include native anomaly detection capabilities to help find important information across a large amount of telemetry data. For example, Amazon CloudWatch implements anomaly detection for metrics and logs.

Here’s a simple example of how to implement an anomaly detection system using basic statistical methods:

import numpy as np
from scipy import stats

class AnomalyDetector:
    def __init__(self, window_size=100):
        self.window_size = window_size
        self.values = []

    def add_value(self, value):
        self.values.append(value)
        if len(self.values) > self.window_size:
            self.values.pop(0)

    def is_anomaly(self, new_value, z_threshold=3.0):
        if len(self.values) < self.window_size:
            return False  # Not enough data to detect anomalies yet

        mean = np.mean(self.values)
        std = np.std(self.values)
        z_score = (new_value - mean) / std

        return abs(z_score) > z_threshold

# Usage
detector = AnomalyDetector()

# Simulate some normal values
for _ in range(100):
    detector.add_value(np.random.normal(0, 1))

# Test with a normal value
print(detector.is_anomaly(1.5))  # Probably False

# Test with an anomaly
print(detector.is_anomaly(10))  # Probably True

Alternative property-based testing tools for other programming languages

In these examples, I used the Hypothesis Python module. Here are a few interesting alternatives for other programming languages.

Language Recommended Library Reasoning
C# FsCheck Widely used in the .NET ecosystem, supports both C# and F#.
Clojure test.check Part of Clojure's core.spec, well-integrated with the language.
Haskell QuickCheck The original property-based testing library, still the standard in Haskell.
Java jqwik Modern design, good documentation, and seamless integration with JUnit 5.
JavaScript fast-check Actively maintained, well-documented, and integrates well with popular JS testing frameworks.
Python Hypothesis Most mature, feature-rich, and widely adopted in the Python ecosystem.
Scala ScalaCheck The de facto standard for property-based testing in Scala.
Ruby Rantly More actively maintained compared to alternatives, good integration with RSpec.
Rust proptest More actively developed than quickcheck for Rust, with helpful features like persistence of failing examples.

Continuous Improvement

A good testing strategy includes a process for continuous improvement. This involves:

  • Regular review of test results and production incidents
  • Updating the test suite based on new insights
  • Staying informed about new testing techniques and tools
  • Adapting the testing strategy as the system evolves

Implementing a comprehensive testing strategy for non-deterministic software is no small task. It requires a combination of technical skill, creativity, and a deep understanding of the system under test. However, by combining multiple testing techniques, leveraging AI and machine learning, and maintaining a commitment to continuous improvement, we can create robust, reliable systems even in the face of uncertainty.

When we look to the future, the only thing we can be certain is that the field of software testing will continue to evolve. New challenges will arise as systems become more powerful and complex. But with the foundations explored in this article - from property-based testing to AI-driven test generation, from chaos engineering to semantic similarity checking - there is a solid base on which to build on.

The strategies and techniques discussed here are not set in stone. They’re a starting point, a foundation upon which you can add your own approach tailored to your specific needs and challenges. The world of software is ever-changing, and so too must be our approaches to testing it. I encourage you to embrace the uncertainty, stay curious, and never stop learning. The future of software testing is definitely interesting, and it’s in our hands to learn and, when possible, shape it.

To continue learning, have a look at the repository that includes all code in this article.

Permalink

How Could Clojure Web Development Suck Less

In this episode, we dive into the intricacies of web development with Clojure, exploring how it can be improved and made less cumbersome. We also touch on Rama, as Ben brings more expertise in that area. Additionally, we explore Ben career journey, f...

Permalink

From Micro to Macro: Scaling Front-Ends

Are you facing challenges in scaling your front-end to meet a growing number of users? With the increasing complexity of modern web applications, strategies like micro front-ends, monorepositories, global state management, and cache optimization are essential. 

In this article, we explore best practices for scaling front-end applications, discussing how to implement micro front-ends, manage versions in monorepositories, apply effective caching strategies, and efficiently maintain global states. 

Discover how Nubank is overcoming scalability challenges in the front-end and how you can apply these approaches to build agile, responsive, and easy-to-maintain user interfaces.

The Challenge of Scale

Companies like Nubank face unique challenges. With over 100 million customers in Brazil, Mexico and Colombia, handling large-scale distributed systems is not just a necessity but an obligation. Managing transactions like PIX, ensuring service stability, and providing a consistent user experience require innovative solutions.

Moreover, working with advanced technologies like Clojure and Datomic—whose development is influenced by engineers within Nubank itself—adds additional layers of complexity and opportunity. These technologies are not just tools; they are integral parts of our scalability strategy and continuous innovation.

Micro Front-Ends: Dividing to Conquer

The micro front-end architecture has emerged as a solution to many challenges faced by large development teams. But what exactly are micro front-ends?

What Micro Front-Ends Are (and What They Aren’t)

Micro front-ends are an extension of the microservices concept to the front-end. They allow different teams to develop, deploy, and maintain distinct parts of the user interface independently.

This means that each team can work at its own pace, choose its own technologies (to a certain extent), and deploy updates without impacting the system as a whole.

It’s important to highlight that micro front-ends are not:

  • NPM packages or monolithic modules: Simply splitting a monolith into packages doesn’t offer the benefits of independent deployment or team isolation.
  • Separate applications for the end-user: The user experience should be unified. We’re not talking about multiple distinct applications but a single application composed of several autonomous parts.

Benefits of Micro Front-Ends

  • Independent deployments: Teams can deploy updates without coordinating with the entire organization.
  • Team scalability: New developers can be onboarded more quickly, focusing on a specific part of the system.
  • Fault isolation: Issues in one micro front-end don’t necessarily affect the entire application.
  • Technological flexibility: The possibility of using different frameworks or libraries, although this should be done cautiously to avoid client-side overload.

Costs and Considerations

  • Complex initial setup: Implementing micro front-ends requires careful planning and a robust initial configuration.
  • Cohesion of user experience: Ensuring that the interface is consistent and cohesive is a challenge when multiple teams are involved.
  • Performance overhead: Using different frameworks can increase bundle size and affect performance.
  • Observability and debugging: Monitoring and debugging an application composed of multiple micro front-ends requires advanced tools and practices.

Implementation Strategies

There are several approaches to implementing micro front-ends:

Client-Side Composition

This is the most common approach, where the integration of micro front-ends occurs in the user’s browser. Technologies like Web Components, Module Federation (Webpack 5), and frameworks like Single SPA facilitate this composition.

Server-Side or CDN Composition

The assembly of micro front-ends occurs before reaching the client, either on the server or CDN. Tools and techniques like Edge Side Includes (ESI) can be utilized.

Communication Between Micro Front-Ends

Efficient communication is essential. It’s recommended to use:

  • Custom events: Allow micro front-ends to communicate without directly depending on each other.
  • Shared states via browser APIs: Such as Local Storage or IndexedDB.
  • Avoid excessive global dependencies: Minimizes coupling between components.

Version Control in Monorepositories

Managing versions in a monorepository can be challenging, especially when multiple teams are working on different parts of the system. Here are some practices to handle this:

Individual Package Versioning

Tools like Lerna or Nx allow you to manage individual package versions within a monorepository. This enables each team to control the versions of their own components or modules, maintaining independence and facilitating coordination.

Avoiding Git Submodules

While Git submodules might seem like a solution, they often introduce additional complexity. Instead, using NPM or Yarn workspaces can simplify the management of internal dependencies.

Benefits of the Monorepository

  • Code consistency: Facilitates code standardization and reuse.
  • Visibility: All teams have access to the complete source code, promoting collaboration.
  • Automation: Simplifies the setup of CI/CD pipelines that cover the entire system.

Caching Strategies for Bundle Loading

Efficiency in loading resources is crucial for application performance. Well-implemented caching strategies can significantly improve the user experience.

Caching Shared Resources

By using technologies like Module Federation, it’s possible to share common dependencies among different micro front-ends, avoiding redundant downloads. To achieve this:

  • Define shared modules: Configure which libraries or frameworks should be shared to prevent multiple versions on the client.
  • Compatible versions: Ensure that shared dependencies are compatible with each other to avoid conflicts.

CDN-Level Caching

Utilizing a Content Delivery Network (CDN) allows static resources to be delivered more quickly to users by leveraging distributed caching.

  • Cache-Control configurations: Adjust HTTP headers to control how and for how long resources should be cached.
  • Cache invalidation: Have strategies to invalidate or update the cache when new versions of resources are deployed.

Browser Caching

  • Service Workers: Implement caching via Service Workers for more granular control over which resources are stored and when they are updated.
  • Preloading and Prefetching: Anticipate which resources will be needed and load them in advance.

Managing Global States in Host Applications

Maintaining a consistent global state in an application composed of multiple micro front-ends is a challenge.

Recommended Strategies

  • Custom events: Use the browser’s event system for communication between micro front-ends without creating rigid dependencies.
  • Shared local storage: APIs like Local Storage or IndexedDB can serve as a means to share global state.
  • Global contexts: In frameworks like React, you can use Context API, but be careful not to introduce unwanted coupling.

Best Practices

  • Domain isolation: Each micro front-end should be responsible for its own local state and interact with the global state only when necessary.
  • Well-defined contracts: Establish clear interfaces for communication between components, facilitating maintenance and evolution.

Standardization and Platform Teams

While micro front-ends address technical scalability, code standardization and the existence of platform teams are crucial for the human scalability of development teams.

The Role of Platform Teams

  • Defining good standards: Create and maintain code standards that make teams’ work easier.
  • Tools and infrastructure: Develop tools that automate repetitive tasks and ensure code quality.
  • Facilitating collaboration: Ensure different teams can work together efficiently.

Importance of Standardization

  • Faster onboarding: New developers adapt more quickly to standardized code.
  • Consistent quality: Reduces the incidence of bugs and maintenance issues.
  • Easier code review: Code reviews are more effective when there’s a consistent style.

Avoiding Unnecessary Complexity

  • Simplicity as standard: Opt for simple solutions that solve problems without adding excessive complexity.
  • Value-based decisions: Implement technologies and standards that bring clear benefits to the business and the team.
  • Beware of technological “hype”: Not every new library or framework is suitable for your application’s context.

Organizational Laws Applied to Code

Conway’s Law states that a system’s structure reflects the organization structure that develops it. Therefore, aligning technical architecture with team organization is not just beneficial but essential.

  • Aligned structures: Autonomous teams responsible for specific micro front-ends reflect a modular architecture.
  • Efficient communication: Fewer dependencies between teams reduce the need for constant communication and complex alignments.
  • Continuous evolution: A flexible organization allows the architecture to evolve with business needs.

How to Start

  • Pilot project: Implement a micro front-end in a non-critical part of the system to understand the challenges and benefits.
  • Define standards: Establish clear conventions from the outset for routes, communication, and styles.
  • Invest in observability: Monitoring tools are essential to quickly identify issues.
  • Documentation and communication: Keep documentation up to date and promote communication between teams to share learnings.

Conclusion

Scaling front-ends effectively requires a combination of technical and organizational solutions. Micro front-end architectures offer a path to handle technical complexity, while standardization and platform teams address the human challenges of large-scale collaboration.

At Nubank, we understand that continuous innovation and adaptability are essential to provide the best experience to our customers. Whether adopting advanced technologies or restructuring our teams, we are committed to evolving and facing the scalability challenges of the modern world.

Want to be part of this challenge? We’re always looking for talents passionate about technology and innovation to build the purple future together!

For more insights like these, watch the recording of the Engineering meetup.

The post From Micro to Macro: Scaling Front-Ends appeared first on Building Nubank.

Permalink

jank development update - Moving to LLVM IR

Hi everyone! It&aposs been a few months since the last update and I&aposm excited to outline what&aposs been going on and what&aposs upcoming for jank, the native Clojure dialect. Many thanks to Clojurists Together and my Github sponsors for the support. Let&aposs get into it!

Permalink

Clojure macros continue to surprise me

Clojure macros have two modes: avoid them at all costs/do very basic stuff, or go absolutely crazy.

Here’s the problem: I’m working on Humble UI’s component library, and I wanted to document it. While at it, I figured it could serve as an integration test as well—since I showcase every possible option, why not test it at the same time?

This is what I came up with: I write component code, and in the application, I show a table with the running code on the left and the source on the right:

It was important that code that I show is exactly the same code that I run (otherwise it wouldn’t be a very good test). Like a quine: hey program! Show us your source code!

Simple with Clojure macros, right? Indeed:

(defmacro table [& examples]
  (list 'ui/grid {:cols 2}
    (for [[_ code] (partition 2 examples)]
      (list 'list
        code (pr-str code)))))

This macro accepts code AST and emits a pair of AST (basically a no-op) back and a string that we serialize that AST to.

This is what I consider to be a “normal” macro usage. Nothing fancy, just another day at the office.

Unfortunately, this approach reformats code: while in the macro, all we have is an already parsed AST (data structures only, no whitespaces) and we have to pretty-print it from scratch, adding indents and newlines.

I tried a couple of existing formatters (clojure.pprint, zprint, cljfmt) but wasn’t happy with any of them. The problem is tricky—sometimes a vector is just a vector, but sometimes it’s a UI component and shows the structure of the UI.

And then I realized that I was thinking inside the box all the time. We already have the perfect formatting—it’s in the source file!

So what if... No, no, it’s too brittle. We shouldn’t even think about it... But what if...

What if our macro read the source file?

Like, actually went to the file system, opened a file, and read its content? We already have the file name conveniently stored in *file*, and luckily Clojure keeps sources around.

So this is what I ended up with:

(defn slurp-source [file key]
  (let [content      (slurp (io/resource file))
        key-str      (pr-str key)
        idx          (str/index-of content key)
        content-tail (subs content (+ idx (count key-str)))
        reader       (clojure.lang.LineNumberingPushbackReader.
                       (java.io.StringReader.
                         content-tail))
        indent       (re-find #"\s+" content-tail)
        [_ form-str] (read+string reader)]
    (->> form-str
      str/split-lines
      (map #(if (str/starts-with? % indent)
              (subs % (count indent))
              %)))))

Go to a file. Find the string we are interested in. Read the first form after it as a string. Remove common indentation. Render. As a string.

Voilà!

I know it’s bad. I know you shouldn’t do it. I know. I know.

But still. Clojure is the most fun I have ever had with any language. It lets you play with code like never before. Do the craziest, stupidest things. Read the source file of the code you are evaluating? Fetch code from the internet and splice it into the currently running program?

In any other language, this would’ve been a project. You’d need a parser, a build step... Here—just ten lines of code, on vanilla language, no tooling or setup required.

Sometimes, a crazy thing is exactly what you need.

Permalink

Sai do Barroco, Vem pro meio do Rococó: Um Novo Olhar sobre a Programação Funcional

Esses dias, estava codificando e escutando essa música, e no meio da música é feita havia a seguinte frase: "Sai do Barroco, vai pro meio do Rococó". Terminei o código e fui participar de algumas reuniões mundanas da vida de um engenheiro de software, mas, em específico, o tema era: como podemos tornar um conjunto de processos dentro do ecossistema da empresa mais simples e resiliente.

Mas, por algum motivo, a frase ainda ficava na minha cabeça: "Sai do Barroco, vai pro meio do Rococó". Terminadas as reuniões, fui até o Google tentar entender mais sobre o que se tratam esses movimentos, e minha mente simplesmente explodiu. Não consegui fazer mais nada além de produzir o conteúdo abaixo. Sugiro que deem play na música e me acompanhem nessa viagem.

https://www.youtube.com/watch?v=otXx46_FA80&ab_channel=CercleRecords

A viagem começa aqui

Imagine-se caminhando pelas ruas de uma cidade histórica europeia. De um lado, ergue-se uma catedral barroca, com suas fachadas exuberantes, colunas imponentes e detalhes ornamentais que parecem competir entre si pela sua atenção. Cada elemento arquitetônico é carregado de simbolismo e complexidade, refletindo uma época em que a grandiosidade e o esplendor eram sinônimos de poder e inovação.

Interior de uma igreja barroca

Agora, ao dobrar a esquina, você se depara com um palácio rococó. A atmosfera muda imediatamente. As linhas pesadas dão lugar a curvas graciosas, os detalhes ornamentais tornam-se mais leves e delicados. Há uma sensação de harmonia e elegância que convida à contemplação tranquila. A complexidade dá lugar à simplicidade refinada, sem perder a riqueza artística.

Interior de um palacio rococó

Essa transição arquitetônica entre o Barroco e o Rococó serve como uma metáfora poderosa para o mundo da programação. Assim como a arquitetura barroca refletia uma abordagem pesada e complexa, a Programação Orientada a Objetos (POO) tem sido, por décadas, o paradigma dominante, estruturando sistemas com hierarquias profundas e interdependências complexas. No entanto, à medida que avançamos para uma era onde a Lei de Moore já não se sustenta, há uma necessidade crescente de repensar nossas abordagens. É hora de "sair do Barroco e vir para o Rococó", abraçando a programação funcional como uma forma mais elegante e eficiente de resolver problemas.

O Esplendor e o Peso do Barroco na Programação

No século XVII, o Barroco surgiu na Europa como uma resposta à busca por magnificência e impacto emocional na arte e arquitetura. As edificações barrocas eram conhecidas por sua opulência, com fachadas ornamentadas, interiores luxuosos e uma abundância de detalhes. A intenção era impressionar, provocar admiração e demonstrar poder.

Analogamente, a Programação Orientada a Objetos emergiu nas décadas de 1960 e 1970 como uma solução inovadora para lidar com a crescente complexidade dos softwares. Linguagens como Simula e Smalltalk introduziram conceitos como classes, objetos, herança e encapsulamento. A POO permitiu que desenvolvedores modelassem sistemas complexos, organizando o código em estruturas que refletiam o mundo real.

Durante anos, a POO foi a resposta para a necessidade de organizar códigos que cresciam em tamanho e complexidade. No entanto, assim como o Barroco, essa abordagem começou a revelar suas limitações. Projetos orientados a objetos frequentemente sofrem com códigos excessivamente complexos, difíceis de manter e escalar. A criação de hierarquias profundas de classes pode levar a um emaranhado de dependências, onde a alteração de um pequeno componente pode ter efeitos imprevisíveis em todo o sistema.

As semelhanças com a arquitetura barroca são evidentes. A abundância de detalhes e a busca por grandeza podem resultar em um peso estrutural que dificulta a adaptação e evolução. Da mesma forma, sistemas orientados a objetos podem se tornar rígidos, resistindo a mudanças e exigindo esforços significativos para incorporar novas funcionalidades ou corrigir problemas.

O Fim da Lei de Moore e os Novos Desafios

Por décadas, a indústria da tecnologia se beneficiou da Lei de Moore, que previu o aumento exponencial do número de transistores em um chip, permitindo avanços contínuos no poder de processamento. Essa tendência mascarou, em muitos casos, a ineficiência do software, já que o hardware sempre avançava para compensar códigos pesados.

Contudo, chegamos a um ponto em que os limites físicos tornam essa progressão insustentável. O fim da Lei de Moore significa que não podemos mais contar com o hardware para melhorar o desempenho dos nossos softwares automaticamente. Além disso, a demanda por sistemas distribuídos, computação em nuvem e processamento paralelo colocou em evidência a necessidade de códigos mais eficientes e escaláveis.

Nesse novo cenário, a complexidade da POO torna-se um obstáculo. O gerenciamento de estados mutáveis, típico em objetos, dificulta a programação concorrente e aumenta o risco de erros difíceis de detectar e corrigir. A manutenção de grandes sistemas orientados a objetos pode se tornar insustentável, com custos elevados e tempos de desenvolvimento prolongados.

Assim como o Barroco enfrentou críticas e deu lugar ao Rococó, que buscava uma abordagem mais leve e elegante, a programação também precisa evoluir. É necessário adotar paradigmas que permitam construir sistemas robustos, mas que sejam ao mesmo tempo flexíveis, eficientes e mais fáceis de manter.

A Elegância do Rococó e a Programação Funcional

O Rococó surgiu na França, no início do século XVIII, como uma evolução natural do Barroco, mas com uma mudança significativa no foco estético. Os arquitetos rococós valorizavam a leveza, a assimetria e a ornamentação delicada. As estruturas buscavam criar espaços mais íntimos e acolhedores, utilizando cores claras, motivos naturais e linhas curvas que transmitiam movimento e fluidez.

Transpondo essa filosofia para o universo da programação, encontramos na programação funcional um paralelo interessante. Embora os conceitos fundamentais da programação funcional existam desde os primórdios da computação, foi nos últimos anos que esse paradigma ganhou destaque. Com linguagens como Haskell, Erlang, Clojure e Elixir, a programação funcional oferece uma abordagem que prioriza a simplicidade, a imutabilidade e a ausência de efeitos colaterais.

Na programação funcional, funções são cidadãos de primeira classe. Elas podem ser passadas como parâmetros, retornadas como resultados e compostas para criar funcionalidades mais complexas. A ênfase na imutabilidade significa que os dados não são alterados após serem criados, eliminando uma série de problemas relacionados ao estado compartilhado e à concorrência.

Essa abordagem se alinha com os princípios do Rococó: em vez de construir estruturas pesadas e complexas, busca-se a elegância através da simplicidade e da harmonia. O código funcional tende a ser mais conciso e expressivo, facilitando a leitura e a compreensão. Além disso, a ausência de efeitos colaterais torna o comportamento do software mais previsível e confiável.

A Redescoberta da Programação Funcional na Era Moderna

Com o fim da Lei de Moore, a indústria começou a buscar alternativas para continuar avançando. A programação funcional emergiu como uma resposta eficaz aos desafios da programação concorrente e paralela. Em sistemas onde a escalabilidade é crucial, a imutabilidade e as funções puras oferecem vantagens significativas.

Empresas como a Microsoft, com a introdução de recursos funcionais no .NET através do F#, e a crescente adoção de Scala e Clojure no ecossistema Java, demonstram que a programação funcional está deixando de ser uma curiosidade acadêmica para se tornar uma ferramenta prática e poderosa.

A programação funcional não é apenas uma nova forma de escrever código, mas uma mudança fundamental na forma de pensar sobre problemas. Requer que os desenvolvedores desapeguem de conceitos arraigados na POO e adotem uma mentalidade que privilegia a transformação de dados através de funções puras.

A Necessidade de Mudar: Saindo do Barroco e Abraçando o Rococó

Assim como a transição do Barroco para o Rococó representou uma busca por equilíbrio e refinamento, a migração para a programação funcional é uma resposta às demandas atuais da indústria. A complexidade excessiva não é mais sustentável. É preciso adotar abordagens que permitam construir sistemas robustos, mas que sejam também flexíveis e fáceis de manter.

A programação funcional oferece ferramentas para lidar com a concorrência de forma mais natural, reduzindo os riscos associados a estados mutáveis e efeitos colaterais. Além disso, promove um código mais limpo e legível, facilitando a colaboração e a manutenção a longo prazo.

Essa transição não significa abandonar completamente os conceitos da POO. Muitas linguagens modernas permitem a combinação de paradigmas, aproveitando o melhor de cada abordagem. No entanto, é fundamental reconhecer as limitações da POO tradicional e estar aberto a novas formas de pensar e resolver problemas.

A Jornada não Acaba aqui

A história nos ensina que a evolução é essencial para o progresso. Na arte, na arquitetura e na tecnologia, precisamos constantemente revisar e adaptar nossas abordagens para atender às novas realidades. A metáfora entre o Barroco e o Rococó ilustra a necessidade de abandonar a complexidade excessiva em favor da elegância e da simplicidade.

Na programação, essa mudança é ainda mais premente. Com os desafios atuais de desempenho, escalabilidade e manutenção, é vital adotarmos paradigmas que nos permitam construir softwares de alta qualidade de forma eficiente. A programação funcional oferece essa oportunidade, convidando-nos a repensar nossas práticas e abraçar uma nova forma de criar.

Portanto, "sair do Barroco e vir para o Rococó" é mais do que uma simples metáfora. É um chamado à ação para que todos os desenvolvedores repensem suas decisões técnicas e simplifiquem, a cada dia, os sistemas e processos sob seu controle.

Ao adotar a programação funcional, podemos não apenas resolver problemas de forma mais eficaz, mas também contribuir para a construção de um futuro onde o software seja cada vez mais confiável, eficiente e elegante.

Permalink

From Micro to Macro: Scaling Front-Ends

Are you facing challenges in scaling your front-end to meet a growing number of users? With the increasing complexity of modern web applications, strategies like micro front-ends, monorepositories, global state management, and cache optimization are essential. 

In this article, we explore best practices for scaling front-end applications, discussing how to implement micro front-ends, manage versions in monorepositories, apply effective caching strategies, and efficiently maintain global states. 

Discover how Nubank is overcoming scalability challenges in the front-end and how you can apply these approaches to build agile, responsive, and easy-to-maintain user interfaces.

The Challenge of Scale

Companies like Nubank face unique challenges. With over 100 million customers in Brazil alone, handling large-scale distributed systems is not just a necessity but an obligation. Managing transactions like PIX, ensuring service stability, and providing a consistent user experience require innovative solutions.

Moreover, working with advanced technologies like Clojure and Datomic—whose development is influenced by engineers within Nubank itself—adds additional layers of complexity and opportunity. These technologies are not just tools; they are integral parts of our scalability strategy and continuous innovation.

Micro Front-Ends: Dividing to Conquer

The micro front-end architecture has emerged as a solution to many challenges faced by large development teams. But what exactly are micro front-ends?

What Micro Front-Ends Are (and What They Aren’t)

Micro front-ends are an extension of the microservices concept to the front-end. They allow different teams to develop, deploy, and maintain distinct parts of the user interface independently.

This means that each team can work at its own pace, choose its own technologies (to a certain extent), and deploy updates without impacting the system as a whole.

It’s important to highlight that micro front-ends are not:

  • NPM packages or monolithic modules: Simply splitting a monolith into packages doesn’t offer the benefits of independent deployment or team isolation.
  • Separate applications for the end-user: The user experience should be unified. We’re not talking about multiple distinct applications but a single application composed of several autonomous parts.

Benefits of Micro Front-Ends

  • Independent deployments: Teams can deploy updates without coordinating with the entire organization.
  • Team scalability: New developers can be onboarded more quickly, focusing on a specific part of the system.
  • Fault isolation: Issues in one micro front-end don’t necessarily affect the entire application.
  • Technological flexibility: The possibility of using different frameworks or libraries, although this should be done cautiously to avoid client-side overload.

Costs and Considerations

  • Complex initial setup: Implementing micro front-ends requires careful planning and a robust initial configuration.
  • Cohesion of user experience: Ensuring that the interface is consistent and cohesive is a challenge when multiple teams are involved.
  • Performance overhead: Using different frameworks can increase bundle size and affect performance.
  • Observability and debugging: Monitoring and debugging an application composed of multiple micro front-ends requires advanced tools and practices.

Implementation Strategies

There are several approaches to implementing micro front-ends:

Client-Side Composition

This is the most common approach, where the integration of micro front-ends occurs in the user’s browser. Technologies like Web Components, Module Federation (Webpack 5), and frameworks like Single SPA facilitate this composition.

Server-Side or CDN Composition

The assembly of micro front-ends occurs before reaching the client, either on the server or CDN. Tools and techniques like Edge Side Includes (ESI) can be utilized.

Communication Between Micro Front-Ends

Efficient communication is essential. It’s recommended to use:

  • Custom events: Allow micro front-ends to communicate without directly depending on each other.
  • Shared states via browser APIs: Such as Local Storage or IndexedDB.
  • Avoid excessive global dependencies: Minimizes coupling between components.

Version Control in Monorepositories

Managing versions in a monorepository can be challenging, especially when multiple teams are working on different parts of the system. Here are some practices to handle this:

Individual Package Versioning

Tools like Lerna or Nx allow you to manage individual package versions within a monorepository. This enables each team to control the versions of their own components or modules, maintaining independence and facilitating coordination.

Avoiding Git Submodules

While Git submodules might seem like a solution, they often introduce additional complexity. Instead, using NPM or Yarn workspaces can simplify the management of internal dependencies.

Benefits of the Monorepository

  • Code consistency: Facilitates code standardization and reuse.
  • Visibility: All teams have access to the complete source code, promoting collaboration.
  • Automation: Simplifies the setup of CI/CD pipelines that cover the entire system.

Caching Strategies for Bundle Loading

Efficiency in loading resources is crucial for application performance. Well-implemented caching strategies can significantly improve the user experience.

Caching Shared Resources

By using technologies like Module Federation, it’s possible to share common dependencies among different micro front-ends, avoiding redundant downloads. To achieve this:

  • Define shared modules: Configure which libraries or frameworks should be shared to prevent multiple versions on the client.
  • Compatible versions: Ensure that shared dependencies are compatible with each other to avoid conflicts.

CDN-Level Caching

Utilizing a Content Delivery Network (CDN) allows static resources to be delivered more quickly to users by leveraging distributed caching.

  • Cache-Control configurations: Adjust HTTP headers to control how and for how long resources should be cached.
  • Cache invalidation: Have strategies to invalidate or update the cache when new versions of resources are deployed.

Browser Caching

  • Service Workers: Implement caching via Service Workers for more granular control over which resources are stored and when they are updated.
  • Preloading and Prefetching: Anticipate which resources will be needed and load them in advance.

Managing Global States in Host Applications

Maintaining a consistent global state in an application composed of multiple micro front-ends is a challenge.

Recommended Strategies

  • Custom events: Use the browser’s event system for communication between micro front-ends without creating rigid dependencies.
  • Shared local storage: APIs like Local Storage or IndexedDB can serve as a means to share global state.
  • Global contexts: In frameworks like React, you can use Context API, but be careful not to introduce unwanted coupling.

Best Practices

  • Domain isolation: Each micro front-end should be responsible for its own local state and interact with the global state only when necessary.
  • Well-defined contracts: Establish clear interfaces for communication between components, facilitating maintenance and evolution.

Standardization and Platform Teams

While micro front-ends address technical scalability, code standardization and the existence of platform teams are crucial for the human scalability of development teams.

The Role of Platform Teams

  • Defining good standards: Create and maintain code standards that make teams’ work easier.
  • Tools and infrastructure: Develop tools that automate repetitive tasks and ensure code quality.
  • Facilitating collaboration: Ensure different teams can work together efficiently.

Importance of Standardization

  • Faster onboarding: New developers adapt more quickly to standardized code.
  • Consistent quality: Reduces the incidence of bugs and maintenance issues.
  • Easier code review: Code reviews are more effective when there’s a consistent style.

Avoiding Unnecessary Complexity

  • Simplicity as standard: Opt for simple solutions that solve problems without adding excessive complexity.
  • Value-based decisions: Implement technologies and standards that bring clear benefits to the business and the team.
  • Beware of technological “hype”: Not every new library or framework is suitable for your application’s context.

Organizational Laws Applied to Code

Conway’s Law states that a system’s structure reflects the organization structure that develops it. Therefore, aligning technical architecture with team organization is not just beneficial but essential.

  • Aligned structures: Autonomous teams responsible for specific micro front-ends reflect a modular architecture.
  • Efficient communication: Fewer dependencies between teams reduce the need for constant communication and complex alignments.
  • Continuous evolution: A flexible organization allows the architecture to evolve with business needs.

How to Start

  • Pilot project: Implement a micro front-end in a non-critical part of the system to understand the challenges and benefits.
  • Define standards: Establish clear conventions from the outset for routes, communication, and styles.
  • Invest in observability: Monitoring tools are essential to quickly identify issues.
  • Documentation and communication: Keep documentation up to date and promote communication between teams to share learnings.

Conclusion

Scaling front-ends effectively requires a combination of technical and organizational solutions. Micro front-end architectures offer a path to handle technical complexity, while standardization and platform teams address the human challenges of large-scale collaboration.

At Nubank, we understand that continuous innovation and adaptability are essential to provide the best experience to our customers. Whether adopting advanced technologies or restructuring our teams, we are committed to evolving and facing the scalability challenges of the modern world.

These and other insights were shared at Nubank’s August Engineering Meetup. Watch the recording below.

The post From Micro to Macro: Scaling Front-Ends appeared first on Building Nubank.

Permalink

Writing the Worst Datalog Ever in 26loc

Today to change from heavy interop and frameworks, let's do some light coding exercise and implement the most amateur datalog engine by taking any shortcut we see fit!

Don't forget when we're not busy writing silly Datalog implementations, we are available to help you with Clojure or working to get our app Paktol (The positive spending tracker where money goes up!) off the ground.

Data Representations

The first question is how to represent facts, rules and databases.

A fact will be represented by a vector of simple values (but no symbols, we reserve symbols for variables).

[:father :bart :homer]

A rule will be represented by a list of patterns. The first pattern being the head of the rule (its "conclusion"), the rest of the rule being its body. A pattern is a vector of simples values (this time, including symbols for variables).

;; parent(c, p) :- father(c, p)
([:parent c p] [:father c p])
;; parent(c, p) :- mother(c, p)
([:parent c p] [:mother c p])

The database in Datalog is conceptually split in two: the EDB (the extensional database, every recorded fact — tables in SQL) and the IDB (the intensional database, every deduced fact — views in SQL). Let's lump them together as a single set of facts! What could possibly go wrong? 🤷‍♂️

Matching Patterns Against Facts

The result of matching a pattern against a fact is going to be nil (no match) or a map of variables to their values (let's call these bindings an environment).

However further matches in a rule will have to obey environments created by earlier matches. That's why we also pass an upstream environment to match: (defn match "Returns updated env or nil." [pattern fact env] ...)

(defn match [pattern fact env]
   (when (= (count pattern) (count fact))
     (reduce (fn [env [p v]]
               (let [p-or-v (env p p)]
                 (cond
                   (= p '_) env
                   (= p-or-v v) env
                   (symbol? p-or-v) (assoc env p v)
                   :else (reduced nil))))
       env (map vector pattern fact))))

You may have realized that unlike textbook datalog the predicate name isn't special and is treated as any other slot of facts vectors. This will prove useful when implementing q at the end.

Matching rules and producing new facts

Now we need to match a patterns chain against all known facts:

(defn match-patterns [patterns facts]
  (reduce
    (fn [envs pattern]
      (-> (for [fact facts env envs] (match pattern fact env))
        set (disj nil)))
    #{{}} patterns))

From this we can infer new facts by turning envs into facts by simply replacing vars appearing in the head by their values in environments.

(defn match-rule [facts [head & patterns]]
  (for [env (match-patterns patterns facts)]
    (into [] (map #(env % %)) head)))

Running all rules until saturation

We repeatedly apply all rules until no new facts can be derived. This saturation is detected by comparing sizes because it's cheaper than an equality check.

(defn saturate [facts rules]
  (let [facts' (into facts (mapcat #(match-rule facts %)) rules)]
    (if (< (count facts) (count facts'))
      (recur facts' rules)
      facts)))

Query

A query is just an anonymous rule that we run after the others so that we get only its results.

(defn q [facts query rules]
  (-> facts (saturate rules) (match-rule query) set))

Let's try it!

The Simpsons family:

(def edb
  #{[:father :bart :homer]
    [:mother :bart :marge]
    [:father :lisa :homer]
    [:mother :lisa :marge]
    [:father :maggie :homer]
    [:mother :maggie :marge]
    [:father :homer :abe]
    [:mother :homer :mona]})

Some sample rules:

(def rules
  '[([:parent c p] [:father c p])
    ([:parent c p] [:mother c p])
    ([:grand-parent c gp] [:parent p gp] [:parent c p])
    ([:ancestor c p] [:parent c p])
    ([:ancestor c ancp] [:ancestor c anc] [:parent anc ancp])])

And now the queries:

user=> (q edb
         '([gp] [:grand-parent :bart gp])
         rules)
#{[:mona] [:abe]}
user=> (q edb
         '([anc] [:ancestor :bart anc])
         rules)
#{[:homer] [:marge] [:mona] [:abe]}

🎉 Here we are: datalog in 26 lines of code!

Conclusion

Datalog is minimal—almost to a fault. It’s like a game of Jenga: stack too many extensions on top, and the whole thing could come crashing down. That’s why there’s a rich literature on adding just enough to make it useful without tipping over.

So here we are with a seed implementation of a seed language. We have a starting point and can grow it in any directions we are interested in.

Share online if you'd like to see more open ended exploratory code (including growing this datalog)!

Appendix: 26 glorious locs at once

(defn match [pattern fact env]
   (when (= (count pattern) (count fact))
     (reduce (fn [env [p v]]
               (let [p-or-v (env p p)]
                 (cond
                   (= p '_) env
                   (= p-or-v v) env
                   (symbol? p-or-v) (assoc env p v)
                   :else (reduced nil))))
       env (map vector pattern fact))))

(defn match-patterns [patterns facts]
  (reduce
    (fn [envs pattern]
      (-> (for [fact facts env envs] (match pattern fact env))
        set (disj nil)))
    #{{}} patterns))

(defn match-rule [facts [head & patterns]]
  (for [env (match-patterns patterns facts)]
    (into [] (map #(env % %)) head)))

(defn saturate [facts rules]
  (let [facts' (into facts (mapcat #(match-rule facts %)) rules)]
    (if (< (count facts) (count facts'))
      (recur facts' rules)
      facts)))

(defn q [facts query rules]
  (-> facts (saturate rules) (match-rule query) set))

Permalink

Clojure Deref (Oct 11, 2024)

Welcome to the Clojure Deref! This is a weekly link/news roundup for the Clojure ecosystem (feed: RSS). Thanks to Anton Fonarev for link aggregation.

Libraries and Tools

New releases and tools this week:

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.