Datomic as a Higher-Level Database

Datomic as a Higher-Level Database

More than 20 years ago, when I first began learning programming languages, I read a line in a book:

C is a high-level language.

But it wasn&apost until years later, during a university class on assembly language, when I had to use jump commands just to write a loop, that I truly realized how high-level C was. Despite this, for much of my career, I found C to be quite low-level because I didn&apost want to deal with memory management via malloc and free, nor did I want to handle pointers. As my career progressed, I learned many programming languages. Java, for instance, was much higher-level than C because it provided garbage collection. Clojure, in turn, was even higher-level than Java because of its immutable collection types.

High-Level Doesn&apost Mean More Features — Sometimes It&aposs the Opposite

High-level doesn&apost refer to having more features; it means higher-level semantics. As a result, high-level languages often restrict or even eliminate lower-level semantics. High-level semantics allow you to focus on specifying what you want to achieve without worrying about every single implementation detail—the machine handles those for you. In some cases, you’re even restricted from accessing certain details because they are easy to mess up.

For example, when writing in C, you are strongly discouraged from using jump commands; when writing in Java, you cannot directly manipulate pointers; when writing in Clojure, you are advised against using mutable collection types.

High-level semantics often come with a trade-off in terms of machine performance. The JVM’s garbage collection obviously uses extra memory, and Clojure’s immutable collection types also consume more memory compared to Java. Furthermore, Clojure&aposs startup time far exceeds Java&aposs, testing the limits of human patience and even spawning solutions like Babashka, specifically designed for shell usage.

High-level typically means trading machine efficiency for developer productivity, a trade-off that’s often worth it thanks to Moore’s Law, which ensures that machine performance will automatically improve over time.

In What Ways is Datomic High-Level?

From my experience using Datomic, I&aposve observed at least four ways in which it is a higher-level database:

  1. DDL (Data Definition Language)
    • Primary Keys
    • Many-to-Many Relationships
  2. Isolation Level
  3. Time-travel Queries
  4. A Query Language that References Context

DDL - Primary Keys

When working with SQL databases, I often struggled with how to design the primary key for my tables. The first decision to make is:

  1. Should I use a natural key, which is one or more data attributes derived from the business domain? This can be convenient early on, but sometimes not as good later.
  2. Should I use a surrogate key, which decouples the key from the business meaning, providing more flexibility for future modifications? Some tables, which represent parts of an entity that lack suitable natural keys, require surrogate keys.

If you decide to use a surrogate key, the next question is which data type to choose:

  • Auto-incrementing integers
  • UUIDs

Then, considerations about enumeration attacks and performance arise. Should you opt for integers that avoid enumeration attacks, or UUIDs that boost performance?

Datomic makes the primary key decision for you. In the world of Datomic, the primary key is the entity ID. It’s that simple. [1]

DDL - Many-to-Many Relationships

In SQL databases, when modeling many-to-many relationships, we usually need to design a bridge table.

In Datomic, there is no need for a bridge table because you can set the :db/cardinality attribute to :db.cardinality/many, which means the field supports one-to-many relationships. This feature not only simplifies the semantics by eliminating the need for a bridge table, but also makes the syntax for one-to-many and many-to-many relationships much more consistent.

Isolation Level

SQL databases offer four isolation levels:

  • Read Uncommitted
  • Read Committed
  • Repeatable Read
  • Serializable

These various levels exist to allow for higher performance when dealing with transactions. In contrast, Datomic only provides one isolation level—Serializable. [2] With fewer options, there is less for us to worry about.

Time-travel Queries

Traditional databases are like regular files; once something is written, you can’t go back to a previous state. Datomic is different. Its state is like a git repository, allowing you to easily revert to a previous state using a time-travel query known as the as-of query.

A Query Language that References Context

When writing SQL queries that involve multiple JOIN operations, the resulting query often becomes so long that it becomes hard to read. Human languages, in contrast, are typically composed of many short sentences. Short sentences can still convey complex ideas because they reference context. When we understand natural language, we don’t process each sentence in isolation; we use its context to fully grasp its meaning.

Datomic’s query language, Datalog, has a mechanism called Rules that can cleverly inject context into your queries.

Consider the following Datalog query, which retrieves the names of the actors in "The Terminator":

[:find ?name
 :where
 [?p :person/name ?name]
 [?m :movie/cast ?p]
 [?m :movie/title "The Terminator"]]

Now imagine that the part of the query responsible for "finding the actor&aposs name from the movie title" is something you repeatedly write across different queries. Is there a way to avoid rewriting this section each time?

[?p :person/name ?name]
[?m :movie/cast ?p]
[?m :movie/title ?title]

Yes, and that mechanism is called Rules. We can rewrite the above query using Datomic Rules as follows:

;; The rules we defined
[(appear ?name ?title)
 [?p :person/name ?name]
 [?m :movie/cast ?p]
 [?m :movie/title ?title]]
 
;; re-written Datolog Query
 [:find ?name
  :in $ %
  :where (appear ?name "The Terminator")]

In this version, the appear rule abstracts the definition of an actor&aposs "appearance" in a movie. Once this logical rule is applied by the query engine, the concept of "appearance" can be inferred as new knowledge.

This rule acts as a tool to reference context in a query. When a query language can reference context, it becomes more like human natural language, and thus more concise.

Envision a New Conversation

You recommend Datomic to your boss, and he/she asks, "What are its benefits? How can you prove it?"

You reply, "It improves productivity because it is a higher-level database."

I expect your boss will then ask, "What do you mean by higher-level?"

If the conversation is conducted thoughtfully and cleverly, this opening dialogue could lead to successfully advocating for the use of a higher-level database within your company. Savvy businesspeople may not remember or fully understand the various details of databases, but they understand that a higher level is equivalent to exchanging machine power for brain power. Exchanging something cheap for something expensive is a concept I believe businesspeople will understand.

Notes

  • [1] In practice, when using Datomic and needing to expose a unique identifier to the outside world, we typically design an additional UUID field. However, in this article, I won’t delve into all the design issues related to primary keys. My focus is: Datomic has already made the design decision for entity ID, which reduces the decisions we need to make, thus making this database more high-level.

  • [2] Most SQL database systems compose transactions from a series of updates, where each update changes the state of the database. However, Datomic transaction execution is not defined as a series of updates. It is defined as the addition of a set of datoms to the previous value of the database. Ref

Permalink

Beyond Traditional Testing: Addressing the Challenges of Non-Deterministic Software

Software development of non-deterministic systems have become increasingly common. From distributed systems with untrusted inputs to AI-powered solutions, there is a growing challenge in ensuring reliability and consistency in environments that are not fully predictable. The integration of Large Language Models (LLMs) and other AI technologies can in fact introduce data that can change every time it is computed.

Non-deterministic software, by its very nature, can produce different outputs for the same input under seemingly identical conditions. This unpredictability presents significant challenges for testing.

This article explores some of the fundamental characteristics of non-deterministic software, discuss established best practices for testing such systems, examine recent innovations in the field with a focus on AI-driven techniques, and provide practical examples complete with Python code samples. It’ll also investigate the unique challenges posed by LLMs in software testing and offer guidance on implementing a comprehensive testing strategy for these complex systems.

All code is available in this repository. All examples are in Python but the same ideas can be applied to any programming language. At the end, I recap a few testing frameworks that can be used with other programming languages, including C#, Java, JavaScript, and Rust.

Characteristics and Challenges of Non-Deterministic Software

Non-deterministic software can be seen as a reflection of the complex, often unpredictable world we live in. Unlike deterministic software systems, which produce the same output for a given input every time, non-deterministic systems introduce an element of variability.

Non-determinism in software can arise from various sources such as inherent randomness in the algorithms being used or the effect of an internal state that is not observable from the outside. It might also be the result of numerical computing errors. For example, when dealing with floating-point arithmetic, tiny rounding errors can accumulate and lead to divergent results.

A newer source of non-determinism is the integration of generative AI components, such as Large Language Models (LLMs), because at every invocation their outputs can vary significantly for the same input.

To demonstrate non-deterministic behavior, let's have a look at a simple Python example:

import random

def non_deterministic_function(x):
    if random.random() < 0.1:  # 10% chance of failure
        return None
    return x * 2

# Running this function multiple times with the same input
for _ in range(20):
    result = non_deterministic_function(5)
    print(result)

If you run this code, it will return 10 most of the time, because the function is the input 5, but about 10% of the time, it returns None. This simple example illustrates the challenge when testing non-deterministic software: how to write tests for a function that doesn’t always behave the same way?

To tackle these challenges. we can adapt traditional testing methods and implement new approaches. From property-based testing to AI-driven test generation, the field of software testing is evolving to meet the demands of an increasingly non-deterministic digital world.

Effective Testing Strategies for Non-Deterministic Software

Testing non-deterministic software requires a shift in how we approach software quality assurance. One interesting approach we can use to test non-deterministic software is property-based testing.

With property-based testing, rather than writing tests for specific input-output pairs, you define properties that should hold true for all possible inputs. The testing framework then generates a large number of random inputs and checks if the defined properties hold for each of them.

Let look at an example of property-based testing using the Hypothesis library in Python:

from hypothesis import given, strategies as st
import random

def non_deterministic_sort(lst):
    """A non-deterministic sorting function that occasionally makes mistakes."""
    if random.random() < 0.1:  # 10% chance of making a mistake
        return lst  # Return unsorted list
    return sorted(lst)

@given(st.lists(st.integers()))
def test_non_deterministic_sort(lst):
    result = non_deterministic_sort(lst)

    # Property 1: The result should have the same length as the input
    assert len(result) == len(lst), "Length of the result should match the input"

    # Property 2: The result should contain all elements from the input
    assert set(result) == set(lst), "Result should contain all input elements"

    # Property 3: The result should be sorted in most cases
    attempts = [non_deterministic_sort(lst) for _ in range(100)]

    # We allow for some failures due to the non-deterministic nature
    # Replace 'any' with 'all' to make the test fail if any attempt is not sorted
    assert any(attempt == sorted(lst) for attempt in attempts), "Function should produce a correct sort in multiple attempts"

# Run the test
if __name__ == "__main__":
    test_non_deterministic_sort()

In this example, we're testing a non-deterministic sorting function that occasionally makes mistakes. Instead of checking for a specific output, we can verify properties that should hold true regardless of the function’s non-deterministic behavior. For example, we can check that the output has the same length as the input, contains all the same elements, and is correctly sorted in at least some of multiple attempts.

While property-based testing is powerful, it can be slow and costly when LLMs are involved in the test cases. This is because each test run may require multiple invocations of the LLM, which can be computationally expensive and time-consuming. Therefore, it’s crucial to carefully design property-based tests when working with LLMs to balance thoroughness with efficiency.

Another crucial strategy for testing non-deterministic software is to check if it is feasible to create repeatable test environments. This involves controlling as many variables as possible to reduce the sources of non-determinism during testing. For example, you can use fixed random seeds, mock external dependencies, and use containerization to ensure consistent environments.

When dealing with AI, especially LLMs, you can use semantic similarity measures to evaluate outputs rather than expecting exact matches. For instance, when testing an LLM-based chatbot, you might check if the model’s responses are semantically similar to a set of acceptable answers, rather than looking for specific phrases.

Here’s an example of how to test an LLM’s output using semantic similarity:

import json
import boto3

from scipy.spatial.distance import cosine

AWS_REGION = "us-east-1"
EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

def get_embedding(text):
    body = json.dumps({"inputText": text})
    response = bedrock_runtime.invoke_model(
        modelId=EMBEDDING_MODEL_ID,
        contentType="application/json",
        accept="application/json",
        body=body
    )
    response_body = json.loads(response['body'].read())
    return response_body['embedding']

def semantic_similarity(text1, text2):
    embedding1 = get_embedding(text1)
    embedding2 = get_embedding(text2)
    return 1 - cosine(embedding1, embedding2)

def test_llm_response(llm_function, input_text, acceptable_responses, similarity_threshold=0.8):
    llm_response = llm_function(input_text)
    print("llm_response:", llm_response)

    for acceptable_response in acceptable_responses:
        similarity = semantic_similarity(llm_response, acceptable_response)
        print("acceptable_response:", acceptable_response)
        if similarity >= similarity_threshold:
            print("similarity:", similarity)
            return True

    return False

# Example usage
def mock_llm(input_text):
    # This is a mock LLM function for demonstration purposes
    return "The capital of France is Paris, a city known for its iconic Eiffel Tower."

input_text = "What is the capital of France?"
acceptable_responses = [
    "The capital of France is Paris.",
    "Paris is the capital city of France.",
    "France's capital is Paris, known for its rich history and culture."
]

result = test_llm_response(mock_llm, input_text, acceptable_responses)
print(f"LLM response test passed: {result}")

In this example, we use Amazon Bedrock to compute semantic embeddings of a simulated LLM’s response and a set of acceptable responses. Then, we use cosine similarity to determine if the LLM’s output is semantically similar enough to any of the acceptable responses.

On another note, an interesting development not strictly related to non-deterministic software testing is the use of LLMs themselves to generate test data and check test outputs. This approach leverages the power of LLMs to understand context and generate diverse, realistic test cases.

Here’s an example generating structured test data in JSON format:

import json
import boto3

AWS_REGION = "us-east-1"

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20240620-v1:0"   

bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

def generate_structured_test_data(prompt, num_samples=5):
    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[{
            'role': 'user',
            'content': [{ 'text': prompt }]
        }]
    )
    generated_data = response['output']['message']['content'][0]['text']
    try:
        json_data = json.loads(generated_data)
    except json.JSONDecodeError:
        print("Generated data is not valid JSON")

    return json_data

# Example usage
prompt = """Generate 5 JSON objects representing potential user inputs for a weather forecasting app.
Each object should have 'location' and 'query' fields.
Output the result as a valid JSON array.
Output JSON and nothing else.
Here's a sample to guide the format:
[
  {
    "location": "New York",
    "query": "What's the temperature tomorrow?"
  }
]"""

test_inputs = generate_structured_test_data(prompt)

print(json.dumps(test_inputs, indent=2))

In this example, we're using Amazon Bedrock and the Anthropic Claude 3.5 Sonnet model to generate structured JSON test inputs for a weather forecasting app. Using this approach, you can create a wide range of test cases, including edge cases that could be difficult to think initially. These test cases can be stored and used multiple times.

Similarly, LLMs can be used to check test outputs, especially for tasks where the correct answer might be subjective or context-dependent. This approach is more precise than just using semantic similarity but is slower and more costly. The two approaches can be used together. For example, if the semantic similarity test has passed, we then use an LLM for further checks.

import boto3

AWS_REGION = "us-east-1"

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20240620-v1:0"   

bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

def check_output_with_llm(input_text, test_output, prompt_template):
    prompt = prompt_template.format(input=input_text, output=test_output)

    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[{
            'role': 'user',
            'content': [{ 'text': prompt }]
        }]
    )

    response_content = response['output']['message']['content'][0]['text']

    return response_content.strip().lower() == "yes"

# Example usage
input_text = "What's the weather like today?"
test_output = "It's sunny with a high of 75°F (24°C) and a low of 60°F (16°C)."
prompt_template = "Given the input question '{input}', is this a reasonable response: '{output}'? Answer yes or no and nothing else."

is_valid = check_output_with_llm(input_text, test_output, prompt_template)

print('input_text:', input_text)
print('test_output:', test_output)
print(f"Is the test output a reasonable response? {is_valid}")

In this example, we're using again an Anthropic Claude model to evaluate whether the system’s response is reasonable given the input question. Depending on the difficulty of the test, we can use a more or less powerful model to optimize speed and costs.

This approach can be used for testing chatbots, content generation systems, or any other application where the correct output isn’t easily defined by simple rules.

These strategies - property-based testing, repeatable environments, semantic similarity checking, and LLM-assisted test generation and validation - form the foundation for an effective testing of non-deterministic software. They allow to make meaningful assertions about system behavior even when exact outputs cannot be predicted.

Advanced Techniques for Testing Complex Non-Deterministic Systems

Using AI to generate test cases can go beyond generative AI and LLMs. For example, machine learning models can analyze historical test data and system behavior to identify patterns and generate test cases that are most likely to uncover bugs or edge cases that a human tester might miss.

Let's see an example of using a simple machine learning model to generate test cases for a non-deterministic function.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Simulated historical test data
# Features: input_a, input_b, system_load
# Target: 0 (pass) or 1 (fail)
X = np.array([
    [1, 2, 0.5], [2, 3, 0.7], [3, 4, 0.3], [4, 5, 0.8], [5, 6, 0.4],
    [2, 2, 0.6], [3, 3, 0.5], [4, 4, 0.7], [5, 5, 0.2], [6, 6, 0.9]
])
y = np.array([0, 0, 0, 1, 0, 0, 0, 1, 0, 1])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a random forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Function to generate new test cases
def generate_test_cases_from_historical_test_data(n_cases):
    # Generate random inputs
    new_cases = np.random.rand(n_cases, 3)
    new_cases[:, 0] *= 10  # Scale input_a to 0-10
    new_cases[:, 1] *= 10  # Scale input_b to 0-10

    # Predict failure probability
    failure_prob = clf.predict_proba(new_cases)[:, 1]

    # Sort cases by failure probability
    sorted_indices = np.argsort(failure_prob)[::-1]

    return new_cases[sorted_indices]

# Generate and print top 5 test cases most likely to fail
top_test_cases = generate_test_cases_from_historical_test_data(100)[:5]
print("Top 5 test cases most likely to fail:")
for i, case in enumerate(top_test_cases, 1):
    print(f"Case {i}: input_a = {case[0]:.2f}, input_b = {case[1]:.2f}, system_load = {case[2]:.2f}")

This example demonstrates to use of a random forest classifier to generate test cases that are more likely to uncover issues in a system. Model are can be better than humans in learning from historical data to predict which combinations of inputs and system conditions are most likely to cause failures.

Another related technique is the use of chaos engineering for testing non-deterministic systems. For example, you can deliberately introduce failures and perturbations into a system to test its resilience and identify potential issues before they occur in production.

For instance, you can randomly terminate instances in a distributed system, simulate network latency, or inject errors into data streams. By systematically introducing chaos in a controlled environment, you are able to uncover weaknesses in a systems that might not be apparent under normal testing conditions.

When it comes to testing AI-powered systems, especially those involving Large Language Models (LLMs), a similar approach is to use adversarial testing, where input prompts are designed to challenge the LLM’s understanding and generate edge cases.

Here’s an example of how to implemented a simple adversarial testing framework for an LLM:

import random
import string

def generate_adversarial_prompt(base_prompt, num_perturbations=3):
    perturbations = [
        lambda s: s.upper(),
        lambda s: s.lower(),
        lambda s: ''.join(random.choice([c.upper(), c.lower()]) for c in s),
        lambda s: s.replace(' ', '_'),
        lambda s: s + ' ' + ''.join(random.choices(string.ascii_letters, k=5)),
    ]

    adversarial_prompt = base_prompt
    for _ in range(num_perturbations):
        perturbation = random.choice(perturbations)
        adversarial_prompt = perturbation(adversarial_prompt)

    return adversarial_prompt

def test_llm_robustness(llm_function, base_prompt, expected_topic, num_tests=10):
    for _ in range(num_tests):
        adversarial_prompt = generate_adversarial_prompt(base_prompt)
        response = llm_function(adversarial_prompt)

        # Here I use my semantic similarity function to check if the response
        # is still on topic despite the adversarial prompt
        is_on_topic = semantic_similarity(response, expected_topic) > 0.7

        print(f"Prompt: {adversarial_prompt}")
        print(f"Response on topic: {is_on_topic}")
        print("---")

# Example usage (assuming I have my LLM function and semantic_similarity function from before)
base_prompt = "What is the capital of France?"
expected_topic = "Paris is the capital of France"

test_llm_robustness(mock_llm, base_prompt, expected_topic)

This example generates adversarial prompts by applying random perturbations to a base prompt, then tests whether the LLM can still produce on-topic responses despite these challenging inputs. Other approaches to generating adversarial prompts include the use of different human languages, asking to output in poetry or specific formats, and asking for internal information such as tool use syntax.

Because no single technique is a silver bullet, the most effective testing strategies often involve a combination of approaches, tailored to the specific characteristics and requirements of the system under test.

In the next section, let's explore how to implement a comprehensive testing strategy that incorporates advanced techniques alongside more traditional methods, creating a robust approach to testing even the most complex non-deterministic systems.

Comprehensive Strategy for Testing Non-Deterministic Software

To effectively test complex systems, we need a comprehensive strategy that combines multiple techniques and adapts to the specific challenges of each system.

Let's go through the process of implementing such a strategy, using a hypothetical AI-powered recommendation system as an example. This system uses machine learning models to predict user preferences, incorporates real-time data, and interfaces with a Large Language Model to generate personalized content descriptions. We can use it as an example of a non-deterministic system with multiple sources of unpredictability.

The first step in this strategy is to identify the critical components of the system and assess the potential impact of failures. In this sample recommendation system, we can find the following high-risk areas:

  • The core recommendation algorithm
  • The real-time data processing pipeline
  • The LLM-based content description generator

For each of these components, let's consider the potential impact of failures on user experience, data integrity, and system stability. This assessment can be used to guide testing efforts, ensuring that resources are focused where they’re most needed.

Then, with the previous risk assessment in hand, we can design a layered testing approach that combines multiple techniques.

Unit Testing with Property-Based Tests

For individual components, we can use property-based testing to ensure they behave correctly across a wide range of inputs. Here’s an example of how to test the recommendation algorithm.

from hypothesis import given, strategies as st
import numpy as np

def recommendation_algorithm(user_preferences, item_features):
    # Simplified recommendation algorithm
    return np.dot(user_preferences, item_features)

@given(
    st.lists(st.floats(min_value=-1, max_value=1), min_size=5, max_size=5),
    st.lists(st.lists(st.floats(min_value=-1, max_value=1), min_size=5, max_size=5), min_size=1, max_size=10)
)
def test_recommendation_algorithm(user_preferences, item_features_list):
    recommendations = [recommendation_algorithm(user_preferences, item) for item in item_features_list]

    # Property 1: Recommendations should be in the range [-5, 5] given our input ranges
    assert all(-5 <= r <= 5 for r in recommendations), "Recommendations out of expected range"

    # Property 2: Higher dot products should result in higher recommendations
    sorted_recommendations = sorted(zip(recommendations, item_features_list), reverse=True)
    for i in range(len(sorted_recommendations) - 1):
        assert np.dot(user_preferences, sorted_recommendations[i][1]) >= np.dot(user_preferences, sorted_recommendations[i+1][1]), "Recommendations not properly ordered"

# Run the test
test_recommendation_algorithm()

Integration Testing with Chaos Engineering

To test how our components work together under various conditions, we can use chaos engineering techniques. For example, we can randomly degrade the performance of the real-time data pipeline, simulate network issues, or introduce delays in API responses. This helps ensure the system remains stable even under suboptimal conditions.

System Testing with AI-Generated Test Cases

For end-to-end testing, we can use AI to generate diverse and challenging test scenarios. This might involve creating complex user profiles, simulating various usage patterns, and generating edge case inputs for our LLM.

Continuous Monitoring and Adaptation

A good testing strategy doesn’t end when we deploy to production. We need robust monitoring and observability tools to catch issues that might not have surfaced during testing.

This includes:

  • Real-time performance monitoring
  • Anomaly detection algorithms to identify unusual behavior
  • A/B testing for gradual rollout of changes
  • User feedback collection and analysis

Observability tools often include native anomaly detection capabilities to help find important information across a large amount of telemetry data. For example, Amazon CloudWatch implements anomaly detection for metrics and logs.

Here’s a simple example of how to implement an anomaly detection system using basic statistical methods:

import numpy as np
from scipy import stats

class AnomalyDetector:
    def __init__(self, window_size=100):
        self.window_size = window_size
        self.values = []

    def add_value(self, value):
        self.values.append(value)
        if len(self.values) > self.window_size:
            self.values.pop(0)

    def is_anomaly(self, new_value, z_threshold=3.0):
        if len(self.values) < self.window_size:
            return False  # Not enough data to detect anomalies yet

        mean = np.mean(self.values)
        std = np.std(self.values)
        z_score = (new_value - mean) / std

        return abs(z_score) > z_threshold

# Usage
detector = AnomalyDetector()

# Simulate some normal values
for _ in range(100):
    detector.add_value(np.random.normal(0, 1))

# Test with a normal value
print(detector.is_anomaly(1.5))  # Probably False

# Test with an anomaly
print(detector.is_anomaly(10))  # Probably True

Alternative property-based testing tools for other programming languages

In these examples, I used the Hypothesis Python module. Here are a few interesting alternatives for other programming languages.

Language Recommended Library Reasoning
C# FsCheck Widely used in the .NET ecosystem, supports both C# and F#.
Clojure test.check Part of Clojure's core.spec, well-integrated with the language.
Haskell QuickCheck The original property-based testing library, still the standard in Haskell.
Java jqwik Modern design, good documentation, and seamless integration with JUnit 5.
JavaScript fast-check Actively maintained, well-documented, and integrates well with popular JS testing frameworks.
Python Hypothesis Most mature, feature-rich, and widely adopted in the Python ecosystem.
Scala ScalaCheck The de facto standard for property-based testing in Scala.
Ruby Rantly More actively maintained compared to alternatives, good integration with RSpec.
Rust proptest More actively developed than quickcheck for Rust, with helpful features like persistence of failing examples.

Continuous Improvement

A good testing strategy includes a process for continuous improvement. This involves:

  • Regular review of test results and production incidents
  • Updating the test suite based on new insights
  • Staying informed about new testing techniques and tools
  • Adapting the testing strategy as the system evolves

Implementing a comprehensive testing strategy for non-deterministic software is no small task. It requires a combination of technical skill, creativity, and a deep understanding of the system under test. However, by combining multiple testing techniques, leveraging AI and machine learning, and maintaining a commitment to continuous improvement, we can create robust, reliable systems even in the face of uncertainty.

When we look to the future, the only thing we can be certain is that the field of software testing will continue to evolve. New challenges will arise as systems become more powerful and complex. But with the foundations explored in this article - from property-based testing to AI-driven test generation, from chaos engineering to semantic similarity checking - there is a solid base on which to build on.

The strategies and techniques discussed here are not set in stone. They’re a starting point, a foundation upon which you can add your own approach tailored to your specific needs and challenges. The world of software is ever-changing, and so too must be our approaches to testing it. I encourage you to embrace the uncertainty, stay curious, and never stop learning. The future of software testing is definitely interesting, and it’s in our hands to learn and, when possible, shape it.

To continue learning, have a look at the repository that includes all code in this article.

Permalink

How Could Clojure Web Development Suck Less

In this episode, we dive into the intricacies of web development with Clojure, exploring how it can be improved and made less cumbersome. We also touch on Rama, as Ben brings more expertise in that area. Additionally, we explore Ben career journey, f...

Permalink

From Micro to Macro: Scaling Front-Ends

Are you facing challenges in scaling your front-end to meet a growing number of users? With the increasing complexity of modern web applications, strategies like micro front-ends, monorepositories, global state management, and cache optimization are essential. 

In this article, we explore best practices for scaling front-end applications, discussing how to implement micro front-ends, manage versions in monorepositories, apply effective caching strategies, and efficiently maintain global states. 

Discover how Nubank is overcoming scalability challenges in the front-end and how you can apply these approaches to build agile, responsive, and easy-to-maintain user interfaces.

The Challenge of Scale

Companies like Nubank face unique challenges. With over 100 million customers in Brazil, Mexico and Colombia, handling large-scale distributed systems is not just a necessity but an obligation. Managing transactions like PIX, ensuring service stability, and providing a consistent user experience require innovative solutions.

Moreover, working with advanced technologies like Clojure and Datomic—whose development is influenced by engineers within Nubank itself—adds additional layers of complexity and opportunity. These technologies are not just tools; they are integral parts of our scalability strategy and continuous innovation.

Micro Front-Ends: Dividing to Conquer

The micro front-end architecture has emerged as a solution to many challenges faced by large development teams. But what exactly are micro front-ends?

What Micro Front-Ends Are (and What They Aren’t)

Micro front-ends are an extension of the microservices concept to the front-end. They allow different teams to develop, deploy, and maintain distinct parts of the user interface independently.

This means that each team can work at its own pace, choose its own technologies (to a certain extent), and deploy updates without impacting the system as a whole.

It’s important to highlight that micro front-ends are not:

  • NPM packages or monolithic modules: Simply splitting a monolith into packages doesn’t offer the benefits of independent deployment or team isolation.
  • Separate applications for the end-user: The user experience should be unified. We’re not talking about multiple distinct applications but a single application composed of several autonomous parts.

Benefits of Micro Front-Ends

  • Independent deployments: Teams can deploy updates without coordinating with the entire organization.
  • Team scalability: New developers can be onboarded more quickly, focusing on a specific part of the system.
  • Fault isolation: Issues in one micro front-end don’t necessarily affect the entire application.
  • Technological flexibility: The possibility of using different frameworks or libraries, although this should be done cautiously to avoid client-side overload.

Costs and Considerations

  • Complex initial setup: Implementing micro front-ends requires careful planning and a robust initial configuration.
  • Cohesion of user experience: Ensuring that the interface is consistent and cohesive is a challenge when multiple teams are involved.
  • Performance overhead: Using different frameworks can increase bundle size and affect performance.
  • Observability and debugging: Monitoring and debugging an application composed of multiple micro front-ends requires advanced tools and practices.

Implementation Strategies

There are several approaches to implementing micro front-ends:

Client-Side Composition

This is the most common approach, where the integration of micro front-ends occurs in the user’s browser. Technologies like Web Components, Module Federation (Webpack 5), and frameworks like Single SPA facilitate this composition.

Server-Side or CDN Composition

The assembly of micro front-ends occurs before reaching the client, either on the server or CDN. Tools and techniques like Edge Side Includes (ESI) can be utilized.

Communication Between Micro Front-Ends

Efficient communication is essential. It’s recommended to use:

  • Custom events: Allow micro front-ends to communicate without directly depending on each other.
  • Shared states via browser APIs: Such as Local Storage or IndexedDB.
  • Avoid excessive global dependencies: Minimizes coupling between components.

Version Control in Monorepositories

Managing versions in a monorepository can be challenging, especially when multiple teams are working on different parts of the system. Here are some practices to handle this:

Individual Package Versioning

Tools like Lerna or Nx allow you to manage individual package versions within a monorepository. This enables each team to control the versions of their own components or modules, maintaining independence and facilitating coordination.

Avoiding Git Submodules

While Git submodules might seem like a solution, they often introduce additional complexity. Instead, using NPM or Yarn workspaces can simplify the management of internal dependencies.

Benefits of the Monorepository

  • Code consistency: Facilitates code standardization and reuse.
  • Visibility: All teams have access to the complete source code, promoting collaboration.
  • Automation: Simplifies the setup of CI/CD pipelines that cover the entire system.

Caching Strategies for Bundle Loading

Efficiency in loading resources is crucial for application performance. Well-implemented caching strategies can significantly improve the user experience.

Caching Shared Resources

By using technologies like Module Federation, it’s possible to share common dependencies among different micro front-ends, avoiding redundant downloads. To achieve this:

  • Define shared modules: Configure which libraries or frameworks should be shared to prevent multiple versions on the client.
  • Compatible versions: Ensure that shared dependencies are compatible with each other to avoid conflicts.

CDN-Level Caching

Utilizing a Content Delivery Network (CDN) allows static resources to be delivered more quickly to users by leveraging distributed caching.

  • Cache-Control configurations: Adjust HTTP headers to control how and for how long resources should be cached.
  • Cache invalidation: Have strategies to invalidate or update the cache when new versions of resources are deployed.

Browser Caching

  • Service Workers: Implement caching via Service Workers for more granular control over which resources are stored and when they are updated.
  • Preloading and Prefetching: Anticipate which resources will be needed and load them in advance.

Managing Global States in Host Applications

Maintaining a consistent global state in an application composed of multiple micro front-ends is a challenge.

Recommended Strategies

  • Custom events: Use the browser’s event system for communication between micro front-ends without creating rigid dependencies.
  • Shared local storage: APIs like Local Storage or IndexedDB can serve as a means to share global state.
  • Global contexts: In frameworks like React, you can use Context API, but be careful not to introduce unwanted coupling.

Best Practices

  • Domain isolation: Each micro front-end should be responsible for its own local state and interact with the global state only when necessary.
  • Well-defined contracts: Establish clear interfaces for communication between components, facilitating maintenance and evolution.

Standardization and Platform Teams

While micro front-ends address technical scalability, code standardization and the existence of platform teams are crucial for the human scalability of development teams.

The Role of Platform Teams

  • Defining good standards: Create and maintain code standards that make teams’ work easier.
  • Tools and infrastructure: Develop tools that automate repetitive tasks and ensure code quality.
  • Facilitating collaboration: Ensure different teams can work together efficiently.

Importance of Standardization

  • Faster onboarding: New developers adapt more quickly to standardized code.
  • Consistent quality: Reduces the incidence of bugs and maintenance issues.
  • Easier code review: Code reviews are more effective when there’s a consistent style.

Avoiding Unnecessary Complexity

  • Simplicity as standard: Opt for simple solutions that solve problems without adding excessive complexity.
  • Value-based decisions: Implement technologies and standards that bring clear benefits to the business and the team.
  • Beware of technological “hype”: Not every new library or framework is suitable for your application’s context.

Organizational Laws Applied to Code

Conway’s Law states that a system’s structure reflects the organization structure that develops it. Therefore, aligning technical architecture with team organization is not just beneficial but essential.

  • Aligned structures: Autonomous teams responsible for specific micro front-ends reflect a modular architecture.
  • Efficient communication: Fewer dependencies between teams reduce the need for constant communication and complex alignments.
  • Continuous evolution: A flexible organization allows the architecture to evolve with business needs.

How to Start

  • Pilot project: Implement a micro front-end in a non-critical part of the system to understand the challenges and benefits.
  • Define standards: Establish clear conventions from the outset for routes, communication, and styles.
  • Invest in observability: Monitoring tools are essential to quickly identify issues.
  • Documentation and communication: Keep documentation up to date and promote communication between teams to share learnings.

Conclusion

Scaling front-ends effectively requires a combination of technical and organizational solutions. Micro front-end architectures offer a path to handle technical complexity, while standardization and platform teams address the human challenges of large-scale collaboration.

At Nubank, we understand that continuous innovation and adaptability are essential to provide the best experience to our customers. Whether adopting advanced technologies or restructuring our teams, we are committed to evolving and facing the scalability challenges of the modern world.

Want to be part of this challenge? We’re always looking for talents passionate about technology and innovation to build the purple future together!

For more insights like these, watch the recording of the Engineering meetup.

The post From Micro to Macro: Scaling Front-Ends appeared first on Building Nubank.

Permalink

jank development update - Moving to LLVM IR

Hi everyone! It&aposs been a few months since the last update and I&aposm excited to outline what&aposs been going on and what&aposs upcoming for jank, the native Clojure dialect. Many thanks to Clojurists Together and my Github sponsors for the support. Let&aposs get into it!

Permalink

Clojure macros continue to surprise me

Clojure macros have two modes: avoid them at all costs/do very basic stuff, or go absolutely crazy.

Here’s the problem: I’m working on Humble UI’s component library, and I wanted to document it. While at it, I figured it could serve as an integration test as well—since I showcase every possible option, why not test it at the same time?

This is what I came up with: I write component code, and in the application, I show a table with the running code on the left and the source on the right:

It was important that code that I show is exactly the same code that I run (otherwise it wouldn’t be a very good test). Like a quine: hey program! Show us your source code!

Simple with Clojure macros, right? Indeed:

(defmacro table [& examples]
  (list 'ui/grid {:cols 2}
    (for [[_ code] (partition 2 examples)]
      (list 'list
        code (pr-str code)))))

This macro accepts code AST and emits a pair of AST (basically a no-op) back and a string that we serialize that AST to.

This is what I consider to be a “normal” macro usage. Nothing fancy, just another day at the office.

Unfortunately, this approach reformats code: while in the macro, all we have is an already parsed AST (data structures only, no whitespaces) and we have to pretty-print it from scratch, adding indents and newlines.

I tried a couple of existing formatters (clojure.pprint, zprint, cljfmt) but wasn’t happy with any of them. The problem is tricky—sometimes a vector is just a vector, but sometimes it’s a UI component and shows the structure of the UI.

And then I realized that I was thinking inside the box all the time. We already have the perfect formatting—it’s in the source file!

So what if... No, no, it’s too brittle. We shouldn’t even think about it... But what if...

What if our macro read the source file?

Like, actually went to the file system, opened a file, and read its content? We already have the file name conveniently stored in *file*, and luckily Clojure keeps sources around.

So this is what I ended up with:

(defn slurp-source [file key]
  (let [content      (slurp (io/resource file))
        key-str      (pr-str key)
        idx          (str/index-of content key)
        content-tail (subs content (+ idx (count key-str)))
        reader       (clojure.lang.LineNumberingPushbackReader.
                       (java.io.StringReader.
                         content-tail))
        indent       (re-find #"\s+" content-tail)
        [_ form-str] (read+string reader)]
    (->> form-str
      str/split-lines
      (map #(if (str/starts-with? % indent)
              (subs % (count indent))
              %)))))

Go to a file. Find the string we are interested in. Read the first form after it as a string. Remove common indentation. Render. As a string.

Voilà!

I know it’s bad. I know you shouldn’t do it. I know. I know.

But still. Clojure is the most fun I have ever had with any language. It lets you play with code like never before. Do the craziest, stupidest things. Read the source file of the code you are evaluating? Fetch code from the internet and splice it into the currently running program?

In any other language, this would’ve been a project. You’d need a parser, a build step... Here—just ten lines of code, on vanilla language, no tooling or setup required.

Sometimes, a crazy thing is exactly what you need.

Permalink

Sai do Barroco, Vem pro meio do Rococó: Um Novo Olhar sobre a Programação Funcional

Esses dias, estava codificando e escutando essa música, e no meio da música é feita havia a seguinte frase: "Sai do Barroco, vai pro meio do Rococó". Terminei o código e fui participar de algumas reuniões mundanas da vida de um engenheiro de software, mas, em específico, o tema era: como podemos tornar um conjunto de processos dentro do ecossistema da empresa mais simples e resiliente.

Mas, por algum motivo, a frase ainda ficava na minha cabeça: "Sai do Barroco, vai pro meio do Rococó". Terminadas as reuniões, fui até o Google tentar entender mais sobre o que se tratam esses movimentos, e minha mente simplesmente explodiu. Não consegui fazer mais nada além de produzir o conteúdo abaixo. Sugiro que deem play na música e me acompanhem nessa viagem.

https://www.youtube.com/watch?v=otXx46_FA80&ab_channel=CercleRecords

A viagem começa aqui

Imagine-se caminhando pelas ruas de uma cidade histórica europeia. De um lado, ergue-se uma catedral barroca, com suas fachadas exuberantes, colunas imponentes e detalhes ornamentais que parecem competir entre si pela sua atenção. Cada elemento arquitetônico é carregado de simbolismo e complexidade, refletindo uma época em que a grandiosidade e o esplendor eram sinônimos de poder e inovação.

Interior de uma igreja barroca

Agora, ao dobrar a esquina, você se depara com um palácio rococó. A atmosfera muda imediatamente. As linhas pesadas dão lugar a curvas graciosas, os detalhes ornamentais tornam-se mais leves e delicados. Há uma sensação de harmonia e elegância que convida à contemplação tranquila. A complexidade dá lugar à simplicidade refinada, sem perder a riqueza artística.

Interior de um palacio rococó

Essa transição arquitetônica entre o Barroco e o Rococó serve como uma metáfora poderosa para o mundo da programação. Assim como a arquitetura barroca refletia uma abordagem pesada e complexa, a Programação Orientada a Objetos (POO) tem sido, por décadas, o paradigma dominante, estruturando sistemas com hierarquias profundas e interdependências complexas. No entanto, à medida que avançamos para uma era onde a Lei de Moore já não se sustenta, há uma necessidade crescente de repensar nossas abordagens. É hora de "sair do Barroco e vir para o Rococó", abraçando a programação funcional como uma forma mais elegante e eficiente de resolver problemas.

O Esplendor e o Peso do Barroco na Programação

No século XVII, o Barroco surgiu na Europa como uma resposta à busca por magnificência e impacto emocional na arte e arquitetura. As edificações barrocas eram conhecidas por sua opulência, com fachadas ornamentadas, interiores luxuosos e uma abundância de detalhes. A intenção era impressionar, provocar admiração e demonstrar poder.

Analogamente, a Programação Orientada a Objetos emergiu nas décadas de 1960 e 1970 como uma solução inovadora para lidar com a crescente complexidade dos softwares. Linguagens como Simula e Smalltalk introduziram conceitos como classes, objetos, herança e encapsulamento. A POO permitiu que desenvolvedores modelassem sistemas complexos, organizando o código em estruturas que refletiam o mundo real.

Durante anos, a POO foi a resposta para a necessidade de organizar códigos que cresciam em tamanho e complexidade. No entanto, assim como o Barroco, essa abordagem começou a revelar suas limitações. Projetos orientados a objetos frequentemente sofrem com códigos excessivamente complexos, difíceis de manter e escalar. A criação de hierarquias profundas de classes pode levar a um emaranhado de dependências, onde a alteração de um pequeno componente pode ter efeitos imprevisíveis em todo o sistema.

As semelhanças com a arquitetura barroca são evidentes. A abundância de detalhes e a busca por grandeza podem resultar em um peso estrutural que dificulta a adaptação e evolução. Da mesma forma, sistemas orientados a objetos podem se tornar rígidos, resistindo a mudanças e exigindo esforços significativos para incorporar novas funcionalidades ou corrigir problemas.

O Fim da Lei de Moore e os Novos Desafios

Por décadas, a indústria da tecnologia se beneficiou da Lei de Moore, que previu o aumento exponencial do número de transistores em um chip, permitindo avanços contínuos no poder de processamento. Essa tendência mascarou, em muitos casos, a ineficiência do software, já que o hardware sempre avançava para compensar códigos pesados.

Contudo, chegamos a um ponto em que os limites físicos tornam essa progressão insustentável. O fim da Lei de Moore significa que não podemos mais contar com o hardware para melhorar o desempenho dos nossos softwares automaticamente. Além disso, a demanda por sistemas distribuídos, computação em nuvem e processamento paralelo colocou em evidência a necessidade de códigos mais eficientes e escaláveis.

Nesse novo cenário, a complexidade da POO torna-se um obstáculo. O gerenciamento de estados mutáveis, típico em objetos, dificulta a programação concorrente e aumenta o risco de erros difíceis de detectar e corrigir. A manutenção de grandes sistemas orientados a objetos pode se tornar insustentável, com custos elevados e tempos de desenvolvimento prolongados.

Assim como o Barroco enfrentou críticas e deu lugar ao Rococó, que buscava uma abordagem mais leve e elegante, a programação também precisa evoluir. É necessário adotar paradigmas que permitam construir sistemas robustos, mas que sejam ao mesmo tempo flexíveis, eficientes e mais fáceis de manter.

A Elegância do Rococó e a Programação Funcional

O Rococó surgiu na França, no início do século XVIII, como uma evolução natural do Barroco, mas com uma mudança significativa no foco estético. Os arquitetos rococós valorizavam a leveza, a assimetria e a ornamentação delicada. As estruturas buscavam criar espaços mais íntimos e acolhedores, utilizando cores claras, motivos naturais e linhas curvas que transmitiam movimento e fluidez.

Transpondo essa filosofia para o universo da programação, encontramos na programação funcional um paralelo interessante. Embora os conceitos fundamentais da programação funcional existam desde os primórdios da computação, foi nos últimos anos que esse paradigma ganhou destaque. Com linguagens como Haskell, Erlang, Clojure e Elixir, a programação funcional oferece uma abordagem que prioriza a simplicidade, a imutabilidade e a ausência de efeitos colaterais.

Na programação funcional, funções são cidadãos de primeira classe. Elas podem ser passadas como parâmetros, retornadas como resultados e compostas para criar funcionalidades mais complexas. A ênfase na imutabilidade significa que os dados não são alterados após serem criados, eliminando uma série de problemas relacionados ao estado compartilhado e à concorrência.

Essa abordagem se alinha com os princípios do Rococó: em vez de construir estruturas pesadas e complexas, busca-se a elegância através da simplicidade e da harmonia. O código funcional tende a ser mais conciso e expressivo, facilitando a leitura e a compreensão. Além disso, a ausência de efeitos colaterais torna o comportamento do software mais previsível e confiável.

A Redescoberta da Programação Funcional na Era Moderna

Com o fim da Lei de Moore, a indústria começou a buscar alternativas para continuar avançando. A programação funcional emergiu como uma resposta eficaz aos desafios da programação concorrente e paralela. Em sistemas onde a escalabilidade é crucial, a imutabilidade e as funções puras oferecem vantagens significativas.

Empresas como a Microsoft, com a introdução de recursos funcionais no .NET através do F#, e a crescente adoção de Scala e Clojure no ecossistema Java, demonstram que a programação funcional está deixando de ser uma curiosidade acadêmica para se tornar uma ferramenta prática e poderosa.

A programação funcional não é apenas uma nova forma de escrever código, mas uma mudança fundamental na forma de pensar sobre problemas. Requer que os desenvolvedores desapeguem de conceitos arraigados na POO e adotem uma mentalidade que privilegia a transformação de dados através de funções puras.

A Necessidade de Mudar: Saindo do Barroco e Abraçando o Rococó

Assim como a transição do Barroco para o Rococó representou uma busca por equilíbrio e refinamento, a migração para a programação funcional é uma resposta às demandas atuais da indústria. A complexidade excessiva não é mais sustentável. É preciso adotar abordagens que permitam construir sistemas robustos, mas que sejam também flexíveis e fáceis de manter.

A programação funcional oferece ferramentas para lidar com a concorrência de forma mais natural, reduzindo os riscos associados a estados mutáveis e efeitos colaterais. Além disso, promove um código mais limpo e legível, facilitando a colaboração e a manutenção a longo prazo.

Essa transição não significa abandonar completamente os conceitos da POO. Muitas linguagens modernas permitem a combinação de paradigmas, aproveitando o melhor de cada abordagem. No entanto, é fundamental reconhecer as limitações da POO tradicional e estar aberto a novas formas de pensar e resolver problemas.

A Jornada não Acaba aqui

A história nos ensina que a evolução é essencial para o progresso. Na arte, na arquitetura e na tecnologia, precisamos constantemente revisar e adaptar nossas abordagens para atender às novas realidades. A metáfora entre o Barroco e o Rococó ilustra a necessidade de abandonar a complexidade excessiva em favor da elegância e da simplicidade.

Na programação, essa mudança é ainda mais premente. Com os desafios atuais de desempenho, escalabilidade e manutenção, é vital adotarmos paradigmas que nos permitam construir softwares de alta qualidade de forma eficiente. A programação funcional oferece essa oportunidade, convidando-nos a repensar nossas práticas e abraçar uma nova forma de criar.

Portanto, "sair do Barroco e vir para o Rococó" é mais do que uma simples metáfora. É um chamado à ação para que todos os desenvolvedores repensem suas decisões técnicas e simplifiquem, a cada dia, os sistemas e processos sob seu controle.

Ao adotar a programação funcional, podemos não apenas resolver problemas de forma mais eficaz, mas também contribuir para a construção de um futuro onde o software seja cada vez mais confiável, eficiente e elegante.

Permalink

From Micro to Macro: Scaling Front-Ends

Are you facing challenges in scaling your front-end to meet a growing number of users? With the increasing complexity of modern web applications, strategies like micro front-ends, monorepositories, global state management, and cache optimization are essential. 

In this article, we explore best practices for scaling front-end applications, discussing how to implement micro front-ends, manage versions in monorepositories, apply effective caching strategies, and efficiently maintain global states. 

Discover how Nubank is overcoming scalability challenges in the front-end and how you can apply these approaches to build agile, responsive, and easy-to-maintain user interfaces.

The Challenge of Scale

Companies like Nubank face unique challenges. With over 100 million customers in Brazil alone, handling large-scale distributed systems is not just a necessity but an obligation. Managing transactions like PIX, ensuring service stability, and providing a consistent user experience require innovative solutions.

Moreover, working with advanced technologies like Clojure and Datomic—whose development is influenced by engineers within Nubank itself—adds additional layers of complexity and opportunity. These technologies are not just tools; they are integral parts of our scalability strategy and continuous innovation.

Micro Front-Ends: Dividing to Conquer

The micro front-end architecture has emerged as a solution to many challenges faced by large development teams. But what exactly are micro front-ends?

What Micro Front-Ends Are (and What They Aren’t)

Micro front-ends are an extension of the microservices concept to the front-end. They allow different teams to develop, deploy, and maintain distinct parts of the user interface independently.

This means that each team can work at its own pace, choose its own technologies (to a certain extent), and deploy updates without impacting the system as a whole.

It’s important to highlight that micro front-ends are not:

  • NPM packages or monolithic modules: Simply splitting a monolith into packages doesn’t offer the benefits of independent deployment or team isolation.
  • Separate applications for the end-user: The user experience should be unified. We’re not talking about multiple distinct applications but a single application composed of several autonomous parts.

Benefits of Micro Front-Ends

  • Independent deployments: Teams can deploy updates without coordinating with the entire organization.
  • Team scalability: New developers can be onboarded more quickly, focusing on a specific part of the system.
  • Fault isolation: Issues in one micro front-end don’t necessarily affect the entire application.
  • Technological flexibility: The possibility of using different frameworks or libraries, although this should be done cautiously to avoid client-side overload.

Costs and Considerations

  • Complex initial setup: Implementing micro front-ends requires careful planning and a robust initial configuration.
  • Cohesion of user experience: Ensuring that the interface is consistent and cohesive is a challenge when multiple teams are involved.
  • Performance overhead: Using different frameworks can increase bundle size and affect performance.
  • Observability and debugging: Monitoring and debugging an application composed of multiple micro front-ends requires advanced tools and practices.

Implementation Strategies

There are several approaches to implementing micro front-ends:

Client-Side Composition

This is the most common approach, where the integration of micro front-ends occurs in the user’s browser. Technologies like Web Components, Module Federation (Webpack 5), and frameworks like Single SPA facilitate this composition.

Server-Side or CDN Composition

The assembly of micro front-ends occurs before reaching the client, either on the server or CDN. Tools and techniques like Edge Side Includes (ESI) can be utilized.

Communication Between Micro Front-Ends

Efficient communication is essential. It’s recommended to use:

  • Custom events: Allow micro front-ends to communicate without directly depending on each other.
  • Shared states via browser APIs: Such as Local Storage or IndexedDB.
  • Avoid excessive global dependencies: Minimizes coupling between components.

Version Control in Monorepositories

Managing versions in a monorepository can be challenging, especially when multiple teams are working on different parts of the system. Here are some practices to handle this:

Individual Package Versioning

Tools like Lerna or Nx allow you to manage individual package versions within a monorepository. This enables each team to control the versions of their own components or modules, maintaining independence and facilitating coordination.

Avoiding Git Submodules

While Git submodules might seem like a solution, they often introduce additional complexity. Instead, using NPM or Yarn workspaces can simplify the management of internal dependencies.

Benefits of the Monorepository

  • Code consistency: Facilitates code standardization and reuse.
  • Visibility: All teams have access to the complete source code, promoting collaboration.
  • Automation: Simplifies the setup of CI/CD pipelines that cover the entire system.

Caching Strategies for Bundle Loading

Efficiency in loading resources is crucial for application performance. Well-implemented caching strategies can significantly improve the user experience.

Caching Shared Resources

By using technologies like Module Federation, it’s possible to share common dependencies among different micro front-ends, avoiding redundant downloads. To achieve this:

  • Define shared modules: Configure which libraries or frameworks should be shared to prevent multiple versions on the client.
  • Compatible versions: Ensure that shared dependencies are compatible with each other to avoid conflicts.

CDN-Level Caching

Utilizing a Content Delivery Network (CDN) allows static resources to be delivered more quickly to users by leveraging distributed caching.

  • Cache-Control configurations: Adjust HTTP headers to control how and for how long resources should be cached.
  • Cache invalidation: Have strategies to invalidate or update the cache when new versions of resources are deployed.

Browser Caching

  • Service Workers: Implement caching via Service Workers for more granular control over which resources are stored and when they are updated.
  • Preloading and Prefetching: Anticipate which resources will be needed and load them in advance.

Managing Global States in Host Applications

Maintaining a consistent global state in an application composed of multiple micro front-ends is a challenge.

Recommended Strategies

  • Custom events: Use the browser’s event system for communication between micro front-ends without creating rigid dependencies.
  • Shared local storage: APIs like Local Storage or IndexedDB can serve as a means to share global state.
  • Global contexts: In frameworks like React, you can use Context API, but be careful not to introduce unwanted coupling.

Best Practices

  • Domain isolation: Each micro front-end should be responsible for its own local state and interact with the global state only when necessary.
  • Well-defined contracts: Establish clear interfaces for communication between components, facilitating maintenance and evolution.

Standardization and Platform Teams

While micro front-ends address technical scalability, code standardization and the existence of platform teams are crucial for the human scalability of development teams.

The Role of Platform Teams

  • Defining good standards: Create and maintain code standards that make teams’ work easier.
  • Tools and infrastructure: Develop tools that automate repetitive tasks and ensure code quality.
  • Facilitating collaboration: Ensure different teams can work together efficiently.

Importance of Standardization

  • Faster onboarding: New developers adapt more quickly to standardized code.
  • Consistent quality: Reduces the incidence of bugs and maintenance issues.
  • Easier code review: Code reviews are more effective when there’s a consistent style.

Avoiding Unnecessary Complexity

  • Simplicity as standard: Opt for simple solutions that solve problems without adding excessive complexity.
  • Value-based decisions: Implement technologies and standards that bring clear benefits to the business and the team.
  • Beware of technological “hype”: Not every new library or framework is suitable for your application’s context.

Organizational Laws Applied to Code

Conway’s Law states that a system’s structure reflects the organization structure that develops it. Therefore, aligning technical architecture with team organization is not just beneficial but essential.

  • Aligned structures: Autonomous teams responsible for specific micro front-ends reflect a modular architecture.
  • Efficient communication: Fewer dependencies between teams reduce the need for constant communication and complex alignments.
  • Continuous evolution: A flexible organization allows the architecture to evolve with business needs.

How to Start

  • Pilot project: Implement a micro front-end in a non-critical part of the system to understand the challenges and benefits.
  • Define standards: Establish clear conventions from the outset for routes, communication, and styles.
  • Invest in observability: Monitoring tools are essential to quickly identify issues.
  • Documentation and communication: Keep documentation up to date and promote communication between teams to share learnings.

Conclusion

Scaling front-ends effectively requires a combination of technical and organizational solutions. Micro front-end architectures offer a path to handle technical complexity, while standardization and platform teams address the human challenges of large-scale collaboration.

At Nubank, we understand that continuous innovation and adaptability are essential to provide the best experience to our customers. Whether adopting advanced technologies or restructuring our teams, we are committed to evolving and facing the scalability challenges of the modern world.

These and other insights were shared at Nubank’s August Engineering Meetup. Watch the recording below.

The post From Micro to Macro: Scaling Front-Ends appeared first on Building Nubank.

Permalink

Writing the Worst Datalog Ever in 26loc

Today to change from heavy interop and frameworks, let's do some light coding exercise and implement the most amateur datalog engine by taking any shortcut we see fit!

Don't forget when we're not busy writing silly Datalog implementations, we are available to help you with Clojure or working to get our app Paktol (The positive spending tracker where money goes up!) off the ground.

Data Representations

The first question is how to represent facts, rules and databases.

A fact will be represented by a vector of simple values (but no symbols, we reserve symbols for variables).

[:father :bart :homer]

A rule will be represented by a list of patterns. The first pattern being the head of the rule (its "conclusion"), the rest of the rule being its body. A pattern is a vector of simples values (this time, including symbols for variables).

;; parent(c, p) :- father(c, p)
([:parent c p] [:father c p])
;; parent(c, p) :- mother(c, p)
([:parent c p] [:mother c p])

The database in Datalog is conceptually split in two: the EDB (the extensional database, every recorded fact — tables in SQL) and the IDB (the intensional database, every deduced fact — views in SQL). Let's lump them together as a single set of facts! What could possibly go wrong? 🤷‍♂️

Matching Patterns Against Facts

The result of matching a pattern against a fact is going to be nil (no match) or a map of variables to their values (let's call these bindings an environment).

However further matches in a rule will have to obey environments created by earlier matches. That's why we also pass an upstream environment to match: (defn match "Returns updated env or nil." [pattern fact env] ...)

(defn match [pattern fact env]
   (when (= (count pattern) (count fact))
     (reduce (fn [env [p v]]
               (let [p-or-v (env p p)]
                 (cond
                   (= p '_) env
                   (= p-or-v v) env
                   (symbol? p-or-v) (assoc env p v)
                   :else (reduced nil))))
       env (map vector pattern fact))))

You may have realized that unlike textbook datalog the predicate name isn't special and is treated as any other slot of facts vectors. This will prove useful when implementing q at the end.

Matching rules and producing new facts

Now we need to match a patterns chain against all known facts:

(defn match-patterns [patterns facts]
  (reduce
    (fn [envs pattern]
      (-> (for [fact facts env envs] (match pattern fact env))
        set (disj nil)))
    #{{}} patterns))

From this we can infer new facts by turning envs into facts by simply replacing vars appearing in the head by their values in environments.

(defn match-rule [facts [head & patterns]]
  (for [env (match-patterns patterns facts)]
    (into [] (map #(env % %)) head)))

Running all rules until saturation

We repeatedly apply all rules until no new facts can be derived. This saturation is detected by comparing sizes because it's cheaper than an equality check.

(defn saturate [facts rules]
  (let [facts' (into facts (mapcat #(match-rule facts %)) rules)]
    (if (< (count facts) (count facts'))
      (recur facts' rules)
      facts)))

Query

A query is just an anonymous rule that we run after the others so that we get only its results.

(defn q [facts query rules]
  (-> facts (saturate rules) (match-rule query) set))

Let's try it!

The Simpsons family:

(def edb
  #{[:father :bart :homer]
    [:mother :bart :marge]
    [:father :lisa :homer]
    [:mother :lisa :marge]
    [:father :maggie :homer]
    [:mother :maggie :marge]
    [:father :homer :abe]
    [:mother :homer :mona]})

Some sample rules:

(def rules
  '[([:parent c p] [:father c p])
    ([:parent c p] [:mother c p])
    ([:grand-parent c gp] [:parent p gp] [:parent c p])
    ([:ancestor c p] [:parent c p])
    ([:ancestor c ancp] [:ancestor c anc] [:parent anc ancp])])

And now the queries:

user=> (q edb
         '([gp] [:grand-parent :bart gp])
         rules)
#{[:mona] [:abe]}
user=> (q edb
         '([anc] [:ancestor :bart anc])
         rules)
#{[:homer] [:marge] [:mona] [:abe]}

🎉 Here we are: datalog in 26 lines of code!

Conclusion

Datalog is minimal—almost to a fault. It’s like a game of Jenga: stack too many extensions on top, and the whole thing could come crashing down. That’s why there’s a rich literature on adding just enough to make it useful without tipping over.

So here we are with a seed implementation of a seed language. We have a starting point and can grow it in any directions we are interested in.

Share online if you'd like to see more open ended exploratory code (including growing this datalog)!

Appendix: 26 glorious locs at once

(defn match [pattern fact env]
   (when (= (count pattern) (count fact))
     (reduce (fn [env [p v]]
               (let [p-or-v (env p p)]
                 (cond
                   (= p '_) env
                   (= p-or-v v) env
                   (symbol? p-or-v) (assoc env p v)
                   :else (reduced nil))))
       env (map vector pattern fact))))

(defn match-patterns [patterns facts]
  (reduce
    (fn [envs pattern]
      (-> (for [fact facts env envs] (match pattern fact env))
        set (disj nil)))
    #{{}} patterns))

(defn match-rule [facts [head & patterns]]
  (for [env (match-patterns patterns facts)]
    (into [] (map #(env % %)) head)))

(defn saturate [facts rules]
  (let [facts' (into facts (mapcat #(match-rule facts %)) rules)]
    (if (< (count facts) (count facts'))
      (recur facts' rules)
      facts)))

(defn q [facts query rules]
  (-> facts (saturate rules) (match-rule query) set))

Permalink

Clojure Deref (Oct 11, 2024)

Welcome to the Clojure Deref! This is a weekly link/news roundup for the Clojure ecosystem (feed: RSS). Thanks to Anton Fonarev for link aggregation.

Libraries and Tools

New releases and tools this week:

Permalink

Rama on Clojure’s terms, and the magic of continuation-passing style

Rama is a platform with huge applicability, able to express all the computation and storage for a backend at any scale. Just like the UNIX philosophy of composing simple programs to do more complex tasks, Rama is based on simple building blocks that compose for any backend use case.

At the heart of Rama is its dataflow language, a Clojure library that’s also a full-fledged language. Rama’s dataflow language is based on continuation-passing style (CPS). Rama provides a clean and elegant way to express entire programs in CPS while producing bytecode that’s just as efficient as Clojure. In this post I’ll explore how Rama works in comparison to equivalent Clojure code written in a CPS style. You’ll see how CPS through Rama greatly generalizes the basic concept of a function, how that enables new ways of writing code in general, and how that is particularly liberating for writing parallel and asynchronous code.

You can follow along with the code in this post by cloning rama-demo-gallery and opening a REPL with lein repl . Run the following to set up your REPL:

1
2
(use 'com.rpl.rama)
(require '[com.rpl.rama.ops :as ops])

Basic example

Let’s start by defining the equivalent of Clojure’s identity function in Rama:

1
2
(deframaop identity-rama [*v]
  (:> *v))

Here we define a “Rama operation” called identity-rama that accepts one argument named *v . Variables in Rama code are symbols beginning with * . A “Rama operation” can do everything a regular Clojure function can – conditionals, loops, define anonymous operations with lexical closures, declare locals, etc. – plus it can do much more.

In this case, the body of identity-rama “emits” the value of *v to its caller using :> . This is equivalent to the following Clojure code:

1
2
(defn identity-rama-clj [v cont]
  (cont v))

In Rama code, the continuation is implicit and is invoked by calling :> like a function. A Rama operation does not return a value to its caller. It emits values to its continuation. This is a critical distinction, as part of what makes Rama operations more general than functions is how they can emit multiple times, not emit at all, or emit asynchronously.

Note that Rama does not compile to Clojure code like this. It compiles straight to bytecode, which is necessary to achieve high performance.

Now suppose you want to call identity-rama with the value “Hello world!” and print the result. In Rama you would write this like so:

1
2
3
(?<-
  (identity-rama "Hello world!" :> *str)
  (println "Emitted:" *str))

?<- is called the “execution operator” and just dynamically executes some Rama code. It’s not used in production and is just for playing at the REPL like this. Here the string “Hello world!” is passed as input to our identity-rama operation. The :> *str part binds the output of the operation to the variable *str . The :> keyword distinguishes the input from the output and is called the “default output stream” (you’ll see soon how you can have more than one output stream). The variable *str is then passed to println .

Here’s the equivalent Clojure code in CPS:

1
2
3
4
(identity-rama-clj
  "Hello world!"
  (fn [str]
    (println "Emitted:" str)))

So far, you can see from this example how Rama makes things more concise by eliminating nested callback functions from the code. Here’s a slightly more complicated example to show how unreadable CPS gets when done manually:

1
2
3
4
(?<-
  (+ 1 2 :> *a)
  (* *a 10 :> *b)
  (println *a *b))

In Clojure with CPS versions of + and * , this looks like:

1
2
3
4
5
6
7
8
9
10
11
(defn add [v1 v2 cont]
  (cont (+ v1 v2)))

(defn multiply [v1 v2 cont]
  (cont (* v1 v2)))

(add 1 2
  (fn [a]
    (multiply a 10
      (fn [b]
        (println a b)))))

In addition to how unreadable all the nesting makes this, it’s also extremely inefficient. Every single “emit” uses up another stack frame, so it seems entire programs compiled this way will quickly overflow the stack. If this is all Rama was doing, that would indeed be the case. You’ll see later some important optimizations Rama makes so that dataflow code is just as efficient as idiomatic code in any other language. The continuation being implicit in Rama rather than explicit like in the Clojure CPS examples gives Rama critical flexibility to make those optimizations.

Emitting zero or multiple times

As mentioned, you don’t have to call the continuation exactly one time. You can call it multiple times, or you can call it zero times. You can also call it asynchronously, on a different thread, or even on a different machine. This is where the expressive power of dataflow starts to show itself.

Here’s an example of a deframaop that emits multiple times along with some code that uses it:

1
2
3
4
5
6
7
8
9
(deframaop emit-many-times []
  (:> 1)
  (:> 3)
  (:> 2)
  (:> 5))

(?<-
  (emit-many-times :> *v)
  (println "Emitted:" *v))

This is equivalent to this CPS Clojure function:

1
2
3
4
5
6
7
8
9
(defn emit-many-times-clj [cont]
  (cont 1)
  (cont 3)
  (cont 2)
  (cont 5))

(emit-many-times-clj
  (fn [v]
    (println "Emitted:" v)))

Let’s now take a look at another deframaop that does filtering:

1
2
3
(deframaop my-filter> [*v]
  (<<if *v
    (:>)))

We’ll look more at conditionals later on in this post. my-filter> is the same as the built-in operation filter> and is equivalent to:

1
2
3
(defn my-filter>-clj [v cont]
  (if v
    (cont)))

As you can see here, my-filter> emits zero values to its continuation. Emits can be done with any number of values, and in the next section you’ll see examples of emitting multiple values.

You could combine my-filter> with emit-many-times to write code like this:

1
2
3
4
(?<-
  (emit-many-times :> *v)
  (my-filter> (odd? *v))
  (println *v))

You can nest expressions in Rama code just like you can in Clojure code, and the above is the same as writing:

1
2
3
4
5
(?<-
  (emit-many-times :> *v)
  (odd? *v :> *is-odd?)
  (my-filter> *is-odd?)
  (println *v))

This is equivalent to:

1
2
3
4
5
6
7
8
9
10
(defn odd?-cont [v cont]
  (cont (odd? v)))

(emit-many-times-clj
  (fn [v]
    (odd?-cont v
      (fn [is-odd?]
        (my-filter>-clj is-odd?
          (fn []
            (println v)))))))

Running any of these prints:

1
2
3
1
3
5

This code is kind of like doing a filter on a sequence followed by a doseq , except no sequences are materialized. It also reads kind of like a WHERE clause in SQL, in that the filter is expressed solely on the value in question with an arbitrary predicate, and computation only continues with values that match.

Operations can also emit a dynamic number of times. For example, Rama has an operation in its standard library called explode that’s equivalent to this CPS Clojure function:

1
2
3
(defn explode-clj [aseq cont]
  (doseq [e aseq]
    (cont e)))

You could use explode to print every element of a sequence like this:

1
2
3
(?<-
  (ops/explode [1 2 3 4] :> *v)
  (println "Val:" *v))

This is the same as:

1
2
(explode-clj [1 2 3 4]
  (fn [e] (println "Val:" e)))

Emitting multiple values in one emit

Besides being able to emit multiple times, Rama operations can emit multiple values per emit. Here’s an example:

1
2
3
4
5
6
7
(deframaop emit-many [*v]
  (:> (inc *v) (dec *v))
  (:> (* *v 2) (/ *v 2)))

(?<-
  (emit-many 9 :> *v1 *v2)
  (println "Result:" *v1 *v2))

This is equivalent to the Clojure CPS code (for brevity, without doing inc , dec , * , or / in CPS):

1
2
3
4
5
6
7
(defn emit-many-clj [v cont]
  (cont (inc v) (dec v))
  (cont (* v 2) (/ v 2)))

(emit-many-clj 9
  (fn [v1 v2]
    (println "Result:" v1 v2)))

Something important here is the caller needs to know how many fields are expected to be given to the continuation. In Rama that’s specified by the number of variables bound to the :> output stream, and in the Clojure version that’s specified by the arity of the passed continuation function. In both cases, you’ll get a runtime error if you bind the incorrect number of continuation outputs. Whereas with a Clojure function you only have to know what arities are valid for inputs, with Rama operations you also must know the arity of the output.

Anonymous operations

Just like how you can declare anonymous functions in Clojure and pass them around as values, you can do the same in Rama with Rama operations. Like anonymous Clojure functions, anonymous Rama operations capture their lexical scope. Here’s a basic example of this:

1
2
3
4
5
6
7
8
9
(deframaop adder [*v1]
  (<<ramaop %ret [*v2]
    (:> (+ *v1 *v2)))
  (:> %ret))

(?<-
  (adder 10 :> %f)
  (%f 3 :> *val)
  (println "Result:" *val))

<<ramaop defines an anonymous Rama operation with the given name, arguments, and body. Vars for anonymous operations are prefixed with % . There’s no difference in functionality between an anonymous Rama op and a top-level one. The above code is equivalent to this Clojure CPS code:

1
2
3
4
5
6
7
8
9
10
(defn adder-clj [v1 cont]
  (let [ret (fn [v2 cont-inner]
              (cont-inner (+ v1 v2)))]
    (cont ret)))

(adder-clj 10
  (fn [f]
    (f 3
      (fn [val]
        (println "Result:" val)))))

As alluded to before, Rama operations can be passed around as values. Here’s an example of passing around top-level Rama operations, anonymous Rama operations, and regular Clojure functions as values in Rama:

1
2
3
4
5
6
7
8
9
10
11
(deframaop times2 [*v]
  (:> (* *v 2)))

(deframaop foo [%f1 %f2 %f3]
  (%f3 (%f2 (%f1 2)) :> *res)
  (:> *res))

(?<-
  (adder 20 :> %f1)
  (foo %f1 inc times2 :> *res)
  (println "Result:" *res))

This prints “Result: 46”.

Emitting asynchronously

Rama operations being able to emit asynchronously is what makes Rama so good for writing parallel and asynchronous code. To demonstrate this, I’ll briefly introduce you to Rama’s cluster programming environment which implements the underlying infrastructure powering the parallel programming primitives you’re about to see. A “Rama module” is what you deploy to a Rama cluster, and it uses dataflow to define all the data ingestion, processing, and indexing for a backend. A module is launched with a configurable number of partitions called “tasks”, and these tasks run across the cluster in processes launched for the module. Dataflow code runs across all tasks in parallel and defines how to react to incoming data.

For a basic example of distributed programming with dataflow, here’s code doing a bank transfer from *from-user-id to *to-user-id in the amount of *amt dollars. This code is a stripped down version of our open-source atomic bank transfer example, and the code for reading/writing that information to durable storage is mocked out to focus on the distributed programming aspects. $$funds here refers to a durable index, similar to a database.

1
2
3
4
5
6
(|hash *from-user-id)
(user-current-funds $$funds *from-user-id :> *funds)
(filter> (>= *funds *amt))
(deduct-funds-from-user! $$funds *from-user-id *amt)
(|hash *to-user-id)
(add-funds-to-user! $$funds *to-user-id *amt)

|hash is called a “partitioner”, and it relocates computation to a different thread/node. The only difference with the other Rama operations you’ve seen is it emits to its continuation asynchronously and potentially on a different thread/node. |hash computes the target task by modding the hash of its argument by the total number of tasks in the module. Hashing ensures the same argument always goes to the same task, while different arguments get evenly distributed across all tasks.

Computation and storage are colocated in Rama. By using partitioners to control where code is executing, you’re able to control to which partitions of durable storage you read and write. This lets you control in a fine-grained way how data is partitioned across durable storage.

What makes partitioners powerful is they’re just like any other Rama operation, and that uniformity enables composition. You can use partitioners just like any other code, such as within conditionals, loops, or helper operations. Code is read linearly without any callback functions even though you’re jumping around the cluster with impunity.

Rama’s implementation of partitioners is similar to this Clojure CPS version:

1
2
3
(defn |hash-clj [k cont]
  (let [task-id (mod (hash k) (num-tasks-in-module))]
    (send-to-task! task-id cont)))

Internally, every task of a module has a queue that runs events in the order in which they arrive. The event sent in this case is the continuation, which when called continues computation where it left off. This is no different than if the emit was done synchronously.

The Rama operation definition is similar, though since we haven’t exposed manipulating continuations in Rama’s public API, the following code is only representative of what the definition looks like internally:

1
2
3
4
5
(deframaop |hash-pseudo [*k]
  (mod (hash *k) (num-tasks-in-module) :> *task-id)
  (<<continuation %cont []
    (:>))
  (send-to-task! *task-id %cont))

<<continuation defines an anonymous Rama operation just like any other, with the difference being that it emits to the caller of its parent rather than its own caller. This is just like the Clojure CPS version: when cont is eventually invoked on the other thread/node, it invokes the code following the call of |hash-clj .

Rama takes care of efficiently serializing the continuation, including any information in its closure. The Rama compiler analyzes what vars are used after every invoke of an operation, and it uses that information to only include in the closure vars that are referenced in downstream code. This minimizes the amount of information sent across the wire. This compiler analysis isn’t specific to partitioners, as its used for closure construction for all anonymous operations.

Partitioners don’t have to emit just one time at one location. Sometimes, for example, you want to run code like this:

1
2
3
4
(|all)
(fetch-information-from-storage $$p :> *v)
(|global)
(agregate-information *v :> *result)

Code like this is typical for queries that fetch and aggregate information stored across all partitions. |all partitions to all tasks in parallel, and |global always goes to the same task. |all is defined approximately like this:

1
2
3
4
5
(deframaop |all-pseudo [*k]
  (<<continuation %cont []
    (:>))
  (ops/explode (range 0 (num-tasks-in-module)) :> *task-id)
  (send-to-task! *task-id %cont))

And |global is defined approximately like this:

1
2
3
4
(deframaop |global-pseudo [*k]
  (<<continuation %cont []
    (:>))
  (send-to-task! 0 %cont))

CPS and the ability to emit asynchronously unifies general purpose programming with distributed programming, by enabling parallel code to be expressed no differently than any other logic. Partitioners enable Rama code to precisely control not just what is executing, but where.

Emitting to multiple output streams

Emitting zero or multiple times, emitting multiple values, and emitting asynchronously are three ways Rama operations are more general than functions. Another is that Rama operations can emit to output streams besides :> .

Here’s an example Rama operation doing this:

1
2
3
4
5
6
(deframaop emit-multiple-streams []
  (:a> 1)
  (:> 2)
  (:> 3)
  (:a> 4)
  (:b> 5 6))

This emits to three streams: :> , :a> , and :b> . Let’s take a look at an equivalent Clojure function in CPS. Instead of passing in one continuation function, we’ll now pass in a map from output stream to continuation function:

1
2
3
4
5
6
(defn emit-multiple-streams-clj [cont-map]
  ((:a> cont-map) 1)
  ((:> cont-map) 2)
  ((:> cont-map) 3)
  ((:a> cont-map) 4)
  ((:b> cont-map) 5 6))

This isn’t totally accurate, as Rama does not require a caller to provide a continuation for each output stream. If there’s no continuation, then emitting to that output stream is a no-op. So the Clojure code that matches what Rama does is this:

1
2
3
4
5
6
7
8
9
10
11
(defn emit-multiple-streams-clj [cont-map]
  (if-let [cont (:a> cont-map)]
    (cont 1))
  (if-let [cont (:> cont-map)]
    (cont 2))
  (if-let [cont (:> cont-map)]
    (cont 3))
  (if-let [cont (:a> cont-map)]
    (cont 4))
  (if-let [cont (:b> cont-map)]
    (cont 5 6)))

Invoking a Rama operation that emits multiple output streams is a little bit different, as each output stream is a different code path. Here’s an example:

1
2
3
4
5
6
7
8
9
10
(?<-
  (emit-multiple-streams
    :a> <emitted-a> *v
    :b> <b> *v1 *v2
    :> *v)
  (println "Default:" *v)
  (<<branch <emitted-a>
    (println "A:" *v))
  (<<branch <b>
    (println "B:" *v1 *v2)))

Rama code produces an “abstract syntax graph” (ASG), whereas Clojure (and most other languages) produce an “abstract syntax tree” (AST). <emitted-a> and <b> are called “anchors” and label part of the ASG. Those anchors are used by <<branch to specify where that code should attach. You can visualize this code like so:

Rama has other ways to specify how code should be attached. In this particular case, since each stream has only one line of code attached to it, the above code can be written more concisely as:

1
2
3
4
5
6
(?<-
  (emit-multiple-streams
    :a> *v :>> (println "A:" *v)
    :b> *v1 *v2 :>> (println "B:" *v1 *v2)
    :> *v)
  (println "Default:" *v))

:>> is called an inline hook and automatically handles setting up anchors and branching.

The above can be written with Clojure CPS like so:

1
2
3
4
5
(emit-multiple-streams-clj
  {:a> (fn [v] (println "A:" v))
   :b> (fn [v1 v2] (println "B:" v1 v2))
   :> (fn [v] (println "Default:" v))
  })

All of these print:

1
2
3
4
5
A: 1
Default: 2
Default: 3
A: 4
B: 5 6

As mentioned, you can also call emit-multiple-streams without providing continuations for every output stream. For example:

1
2
3
4
5
(?<-
  (emit-multiple-streams
    :b> *v1 *v2 :>> (println "B:" *v1 *v2)
    :> *v)
  (println "Default:" *v))

This is equivalent to:

1
2
3
4
(emit-multiple-streams-clj
  {:b> (fn [v1 v2] (println "B:" v1 v2))
   :> (fn [v] (println "Default:" v))
  })

Both of these print:

1
2
3
Default: 2
Default: 3
B: 5 6

Each continuation and the code attached to it is fully executed before the subsequent line of emit-multiple-streams is run. The behavior of Rama is exactly the same as the Clojure CPS version in this respect.

Rama provides an operation called if> which is the basic primitive for specifying conditional behavior. if> takes in a value and emits to :then> or :else> depending on the truthiness of that value. The operation <<if mentioned before is a Rama macro (called “segmacro”) implemented using if> . Here’s an example of usage of if> :

1
2
3
4
(?<-
  (if> (= 1 2)
    :then> :>> (println "True")
    :else> :>> (println "False")))

This prints “False”. Unlike if in Clojure (as well as the equivalent in pretty much every other programming language), if> is not a special form in Rama. It doesn’t have to be, since it’s no different than any other Rama operation. So it can be passed around just like any Rama operation, like so:

1
2
3
4
5
6
7
8
9
(deframaop exec-if-like-op [%f *v]
  (%f *v
    :then> :>> (:> "True branch")
    :else>)
  (:> "False branch"))

(?<-
  (exec-if-like-op if> true :> *res)
  (println "Result:" *res))

This prints “Result: True branch”.

So far, I’ve never found a reason to pass if> around dynamically like this. What this demonstrates is how Rama’s richer language primitives provide greater uniformity and less special cases.

Also important to note is that if> produces exactly the same bytecode as Clojure’s if when invoked directly (not as an anonymous operation). Rama accomplishes this with an “intrinsic” implementation for if> in its compiler. This doesn’t change anything about semantics and is purely an optimization.

Unification

Rama’s “unification” facility enables separate branches of computation to be merged together. It’s another way to share code that fits naturally into dataflow. Here’s an example:

1
2
3
4
5
6
7
8
9
10
(deframaop foo [*v]
  (:a> (inc *v))
  (:b> (dec *v)))

(?<-
  (foo 10
    :a> <a> *v
    :b> <b> *v)
  (unify> <a> <b>)
  (println "Emit:" *v))

You can visualize this code like so:

unify> is a compile-time directive on how to construct the abstract syntax graph. When either :a> or :b> emit, the code continues on the shared code after the unify> . unify> can merge any number of branches, not just two.

Since the code after a unify> is shared among all its parent branches, there are rules regarding what vars are in scope after a unify> . Only vars defined in all parent branches are in scope. So if you try to reference a var after a unify> that doesn’t exist in all of the parent branches, you’ll get a compile-time error.

The above code can be written in Clojure CPS like so:

1
2
3
4
5
6
7
8
9
10
(defn foo [v cont-map]
  (if-let [cont (:a> cont-map)]
    (cont (inc v)))
  (if-let [cont (:b> cont-map)]
    (cont (dec v))))

(let [shared (fn [v] (println "Emit:" v))]
  (foo 10
    {:a> shared
     :b> shared}))

As you can see, the only way to share code across the two branches is to factor out a helper function defined before the code that executes first. The dataflow version reads much nicer since the code is ordered the same way it executes.

Loops

Dataflow loops are similar to Clojure loops, but like Rama operations they can emit any number of times. Here’s an example of a dataflow loop:

1
2
3
4
5
6
(?<-
  (loop<- [*a 10 :> *v]
    (<<if (>= *a 0)
      (:> *a)
      (continue> (dec *a))))
  (println "Emitted:" *v))

Like a Clojure loop, a dataflow loop has bindings along with initial values. Here the variable *a is initialized to 10. The bindings vector also binds emits from this loop to output variables which will be in scope after the loop –  *v in this case. A loop is recurred with continue> , and emits are done with :> just like emitting from a Rama operation.

This is equivalent to the following Clojure code:

1
2
3
4
5
(let [cont (fn [v] (println "Emitted:" v))]
  (loop [a 10]
    (when (>= a 0)
      (cont a)
      (recur (dec a)))))

Loops compose with everything else in dataflow, including partitioners. This makes it trivial to do distributed loops that hop around the cluster. Such loops are common with graph algorithms where you may be traversing from node to node fetching connections from each partition. For example:

1
2
3
4
5
6
7
(loop<- [*node-id *start-node-id :> *ancestor-node]
  (|hash *node-id)
  (fetch-node-parent $$parents *node-id :> *parent-id)
  (<<if *parent-id
    (:> *parent-id)
    (continue> *parent-id)
    ))

Like before, the details of storage are mocked out to focus on the computation aspects. Here $$parents represents a datastore mapping the parent for every node. This code fetches all ancestors for a node ID by traversing the graph across the cluster in a loop.

Optimizations in dataflow compiler

As mentioned earlier, passing a continuation on every invocation is inefficient and would likely cause a stack overflow if done for every invocation in a program. Rama has a number of optimizations to make the bytecode it produces just as efficient as Clojure. I’ll focus on one particularly important optimization.

When a Rama operation emits exactly one time, synchronously, and as the last thing it does, then it’s like a function. Rama provides two ways of defining operations: deframaop , as already shown, and deframafn . deframafn is just like deframaop except its implementation must synchronously emit exactly one time to the :> stream as the last thing it does.

For every invoke, Rama determines if it’s executing a deframaop or a deframafn . If it’s a deframafn , then it invokes it just like how functions are invoked in Clojure by unrolling the stack frame with the return value. For example, consider this Rama code:

1
2
3
4
5
6
7
8
9
10
(deframafn double-value [*v]
  (:> (* 2 *v)))

(deframaop bar []
  (:> (ops/explode [1 2 3 4])))

(?<-
  (bar :> *v)
  (double-value *v :> *v2)
  (println *v2))

The Rama execution will be very similar to this Clojure CPS:

1
2
3
4
5
6
7
8
9
(defn bar-clj [cont]
  (explode-clj [1 2 3 4]
    (fn [v]
      (cont v))))

(bar-clj
  (fn [v]
    (let [v2 (double-value v)]
      (println v2))))

The calls to double and println do not pass a continuation and instead do optimized invokes as functions.

It’s also worth noting that since a deframafn works just like a Clojure function, it can be invoked directly from Clojure like any Clojure function.

deframafn only has restrictions on emits to the :> output stream – it can emit to all other streams any number of times and/or asynchronously. Internally, we refer to operations like that as “semi-functions”. Rama determines whether a deframafn is a semi-function or regular function by statically analyzing whether emits are done to any other output stream beside :> . If so, invokes of that operation will unroll the stack and pass a continuation for the other output streams.

Finally, the ramafn> annotation can be used to tell Rama an anonymous operation is a ramafn .

The ramafn optimization is critical because the majority of code is still best written with function semantics. So most of a codebase will compile to stack-efficient invokes. Emitting multiples times, zero times, or asynchronously is powerful but less common.

The general term we use to refer to an object which is either a ramafn or ramaop is “fragment”. A ramafn is a fragment that has restrictions on the :> stream, while a ramaop has no restrictions.

Conclusion

Dataflow turns CPS into a full-fledged programming paradigm that’s elegant and efficient. This paradigm isn’t just for backend programming, like data processing, indexing, and querying. It’s a general purpose paradigm that we’ve used for building a huge amount of Rama itself. Emitting zero times, multiple times, asynchronously, or to multiple output streams are major generalizations of functions that open up huge new avenues to explore in the craft of programming. One of the joys of working on Rama has been the opportunity to explore and develop new techniques utilizing this new programming paradigm.

There’s a lot I didn’t cover in this post, like “segmacros” (macros that produce dataflow code) and “batch blocks” (a slightly declarative form of dataflow that has equivalent functionality as relational languages, like joins and aggregation). These additional capabilities are documented on this page.

Rama as a cluster platform adds durable storage into the mix, using dataflow to process distributed logs and produce indexes of any shape. It generalizes the ideas of event sourcing and materialized views into a unified system, providing strong fault-tolerance and ACID semantics. Dataflow is one of the keys to how it’s such a generally applicable platform, as it gives tight control over what, where, and how code executes.

Permalink

Last in Clojure

Chat GPT: No, the last function is not particularly expensive for vectors in Clojure. It runs in O(1) time because vectors in Clojure support efficient access to their last element.

Meanwhile, Clojure:

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.