Data Pipelines for Machine Learning: From Ingestion to Training (2026 Guide)

In 2026, successfully handling the machine learning data pipeline represents 80% of AI success – the model itself is just the final 20%. 

Artificial intelligence has reached a turning point. The debate is no longer about models, it’s about data. Are we feeding our systems the right inputs at the right time, and can we trust their source? The machine learning data pipeline is now the main act.

Where Things Go Wrong: Traditional Pipelines Drag AI Down 

Let’s be honest. Many companies are stuck with outdated pipelines. 

  • They put things together with manual scripts that break down whenever the data changes. 
  • Their ETL jobs? Too rigid – they can not juggle videos, telemetry, and text at once. 
  • And compliance? Gaps everywhere. Sensitive data gets exposed, putting companies at risk of penalties and negative publicity.

These problems do not just slow the business down. They gradually decrease trust in AI. If business data is stale, messy, or noncompliant, their models will produce worthless results. In healthcare, finance, or autonomous vehicles, such a failure cannot be ignored.

How to Fix It: Automated Data Engineering for ML

The answer is clear: smarter, automated data engineering for ML. Modern pipelines don’t avoid complexity—they are engineered to embrace and manage it.

  • Event‑driven and real‑time: They are capable of real-time data ingestion from all kinds of sources.
  • Self‑healing: AI-powered “data quality bots” catch problems – like changing schemas or weird outliers – prior to that faulty data reaching the models.
  • Scalable: These pipelines scale up quickly, supporting feature engineering, so businesses can turn raw signals into features they actually use. 
  • Transparent: Embedding automated data lineage to track every step, so teams can always explain how the data changed. 
  • Model‑ready: They automate partitioning and validation, ensuring model data readiness so only the right data lands in training, with zero leakage.

With systems like these, teams move from fragile, hand-built workflows to reliable, event-powered setups that deliver clean, compliant, ready-to-go features every time.

Why This Matters in 2026

  • Speed is everything. If companies are still waiting hours for data to update, they are already losing. 
  • Compliance is not optional – regulators now want to see the entire journey of every data point.
  • And scale? That is just standard now. Enterprises generate terabytes every second. No one is taking care of that manually.

The pipeline gives an advantage. Master it, and a business will be ahead of the competition in AI. Ignore it, and a business is left watching from behind.

Comparison: Old vs. New ML Pipelines

Old vs. New ML Pipelines

AspectOld PipelinesNew Pipelines
Data FlowBatch jobs, slow updatesReal‑time, event‑driven
ReliabilityBreaks easily, manual fixesSelf‑monitoring, auto‑recovery
FeaturesRebuilt each timeShared through the feature store
ComplianceManual checksAutomated lineage and PII protection
ScaleLimited, hard to growHandles massive data streams
Model PrepManual splits, risk of errorsAutomated, leakage‑free

The 2026 Machine Learning Data Pipeline Stages

Machine learning data pipelines are modular, event-driven, and built to handle whatever challenges come their way: more data, more rules, more complexity. Every stage matters, turning messy, raw data into clear, model-ready features.

❶ Multi‑Modal Ingestion

Data is not simple anymore. Teams manage SQL tables, video clips, and IoT signals all at once. The old pipelines? They just could not keep up. The new ones support real-time data ingestion, which means they collect everything – live and in sync. No more waiting hours for data to slowly arrive. Models get the latest info right away.

Example: Consider a retail chain. They are pulling transactions from cash registers, video from security cameras, and shelf sensor readings – all feeding into a single pipeline, live.

Multi-Modal Data Ingestion Pipeline

❷ Automated Cleaning & Validation

Raw data is always a mess. Missing fields, odd formats, and random outliers cause problems. Data Quality Bots now take care of them. Before delivering the data to the model, they review it, fix any errors, add any missing information, and flag any issues. 

Example: A hospital can find records missing important details. The system either fills those gaps or marks them for human review. And this works – bad records drop by 80%.

 Automated Cleaning & Validation

Feature Engineering at Scale

This is where the magic happens. Raw data turns into usable signals. Instead of providing raw GPS coordinates to a model, it calculates “distance from home.” In 2026, feature engineering pipelines do this automatically, across millions of records, so features stay consistent and ready to use.

Example: A ride‑sharing app uses GPS data to generate metrics such as “average trip distance” and “minutes stuck in traffic.” Those features enable every single ride.

Feature Engineering at Scale

The Feature Store

Here’s where teams get smart about their work. A feature store is a shared library- a place to put all those carefully built features so you only build them once. No more wasting time recalculating. No more inconsistencies.

Example: Consider a bank. They store features like “average account balance” and “transaction frequency,” then use them everywhere: fraud detection, credit scoring, customer analysis. It is a central spot that provides data to many models.

The Feature Store

❺ Model Readiness

Before teams train a model, the data needs to be split into training, validation, and test sets. Doing that manually? Risky. Teams end up with bad splits or leaks. The new pipelines ensure model data readiness by handling splits automatically and tracking every step, so teams always know how their data is prepared.

Example: Consider an autonomous vehicle company. They ensure sensor data is cleanly split so the training set never overlaps with the test set. The whole process is tracked and transparent.

Model Readiness

Putting It All Together

These stages are the backbone of modern machine learning pipelines. They turn raw, chaotic data into clean, model-ready features – fast, reliable, and clear. In 2026, there is no room for mistakes here. Nail this, and teams lead in AI. 

Strategic Choice: ETL vs. ELT in Machine Learning

FeatureETL (Extract, Transform, Load)ELT (Extract, Load, Transform)
Data PreservationDiscards raw dataKeeps raw data in a Lakehouse
Speed to TrainingSlowerFaster
ComplianceMasks PII earlyMasks PII after loading
FlexibilityRigidHigh
ScalabilityLimitedScales with cloud compute/storage
Cost EfficiencyHigher upfront costLower, pay‑as‑you‑go
ComplexityComplex pipelines are harder to adaptSimpler, modular, easier to extend
Error HandlingErrors caught before loadErrors handled after load
Use Case FitBest for compliance Best for experimentation/speed
Resource UsageHeavy on ETL serversUse cloud resources
MaintenanceManual updatesEasier with cloud tools
Data FreshnessDelayedNear real-time 
IntegrationWorks with legacy systemsWorks best with a modern cloud stack
Analytics ReadinessPre-shaped for BI toolsRaw data available for ML + BI

➜ Quick Take: People use ETL when strict rules, old systems, or early data masking matter—like in banks or hospitals. But when speed and scaling up matter more, and keeping the original data is key, ELT wins out. That’s what you see with tech firms, online shops, and AI startups.

Looking ahead to 2026, most machine learning teams are moving to ELT. Cloud lakehouses make it much easier to store raw data and test new ideas quickly.

The Rise of Real‑Time & Event‑Driven Pipelines

Batch processing used to rule the world of data. Now? If teams are still waiting all night for numbers to refresh, they are already behind. The “death of batch” is not just a buzzword- it is happening. Companies that stick with slow, overnight updates miss out on real moments that actually matter.

Today’s machine learning data pipelines work in real time. Every click, every sensor reading, every transaction– captured instantly, as it happens. Tools like Kafka and AWS Glue make all this possible. Data does not wait around for a scheduled batch; it just keeps flowing. And if something shows up late or out of order, backfill tools jump in and fix it automatically. Teams do not lose hours fixing broken records, and models carry on without interruption, even when streams become disorganized.

The difference across industries:

  • Retail: Recommendations shift as people shop, not hours later.
  • Healthcare: Patient monitors flag issues in seconds, before it is too late.
  • Finance: Fraud detection acts right when something’s off.
  • Logistics: Supply chains adjust in real time to address delays or sudden spikes in demand.

Data Engineering for ML: The Privacy‑First Era

-→ Synthetic Data Generation

Machine learning runs on data, but using real customer info is risky. That is where synthetic data comes in. Generative AI lets teams build artificial datasets that replicate real-life patterns, without ever accessing anyone’s private info. Companies can train, test, and share their models while remaining compliant with privacy laws such as the GDPR. 

Consider hospitals: they can create fake patient records for research and keep real identities totally hidden.

-→ Automated Data Lineage

The more complicated data pipelines get, the harder it is to keep track of what is happening. Teams do not just need to know what data they have– they need to know where it came from and how it changed along the way. Automated data lineage tools take care of this by tracking every step, from the moment data comes in through all the transformations it experiences. That is a major development in the field of explainable AI. Regulators and auditors now expect clear evidence of how every decision is made. 

Banks use lineage to show exactly how transaction data feeds into their fraud models. Tech companies use it to spot errors fast. Either way, lineage makes AI systems more transparent and much easier to trust.

-→ PII Redaction

Personal data needs protection at every stage. As data moves across the system, PII redaction techniques eliminate names, addresses, and IDs. They filter data in transit, so only the right content reaches the models. Because regulations operate automatically in the background, there will be fewer leaks and less work for compliance teams.

In healthcare, patterns in patient data can be identified without compromising patient identities. In retail, teams can dig into shopping habits without ever seeing a customer’s name.

Closing Note

Synthetic data, automated lineage, and PII redaction are not just nice-to-haves– they are the backbone of privacy-first machine learning. With these tools, organizations can move fast and innovate, all while proving that strong privacy is not a roadblock. It is what makes trustworthy AI possible.

Why Flexiana’s Functional Approach Wins at Data Engineering

✔️ Clojure for Immutable Transformations

Flexiana builds its data engineering for ML  pipelines with Clojure, a language that treats data as unchangeable. Once the data appears, it does not change. This means every transformation is predictable and easier to debug. When teams are building feature engineering pipelines, they want to be reliable —their models depend on consistent inputs to keep performing well.

✔️ Metrics: Data Freshness and Pipeline Uptime

Flexiana does not just talk about reliability—we track it. We watch two things closely: how quickly new data reaches models (data freshness) and how often the pipelines stay up and running (uptime). Put together, these numbers prove that the machine learning data pipelines are quick and extremely reliable. Models are trained on the latest data, and with real-time data ingestion, we ensure the models’ data is ready for production.

✔️ Tracking Data from Start to Finish 

Flexiana adds automated data lineage to its pipelines. Businesses get a clear trail from entry to transformation. It makes model results easy to explain, audits easier to pass, and problems easier to solve. For regulated industries, this kind of transparency ensures that data engineering for ML is more than just a luxury—it is critical.

Closing Note

Flexiana’s functional approach blends immutable design, proven reliability, huge scalability, and clear lineage tracking. If your organization depends on real-time data ingestion, feature-engineering pipelines, and model data readiness, Flexiana’s machine-learning data pipelines deliver.

FAQs on Machine Learning Data Pipelines

Q1: What is the most time‑consuming part of a pipeline?  

Data preparation, without a doubt. Teams spend most of their hours—sometimes 60 to 80 percent—just cleaning, labelling, and formatting data before even thinking about models. Skip this, and even the best machine learning data pipelines won’t give you good results.

Q2: What is a Feature Store?  

Consider it a library of features that you have already developed. Teams can save a ton of time and ensure consistency by reusing features across many models. A retailer, for example, can save “customer purchase frequency” and use it for both recommendations and churn predictions.

Q3: Should we build or buy pipeline tools?  

Most teams mix it up. For things like scale and dependability, they purchase tools from vendors after developing custom solutions that make them special. It is a technique to maintain flexibility while controlling expenses.

Q4: What is real‑time data ingestion?  

It is about pulling in data immediately upon creation. Banks use it to catch fraud on the spot. Online stores update recommendations as you shop. In today’s machine learning world, real‑time data ingestion is a must.

Q5: Why is automated data lineage important?  

It shows you exactly how data moves through your pipeline, from start to finish. This makes it way easier to explain model results, meet compliance, or find out where something went wrong. In healthcare, for instance, automated data lineage proves patient data was handled correctly.

Q6: What does model data readiness mean?  

It means your data is clean, up to date, and formatted so that models can use it immediately. If you do not have this, expect mistakes or delays. In finance, model data readiness lets trading algorithms react to the latest market moves instantly.

Q7: How do feature engineering pipelines help ML?  

Feature engineering pipelines turn messy, raw data into something your models can actually work with. Let’s say you have a collection of “transaction history”—rather than inserting it unchanged, you break it down into something cleaner, like “average spend per month.” Stuff like that makes a real difference. These pipelines extract the right details, so your models perform better and are not confused by junk or bias embedded in the data.

Q8: What metrics matter most in Machine Learning data pipelines?  

Two things really stand out: how fresh the data is, and how regularly the pipeline stays active. Freshness tells you whether your data is up to date, and uptime shows how reliable your pipeline is. Both matter if you want to make real-time decisions.

Q9: How do ML pipelines scale to enterprise workloads?  

They move serious data—consider 100GB each second. That is how retailers or hospitals crunch billions of records quickly, keeping real‑time analytics and large-scale feature-engineering pipelines running smoothly.

Summing Up 

Machine learning models get the spotlight, but the real impact lies upstream. The majority of success—often as much as 80%—comes from how well data is prepared, cleaned, and delivered. Strong data pipelines directly translate into more accurate, reliable, and efficient systems.

That is where Flexiana’s approach comes in. Our functional style means data stays predictable because transformations do not mess with the originals. Tracking things like data freshness and pipeline uptime shows if everything is on track. We design pipelines to handle big workloads without breaking down. And with automated data lineage, teams always know where their data has been. Put it all together, and teams get pipelines they can actually trust.

See the difference everywhere. Banks need real-time data to spot fraud as it happens. Hospitals have to track every data step to meet regulations. Retailers want to build smart recommendations, so they lean on feature engineering. In every case, strong pipelines lead to better models.

Whenever pipelines fail to deliver, everything stops moving. But when they work, teams can trust their data, handle whatever comes their way, and get insights right when they need them.

From ingestion to training, building robust data pipelines is complex. Reach out to Flexiana to get it right from day one.

The post Data Pipelines for Machine Learning: From Ingestion to Training (2026 Guide) appeared first on Flexiana.

Permalink

MLOps in 2026: What Is It and Why Should You Care?

The development of AI software is progressing rapidly. Many organizations now work with a Machine learning development company to turn early ML ideas into Scalable software solutions.

Machine learning can be viewed as the core engine, while MLOps serves as the operational framework that ensures the engine is built, deployed, and runs efficiently at scale.

By 2026, building a machine learning model will no longer be the primary challenge. The real effort begins after development, ensuring the model remains accurate, operates reliably in production, and remains cost-effective over time. While many teams are accelerating AI adoption, a significant number of models fail to reach production, and others degrade in performance after deployment

When there is no clear process, a few common problems show up:

  • Model drift. Predictions slowly become less accurate.
  • Data inconsistencies. Training data and live data don’t match.
  • Compliance risks. It’s possible to handle sensitive data improperly.
  • Poor monitoring. Until people complain, teams are unaware of problems.

This is why MLOps now sits at the center of modern AI software development. It helps teams move from experiments to reliable systems that fit into Full-cycle software development.

With a well-established MLOps framework, teams are able to:

  • Deploy models faster
  • Keep models accurate over time
  • Build scalable software solutions for real products
  • Improve engineering productivity measurement with clear metrics and automation

And for organizations working with a Machine learning development company, MLOps turns machine learning into a real capability. 

In this guide, we will explore how MLOps works. We will also see why it matters in 2026. And how teams use it to build secure Privacy-first AI software development pipelines that run in production.

A lifecycle without vs with MLOPs

So, What is MLOps All About? Defining the Intersection of AI and DevOps

🔷 The Definition

Basically, it’s short for Machine Learning Operations (MLOps). 

Let’s think of it as the practical side of machine learning—the step where models leave a data scientist’s notebook and move into the real world. Teams can experiment with models endlessly, but they only matter when they’re running inside an actual product. Out there, they need to be accurate, manage up-to-date data, and keep working without falling apart.

MLOps provides the framework that enables this to happen by bringing together a range of interconnected functions, including:

  • DevOps, which takes care of automating deployment and keeping the infrastructure operational
  • Data engineering, which organizes and prepares data
  • Machine learning engineering, focused on building and improving the models themselves
  • Software operations that watch over all components after deployment.

🔷 The Three Pillars of MLOps

MLOps primarily consists of three components: data, models, and operations. They all depend on each other.

❶ Data Engineering

Machine learning depends entirely on its data. If the data pipeline breaks, the model fails. Data engineering handles everything—from retrieving data from various sources, checking for missing or anomalous values, monitoring dataset changes, and turning raw numbers into something models can work with. 

So, what enables all of this to run smoothly and reliably?

  • Apache Airflow to schedule 
  • DVC to track datasets 
  • Snowflake to store data in the cloud 
  • Apache Spark to process large datasets 

These tools make the whole process more reliable and way less messy.

❷ Model Engineering

The emphasis here is on developing and refining the models themselves.

  • Teams spend time training on past data 
  • Improve hyperparameters to achieve good results 
  • Track what experiments they have tried so they don’t lose track 
  • Save models so they can roll back or reuse them as needed 

Teams rely on tools such as MLflow, Weights & Biases, and TensorFlow Extended (TFX) to manage complexity and avoid infinite loops.

❸ Operations

After teams finish training a model, the real challenge begins: putting it to work and keeping it on track.

  • Teams need a robust CI/CD pipeline to ensure updates and fixes are deployed without issues. 
  • Teams must monitor the situation—make sure the model’s doing its job.
  • Models can drift off course if conditions shift, so teams need to detect them and retrain them automatically. 
  • When more users show up, teams must scale up quickly. 

If any of these aspects are overlooked, models can quickly become unreliable, degrade in performance, or fall out of use altogether.

MLOPs Lifecycle Architecture Diagram

Why MLOps Is Non-Negotiable for Modern Busines

🔷 Eliminating “Model Drift.”

Machine learning models do not remain accurate indefinitely. As markets evolve, user behavior shifts, and real-world conditions change, the data used to train models gradually becomes outdated. When this happens, model performance declines—a phenomenon known as model drift.

This drop starts quietly, hard to see at first.

  • Accuracy drops over time.
  • Predictions become increasingly inconsistent.

It looks normal outside, but inside, teams work without proper insights.

MLOps addresses this challenge by introducing continuous oversight and automation. With MLOps in place:

  • Model performance is monitored in real time.
  • Data drift is detected as soon as deviations occur.
  • Retraining starts without manual effort.

As a result, models keep adapting to current data and remain aligned with real-world conditions rather than outdated patterns.

Model accuracy decline without retraining

🔷 Accelerating Time-to-Market

When teams follow the traditional machine learning process, progress is slow. 

  • Data scientists create and test models within their own environment. 
  • Then, engineers step in, rework all the code to fit the real world, and ultimately deploy it. 
  • Sometimes, that exchange goes on endlessly for weeks—sometimes months. 

It’s frustrating and delays any real impact.

MLOps flips that script. With clear, repeatable pipelines, models effortlessly move from testing to production. Automated tests keep things on track and consistent, and CI/CD tools let the team push out updates quickly—sometimes even daily. 

Suddenly, everyone can experiment, tweak, and deploy very quickly. Teams get results much faster, and the company notices the difference.

🔗 Google Cloud research shows that companies using MLOps deploy machine learning models about 30% faster because much of the process is automated.

🔷 Ensuring Regulatory Compliance

If a team is in finance, healthcare, insurance (or any regulated sector), they cannot overlook compliance requirements such as GDPR, HIPAA, or SOC 2. As new AI laws emerge, restrictions continue to increase. Without solid processes, staying transparent and accountable gets messy. It’s tough to explain why the model made a decision or to prove exactly where the data came from.

That’s where MLOps shines. Teams get audit trails for every model change—nothing is missed. It’s easy to see all the data sources and transformations, and the workflow stays completely traceable. That’s not just nice to have anymore—it’s essential if teams want to stay in business and out of trouble.

Comparison: Manual ML vs Automated MLOps

Consider this: A team is building ML systems the old way, solely through manual effort. Each team controls individual components, tools don’t work together, and no one’s really sure who did what—or when. Deployments are delayed, and each update causes problems. MLOps changes everything. 

Teams automate the entire pipeline, standardize best practices, track every version, and monitor the system from start to finish. Everything just works better—and a lot faster. 

Here’s how they compare.

FeatureManual ML WorkflowAutomated MLOps
Deployment SpeedIt will take days, sometimes weeks.  Achieved within minutes.
MonitoringTeams only find problems once something fails.  Stay ahead—automated alerts let teams know before issues arise.  
ScalabilityPretty tough to replicate across teams.  Makes it easy to implement standardized solutions at scale.  
ReproducibilityDepends on each person’s local setup, so results vary widely.  Versioned pipelines keep experiments and results consistent.  
SecurityTeams fix security flaws when they appear.  Privacy and compliance are part of the design from day one.  
CollaborationPeople work in silos, and it’s hard for others to see what’s going on.  Shared pipelines and experiment tracking let everyone stay in sync.  
Model VersioningOften gets skipped or done manually.  Automatically tracks every model version.  
Data VersioningIt’s hard to know how your data is changing.  Every dataset version is logged and recorded.  
TestingMostly manual and tedious.  Validation and testing run automatically.  
Deployment ProcessEngineers need to rewrite a lot of code just to deploy.  CI/CD pipelines handle deployment for the team.  
Model UpdatesRetraining happens when someone notices a problem.  Set up scheduled or trigger-based retraining—no need to wait for failure.  
Failure RecoveryTeams only address issues after production failures.  Continuous monitoring catches problems early.  
Experiment TrackingResults are scattered or easily lost on local machines.  All experiments are tracked, easily compared, and fully reproducible.

Manual workflows make it seem like teams have to start over every time. MLOps makes the process easy to repeat. Models move from training to production through a defined pipeline. Teams know what changed. They know what version is running. And they can update models without breaking the system.

Manual ML vs MLOps Workflow Comparision

The MLOps Tech Stack in 2026

🔷 Orchestration Platforms

Machine learning pipelines aren’t simple. Teams have data preparation, model training, validation, and deployment—all of it has to happen in the right order. Orchestration tools make sure that happens. Common tools include:

  • Kubeflow
  • MLflow
  • Apache Airflow

These manage training pipelines, automate tedious tasks, and handle deployments. So instead of running around, clicking buttons, and hoping teams didn’t skip a step, they let the pipelines do the work—and avoid those silly mistakes.

🔷 Data Versioning

Data never remains static. New items arrive, old items get tweaked, sometimes things just disappear. If teams don’t keep track, their models turn into a mess. Data versioning tools keep everything organized.  The tools are 

  • DVC (Data Version Control)
  • LakeFS
  • Delta Lake

Teams can roll back to earlier datasets, repeat old experiments, and work with other teams without causing conflicts. When someone asks, “Which data did we use for this model?” teams actually have an answer.

🔷 Containerization & Infrastructure

Teams can’t trust code to work the same everywhere, considering all the unusual quirks in different systems. That’s where containers help. Most ML teams use

  • Docker containers
  • Kubernetes clusters
  • Serverless ML setups

Containers bundle the model, code, and everything else it needs. So teams know it’ll work—locally, on the cloud, wherever. That means fewer surprises and smoother scaling. 

🔷 The Clojure Advantage

Some teams include Clojure in the AI systems. It runs on the JVM, so teams get access to the massive Java ecosystem. What’s cool about Clojure? 

  • The platform is stable
  • Immutable data structures help kill off annoying bugs
  • It handles concurrency like a champ

For large systems that process large volumes of data or run multiple jobs concurrently, Clojure is a perfect fit. In some enterprise setups, it integrates seamlessly with those MLOps pipelines and keeps things running smoothly.

MLOPS Architecture Stack

Measuring ROI: Engineering Productivity in A

Many companies underestimate how expensive it is to maintain AI systems—especially when building the model. That’s just the starting line. Most of the heavy lifting comes after the updates, the fixes when data pipelines break, and the constant monitoring. The real costs become apparent once the system is operational under real-world conditions.

This condition is known as hidden technical debt in Machine Learning Systems. The model continues to work, but behind the scenes, things become more complex and harder to maintain. That’s where MLOps steps in. It provides teams with a framework for launching models, tracking their progress, and retraining as needed. Even better, it makes it way easier to see how AI is actually doing.

🔷 Key Metrics MLOps Enables

When teams use MLOps, they get real numbers to track. These metrics show how quickly engineers can upgrade models or fix issues. Some of the main things teams watch:

  • Time to retrain—how long does it take to refresh a model with new data
  • Model deployment frequency—how often new models go live
  • Prediction latency—how fast teams get answers from the system
  • Data pipeline reliability—how much of the time everything operates correctly

These engineering productivity metrics help teams identify bottlenecks that slow progress. They’re also a solid way to boost productivity across your AI projects.

🔷 Business ROI Metrics

Sure, engineering metrics matter, but they don’t tell the whole story. Businesses ultimately want to see the results of their investment. Thus, they consider things such as:

  • How many machine learning features increase revenue
  • Whether more customers are converting
  • How much they’re saving by cutting operational costs

These are the numbers that connect the AI system to actual business value. If a model brings in more sales or saves people from boring manual work, that’s when the payoff gets real. That’s the true return companies look for when they invest in MLOps.

Engineering Productivity Metrics Before vs After MLOPs.

How to Choose a Partner for Your AI Infrastructure

🔷 Key Evaluation Criteria

Pay close attention to how they handle building and maintaining those systems. Lots of AI projects fail because someone builds a model and then abandons it. The best partners stick around for the entire journey.

Checklist

✔️ Automated model testing: Just like regular software, AI models need to be tested regularly. Automated tests make it way easier to spot accuracy issues or catch errors early. Plus, they let teams know if the model starts behaving strangely because new data is skewing its predictions.

✔️ CI/CD pipelines for ML: Models aren’t static. They change as new data arrives. Improvements are made. A good team builds CI/CD pipelines so models can be updated and deployed safely.

✔️ Continuous monitoring: The work is not finished once the modes go live. Teams have to monitor performance once they’re running in production. Monitoring helps teams spot drops in accuracy or unusual behavior.

✔️ Data governance systems: AI systems depend on data. Teams also need solid rules for data—where they store it, who gets access, and how they keep it private. Good governance keeps both the company and users safe.

✔️ Documentation and reproducibility: If an AI system is unclear or cannot be reproduced, it creates problems for teams. When teams document things clearly, it’s way easier to rebuild models, solve problems, or continue developing later.

Common Questions (FAQs)❓

Q1: Is MLOps just DevOps for AI?

Not really. MLOps and DevOps share some practices. Both use automation and continuous deployment. But MLOps deals with more pieces.

DevOps mostly manages code. MLOps manages code, data, and models. And those pieces change over time. Data shifts. Models lose accuracy. Without updates, predictions decline—ML systems require ongoing checks and retraining.

Q2: When should we start thinking about MLOps?

Right from the start. MLOps should be part of the first version of your AI software. Not something you try to add later. Without structure, ML projects get messy fast. Models become hard to track. Data pipelines break. Updates become risky.

And fixing that later usually costs more than building it properly from the beginning.

Q3: Does MLOps help with AI security?

Yes. MLOps pipelines add layers like security checks and automated scans, not always found in standard DevOps. They also control how data moves through the system. This makes it easier to see where training data comes from and who has access to it.

And it helps prevent unsafe or unverified data from being used to train models.

Conclusion 

Business value cannot be created by machine learning alone. What matters is how it works in real systems.

Many teams build models that perform well in tests. But without the right setup around them, those models stay in notebooks or demo apps. They never become part of a real product. This is where MLOps matters.

MLOps helps move a model from prototype to production. It puts structure around how models are trained, deployed, and updated. Data changes. Models lose accuracy. Systems need updates. MLOps helps teams manage all of that. Companies that invest in a few key areas tend to move faster:

  • AI software development
  • Engineering productivity measurement
  • Privacy-first AI development

These practices make AI systems easier to build and maintain. And they help teams keep systems running as data and products change. Over time, that leads to a real advantage.

MLOps isn’t optional anymore, and if you want to get it right, the team at Flexiana is a great place to start the conversation.

The post MLOps in 2026: What Is It and Why Should You Care?  appeared first on Flexiana.

Permalink

Build and Deploy Web Apps With Clojure and FLy.io

This post walks through a small web development project using Clojure, covering everything from building the app to packaging and deploying it. It’s a collection of insights and tips I’ve learned from building my Clojure side projects, but presented in a more structured format.

As the title suggests, we’ll be deploying the app to Fly.io. It’s a service that allows you to deploy apps packaged as Docker images on lightweight virtual machines. [1] [1] My experience with it has been good; it’s easy to use and quick to set up. One downside of Fly is that it doesn’t have a free tier, but if you don’t plan on leaving the app deployed, it barely costs anything.

This isn’t a tutorial on Clojure, so I’ll assume you already have some familiarity with the language as well as some of its libraries. [2] [2]

Project Setup

In this post, we’ll be building a barebones bookmarks manager for the demo app. Users can log in using basic authentication, view all bookmarks, and create a new bookmark. It’ll be a traditional multi-page web app and the data will be stored in a SQLite database.

Here’s an overview of the project’s starting directory structure:

.
├── dev
│   └── user.clj
├── resources
│   └── config.edn
├── src
│   └── acme
│       └── main.clj
└── deps.edn

And the libraries we’re going to use. If you have some Clojure experience or have used Kit, you’re probably already familiar with all the libraries listed below. [3] [3]

deps.edn
{:paths ["src" "resources"]
 :deps {org.clojure/clojure               {:mvn/version "1.12.0"}
        aero/aero                         {:mvn/version "1.1.6"}
        integrant/integrant               {:mvn/version "0.11.0"}
        ring/ring-jetty-adapter           {:mvn/version "1.12.2"}
        metosin/reitit-ring               {:mvn/version "0.7.2"}
        com.github.seancorfield/next.jdbc {:mvn/version "1.3.939"}
        org.xerial/sqlite-jdbc            {:mvn/version "3.46.1.0"}
        hiccup/hiccup                     {:mvn/version "2.0.0-RC3"}}
 :aliases
 {:dev {:extra-paths ["dev"]
        :extra-deps  {nrepl/nrepl    {:mvn/version "1.3.0"}
                      integrant/repl {:mvn/version "0.3.3"}}
        :main-opts   ["-m" "nrepl.cmdline" "--interactive" "--color"]}}}

I use Aero and Integrant for my system configuration (more on this in the next section), Ring with the Jetty adaptor for the web server, Reitit for routing, next.jdbc for database interaction, and Hiccup for rendering HTML. From what I’ve seen, this is a popular “library combination” for building web apps in Clojure. [4] [4]

The user namespace in dev/user.clj contains helper functions from Integrant-repl to start, stop, and restart the Integrant system.

dev/user.clj
(ns user
  (:require
   [acme.main :as main]
   [clojure.tools.namespace.repl :as repl]
   [integrant.core :as ig]
   [integrant.repl :refer [set-prep! go halt reset reset-all]]))

(set-prep!
 (fn []
   (ig/expand (main/read-config)))) ;; we'll implement this soon

(repl/set-refresh-dirs "src" "resources")

(comment
  (go)
  (halt)
  (reset)
  (reset-all))

Systems and Configuration

If you’re new to Integrant or other dependency injection libraries like Component, I’d suggest reading “How to Structure a Clojure Web”. It’s a great explanation of the reasoning behind these libraries. Like most Clojure apps that use Aero and Integrant, my system configuration lives in a .edn file. I usually name mine as resources/config.edn. Here’s what it looks like:

resources/config.edn
{:server
 {:port #long #or [#env PORT 8080]
  :host #or [#env HOST "0.0.0.0"]
  :auth {:username #or [#env AUTH_USER "john.doe@email.com"]
         :password #or [#env AUTH_PASSWORD "password"]}}

 :database
 {:dbtype "sqlite"
  :dbname #or [#env DB_DATABASE "database.db"]}}

In production, most of these values will be set using environment variables. During local development, the app will use the hard-coded default values. We don’t have any sensitive values in our config (e.g., API keys), so it’s fine to commit this file to version control. If there are such values, I usually put them in another file that’s not tracked by version control and include them in the config file using Aero’s #include reader tag.

This config file is then “expanded” into the Integrant system map using the expand-key method:

src/acme/main.clj
(ns acme.main
  (:require
   [aero.core :as aero]
   [clojure.java.io :as io]
   [integrant.core :as ig]))

(defn read-config
  []
  {:system/config (aero/read-config (io/resource "config.edn"))})

(defmethod ig/expand-key :system/config
  [_ opts]
  (let [{:keys [server database]} opts]
    {:server/jetty (assoc server :handler (ig/ref :handler/ring))
     :handler/ring {:database (ig/ref :database/sql)
                    :auth     (:auth server)}
     :database/sql database}))

The system map is created in code instead of being in the configuration file. This makes refactoring your system simpler as you only need to change this method while leaving the config file (mostly) untouched. [5] [5]

My current approach to Integrant + Aero config files is mostly inspired by the blog post “Rethinking Config with Aero & Integrant” and Laravel’s configuration. The config file follows a similar structure to Laravel’s config files and contains the app configurations without describing the structure of the system. Previously, I had a key for each Integrant component, which led to the config file being littered with #ig/ref and more difficult to refactor.

Also, if you haven’t already, start a REPL and connect to it from your editor. Run clj -M:dev if your editor doesn’t automatically start a REPL. Next, we’ll implement the init-key and halt-key! methods for each of the components:

src/acme/main.clj
;; src/acme/main.clj
(ns acme.main
  (:require
   ;; ...
   [acme.handler :as handler]
   [acme.util :as util])
   [next.jdbc :as jdbc]
   [ring.adapter.jetty :as jetty]))
;; ...

(defmethod ig/init-key :server/jetty
  [_ opts]
  (let [{:keys [handler port]} opts
        jetty-opts (-> opts (dissoc :handler :auth) (assoc :join? false))
        server     (jetty/run-jetty handler jetty-opts)]
    (println "Server started on port " port)
    server))

(defmethod ig/halt-key! :server/jetty
  [_ server]
  (.stop server))

(defmethod ig/init-key :handler/ring
  [_ opts]
  (handler/handler opts))

(defmethod ig/init-key :database/sql
  [_ opts]
  (let [datasource (jdbc/get-datasource opts)]
    (util/setup-db datasource)
    datasource))

The setup-db function creates the required tables in the database if they don’t exist yet. This works fine for database migrations in small projects like this demo app, but for larger projects, consider using libraries such as Migratus (my preferred library) or Ragtime.

src/acme/util.clj
(ns acme.util 
  (:require
   [next.jdbc :as jdbc]))

(defn setup-db
  [db]
  (jdbc/execute-one!
   db
   ["create table if not exists bookmarks (
       bookmark_id text primary key not null,
       url text not null,
       created_at datetime default (unixepoch()) not null
     )"]))

For the server handler, let’s start with a simple function that returns a “hi world” string.

src/acme/handler.clj
(ns acme.handler
  (:require
   [ring.util.response :as res]))

(defn handler
  [_opts]
  (fn [req]
    (res/response "hi world")))

Now all the components are implemented. We can check if the system is working properly by evaluating (reset) in the user namespace. This will reload your files and restart the system. You should see this message printed in your REPL:

:reloading (acme.util acme.handler acme.main)
Server started on port  8080
:resumed

If we send a request to http://localhost:8080/, we should get “hi world” as the response:

$ curl localhost:8080/
# hi world

Nice! The system is working correctly. In the next section, we’ll implement routing and our business logic handlers.

Routing, Middleware, and Route Handlers

First, let’s set up a ring handler and router using Reitit. We only have one route, the index / route that’ll handle both GET and POST requests.

src/acme/handler.clj
(ns acme.handler
  (:require
   [reitit.ring :as ring]))

(def routes
  [["/" {:get  index-page
         :post index-action}]])

(defn handler
  [opts]
  (ring/ring-handler
   (ring/router routes)
   (ring/routes
    (ring/redirect-trailing-slash-handler)
    (ring/create-resource-handler {:path "/"})
    (ring/create-default-handler))))

We’re including some useful middleware:

  • redirect-trailing-slash-handler to resolve routes with trailing slashes,
  • create-resource-handler to serve static files, and
  • create-default-handler to handle common 40x responses.

Implementing the Middlewares

If you remember the :handler/ring from earlier, you’ll notice that it has two dependencies, database and auth. Currently, they’re inaccessible to our route handlers. To fix this, we can inject these components into the Ring request map using a middleware function.

src/acme/handler.clj
;; ...

(defn components-middleware
  [components]
  (let [{:keys [database auth]} components]
    (fn [handler]
      (fn [req]
        (handler (assoc req
                        :db database
                        :auth auth))))))
;; ...

The components-middleware function takes in a map of components and creates a middleware function that “assocs” each component into the request map. [6] [6] If you have more components such as a Redis cache or a mail service, you can add them here.

We’ll also need a middleware to handle HTTP basic authentication. [7] [7] This middleware will check if the username and password from the request map match the values in the auth map injected by components-middleware. If they match, then the request is authenticated and the user can view the site.

src/acme/handler.clj
(ns acme.handler
  (:require
   ;; ...
   [acme.util :as util]
   [ring.util.response :as res]))
;; ...

(defn wrap-basic-auth
  [handler]
  (fn [req]
    (let [{:keys [headers auth]} req
          {:keys [username password]} auth
          authorization (get headers "authorization")
          correct-creds (str "Basic " (util/base64-encode
                                       (format "%s:%s" username password)))]
      (if (and authorization (= correct-creds authorization))
        (handler req)
        (-> (res/response "Access Denied")
            (res/status 401)
            (res/header "WWW-Authenticate" "Basic realm=protected"))))))
;; ...

A nice feature of Clojure is that interop with the host language is easy. The base64-encode function is just a thin wrapper over Java’s Base64.Encoder:

src/acme/util.clj
(ns acme.util
   ;; ...
  (:import java.util.Base64))

(defn base64-encode
  [s]
  (.encodeToString (Base64/getEncoder) (.getBytes s)))

Finally, we need to add them to the router. Since we’ll be handling form requests later, we’ll also bring in Ring’s wrap-params middleware.

src/acme/handler.clj
(ns acme.handler
  (:require
   ;; ...
   [ring.middleware.params :refer [wrap-params]]))
;; ...

(defn handler
  [opts]
  (ring/ring-handler
   ;; ...
   {:middleware [(components-middleware opts)
                 wrap-basic-auth
                 wrap-params]}))

Implementing the Route Handlers

We now have everything we need to implement the route handlers or the business logic of the app. First, we’ll implement the index-page function, which renders a page that:

  1. Shows all of the user’s bookmarks in the database, and
  2. Shows a form that allows the user to insert new bookmarks into the database
src/acme/handler.clj
(ns acme.handler
  (:require
   ;; ...
   [next.jdbc :as jdbc]
   [next.jdbc.sql :as sql]))
;; ...

(defn template
  [bookmarks]
  [:html
   [:head
    [:meta {:charset "utf-8"
            :name    "viewport"
            :content "width=device-width, initial-scale=1.0"}]]
   [:body
    [:h1 "bookmarks"]
    [:form {:method "POST"}
     [:div
      [:label {:for "url"} "url "]
      [:input#url {:name "url"
                   :type "url"
                   :required true
                   :placeholer "https://en.wikipedia.org/"}]]
     [:button "submit"]]
    [:p "your bookmarks:"]
    [:ul
     (if (empty? bookmarks)
       [:li "you don't have any bookmarks"]
       (map
        (fn [{:keys [url]}]
          [:li
           [:a {:href url} url]])
        bookmarks))]]])

(defn index-page
  [req]
  (try
    (let [bookmarks (sql/query (:db req)
                               ["select * from bookmarks"]
                               jdbc/unqualified-snake-kebab-opts)]
      (util/render (template bookmarks)))
    (catch Exception e
      (util/server-error e))))
;; ...

Database queries can sometimes throw exceptions, so it’s good to wrap them in a try-catch block. I’ll also introduce some helper functions:

src/acme/util.clj
(ns acme.util
  (:require
   ;; ...
   [hiccup2.core :as h]
   [ring.util.response :as res])
  (:import java.util.Base64))
;; ...

(defn preprend-doctype
  [s]
  (str "<!doctype html>" s))

(defn render
  [hiccup]
  (-> hiccup h/html str preprend-doctype res/response (res/content-type "text/html")))

(defn server-error
  [e]
  (println "Caught exception: " e)
  (-> (res/response "Internal server error")
      (res/status 500)))

render takes a hiccup form and turns it into a ring response, while server-error takes an exception, logs it, and returns a 500 response.

Next, we’ll implement the index-action function:

src/acme/handler.clj
;; ...

(defn index-action
  [req]
  (try
    (let [{:keys [db form-params]} req
          value (get form-params "url")]
      (sql/insert! db :bookmarks {:bookmark_id (random-uuid) :url value})
      (res/redirect "/" 303))
    (catch Exception e
      (util/server-error e))))
;; ...

This is an implementation of a typical post/redirect/get pattern. We get the value from the URL form field, insert a new row in the database with that value, and redirect back to the index page. Again, we’re using a try-catch block to handle possible exceptions from the database query.

That should be all of the code for the controllers. If you reload your REPL and go to http://localhost:8080, you should see something that looks like this after logging in:

Screnshot of the app

The last thing we need to do is to update the main function to start the system:

src/acme/main.clj
;; ...

(defn -main [& _]
  (-> (read-config) ig/expand ig/init))

Now, you should be able to run the app using clj -M -m acme.main. That’s all the code needed for the app. In the next section, we’ll package the app into a Docker image to deploy to Fly.

Packaging the App

While there are many ways to package a Clojure app, Fly.io specifically requires a Docker image. There are two approaches to doing this:

  1. Build an uberjar and run it using Java in the container, or
  2. Load the source code and run it using Clojure in the container

Both are valid approaches. I prefer the first since its only dependency is the JVM. We’ll use the tools.build library to build the uberjar. Check out the official guide for more information on building Clojure programs. Since it’s a library, to use it, we can add it to our deps.edn file with an alias:

deps.edn
{;; ...
 :aliases
 {;; ...
  :build {:extra-deps {io.github.clojure/tools.build 
                       {:git/tag "v0.10.5" :git/sha "2a21b7a"}}
          :ns-default build}}}

Tools.build expects a build.clj file in the root of the project directory, so we’ll need to create that file. This file contains the instructions to build artefacts, which in our case is a single uberjar. There are many great examples of build.clj files on the web, including from the official documentation. For now, you can copy+paste this file into your project.

build.clj
(ns build
  (:require
   [clojure.tools.build.api :as b]))

(def basis (delay (b/create-basis {:project "deps.edn"})))
(def src-dirs ["src" "resources"])
(def class-dir "target/classes")

(defn uber
  [_]
  (println "Cleaning build directory...")
  (b/delete {:path "target"})

  (println "Copying files...")
  (b/copy-dir {:src-dirs   src-dirs
               :target-dir class-dir})

  (println "Compiling Clojure...")
  (b/compile-clj {:basis      @basis
                  :ns-compile '[acme.main]
                  :class-dir  class-dir})

  (println "Building Uberjar...")
  (b/uber {:basis     @basis
           :class-dir class-dir
           :uber-file "target/standalone.jar"
           :main      'acme.main}))

To build the project, run clj -T:build uber. This will create the uberjar standalone.jar in the target directory. The uber in clj -T:build uber refers to the uber function from build.clj. Since the build system is a Clojure program, you can customise it however you like. If we try to run the uberjar now, we’ll get an error:

# build the uberjar
$ clj -T:build uber
# Cleaning build directory...
# Copying files...
# Compiling Clojure...
# Building Uberjar...

# run the uberjar
$ java -jar target/standalone.jar
# Error: Could not find or load main class acme.main
# Caused by: java.lang.ClassNotFoundException: acme.main

This error occurred because the Main class that is required by Java isn’t built. To fix this, we need to add the :gen-class directive in our main namespace. This will instruct Clojure to create the Main class from the -main function.

src/acme/main.clj
(ns acme.main
  ;; ...
  (:gen-class))
;; ...

If you rebuild the project and run java -jar target/standalone.jar again, it should work perfectly. Now that we have a working build script, we can write the Dockerfile:

Dockerfile
# install additional dependencies here in the base layer
# separate base from build layer so any additional deps installed are cached
FROM clojure:temurin-21-tools-deps-bookworm-slim AS base

FROM base AS build
WORKDIR /opt
COPY . .
RUN clj -T:build uber

FROM eclipse-temurin:21-alpine AS prod
COPY --from=build /opt/target/standalone.jar /
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "standalone.jar"]

It’s a multi-stage Dockerfile. We use the official Clojure Docker image as the layer to build the uberjar. Once it’s built, we copy it to a smaller Docker image that only contains the Java runtime. [8] [8] By doing this, we get a smaller container image as well as a faster Docker build time because the layers are better cached.

That should be all for packaging the app. We can move on to the deployment now.

Deploying with Fly.io

First things first, you’ll need to install flyctl, Fly’s CLI tool for interacting with their platform. Create a Fly.io account if you haven’t already. Then run fly auth login to authenticate flyctl with your account.

Next, we’ll need to create a new Fly App:

$ fly app create
# ? Choose an app name (leave blank to generate one): 
# automatically selected personal organization: Ryan Martin
# New app created: blue-water-6489

Another way to do this is with the fly launch command, which automates a lot of the app configuration for you. We have some steps to do that are not done by fly launch, so we’ll be configuring the app manually. I also already have a fly.toml file ready that you can straight away copy to your project.

fly.toml
# replace these with your app and region name
# run `fly platform regions` to get a list of regions
app = 'blue-water-6489' 
primary_region = 'sin'

[env]
  DB_DATABASE = "/data/database.db"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 0

[mounts]
  source = "data"
  destination = "/data"
  initial_sie = 1

[[vm]]
  size = "shared-cpu-1x"
  memory = "512mb"
  cpus = 1
  cpu_kind = "shared"

These are mostly the default configuration values with some additions. Under the [env] section, we’re setting the SQLite database location to /data/database.db. The database.db file itself will be stored in a persistent Fly Volume mounted on the /data directory. This is specified under the [mounts] section. Fly Volumes are similar to regular Docker volumes but are designed for Fly’s micro VMs.

We’ll need to set the AUTH_USER and AUTH_PASSWORD environment variables too, but not through the fly.toml file as these are sensitive values. To securely set these credentials with Fly, we can set them as app secrets. They’re stored encrypted and will be automatically injected into the app at boot time.

$ fly secrets set AUTH_USER=hi@ryanmartin.me AUTH_PASSWORD=not-so-secure-password
# Secrets are staged for the first deployment

With this, the configuration is done and we can deploy the app using fly deploy:

$ fly deploy
# ...
# Checking DNS configuration for blue-water-6489.fly.dev
# Visit your newly deployed app at https://blue-water-6489.fly.dev/

The first deployment will take longer since it’s building the Docker image for the first time. Subsequent deployments should be faster due to the cached image layers. You can click on the link to view the deployed app, or you can also run fly open, which will do the same thing. Here’s the app in action:

The app in action

If you made additional changes to the app or fly.toml, you can redeploy the app using the same command, fly deploy. The app is configured to auto stop/start, which helps to cut costs when there’s not a lot of traffic to the site. If you want to take down the deployment, you’ll need to delete the app itself using fly app destroy <your app name>.

Adding a Production REPL

This is an interesting topic in the Clojure community, with varying opinions on whether or not it’s a good idea. Personally, I find having a REPL connected to the live app helpful, and I often use it for debugging and running queries on the live database. [9] [9] Since we’re using SQLite, we don’t have a database server we can directly connect to, unlike Postgres or MySQL.

If you’re brave, you can even restart the app directly without redeploying from the REPL. You can easily go wrong with it, which is why some prefer not to use it.

For this project, we’re gonna add a socket REPL. It’s very simple to add (you just need to add a JVM option) and it doesn’t require additional dependencies like nREPL. Let’s update the Dockerfile:

Dockerfile
# ...
EXPOSE 7888
ENTRYPOINT ["java", "-Dclojure.server.repl={:port 7888 :accept clojure.core.server/repl}", "-jar", "standalone.jar"]

The socket REPL will be listening on port 7888. If we redeploy the app now, the REPL will be started, but we won’t be able to connect to it. That’s because we haven’t exposed the service through Fly proxy. We can do this by adding the socket REPL as a service in the [services] section in fly.toml.

However, doing this will also expose the REPL port to the public. This means that anyone can connect to your REPL and possibly mess with your app. Instead, what we want to do is to configure the socket REPL as a private service.

By default, all Fly apps in your organisation live in the same private network. This private network, called 6PN, connects the apps in your organisation through WireGuard tunnels (a VPN) using IPv6. Fly private services aren’t exposed to the public internet but can be reached from this private network. We can then use Wireguard to connect to this private network to reach our socket REPL.

Fly VMs are also configured with the hostname fly-local-6pn, which maps to its 6PN address. This is analogous to localhost, which points to your loopback address 127.0.0.1. To expose a service to 6PN, all we have to do is bind or serve it to fly-local-6pn instead of the usual 0.0.0.0. We have to update the socket REPL options to:

Dockerfile
# ...
ENTRYPOINT ["java", "-Dclojure.server.repl={:port 7888,:address \"fly-local-6pn\",:accept clojure.core.server/repl}", "-jar", "standalone.jar"]

After redeploying, we can use the fly proxy command to forward the port from the remote server to our local machine. [10] [10]

$ fly proxy 7888:7888
# Proxying local port 7888 to remote [blue-water-6489.internal]:7888

In another shell, run:

$ rlwrap nc localhost 7888
# user=>

Now we have a REPL connected to the production app! rlwrap is used for readline functionality, e.g. up/down arrow keys, vi bindings. Of course, you can also connect to it from your editor.

Deploy with GitHub Actions

If you’re using GitHub, we can also set up automatic deployments on pushes/PRs with GitHub Actions. All you need is to create the workflow file:

.github/workflows/fly.yaml
name: Fly Deploy
on:
  push:
    branches:
      - main
  workflow_dispatch:

jobs:
  deploy:
    name: Deploy app
    runs-on: ubuntu-latest
    concurrency: deploy-group
    steps:
      - uses: actions/checkout@v4
      - uses: superfly/flyctl-actions/setup-flyctl@master
      - run: flyctl deploy --remote-only
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

To get this to work, you’ll need to create a deploy token from your app’s dashboard. Then, in your GitHub repo, create a new repository secret called FLY_API_TOKEN with the value of your deploy token. Now, whenever you push to the main branch, this workflow will automatically run and deploy your app. You can also manually run the workflow from GitHub because of the workflow_dispatch option.

End

As always, all the code is available on GitHub. Originally, this post was just about deploying to Fly.io, but along the way, I kept adding on more stuff until it essentially became my version of the user manager example app. Anyway, hope this post provided a good view into web development with Clojure. As a bonus, here are some additional resources on deploying Clojure apps:


  1. The way Fly.io works under the hood is pretty clever. Instead of running the container image with a runtime like Docker, the image is unpacked and “loaded” into a VM. See this video explanation for more details. ↩︎

  2. If you’re interested in learning Clojure, my recommendation is to follow the official getting started guide and join the Clojurians Slack. Also, read through this list of introductory resources. ↩︎

  3. Kit was a big influence on me when I first started learning web development in Clojure. I never used it directly, but I did use their library choices and project structure as a base for my own projects. ↩︎

  4. There’s no “Rails” for the Clojure ecosystem (yet?). The prevailing opinion is to build your own “framework” by composing different libraries together. Most of these libraries are stable and are already used in production by big companies, so don’t let this discourage you from doing web development in Clojure! ↩︎

  5. There might be some keys that you add or remove, but the structure of the config file stays the same. ↩︎

  6. “assoc” (associate) is a Clojure slang that means to add or update a key-value pair in a map. ↩︎

  7. For more details on how basic authentication works, check out the specification. ↩︎

  8. Here’s a cool resource I found when researching Java Dockerfiles: WhichJDK. It provides a comprehensive comparison of the different JDKs available and recommendations on which one you should use. ↩︎

  9. Another (non-technically important) argument for live/production REPLs is just because it’s cool. Ever since I read the story about NASA’s programmers debugging a spacecraft through a live REPL, I’ve always wanted to try it at least once. ↩︎

  10. If you encounter errors related to WireGuard when running fly proxy, you can run fly doctor, which will hopefully detect issues with your local setup and also suggest fixes for them. ↩︎

Permalink

Advent of Code 2024 in Zig

This post is about six seven months late, but here are my takeaways from Advent of Code 2024. It was my second time participating, and this time I actually managed to complete it. [1] [1] My goal was to learn a new language, Zig, and to improve my DSA and problem-solving skills.

If you’re not familiar, Advent of Code is an annual programming challenge that runs every December. A new puzzle is released each day from December 1st to the 25th. There’s also a global leaderboard where people (and AI) race to get the fastest solves, but I personally don’t compete in it, mostly because I want to do it at my own pace.

I went with Zig because I have been curious about it for a while, mainly because of its promise of being a better C and because TigerBeetle (one of the coolest databases now) is written in it. Learning Zig felt like a good way to get back into systems programming, something I’ve been wanting to do after a couple of chaotic years of web development.

This post is mostly about my setup, results, and the things I learned from solving the puzzles. If you’re more interested in my solutions, I’ve also uploaded my code and solution write-ups to my GitHub repository.

My Advent of Code results page

Project Setup

There were several Advent of Code templates in Zig that I looked at as a reference for my development setup, but none of them really clicked with me. I ended up just running my solutions directly using zig run for the whole event. It wasn’t until after the event ended that I properly learned Zig’s build system and reorganised my project.

Here’s what the project structure looks like now:

.
├── src
│   ├── days
│   │   ├── data
│   │   │   ├── day01.txt
│   │   │   ├── day02.txt
│   │   │   └── ...
│   │   ├── day01.zig
│   │   ├── day02.zig
│   │   └── ...
│   ├── bench.zig
│   └── run.zig
└── build.zig

The project is powered by build.zig, which defines several commands:

  1. Build
    • zig build - Builds all of the binaries for all optimisation modes.
  2. Run
    • zig build run - Runs all solutions sequentially.
    • zig build run -Day=XX - Runs the solution of the specified day only.
  3. Benchmark
    • zig build bench - Runs all benchmarks sequentially.
    • zig build bench -Day=XX - Runs the benchmark of the specified day only.
  4. Test
    • zig build test - Runs all tests sequentially.
    • zig build test -Day=XX - Runs the tests of the specified day only.

You can also pass the optimisation mode that you want to any of the commands above with the -Doptimize flag.

Under the hood, build.zig compiles src/run.zig when you call zig build run, and src/bench.zig when you call zig build bench. These files are templates that import the solution for a specific day from src/days/dayXX.zig. For example, here’s what src/run.zig looks like:

src/run.zig
const std = @import("std");
const puzzle = @import("day"); // Injected by build.zig

pub fn main() !void {
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();
    const allocator = arena.allocator();

    std.debug.print("{s}\n", .{puzzle.title});
    _ = try puzzle.run(allocator, true);
    std.debug.print("\n", .{});
}

The day module imported is an anonymous import dynamically injected by build.zig during compilation. This allows a single run.zig or bench.zig to be reused for all solutions. This avoids repeating boilerplate code in the solution files. Here’s a simplified version of my build.zig file that shows how this works:

build.zig
const std = @import("std");

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    const run_all = b.step("run", "Run all days");
    const day_option = b.option(usize, "ay", ""); // The `-Day` option

    // Generate build targets for all 25 days.
    for (1..26) |day| {
        const day_zig_file = b.path(b.fmt("src/days/day{d:0>2}.zig", .{day}));

        // Create an executable for running this specific day.
        const run_exe = b.addExecutable(.{
            .name = b.fmt("run-day{d:0>2}", .{day}),
            .root_source_file = b.path("src/run.zig"),
            .target = target,
            .optimize = optimize,
        });

        // Inject the day-specific solution file as the anonymous module `day`.
        run_exe.root_module.addAnonymousImport("day", .{ .root_source_file = day_zig_file });

        // Install the executable so it can be run.
        b.installArtifact(run_exe);

        // ...
    }
}

My actual build.zig has some extra code that builds the binaries for all optimisation modes.

This setup is pretty barebones. I’ve seen other templates do cool things like scaffold files, download puzzle inputs, and even submit answers automatically. Since I wrote my build.zig after the event ended, I didn’t get to use it while solving the puzzles. I might add these features to it if I decided to do Advent of Code again this year with Zig.

Self-Imposed Constraints

While there are no rules to Advent of Code itself, to make things a little more interesting, I set a few constraints and rules for myself:

  1. The code must be readable. By “readable”, I mean the code should be straightforward and easy to follow. No unnecessary abstractions. I should be able to come back to the code months later and still understand (most of) it.
  2. Solutions must be a single file. No external dependencies. No shared utilities module. Everything needed to solve the puzzle should be visible in that one solution file.
  3. The total runtime must be under one second. [2] [2] All solutions, when run sequentially, should finish in under one second. I want to improve my performance engineering skills.
  4. Parts should be solved separately. This means: (1) no solving both parts simultaneously, and (2) no doing extra work in part one that makes part two faster. The aim of this is to get a clear idea of how long each part takes on its own.
  5. No concurrency or parallelism. Solutions must run sequentially on a single thread. This keeps the focus on the efficiency of the algorithm. I can’t speed up slow solutions by using multiple CPU cores.
  6. No ChatGPT. No Claude. No AI help. I want to train myself, not the LLM. I can look at other people’s solutions, but only after I have given my best effort at solving the problem.
  7. Follow the constraints of the input file. The solution doesn’t have to work for all possible scenarios, but it should work for all valid inputs. If the input file only contains 8-bit unsigned integers, the solution doesn’t have to handle larger integer types.
  8. Hardcoding is allowed. For example: size of the input, number of rows and columns, etc. Since the input is known at compile-time, we can skip runtime parsing and just embed it into the program using Zig’s @embedFile.

Most of these constraints are designed to push me to write clearer, more performant code. I also wanted my code to look like it was taken straight from TigerBeetle’s codebase (minus the assertions). [3] [3] Lastly, I just thought it would make the experience more fun.

Favourite Puzzles

From all of the puzzles, here are my top 3 favourites:

  1. Day 6: Guard Gallivant - This is my slowest day (in benchmarks), but also the one I learned the most from. Some of these learnings include: using vectors to represent directions, padding 2D grids, metadata packing, system endianness, etc.
  2. Day 17: Chronospatial Computer - I love reverse engineering puzzles. I used to do a lot of these in CTFs during my university days. The best thing I learned from this day is the realisation that we can use different integer bases to optimise data representation. This helped improve my runtimes in the later days 22 and 23.
  3. Day 21: Keypad Conundrum - This one was fun. My gut told me that it can be solved greedily by always choosing the best move. It was right. Though I did have to scroll Reddit for a bit to figure out the step I was missing, which was that you have to visit the farthest keypads first. This is also my longest solution file (almost 400 lines) because I hardcoded the best-moves table.

Honourable mention:

  1. Day 24: Crossed Wires - Another reverse engineering puzzle. Confession: I didn’t solve this myself during the event. After 23 brutal days, my brain was too tired, so I copied a random Python solution from Reddit. When I retried it later, it turned out to be pretty fun. I still couldn’t find a solution I was satisfied with though.

Programming Patterns and Zig Tricks

During the event, I learned a lot about Zig and performance, and also developed some personal coding conventions. Some of these are Zig-specific, but most are universal and can be applied across languages. This section covers general programming and Zig patterns I found useful. The next section will focus on performance-related tips.

Comptime

Zig’s flagship feature, comptime, is surprisingly useful. I knew Zig uses it for generics and that people do clever metaprogramming with it, but I didn’t expect to be using it so often myself.

My main use for comptime was to generate puzzle-specific types. All my solution files follow the same structure, with a DayXX function that takes some parameters (usually the input length) and returns a puzzle-specific type, e.g.:

src/days/day01.zig
fn Day01(comptime length: usize) type {
    return struct {
        const Self = @This();
        
        left: [length]u32 = undefined,
        right: [length]u32 = undefined,

        fn init(input: []const u8) !Self {}

        // ...
    };
}

This lets me instantiate the type with a size that matches my input:

src/days/day01.zig
// Here, `Day01` is called with the size of my actual input.
pub fn run(_: std.mem.Allocator, is_run: bool) ![3]u64 {
    // ...
    const input = @embedFile("./data/day01.txt");
    var puzzle = try Day01(1000).init(input);
    // ...
}

// Here, `Day01` is called with the size of my test input.
test "day 01 part 1 sample 1" {
    var puzzle = try Day01(6).init(sample_input);
    // ...
}

This allows me to reuse logic across different inputs while still hardcoding the array sizes. Without comptime, I have to either create a separate function for all my different inputs or dynamically allocate memory because I can’t hardcode the array size.

I also used comptime to shift some computation to compile-time to reduce runtime overhead. For example, on day 4, I needed a function to check whether a string matches either "XMAS" or its reverse, "SAMX". A pretty simple function that you can write as a one-liner in Python:

example.py
def matches(pattern, target):
    return target == pattern or target == pattern[::-1]

Typically, a function like this requires some dynamic allocation to create the reversed string, since the length of the string is only known at runtime. [4] [4] For this puzzle, since the words to reverse are known at compile-time, we can do something like this:

src/days/day04.zig
fn matches(comptime word: []const u8, slice: []const u8) bool {
    var reversed: [word.len]u8 = undefined;
    @memcpy(&reversed, word);
    std.mem.reverse(u8, &reversed);
    return std.mem.eql(u8, word, slice) or std.mem.eql(u8, &reversed, slice);
}

This creates a separate function for each word I want to reverse. [5] [5] Each function has an array with the same size as the word to reverse. This removes the need for dynamic allocation and makes the code run faster. As a bonus, Zig also warns you when this word isn’t compile-time known, so you get an immediate error if you pass in a runtime value.

Optional Types

A common pattern in C is to return special sentinel values to denote missing values or errors, e.g. -1, 0, or NULL. In fact, I did this on day 13 of the challenge:

src/days/day13.zig
// We won't ever get 0 as a result, so we use it as a sentinel error value.
fn count_tokens(a: [2]u8, b: [2]u8, p: [2]i64) u64 {
    const numerator = @abs(p[0] * b[1] - p[1] * b[0]);
    const denumerator = @abs(@as(i32, a[0]) * b[1] - @as(i32, a[1]) * b[0]);
    return if (numerator % denumerator != 0) 0 else numerator / denumerator;
}

// Then in the caller, skip if the return value is 0.
if (count_tokens(a, b, p) == 0) continue;

This works, but it’s easy to forget to check for those values, or worse, to accidentally treat them as valid results. Zig improves on this with optional types. If a function might not return a value, you can return ?T instead of T. This also forces the caller to handle the null case. Unlike C, null isn’t a pointer but a more general concept. Zig treats null as the absence of a value for any type, just like Rust’s Option<T>.

The count_tokens function can be refactored to:

src/days/day13.zig
// Return null instead if there's no valid result.
fn count_tokens(a: [2]u8, b: [2]u8, p: [2]i64) ?u64 {
    const numerator = @abs(p[0] * b[1] - p[1] * b[0]);
    const denumerator = @abs(@as(i32, a[0]) * b[1] - @as(i32, a[1]) * b[0]);
    return if (numerator % denumerator != 0) null else numerator / denumerator;
}

// The caller is now forced to handle the null case.
if (count_tokens(a, b, p)) |n_tokens| {
    // logic only runs when n_tokens is not null.
}

Zig also has a concept of error unions, where a function can return either a value or an error. In Rust, this is Result<T>. You could also use error unions instead of optionals for count_tokens; Zig doesn’t force a single approach. I come from Clojure, where returning nil for an error or missing value is common.

Grid Padding

This year has a lot of 2D grid puzzles (arguably too many). A common feature of grid-based algorithms is the out-of-bounds check. Here’s what it usually looks like:

example.zig
fn dfs(map: [][]u8, position: [2]i8) u32 {
    const x, const y = position;
    
    // Bounds check here.
    if (x < 0 or y < 0 or x >= map.len or y >= map[0].len) return 0;

    if (map[x][y] == .visited) return 0;
    map[x][y] = .visited;

    var result: u32 = 1;
    for (directions) | direction| {
        result += dfs(map, position + direction);
    }
    return result;
}

This is a typical recursive DFS function. After doing a lot of this, I discovered a nice trick that not only improves code readability, but also its performance. The trick here is to pad the grid with sentinel characters that mark out-of-bounds areas, i.e. add a border to the grid.

Here’s an example from day 6:

Original map:               With borders added:
                            ************
....#.....                  *....#.....*
.........#                  *.........#*
..........                  *..........*
..#.......                  *..#.......*
.......#..        ->        *.......#..*
..........                  *..........*
.#..^.....                  *.#..^.....*
........#.                  *........#.*
#.........                  *#.........*
......#...                  *......#...*
                            ************

You can use any value for the border, as long as it doesn’t conflict with valid values in the grid. With the border in place, the bounds check becomes a simple equality comparison:

example.zig
const border = '*';

fn dfs(map: [][]u8, position: [2]i8) u32 {
    const x, const y = position;
    if (map[x][y] == border) { // We are out of bounds
        return 0;
    }
    // ...
}

This is much more readable than the previous code. Plus, it’s also faster since we’re only doing one equality check instead of four range checks.

That said, this isn’t a one-size-fits-all solution. This only works for algorithms that traverse the grid one step at a time. If your logic jumps multiple tiles, it can still go out of bounds (except if you increase the width of the border to account for this). This approach also uses a bit more memory than the regular approach as you have to store more characters.

SIMD Vectors

This could also go in the performance section, but I’m including it here because the biggest benefit I get from using SIMD in Zig is the improved code readability. Because Zig has first-class support for vector types, you can write elegant and readable code that also happens to be faster.

If you’re not familiar with vectors, they are a special collection type used for Single instruction, multiple data (SIMD) operations. SIMD allows you to perform computation on multiple values in parallel using only a single CPU instruction, which often leads to some performance boosts. [6] [6]

I mostly use vectors to represent positions and directions, e.g. for traversing a grid. Instead of writing code like this:

example.zig
next_position = .{ position[0] + direction[0], position[1] + direction[1] };

You can represent position and direction as 2-element vectors and write code like this:

example.zig
next_position = position + direction;

This is much nicer than the previous version!

Day 25 is another good example of a problem that can be solved elegantly using vectors:

src/days/day25.zig
var result: u64 = 0;
for (self.locks.items) |lock| { // lock is a vector
    for (self.keys.items) |key| { // key is also a vector
        const fitted = lock + key > @as(@Vector(5, u8), @splat(5));
        const is_overlap = @reduce(.Or, fitted);
        result += @intFromBool(!is_overlap);
    }
}

Expressing the logic as vector operations makes the code cleaner since you don’t have to write loops and conditionals as you typically would in a traditional approach.

Performance Tips

The tips below are general performance techniques that often help, but like most things in software engineering, “it depends”. These might work 80% of the time, but performance is often highly context-specific. You should benchmark your code instead of blindly following what other people say.

This section would’ve been more fun with concrete examples, step-by-step optimisations, and benchmarks, but that would’ve made the post way too long. Hopefully, I’ll get to write something like that in the future. [7] [7]

Minimise Allocations

Whenever possible, prefer static allocation. Static allocation is cheaper since it just involves moving the stack pointer vs dynamic allocation which has more overhead from the allocator machinery. That said, it’s not always the right choice since it has some limitations, e.g. stack size is limited, memory size must be compile-time known, its lifetime is tied to the current stack frame, etc.

If you need to do dynamic allocations, try to reduce the number of times you call the allocator. The number of allocations you do matters more than the amount of memory you allocate. More allocations mean more bookkeeping, synchronisation, and sometimes syscalls.

A simple but effective way to reduce allocations is to reuse buffers, whether they’re statically or dynamically allocated. Here’s an example from day 10. For each trail head, we want to create a set of trail ends reachable from it. The naive approach is to allocate a new set every iteration:

src/days/day10.zig
for (self.trail_heads.items) |trail_head| {
    var trail_ends = std.AutoHashMap([2]u8, void).init(self.allocator);
    defer trail_ends.deinit();
    
    // Set building logic...
}

What you can do instead is to allocate the set once before the loop. Then, each iteration, you reuse the set by emptying it without freeing the memory. For Zig’s std.AutoHashMap, this can be done using the clearRetainingCapacity method:

src/days/day10.zig
var trail_ends = std.AutoHashMap([2]u8, void).init(self.allocator);
defer trail_ends.deinit();

for (self.trail_heads.items) |trail_head| {
    trail_ends.clearRetainingCapacity();
    
    // Set building logic...
}

If you use static arrays, you can also just overwrite existing data instead of clearing it.

A step up from this is to reuse multiple buffers. The simplest form of this is to reuse two buffers, i.e. double buffering. Here’s an example from day 11:

src/days/day11.zig
// Initialise two hash maps that we'll alternate between.
var frequencies: [2]std.AutoHashMap(u64, u64) = undefined;
for (0..2) |i| frequencies[i] = std.AutoHashMap(u64, u64).init(self.allocator);
defer for (0..2) |i| frequencies[i].deinit();

var id: usize = 0;
for (self.stones) |stone| try frequencies[id].put(stone, 1);

for (0..n_blinks) |_| {
    var old_frequencies = &frequencies[id % 2];
    var new_frequencies = &frequencies[(id + 1) % 2];
    id += 1;

    defer old_frequencies.clearRetainingCapacity();

    // Do stuff with both maps...
}

Here we have two maps to count the frequencies of stones across iterations. Each iteration will build up new_frequencies with the values from old_frequencies. Doing this reduces the number of allocations to just 2 (the number of buffers). The tradeoff here is that it makes the code slightly more complex.

Make Your Data Smaller

A performance tip people say is to have “mechanical sympathy”. Understand how your code is processed by your computer. An example of this is to structure your data so it works better with your CPU. For example, keep related data close in memory to take advantage of cache locality.

Reducing the size of your data helps with this. Smaller data means more of it can fit in cache. One way to shrink your data is through bit packing. This depends heavily on your specific data, so you’ll need to use your judgement to tell whether this would work for you. I’ll just share some examples that worked for me.

The first example is in day 6 part two, where you have to detect a loop, which happens when you revisit a tile from the same direction as before. To track this, you could use a map or a set to store the tiles and visited directions. A more efficient option is to store this direction metadata in the tile itself.

There are only four tile types, which means you only need two bits to represent the tile types as an enum. If the enum size is one byte, here’s what the tiles look like in memory:

.obstacle -> 00000000
.path     -> 00000001
.visited  -> 00000010
.path     -> 00000011

As you can see, the upper six bits are unused. We can store the direction metadata in the upper four bits. One bit for each direction. If a bit is set, it means that we’ve already visited the tile in this direction. Here’s an illustration of the memory layout:

        direction metadata   tile type
           ┌─────┴─────┐   ┌─────┴─────┐
┌────────┬─┴─┬───┬───┬─┴─┬─┴─┬───┬───┬─┴─┐
│ Tile:  │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │
└────────┴─┬─┴─┬─┴─┬─┴─┬─┴───┴───┴───┴───┘
   up bit ─┘   │   │   └─ left bit
    right bit ─┘ down bit

If your language supports struct packing, you can express this layout directly: [8] [8]

src/days/day06.zig
const Tile = packed struct(u8) {
    const TileType = enum(u4) { obstacle, path, visited, exit };

    up: u1 = 0,
    right: u1 = 0,
    down: u1 = 0,
    left: u1 = 0,
    tile: TileType,

    // ...
}

Doing this avoids extra allocations and improves cache locality. Since the directions metadata is colocated with the tile type, all of them can fit together in cache. Accessing the directions just requires some bitwise operations instead of having to fetch them from another region of memory.

Another way to do this is to represent your data using alternate number bases. Here’s an example from day 23. Computers are represented as two-character strings made up of only lowercase letters, e.g. "bc", "xy", etc. Instead of storing this as a [2]u8 array, you can convert it into a base-26 number and store it as a u16. [9] [9]

Here’s the idea: map 'a' to 0, 'b' to 1, up to 'z' as 25. Each character in the string becomes a digit in the base-26 number. For example, "bc" ( [2]u8{ 'b', 'c' }) becomes the base-10 number 28 (1×26+2=28). If we represent this using the base-64 character set, it becomes 12 ('b' = 1, 'c' = 2).

While they take the same amount of space (2 bytes), a u16 has some benefits over a [2]u8:

  1. It fits in a single register, whereas you need two for the array.
  2. Comparison is faster as there is only a single value to compare.

Reduce Branching

I won’t explain branchless programming here; Algorithmica explains it way better than I can. While modern compilers are often smart enough to compile away branches, they don’t catch everything. I still recommend writing branchless code whenever it makes sense. It also has the added benefit of reducing the number of codepaths in your program.

Again, since performance is very context-dependent, I’ll just show you some patterns I use. Here’s one that comes up often:

src/days/day02.zig
if (is_valid_report(report)) {
    result += 1;
}

Instead of the branch, cast the bool into an integer directly:

src/days/day02.zig
result += @intFromBool(is_valid_report(report))

Another example is from day 6 (again!). Recall that to know if a tile has been visited from a certain direction, we have to check its direction bit. Here’s one way to do it:

src/days/day06.zig
fn has_visited(tile: Tile, direction: Direction) bool {
    switch (direction) {
        .up => return self.up == 1,
        .right => return self.right == 1,
        .down => return self.down == 1,
        .left => return self.left == 1,
    }
}

This works, but it introduces a few branches. We can make it branchless using bitwise operations:

src/days/day06.zig
fn has_visited(tile: Tile, direction: Direction) bool {
    const int_tile = std.mem.nativeToBig(u8, @bitCast(tile));
    const mask = direction.mask();
    const bits = int_tile & 0xff; // Get only the direction bits
    return bits & mask == mask;
}

While this is arguably cryptic and less readable, it does perform better than the switch version.

Avoid Recursion

The final performance tip is to prefer iterative code over recursion. Recursive functions bring the overhead of allocating stack frames. While recursive code is more elegant, it’s also often slower unless your language’s compiler can optimise it away, e.g. via tail-call optimisation. As far as I know, Zig doesn’t have this, though I might be wrong.

Recursion also has the risk of causing a stack overflow if the execution isn’t bounded. This is why code that is mission- or safety-critical avoids recursion entirely. It’s in TigerBeetle’s TIGERSTYLE and also NASA’s Power of Ten.

Iterative code can be harder to write in some cases, e.g. DFS maps naturally to recursion, but most of the time it is significantly faster, more predictable, and safer than the recursive alternative.

Benchmarks

I ran benchmarks for all 25 solutions in each of Zig’s optimisation modes. You can find the full results and the benchmark script in my GitHub repository. All benchmarks were done on an Apple M3 Pro.

As expected, ReleaseFast produced the best result with a total runtime of 85.1 ms. I’m quite happy with this, considering the two constraints that limited the number of optimisations I can do to the code:

  • Parts should be solved separately - Some days can be solved in a single go, e.g. day 10 and day 13, which could’ve saved a few milliseconds.
  • No concurrency or parallelism - My slowest days are the compute-heavy days that are very easily parallelisable, e.g. day 6, day 19, and day 22. Without this constraint, I can probably reach sub-20 milliseconds total(?), but that’s for another time.

You can see the full benchmarks for ReleaseFast in the table below:

Day Title Parsing (µs) Part 1 (µs) Part 2 (µs) Total (µs)
1 Historian Hysteria 23.5 15.5 2.8 41.8
2 Red-Nosed Reports 42.9 0.0 11.5 54.4
3 Mull it Over 0.0 7.2 16.0 23.2
4 Ceres Search 5.9 0.0 0.0 5.9
5 Print Queue 22.3 0.0 4.6 26.9
6 Guard Gallivant 14.0 25.2 24,331.5 24,370.7
7 Bridge Repair 72.6 321.4 9,620.7 10,014.7
8 Resonant Collinearity 2.7 3.3 13.4 19.4
9 Disk Fragmenter 0.8 12.9 137.9 151.7
10 Hoof It 2.2 29.9 27.8 59.9
11 Plutonian Pebbles 0.1 43.8 2,115.2 2,159.1
12 Garden Groups 6.8 164.4 249.0 420.3
13 Claw Contraption 14.7 0.0 0.0 14.7
14 Restroom Redoubt 13.7 0.0 0.0 13.7
15 Warehouse Woes 14.6 228.5 458.3 701.5
16 Reindeer Maze 12.6 2,480.8 9,010.7 11,504.1
17 Chronospatial Computer 0.1 0.2 44.5 44.8
18 RAM Run 35.6 15.8 33.8 85.2
19 Linen Layout 10.7 11,890.8 11,908.7 23,810.2
20 Race Condition 48.7 54.5 54.2 157.4
21 Keypad Conundrum 0.0 1.7 22.4 24.2
22 Monkey Market 20.7 0.0 11,227.7 11,248.4
23 LAN Party 13.6 22.0 2.5 38.2
24 Crossed Wires 5.0 41.3 14.3 60.7
25 Code Chronicle 24.9 0.0 0.0 24.9

A weird thing I found when benchmarking is that for day 6 part two, ReleaseSafe actually ran faster than ReleaseFast (13,189.0 µs vs 24,370.7 µs). Their outputs are the same, but for some reason, ReleaseSafe is faster even with the safety checks still intact.

The Zig compiler is still very much a moving target, so I don’t want to dig too deep into this, as I’m guessing this might be a bug in the compiler. This weird behaviour might just disappear after a few compiler version updates.

Reflections

Looking back, I’m really glad I decided to do Advent of Code and followed through to the end. I learned a lot of things. Some are useful in my professional work, some are more like random bits of trivia. Going with Zig was a good choice too. The language is small, simple, and gets out of your way. I learned more about algorithms and concepts than the language itself.

Besides what I’ve already mentioned earlier, here are some examples of the things I learned:

Some of my self-imposed constraints and rules ended up being helpful. I can still (mostly) understand the code I wrote a few months ago. Putting all of the code in a single file made it easier to read since I don’t have to context switch to other files all the time.

However, some of them did backfire a bit, e.g. the two constraints that limit how I can optimise my code. Another one is the “hardcoding allowed” rule. I used a lot of magic numbers, which helped to improve performance, but I didn’t document them, so after a while, I don’t even remember how I got them. I’ve since gone back and added explanations in my write-ups, but next time I’ll remember to at least leave comments.

One constraint I’ll probably remove next time is the no concurrency rule. It’s the biggest contributor to the total runtime of my solutions. I don’t do a lot of concurrent programming, even though my main language at work is Go, so next time it might be a good idea to use Advent of Code to level up my concurrency skills.

I also spent way more time on these puzzles than I originally expected. I optimised and rewrote my code multiple times. I also rewrote my write-ups a few times to make them easier to read. This is by far my longest side project yet. It’s a lot of fun, but it also takes a lot of time and effort. I almost gave up on the write-ups (and this blog post) because I don’t want to explain my awful day 15 and day 16 code. I ended up taking a break for a few months before finishing it, which is why this post is published in August lol.

Just for fun, here’s a photo of some of my notebook sketches that helped me visualise my solutions. See if you can guess which days these are from:

Photos of my notebook sketches

What’s Next?

So… would I do it again? Probably, though I’m not making any promises. If I do join this year, I’ll probably stick with Zig. I had my eyes on Zig since the start of 2024, so Advent of Code was the perfect excuse to learn it. This year, there aren’t any languages in particular that caught my eye, so I’ll just keep using Zig, especially since I have a proper setup ready.

If you haven’t tried Advent of Code, I highly recommend checking it out this year. It’s a great excuse to learn a new language, improve your problem-solving skills, or just learn something new. If you’re eager, you can also do the previous years’ puzzles as they’re still available.

One of the best aspects of Advent of Code is the community. The Advent of Code subreddit is a great place for discussion. You can ask questions and also see other people’s solutions. Some people also post really cool visualisations like this one. They also have memes!


  1. I failed my first attempt horribly with Clojure during Advent of Code 2023. Once I reached the later half of the event, I just couldn’t solve the problems with a purely functional style. I could’ve pushed through using imperative code, but I stubbornly chose not to and gave up… ↩︎

  2. The original constraint was that each solution must run in under one second. As it turned out, the code was faster than I expected, so I increased the difficulty. ↩︎

  3. TigerBeetle’s code quality and engineering principles are just wonderful. ↩︎

  4. You can implement this function without any allocation by mutating the string in place or by iterating over it twice, which is probably faster than my current implementation. I kept it as-is as a reminder of what comptime can do. ↩︎

  5. As a bonus, I was curious as to what this looks like compiled, so I listed all the functions in this binary in GDB and found:

    72: static bool day04.Day04(140).matches__anon_19741;
    72: static bool day04.Day04(140).matches__anon_19750;

    It does generate separate functions! ↩︎

  6. Well, not always. The number of SIMD instructions depends on the machine’s native SIMD size. If the length of the vector exceeds it, Zig will compile it into multiple SIMD instructions. ↩︎

  7. Here’s a nice post on optimising day 9’s solution with Rust. It’s a good read if you’re into performance engineering or Rust techniques. ↩︎

  8. One thing about packed structs is that their layout is dependent on the system endianness. Most modern systems are little-endian, so the memory layout I showed is actually reversed. Thankfully, Zig has some useful functions to convert between endianness like std.mem.nativeToBig, which makes working with packed structs easier. ↩︎

  9. Technically, you can store 2-digit base 26 numbers in a u10, as there are only 262 possible numbers. Most systems usually pad values by byte size, so u10 will still be stored as u16, which is why I just went straight for it. ↩︎

Permalink

Swish - Clojure-like Lisp for Swift Video Series

Since February of 2026, I’ve been publishing a series of videos on implementing Swish, a Clojure-like Lisp in Swift using Claude Code. You can find it here.

Swish Logo

Swish Logo

I used Common Lisp for a solid year in college, and really enjoyed it. I always wanted to use it professionally, but never really had the chance.

Permalink

Giving LLMs a Formal Reasoning Engine for Code Analysis

LLM coding assistants continue to become more capable at writing code, but they have an inherent weakness when it comes to reasoning about code structure. What's worse is that they assemble the picture of the code by grepping through source files and reconstructing call chains in ad hoc fashion. This works for simple questions, but quickly starts to fall apart for transitive ones such as "Can user input reach this SQL query through any chain of calls?" or "What's all the dead code in this module?" Such questions require exhaustive structural analysis that simply can't be accomplished using pattern matching.

Chiasmus is an MCP server aiming to address the problem by giving LLMs access to formal reasoning engines, bundling Z3 for constraint solving and Tau Prolog for logic programming. Source files are parsed using tree-sitter and then turned into formal grammars, providing the LLM with a structured representation of the code along with a logic engine that can answer questions about it with certainty while using a fraction of the tokens.

The project is grounded in the neurosymbolic AI paradigm described by Sheth, Roy, and Gaur. The core idea is that AI systems benefit from combining neural networks (perception, language understanding) with symbolic knowledge-based approaches (reasoning, verification). LLMs are excellent at understanding what you're asking and generating plausible code, but they lack the ability to prove properties about that code. Symbolic solvers have that ability but can't understand natural language or navigate a codebase. Chiasmus bridges the two: the LLM handles perception (parsing your question, understanding context, filling templates), while the solvers handle cognition (exhaustive graph traversal, constraint satisfaction, logical inference).

The Problem with Grepping Through Source

When an LLM assistant needs to answer "what's the blast radius of changing lintSpec?", here's what typically happens:

Step 1: grep lintSpec src/**/*.ts
  → found in engine.ts (lintLoop) and mcp-server.ts (handleLint)

Step 2: grep lintLoop src/**/*.ts
  → called from solve() at lines 75 and 87

Step 3: grep handleSolve src/**/*.ts
  → called from createChiasmusServer switch...

Three rounds of tool calls, each consuming tokens for both the query and the response. At each step, the LLM has to reason about what it found and decide what to grep next. And after all that, it's still only traced part of the chain while missing paths through correctionLoop, runAnalysis, and several other transitive callers.

This isn't a failure of the LLM. It's a fundamental limitation of the approach. Grep finds string matches. Structural questions about code such as reachability, dead code, cycles, impact analysis require graph traversal, which grep cannot do.

How Chiasmus Works: Tree-sitter → Prolog → Formal Queries

Chiasmus takes a different approach. Instead of searching through text, it:

  1. Parses source files with tree-sitter into typed ASTs
  2. Walks the ASTs to extract structural facts: method definitions, call relationships, imports, exports
  3. Serializes these as Prolog facts a declarative representation of the call graph
  4. Runs formal queries via the Prolog solver to answer structural questions

Step 1: Tree-sitter Parsing

Unlike regex-based tools, tree-sitter understands language grammar making it possible to produce concrete syntax trees. Since it knows that foo() in const bar = () => { foo(); } is a call from bar to foo, it can answer semantic questions regarding the symbol.

Chiasmus supports Python, Go, TypeScript, JavaScript, and Clojure out of the box, and provides adapters for other languages. When you pass source files to chiasmus_graph, the parser identifies method declarations arrow_function, method_definition in TS/JS; defn, defn- in Clojure. Next, it resolves call expressions call_expression → callee name, handling obj.method()method, this.bar()bar, db/queryquery. It tracks scope of which routine is the caller for each call site and extracts imports and exports for cross-file resolution.

Tree-sitter is an incremental parsing library that produces concrete syntax trees. Unlike regex-based tools, it understands language grammar, and so, it knows that foo() in const bar = () => { foo(); } is a call from bar to foo, not just a string that contains "foo".

Step 2: Prolog Fact Generation

The extracted relationships become Prolog facts:

defines('src/formalize/validate.ts', lintSpec, routine, 16).
defines('src/formalize/engine.ts', lintLoop, routine, 208).
defines('src/formalize/engine.ts', solve, routine, 64).
defines('src/mcp-server.ts', handleLint, routine, 527).

calls(lintLoop, lintSpec).
calls(solve, lintLoop).
calls(handleLint, lintSpec).
calls(handleSolve, solve).
calls(correctionLoop, solve).

exports('src/formalize/validate.ts', lintSpec).

We now have a complete representation of the call graph with all subroutine definitions, call edges, and import relationships are encoded as ground facts that a Prolog engine can reason about.

Step 3: Built-in Rules for Structural Analysis

Alongside the facts, Chiasmus appends rules that enable the kinds of queries LLMs actually need. The most important of these is cycle-safe transitive reachability:

reaches(A, B) :- reaches(A, B, [A]).
reaches(A, B, _) :- calls(A, B).
reaches(A, B, Visited) :-
    calls(A, Mid),
    \+ member(Mid, Visited),
    reaches(Mid, B, [Mid|Visited]).

This rule says that A reaches B if A calls B directly, or if A calls some intermediate routine that has not yet been visited to reaches B. The visited list prevents infinite loops on cyclic call graphs which is a real concern in any codebase with mutual recursion or event loops. The solver can use this rule to answer transitive reachability over the entire call graph without the need for iterative grepping.

Step 4: Query Execution

Now the same "blast radius" question becomes a single tool call:

chiasmus_graph analysis="impact" target="lintSpec"
→ ["lintLoop", "handleLint", "solve", "correctionLoop",
   "handleVerify", "handleSolve", "handleGraph",
   "createChiasmusServer", "runAnalysis", "runAnalysisFromGraph"]

Above is the result of the Prolog solver having traversed every path in the call graph to collect all the methods that transitively call lintSpec. The LLM didn't need to know anything about the graph structure at all here.

What This Makes Possible That Grep Cannot

While efficiency is certainly nice, the real value lies in correctness. There are questions that grep fundamentally cannot answer, regardless of how many rounds you run:

Transitive Reachability

"Can user input reach the database query?" Being able to answer this question requires proving whether a path exists through potentially dozens of intermediate routines across multiple files. Grep can find direct callers, but tracing the full transitive closure requires the LLM to make decisions at each step about which paths to follow. It will miss branches, run out of context, and give you a best guess. Hence why the agent can end up giving different answers when asked the same question repeatedly.

With Chiasmus:

chiasmus_graph analysis="reachability" from="handleRequest" to="dbQuery"
→ { reachable: true }

chiasmus_graph analysis="path" from="handleRequest" to="dbQuery"
→ { paths: ["[handleRequest,validate,processData,dbQuery]"] }

The solver explores every possible path. If it says "not reachable", that's a proof by exhaustion showing that there is no chain of calls from A to B in the entire graph.

Dead Code Detection

"Which routines are never called?" This is another question where answering with grep would necessitate checking every method definition against every call site in the codebase. Even for a project with around 100 subroutines, that's 100 grep calls at a minimum, and you'd still miss methods that are only called by other dead methods.

With Chiasmus:

chiasmus_graph analysis="dead-code"
→ ["unusedHelper", "legacyParser", "deprecatedValidator"]

One call. The Prolog rule is simple:

dead(Name) :-
    defines(_, Name, routine, _),
    \+ calls(_, Name),
    \+ entry_point(Name).

A routine is dead if it's defined, nobody calls it, and it's not an entry point. The solver is able to trivially check this against every node in the graph exhaustively.

Cycle Detection

"Are there circular call dependencies?" is another kind of task that isn't possible to answer because grep cannot detect cycles at all. it's a question requiring traversal.

chiasmus_graph analysis="cycles"
→ ["eventHandler", "processQueue", "dispatchEvent"]

On the other hand, the solver finds all nodes that can reach themselves through any chain of calls.

Impact Analysis

"What breaks if I change this method?" This is reverse transitive reachability scenario where we need to find everything that transitively depends on the target. Grep can give you direct callers, and then you'd have to iterate on each one exhaustively. Chiasmus gives you the full blast radius.

chiasmus_graph analysis="impact" target="validate"
→ ["handleRequest", "batchProcessor", "main", "testHarness"]

Token Economics

Each grep call consumes tokens for the query, the response (which includes matching lines plus context), and the LLM's reasoning about what to do next. For a transitive question requiring N hops through the call graph you end up with ~N tool calls × (query tokens + response tokens + reasoning tokens). For a 5-hop chain, this might be 5 calls × ~500 tokens = ~2,500 tokens, and assuming the LLM doesn't go down wrong paths. With Chiasmus, we have asingle tool call × ~200 tokens and small JSON response. The heavy lifting happens in the Prolog solver, which runs locally and doesn't consume API tokens at all.

The savings compound with codebase size. In a 500-routine project, dead code detection via grep would require hundreds of calls. Via Chiasmus, it's still just one call.

Beyond Code: Mermaid Diagrams and Formal Verification

The same architecture handles more than source code. For example, Chiasmus can parse Mermaid diagrams directly into Prolog facts:

chiasmus_verify solver="prolog" format="mermaid"
  input="stateDiagram-v2
    [*] --> Idle
    Idle --> Processing : submit
    Processing --> Review : complete
    Review --> Approved : approve
    Review --> Processing : revise
    Approved --> [*]"
  query="can_reach(idle, approved)."
→ { status: "success", answers: [{}] }

If it's expressed as a Mermaid graph then you can formally verify properties of it, be it an architecture diagram, a state machine from a design doc, or a workflow from a ticket. Can every state reach the terminal state? Are there dead-end states? Is there a cycle between review and processing? These all become one-line queries against a solver.

Of course, not all constraint problems can be usefully expressed as graphs. Chiasmus provides Z3, an SMT solver that can prove properties over combinatorial spaces for cases such access control conflicts, configuration equivalence, or dependency resolution. "Can these RBAC rules ever produce contradictory allow/deny decisions?" isn't a question you can even begin to grep for. It requires exploring every possible combination of roles, actions, and resources. Z3 does this exhaustively and yields either a proof of consistency or a concrete counterexample.

The Neurosymbolic Advantage

The Neurosymbolic AI paper classifies systems by how tightly they couple neural and symbolic components. Chiasmus largely operates in Category 2(a) where the LLM identifies what formal analysis is needed and delegates to symbolic solvers for execution. But it pushes toward Category 2(b) in several ways:

It provides enriched feedback loops when the solver produces UNSAT, feeding specific assertions conflict back to the LLM as structured guidance. It tracks derivation traces, so that when Prolog proves a query, the trace of which rules fired gives the LLM an explanation of why the answer holds. Finally, Chiasmus supports template learning extracting verification pattern prove useful into a reusable templates. The symbolic structure (skeleton with typed slots) is learned organically from successful neural-symbolic interactions, creating a feedback loop where the system improves with use.

The practical consequence here is that using Chiasmus provides logically derived answers rather than probabilistic guess based on pattern matching over training data. It's a logical proof by exhaustion derived from a formal representation of the call graph. The neural component understands the question, and the symbolic component provides the answer.

The Architecture

Chiasmus runs as an MCP server, and setup for Claude Code is one command:

claude mcp add chiasmus -- npx -y chiasmus

The server exposes nine tools:

  • chiasmus_graph: tree-sitter call graph analysis (callers, callees, reachability, dead-code, cycles, path, impact)
  • chiasmus_verify: submit formal logic to Z3 or Prolog solvers directly
  • chiasmus_craft: create reusable verification templates
  • chiasmus_formalize: find the right template for a problem
  • chiasmus_skills: search the template library
  • chiasmus_solve: end-to-end autonomous verification
  • chiasmus_learn: extract templates from verified solutions
  • chiasmus_lint: structural validation of formal specs

What Changes for the Developer

From the developer's perspective, the experience is subtle but significant. You ask your coding assistant a structural question, and instead of watching it grep through files for 30 seconds, it answers immediately with a complete, provably correct result. "What calls this method?" comes back with every transitive caller in the graph. "Is there dead code?" comes back with a definitive list, not "I checked a few files and didn't find any callers."

The LLM spends fewer tokens on exploration and more on the work you actually asked for. And when it tells you something about your code's structure, you can trust it because the answer comes from a solver, not a guess.

The project is open source at github.com/yogthos/chiasmus.

Permalink

Simple System + Rick Feedback

I saw someone posted in the Clojurians Slack about something Clojure has taught them:

I’ve come to see programming as:

1. building simple systems
2. and building nice feedback loops for interacting with those systems.

There is no three. I am so happy Clojure helps me with both.

—teodorlu

This is beautiful. It’s wonderful. And it’s a complete list, as far as I can tell.

I want to unpack these three points a litttle bit in the context of Clojure, partly to remind myself of these points, but also to better understand why we feel that Clojure is such a good teacher. And, yes, there are three points in the quote, which I’ll get to.

Hickey Simplicity

Clojure’s creator, Rich Hickey, has included the industry’s understanding of what simplicity means. After Simple Made Easy, we think of simplicity as a function of how well decomposed a thing is. We pull apart a problem until we understand how it is made of subproblems. This is the simplicity of making a thing to solve a problem at the appropriate level of granularity and generality. General solutions tend to be simpler.

My favorite example of this kind of simplicity from Clojure is the atom. It found a common problem—sharing mutable state between threads—and it solved it in a very straightforward way: You get one mutable reference with a very constrained interface. In return, you get a strong but limited guarantee. Hickey found the problem, separated it from the rest of the problems, and solved it in a minimal and useful way. His achievement of simplicity is inspiring.

Incidentally, I believe the careful decomposition of problems is why Clojure’s parts seem to work so well together. For example, atoms love to work with immutable data structures and update. They’re each solving small, related problems—safely sharing values (immutable data), working with nested structures (update), and changing the value over time (atom). I believe they compose well together because they were decomposed by the designer himself.

Simple systems

The Slack message mentions “simple systems”. In systems theory, according to Donella Meadows, we get a different definition of simplicity. There, simplicity is about the number and nature of feedback loops, delays, and nonlinear causal graphs—in short, interactions within the system between the parts. If you have fewer loops and branches, the system’s behavior is easier to understand.

Again, I’ll use the example of the atom. In many languages, we would use an object with mutable state with methods meant to read and modify that state. We might make a whole new class just to represent the count of the number of words in a directory of text files. And if multiple files are being counted in parallel, we’d need thread-safe coding practices, probably locks, to make sure we counted correctly. But in Clojure, it’s just an atom. It feels to me that the causal chain is much shorter, perhaps because the atom itself is so reliable. Locks are reliable, too, if you get them right. But you have to take the lock, do your work, then release. You need a try/finally so you release reliably, even after a failure. There’s a lot to get right. With an atom, you just:

(swap! count inc)

Simplicity as clarity

Dieter Rams is lauded as a master of simplicity. Many people conflate simplicity with minimalism. But Rams insists it is about clarity, not minimalism. The volume knob changes the volume. The on-off switch is clearly a toggle. The extreme focus on clarity can breed the aesthetic of minimalism.

Clojure too focuses on clarity. The clarity of purposed of each special form—if, let, etc.—is part of this form of simplicity. So too ar the function in the core library. Though there are some outliers, most functions reveal their purpose plainly. And they are so plain that even though there are many functions, their docs are shown on a simple page. When I look at the Javadocs for the standard libraries, I see staggering obscurity. Each class seems like a world of its own, ready to be studied. Methods return yet other classes—more worlds to understand.

Feedback

Now let’s talk about feedback. Clojure excels at feedback. The obvious mechanism of feedback in Clojure is the REPL. The read-eval-print-loop is an interface between you and your code’s execution. A skilled programmer at the REPL will evaluate lots of expressions and subexpressions. You can recompile functions and run tests just as they would be executed in a production system.

But there are more subtle things that are easy to overlook. All of the literal data structures and atomic values like numbers, strings, keywords, and symbols, can be printed with no extra work. You can put a prn right in your program and print out the arguments to see them.

You can navigate the namespaces and perform reflection. The reflection works on Clojure (list all the vars in this namespace) and for JVM stuff (class, supers). These tools are a source of information about your code.

Clojure added tap> a few years ago. it’s a built-in pub/sub system used during development. Tools like Portal use it to get values from your running system and visualize them.

There is no three

The third point should be unpacked, too. “There is no three.” It’s a point stated in the negative. It implicitly excludes all of the things that aren’t listed above. It’s sort of an invitation to abandon all the bad habits you picked up over the years. My bad habits include adding to many layers of indirection, trying to anticipate the future, and overmodeling. The rule to combat these is called YAGNI (You Ain’t Gonna Need It). Or DRY (Don’t Repeat Yourself). These are good practices that help you build simpler systems. But they’re all subsumed in the refocusing on the first two positive statements.

I think there’s value in enumerating the little rules of thumb this list leaves out, especially for beginners. As someone becomes an expert at something, the way they talk about their skill often sounds more and more abstract. “It’s just simplicity,” hides how hard simplicity is to achieve and even to understand. What sounds wise glosses over the thousands of details that you learned on the way up. That’s not to take away from the beauty of the expression. Just saying that these abstract expressions of what’s important leave a lot out.

That said, the reason the third item is so refreshing is that we’ve been taught at school and at work to code in a certain way. I was taught in Java to seek out the isA hierarchy inherent in a domain and to express it with a class hierarchy. It’s where we get the classic class Dog extends Animal. But it’s putting the cart before the horse. Yes, a dog is an animal. But is that relevant to my software? Saying “There is no three” gives me permission to stop and refocus on simplicity and feedback.

So thanks, Teodor, for sharing this. It’s a wonderful view into your progress as a programmer. You should be proud of all you’ve accomplished. I really like how you’ve boiled it down to these three ideas. It reminds me of how much I’ve learned from Clojure and how far I still have to go.

Permalink

Clojars Update for Q1 2026

Clojars Q1 2026 Update: Toby Crawley

This is an update on the work I’ve done maintaining Clojars in January through March 2026 with the ongoing support of Clojurists Together.

Most of my work on Clojars is reactive, based on issues reported through the community or noticed through monitoring. If you have any issues or questions about Clojars, you can find me in the #clojars channel on the Clojurians Slack, or you can file an issue on the main Clojars GitHub repository.

You can see the CHANGELOG for notable changes, and see all commits in the clojars-web and infrastructure repositories for this period. I also track my work over the years for Clojurists Together (and, before that, the Software Freedom Conservancy).

Below are some highlights for work done in January through March:

A note on 11 years of Clojars maintenance

I became the lead maintainer of Clojars a little over 11 years ago. I’ve done quite a bit of work on Clojars during that period, and have thoroughly enjoyed working on it & supporting the community! I greatly appreciate the support I’ve gotten from GitHub sponsors, the Software Freedom Conservancy, and Clojurists Together over the years. After all that, it’s time for a little break! I’m taking a few months away from Clojars (and computers in general) to go backpacking for a few months. I’m handing off lead maintenance to Daniel Compton, and it is in good hands!

Many thanks Toby for all your work - your contributions have made an immense difference to all of us! Have a great adventure - we’re looking forward to hearing all about it when you return.

Permalink

Clojure Deref (Apr 7, 2026)

Welcome to the Clojure Deref! This is a weekly link/news roundup for the Clojure ecosystem (feed: RSS).

Clojure/Conj 2026

September 30 – October 2, 2026
Charlotte Convention Center, Charlotte, NC

Join us for the largest gathering of Clojure developers in the world! Meet new people and reconnect with old friends. Enjoy two full days of talks, a day of workshops, social events, and more.

Early bird and group tickets are now on sale.

Is your company interested in sponsoring? Email us at clojure_conj@nubank.com.br to discuss opportunities.

Upcoming Events

Blogs, articles, and news

Libraries and Tools

Debut release

  • clj-android - A modernization of the clojure-android project.

  • plorer - cljfx/plorer helps you (or your coding agent) explore JavaFX application state in the REPL

  • xitdb-tsclj - Clojure flavored javascript using xitdb database

  • clj-mdns - Clojure wrapper around jmdns for mDNS service discovery

  • clj-oa3 - Clojure client library for OpenADR 3 (Martian HTTP, entity coercion, Malli schemas)

  • clj-oa3-client - Component lifecycle wrapper for clj-oa3 (MQTT, VEN registration, API delegation)

  • clj-gridx - Clojure client library for the GridX Pricing API

  • clj-midas - Clojure client library for the California Energy Commission’s MIDAS API

  • flux - Clojure wrapper for Netflix concurrency-limits — adaptive concurrency control based on TCP congestion algorithms.

  • ClojureProtegeIDE - GitHub - rururu/ClojureProtegeIDE

  • re-frame-query - Declarative data fetching and caching for re-frame inspired by tanstack query and redux toolkit query

  • codox-md - Codox writer that generates Markdown documentation for embedding in Clojure JARs

  • clj-doc-browse - Runtime classpath-based Markdown documentation browser for Clojure libraries

  • clj-doc-browse-el - Emacs package for browsing Clojure library docs from classpath JARs via CIDER

  • llx - Unified LLM API and agent runtime for Clojure, ClojureScript (and soon Clojure Dart)

  • baredom - BareDOM: Lightweight CLJS UI components built on web standards (Custom Elements, Shadow DOM, ES modules). No framework, just the DOM

  • ty-pocketledger - Demo app for ty web components over datastar that can be installed on mobile device

  • noumenon - Queryable knowledge graph for codebases — turns git history and LLM-analyzed source into a Datomic database that AI agents can query with Datalog.

  • lasagna-pattern - Match data with your pattern

  • rama-sail-graph - Demonstration of Rama and RDF4J SAIL API integration

  • clua - Sandboxed Lua 5.5 interpreter for Clojure/JVM

  • awesome-backseat-driver - Plugin marketplace for Clojure AI context in GitHub Copilot: agents, skills, and workflows for REPL-first interactive programming with Calva Backseat Driver

  • dexter - Dexter - Graphical Dependency Explorer

  • meme-clj - meme-clj — M-Expressions with Macro Expansion

  • xor-clj - Train neural network to imitate XOR operator using Clojure libpython-clj and Pytorch

  • mdq - A faithful port of Rust mdq, jq for markdown to Babashka.

  • once - BigConfig and ONCE

  • clj-format - A Clojure DSL for cl-format inspired by Hiccup. No dependencies. Drop-in compatibility. The power of FORMAT made easy.

  • infix - Readable Math and Data Processing for Clojure

  • ansatz - Dependently typed Clojure DSL with a Lean4 compatible kernel.

  • k7 - A high-performance disk-backed queue for Clojure

  • eido - Data-driven 2D & 3D graphics for Clojure — shapes, animation, lighting, and compositing from pure data

  • html2helix - Convert raw HTML to ClojureScript Helix syntax

Updates

Permalink

Clojure on Fennel part one: Persistent Data Structures

Somewhere in 2019 I started a project that aimed to bring some of Clojure features to Lua runtime - fennel-cljlib. It was a library for Fennel that implemented a basic subset of clojure.core namespace functions and macros. My goal was simple - I enjoy working with Clojure, but I don’t use it for hobby projects, so I wanted Fennel to feel more Clojure-like, besides what it already provides for that.

This library grew over the years, I implemented lazy sequences, added immutability, made a testing library, inspired by clojure.test and kaocha, and even made a port of clojure.core.async. It was a passion project, I almost never used it to write actual software. One notable exception is fenneldoc - a tool for documentation generation for Fennel libraries. And I haven’t seen anyone else use it for a serious project.

The reason for that is simple - it was an experiment. Corners were cut, and Fennel, being a Clojure-inspired lisp is not associated with functional programming the same way Clojure is. As a matter of fact, I wouldn’t recommend using this library for anything serious… yet.

Recently, however, I started a new project: ClojureFnl. This is a Clojure-to-Fennel compiler that uses fennel-cljlib as a foundation. It’s still in early days of development, but I’ve been working on it for a few months in private until I found a suitable way to make things work in March. As of this moment, it is capable of compiling most of .cljc files I threw at it, but running the compiled code is a different matter. I mean, it works to some degree, but the support for standard library is far from done.

;; Welcome to ClojureFnl REPL
;; ClojureFnl v0.0.1
;; Fennel 1.6.1 on PUC Lua 5.5
user=> (defn prime? [n]
         (not (some zero? (map #(rem n %) (range 2 n)))))
#<function: 0x89ba7c550>
user=> (for [x (range 3 33 2)
             :when (prime? x)]
         x)
(3 5 7 11 13 17 19 23 29 31)
user=>

However, there was a problem.

My initial implementation of immutable data structures in the itable library had a serious flaw. The whole library was a simple hack based on the copy-on-write approach and a bunch of Lua metatables to enforce immutability. As a result, all operations were extremely slow. It was fine as an experiment, but if I wanted to go further with ClojureFnl, I had to replace it. The same problem plagued immutableredblacktree.lua, an implementation of a copy-on-write red-black tree I made for sorted maps. It did a full copy of the tree each time it was modified.

For associative tables it wasn’t that big of a deal - usually maps contain a small amount of keys, and itable only copied levels that needed to be changed. So, if you had a map with, say, ten keys, and each of those keys contained another map with ten keys, adding, removing or updating a key in the outer map meant only copying these ten keys - not the whole nested map. I could do that reliably, because inner maps were immutable too.

But for arrays the story is usually quite different. Arrays often store a lot of indices, and rarely are nested (or at least not as often as maps). And copying arrays on every change quickly becomes expensive. I’ve mitigated some of the performance problems by implementing my version of transients, however the beauty of Clojure’s data structures is that they’re quite fast even without this optimization.

Proper persistent data structures

Clojure uses Persistent HAMT as a base for its hash maps and sets, and a bit-partitioned trie for vectors. For sorted maps and sets, Clojure uses an immutable red-black tree implementation, but as far as I know it’s not doing a full copy of the tree, and it also has structural sharing properties.

I started looking into existing implementations of HAMT for Lua:

  1. hamt.lua (based on mattbierner/hamt)
    • seemed incomplete
  2. ltrie
    • no transients
    • no hashset
    • no ordered map (expectable, different algorithm)
    • no compound vector/hash
  3. Michael-Keith Bernard’s gist
    • no custom hashing

I could use one of those, notably ltrie seemed the most appropriate one, but given that I’m working on a fennel library that I want later to embed into my Clojure compiler I needed a library implemented in Fennel.

So I made my own library: immutable.fnl. This library features HAMT-hash maps, hash-sets, and vectors, as well as a better implementation of a persistent red-black tree, and lazy linked lists.

Persistent Hash Map

I started the implementation with a Persistent HAMT with native Lua hashing. The data structure itself is a Hash Array Mapped Trie (HAMT) with 16-factor branching. Thus all operations are O(Log16 N), which is effectively O(1) for a practical amount of keys.

As far as I know, Clojure uses branching factor of 32, but for a Lua runtime this would mean that the popcount would be more expensive, and despite a shallower tree, each mutation would need to copy a larger sparse array. With branching factor of 16 a map with 50K entries is ~4 levels deep, which would be ~3 with 32 branching factor. So my logic was that it’ll be a compromise, especially since Lua is not JVM when it comes to performance.

Of course, it’s not as fast as a pure Lua table, which is to be expected. Lua tables are implemented in C, use efficient hashing, and dynamically re-allocated based on key count. So for my implementation most operations are a lot slower, but the total time for an operation is still usable.

Here are some benchmarks:

Median time over 7 rounds (1 warmup discarded), N = 50000 elements. GC stopped during measurement. Clock: os.clock (CPU). Runtime: Fennel 1.7.0-dev on PUC Lua 5.5

Regular operations are notably slower, when compared to Lua:

Operation Lua table Persistent HashMap Ratio per op
insert 50000 random keys 2.05 ms 164.80 ms 80.3x slower 3.3 us
lookup 50000 random keys 0.83 ms 92.51 ms 110.8x slower 1.9 us
delete all 0.78 ms 170.78 ms 219.8x slower 3.4 us
delete 10% 0.14 ms 19.50 ms 136.4x slower 3.9 us
iterate 50000 entries 1.74 ms 6.64 ms 3.8x slower 0.133 us

For transients the situation is a bit better, but not by much:

Operation Lua table Transient HashMap Ratio per op
insert 50000 random keys 2.05 ms 89.17 ms 43.5x slower 1.8 us
delete all 0.76 ms 104.31 ms 138.0x slower 2.1 us
delete 10% 0.16 ms 12.71 ms 82.0x slower 2.5 us

On LuaJIT numbers may seem worse, but per-operation cost is much lower, it’s just that native table operations are so much faster:

Median time over 7 rounds (1 warmup discarded), N = 50000 elements. GC stopped during measurement. Clock: os.clock (CPU). Runtime: Fennel 1.7.0-dev on LuaJIT 2.1.1774896198 macOS/arm64

Operation Lua table Persistent HashMap Ratio per op
insert 50000 random keys 0.86 ms 49.05 ms 56.8x slower 0.981 us
lookup 50000 random keys 0.27 ms 14.21 ms 53.4x slower 0.284 us
delete all 0.13 ms 48.63 ms 374.1x slower 0.973 us
delete 10% 0.05 ms 6.49 ms 138.1x slower 1.3 us
iterate 50000 entries 0.07 ms 1.80 ms 27.7x slower 0.036 us
Operation Lua table Transient HashMap Ratio per op
insert 50000 random keys 0.76 ms 22.43 ms 29.6x slower 0.449 us
delete all 0.15 ms 34.16 ms 232.4x slower 0.683 us
delete 10% 0.04 ms 5.02 ms 132.1x slower 1.0 us

With a branching factor of 32 the situation gets worse on PUC Lua, but is slightly better on LuaJIT. So there’s still space for fine-tuning.

For hashing strings and objects I decided to use djb2 algorithm. I am almost as old as this hash function, so seemed like a good fit. JK. The main reason to use it was that it can be implemented even if we don’t have any bit-wise operators, and Lua doesn’t have them in all of the versions. It only uses +, *, and % arithmetic operators, so can be done on any Lua version. It’s prone to collisions, and I try to mitigate that by randomizing it when the library is loaded.

Still, collisions do happen, but HAMT core ensures that they will still resolve correctly by implementing a deep equality function for most objects.

However, when first working on this, I noticed this:

>> (local hash-map (require :io.gitlab.andreyorst.immutable.PersistentHashMap))
nil
>> (local {: hash} (require :io.gitlab.andreyorst.immutable.impl.hash))
nil
>> (hash (hash-map :foo 1 :bar 2))
161272824
>> (hash {:foo 1 :bar 2})
161272824
>> (hash-map (hash-map :foo 1 :bar 2) 1 {:foo 1 :bar 2} 2)
{{:foo 1 :bar 2} 2}

This is an interesting loophole. What object ended up in our hash map as a key - our persistent map or plain Lua table? Well, that depends on insertion order:

>> (each [_ k (pairs (hash-map (hash-map :foo 1 :bar 2) 1 {:foo 1 :bar 2} 2))]
     (print (getmetatable k)))
IPersistentHashMap: 0x824d9b570
nil
>> (each [_ k (pairs (hash-map {:foo 1 :bar 2} 2 (hash-map :foo 1 :bar 2) 1))]
     (print (getmetatable k)))
nil
nil

To reiterate, I’m creating a hash map, with a key set to another persistent hash map, and then insert a plain Lua table with the same content. The Lua table hashes to exactly the same hash, and goes into the same bucket, but there’s no collision, because objects are equal by value. But equality of mutable collections is very loosely defined - it may be equal right now, but the next time you look at it, it’s different. So a different hashing was needed for persistent collections, to avoid these kinds of collision. I ended up salting persistent collections with their prototype address in memory.

Other than that, the HAMT implementation is by the book, and the rest is the interface for interacting with maps.

Main operations:

  • new - construct a new map of key value pairs
  • assoc - associate a key with a value
  • dissoc - remove key from the map
  • conj - universal method for association, much like in Clojure
  • contains - check if key is in the map
  • count - map size, constant time
  • get - get a key value from a map
  • keys - get a lazy list of keys
  • vals - get a lazy list of values
  • transient - convert a map to a transient

Coercion/conversion:

  • from - create a map from another object
  • to-table - convert a map to a Lua table
  • iterator - get an iterator to use in Lua loops

Transient operations:

  • assoc! - mutable assoc
  • dissoc! - mutable dissoc
  • persistent - convert back to persistent variant, and mark transient as completed

This covers most of the needs in my fennel-cljlib library, as anything besides it I can implement myself, or just adapt existing implementations.

A Persistent Hash Set is also available as a thin wrapper around PersistentHashMap with a few method changes.

A note on PersistentArrayMap.

In Clojure there is a second kind of maps that are ordered, not sorted, called a Persistent Array Map. They are used by default when defining a map with eight keys or less, like {:foo 1 :bar 2}. The idea is simple - for such a small map, a linear search through all keys is faster than with a HAMT-based map.

However, in my testing on the Lua runtime, there’s no benefit in this kind of a data structure, apart from it being an ordered variant. Lookup is slower, because of a custom equality function, which does deep comparison.

Persistent Vector

Persistent Vectors came next, and while the trie structure is similar to hash maps, vectors use direct index-based navigation instead of hashing, with a branching factor of 32. Unlike maps, vector arrays in the HAMT are more densely packed, and therefore a higher branching factor is better for performance. So lookup, update, and pop are O(log32 N), append can be considered O(1) amortized.

Still, compared to plain Lua sequential tables the performance is not as good:

Median time over 7 rounds (1 warmup discarded), N = 50000 elements. GC stopped during measurement. Clock: os.clock (CPU). Runtime: Fennel 1.7.0-dev on PUC Lua 5.5

Operation Lua table Persistent Vector Ratio per op
insert 50000 elements 0.19 ms 21.07 ms 109.7x slower 0.421 us
lookup 50000 random indices 0.47 ms 14.05 ms 29.7x slower 0.281 us
update 50000 random indices 0.32 ms 70.04 ms 221.6x slower 1.4 us
pop all 50000 elements 0.25 ms 24.34 ms 96.2x slower 0.487 us
iterate 50000 elements 0.63 ms 10.16 ms 16.2x slower 0.203 us
Operation Lua table Transient Vector Ratio per op
insert 50000 elements 0.19 ms 7.81 ms 40.3x slower 0.156 us
update 50000 random indices 0.33 ms 20.76 ms 62.4x slower 0.415 us
pop all 50000 elements 0.25 ms 11.14 ms 44.4x slower 0.223 us

On LuaJIT:

Median time over 7 rounds (1 warmup discarded), N = 50000 elements. GC stopped during measurement. Clock: os.clock (CPU). Runtime: Fennel 1.7.0-dev on LuaJIT 2.1.1774896198 macOS/arm64

Operation Lua table Persistent Vector Ratio per op
insert 50000 elements 0.10 ms 7.62 ms 74.0x slower 0.152 us
lookup 50000 random indices 0.06 ms 0.67 ms 11.8x slower 0.013 us
update 50000 random indices 0.04 ms 29.13 ms 710.4x slower 0.583 us
pop all 50000 elements 0.02 ms 8.62 ms 410.4x slower 0.172 us
iterate 50000 elements 0.02 ms 0.57 ms 28.7x slower 0.011 us
Operation Lua table Transient Vector Ratio per op
insert 50000 elements 0.05 ms 0.59 ms 11.6x slower 0.012 us
update 50000 random indices 0.04 ms 2.06 ms 51.6x slower 0.041 us
pop all 50000 elements 0.02 ms 0.84 ms 46.7x slower 0.017 us

I think this is an OK performance still. Vectors don’t use hashing, instead it is a direct index traversal via bit-shifting, so there’s no hashing, just index math.

Operations on vectors include:

  • new - constructor
  • conj - append to the tail
  • assoc - change a value at given index
  • count - element count (constant time)
  • get - get value at given index
  • pop - remove last
  • transient - convert to a transient
  • subvec - create a slice of the vector in constant time

Transient operations:

  • assoc! - mutable assoc
  • conj! - mutable conj
  • pop! - mutable pop
  • persistent - convert back to persistent and finalize

Interop:

  • from - creates a vector from any other collection
  • iterator - returns an iterator for use in Lua loops
  • to-table - converts to a sequential Lua table

One notable difference in both vector and hash-map is that it allows nil to be used as a value (and as a key, in case of the hash-map). Vectors don’t have the same problem that Lua sequential tables have, where length is not well-defined if the table has holes in it.

It’s a debate for another time, whether allowing nil as a value (and especially as a key) is a good decision to make, but Clojure already made it for me. So for this project I decided to support it.

Persistent Red-Black Tree

For sorted maps and sorted sets I chose Okasaki’s insertion and Germane & Might’s deletion algorithms. Most of the knowledge I got from this amazing blog post by Matt Might.

I believe the operations are O(Log N), as for any binary tree, but given that the deletion algorithm is tricky, I’m not exactly sure:

Median time over 7 rounds (1 warmup discarded), N = 50000 elements. GC stopped during measurement. Clock: os.clock (CPU). Runtime: Fennel 1.7.0-dev on PUC Lua 5.5

Operation Lua table PersistentTreeMap Ratio per op
insert 50000 random keys 2.10 ms 209.23 ms 99.8x slower 4.2 us
lookup 50000 random keys 0.88 ms 82.97 ms 94.2x slower 1.7 us
delete all 0.74 ms 173.76 ms 234.8x slower 3.5 us

On LuaJIT:

Median time over 7 rounds (1 warmup discarded), N = 50000 elements. GC stopped during measurement. Clock: os.clock (CPU). Runtime: Fennel 1.7.0-dev on LuaJIT 2.1.1774896198 macOS/arm64

Operation Lua table PersistentTreeMap Ratio per op
insert 50000 random keys 0.72 ms 101.08 ms 140.4x slower 2.0 us
lookup 50000 random keys 0.25 ms 12.67 ms 49.9x slower 0.253 us
delete all 0.14 ms 56.14 ms 403.9x slower 1.1 us

The API for sorted maps and sets is the same as to their hash counterparts with a small difference - no transients. Clojure doesn’t do them, and I’m not doing them too.

That’s all for benchmarks. I know that there are many problems with this kind of benchmarking, so take it with a grain of salt. Still, the results are far, far better than what I had with itable.

But there are two more data structures to talk about.

Persistent List

As I mentioned, I made a lazy persistent list implementation a while ago but it had its problems and I couldn’t integrate that library with the current one well enough.

The main problem was that this library uses a single shared metatable per data structure, and the old implementation of lazy lists didn’t. This difference makes it hard to check whether the object is a table, hash-map, list, vector, set, etc. So I reimplemented them.

The reason for old implementation to use different metatables was because I decided to try the approach described in Reversing the technical interview post by Kyle Kingsbury (Aphyr). I know this post is more of a fun joke, but it actually makes sense to define linked lists like that in Lua.

See, tables are mutable, and you can’t do much about it. Closures, on the other hand are much harder to mutate - you can still do it via the debug module, but it’s hard, and it’s not always present. So storing head and tail in function closures was a deliberate choice.

However, it meant that I needed to somehow attach metadata to the function, to make it act like a data structure, and you can’t just use setmetatable on a function. Again, you can do debug.setmetatable but all function objects share the same metadata table. So, while you can do fancy things like this:

>> (fn comp [f g] (fn [...] (f (g ...))))
#<function: 0x7bdb320a0>
>> (debug.setmetatable (fn []) {:__add comp})
#<function: 0x7bd17f040>
>> ((+ string.reverse string.upper) "foo")
"OOF"

You can also notice, that our + overload applied to functions in the string module.

So instead, we use a table, and wrap it with a metatable that has a __call metamethod, essentially making our table act like a function. This, in turn means, that we have to create two tables per list node - one to give to the user, the other to set our __call and use it as a meta-table.

Convoluted, I know. It’s all in the past now - current implementation is a simple {:head 42 :tail {...}} table. Not sure what is worse.

But that meant that I had to rework how lazy lists worked, because previously it was just a metatable swap. Now list stores a “thunk”, that when called replaces itself in the node with the :head and :tail keys. Unless it’s an empty list, of course - in that case we swap the metatable to an empty list one.

So Lists have three metatables now:

  • IPersistentList
  • IPersistentList$Empty
  • IPersistentList$Lazy

Instead of god knows how many in the old implementation.

The list interface is also better now. Previously it was hardcoded how to construct a list from a data structure. Current implementation also hardcodes it, but also allows to build a list in a lazy way from an iterator.

This is better, because now a custom data structure that has weird iteration schema (like maps and sets in this library), we still can convert it to a list. A general case is just:

(PersistentList.from-iterator #(pairs data) (fn [_ v] v))

Meaning that we pass a function that will produce the iterator, and a function to capture values from that iterator. Reminds me of clojure transducers in some way.

Persistent Queue

And the final data structure - a persistent queue. Fast append at the end, and also fast remove from the front.

It’s done by holding two collections - a linked list at the front, and a persistent vector for the rear. So removing from the list is O(1), and appending to the vector is also pretty much O(1).

Interesting things start to happen when we exhaust the list part - we need to move vector’s contents into the list. It is done by calling PersistentList.from on the rear. And building a list out of a persistent vector is an O(1) operation as well! Well, because nothing happens, we simply create an iterator, and build the list in a lazy way. But since indexing the vector is essentially ~O(1), we can say that we still retain this property.

Or at least that’s how I reasoned about this - I’m not that good with time-complexity stuff.

ClojureFnl

That concludes part one about ClojureFnl.

I know that this post was not about ClojureFnl at all, but I had to fix my underlying implementation first. Now, that I have better data structures to build from, I can get back working on the compiler itself. So the next post will hopefully be about the compiler itself.

Unless I get distracted again.

Permalink

Versioned Analytics for Regulated Industries

Versioned Analytics for Regulated Industries

Financial regulation — Basel III, MiFID II, Solvency II, SOX — requires that risk calculations, credit decisions, and compliance reports be reproducible. Not just the code, but the exact data state that produced them. When an auditor asks “show me the data behind this risk number from six months ago,” the answer can’t be “we’ll try to reconstruct it.”

Version control solved this problem for source code decades ago. But analytical data infrastructure never caught up. Data warehouses don’t version tables. Temporal tables track row-level changes but don’t compose across tables or systems. Manual snapshots are expensive, fragile, and don’t support branching for scenario analysis.

Stratum brings the git model to analytical data: every write creates an immutable, content-addressed snapshot. Old states remain accessible by commit UUID. Branches are O(1). And via Yggdrasil, you can tie entity databases, analytical datasets, and search indices into a single consistent, auditable snapshot.

The problem

A typical analytical pipeline at a regulated institution:

  1. Transactional data flows into a warehouse (nightly ETL or streaming)
  2. Analysts run GROUP BY / SUM / STDDEV queries for risk models and reports
  3. Results feed regulatory submissions — capital adequacy, liquidity coverage, market risk
  4. Months later, an auditor asks: “What data produced risk report X on date Y?”

Step 4 is where things break. The warehouse has been mutated since then. Maybe there’s a backup, maybe not. Reconstructing the exact state requires replaying ETL from source systems — if those logs still exist.

Even if you can reconstruct the data, you can’t prove it’s the same data. There’s no cryptographic link between the report and the state that produced it. The best you can offer is procedural trust: “our backup process is reliable, and we believe this is what the data looked like.” That’s a weak foundation for regulatory compliance.

Immutable snapshots as audit anchors

With Stratum, every table is a copy-on-write value. Writes create new snapshots; old snapshots remain addressable by commit UUID or branch name. The underlying storage is a content-addressed Merkle tree — each snapshot’s identity is derived from a hash of its data, providing a cryptographic chain of custody from report to source.

What is this syntax?
require('[stratum.api :as st])

;; Load the current production state
def trades: st/load(store "trades" {:branch "production"})

;; Run today's risk calculation
def risk-report: st/q({:from trades, :group [:desk :currency], :agg [[:sum :notional] [:stddev :pnl] [:count]]})

;; The commit UUID is your audit anchor — store it alongside the report
;; Six months later, reproduce exactly:
def historical-trades: st/load(store "trades" {:as-of #uuid "a1b2c3d4-..."})

def historical-report: st/q({:from historical-trades, :group [:desk :currency], :agg [[:sum :notional] [:stddev :pnl] [:count]]})
;; Identical results, guaranteed by content addressing
(require '[stratum.api :as st])

;; Load the current production state
(def trades (st/load store "trades" {:branch "production"}))

;; Run today's risk calculation
(def risk-report
  (st/q {:from trades
         :group [:desk :currency]
         :agg [[:sum :notional] [:stddev :pnl] [:count]]}))

;; The commit UUID is your audit anchor — store it alongside the report
;; Six months later, reproduce exactly:
(def historical-trades
  (st/load store "trades" {:as-of #uuid "a1b2c3d4-..."}))

(def historical-report
  (st/q {:from historical-trades
         :group [:desk :currency]
         :agg [[:sum :notional] [:stddev :pnl] [:count]]}))
;; Identical results, guaranteed by content addressing

Or via SQL — connect any PostgreSQL client:

-- Today's report
SELECT desk, currency, SUM(notional), STDDEV(pnl), COUNT(*)
FROM trades GROUP BY desk, currency;

-- Historical report: same query, different snapshot
-- resolved server-side via branch/commit configuration

Once committed, data cannot be modified — every state is a value, addressable by its content hash. Historical snapshots load lazily from storage on demand, so keeping years of history doesn’t mean paying for it in memory. And because snapshots are immutable values, multiple analysts can query the same or different points in time concurrently without coordination or locks.

Scenario analysis with branching

Beyond audit compliance, regulated institutions need scenario analysis. Basel III stress testing requires banks to evaluate capital adequacy under hypothetical adverse conditions — equity drawdowns, interest rate shocks, credit spread widening. Traditional approaches involve copying production data into staging environments, running scenarios, comparing results, and cleaning up. That process is slow, expensive, and error-prone.

With copy-on-write branching, forking a dataset is O(1) regardless of size. A 100-million-row table branches in microseconds because the fork is just a new root pointer into the shared tree. Only chunks that are actually modified get copied.

What is this syntax?
;; Fork production data for stress testing — O(1) regardless of table size
def stress-scenario: st/fork(trades)

;; Apply adverse conditions — only modified chunks are copied
;; e.g. via SQL: UPDATE trades SET price = price * 0.7
;;               WHERE asset_class = 'equity'

;; Compare risk metrics: production vs stressed
def baseline-risk: st/q({:from trades, :group [:desk], :agg [[:stddev :pnl] [:sum :notional]]})

def stressed-risk: st/q({:from stress-scenario, :group [:desk], :agg [[:stddev :pnl] [:sum :notional]]})

;; Run as many scenarios as needed — each is an independent branch
;; Baseline, adverse, severely adverse, custom scenarios
;; all sharing unmodified data via structural sharing
;; Fork production data for stress testing — O(1) regardless of table size
(def stress-scenario (st/fork trades))

;; Apply adverse conditions — only modified chunks are copied
;; e.g. via SQL: UPDATE trades SET price = price * 0.7
;;               WHERE asset_class = 'equity'

;; Compare risk metrics: production vs stressed
(def baseline-risk
  (st/q {:from trades
         :group [:desk]
         :agg [[:stddev :pnl] [:sum :notional]]}))

(def stressed-risk
  (st/q {:from stress-scenario
         :group [:desk]
         :agg [[:stddev :pnl] [:sum :notional]]}))

;; Run as many scenarios as needed — each is an independent branch
;; Baseline, adverse, severely adverse, custom scenarios
;; all sharing unmodified data via structural sharing

Each branch is fully isolated: modifications to the stress scenario can’t touch production data. You can maintain dozens of concurrent scenarios without multiplying storage costs — they share all unmodified data. When you stop referencing a branch, mark-and-sweep GC reclaims the storage. No staging environments, no cleanup scripts.

This also applies to model validation. When a risk model is updated, you can run the new model against historical snapshots and compare its outputs to the original model’s results — same data, different code, verifiable divergence.

Cross-system consistency

A real regulatory pipeline isn’t just one analytical table. Entity data (customers, counterparties, legal entities) lives in a transactional database. Analytical views (positions, P&L, exposures) live in a columnar engine. Compliance documents and communications live in a search index. For an audit to be meaningful, all of these need to be at the same point in time.

Yggdrasil provides a shared branching protocol across these heterogeneous systems. You can compose a Datahike entity database, a Stratum analytical dataset, and a Scriptum search index into a single composite system — branching, snapshotting, and time-traveling all of them together.

What is this syntax?
require('[yggdrasil.core :as ygg])

;; Compose entity database + analytics + search into one system
def system: ygg/composite-system({:entities datahike-conn, :analytics stratum-store, :search scriptum-index})

;; Branch the entire system for an investigation
ygg/branch!(system "investigation-2026-Q1")

;; Every component is now at the same logical point in time
;; Query across all three with a single consistent snapshot
(require '[yggdrasil.core :as ygg])

;; Compose entity database + analytics + search into one system
(def system
  (ygg/composite-system
    {:entities datahike-conn    ;; customer records, counterparties
     :analytics stratum-store   ;; trade data, positions, P&L
     :search scriptum-index}))  ;; compliance documents, communications

;; Branch the entire system for an investigation
(ygg/branch! system "investigation-2026-Q1")

;; Every component is now at the same logical point in time
;; Query across all three with a single consistent snapshot

When an auditor needs the full picture — the trade data, the customer entity that placed the trade, and the compliance documents reviewed at the time — they get a single consistent view across all systems, tied to one branch identifier. No manual coordination, no hoping the timestamps line up.

Compliance lifecycle

Immutable systems raise an obvious question: what about GDPR right-to-erasure, or data retention policies that require deletion?

Immutability doesn’t mean data can never be removed — it means deletion is explicit and verifiable rather than implicit and unauditable. The Datahike ecosystem supports purge operations that remove specific data from all indices and all historical snapshots. Mark-and-sweep garbage collection, coordinated across systems via Yggdrasil, reclaims storage from unreachable snapshots.

This is actually a stronger compliance story than mutable databases offer. In a mutable system, you DELETE a row and trust that the storage layer eventually overwrites it — but you can’t prove it’s gone from backups, replicas, or caches. With explicit purge on content-addressed storage, you can verify that the data no longer exists in any reachable snapshot.

Production-ready performance

Versioning and immutability don’t come at the cost of query speed. Stratum uses SIMD-accelerated execution via the Java Vector API, fused filter-aggregate pipelines, and zone-map pruning to skip entire data chunks. It runs standard OLAP benchmarks competitively with engines like DuckDB — while also providing branching, time travel, and content addressing that pure analytical engines don’t.

Full SQL is supported via the PostgreSQL wire protocol: aggregates, window functions, joins, CTEs, subqueries. Connect with psql, JDBC, DBeaver, or any PostgreSQL-compatible client. See the Stratum technical deep-dive for architecture details and benchmark methodology.

Getting started

Stratum runs as an in-process Clojure library or a standalone SQL server. Requires JDK 21+.

What is this syntax?
{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}
{:deps {org.replikativ/stratum {:mvn/version "RELEASE"}}}

If you’re building analytical infrastructure in a regulated environment — or exploring how versioned data can simplify your compliance story — get in touch. We work with teams in finance, insurance, and healthcare to design data architectures where auditability is built in, not bolted on.

Permalink

Learn Ring - 8. Hiccup

Code

  • project.clj
(defproject little_ring_things "0.1.0-SNAPSHOT"
:description "FIXME: write description"
:url "http://example.com/FIXME"
:min-lein-version "2.0.0"
:dependencies [[org.clojure/clojure "1.12.4"]
                [compojure "1.6.1"]
                [ring/ring-defaults "0.3.2"]
                [hiccup "2.0.0"]]
:plugins [[lein-ring "0.12.5"]]
:ring {:handler little-ring-things.handler/app}
:profiles
{:dev {:dependencies [[javax.servlet/servlet-api "2.5"]
                        [ring/ring-mock "0.3.2"]]}})

  • template.clj
(ns little-ring-things.template
(:require [hiccup2.core :as h]))

(defn template [title body]
(str
    "<!DOCTYPE html>\n"
    (h/html
    [:html
    [:head
    [:meta {:charset "UTF-8"}]
    [:title title]]
    [:body
    body]])))

Notes

Permalink

Copyright © 2009, Planet Clojure. No rights reserved.
Planet Clojure is maintained by Baishamapayan Ghose.
Clojure and the Clojure logo are Copyright © 2008-2009, Rich Hickey.
Theme by Brajeshwar.