The Substrate
Notes from a Decade Beneath the Application
I've been quite these past couple months, heads down, busy, but tinkering. I don't have a destination for this post, but I'm happily trotting on a path to some destination. So come with me as I meander.
I think when I started actually knowing what I was doing in this industry, I wanted cloud infrastructure to feel honest. Not magical, not abstract, not "elastic" in the brochure sense. Honest. If a request hit a server and the server was sick, I wanted to know. If a workload was waiting in line, I wanted to understand why and behind whom. If a model returned a token, I wanted the path that token took to be something a person could draw on a whiteboard without leaving anything out.
Most of the last decade has been a long argument with myself about what that means in practice. The arenas keep changing, bare-metal Linux, OpenStack at scale, container orchestration, GPU pools that misbehave the same week the model ships, but the question hasn't. What does the smallest, most legible substrate underneath the thing the user actually wanted look like? Getting that wrong makes every clever feature above it leak at the seams. Getting it closer to right makes the clever features feel almost easy.
These notes have been kicking around in my notebooks for a while. I wanted to write them down before they got sanded smooth (like my brain). Especially as I effectively use this blog post as a way to remember things that I'm likely to forget in the very near future.
Operations as a design discipline
Operations isn't the thing that happens after engineering. It's a design discipline of its own. The shape of how something gets operated, who pages whom, what knob exists in production but not staging, which view the on-call uses to answer "is it actually down" pre-shapes what a healthy codebase ends up looking like. When the architecture can't answer the operator's questions cheaply, the architecture is working against itself, even when the diagram looks clean.
I've been on the wrong side of that line enough times to recognize the symptoms. You ship a feature, it works, customers like it, and a few months later three teams are independently shelling into a box to tail a log file because the system has no real opinion about what "healthy" means for it. The fix isn't more logs. The fix is that healthy should be a first-class concept the system can describe in its own vocabulary, and features that can't extend that vocabulary deserve a second look before they ship.
This shows up everywhere I work. In one current effort, the boundaries inside a request path each emit a typed event drawn from a closed, versioned set of names, not a free-text log line. When a UI surfaces a small status pill that says "waiting for first chunk," it's reading a discrete event the path emitted on purpose, with a name that survived review. That small piece of discipline pays back its cost fast.
Swift, what?
People ask, sometimes with a bit of incredulity, why I keep writing operations tooling and infrastructure services in Swift. Fair question. The cloud world is mostly Python, Go, Rust, with a smattering of Ruby, all empowered by ominous C. Swift still has a reputation as the language of iPhone apps.
My answer isn't religious. It's working. Swift 6 combines three things I've come to lean on: a strict, compile-time concurrency model that surfaces whole categories of race condition as build errors; a clean separation between value and reference types that maps onto the difference between a snapshot of state and a handle to something that changes underneath you; and an actor model that lets me sketch a router or scheduler in code the way I'd sketch it on a whiteboard. The parts of a system that are hardest to reason about end up being the parts the compiler is hardest on, which is where I want the friction.
I built Substation partly to test that intuition. It's a terminal UI for OpenStack, written in Swift 6, with no external dependencies and a custom ncurses layer underneath. Substation started because I wanted to manage thousands of instances without waiting on a web dashboard that thinks twelve seconds is a reasonable response time. But the deeper reason it exists is that I wanted to know whether a Swift-only systems stack could carry the operational weight of a real cloud. With multi-level caches, actor-based concurrency, and a strict Swift 6 build that compiles cleanly on macOS and Linux, I'm confident the answer is yes. The language hasn't been the bottleneck. What mattered more was whether I was willing to be honest about my state machines.
Embeddings and scoring as connective tissue
It's fashionable to talk about generative models as the interesting part of an AI system. They are interesting. They're not usually the part that determines whether the system feels any good. The part that matters more is the connective tissue: embeddings, rerankers, the scoring functions that quietly decide which worker, which cache shard, which retrieval path, which user gets served first.
In production, a generative model behaves like a coarse-grained, expensive, occasionally unreliable RPC. You feed it a context, you get a stream of tokens, you hope the stream finishes cleanly; it's like a slot machine... Most of the engineering effort lives upstream and downstream of that call, an embedding pipeline turning unstructured text into something you can index and route on, a reranker shaping what the model actually sees, a scoring function deciding which of several plausible workers takes the request, a parser turning the stream into structured events, a validator catching malformed tool calls before they reach a customer, and a budget that decides when to give up gracefully.
When I started experimenting with vLLM, embedding models, and small purpose-built scorers in earnest, what struck me was how much of the discipline carried over from years of OpenStack work, almost unchanged. A GPU worker looks a lot like a hypervisor. A model looks a lot like a flavor. A KV cache behaves like a block device. Prefix caching has a lot in common with page caching. The control plane deciding which request lands on which worker is, fundamentally, a scheduler, with most of the same fairness, starvation, and anti-thrash problems schedulers have always had. Operating it well has required the same instincts I developed running compute at scale, and a smaller share of the new vocabulary invented by self-important industry.
That recognition slowly pulled me out of treating model serving as a brand new domain and more toward thinking of it as the latest layer of a substrate I'd spent a long time getting comfortable with.
Lessons I keep coming back to
The database is usually the only thing that remembers. Every time I've let in-memory state become authoritative, a sync session, a worker health score, a quota counter, there's eventually been an outage that reminded me of the same thing. After a restart, the truth is gone. Anything load-bearing eventually needs to flow back to a durable store, even if you have to be clever about how the writes get batched. The schema is the most expensive interface in the system, and migrations deserve roughly the same care as breaking API changes.
Fairness is easier to design in than to retrofit. Round-robin is rarely as fair as it looks, and weighted random has its own surprises. What's worked for me is something closer to aged priority, a soft anti-starvation bump for any tenant that's been waiting past a fairness window, expressed in one place rather than scattered across several queues. Once a scheduler behaves this way, complaints about "the slow times" tend to fade, mostly because the worst case becomes bounded and easier to explain.
Honest signals outperform clever heuristics. I've written my share of predictors, latency models, EWMA (Exponentially Weighted Moving Average) smoothed health scores, adaptive exploration policies. They have their place. What I keep coming back to is how much value a single well-named event emitted at the right boundary can carry. A heartbeat that says "I am still working, the first byte is on its way" has done more for perceived reliability than most of the predictors I've built around it. The signal beneath the prediction has to be honest before the prediction is worth much.
Closed vocabularies age better than open ones. The protocols I've enjoyed working with had a finite, documented event vocabulary and a habit of versioning carefully when it grew. The ones I struggled with started as "we'll just put whatever we need in a JSON blob." A closed vocabulary makes parsers safer, UIs more renderable, downstream consumers easier to compose. It also forces engineers to argue about what an event means before it ships, which is usually a healthy argument to have.
Compatibility is quiet leverage. When a wire format matches something a thousand client libraries already speak, a lot of integration work comes along for free. The inverse is also true, and the hours I've spent guarding an existing surface against accidental drift have often turned out to be some of the cheaper hours in a project.
Where this is going
I have no idea. This entire post is an exercise in avant-garde thinking masking itself as a creative outlet. I think I've really just been pondering the past decade of operating infrastructure at scale, with a stubborn affection for a language most of the industry doesn't associate with systems work, a terminal UI written to prove a point to myself, and a stretch of pulling on the threads of accelerated serving, embeddings, scoring, and generative routing; everything is circular and all feel like pieces of the same larger thing.
The substrate matters more than the surface. The connective tissue matters more than the celebrity layer. The discipline of operations, closed vocabularies, durable state, honest signals, bounded fairness, is a natural fit for the kind of platform the AI side of the industry is still figuring out how to build.
I'm going to keep poking at the substrate. The interesting part has always been what gets built on top of one that's been put together with some care.