Compose Once, Deploy Twice

This afternoon I had two fresh Fedora CoreOS (FCOS) hosts to set up, which I was very excited about. FCOS is the OS I keep coming back to: atomic, declarative, sovereignty-respecting in a way most distros aren’t. I knew what I wanted on the new hosts: my favorites. Helix, zsh, tmux, ripgrep, fd, bat, lsd, fzf, zoxide, git with delta and gh, claude, codex, lazygit, lf. I also knew how I wanted to set them up, which mattered to me more than what.

The lazy path was to SSH into each host and install what I needed. The path I wanted was to encode the work in a single script that applies identically to both hosts and to any sibling I spin up next, idempotent at every step. That’s a discipline, not a shortcut. I picked the discipline.

The wrong turn

The first draft of the script had a podman-compose layer in it alongside everything else. My eye caught on it before I’d finished reading the file. Both hosts are pure Quadlet, and Quadlet is declarative all the way down. Slipping podman-compose into that picture doesn’t just add a redundant tool. It ruptures the declarative shape that’s the whole reason to run Quadlet. Compose has no home unless Quadlet doesn’t already do the job, and Quadlet does. I asked the question that made it obvious: aren’t both hosts pure Quadlet? They are. Drop it.

But asking that question changed the work I thought I was doing. I’d been thinking “what tools should I install.” I was actually deciding what substrate layer each tool belongs on and what its update story is. That’s a smaller, sharper problem. Each tool, the same question.

The click

Three layers fell out, each one defined by what the tools in it need from the system:

flowchart TB
    A[Layer 1: rpm-ostree<br/>System citizens<br/>zsh, git, helix, ripgrep, fd, bat, lsd, fzf, zoxide, tmux...]
    B[Layer 2: ~/.local/bin<br/>Static binaries with their own update story<br/>claude, codex, lazygit, lf]
    C[Layer 3: toolbox<br/>Ephemeral compile environments<br/>created on demand, removed after]
    A --> B --> C

Tools that need to behave like system citizens (visible at standard paths, callable from any login shell) earn rpm-ostree’s atomic layer. Tools that ship as static binaries with their own update flow earn ~/.local/bin, where they only need to be visible to me. Compile environments don’t earn either; they live in toolbox containers, ephemeral, gone when the build’s done.

The discipline I want to keep from that partitioning: toolboxes are for compiling. Anything I leave persistent in a toolbox is a maintenance debt I’ll forget about, and FCOS hosts accumulate ghost containers fast if nobody is watching.

Once each tool had its layer, the script was almost trivial. Phase 1 layers the Fedora packages. Reboot, because FCOS’s atomic layer requires it. Phase 2 drops the static binaries and sets up zsh. I read it back through twice. I felt good about it.

The first host

The first host ran Phase 1 fine. I rebooted, ran Phase 2, my shell flipped to zsh. Done.

I re-ran the script anyway, because that’s the test I actually care about. Does the artifact tell me “nothing to do” when there’s nothing to do? It should.

It didn’t. It told me a new deployment was staged and that I should reboot. But rpm-ostree’s own status had printed No change one line earlier. The artifact I’d just been proud of was lying to me about the state of the host.

I went looking for why. The check I’d written counted unmarked deployment lines in rpm-ostree status. What I hadn’t accounted for was that rpm-ostree keeps the previous deployment around as a rollback target after every install-and-reboot. From that point on, the unmarked count is always at least one. I’d treated the count as a signal; it was noise.

I rewrote the check to look at the first deployment line for the active marker. It worked on this host. I committed the fix and moved on. But something was starting to form. I couldn’t have caught that bug in review. The thing the check was wrong about, rpm-ostree’s rollback retention, was outside the script’s assumption space, and outside mine.

A few minutes later, in a fresh zsh session, I typed claude. Nothing. The binary was right there in ~/.local/bin. The shell I’d just made the default couldn’t find it. Neither could it find codex, lazygit, or lf.

I went looking for why again. For bash, Fedora’s /etc/skel/.bashrc adds ~/.local/bin to PATH. For zsh login shells, the equivalent would be /etc/zprofile, which Fedora doesn’t ship. There’s a _src_etc_profile_d function in /etc/zshrc (it sources the scripts in /etc/profile.d/ that would normally extend PATH), but it’s gated to non-login shells; SSH login skips it specifically.

That was a fact I couldn’t have thought to check for. I dropped a minimal ~/.zshenv and the shell could see its binaries. The shape of the day’s work was getting clearer: each bug I was fixing required knowing something about the substrate I hadn’t thought to verify, and the substrate had to tell me by failing.

The second host

The second host was supposed to be the easy run. The bugs were fixed. Five PRs in.

I ran setup.sh. Phase 1 staged the packages and exited with the reboot prompt. I rebooted, reconnected, re-ran, expecting Phase 2.

Phase 2 ran immediately. Before the reboot, on the unrebooted host. chsh -s /usr/bin/zsh aborted because zsh wasn’t there yet. It was sitting in the staged deployment I hadn’t booted into.

That was the third pending check I’d written today. The first was a canary on whether hx existed; that one assumed my script was the only thing on the host that could produce hx, which was wrong. The second was the unmarked-line count I’d just fixed an hour ago, which had been wrong about rollback retention. The third (look at the first deployment line for the active marker) assumed pending deployments are listed first. That had been true on the first host, because it had a rollback target. It wasn’t true on a fresh install with no rollback, where rpm-ostree lists the booted deployment first and the staged one second.

Three reasonable assumptions about rpm-ostree. Three failures, each the first time the substrate did something the assumption hadn’t predicted.

I stopped trying to be clever about parsing the text output. I used the data structure rpm-ostree itself uses:

rpm-ostree status --json | jq -e 'any(.deployments[]; .staged == true)'

The .staged boolean is canonical. It’s what rpm-ostree marks deployments with when they’re queued for next boot. Stable across rpm-ostree versions and host states. I’d been reading the wrong artifact the whole time. The text output was for humans; the JSON output was for programs. I had let myself forget.

The pattern

By the end of the day, six substrate findings had come out of this work. One was caught at inspection: the canary, because the wrong assumption was right there in the code as I wrote it. The other five only surfaced when the script ran against a host that violated some assumption I hadn’t thought to state:

That Fedora ships /etc/zprofile. It doesn’t.
That rpm-ostree text output lists staged deployments first. Only when a rollback target exists.
That jq requires layering. The FCOS base provides it; six Phase 1 runs printed “Inactive requests: jq” before I noticed it was telling me something.
That my workstation’s ~/.ssh/config reflected the real network state. It didn’t; a chezmoi divergence sent ssh to a retired Tailscale IP and knocked out my access mid-session.

Each of those was “a fact that doesn’t make it into a code review because nobody knows to ask the question.” The substrate knows. The substrate doesn’t have opinions about what should be true; it only reports what is.

I had been treating review as the place where bugs get caught. Today I watched review catch one bug and the substrate catch five. That ratio is the lesson. Review checks my model of the system. Deployment checks the system. When my model and the system agree, both find the same bugs. When my model has a gap I don’t know about, only the substrate can reveal it, and it reveals it by failing.

Compose once, deploy twice

The shape of the discipline I want to keep from today: encode the artifact in one canonical place, then deploy it against at least two hosts that differ in dimensions I suspect matter, and at least one host that differs in dimensions I don’t.

The two hosts I worked on differed in obvious ways (workload, hostname, network position). They also differed in one I hadn’t thought to catalogue: whether the host carried a rollback target. That uncatalogued difference produced two of the six bugs. I needed both hosts to find them, and I would have needed both hosts no matter how careful my review was, because the dimension that mattered wasn’t in my model.

One host is a hypothesis. Two hosts are a test. By three, the substrate is a regression suite. Every bug that surfaces from one host gets fixed in the canonical artifact. The next host inherits the fix. The artifact compounds across deployments, because the substrate has been answering questions I didn’t know to ask.

Call it test-driven infrastructure. That’s what it is.

The substrate doesn’t lie. The substrate doesn’t perform. Only deployment asks it the question.

fcos-host-setup on GitHub