Skip to main content

Don't Migrate Logs. Outlive Them.

· 10 min read
Elisha Sterngold

Dual-write machinery on one side, a single line in time on the other

Don't Migrate Logs. Outlive Them.

For two months, every log written into Shipbook was written twice.

Once into the old Elasticsearch cluster, the one we had been running for years. And once, in parallel, into a new one — a different Elasticsearch version, with a completely different way of organizing the data. Dual-write. The standard playbook for moving a database you cannot afford to lose: keep both copies, keep them in sync, prove the new one is right, then flip.

It worked. We are fully on the new cluster now; the old one is gone. But somewhere in the middle of those two months, a much smaller change — one I was forced to handle a completely different way, because dual-write was physically impossible for it — quietly showed me that I had built the entire rest of the migration the hard way.

That smaller change took one line of code and a calendar reminder. The big one took two months of machinery and a steady drip of bugs. They were solving nearly the same problem.

One Dual-Write for Two Changes

I was changing two things at once, and I folded them into a single migration.

The first was the engine: a jump from Elasticsearch 7 to Elasticsearch 9. Years of accumulated reasons — features, a saner storage-tiering model, the end of a version aging out from under us.

The second was the schema. For most of the life of Shipbook — our logging platform — we created one index per customer per day. At our scale that meant thousands of tiny indexes, most nearly empty, all of them taxing the cluster in shard count and metadata churn — the structural problem I wrote about in Driving Our Error Log to Zero. The new shape shares one index across every customer on the same plan, routes writes by account so each customer's data stays together, and gives dedicated indexes only to whale accounts — the handful pushing more than twenty million logs a day.

Both changes went into one dual-write. New cluster, new schema, all at once. Every log landed in the old per-customer index on the old cluster and in the new shared per-plan index on the new one. A backfill job scrolled the entire history across. A flag — readFromNew — kept reads coming from the old, trusted side while the new side filled, ready to flip when I believed it.

That is the textbook approach. The textbook is also where the textbook pain lives.

The Tax on Keeping Two Truths

Dual-write means every write now has two outcomes, and they can disagree.

The secondary write was fire-and-forget — I did not want a hiccup on the new cluster to block a real request on the old one. So when the new cluster pushed back under load, writes to it failed silently. Nothing surfaced, because reads were still served from the healthy old side. The gap only existed on the cluster nobody was reading from yet.

That exact failure became its own story: 1.3 million silent errors a day, accumulating on the secondary side of this very migration, invisible to every customer. It was future damage — fuel loading quietly behind a flag, waiting for the day reads flipped over to detonate as missing sessions and stale fields. We caught it only because Claude could read the production logs and a number on a histogram did not fit the story.

That is the dual-write tax. You are not running one system; you are running two systems and a promise that they match. Every write path carries a branch. Every error makes you ask which cluster it came from. The whole apparatus exists to keep two copies of the truth in sync — and keeping two copies of the truth in sync is one of the genuinely hard problems in computing. We paid the tax for two months. The commit that finally deleted the old cluster's code path is dated almost two months after the one that added the dual-write.

The One Change I Couldn't Dual-Write

In the middle of all that, I needed to add routing to the new per-plan indexes — to tell Elasticsearch to place every account's logs together on the same shard, so a query for one customer hits one shard instead of fanning out across all of them.

And I could not dual-write it. Not because of effort — because it is impossible. Routing is decided at the instant a document is written; it determines which shard the document physically lands on. The logs already written to the new indexes were already placed, scattered across shards by their id. There is no second write that fixes where the first one went. My only options were to reindex everything — expensive, and exactly the kind of bulk data movement I was trying to avoid — or to draw a line.

So I drew a line. One constant:

private static readonly RET_ROUTING_TEMPLATE_CUTOVER = '2026-04-28';

Indexes created before that date are read the old way, without routing assumptions. Indexes created on or after it carry routing. The reader checks the date in the index name and does the right thing for each. And then — because every log has a retention and a nightly job deletes the ones past it — the pre-cutover indexes simply age out. A thirty-day index is gone thirty days later. Once the last one expires, the special case is dead code.

// TODO ~2026-05-28: drop the 30d guard once pre-cutover indexes age out.

That TODO is the entire decommission plan. Not a migration. A reminder. No second write, no backfill, no flag to flip, no held breath. The only cost was patience: for a few weeks, some reads spanned both the old and new index shapes before the old ones evaporated.

It was so much calmer than the thing happening all around it that I almost did not notice it was the same kind of problem.

Where Claude Helped, and Where It Led Me Astray

I did this work in a tight loop with Claude Code, and the honest accounting cuts both ways.

It earned its keep on the hard, fiddly parts. The hot/warm storage tiering — recent data on fast NVMe, older indexes rolled down to cheaper disk — it designed cleanly. When the new per-plan indexes started rejecting writes every midnight, it worked out why: routing by account funneled a burst of high-volume accounts onto one shard the instant each fresh daily index appeared. And it caught a missing shards-per-node cap that had piled a whale's ten shards onto two machines.

It also walked me into walls with total confidence.

It opened by reasoning about Elasticsearch 7 — because it saw 7.17 in our package.json. That was the client library; the cluster was already on 9. Half its early advice applied to a system we no longer ran.

It recommended an index setting to read each index's age from its name. Reasonable, except Elasticsearch's date parser wants hyphens and our names use underscores. It did not fail in review. It failed in production, on all 325 existing indexes at once, every one stuck in an error state until I tore the setting back out by hand.

And for a stretch it argued against the hot/warm tiering I wanted — insisting it would not come out ahead, that there was nothing to gain. The argument rested on a cost calculation, and the calculation was simply wrong: my configuration cost the same either way, and the tiered shape was the right one for the workload. Not a defensible judgment built on a real number — a wrong number, stated with exactly the confidence of a right one.

The lesson is the one from every post I write about this: a confident wrong number reads identically to a confident right one. The model will reason beautifully and calculate badly in the same breath, and my job is to make it show the arithmetic before I act on the conclusion.

And the decision itself stays mine. The model can fetch, calculate, and propose — but the moment you let Claude, or any AI, do the deciding, you have handed the wheel to something that is wrong and right with the same face, and it will walk you off a cliff with perfect composure. For the moment, at least, that is the one job you do not delegate.

What I'd Do Differently

The routing change was forced into the cheap approach. I did not choose the cutoff date because it was elegant; I chose it because dual-write was off the table. And the forced move turned out to be the one I should have made everywhere.

If I ran the cluster migration again, I would not dual-write it. I would give it a cutoff date too. Stand up the new cluster, point new writes at it, leave the old cluster running and serving the logs it already holds. Read from both during the overlap and merge. Then wait — and the day the oldest log on the old cluster passes its retention, the old cluster holds nothing anyone needs, and you switch it off.

No second copy of the truth to keep in sync. No backfill dragging the whole history across the wire. No silent secondary-write failures loading future damage behind a flag. The price is real and worth naming: you keep two clusters running for as long as your longest retention tier — up to a hundred and eighty days for us — instead of cutting over in a weekend. For a dataset that deletes itself, that is almost always the cheaper trade.

I reached for dual-write because that is what you reach for when you migrate a database. But logs are not most databases. They expire. The hard part of a migration is usually moving the history — and with logs there is no history to move, only a present that, left alone, becomes the past and then becomes nothing.

The Dull Work Customers Could Feel

I have spent this whole post second-guessing how I ran the migration. I want to be just as clear about the thing I do not second-guess: running it at all.

After the new infrastructure settled, the feedback started arriving on its own — from several customers, unprompted. The server feels faster. Searches come back quicker. Nobody files a support ticket asking you to reorganize your Elasticsearch indexes — but the speed-up was real enough that they noticed it on their own and told us.

That is the part worth holding onto. Infrastructure work is the least glamorous work there is. It ships no feature, closes no requested ticket, and shows up on no roadmap anyone cheers for. It is dull — right up until you discover that the dull work was what everything visible had been quietly resting on the whole time. The plumbing is invisible until someone notices the water is suddenly faster out of the tap. The schema change I almost talked myself out of as not-worth-it turned out to be something paying customers could feel.

So both things are true at once. I would change how I did it. I would not, for a second, change that I did it.

The one change I could not force into the dual-write was the one that told me the truth. Next time I will draw the line first.