Skip to main content

5 posts tagged with "ai"

View All Tags

· 9 min read
Elisha Sterngold

Claude querying production logs through the Shipbook MCP

The Bug Hiding in Our Logs That AI Almost Helped Us Miss

A unit test was timing out on my laptop. The kind of small annoyance you push to next week. I asked Claude to look at it.

A couple of hours later we had a fix. We also had something I did not expect: confirmation that the same bug, in a slightly stricter form, had been firing 1.3 million times a day in production for the past day and a half — silently, in a fire-and-forget error path nobody was watching.

The fix is uninteresting. The way we got there is the part worth writing about.

The Hypothesis That Sounded Small

While Claude was poking at the local failure, it offered a guess about the production impact. The buggy code path was reached only by a couple of background jobs, it said. Bounded scope. The kind of thing you fix in the next deploy window, not at midnight.

It was a confident answer. Two callers. Job tier. Done. Claude did add a passing line at the end — "worth grepping production logs to confirm" — but it landed the way most footnotes land: as a thing you nod at and move on from. The whole framing pointed away from urgency.

I almost moved on. The fix was ready, the tests were green, and there was nothing in the model's analysis that suggested otherwise. What I did instead — and this is the only thing I want to take credit for in this whole story — was not accept the confident answer. I told Claude to actually check. Pull the production logs. See if the bug was, in fact, contained.

That single sentence is the difference between this story and the much more boring version of it. Claude had given me a tidy diagnosis. The diagnosis was wrong. And the only reason I knew to push back is that I have learned, slowly and at some cost, that "bounded scope" answers from a confident model are exactly where I should be most suspicious. Not because the model is reckless. Because confident reasoning in the absence of data is just well-formed guessing.

So Claude called the Shipbook MCP and pulled a seven-day histogram of the relevant error.

The histogram looked like a wall. Six errors a day. Six. Four. Three. Six. Then a cliff. Then 1.3 million.

A number that high cannot come from a job that runs every five minutes. The original "background jobs only" story was wrong, and it was wrong by an order of magnitude that none of us — engineer or model — would have inferred from the code alone. The cliff also lined up exactly with a constant in our own code that had switched on stricter behavior in our new infrastructure. The story was no longer "small bounded bug." It was "we have been silently breaking writes to a whole cluster since the moment we deployed that constant, and we did not know."

The MCP did not just confirm a hypothesis. It overturned a benign one and replaced it with a real one.

What the Tight Loop Actually Buys You

I want to be careful about what is new here, because plenty of tools claim some version of "AI debugs production." Most of them are wrong.

What is new is not that an AI can look at logs. It is that the model can stay inside its hypothesis-and-test loop without leaving the conversation. The engineer's job, in that loop, is no longer to fetch data. It is to decide which questions are worth asking. The model handles the rest.

The volume on that histogram did not just upgrade the priority of the bug. It corrected the model. With the number on the table, Claude re-read the code, found a caller it had missed — on the request path of every active session — and revised its picture of the situation. The interesting thing is not that the model corrected itself, but what corrected it. It was not better reasoning. It was a number that did not fit the story. A model that cannot see production will keep telling stories that do not survive contact with it. A model that can see production gets caught when it is wrong, and updates.

This is the loop we are betting on: the model proposes, the data disposes, and both stay in the same conversation while it happens.

Two Bugs for the Price of One

While we were already inside the data, we noticed something we had not gone looking for. The first bug had been masking a second one. A customer we had recently promoted to a higher tier was supposed to have their traffic distributed across several pieces of dedicated infrastructure. Because of a related routing decision, all of their traffic had been piling onto one of those pieces, with the others sitting empty.

We would have caught this eventually. Probably the next time someone looked at a latency dashboard and noticed an asymmetry. Probably weeks from now. We caught it today, because we were already there. It cost us one extra question.

This is, I think, an underrated effect of having a low-friction loop. When the cost of looking is low, you look more. When you look more, you find things you would not have specifically gone hunting for. Cheap curiosity compounds.

Logs Nobody Queries Are Not Insight

The bug that bled 1.3 million times a day was logged faithfully. Every single error was caught, formatted, written out — and ignored. It was not in a dashboard. It did not page anyone. The error path was deliberately fire-and-forget, because the alternative would have been a request-blocking failure mode that we did not want.

A log no one queries is closer to a tree falling in an empty forest than to information. It exists, but it does not act on the world. Most production systems are full of these. We have built a generation of observability tools that are excellent at producing logs and mediocre at producing attention.

The interesting question is not "how do we collect more logs." It is "how do we make the logs that already exist legible to the people, or the agents, who can do something about them."

Ground Truth and the Model

There is a thing AI systems are bad at, which is making confident claims about specific facts they cannot verify. There is a thing they are good at, which is reasoning over data once that data is in front of them.

The trick is to put the data in front of them. Not as a paragraph in a prompt — as a tool call. Not as yesterday's snapshot — as live state. The model should not be guessing what production looks like; it should be looking.

That is what an MCP server does. It turns whatever capability you wire up — logs, in this case — into a function the model can call when it decides it needs the answer. The model's reasoning is no longer floating; it is anchored to whatever ground truth you have given it access to.

We have spent the last few months wiring our own product into our own conversations. Our logs sit a tool call away from Claude. So do our deployments, our error analytics, our session traces. The result is not magic. It is something quieter and more useful: an engineer and an agent, looking at the same reality at the same time, and reasoning about it together.

The Footnote That Matters

There is one more thing worth saying, because it is the most honest part of the story. No end user was affected by any of this.

The cluster where the writes were failing was the new one — the secondary side of an in-progress migration. Reads were still being served from the original cluster, which was healthy and complete. The 1.3 million errors a day were silently accumulating inside our own infrastructure, on a side of the system no customer was reading from. To anyone using Shipbook today, everything looked fine. It was fine.

The damage was strictly future damage. The day we cut over reads to the new cluster, every one of those failed updates would have surfaced as a missing field, a stale session, a piece of context that should have been there and was not. The bug was a quiet, slow accumulation of data drift that would have detonated the moment we trusted the new cluster.

I almost did not include this section because it felt like deflating the drama. But it is actually the point. The most valuable bugs to catch are the ones that are not yet on fire — the ones that are loading their fuel quietly, behind a flag, on a path nobody is yet looking at. Those are the ones that turn into postmortems six months later, when the team has forgotten the context and the data is unrecoverable.

We did not catch a customer-facing crisis today. We caught the seed of one, while it was still small enough to fix in an afternoon. That is the kind of catch a tight loop with production data makes possible. Not heroic firefighting. Just earlier noticing.

The Quiet Lesson

I keep coming back to one thing. The bug we found today was not a hard bug. It was a small, bounded, fixable mistake — the kind of thing any engineer would have spotted in five minutes if they happened to be looking at the right log line.

Nobody was looking. Nobody was going to look. It would have hidden until the migration finished and the damage became visible.

The thing that broke this pattern was not a smarter model or a better dashboard. It was that the model could ask the question, and the answer was right there, in the same conversation. That is the loop we are betting on. It feels small from the inside. From the outside, I think it is going to look like a quiet shift in how production systems are kept healthy.

A flaky test on a Wednesday afternoon is not where I expected to start writing about that. But here we are.

· 9 min read
Elisha Sterngold

The old way vs. the new way of shipping features with AI agents

AI Agents Don't Just Change How You Code — They Demand a New Kind of Company

The Quiet Disappointment of "AI Adoption"

Almost every company I speak with has "adopted AI." Engineers are using Claude Code, Cursor, Copilot, Codex. Designers are in Figma with AI plugins. Marketing teams have entire workflows built around generative tools. On paper, the revolution has arrived.

And yet, something is off. Recently I asked a senior executive at a large software company whether their development was now going several times faster. I expected at least a careful "two or three times." The answer was no. Not two times. Not even close. He then added the sentence that has stayed with me since: "Development is only about 20% of the time it takes to ship a feature. Even if you make that part infinitely fast, the whole system isn't much faster."

That is the story of AI adoption right now. Companies plugged AI agents into individual seats, but they did not touch the scaffolding around them — the stages, the approvals, the handoffs, the committees. The work still moves through the same pipes. The pipes themselves are the bottleneck.

Faster Tools in a Slower System

Imagine a factory where every worker suddenly gets a tool that is ten times more productive. If the conveyor belts still run at the old speed, if every part still needs six signatures to move forward, if QA still happens at the end in a single batch, the factory does not produce ten times more. It produces roughly the same, with more people waiting between stations.

That is modern software development in 2026. The coding step has genuinely collapsed. A feature that once took a week can often be drafted in an afternoon. But the surrounding choreography — the kickoff meetings, the design reviews, the stakeholder approvals, the pixel‑perfect handoff, the QA cycle, the release train — is untouched. So the feature still takes a month. The AI did its job; the company did not.

The industry keeps measuring the wrong thing. We compare lines of code per hour, or pull requests per engineer. The real question is how long it takes for an idea, a customer insight, or a competitive threat to become a shipped feature in the hands of users. That number, for most companies, has barely moved.

Rethinking the Design‑Then‑Development Handoff

Consider one of the most sacred sequences in product work: design first, then development. Designers create a spec. Developers implement it pixel‑perfect. Back‑and‑forth for edge cases. Another round of review. Eventually it ships.

This sequence made sense when implementation was expensive. You did not want engineers guessing. You wanted them cutting once, after measuring twice. Design was the cheap stage; development was the costly one, so you resolved ambiguity before writing code.

AI agents invert this economics. Implementation is no longer the expensive stage. Generation is nearly free. And yet we continue to hand agents a frozen design and ask them to faithfully translate it. We are still optimizing for a scarcity that no longer exists.

A better approach: describe the intent, the constraints, and the brand guardrails, then ask the agent to generate three full, working, interactive pages. Not mockups. Not prototypes. Real pages, wired to the backend, navigable in a browser. Then, as a team, you look at the three options and pick the one that actually feels right. You iterate from inside the working product, not from a static file.

Notice what just happened. You skipped a whole stage. The separate "design approval, then build" loop collapsed into a single "explore, pick, refine" loop. The code is already written — not because you rushed it, but because producing it was cheap enough that generating three variants cost less than arguing over one. Interaction is no longer something you negotiate through screenshots; it is something you experience directly.

Letting the Agent Surprise You

There is a second, subtler shift. The traditional handoff assumes the human knows the right answer and the implementer is there to reproduce it. But AI agents do not merely reproduce — they propose. Sometimes what they produce is worse than what you had in mind. Sometimes it is roughly equivalent. And sometimes, honestly, it is better.

The discipline we need is the ability to look at what an agent built and ask two separate questions: Is this actually worse than what I wanted, or is it just not what I imagined? Those are not the same thing. The first is a real problem. The second is ego dressed up as taste.

Companies that treat every deviation from the brief as a defect will grind to a halt correcting things that were never actually broken. The ones that treat deviations as candidate improvements — evaluated on their own merit, not on their fidelity to the original mental image — will move dramatically faster and, more often than not, arrive somewhere better than they originally planned.

Flatten the Company or Lose the Race

None of this works if the process wrapping the product is still hierarchical. If every change still needs to climb a ladder of approvals — product manager, design lead, engineering lead, VP, exec sponsor — the agent's speed is wasted waiting in queues. The agent can refactor your entire system over lunch. Your Slack thread to get sign‑off will still take three days.

Dynamic, flat companies are about to run circles around large, permission‑heavy ones. Not because the engineers are smarter, but because the distance between a good idea and a shipped feature is short. When you combine genuine AI superpowers with a small, empowered team that does not need committee approval to try something, you get cycle times that used to belong to demo‑day prototypes — but with production quality.

This is uncomfortable for established organizations, because the approval chains were not accidents. They were scar tissue from real incidents. Someone shipped something dangerous, so we added a review. Someone broke production, so we added an approval. Every gate has a story behind it. Removing them feels reckless.

But those gates were designed for a world where mistakes were hard to detect and expensive to reverse. That is not our world anymore.

The New Checks and Balances

Flattening a company does not mean removing discipline. It means moving the discipline to where it actually protects you.

The new center of gravity is simple: nothing should be breaking. Not "nothing should be unfamiliar." Not "nothing should deviate from last year's patterns." Nothing should be breaking. That is the invariant worth defending.

How do you defend it when code is flowing in from AI agents ten times faster than before? Not with more human reviewers in the loop — that is exactly the pipe that is already clogged. You defend it with systems that run at the same speed as the agents:

  • Unit tests that actually assert behavior, not placeholder shells. When an agent changes a function, the tests catch regressions before the PR is even reviewed.
  • Integration tests that exercise real flows — real database, real services, real boundaries. Mocked integrations pass while production burns. Real ones tell the truth.
  • CI/CD that runs automatically and refuses to merge anything that breaks the invariants.

But tests and CI/CD only tell you whether the code behaves the way its author expected. They do not tell you what is actually happening in production — under real load, with real users, with real data no test fixture would think to invent. That is where logs come in, and it is where most teams underinvest most badly.

Logs are the ground truth of a running system. They record what the code actually did, not what it was supposed to do. In the old world, logs were mostly for humans — a developer grepping through files after something had already gone wrong. In the AI‑agent world, they become something far more powerful: the substrate agents reason over. Feed production logs into an agent and it can tell you which errors share a root cause, which users are hitting which failure paths, which feature you shipped yesterday quietly broke something you did not notice. The loop closes: the system reports on itself, the agent reasons over that report, and problems surface before a human has to spot them. This is exactly the gap we are building Shipbook to close — turning logs from an afterthought into the ground‑truth layer that both humans and AI agents depend on.

When all of this is in place — tests, CI/CD, and a genuine ground‑truth layer from production — something remarkable happens. The speed at which you can try new features, validate them under real usage, and ship them becomes enormous. Not because you removed the checks, but because the checks now run at machine speed instead of meeting speed.

Staying Ahead of the Curve

The companies that will lead the next decade are not the ones with the most AI licenses. They are the ones that used AI adoption as a reason to rebuild how they work.

That means fewer stages and more loops. Fewer approvals and more guardrails. Less "pixel perfect from spec" and more "pick the best of three working options." Less deference to what the brief said, more willingness to recognize that the agent's suggestion might actually be better. A flatter structure. A deeper investment in tests, observability, and logs as the real safety net.

AI agents are an enormous gift. But a gift you do not unwrap is just clutter. If your company still looks and moves the way it did in 2022, you have not really adopted AI — you have just bolted it onto a machine that was not built for it. The teams that redesign the machine itself are the ones who will still be here, shaping the market, when the rest are wondering why their impressive tools never quite delivered.

· 7 min read
Elisha Sterngold

Dogfooding Shipbook

Why We Started Dogfooding Shipbook — and What We Found

There is an old principle in software: use what you build. The industry calls it dogfooding — short for "eating your own dog food," meaning if you build a product, you should use it yourself. The logic is simple: if it is not good enough for you, it is not good enough for your customers. If you are not experiencing your own product the way your users do, you are flying blind. You can read every support ticket, study every metric, and still miss what it actually feels like when something goes wrong.

For years, Shipbook was a logging platform built for mobile apps. Our SDKs covered iOS, Android, Flutter, and React Native. Our users were mobile developers — and so was I. Before Shipbook, I was VP R&D at a mobile app services company, which is where the idea for Shipbook was born. But time passes. As the product grew, I spent less time inside mobile apps and more time building the platform itself — the web console, the backend, the infrastructure. Even though Shipbook was born from the mobile world, our own day-to-day work had moved beyond it. Our stack — a web console, a Node.js backend — lived outside the reach of our own tools. We could see our users' logs, but we could not see our own.

That had to change.

Building the SDKs We Needed

We added a browser SDK and a Node.js SDK to our JavaScript SDK family. The main motivation was to close the gap between us and our users. We wanted to instrument our own web console and backend services with Shipbook, to see our own errors, our own warnings, our own performance issues flow through the same pipeline our customers rely on.

The moment we deployed them, something shifted. Issues that had previously been abstract — a user reporting a vague problem, a spike in an error metric — became concrete. When the console threw an error, we saw it in our own logs. When the backend hit an edge case, we felt it. There is a fundamental difference between reading a report about a problem and encountering it yourself while doing your own work.

The Difference Between Hearing and Feeling

When a user files a bug report, you investigate. You look at the evidence, reproduce the issue if you can, and fix it. It is a rational process. But when you yourself hit that same issue — when you are in the middle of debugging something else and the console behaves unexpectedly — you feel it differently. The urgency is not simulated. The frustration is not secondhand. You do not need a priority score to know this needs fixing.

This is the real argument for dogfooding. It is not just about finding bugs. It is about changing your entire relationship to your product. When you use it daily, you do not only discover what is broken — you discover what is missing. You feel the features that should exist but do not. You notice the workflows that are clunky, the information that is hard to find, the capabilities you keep wishing you had. No feature request from a user can replicate that intuition. Dogfooding drives not just quality, but product direction.

We started noticing things our users had probably noticed for months. Small annoyances, missing capabilities, rough edges that individually seemed minor but collectively degraded the experience. The kind of things that rarely make it into a support ticket because no single one is worth reporting — but that erode trust over time.

When Claude Met Our Logs

The most surprising chapter came when we connected Claude to our Shipbook data through an MCP server. We gave it access to our Loglytics — the aggregated error analytics that Shipbook computes across sessions — and asked it to analyze the issues.

Loglytics was already doing a good job — it had identified around twenty distinct error patterns, grouped and ranked by frequency and impact. That alone is valuable. But what happened next is where the real power showed up.

Claude processed the full Loglytics output and consolidated the twenty issues into four root causes. Not four categories — four actual underlying issues. And then it did not stop at the analysis. It went ahead and fixed them directly in the code.

This is where the combination of Shipbook Loglytics, MCP, and Claude Code becomes something greater than its parts. Loglytics surfaces the problems. MCP gives Claude access to that data. And Claude Code has the ability to reason about the issues and write the fixes. The entire loop — from detection to diagnosis to resolution — happened without manual triage, without prioritization meetings, without context-switching between tools.

Ground Truth Changes Everything

This experience crystallized something we had been thinking about for a while: AI is most powerful when it has ground truth to work with. Give a model a vague question and you get a vague answer. Give it structured, real-world data — actual logs, actual error patterns, actual stack traces — and it becomes remarkably precise.

Our logs were not just records of what happened. They were the ground truth that made Claude's analysis trustworthy. Every conclusion it drew was anchored in real data from real sessions. There was no hallucination risk because the model was not speculating — it was reasoning over facts.

This is where logging and AI intersect in a way that feels genuinely new. Logs have always been valuable for debugging. But when you feed them to a model that can reason about patterns at scale, they become something more: a foundation for proactive quality improvement.

Fixing Issues Before Users Complain

The most valuable outcome was not just finding the four root causes. It was fixing them before our users had to ask. We knew from the data that these issues were affecting real sessions. We could see the frequency and the impact. But no one had filed a ticket yet — or if they had, it was buried in vague descriptions that did not point to the root cause.

By dogfooding our own system and letting AI analyze the results, we moved from reactive support to proactive improvement. We did not wait for complaints. We saw the problems ourselves, understood them deeply, and resolved them.

This is the cycle we now believe every product team should aim for: use your own product, instrument it thoroughly, feed the data to AI, and fix what it finds — before your users have to tell you something is wrong. It is a higher standard of quality, and it is only possible when you are willing to be your own harshest critic.

The Lesson

Building a product you do not use yourself is like writing a book you never read. You might get the structure right. You might catch the obvious errors. But you will miss the experience — the pacing, the friction, the moments where things just do not feel right.

We should have dogfooded Shipbook sooner. Now that we have, we cannot imagine going back. Every issue we find in our own usage is an issue we fix for everyone. And with AI turning our logs into actionable insights, the gap between "something is wrong" and "here is exactly what to fix" has never been shorter.

· 8 min read
Elisha Sterngold

Programming in the Age of AI Agents

When Experience Becomes a Liability: Programming in the Age of AI Agents

The Comfortable Lie About AI and Seniority

At the beginning of the AI‑agent era, the industry converged on a reassuring belief: AI would transform software development, but not its hierarchy. Juniors would be displaced first. Seniors would become more valuable than ever. Someone would need to supervise the AI agents, to judge their output, to understand when a confident answer was actually dangerous. Only engineers with real scars—production outages, failed rewrites, architectural dead ends—could possibly do that job.

This belief made sense at the time. Early agents were impressive but shallow. They wrote code fluently but reasoned poorly at the system level. Supervision meant knowing where they hallucinated, where abstractions leaked, where reality diverged from elegance. Architecture was still heavy, expensive to change, and deeply shaped by historical accidents. Experience mattered because memory mattered.

But that belief quietly depends on an assumption that is no longer holding: that AI agents will remain weaker than humans at global reasoning, and that supervision will always mean catching the model when it is wrong. Once agents become strong enough to ingest entire codebases, replay years of architectural evolution, simulate migrations, and explore alternative designs, supervision itself changes meaning. With foundation models at the level of Opus‑4.5, GPT-5.2-Codex, Gemini 3 Pro as raw reasoning engines—and with agent layers built on top of them such as Claude Code, GitHub Copilot, Cursor, Antigravity and high‑autonomy agent frameworks—we are no longer dealing with assistants. We are dealing with machines that can explore architectural space directly—and this direction is only accelerating, as models improve and agent layers grow more autonomous, more stateful, and more deeply embedded in real development workflows. 

The Collapse of Architectural Permanence

Architecture, in this world, stops being sacred. For decades it was heavy because changing it was dangerous. Decisions hardened not because they were optimal, but because revisiting them was too costly. Seniority emerged as authority because it carried memory: memory of why something was split, why it failed, why it was glued back together, and why certain areas were never to be touched again. Architecture was history embodied in code.

AI agents dissolve this monopoly on memory. When the past becomes machine‑readable—every commit, every incident, every failed experiment—history stops living only in human heads. It becomes something you can query, simulate, and challenge. Architecture becomes a version, not a monument. It can be branched, stress‑tested, rewritten, and rolled back. Once the cost of change collapses, experience as stored trauma loses its central power.

Experience as Emotional Debt

This is where the uncomfortable reversal begins. Senior engineers carry not only knowledge, but standards—deeply internalized ideas about how code should look, how systems should be structured, and what "good engineering" means. These standards were earned through hard experience: migrations that nearly killed the company, rewrites that went nowhere, abstractions that promised clarity and delivered outages. But the need to ensure that new code conforms to these standards slows everything down. Every change must be carefully shaped, reviewed, and aligned with an existing mental model of correctness.

That caution is wisdom, but it is also friction. It turns supervision into enforcement and progress into negotiation. Seniors are not only guarding the system from breaking; they are guarding it from deviating. And in a world where AI agents can rewrite, test, and validate entire systems quickly, this insistence on conformity becomes a form of inertia. It encodes a sense of what is too dangerous—or simply too unfamiliar—to try, even when the tools that once made deviation risky no longer exist.

The Junior Advantage

Juniors carry none of this. They have no architectural nostalgia, no sunk cost, no identity tied to the current shape of the system. They look at a codebase not as a legacy to be preserved, but as something provisional. In the pre‑agent world, this made them naïve. In the agent world, it can make them powerful.

Because now the agent carries the memory. Claude Code can reconstruct why a refactor failed five years ago. Codex can explore alternative implementations without touching production. Cursor can let you navigate, rewrite, and validate entire systems in hours instead of months. The junior no longer needs to remember the past—the system can replay it. What the junior brings instead is a willingness to ask whether the past should continue to define the future.

Vibe‑Coding in a Serious World

This is also where vibe‑coding stops being a joke. In the early days, vibe‑coding—iterating quickly with AI, trusting intuition, caring more about flow than formal design—looked irresponsible. But once you combine it with strong AI agents, deep test generation, and fast rollback, vibe‑coding becomes a way to explore reality rather than speculate about it. It is not anti‑discipline; it is discipline shifted from up‑front certainty to rapid validation.

Judgment Over Experience

The old claim was that only seniors could supervise AI agents. But as agents become capable of self‑critique, simulation, and multi‑path reasoning, supervision stops being about catching mistakes and starts being about choosing directions. The problem is no longer that the AI cannot reason deeply enough. It is that it can reason in too many directions at once. The scarce resource becomes judgment, not experience.

Judgment is not the same as experience. Experience is backward‑looking; it encodes what failed. Judgment is forward‑looking; it decides what is worth risking now. Experience says, “This was a disaster once.” Judgment asks, “Are the conditions still the same?” And judgment does not scale linearly with years of writing code. It scales with clarity of thought.

High‑Leverage Agent Tools and the End of the Apprenticeship

This is where high‑leverage agent tools matter. Anything that removes the weight of change—instant refactors, cheap rewrites, aggressive simulation, reversible decisions—changes who gets to participate in architectural decisions. When change becomes cheap, the right to propose change expands. The apprenticeship model, where you slowly earn permission to question the system, begins to crack.

The unsettling possibility is that in a mature AI‑agent world, some of the most effective system designers will be people with very little attachment to how things have always been done. Not because they know more, but because they are willing to dismantle and rebuild more. Not because they are reckless, but because the environment has shifted from one where mistakes were catastrophic to one where they are increasingly simulated, contained, and reversible.

The New Role of Seniors

This does not make seniors obsolete. It changes their role. Their value moves away from being living archives of architectural pain and toward defining invariants: what must never break, what constraints are non‑negotiable, what risks are existential regardless of tooling. But the monopoly on exploration dissolves.

What this looks like in practice: Seniors become guardians of boundaries rather than gatekeepers of change. They define the security model that cannot be compromised. They identify the data consistency guarantees that must hold across any refactor. They articulate the performance thresholds below which the product fails its users. They specify the regulatory and compliance rails that no amount of clever architecture can bypass. Crucially, they enforce the ground truth mechanisms that make rapid iteration safe: comprehensive unit tests, meaningful logging, observability pipelines, and monitoring that catches what code review cannot. These are the things AI agents cannot infer from code alone—they require understanding of the business, the users, and the consequences of failure that extend beyond the codebase.

How seniors must adapt: The shift requires letting go of ownership over how things are built and holding tightly to what must remain true. This means resisting the instinct to enforce stylistic preferences, to mandate familiar patterns, or to reject approaches simply because they feel foreign. It means learning to trust validation over intuition—if the tests pass, the system holds, and the invariants are preserved, the unfamiliar path may be the better one. It means becoming comfortable with code that looks nothing like what you would have written, because you did not write it. The senior who thrives in this world is not the one who insists on reviewing every line, but the one who defines the constraints so clearly that review becomes verification rather than negotiation.

The belief that only seniors can supervise AI agents belongs to a world where agents were weak and architecture was rigid. As agents grow stronger and architecture becomes fluid, that belief starts to look like a historical artifact. The hierarchy built on the cost of change erodes as that cost approaches zero.

Who Shapes the Future

The programmer who will shape the next decade of software may not be the one who remembers the most failures, but the one most willing to ask whether those failures should still define what is possible. Armed with Cursor, Claude Code, Codex, and a willingness to iterate fast, they treat architecture as something to be questioned rather than preserved. They are not reckless—they simply operate in a world where the cost of exploration has collapsed.

And that person may not be senior at all.

· 10 min read
Elisha Sterngold

Logs in the Age of AI Agents

Software developers have always relied on logs as a fundamental tool for understanding what happens inside running systems. Logs capture reality: the sequence of events, the state of the system at a given moment, errors that occurred, and the context in which everything happened.

As AI agents increasingly participate in writing, modifying, and maintaining code, it may be tempting to think that logs will become less important — or even obsolete. In practice, the opposite is true. Logs are becoming more critical than ever. The difference is who will primarily consume them, and how they need to be structured.

This post explores why logs remain essential in the age of AI agents, how the nature of logging is likely to change, and what this means for modern development platforms.

AI Makes Mistakes — Just Like Humans

To understand the future role of logs, we need to start with a realistic understanding of how large language models (LLMs) work.

LLMs do not reason about code in the same way compilers, interpreters, or formal verification systems do. They generate output by predicting the most likely next token based on vast amounts of training data. This makes them extremely powerful pattern generators — but not infallible problem solvers.

As a result:

  • LLMs make mistakes, sometimes obvious and sometimes subtle.
  • They can produce code that looks correct but fails under real-world conditions.
  • They are prone to hallucinations — confidently generating incorrect logic, APIs that don’t exist, or assumptions that are not grounded in reality.
  • They often lack awareness of runtime behavior, concurrency issues, environmental differences, or system-specific edge cases.

Let’s look at a few concrete examples.

Example 1: Android Logging Gone Wrong

Imagine an AI agent generating Android code to log network responses:

Log.d("Network", "Response: " + response.body().string())

At first glance, this looks fine. But in practice, calling response.body().string() consumes the response stream. If the same response is later needed for parsing JSON, the app will crash or behave unpredictably. Both human developers and AI models can overlook this subtle side effect during implementation or testing.

Proper logging would look like this:

val bodyString = response.peekBody(Long.MAX_VALUE).string()
Log.d("Network", "Response: $bodyString")

However, even with this fixed version, there’s still an occasional issue where it can run into memory problems, especially when processing extremely large responses or values that approach the system’s maximum capacity.

This exactly shows the complexity of code. Without logs showing the crash or missing data, the AI agent would have no feedback that its generated code caused a runtime issue.

Example 2: iOS threading Issues

Consider an AI model generating Swift code for updating the UI after a background network call:

URLSession.shared.dataTask(with: url) { data, response, error in
if let data = data {
self.statusLabel.text = "Loaded \(data.count) bytes"
}
}.resume()

This code compiles and may even work sometimes, but it violates UIKit’s rule that UI updates must occur on the main thread. The result could be random crashes or UI glitches.

A correct version would wrap the UI update in a DispatchQueue.main.async block:

DispatchQueue.main.async {
self.statusLabel.text = "Loaded \(data.count) bytes"
}

Logs capturing the crash or warning from the runtime would be the only reliable signal for the AI agent to detect and correct this mistake.

Example 3: Hallucinated APIs

LLMs sometimes invent REST APIs that don’t exist — something that might only become apparent once the product is running in production. For example, an AI might generate code that calls a fictional endpoint:

val response = Http.post("https://api.myapp.com/v2/user/trackEvent", event.toJson())
if (response.isSuccessful) {
Log.i("Analytics", "Event sent successfully")
}

If that /v2/user/trackEvent endpoint was never implemented, the code will compile and even deploy, but the system will start logging 404 errors or timeouts once it’s live. Those logs are the only signal — for both humans and AI agents — that the generated API was imaginary and needs correction.

These limitations are not bugs; they are intrinsic to how current models operate. Even as models improve, it is unrealistic to expect near-future AI-generated code to be consistently perfect in production environments.

This is precisely where logs remain indispensable.

Logs as the Source of Truth

When something goes wrong in production, developers don’t rely on intentions, comments, or assumptions — they rely on evidence. Logs provide that evidence.

The same applies to AI agents.

Regardless of whether code is written by a human or generated by an AI, runtime behavior is the ultimate arbiter of correctness. Logs record what actually happened, not what was expected to happen.

They answer questions such as:

  • What sequence of events led to this state?
  • What inputs did the system receive?
  • Which branch of logic was executed?
  • What errors occurred, and in what context?
  • How did external dependencies respond?

Without logs, both humans and AI agents are left guessing.

Why AI Agents Need Logs

As AI agents increasingly participate in development workflows — generating code, refactoring systems, fixing bugs, and even deploying changes — logs become a critical feedback mechanism.

Logs Close the Feedback Loop

AI agents operate on predictions. Logs provide feedback from reality.

By analyzing logs, an AI agent can:

  • Validate whether generated code behaved as intended
  • Detect mismatches between expected and actual outcomes
  • Identify patterns that indicate bugs or regressions
  • Learn from failures in real production conditions

Without logs, AI agents have no reliable way to distinguish correct behavior from silent failure.

Logs Enable Root Cause Analysis

When failures occur, understanding why they happened requires context. Logs provide structured breadcrumbs that allow both humans and AI to trace causality across components, services, and time.

As AI systems take on more responsibility, automated root cause analysis will increasingly depend on rich, well-structured logs.

How Code Will Be Built in the Future

The rise of AI agents is not just changing how code is written — it is changing how software systems evolve over time.

We are moving toward a world where:

  • AI agents generate and modify large portions of code
  • Humans supervise, review, and guide rather than author every line
  • Systems are continuously adjusted based on runtime feedback
  • Debugging and remediation are increasingly automated

In such an environment, logs are no longer just a debugging tool. They become a primary interface between running systems and intelligent agents.

Think of logs as a bidirectional communication channel: they allow AI agents to observe system behavior in real-time, understand what's happening across distributed components, and make informed decisions about modifications and fixes. Just as APIs define how different services communicate, logs define how AI agents perceive and interact with running software. An AI agent monitoring logs can detect anomalies, correlate events across services, identify patterns that indicate potential issues, and even trigger automated responses — all without direct human intervention. This transforms logs from passive historical records into an active, queryable representation of system state that enables autonomous decision-making.

Logs May No Longer Be Written for Humans

One of the most significant shifts ahead is that logs may no longer need to be primarily human-readable.

Historically, logs were formatted for developers reading them line by line: timestamps, severity levels, free-form text messages, and stack traces.

But if the primary consumer of logs is an AI agent, the requirements shift fundamentally. Instead of human-readable prose, logs must become structured data streams that machines can parse, analyze, and reason about. Rather than a developer scanning through text messages, an AI agent needs programmatic access to event data with clear schemas, explicit context, and traceable relationships between events. The format matters less than the ability for algorithms to extract meaning, detect patterns, and make decisions based on what actually happened — not what a human thought was worth writing down.

Human readability becomes secondary. Humans may still access logs — but often through AI-generated summaries, explanations, and insights rather than raw log lines.

Logs as Active Participants in Autonomous Systems

As observability evolves, logs will increasingly move from passive storage to active participation in system behavior. Today, logs are primarily archives — repositories of what happened, consulted after the fact. Tomorrow, they will be real-time inputs that directly influence system actions.

We can already see early signs of this transformation:

  • Logs triggering alerts and automated workflows — when an error pattern appears, systems can automatically scale resources, restart services, or notify teams
  • Logs feeding anomaly detection systems — machine learning models analyze log streams to identify deviations from normal behavior before they escalate
  • Logs being correlated with metrics and traces — combining different signals to build comprehensive views of system health
  • Logs used to gate deployments or rollbacks — automated systems evaluate log patterns to decide whether a new release should proceed or be reverted

In AI-driven systems, this trend accelerates dramatically. Logs become the substrate on which autonomous decision-making is built. Instead of simply reacting to alerts, AI agents can proactively analyze log patterns, predict issues before they manifest, and autonomously implement fixes. An AI agent might notice subtle degradation patterns in logs, correlate them with recent code changes, generate a targeted fix, test it against historical log data, and deploy it — all without human intervention. In this model, logs aren't just records of the past; they're the sensory input that enables systems to observe, reason, and act autonomously.

Shipbook and the AI Age of Logging

At Shipbook, we believe logs are not going away — they are evolving.

Shipbook was built to give developers deep visibility into real-world application behavior, with features like:

  • Powerful search and filtering
  • Session-based log grouping
  • Proactive classification in loglytics

But we take this even further with our Shipbook MCP Server. By implementing the Model Context Protocol, we allow AI agents to directly connect to your Shipbook account. This means your AI assistant can now search, filter, and analyze real-time production logs to help you debug issues faster and more accurately.

But we are also looking ahead.

We are actively developing capabilities that prepare logs for the AI age: logs that are easier for machines to interpret, analyze, and reason about — while still remaining useful for human developers.

As AI agents become first-class participants in software development, logs won't just be a debugging tool — they'll be the trusted interface that enables intelligent systems to understand, learn from, and improve the code they generate. That's the future we're building at Shipbook: logs that power both human insight and AI autonomy.


Ready to prepare your logging infrastructure for the AI age? Shipbook gives you the power to remotely gather, search, and analyze your user logs and exceptions in the cloud, on a per-user & session basis. Start building logs that work for both your team today and the AI agents of tomorrow.