Vibe Coding and Agentic Engineering Are Converging. Good - Now Change What You Review.

Simon Willison, who drew the line between vibe coding and professional agentic engineering, now admits the two are blurring in his own work because he no longer reviews every line his AI agents write. The honest fix is not more guilt or harder line-by-line review - it is changing what you review: the specification, the tests behind it, and the trace from every line back to a requirement.

Date

May 19, 2026

I read something from Simon Willison this month that I have not been able to put down. Willison is one of the most careful, least hype-prone voices on AI-assisted coding. For the past year he has insisted on a hard distinction between two ways of working - "vibe coding" and "agentic engineering" - and in his latest piece, he admits the wall between them is crumbling in his own work.

His words: "Weirdly though, those things have started to blur for me already, which is quite upsetting."

When the person who drew the line says the line is fading, it is worth paying attention.

Two terms, one fading boundary

A quick recap, because the words get used loosely. Both terms come from the same person: Andrej Karpathy.

Karpathy coined "vibe coding" in February 2025. It describes building software with a language model where you do not look at the code at all. You describe what you want, you get something, and if it works, great. If it does not, you tell the model and cross your fingers. His own phrase was to "fully give in to the vibes ... and forget that the code even exists."

A year later, in February 2026, Karpathy named the other half: "agentic engineering." Same tools, opposite discipline. As he framed it: agentic, because you are no longer writing the code by hand but orchestrating agents and acting as oversight; engineering, because doing that well takes real craft and expertise. It is the mode of the professional who still owns security, maintainability, performance and operations. Willison adopted the term and has done as much as anyone to sharpen the distinction - his golden rule was that he would not commit code he could not explain to someone else.

The distinction held up well. Vibe coding for a personal tool, where a bug only hurts you: go ahead. Vibe coding for software other people depend on is, in Willison's words, "grossly irresponsible because it's other people's information. Other people get hurt by your stupid bugs."

So what changed?

The honest part: he stopped reviewing the code

Here is the admission that makes the article worth reading:

"The problem is that as the coding agents get more reliable, I'm not reviewing every line of code that they write anymore, even for my production level stuff."

He is not being careless. His point is that when you ask a good agent to build a JSON API endpoint that runs a SQL query and returns the results, it just does it right. Tests, documentation, the lot. So he stops reading every line. And then the guilt arrives: "if I haven't reviewed the code, is it really responsible for me to use this in production?"

He names the real risk precisely. He calls it the normalization of deviance: "every time a model turns out to have written the right code without me monitoring it closely there's a risk that I'll trust it at the wrong moment in the future and get burned."

That phrase comes from accident research - the slow process by which a team's safety margins erode, one "it worked last time" at a time, until the day it does not. It is exactly the right phrase. And it is why I think the convergence Willison describes is real, and also why I think the way out is not the one most people reach for.

Guilt is the wrong lens

My instinct, after twenty years building engineering teams, a lot of that spent in regulated industries - banking, insurance, a FINMA-approved digital asset exchange - is this: "I trust it because it has been right before" was never an acceptable control. Not for a human, and not for an agent.

In a regulated environment, you do not get to tell an auditor that the payment logic is fine because the developer has a good track record. A track record is a reason to hire someone. It is not evidence that this specific change is correct. The evidence is the requirement, the test that proves the requirement is met, and the trace between them.

So when Willison feels guilty about not reading every line, I would gently reframe it. The guilt is pointing at the wrong thing. The problem is not that he stopped reading lines. The problem is what he has, or has not, put in place of reading them.

Reading every line was always a proxy. It was the cheapest verification we had when a human produced a few hundred lines a day. It was never the actual goal. The actual goal is confidence that the software does what it is supposed to do, and nothing it is not supposed to do.

The black box still needs an audit trail

Willison reaches for a good analogy. He compares trusting an agent to trusting another team inside a large company. If a different team hands you an image resize service, you do not read every line they wrote. You read their documentation, you use it, and you only dig into their repository when something breaks.

I agree with the analogy. I want to push on the one place it leaks.

When another team hands you a service, you are not only trusting their code. You are trusting a chain: there was a requirement, someone reviewed it, decisions were recorded, and there is a person who will answer a question next Tuesday. Willison makes this point himself: "Claude Code does not have a professional reputation! It can't take accountability for what it's done."

Exactly. So the black box only works if the box itself carries the trail that the human team would have carried. The agent leaves you code. It does not, by default, leave you the reasoning, the requirement it was satisfying, or the evidence that it satisfied it. If you want to treat agent output as a semi-black box you do not inspect, you have to make the box produce that trail. That is not optional housekeeping. It is the thing that replaces the line-by-line review.

Change what you review, not how much

This is the shift I would make explicit, because Willison circles it without quite landing on it.

Stop reviewing lines. Start reviewing three things.

The specification. Is the requirement itself correct and complete? This is now the highest-leverage place to spend human attention, because everything downstream inherits its mistakes.

The tests against that specification. Not "are there tests," but "do the tests actually check the requirement, or do they just check that the code does what the code does?" An agent that writes both the code and the tests can produce a beautifully green suite that proves very little.

The trace. Can every meaningful piece of code be linked back to a requirement it exists to satisfy? If you can answer that, "I did not read line 240" stops being a confession and becomes a reasonable engineering decision.

None of this is slower. It is a move up the stack. You spend your scarce, expensive human judgement where it compounds, and you let the agent own the layer where it is genuinely reliable.

This is the problem I am building Shipwright to solve. Shipwright is an open-source framework that runs the full software lifecycle as a spec-driven pipeline on top of Claude Code - every line of code traces back to a requirement, the compliance documentation updates with every build, and the agent reads its own failing test logs and fixes them. It exists precisely because "the agent is reliable, so I stopped checking" is not a safe place to rest. The trace is what makes it safe.

→ Explore Shipwright

You can no longer tell good work from fast work

There is a second observation in Willison's piece that I think is underrated, and it is about evaluation.

He points out that a GitHub repository with a hundred commits, a polished readme and comprehensive tests used to be a signal. It told you someone cared. Now: "I can knock out a git repository with a hundred commits and a beautiful readme and comprehensive tests of every line of code in half an hour! It looks identical to those projects that have had a great deal of care and attention."

The artefacts we used to read as proof of diligence are now free. They no longer separate careful work from fast work.

Willison's answer is that he now values one thing above tests and docs: "I want somebody to have used the thing." A tool someone has used every day for two weeks beats a polished one that was barely exercised.

I think that is correct, and for software you are shipping to other people, I think it is still not enough. Usage by the builder is a great signal for a side project. For production software, "I have used it" and "it is correct" remain different statements. The enterprise version of Willison's instinct - which he also names - is that you want a solution other organizations have run successfully for months before you bet on it. Proven beats polished. But proven is itself a claim, and it needs evidence behind it. That evidence is, again, the trace from requirement to verified behaviour.

The bottleneck moved, and most processes have not

The deepest point in the article is almost a throwaway line: "If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code."

This is the part worth sitting with. Code review, sprint planning, design sign-off, QA cycles - none of these were handed down on stone tablets. They were calibrated to a specific throughput. Change that throughput by a factor of ten and the calibration is wrong everywhere at once. Some processes become bottlenecks. Others, as Willison notes via a talk by Anthropic design leader Jenny Wen, can suddenly afford to be riskier, because getting a design wrong no longer costs three months.

This is why I do not think the answer to the convergence is to clamp down and review harder. That is fighting a tenfold change with the old calibration. The answer is to rebuild the lifecycle around the new throughput, and to move verification from "a human read it" to "the system can prove it traces."

Where this leaves us

Willison is right that the two modes are converging, and I do not think that is a tragedy. Vibe coding for your own tools, where the only person a bug hurts is you, is still wonderful. Do more of it. The convergence only becomes dangerous at the exact boundary he identified at the start: the moment other people depend on the software.

At that boundary, the question was never "did you read every line." A human team could not honestly answer yes to that either. The question is: can every line trace back to a requirement, and is that requirement verified?

If you can answer that, you have not slid into vibe coding. You have done the more honest thing - admitted that line-by-line review was always a proxy, and replaced it with something an auditor, a teammate, or a future version of you can actually rely on.

Ship right, not just fast.

 


Sources

  • Vibe coding and agentic engineering are getting closer than I'd like - Simon Willison - 06.05.2026 - https://simonwillison.net/2026/May/6/vibe-coding-and-agentic-engineering/
  • Not all AI-assisted programming is vibe coding (but vibe coding rocks) - Simon Willison - 19.03.2025 - https://simonwillison.net/2025/Mar/19/vibe-coding/
  • The AI Coding Paradigm Shift with Simon Willison, High Leverage Podcast Ep. #9 - Heavybit - 2026 - https://www.heavybit.com/library/podcasts/high-leverage/ep-9-the-ai-coding-paradigm-shift-with-simon-willison
  • What is Agentic Engineering? - IBM - 2026 - https://www.ibm.com/think/topics/agentic-engineering
  • Vibe coding - Wikipedia - 2026 - https://en.wikipedia.org/wiki/Vibe_coding