The Code Nobody Reads - Why the AI Inflection Point Changes Everything for Leaders

Simon Willison describes the November inflection point where AI-generated code suddenly became reliable. Three developers at StrongDM have produced 32,000 lines of production code without reading a single line themselves. What does this mean for leaders - and why is the bottleneck now shifting from development to testing?

Date

April 9, 2026

The Code Nobody Reads

Three developers. 32,000 lines of production code. Since July 2025. Without writing or reading a single line themselves.

This is not science fiction. This is StrongDM, a security company building its own security product this way. Simon Willison - Django co-creator and one of the sharpest observers of AI development - calls it the "Dark Factory": If your factory is so automated that no humans stand on the shop floor, you can turn the lights off. The machines don't need them.

When I first heard this, I thought: That's irresponsible. Then I looked at how they actually do it. And it confirmed what I've been seeing in my own work - and why frameworks like the one I'm building matter more than ever.

The November Inflection Point

Willison describes a turning point in two recent talks - on Lenny Rachitsky's podcast and at the Pragmatic Summit in San Francisco - that happened in November 2025. GPT 5.1 and Claude Opus 4.5 came out, both incrementally better than their predecessors. But they crossed a threshold.

Before, AI-generated code worked well enough - but you had to look carefully, test, correct. After, it worked almost always. And that shift - from "requires constant supervision" to "almost always does what you told it to do" - changes everything.

Suddenly you can tell a coding agent: Build me a Mac application that does this thing. And you get something back that actually works. Not a half-finished prototype full of bugs. Willison himself says he now produces 10,000 lines of code in a day. Most of it works.

His conclusion: "My ability to estimate software is broken." Old timeframes no longer apply.

The Stages of Adoption

Willison describes AI adoption as a stage model - and most companies are still at the very beginning:

  1. Ask ChatGPT and occasionally get a useful answer
  2. A coding agent writes parts of the code for you
  3. The agent writes more code than you do
  4. You don't read the code anymore

Stage 4 sounds like losing control. But Willison draws an interesting comparison: In any larger company, teams use services from other teams without reading their code. You trust that professionals do their job. When something breaks, you look at the code. But not before.

Trusting an AI agent the same way feels uncomfortable. But Claude Opus 4.5, Willison says, was the first model he could give that trust to - at least for problem classes he already knows.

The Bottleneck Shifts

This is where it gets relevant for leaders. When features are built in hours instead of weeks, the bottleneck shifts. Away from development. Toward testing and validation.

This sounds trivial, but it's fundamental. Because most organisations are optimised to manage development - sprints, velocity, story points. When the bottleneck is somewhere else, these metrics suddenly become irrelevant.

Willison puts it well: "Anyone who's done product work knows that your initial ideas are always wrong. What matters is proving them, and testing them. We can test things so much faster now because we can build workable prototypes so much quicker."

A UI prototype is free now. Claude Code, Codex or Replit will build you a convincing interface for anything you describe. Anyone doing product design who isn't vibe coding little prototypes is missing out on the biggest productivity boost we get at this stage.

Tests Are No Longer Optional

Perhaps the most important insight from Willison's Pragmatic Summit talk: Test-Driven Development with AI agents is not optional. It's mandatory.

His approach is surprisingly simple. Every coding session starts with five words to the agent: "Use red-green TDD." All good coding agents know what that means. They write the test first, then the code, and the probability of getting working results goes up massively.

The ironic part: Willison hated TDD throughout his entire career. Too slow, too tedious. But when an agent does it? "I don't care if the agent spins around for a few minutes on a test that doesn't work."

His statement is unambiguous: "The reason not to write tests in the past has been that it's extra work. They're free now. They're effectively free. I think tests are no longer even remotely optional."

This applies to manual testing too. Willison has his agents start the server in the background and test the API with curl commands. He built Showboat for this - a tool that documents manual tests as a Markdown file.

Code Quality Is a Choice

A point that's often overlooked: Poor code quality from an AI agent is not a technology problem. It's a choice.

Willison says it directly: "If the agent spits out 2,000 lines of bad code and you choose to ignore it - that's on you. If you then say: refactor this part, use this design pattern - you can end up with code that is way better than the code I would have written by hand. Because I'm a little bit lazy."

This is the decisive point for leaders: Output quality doesn't depend on the tool. It depends on the processes and standards you build around it.

What This Means for Me

I've been watching this development for months from the trenches. And Willison's observations match exactly what I'm seeing - the bottleneck is shifting, testing is becoming central, and without a structured framework, vibe coding ends in chaos.

This is exactly why I'm building Shipwright. Not as another coding tool, but as a framework for the entire development process - from requirements to deployed, tested, secured code. With built-in test requirements, self-healing CI, and compliance documentation that stays current with every build.

StrongDM's Dark Factory works because they built a rigid framework around it. Three people, 32,000 lines of code - but with digital twins for every external service, scenario-based tests, and an architecture that enforces quality instead of hoping for it.

That's not magic. That's engineering.

What Leaders Should Do Now

Willison calls software engineers "bellwethers for other information workers". Code is easier to verify than essays, contracts, or strategy papers - that's why it hits us first. But it won't stop with us.

Three practical recommendations:

Ask the metrics question: If features are built in hours instead of weeks, are your sprint metrics still measuring the right thing? Probably not. The focus should be on validation speed, not development speed.

Treat testing as a strategic investment: Not as a cost centre. Tests are free to create today - there's no excuse not to have them. Organisations that treat test infrastructure as nice-to-have will never realise the Dark Factory advantages.

Build the process, don't just buy the tool: StrongDM didn't just unleash AI agents. They built a framework - digital twins, scenario tests, clear architectural principles. The tool alone does nothing. The process makes the difference.

The inflection point is here. Code works. The question is no longer whether AI will change your development process. The question is whether you're ready for what comes next.

 


Sources:

  • Highlights from my conversation about agentic engineering on Lenny's Podcast - Simon Willison - April 2026 - simonwillison.net
  • An AI state of the union: We've passed the inflection point - Lenny's Newsletter - April 2026 - lennysnewsletter.com
  • My fireside chat about agentic engineering at the Pragmatic Summit - Simon Willison - March 2026 - simonwillison.net
  • How StrongDM's AI team build serious software without even looking at the code - Simon Willison - February 2026 - simonwillison.net
  • Built by Agents, Tested by Agents, Trusted by Whom? - Stanford Law School CodeX - February 2026 - law.stanford.edu