Loading…
Loading…
I spent an hour last week reviewing a pull request on Dofek, one of my open source projects. I had generated most of the change with an AI coding assistant the night before. The code worked. The tests passed. CI was green. And yet, when I sat down to actually read it, I found a credential being logged in plaintext, an unsanitized input flowing into a query, and a transitive dependency two major versions behind a known CVE. The AI hadn't broken anything visible. It had just shipped a small pile of risk that nobody, including me, had explicitly accepted.
So I went back to the same AI, told it what I had found, and asked it to fix all three. It did, in about two minutes. The fix was clean. The tests still passed.
How many PRs in how many repositories are sitting one prompt away from being safe, except nobody is running that prompt?
I caught those three issues because I happen to be the engineer, the reviewer, and the security function on this project. That arrangement does not scale, and at any serious organization it should not exist. The rest of this piece is about what does.
For two decades, the constraint on shipping software was engineering bandwidth. How fast could humans type, review, merge? Every agile process, every roadmap rhythm, every PM-to-engineering ratio was built around the assumption that code production was the scarce resource.
That assumption no longer holds. AI coding assistants are pushing code generation toward effective infinity. Engineers are routinely opening five branches in parallel, having Claude or Cursor generate three implementations of the same refactor, shipping more lines on a Tuesday afternoon than they used to ship in a sprint. This is operational reality, not a hypothetical future.
The bottleneck didn't disappear. It moved. The new bottleneck is verification, the human judgment loop that decides whether what was generated is actually trustworthy enough to deploy.
I've started calling the accumulated gap between what AI generates and what humans actually verify "Validation Debt." It is the agentic era's version of technical debt, with a faster compounding rate and a higher blast radius.
Validation Debt doesn't show up in your sprint burndown. It doesn't show up in PR cycle time. It shows up in your CVE backlog three quarters from now, or in the post-mortem after a credential leak, or in the regression that breaks something a six-line generated change touched in a way nobody noticed.
Most product leaders I talk to are tracking AI adoption metrics: percentage of commits AI-assisted, lines of code generated, time saved per developer. These metrics measure throughput. They don't measure trust. And in the agentic era, trust is the actual product.
The inverse metric is the one that matters. Of everything generated, how much was actually validated? Not "tests passed." Not "linter happy." Validated in the deeper sense, reasoned about by a human, checked against stated intent, audited for the things automated tools cannot catch.
There is a structural principle worth being explicit about: the model that generated the code cannot reliably audit it. This is not a moral claim about AI. LLMs do not have incentives. They are not biased in any human sense. The limit is architectural.
An LLM is a probabilistic engine. If the model lacked the pattern to catch a plaintext credential on the way out, asking the same model to inspect its own output rarely surfaces what it missed the first time. What you get instead is confirmation bias dressed up as review, the same statistical lens that produced the code is now reading it. Push harder for findings and you get the opposite failure mode: hallucinated issues, confidently described, plausibly worded, and wrong.
Using a different LLM as the validator helps somewhat. Different training data, different blind spots, less correlated failure. But two probabilistic engines pointed at the same artifact still agree on the most common patterns and miss the same rare ones. The base rate of failure modes shifts. It does not approach zero.
Genuine verification in the agentic era requires pairing probabilistic generation with deterministic validation. Static analysis that traces taint flows and proves data path properties. Formal methods that verify invariants. Automated test execution against specifications defined independently of the generation step. SBOM analysis that reads dependency graphs, not source code. Policy engines that evaluate against compliance rules expressed in a language no LLM can hallucinate its way around.
The pattern is generation-by-language-model, verification-by-deterministic-system. The validator does not need to be smarter than the generator. It needs to be a different kind of thing, one whose outputs are reproducible, auditable, and not contingent on what the model happened to weight that week.
This is also why the regulatory frameworks emerging around AI, the EU AI Act, NIST's AI RMF, the evolving criteria under SOC 2 and ISO 42001, all assume independent, mechanical verification rather than model self-attestation. The architecture of verification matters as much as the existence of verification.
Four shifts start to look different when you take Validation Debt seriously.
The PM owns the definition of verified, not the verification itself. This is the distinction most organizations collapse, and the collapse is dangerous. If the PM is the one catching a transitive dependency CVE or a plaintext credential in a code review, the system has already failed. PMs do not scale by becoming part-time application security engineers. They scale by defining what verified means for their product, the risk classes that matter, the gates that must exist, the SLAs that govern each gate, and then holding the organization accountable to building and operating those gates. The right question for a PM in the agentic era is not "did I read the diff?" but "would the diff have been blocked before it ever reached a human if it were unsafe?"
Intent has to become a first-class artifact, in a form a machine can check. This is where most "intent matters" arguments go vague. Concretely, intent that survives generation looks like: acceptance criteria written as Gherkin-style BDD scenarios that a test runner executes; API contracts defined as schemas that validators evaluate before code merges; security and privacy requirements encoded as policy-as-code in OPA or Rego; threat models expressed as machine-readable rules that gate the build; property-based test invariants that capture what must always be true regardless of implementation. Version-controlled. Reviewed alongside the code. Ingested by the validation layer as inputs, not as documentation that lives in a wiki nobody opens. The PM who writes "the system should be secure" has shipped no intent at all. The PM who writes "no PII may exit the service boundary without encryption per policy X, enforced by gate Y, tested by suite Z" has shipped intent that an automated validator can actually use.
Detection without remediation is anxiety, not validation. Finding a vulnerability does not reduce Validation Debt. Fixing it does. If generation is approaching infinity while remediation still depends on a human reading a Jira ticket three sprints later, the math does not work. The closed-loop pattern looks like this: AI generates, deterministic validators check against encoded intent and known risk patterns, findings flow back as actionable fixes that an AI agent can apply under human approval, the deterministic validators re-run, the loop closes. Without that loop, you have built a faster way to produce a longer to-do list.
The metric that matters is economic, not operational. Validation coverage, mean time from finding to fix, regression rate on AI-generated changes are all useful, none sufficient. They roll up to a single number that most organizations are not yet tracking: cost of verification per generated unit.
The math is worth doing concretely. A modern coding assistant can generate ten thousand lines of code in roughly ten minutes for a few dollars in API spend. If verifying those ten thousand lines takes ten hours of combined human review and machine analysis, at a fully-loaded senior engineering rate of around 150 dollars an hour, you have spent 1,500 dollars verifying 25 dollars of generation. That is a 60-to-1 cost asymmetry, and it widens every quarter as generation throughput climbs while verification capacity stays linear. At that ratio, the AI tool has negative ROI before you count a single bug it shipped that you missed.
The real question for any product organization investing in agentic generation is whether automated verification capacity is scaling at the same rate as generation capacity. If it is not, every AI-assisted commit is a small loan against a future audit. The interest compounds.
The "Full Stack PM" piece earlier in this series argued that PMs who can build will fundamentally change their relationship with engineering. I still believe that. The corollary is becoming clearer: PMs who can build also have to define what trustworthy means for what they ship, not because they are doing the verification themselves, but because nobody else in the org is positioned to decide what the standard should be. Engineering builds the gates. Security operates them. Product defines what passes through.
This is uncomfortable for most product organizations. The default arrangement has product owning feature delivery, security owning compliance, and nobody owning the question of what "verified enough to ship" actually means for this product, this risk class, this customer. That ambiguity was tolerable when humans were the rate limit on code production. It is not tolerable when the rate limit has moved to verification and nobody is responsible for setting the bar.
Open your production environment, the one your customers are using right now. Count the features shipped in the last quarter where AI generated the majority of the code. For each one, ask: can I point to the intent it was validated against? The deterministic checks it passed? The owner of the remediation loop if a finding comes back tomorrow?
Most product leaders cannot answer those questions for more than a small fraction of their production code. The gap between what AI generated and what was actually verified is not theoretical. It is in your environment right now, serving your customers, compounding every day nobody measures it.
So the real questions are these. How many undocumented AI-generated features are currently running in your production? What is your cost of verification per generated unit, and how does it compare to your cost of generation? And who in your organization is responsible for that ratio improving rather than degrading?
If the honest answer to the last question is "nobody yet," the debt is already compounding. That is the only honest read.
If you have started measuring this, I want to know how. If you have not, I would ask why.