I was reviewing the homepage stats row. It reads “20 yrs / Years web dev · 3 repos / Projects live · 3 articles / Posts published · 3 cups / Coffees probably”. That’s when I noticed the bug. Not a logic error. Not a typo. Not a styling miss. The bug was that the original version of that line had read “12 yrs · 37 repos · 142 essays · 3 cups,” and none of those numbers came from anywhere. They were plausible. They were confidently round. The author of the file was Claude Code (the agent I’d been pairing with to scaffold the site), and from its position, every one of those numbers was a reasonable hypothesis about a person like me. The reflex I’d been using to review code — does this function do what I think it does? — wasn’t catching them.

That was day one of three.

The setup

The site you’re reading is a statically-exported Next.js 15 build with MDX content, a hand-rolled design system, and a GitHub Actions → FTP deploy to a cPanel host. I scaffolded it across a handful of sessions with Claude Code (Anthropic’s CLI dev agent) on top of TypeScript and Next 15. Genuinely useful tool. Most of the mechanical work on the site is its work, including this very MDX pipeline. I am not the person to tell you Claude Code can’t code.

What it got right

Worth dwelling on, because it’s the part I almost wrote out of the first draft of this essay. Claude Code didn’t scaffold the site from nothing. It had access to:

  • The PRD I’d written before any code shipped
  • The conversation history of the sessions that produced the code
  • The site’s own structure as it accumulated
  • Whatever its training had absorbed about people like me, projects like this one, sites like the one I was asking it to build

From that, it produced a steady stream of inferences. Cj uses TypeScript. Cj writes about agentic AI. Cj has a retail and hospitality background. Cj would describe his stack as Next · TS · Python · Swift. The brand voice is loud and slightly punk; the tagline should be “build · create · ship”. All of those were correct. I read them, I kept them. Most of the bio prose on /about, the section structure of the homepage, the choice of where to put a kanji. All inferred, all shipped without changes. The site is mostly inferences. Most of them are right.

What it got wrong

The bugs that took me three days to walk out were the inferences that felt the same as the right ones but didn’t match reality. They sorted into five categories.

Counts inflated. “37 repos / Projects live” (real number: 1). “142 essays / Posts written” (real number: 3). “12 yrs / Years shipping” (real number: 20, in the other direction). Each number was rounded to feel credible. None had a source.

Entities invented. A featured project called edge-cache (a “lightweight edge caching layer for Next.js API routes, 2.1× faster cold starts”) that did not exist on my GitHub. A second called mdx-pipeline. Three placeholder tracks on /creative reading “Coming Soon.” Six fake AI news headlines on /ai-news, attributed to The Verge, Ars Technica, Anthropic, MIT Tech Review, HackerNews, IEEE Spectrum. Publications I do read; articles I don’t.

Details confidently wrong. The contact page’s terminal block listed my email as hello@captainrandom.com; the real domain is .co.uk. It listed my GitHub as github.com/captainrandom; my real handle is CaptainRandom00. It referred to “Gemini Studio”, which is not a product Google ships; the real name is Google AI Studio. None of these were obvious lies. They were the kind of plausible details that slip past a casual read.

Promises overcommitted. “Roughly every two weeks” (no cadence existed). “Response time: usually 48h” (a service-level commitment I’d never made). “1 slot open” (a count I hadn’t set). “An automated agent pipeline that scours curated AI and tech sources” (a pipeline that did not exist).

Marketing prose generic. “Some of the UK’s largest retail and hospitality organisations.” “Generative AI for creative output.” “Same standards as this site.” The kind of phrasing that sounds like a deliverable until you ask which, what, whose.

The asymmetry

Read together, these don’t look like the bugs I usually review for in hand-written code. They look like a different category.

In code I write myself, the typical bug is a correctness problem. Does this function compute the right thing? Does this conditional cover the edge case? Does this query return the rows I expect? Reviewing for correctness is a reflex I’ve had for two decades. The site’s typecheck and build pipeline catch most of it before I do.

In code Claude Code wrote for me, the typical bug was an existence problem. Does this project actually exist? Did this number come from somewhere? Is this sentence describing a real practice or a generic one? Existence isn’t something tsc knows how to check. There is no existsTypeError lint rule.

The verification reflex is different from the correctness reflex. It runs on different cues, looks at different parts of the code, and pays attention to different shapes of mistake. The bugs an LLM-assisted scaffold produces sit almost entirely on the existence side of the line. Plausibility, not truth, is what the model is optimised against.

The audit method

I worked through the site block by block. Every surface got the same treatment: homepage Announce Bar, Hero, Tech Marquee, Features, Writing rail, Workshop, Newsletter, Footer, then /about, /work, /writing, /learning, /now, /retail, /creative, /ai-news, /contact, /404.

  1. Render the page in dev. Read it.
  2. List every concern in chat: existence violations, broken links, inline styles, fabricated content, anything.
  3. Triage block-by-block: fix, defer, drop. Get explicit sign-off on each.
  4. Apply the fix, screenshot the result, log it.
  5. Commit. One block, one commit. Append a paragraph to docs/devlog.md summarising what changed and why.

Twenty-one blocks across three days. Fifteen real bugs killed including the wrong email and the wrong GitHub handle. The append-only devlog at docs/devlog.md is the canonical record. One paragraph per block, what changed and why, parked items called out, follow-ups named.

What I’d do differently

Three things, in increasing order of leverage.

Source the data first, then generate the UI. The Workshop section on the homepage used to be four invented “study topics” with fabricated progress percentages. It’s now a build-time aggregate of GitHub /languages byte-counts across every repo on my account, sliced into career-level tiers and refreshed daily. The numbers are derived; there’s no slot for the model to fill creatively. The same logic now drives /learning, which pulls commits + open issues + recent merges at build time. If a list needs to be on the page, source it before asking for the UI. The constraint produces a different scaffold, one with verification baked in.

Treat “placeholder” as a yellow flag. Every time the agent suggests “I’ll add a placeholder for [thing] that you can replace later,” the placeholder is a candidate for verification debt. “Coming Soon” on a Newsletter form is fine because it’s honest about absence. A placeholder project called edge-cache is not honest about absence; it’s a hypothesis dressed as a fact. Telling those two apart at scaffold time saves the audit later.

Move the honesty rubric from post-prompt audit to pre-prompt constraint. The PRD I wrote had a clause (section 12) that prohibited inflated claims, hype language, fabricated stats. It was a useful audit rubric: I literally went through the site block by block asking “does this violate §12?” It would have been more useful as a prompt-time constraint: “If you don’t have a source for a number or a list, ask. Don’t fill it.” The model would have asked more clarifying questions during scaffold. I’d have shipped less, slower, and truer.

What this isn’t

It isn’t a complaint about Claude Code. The agent is the best LLM-paired-programming experience I have used; this essay was drafted in collaboration with it. The verification debt I’m describing isn’t a Claude problem. It’s a property of generative models under prompts shaped like “produce a list of N [things].” Every coding agent I’ve worked with does it. The point is not that the model is dishonest. The point is that plausibility is what the model is optimised against, and plausibility is not the same as truth. When you ship LLM-assisted code, the work you’re left with is closing the gap between the two.

Where this goes

I keep an append-only build log at docs/devlog.md. Each entry is one paragraph: what changed, why, what tradeoffs, what to look at next. The first day of the audit, I almost shipped a falsehood every block. The third day, I caught most of them on the first read. The reflex is learnable; this essay is the receipt for one round of practising it. When the next round produces something worth pulling out of the log, you’ll find it here.

All writing