[{"data":1,"prerenderedAt":1232},["ShallowReactive",2],{"blog-thinking-in-code":3,"blog-posts-nav":106},{"id":4,"title":5,"article":6,"body":7,"date":92,"description":93,"extension":94,"meta":95,"navigation":96,"path":97,"promptVersion":98,"readingTime":99,"seo":100,"stem":101,"tags":102,"__hash__":105},"content\u002Fblog\u002Fthinking-in-code.md","Thinking in code",7,{"type":8,"value":9,"toc":84},"minimark",[10,14,19,22,25,28,31,35,38,41,44,48,51,60,63,67,70,73,81],[11,12,13],"p",{},"There is a story about software development that has become dominant in the last two years. You decide what you want, you write a plan, and a system implements it. The plan is the thinking; the code is the artifact. For a class of problems this is true and useful. For another class — the one I want to talk about — the arrow points the other way. The code is the thinking. The plan is the residue.",[15,16,18],"h2",{"id":17},"what-the-spec-leaves-out","What the spec leaves out",[11,20,21],{},"A spec can stay internally consistent while being externally incoherent because prose tolerates contradictions that types do not. You can write \"the system processes inputs and routes them to the right handler\" and the sentence reads correctly. It is also empty. It does not say what an input is, which handlers exist, what \"right\" means, or how the routing is decided. The sentence is well-formed; the design is not.",[11,23,24],{},"Code does not allow that. A function signature is the cheapest form of commitment a software design can make — it has to name a thing, list its parts, and say what comes back. A class is a claim about which fields belong together and which behaviors live in the same room. A folder layout is a claim about which concepts are siblings. None of these are visible in prose, and none of them survive being skipped.",[11,26,27],{},"The first time I usually notice this is when I sit down to write a function I thought I understood and find I cannot name its arguments. I do not know whether the second parameter is a string or a record. I do not know whether the response includes the original request or only the new fields. The spec did not tell me, because the spec never had to. Prose let me keep both interpretations open at once. Code does not.",[11,29,30],{},"That is what specs are quietly leaving out: the commitments that only exist once a syntax forces them. You can spend a week refining a spec without making any of them. The first hour of typing usually makes a dozen.",[15,32,34],{"id":33},"why-typing-is-doing-work","Why typing is doing work",[11,36,37],{},"Typing the code surfaces decisions the prose was skipping because constrained syntax forces a choice the unconstrained syntax did not. Names are the most obvious version of this. When I name a function, I am claiming what it does. When I name a parameter, I am claiming what kind of thing flows through it. When I name a type, I am claiming where the boundaries of the concept are. Every name is a small commitment, and the commitments compose into a model.",[11,39,40],{},"I spent an afternoon last year writing what I thought was a clean spec for a workflow before I had touched the data. Three pages of prose. The audience, the inputs, the outputs, the steps. It read well. It also passed three reviewers. When I started typing it, two things happened in the first hour. Three of the entities I had described separately turned out to be the same thing under different names — the spec had introduced them in different paragraphs and never noticed. One of them, which I had described as a single concept, turned out to be two — a request and an event, identical-looking at the document level, but with different lifecycles that the types refused to share.",[11,42,43],{},"None of that was a typing mistake. The contradictions were already in the spec. They survived because prose does not check for them, and the reviewers were reading for clarity, not for closure. The compiler is not just a syntax checker. It is the first reader I have ever met that refuses to fill in gaps.",[15,45,47],{"id":46},"the-ambiguity-asymmetry","The ambiguity asymmetry",[11,49,50],{},"Spec-first workflows feel less ambiguous than code-first workflows because prose hides ambiguity inside well-formed sentences while code exposes ambiguity as a missing branch or a name that does not resolve. The feeling is backwards from the substance. The spec feels finished because it has no errors; it has no errors because it has no compiler. The code feels rough because everything not yet decided shows up as something that does not compile. The rough version is the honest one.",[11,52,53,54,59],{},"This holds at the human end. It holds harder at the model end. A model reading a spec fills the ambiguity in with the most plausible reading and hands back something that looks correct, because the most plausible reading is exactly what its training optimizes for. A model writing code against a typed contract cannot fill in the same way — the contract refuses some readings outright, and the wrong output is visible in a way the spec version was not. The constraint travels into the inference. I have written about a related shape of this in ",[55,56,58],"a",{"href":57},"\u002Fblog\u002Fprompts-as-pipelines","prompts as pipelines",": constraints decay with proximity, but they decay much less when the constraint is mechanical instead of textual. A type signature is not a request that the model behave a certain way. It is a shape the output has to fit into, or it does not fit at all.",[11,61,62],{},"That is the asymmetry the spec-first framing misses. It treats prose and code as two notations for the same thinking and picks the one that feels lighter. They are not two notations. They are two pressures, and only one of them pushes back.",[15,64,66],{"id":65},"where-the-orchestrator-model-breaks","Where the orchestrator model breaks",[11,68,69],{},"The orchestrator framing breaks on ambiguous problems because it assumes the thinking has already happened upstream of the typing. The implementation step is then a transcription job that can be handed off — to a junior, to an agent, to anything that can produce code from instructions. The framing works when the assumption holds. For a known input mapped to a known output by a routine I have written ten times before, the spec really does close, and the agent really is doing transcription. I delegate those happily, and I will keep delegating them.",[11,71,72],{},"The framing breaks when the typing was the medium of the thinking. The pattern I watch for is small and specific. A feature seems clear in conversation. It gets vaguer when I write it down in a planning doc. It only becomes specific when I sit down to write the function and find I have to name the second argument. The naming is what made it specific. If I had handed the feature off at the planning-doc stage, the second argument would have been named anyway — plausibly, smoothly, and wrong, because the right name was not in the document. It was downstream of work that had not happened yet, and the only place that work could happen was in the typing.",[11,74,75,76,80],{},"The honest move is to notice when typing is doing the thinking and stay in the medium that is doing the work. I covered the inverse case in ",[55,77,79],{"href":78},"\u002Fblog\u002Fjudgement-shaped-problems","judgement-shaped problems"," — for inputs an integration can close around, the spec is enough and the orchestrator framing is fine. This post names the other side. The two shapes need different stacks. Sending one through the other's pipeline wastes both.",[11,82,83],{},"Not every problem is like this. Some inputs really do close at the spec; for those, delegating implementation is a clean win and I take it. The mistake is using one framing for both. Before I let the typing be done somewhere else, I want to know which medium the thinking is in. If it is in the code, the code stays with me.",{"title":85,"searchDepth":86,"depth":86,"links":87},"",2,[88,89,90,91],{"id":17,"depth":86,"text":18},{"id":33,"depth":86,"text":34},{"id":46,"depth":86,"text":47},{"id":65,"depth":86,"text":66},"2026-05-23","For ambiguous problems, writing the code is where the thinking happens — not a transcription step after the thinking has finished. The medium and the thought go together.","md",{},true,"\u002Fblog\u002Fthinking-in-code",6,"5 min read",{"title":5,"description":93},"blog\u002Fthinking-in-code",[103,104],"engineering","llm","kbF8qw8evQYGT8X96huui2-EyoLOCz6VJGWA39no30A",[107,244,364,431,557,624,758,867,953,1082,1178],{"id":108,"title":109,"article":110,"body":111,"date":235,"description":236,"extension":94,"meta":237,"navigation":96,"path":238,"promptVersion":110,"readingTime":239,"seo":240,"stem":241,"tags":242,"__hash__":243},"content\u002Fblog\u002Fasking-is-not-enough.md","Asking is not enough",4,{"type":8,"value":112,"toc":229},[113,121,124,128,131,134,137,140,144,155,158,164,167,171,174,177,180,193,196,199,203,206,209,220,223,226],[11,114,115,116,120],{},"The usual workflow goes like this: write a prompt, read the output, decide it looks reasonable, move on. Repeat for the next step. By the end, something is broken, and it's not obvious where — because every individual answer ",[117,118,119],"em",{},"sounded"," right.",[11,122,123],{},"This is the query reflex. It treats an LLM call like a search query: ask, receive, accept. It works fine for one-off questions with no downstream consequences. It fails, quietly and consistently, everywhere else.",[15,125,127],{"id":126},"plausible-is-not-correct","Plausible is not correct",[11,129,130],{},"Language models are trained to produce coherent output. Coherence and correctness are different things. A model will confidently describe a codebase it hasn't seen, summarize a document with subtle inversions of meaning, or extract fields from text and miss edge cases that only matter in production.",[11,132,133],{},"None of this looks wrong on first read. That's the problem.",[11,135,136],{},"Plausibility bias — the tendency to accept output that reads well — is why unvalidated LLM output breaks workflows at the worst moment. The failure doesn't surface at the prompt; it surfaces three steps later, in a place that seems unrelated. By then, the original output is already treated as ground truth.",[11,138,139],{},"Validation isn't a nice-to-have attached to the end of the process. It belongs at the point of output, as a condition of continuing.",[15,141,143],{"id":142},"the-prompt-that-describes-nothing-useful","The prompt that describes nothing useful",[11,145,146,147,150,151,154],{},"Weak prompts fail for a specific reason: they describe a ",[117,148,149],{},"topic"," rather than a ",[117,152,153],{},"task",".",[11,156,157],{},"\"Summarize this document\" is a topic. The model can do something coherent with it. What it can't know is: what does the summary need to contain for the next step to work? What format does downstream code expect? What's the maximum length before another process breaks? What happens if a field is missing?",[11,159,160,161],{},"A task-shaped prompt defines the output contract. Not through over-engineering — not temperature settings and system prompt tuning — but through a simple prior question: ",[117,162,163],{},"what does success look like, and how would I know?",[11,165,166],{},"Prompt technicalities (model selection, token budgets, formatting tricks) matter, but they're downstream of that question. Getting the technical settings right while leaving the task undefined produces well-formatted nonsense.",[15,168,170],{"id":169},"outputs-are-inputs","Outputs are inputs",[11,172,173],{},"The thing that changes how you write prompts is thinking of each LLM call as a transformation node rather than a question.",[11,175,176],{},"A node takes input, does something to it, and produces output. That output is the input to the next node. Which means the output needs to satisfy a contract — a shape, a schema, a set of conditions — that the next step depends on.",[11,178,179],{},"When you design prompts this way, several things become obvious that weren't before:",[181,182,183,187,190],"ul",{},[184,185,186],"li",{},"What structured data does the next step actually need?",[184,188,189],{},"What happens if a field is absent or ambiguous?",[184,191,192],{},"Where does the chain assume the previous step was correct?",[11,194,195],{},"The last question is the most important one. Silent assumptions propagate. A workflow that assumes each step succeeded — without checking — doesn't just have a bug. It has a bug that compounds.",[11,197,198],{},"I've seen this in agentic systems where an early classification step returns a plausible-but-wrong category, and every subsequent step proceeds as if the category were verified. The end state is coherent and completely wrong. No single call was obviously bad. The problem was the absence of checks between them.",[15,200,202],{"id":201},"the-check-before-the-call","The check before the call",[11,204,205],{},"The practical change isn't about prompting technique. It's about what you define before you write the prompt.",[11,207,208],{},"Before calling the model, answer three things:",[181,210,211,214,217],{},[184,212,213],{},"What specific data does this step need to produce?",[184,215,216],{},"What are the conditions under which that data is good enough to pass forward?",[184,218,219],{},"What does the next step do if this one returns something malformed?",[11,221,222],{},"These questions force you to think about the call as a step in a flow rather than an isolated question. They make the validation obvious — because you've already decided what the output is supposed to be. And they make weak prompts visible, because a prompt that can't answer \"what does success look like\" hasn't been thought through yet.",[11,224,225],{},"The output of an LLM call is only as useful as the step that uses it. Designing backwards from there — from consumer to producer — is the difference between a pipeline that holds and one that fails somewhere you're not looking.",[11,227,228],{},"Asking is the easy part. Knowing what you needed to hear is the work.",{"title":85,"searchDepth":86,"depth":86,"links":230},[231,232,233,234],{"id":126,"depth":86,"text":127},{"id":142,"depth":86,"text":143},{"id":169,"depth":86,"text":170},{"id":201,"depth":86,"text":202},"2026-03-18","Most LLM workflows fail not because the model gets it wrong, but because nobody defined what right looks like before calling it.",{},"\u002Fblog\u002Fasking-is-not-enough","6 min read",{"title":109,"description":236},"blog\u002Fasking-is-not-enough",[103,104],"4HwYt5JTqwgpJ9S-w7sagYHx62aqDPszhWuM2WJ39Fg",{"id":245,"title":246,"article":86,"body":247,"date":356,"description":357,"extension":94,"meta":358,"navigation":96,"path":359,"promptVersion":86,"readingTime":99,"seo":360,"stem":361,"tags":362,"__hash__":363},"content\u002Fblog\u002Fboring.md","Boring software",{"type":8,"value":248,"toc":349},[249,252,255,259,262,269,272,276,282,289,292,295,299,302,305,308,315,319,330,333,337,340,343,346],[11,250,251],{},"The highest compliment I can give a system is that it is boring. Not boring as in unambitious. Boring as in: nothing unexpected happens. Deploys go out on Tuesday and nobody holds their breath. The on-call phone doesn't ring. Customers don't notice, because there is nothing to notice. Everything just works.",[11,253,254],{},"This is extraordinarily hard to achieve.",[15,256,258],{"id":257},"the-glamour-problem","The glamour problem",[11,260,261],{},"Our industry has a glamour problem. We celebrate the heroic fix, the all-night debugging session, the engineer who saved production at 3 AM. We write blog posts about surviving scale, war stories about incidents, post-mortems that read like thrillers. These make for great conference talks. They make for terrible engineering cultures.",[11,263,264,265,268],{},"If your team regularly needs heroes, your system is telling you something. It's telling you that the work that ",[117,266,267],{},"should"," have been done — the planning, the testing, the documentation, the careful thinking about failure modes — was skipped in favor of speed.",[11,270,271],{},"Speed is not velocity. Shipping fast and fixing later is just borrowing time from your future self at a brutal interest rate.",[15,273,275],{"id":274},"what-boring-looks-like","What boring looks like",[11,277,278,279,154],{},"Boring software is written clearly. Not cleverly — clearly. Every function does what its name says. The code reads like prose, not puzzles. A new engineer can open any file and understand what it does and why. Not because the code is simple, but because someone took the time to make it ",[117,280,281],{},"legible",[11,283,284,285,288],{},"Boring software is documented. Not as an afterthought, not as a ticket that never leaves the backlog — documented as a first-class part of the work. The architecture decisions are recorded. The runbooks exist ",[117,286,287],{},"before"," the incident. The README actually tells you how to run the project.",[11,290,291],{},"Boring software is tested. Not with a handful of optimistic happy-path tests, but with the kind of testing that comes from genuinely asking: what could go wrong? Edge cases are not surprises — they were anticipated, discussed, and handled. The test suite is a living document of every assumption the system makes.",[11,293,294],{},"Boring software is deployed predictably. There is a pipeline. It runs the same way every time. Rollbacks are a button, not a prayer. Feature flags gate new behavior. Migrations are backward-compatible. Nobody is SSHing into production.",[15,296,298],{"id":297},"the-cost-of-excitement","The cost of excitement",[11,300,301],{},"I've worked on exciting systems. Systems where every deploy was an event, where monitoring dashboards were watched like live sports, where Slack channels lit up at odd hours. It felt important. It felt like we were doing real work.",[11,303,304],{},"We were. But most of that work was a loan we didn't remember taking out.",[11,306,307],{},"A heroic fix is not a one-time cost. It's a loan whose collateral is the next hero hour. The artifacts of last night's save — the untested patch, the undocumented workaround, the deploy nobody quite understands — become the substrate of next month's incident. This is why exciting teams stay exciting. Each rescue makes the system slightly less legible, which makes the next failure slightly harder to diagnose, which requires a slightly bigger hero. Boring is not the absence of heroism. It is the refusal to take out the loan in the first place.",[11,309,310,311,314],{},"There's a way to tell whether your team has stopped paying interest. Count your last ten incidents. Not by severity — by root cause. If most of them are variations of a cause you've already seen, the team isn't encountering the frontier of its system. It's failing to learn from it. A boring team is not the team with the fewest incidents. It's the team whose incidents are all ",[117,312,313],{},"new"," — each one a genuinely novel discovery, never the same race condition twice.",[15,316,318],{"id":317},"the-visibility-problem","The visibility problem",[11,320,321,322,325,326,329],{},"Boring has a visibility problem. You cannot demo a thing that didn't happen. You cannot put \"prevented fourteen outages\" on a performance review, because the counterfactual doesn't exist — there is no parallel universe to compare against. Organizations can only see what ",[117,323,324],{},"fails",", never what was ",[117,327,328],{},"prevented",", which means the incentive gradient inside almost every company points away from boring and toward visible heroism.",[11,331,332],{},"This is not a culture problem you fix with a values poster. It is a measurement problem. The teams that stay boring are the ones whose managers have learned to read negative space — to ask \"what didn't happen this quarter that should have?\" instead of only \"what did you ship?\" Until someone is rewarded for the absence of incidents, the system will keep paying its best engineers to create them.",[15,334,336],{"id":335},"the-quiet-pride","The quiet pride",[11,338,339],{},"There's a particular kind of satisfaction that comes from running a system that just works. No war stories. No heroics. No dramatic saves. Just steady, reliable, unremarkable service — day after day.",[11,341,342],{},"It won't get you on stage at a conference. Nobody writes threads about the deploy that went exactly as planned. But your team sleeps well. Your users trust you without thinking about it. And when you do need to change something, you change it calmly, because the system was built to be changed.",[11,344,345],{},"That is not the absence of craft. That is craft at its highest form.",[11,347,348],{},"The best software is boring. I intend to keep it that way.",{"title":85,"searchDepth":86,"depth":86,"links":350},[351,352,353,354,355],{"id":257,"depth":86,"text":258},{"id":274,"depth":86,"text":275},{"id":297,"depth":86,"text":298},{"id":317,"depth":86,"text":318},{"id":335,"depth":86,"text":336},"2026-02-05","The best software is the kind where nothing happens. Everything was planned, written, tested, and deployed — and it just works.",{},"\u002Fblog\u002Fboring",{"title":246,"description":357},"blog\u002Fboring",[103],"BiFHE_SWqa40PEU61MBlocjD_eVu5f6X9U2aHbDgSGM",{"id":365,"title":366,"article":367,"body":368,"date":422,"description":423,"extension":94,"meta":424,"navigation":96,"path":425,"promptVersion":426,"readingTime":99,"seo":427,"stem":428,"tags":429,"__hash__":430},"content\u002Fblog\u002Fdisposable.md","Disposable by default",10,{"type":8,"value":369,"toc":416},[370,373,377,380,383,387,390,393,397,400,403,407,410,413],[11,371,372],{},"Andrej Karpathy's line is that vibe coding raises the floor and agentic engineering raises the ceiling. Two recent pieces run with it — one from MindStudio walking through the framework, one from Fan Wu on Design Bootcamp mapping it onto product work — and both tell it as a story about people. Amateurs vibe their way to a working demo; professionals orchestrate agents and check the output; the arc from one to the other is a ladder you climb. The metaphor is clean, and I think it points at the wrong variable. What decides which mode you should be in isn't who you are. It's whether the code you're about to generate has to survive.",[15,374,376],{"id":375},"the-ladder-that-isnt-one","The ladder that isn't one",[11,378,379],{},"Fan Wu's piece stacks the work in three layers: strategic product thinking on top, agentic systems in the middle, vibe practice at the bottom — the why, the what, the how. It reads as a climb, bottom to top. Karpathy's floor-and-ceiling image does the same work: the floor is where beginners stand, the ceiling is what experts reach for. Both describe a person moving up and staying up.",[11,381,382],{},"When these two pieces first crossed my feed I read them the same way. What changed my mind was a smaller observation: the same engineer drops from the ceiling to the floor and back inside a single afternoon. The ladder framing breaks because the mode you pick changes far faster than your skill does. Skill only moves one direction, and slowly — you are not less experienced after lunch than before it. But the choice between vibing something out and verifying it line by line flips a dozen times a day, which means it tracks something that changes minute to minute. Skill isn't it.",[15,384,386],{"id":385},"what-actually-changes-is-the-artifact","What actually changes is the artifact",[11,388,389],{},"The MindStudio piece points to Peter Steinberger running dozens of agents in parallel and checking every output before it lands. The easy reading is that this is just what professionals do — verification as a mark of seniority. But look at what he's actually checking: outputs headed into a codebase that teammates, and his own later self, will build on top of. He isn't verifying because he's senior. He's verifying because the output crosses a line where someone downstream has to trust it without re-deriving it.",[11,391,392],{},"Verification is only worth its cost once an output has to be trusted by a second reader, because that is the only time being wrong is expensive. Below that line, a wrong output costs you the thirty seconds it takes to throw it away. Above it, a wrong output costs whoever inherits it — and they inherit it without the context that would let them catch the error cheaply. Call the line the survival boundary. Under it sit the disposable things: the throwaway script, the one-off data transform, the prototype you demo once and close. Over it, the output has a second reader — a teammate, production, the next agent that loads your code as context. Verification is the toll for crossing.",[15,394,396],{"id":395},"why-the-label-hides-the-real-skill","Why the label hides the real skill",[11,398,399],{},"Karpathy's own late-2025 note is that the models got reliable enough that he mostly stopped correcting them. In nearly the same breath comes the example everyone repeats: a frontier model that refactors a sprawling codebase cleanly and then, asked something about the physical world, recommends walking to a car wash. The capability is spiky. Reliability is high on average and unpredictable in any single spot, so the decision to verify can't be set once and worn like a title — it has to be made fresh against this output and whatever depends on it.",[11,401,402],{},"Treating the mode as identity fails in two directions, because the label answers \"who am I\" when the artifact is asking \"will anyone trust this later.\" The engineer who has decided he's an agentic engineer now reviews a five-line script he'll run twice and delete, paying a toll at a boundary he never crossed. The engineer who has decided vibe coding is fine ships the demo that quietly becomes the backend. Both got the question wrong from opposite ends, and in each case the miss wasn't about skill — it was a misread of how long the thing they made was going to live.",[15,404,406],{"id":405},"the-question-to-ask-instead","The question to ask instead",[11,408,409],{},"Fan Wu assigns each model a role — ChatGPT as thinking partner, Claude as product partner, Gemini as the executor — and frames it as how he orchestrates. MindStudio leans the other way, folding database, auth, and payments behind the agent so the orchestration disappears, while a tool like Remy generates a full app from an annotated markdown spec. Between them, the floor and the ceiling stop being separate rooms. When the platform handles the plumbing and the model writes the app, \"what kind of engineer am I\" gets harder to even answer. The one question that survives the tooling is whether the thing you just generated has to survive too.",[11,411,412],{},"So ask it directly, per artifact, before you accept or check an output: who is the second reader? If you can't name one — no teammate, no production path, no future agent that will read this as context — the output is disposable, and verifying it is wasted motion. If you can name one, it crossed the boundary, and you owe it the check. The test is fast enough to run in your head on every generation, which is the whole point: the check has to cost less than what it's checking, or it isn't worth running. This only bites if you've watched a prototype get promoted to production without a rewrite; if you haven't yet, the tell is the demo nobody on the team is willing to delete.",[11,414,415],{},"The vibe-versus-agentic question gets sold as a fork in your career — pick a side, grow up, become the engineer who verifies. It's smaller and more constant than that: a call you make dozens of times a day about a single artifact, answered \"disposable\" most of the time and \"this one has to survive\" the rest. The case I haven't worked out is the artifact that changes its answer after you've decided — the script you correctly called disposable on Monday that someone quietly builds on by Friday, the boundary sliding under code you've long stopped checking. Nobody re-runs the decision, because nothing tells them it moved.",{"title":85,"searchDepth":86,"depth":86,"links":417},[418,419,420,421],{"id":375,"depth":86,"text":376},{"id":385,"depth":86,"text":386},{"id":395,"depth":86,"text":396},{"id":405,"depth":86,"text":406},"2026-06-02","Vibe coding versus agentic engineering is sold as a skill tier. It is really a per-artifact call: most agent output is disposable, and verification only pays once an output crosses the boundary where a second reader has to trust it.",{},"\u002Fblog\u002Fdisposable",9,{"title":366,"description":423},"blog\u002Fdisposable",[103,104],"Ju1HGhJDj6v6UJSGLbqzBA2msX3pPiAc-1xvI6M7YMQ",{"id":432,"title":433,"article":98,"body":434,"date":550,"description":551,"extension":94,"meta":552,"navigation":96,"path":78,"promptVersion":98,"readingTime":239,"seo":553,"stem":554,"tags":555,"__hash__":556},"content\u002Fblog\u002Fjudgement-shaped-problems.md","Judgement-shaped problems",{"type":8,"value":435,"toc":543},[436,439,443,446,449,452,456,459,462,465,468,475,479,482,489,492,496,499,502,505,508,511,515,518,521,524,534,540],[11,437,438],{},"\"Agent\" has become a label for anything with an LLM in it. Some of the systems wearing the label do work that integrations could not reach. Most do not. The difference is not what is in the box — it is whether the problem itself is judgement-shaped, and whether the if\u002Felse has actually moved out of the code into the model's inference. If neither is true, the system is an integration in costume.",[15,440,442],{"id":441},"what-integrations-were-built-for","What integrations were built for",[11,444,445],{},"Take the simplest possible automation: an invoice arrives, a notification posts to Slack, the sheet updates. Most automation looks like that. There is a known input, a known output, and the work is wiring the two together with as little surprise as possible. Payment integrations, ETL pipelines, scheduled syncs, webhooks routing events between systems — the shape integrations were built for is \"I have a known input and a known output, and I want them connected reliably.\"",[11,447,448],{},"Integrations encode that input space at design time because that is the shape of problem they were built for. Every input class is a branch I write. Every branch is code I maintain. As long as my branches cover the inputs that arrive, the system runs cleanly.",[11,450,451],{},"The trouble starts when the real input space is open. Free-form text, mixed-mode requests, exceptions I did not enumerate when I shipped. Each new shape is a new branch, and the branches start interacting. Surface area grows quadratically with input variance, and the integration architecture — which scaled gracefully under enumerable inputs — becomes a maintenance liability under non-enumerable ones. Not because the architecture is wrong. Because the problem changed shape underneath it.",[15,453,455],{"id":454},"what-agent-means-when-the-word-means-something","What \"agent\" means when the word means something",[11,457,458],{},"A useful definition of \"agent\" should make it clear which systems are doing something integrations could not. Mine is narrow on purpose: an agent is a system that has moved the input → output decision from my code to the model's inference. The if\u002Felse does not vanish. It gets relocated. Where the integration's branches live in source I write, the agent's branches live in the prompt I send and the reasoning the model does at runtime.",[11,460,461],{},"This is an architectural change, not a marketing one. Relocating the if\u002Felse changes who maintains the input space — me, ahead of time, or the model, in the moment. The cost shifts from code I must write to inference I must trust.",[11,463,464],{},"Take support-email triage. The integration version classifies by keywords or by a per-intent classifier. It works for the cases I anticipated, and fails on the ones I did not — the customer who is angry but polite, the bug report disguised as a feature request, the urgent thread buried in pleasantries. Each new failure mode is a new branch. The branches multiply. The system grows brittle.",[11,466,467],{},"The agent version reads the email and decides what to do with it. The decision logic is no longer in my code; it is in the model's interpretation of the prompt and the email. I have not removed the if\u002Felse. I have moved it to a place where new shapes do not require new code.",[11,469,470,471,474],{},"That gives the reader a diagnostic. ",[117,472,473],{},"Can I write the flowchart?"," If yes, the input is integration-shaped, and an agent is overkill. If every attempt hits \"and 200 edge cases,\" the input is judgement-shaped, and the relocation is what lets the problem be answered at all.",[15,476,478],{"id":477},"the-label-is-uncalibrated","The label is uncalibrated",[11,480,481],{},"Most systems calling themselves agents are not. They are integrations with a model call somewhere in the middle. The shape is familiar: read input, ask an LLM to extract structured fields, run a deterministic flow on what the LLM returned. The if\u002Felse still lives in my code. The model is doing fuzzy parsing, not judgement.",[11,483,484,485,488],{},"I have a name for this pattern: ",[117,486,487],{},"model-assisted integration",". It is not bad architecture. For many problems it is the right architecture — fuzzy parsing is a real capability when the input is structured-but-messy. The error is not in building one. The error is in calling it an agent and inheriting the runtime properties of one in marketing copy without inheriting them in code.",[11,490,491],{},"Most production \"agents\" are this shape because the label gets applied for marketing rather than for architecture, and there is no standard yet to push back. The presence of an LLM call is not the test. The test is where the decision lives. If the flow is fixed and the LLM is filling in fields, the system is a model-assisted integration. If the flow is decided by the model at every step, the system is doing what the word \"agent\" should mean. Most things in production are the first one wearing the second one's clothes.",[15,493,495],{"id":494},"the-failure-mode-trade-off","The failure mode trade-off",[11,497,498],{},"Even when the input is genuinely judgement-shaped, an agent is not always the right answer. The relocation comes with a permanent cost: how the system fails.",[11,500,501],{},"Integrations fail loud. A schema does not match. A field is missing. A request times out. The error is visible at the boundary between systems, easy to log, easy to alert on. Whatever the bug is, I see it, and I can fix it.",[11,503,504],{},"Agents fail soft. The output is plausibly wrong — confident, well-formed, in the right shape — and it passes through the rest of the system as if it were correct. There may be no exception, no log line, no alert. The error becomes visible only downstream, when something acts on the wrong output. Sometimes I never see it.",[11,506,507],{},"This is not a bug in any particular agent. It is a runtime property. Agents fail soft because inference is non-deterministic by construction; the output is sampled from a distribution that includes plausibly-wrong answers. Better models reduce the distribution's tail. They do not eliminate it.",[11,509,510],{},"That makes the choice between integration and agent partly a choice of which failure mode I can afford. A financial transaction routing system cannot tolerate plausibly-wrong outputs; integration is the only honest answer there, regardless of how judgement-shaped the input feels. A content-tagging system can tolerate the occasional miscategorisation; soft failure is a cost the system can absorb. The shape of the input matters; the cost of soft failure matters more.",[15,512,514],{"id":513},"where-agents-earn-their-cost","Where agents earn their cost",[11,516,517],{},"Three shapes of problem reliably reward the relocation, because each satisfies all three conditions at once: the input space is open, soft failure is acceptable, and inference is cheaper than the alternative.",[11,519,520],{},"Under-determined intent extraction. Support-email triage, customer-feedback classification, free-form ticket routing. Input cannot be enumerated; output is one of a manageable number of buckets; soft failure is annoying but recoverable.",[11,522,523],{},"Cross-domain reasoning. Legal documents to action items. Procurement notices to compliance checks. Call transcripts to CRM updates. Input format varies; target schema varies; the mapping requires interpretation no fixed code path can carry.",[11,525,526,527,530,531,154],{},"Generative work. CMS content drafts, marketing copy, product descriptions. The output does not exist before the system runs; there is no input to map ",[117,528,529],{},"to"," — only an input to think ",[117,532,533],{},"about",[11,535,536,537,539],{},"I have shipped systems in each of these shapes, and the diagnostic from earlier is what I run before I pick a stack. ",[117,538,473],{}," If yes, ship the flowchart. If every attempt hits \"and then it depends,\" check whether the failure mode I would inherit is one I can afford. If both answers point to an agent, the relocation pays. Tedium of maintaining the integration is recoverable. Plausibly-wrong output that nobody catches is not.",[11,541,542],{},"Most production systems wearing the agent label sit between those two — integrations in costume, where the if\u002Felse still lives in the code. The label will keep moving until something forces it not to. Until then, the question to ask is shape, not name.",{"title":85,"searchDepth":86,"depth":86,"links":544},[545,546,547,548,549],{"id":441,"depth":86,"text":442},{"id":454,"depth":86,"text":455},{"id":477,"depth":86,"text":478},{"id":494,"depth":86,"text":495},{"id":513,"depth":86,"text":514},"2026-04-30","Most production 'agents' are integrations in costume. The test is where the if\u002Felse lives — in code, or in inference.",{},{"title":433,"description":551},"blog\u002Fjudgement-shaped-problems",[103],"YLdjxintja3yteRXbS0wwiROLaAIKgJHpy5T6fMxC6g",{"id":558,"title":559,"article":560,"body":561,"date":614,"description":615,"extension":94,"meta":616,"navigation":96,"path":617,"promptVersion":560,"readingTime":618,"seo":619,"stem":620,"tags":621,"__hash__":623},"content\u002Fblog\u002Fon-simplicity.md","On simplicity",1,{"type":8,"value":562,"toc":609},[563,566,572,576,583,586,590,593,596,599,603,606],[11,564,565],{},"There is a common misconception that simplicity means doing less, or removing features, or leaving things out. This couldn't be further from the truth.",[11,567,568,569,571],{},"True simplicity — the kind that feels inevitable when you encounter it — is the result of deeply understanding a problem. It requires you to go through complexity, not around it. You have to hold the full weight of what something could be, and then make careful, sometimes painful decisions about what it ",[117,570,267],{}," be.",[15,573,575],{"id":574},"the-cost-of-simplicity","The cost of simplicity",[11,577,578,579,582],{},"Dieter Rams understood this. His ten principles of good design are not a checklist for making things minimal. They are a framework for making things ",[117,580,581],{},"honest",". \"Good design is as little design as possible\" does not mean the designer did little work. It means the designer did so much work that the result appears effortless.",[11,584,585],{},"This is the paradox at the center of every meaningful design decision: the simpler the outcome, the harder the process. The reason is mechanical, not mystical. You cannot decide what to throw away from a room you have not entered. Reduction is impossible without first carrying everything — every option, every edge case, every user you imagined. The minimal answer is the one that survives after you have held all the others in your hands and put them down on purpose.",[15,587,589],{"id":588},"reduction-as-a-practice","Reduction as a practice",[11,591,592],{},"I've found that the most useful question in any design process is not \"what should we add?\" but \"what can we remove?\" Not as a cost-cutting exercise, but as a form of respect for the person who will use what you make.",[11,594,595],{},"Every element on a screen is a demand on someone's attention. Every feature is a promise you have to keep. Every option is a decision someone has to make. When you remove something, you're not just making the interface cleaner — you're giving someone back a small piece of their cognitive freedom.",[11,597,598],{},"Call this the cost of options. The user always pays it — in attention, in hesitation, in trust spent on choices that should never have been theirs to make.",[15,600,602],{"id":601},"the-discipline-of-restraint","The discipline of restraint",[11,604,605],{},"Restraint is not a natural instinct. We want to show our work. We want to demonstrate capability. But the moment a design starts to feel like it's trying to impress you, it has already failed.",[11,607,608],{},"The goal is not invisibility. It is inevitability. A simple thing is one the user could not imagine being any other way — not because it disappeared, but because it answered the question they came with so completely that no other shape was left to consider. Here is a test: if you cannot draw your own interface from memory, neither can the people who use it. The fix is not to label things better. It is to make fewer things.",{"title":85,"searchDepth":86,"depth":86,"links":610},[611,612,613],{"id":574,"depth":86,"text":575},{"id":588,"depth":86,"text":589},{"id":601,"depth":86,"text":602},"2026-01-10","Simplicity is not the absence of complexity — it is the resolution of it.",{},"\u002Fblog\u002Fon-simplicity","4 min read",{"title":559,"description":615},"blog\u002Fon-simplicity",[622],"design","SfBKecGmVFkYfsM8Ty4v77FWDez784Q4ANjsYl7yk-4",{"id":625,"title":626,"article":426,"body":627,"date":748,"description":749,"extension":94,"meta":750,"navigation":96,"path":751,"promptVersion":752,"readingTime":753,"seo":754,"stem":755,"tags":756,"__hash__":757},"content\u002Fblog\u002Fpattern-shopping.md","Pattern shopping",{"type":8,"value":628,"toc":740},[629,632,636,639,642,646,649,657,660,663,666,669,672,675,678,682,685,688,691,694,698,701,715,718,721,725,728,731,734,737],[11,630,631],{},"There is a genre of post that lays out a roadmap for mastering agentic design patterns. ReAct first, then Reflection, then Planning, then Tool Use, then Multi-agent. Each pattern gets a short paragraph on when to reach for it and a short paragraph on what it costs. The pieces are useful as references and I have nothing against them as references. The problem is that they read as curricula, and a pattern catalog read as a curriculum produces decisions made by vocabulary rather than by need.",[15,633,635],{"id":634},"the-catalog-reads-forward","The catalog reads forward",[11,637,638],{},"ReAct is a loop — think, act, observe, repeat. Reflection is generation, self-critique, revision. Planning is decomposing the task into ordered steps before any of them run. Tool use is calling out to a fixed catalog of external functions. Multi-agent is splitting the work across specialists under a coordinator. Each pattern is presented the same way: here is the name, here is the loop, here is when to use it, here is what it costs.",[11,640,641],{},"The catalog reads in that order — name, then move, then constraint — because that is the only order an enumeration can read in. The name is the index. You cannot look up \"the move that fixes outputs you cannot validate inline\" — you can only look up Reflection and discover it might be. The sequence is forced by the format.",[15,643,645],{"id":644},"engineering-runs-backward","Engineering runs backward",[11,647,648],{},"The pipeline I wrote this post in has three components. A planning phase that runs before any prose. An audit phase that reads the draft against a written critique list. A vocabulary file the draft gets swept against. The catalog has names for all three: the planning phase is Planning, the audit phase is Reflection, the vocabulary file is closer to Tool Use.",[11,650,651,652,656],{},"None of those pieces got chosen because they appeared in a catalog. The planning phase exists because drafts kept opening with abstraction instead of a concrete anchor. The audit phase exists because AI-tell words kept slipping through the drafting pass and I could not catch them in the same pass that wrote them. The vocabulary file exists because the same rules were drifting across two files — I named that one already, in ",[55,653,655],{"href":654},"\u002Fblog\u002Frule-drift","rule drift",". Each piece is the response to a specific failure, and the catalog names them only after the fact.",[11,658,659],{},"Engineering runs backward from the catalog because the design decision is which constraint binds, not which name applies. The name is a label on the answer, not the answer.",[15,661,626],{"id":662},"pattern-shopping",[11,664,665],{},"Three teams over the last year. The criticism is the catalog, not the engineers.",[11,667,668],{},"Team A wrapped a Reflection loop around a model output that was supposed to be JSON in a specific schema. The Reflection loop generated the JSON, the critic checked it, the reviser fixed it if not. The loop ran in two to four seconds per call. A JSON-schema validator on the same output would have run in milliseconds and either passed or failed with a precise reason. The team could name the pattern but not the constraint Reflection was supposed to answer.",[11,670,671],{},"Team B reached for Planning on a task that already had a fixed sequence of steps. The plan was generated, the executor walked through it, the plan generator re-planned whenever something unexpected came back. The system worked. A deterministic pipeline of the same steps would have been faster, cheaper, and easier to debug. Planning was the right name for the move, but the move was answering a constraint that was not present.",[11,673,674],{},"Team C decomposed a single agent into five specialists because the prompt was getting long. The bug count went up — coordination is its own surface — and the latency went up because the coordinator now had to make a call before any specialist ran. The bottleneck had been prompt length, which responds to retrieval or summarisation, not to specialisation.",[11,676,677],{},"Pattern shopping happens because the reader picks the most recently learned move rather than the one their failure mode demands. The move and the failure detach. The system still gets built; it just gets built around a vocabulary, and the vocabulary's weight is wrong for the constraint at hand.",[15,679,681],{"id":680},"the-ancestor-test","The ancestor test",[11,683,684],{},"The diagnostic is one sentence per pattern: name the pre-LLM engineering move it descends from.",[11,686,687],{},"ReAct is a debug loop with logging — produce an output, observe its effect, decide what to do next, repeat. Any engineer who has shipped a long-running process has written that loop, usually with print statements as the observation step. The loop body is now a language model; the loop itself is decades old. Reflection is code review with a linter — the model as both author and reviewer, the linter as the deterministic check the reviewer reaches for. Planning is task decomposition — the move every engineer makes the first time they write a one-paragraph spec before they write the code. Tool use is API integration with a fixed catalog of endpoints. Multi-agent is service decomposition; the trade-offs (coordination cost, ownership of state, routing logic) are the same ones distributed systems have always carried.",[11,689,690],{},"If the ancestor is unfamiliar, the pattern probably is too, and the catalog is doing the work of scaffolding rather than reference. Scaffolding is not a bad thing — most learning needs it — but it is a different thing, and the genre does not label it that way.",[11,692,693],{},"The catalog reads correctly when the reader brings the constraint to it, because then the catalog only has to supply the name. The expensive part — recognising which move applies — is already done.",[15,695,697],{"id":696},"the-replacement","The replacement",[11,699,700],{},"A design review template that fits on one screen. For each pattern in a proposed design, the author has to answer three questions. What failure mode does this pattern answer? What is the cheapest alternative we considered? What breaks if we remove this pattern? If the author cannot answer all three, the design goes back. The template took ten minutes to write and has caught more over-architecture than any technical-design book on my shelf.",[11,702,703,704,707,708,707,711,714],{},"Intake runs the same direction. The first document for any new system is a failure-mode list, not a pattern list. Each failure mode is one sentence — ",[117,705,706],{},"the model produces malformed JSON",", ",[117,709,710],{},"the prompt grows beyond context limits",[117,712,713],{},"the user-facing latency exceeds two seconds",". Patterns enter the doc second, attached to specific failure modes. A pattern with no failure attached gets cut.",[11,716,717],{},"Hiring runs the same logic. A standard question I now ask in technical interviews: walk me through a system you have shipped, and for each architectural choice — not just the obvious patterns, all of it — name the constraint that drove it. A candidate who pattern-shops will lead with the names they used; one who has thought backwards will lead with the constraints those names answered, and that difference is audible inside twenty minutes. It is a more reliable hiring signal than any algorithms round I have run.",[11,719,720],{},"A constraint-first design review surfaces over-architecture before it ships because each pattern has to defend its place rather than appear by default. The same logic carries through intake and hiring. The CTO seat acts on three levers — approval, intake, hiring — and pulling them in the same direction is what stops the team pattern-shopping.",[15,722,724],{"id":723},"when-the-shelves-are-empty","When the shelves are empty",[11,726,727],{},"An engineer on my team joined six months ago. Three years into their career, all of it building on top of LLMs and managed APIs — no service decomposition shipped, no debug loop written in print statements, no integration tests against an unreliable third-party API. Smart, ships, fast. The ancestor test points at empty shelves for them.",[11,729,730],{},"The scaffolding is not the catalog and not the pre-LLM history. It is a curated wiki of the failures the team has had, each entry organised by the constraint it taught — JSON output versus Reflection, fixed pipelines versus Planning, prompt length versus specialisation, and the rest as they accrete. Each entry carries the constraint, the move the team almost made (or made and recovered from), and the cheapest right move once the constraint was named. New engineers read the wiki first, the pattern catalog second.",[11,732,733],{},"A wiki of past failures works as scaffolding because it gives the engineer the constraint side of every pattern before they have earned it through experience, so the names in the catalog have somewhere to attach. The catalog comes out only after the wiki has primed the constraint half. The reading order matches engineering's order.",[11,735,736],{},"The wiki has a limit. It is reactive. It only covers failures the team has already named, which means the engineer reading it will still pattern-shop on any constraint the wiki does not anticipate. The wiki is a stopgap that fills in until the team grows an engineer who has the constraint side already, and stops being load-bearing the moment that engineer is in the room.",[11,738,739],{},"The catalog itself is fine. The genre that presents it as a curriculum is the move I am pushing back on. The intake template and the wiki of past failures between them give a team a way to read the catalog backward — pattern shopping gets harder when the design review will not approve a pattern without a failure attached, and newer engineers get the constraint side they have not lived through. The harder problem is the failure modes the wiki does not yet name. The wiki catches up by waiting for the loss, and the next pattern-shopping incident on a constraint nobody has spoken about will look indistinguishable from the rest until the loss has a name.",{"title":85,"searchDepth":86,"depth":86,"links":741},[742,743,744,745,746,747],{"id":634,"depth":86,"text":635},{"id":644,"depth":86,"text":645},{"id":662,"depth":86,"text":626},{"id":680,"depth":86,"text":681},{"id":696,"depth":86,"text":697},{"id":723,"depth":86,"text":724},"2026-05-26","Pattern catalogs read forward — name, then move, then constraint. Engineering runs the other way. Reading them as curriculum produces moves imported from a vocabulary rather than earned from a failure.",{},"\u002Fblog\u002Fpattern-shopping",8,"8 min read",{"title":626,"description":749},"blog\u002Fpattern-shopping",[103,104],"PUiemC6ETk_A9UPOQBfLVNqR_XenzSbTrpFmnfumCpw",{"id":759,"title":760,"article":761,"body":762,"date":860,"description":861,"extension":94,"meta":862,"navigation":96,"path":57,"promptVersion":761,"readingTime":99,"seo":863,"stem":864,"tags":865,"__hash__":866},"content\u002Fblog\u002Fprompts-as-pipelines.md","Prompts as pipelines",3,{"type":8,"value":763,"toc":853},[764,767,771,774,777,781,784,787,791,794,797,800,804,811,814,817,821,824,827,830],[11,765,766],{},"There is a common misconception that prompt engineering is the craft of writing better prompts. Most of the advice online — the tactics, the lists of techniques, the templates — takes the single prompt as the unit of work. After two years of writing, rewriting, and living with the prompts I actually use, I have come to disagree. The unit of work is not the prompt. It is the pipeline.",[15,768,770],{"id":769},"the-single-prompt-trap","The single-prompt trap",[11,772,773],{},"The first prompt I ever kept in a file was a thousand words long. It told a model how to turn a rough topic into a finished article: define the audience, pick the angle, draft a headline, write the piece, check the facts, tighten the prose. It was exhaustive. It was also unreliable. The model would ignore the fact-check step when the draft got interesting. It would forget the audience by the time it reached the conclusion. It would quietly invent statistics it had been explicitly told not to invent.",[11,775,776],{},"This happens because instructions in a long prompt do not keep their weight. Each rule you add dilutes the ones before it. The model is not reading your prompt as a checklist; it is reading it as evidence of what kind of reply you want. The twelfth constraint lands softer than the first, and the first lands softer than it did before the twelfth arrived. You can keep adding rules, but past a certain length, you are no longer teaching — you are wishing.",[15,778,780],{"id":779},"what-a-pipeline-gives-you","What a pipeline gives you",[11,782,783],{},"My current setup is five separate prompts. Research. Strategy. Writing. Audit. Distribution. Each one has its own inputs, its own output, and its own narrow job. Research runs first and feeds verified facts forward so the later phases never have to invent what they should already know. The strategy prompt never sees the body text; the writing prompt never sees the distribution plan. This is not an aesthetic choice. It is a reliability choice.",[11,785,786],{},"A pipeline works because the context of each phase only contains what that phase needs. Strategy thinks about the reader and the angle. Writing thinks about the outline and the voice. Audit thinks about weak sentences and unverified claims. When a phase fails, it fails on its own terms, in a bounded way, and I can see exactly what went wrong — because the only thing in the room is the phase that broke.",[15,788,790],{"id":789},"the-checkpoint-is-the-artifact","The checkpoint is the artifact",[11,792,793],{},"Between each phase I stop and read what came out. I approve it, rerun it, or rewrite it by hand. Nothing advances until I say so. This is the part most agent frameworks want to remove. Auto-chain the phases, they say. Let the model decide when it is done. I have tried it. It produces output faster and worse.",[11,795,796],{},"Errors in phase N compound into phase N+1 because each phase trusts its inputs. If the strategy phase picks a weak angle and nothing stops it, the writing phase faithfully drafts a thousand words in service of a bad idea. The audit that follows then reads those thousand words against criteria that assume the angle was chosen well, and misses the root problem entirely. The checkpoint is not a courtesy. It is the only thing between a fixable drift and a finished piece that has to be thrown away.",[11,798,799],{},"The prompts in my pipeline are valuable. The checkpoints between them are what make the prompts valuable. If someone copied the four prompt files without the four stops, they would have something that looks like my system and behaves like a worse one.",[15,801,803],{"id":802},"constraints-that-travel","Constraints that travel",[11,805,806,807,810],{},"Here is something I did not expect. The rule ",[117,808,809],{},"Do not fabricate data"," sits at the top of the writing prompt, in its own section, at the start of a short file. It used to sit at the top of my mega-prompt, also at the top, also at the start. Same words. Different behavior. In the pipeline, the model does not fabricate. In the mega-prompt, it did.",[11,812,813],{},"I think this is because constraints decay with proximity. A rule stated once, at the start of a long document, competes with everything that follows it for the model's attention. The further the model reads, the more the document as a whole becomes the signal, and the opening constraint becomes one of many voices in a noisy room. In a short, focused prompt there is no noisy room. The constraint is still there when the model stops reading, because the model never left its neighborhood.",[11,815,816],{},"This is the part I did not know until I had run it both ways. You can write the same instruction in the same words, and have it enforced in one version and ignored in another, purely because of where it sits. Short prompts are not just easier to write. They are the substrate that makes constraints hold.",[15,818,820],{"id":819},"what-this-changes-on-monday","What this changes on Monday",[11,822,823],{},"If you have been rewriting the same prompt for the fourth time this week, stop. You have probably confused two problems. One is that the prompt is unclear; the other is that the prompt is trying to do too much. The fix for the first is more words. The fix for the second is fewer words, and more prompts — because a single prompt cannot hold two responsibilities without leaking one into the other. Try splitting it at the first natural seam, often between deciding and doing, and see whether each half behaves better on its own. My guess is that it will.",[11,825,826],{},"The right question to ask about a prompt is not how good it is. It is what it is responsible for, and what happens between it and the next one. Prompts compose. They do not concatenate.",[828,829],"hr",{},[11,831,832],{},[117,833,834,835,841,842,847,848,154],{},"Inspired by the disciplined approach to prompt systems? See how ",[55,836,840],{"href":837,"rel":838},"https:\u002F\u002Fdunking-devils.com\u002F",[839],"nofollow","Dunking Devils"," applies structure and craft to high-performance systems. Explore structured approaches from ",[55,843,846],{"href":844,"rel":845},"https:\u002F\u002Fopenai.com\u002Fsl-SI\u002F",[839],"OpenAI"," and build orchestrated AI systems with ",[55,849,852],{"href":850,"rel":851},"https:\u002F\u002Fmastra.ai\u002F",[839],"Mastra",{"title":85,"searchDepth":86,"depth":86,"links":854},[855,856,857,858,859],{"id":769,"depth":86,"text":770},{"id":779,"depth":86,"text":780},{"id":789,"depth":86,"text":790},{"id":802,"depth":86,"text":803},{"id":819,"depth":86,"text":820},"2026-03-01","Prompt engineering is not the art of writing one better prompt — it is the discipline of breaking the work into phases that can fail independently.",{},{"title":760,"description":861},"blog\u002Fprompts-as-pipelines",[103,104],"pX0hYQ7_rlsuEp0TvOQKZSJoe4fY_gjtRPkiylHj92s",{"id":868,"title":869,"article":870,"body":871,"date":944,"description":945,"extension":94,"meta":946,"navigation":96,"path":947,"promptVersion":98,"readingTime":618,"seo":948,"stem":949,"tags":950,"__hash__":952},"content\u002Fblog\u002Freturn-value.md","Return value",0,{"type":8,"value":872,"toc":938},[873,876,880,893,896,900,903,906,909,913,916,919,922,926,932,935],[11,874,875],{},"Most blog posts get published because someone wrote them. That is not the same as earning a place in someone's attention, and most never do — they are read once, if that, and forgotten. Every post on this blog is held to the second standard, not the first: would the reader keep it?",[15,877,879],{"id":878},"a-short-bio","A short bio",[11,881,882,883,887,888,892],{},"Seventeen years of building software, currently full-stack and AI-powered tools at ",[55,884,886],{"href":885},"\u002Fcv#progmbh","PROGMBH d.o.o."," — the rest of the bio is on ",[55,889,891],{"href":890},"\u002Fcv","my CV"," for anyone who needs it.",[11,894,895],{},"I bring it up only to flag a parallel. Most software ships because the work happened, not because the result earned its place in the system it joins. Most blog posts ship for the same reason. The standard worth keeping is the same in both domains: a thing that exists is not the same as a thing worth keeping. This blog applies the second test to writing.",[15,897,899],{"id":898},"what-most-posts-miss","What most posts miss",[11,901,902],{},"Most blog posts fail return-value because the test that gates their publication is writer-side, not reader-side. Did I write something today? Did I hit the cadence? Did I cover the keyword? Each of those is a question only the author can answer, and a yes from the author lets the post ship. None of them ask anything of the reader.",[11,904,905],{},"The genres prove it. SEO listicles exist to rank, not to inform — the reader is incidental, a vehicle for impressions. Weekly-cadence posts get written to the calendar; the topic is whatever was due. Consensus restatements take an idea everyone already nodded at and put cleaner sentences around it. The reader closes the tab, learns nothing, and the blog still counts the page view.",[11,907,908],{},"The default state of a published post is that no one needed it. That is the bar most posts clear, because most posts only have to clear the bar the writer set.",[15,910,912],{"id":911},"the-return-value-test","The return-value test",[11,914,915],{},"A post passes the return-value test because the reader takes away more than they spent. The currency on the reader's side is attention; the currency on the writer's side is words. The trade is asymmetric — the writer pays once, the reader pays each time — so the burden of proof sits on the post, not on the reader's patience.",[11,917,918],{},"What does return value look like, concretely? A reframe instead of an exhortation. A named pattern the reader can carry into a meeting next week. A mechanism that explains something the reader had felt but not articulated. One applicable claim per post, promoted to its sharpest sentence and given a handle. Bookmarks, returns, and shares are evidence that the trade landed — they are not the goal of writing, they are the receipt.",[11,920,921],{},"If a post does not name something the reader did not already have words for, it did not pay back. That is the bar the rest of this blog wants to be measured against.",[15,923,925],{"id":924},"the-experiment","The experiment",[11,927,928,929,931],{},"I can publish LLM-assisted writing under this bar because the unit of work is not the prompt. It is the pipeline. Constraints survive across phases — grounding, plan, draft, audit — that a single mega-prompt would dilute long before the closing paragraph. The fuller argument is at ",[55,930,760],{"href":57},"; this section only names the pipeline as the reason the return-value bar is enforceable rather than aspirational.",[11,933,934],{},"Right now, every post on this blog is manually checked and corrected before it ships. The pipeline does most of the work; my hand is on the brakes for the rest. That is the honest state of the experiment as I write this. The day a post clears the bar without my correction is the day the experiment worked.",[11,936,937],{},"Until then, every post here is held to the same question, by hand if it has to be: would you keep this? If the answer is no, the post does not ship. That is the only test I want this blog to be measured by.",{"title":85,"searchDepth":86,"depth":86,"links":939},[940,941,942,943],{"id":878,"depth":86,"text":879},{"id":898,"depth":86,"text":899},{"id":911,"depth":86,"text":912},{"id":924,"depth":86,"text":925},"2026-01-05","Most blog posts get published because the writer wrote them. This blog is built around a different test — would the reader keep it?",{},"\u002Fblog\u002Freturn-value",{"title":869,"description":945},"blog\u002Freturn-value",[951],"philosophy","dUxIEAULucAm7E8qBH3vZa1GSZcZ0N9MQUkLxnPIlbM",{"id":954,"title":955,"article":752,"body":956,"date":1075,"description":1076,"extension":94,"meta":1077,"navigation":96,"path":654,"promptVersion":6,"readingTime":99,"seo":1078,"stem":1079,"tags":1080,"__hash__":1081},"content\u002Fblog\u002Frule-drift.md","Rule drift",{"type":8,"value":957,"toc":1069},[958,971,975,987,990,993,997,1012,1015,1018,1022,1041,1044,1053,1057,1060,1063,1066],[11,959,960,961,964,965,970],{},"Addy Osmani has a clean line about agent harnesses. ",[117,962,963],{},"Every mistake becomes a rule."," The harness ratchets toward the behaviour you want, one failure at a time, and the rulebook only grows. The framing is right and his ",[55,966,969],{"href":967,"rel":968},"https:\u002F\u002Fwww.oreilly.com\u002Fradar\u002Fagent-harness-engineering\u002F",[839],"post"," is worth reading. It is also half the move. The other half — the one that keeps the rulebook from collapsing under its own weight — is consolidation, and most write-ups about harness engineering skip it.",[15,972,974],{"id":973},"what-the-ratchet-gives-you","What the ratchet gives you",[11,976,977,978,982,983,986],{},"Take a code-review harness with a smells list — the patterns the reviewer agent should flag on any PR. Say it has nine entries. Bare ",[979,980,981],"code",{},"except:"," is on it because one past review approved a bare except and the bug it hid surfaced in prod two weeks later. ",[979,984,985],{},"SELECT *"," is on it because a query that looked harmless against a twenty-row table table-scanned a million-row one in staging. Each entry came from a specific review that should not have shipped, and the post-mortem on that review is the reason the rule exists.",[11,988,989],{},"A rule earned through a real failure cannot be argued out of, because the cost of the failure is the only argument the rule ever needs to make. The list does not grow by taste. It grows by evidence. The bar for adding a pattern is one review the harness produced that should not have shipped — anything weaker is opinion, and opinion is the wrong currency for a rulebook the system reads on every run.",[11,991,992],{},"That is what Osmani is naming when he calls it a ratchet. The constraint only moves one direction. Each new entry is a small step the harness will never have to take again, because the rule that prevents it is in the rulebook now, and every future review is written with it in scope.",[15,994,996],{"id":995},"where-the-ratchet-drifts","Where the ratchet drifts",[11,998,999,1000,1003,1004,1007,1008,1011],{},"Imagine the same harness has two smells lists. One lives in ",[979,1001,1002],{},"reviewer.md",", under the Rules section that the reviewer step reads while drafting comments on a PR. The other lives in ",[979,1005,1006],{},"final-pass.md",", as the sweep the audit step runs against the finished review before it posts. The lists are almost identical. They are edited as separate documents. By the time anyone notices, one has picked up ",[117,1009,1010],{},"TODO without a ticket reference"," and the other has not.",[11,1013,1014],{},"A rule that lives in two harness files drifts because each file is edited independently and no compiler checks prose for consistency. When a new smell shows up in a missed review, the next edit lands in whichever file is open. The other file goes one more cycle without the rule. The reviewer step now knows to flag a pattern the audit step will not catch. The harness is silently inconsistent with itself, and the silence is the part that matters — there is no error, no log line, no exception. Just two pieces of prose that have started disagreeing.",[11,1016,1017],{},"Markdown has no type checker. Nothing in the file system tells you that the smells list in one file is a superset of the smells list in another, or that the blocker-class list in one file has acquired an entry the other has not. The contract between the two files exists only in the operator's memory of having written them, and memory is the wrong layer for a contract to live in.",[15,1019,1021],{"id":1020},"the-second-move","The second move",[11,1023,1024,1025,1028,1029,1031,1032,1034,1035,1037,1038,1040],{},"The fix is a structural refactor of the harness, not an additive one. Extract every review rule — tone, smells to flag, blocker-class issues, what the reviewer skips, sign-off — into a new file called ",[979,1026,1027],{},"review-rules.md",". The Rules section disappears from ",[979,1030,1002],{},". The smells and blocker-class sweeps disappear from ",[979,1033,1006],{},". Each is replaced with a single line pointing at ",[979,1036,1027],{},". A grep for ",[979,1039,985],{}," across the harness now returns one file.",[11,1042,1043],{},"Rules that live in one canonical file cannot drift, because there is only one place to edit and every consumer reads the same version. Consolidation removes the surface area drift needs to occur on. The next time a review surfaces a new smell, there is exactly one place to add it. The reviewer step picks up the new line on its next run. So does the audit step. The lists cannot disagree, because there is only one list.",[11,1045,1046,1047,1049,1050,1052],{},"The cost is one indirection on the reading side — ",[979,1048,1002],{}," and ",[979,1051,1006],{}," now follow a pointer instead of inlining the rule. The benefit is that the rule has one home, and the next edit cannot accidentally fork it into two slightly different rules. That is the move the ratchet framing leaves out. Accrete, then put the accretion somewhere it cannot fork.",[15,1054,1056],{"id":1055},"what-this-means-for-the-rulebook-you-already-have","What this means for the rulebook you already have",[11,1058,1059],{},"Osmani cites HumanLayer's discipline of keeping AGENTS.md to about sixty lines. The reasoning is that long rulebooks dilute the individual rules — the model treats line 41 with less weight than line 4, and an over-long list trains the system to skim past most of it. The advice is good. It is also the accretion half.",[11,1061,1062],{},"The length signal is real, but length is not the underlying problem — distribution is. A 60-line AGENTS.md that is also repeating four of its rules inside a hook script, three more inside a subagent system prompt, and one more inside a tool description is already drifting, because each copy was edited on a different day and each is one paragraph off from the others by now. The line count looks healthy and the rulebook is still incoherent.",[11,1064,1065],{},"The consolidation move is a single grep. For each rule in AGENTS.md, search the rest of the harness — hooks, subagent prompts, tool descriptions, audit checklists, any markdown file the pipeline reads — and look for the same idea phrased differently. If the rule already exists somewhere else, neither location is canonical, and the next edit will pick one of them by accident. Pick a file, leave the rule there, and replace every other copy with a pointer. The cost is one minute of grep per rule. The benefit is that the next edit cannot fork the rulebook.",[11,1067,1068],{},"Osmani's ratchet is right; every mistake should become a rule. The post just stops one beat early. A harness that only accretes is a harness that drifts, and the drift is invisible until two files disagree about the same word. The full move is two beats — accrete, then canonicalise. The first beat is where the rules come from. The second is what keeps them meaning the same thing.",{"title":85,"searchDepth":86,"depth":86,"links":1070},[1071,1072,1073,1074],{"id":973,"depth":86,"text":974},{"id":995,"depth":86,"text":996},{"id":1020,"depth":86,"text":1021},{"id":1055,"depth":86,"text":1056},"2026-05-25","Osmani's ratchet — every mistake becomes a rule — is the accretion move. The second move is consolidation: rules that live in one canonical file cannot drift, and most harness write-ups skip it.",{},{"title":955,"description":1076},"blog\u002Frule-drift",[103,104],"NaWKKk_0ShUeTDRRQclzzMm8VpepdJNCWMDZ9mWbmtc",{"id":1083,"title":1084,"article":1085,"body":1086,"date":1170,"description":1171,"extension":94,"meta":1172,"navigation":96,"path":1173,"promptVersion":1085,"readingTime":618,"seo":1174,"stem":1175,"tags":1176,"__hash__":1177},"content\u002Fblog\u002Fthe-catch.md","LLM catch",5,{"type":8,"value":1087,"toc":1164},[1088,1091,1094,1097,1101,1104,1107,1110,1113,1117,1120,1123,1126,1129,1133,1136,1139,1142,1145,1149,1152,1155,1158,1161],[11,1089,1090],{},"You know the feeling. You have an idea. You open ChatGPT. You type: \"Design me a complete architecture for this feature. Include database schema, API endpoints, error handling, deployment strategy.\"",[11,1092,1093],{},"Ten seconds later you have twelve pages of perfect text. It looks right. It reads like something a senior engineer would write. It has tradeoffs mentioned. It has numbered lists and bullet points and everything you asked for.",[11,1095,1096],{},"You save it to a file. You close the tab. You never look at it again.",[15,1098,1100],{"id":1099},"the-production-gap","The production gap",[11,1102,1103],{},"This is the quiet failure of LLM-aided work that nobody talks about. We have become incredibly good at generating specifications, plans, architectures, roadmaps, designs, and proposals. We have become no better at actually building any of them.",[11,1105,1106],{},"The ratio is something like 100:1. For every one thing that gets built, there are a hundred complete, perfectly reasonable plans sitting in markdown files, chat histories, and Notion pages, never to be opened again.",[11,1108,1109],{},"The LLM doesn't care. It will happily write you the full specification for a distributed message broker in the time it takes you to blink. It will explain all the edge cases. It will argue with you about consistency models. It will do everything except type the first line of actual code.",[11,1111,1112],{},"Nobody reads these documents. Not really. We scan the first page. We nod. We think \"yes that makes sense\". And then we move on, because the actual work of building was never the part we were stuck on in the first place.",[15,1114,1116],{"id":1115},"planning-as-procrastination","Planning as procrastination",[11,1118,1119],{},"The hardest part of building something was never figuring out what to build. It was building it.",[11,1121,1122],{},"Before LLMs, you had to think through the plan. You had to write it down. You had to argue about it. That process forced you to confront the hard parts early. Now you can skip all that. You can have a complete, internally consistent plan in less time than it would take you to explain the problem to a colleague.",[11,1124,1125],{},"But the plan doesn't remove the work. It just hides it. All the messy, boring, tedious parts that make something actually work are still there, waiting for you after the impressive document ends.",[11,1127,1128],{},"This happens because the plan looks so complete, so thorough, so done, that you get the psychological feeling of having accomplished something without having done anything at all. It is productive procrastination at industrial scale.",[15,1130,1132],{"id":1131},"the-test","The test",[11,1134,1135],{},"Here is a simple test for any LLM output: will this document make me type code tomorrow?",[11,1137,1138],{},"If the answer is no, it doesn't matter how good it is. It doesn't matter how clever the architecture is. It doesn't matter how well it explains the tradeoffs. It is dead weight. It is a simulation of work, not work itself.",[11,1140,1141],{},"The best LLM outputs are not the longest ones. They are not the most detailed ones. They are the ones that end with \"and then you write these seven lines of code, and that's the whole thing\".",[11,1143,1144],{},"Everything else is just reading material.",[15,1146,1148],{"id":1147},"what-gets-built","What gets built",[11,1150,1151],{},"The things that actually get built are almost never the ones with the perfect twelve-page specification. They are the ones where someone got frustrated, stopped planning, and just typed the first ten lines of code.",[11,1153,1154],{},"They are messy. They have missing features. They cut corners. They don't handle all the edge cases. But they exist. They run. They do something.",[11,1156,1157],{},"The difference between a plan and a product is not quality of thinking. It is tolerance for imperfection. It is willingness to start before you have all the answers. It is accepting that the first version will be bad, and building it anyway.",[11,1159,1160],{},"We are living through the greatest supply of plans, specifications, and designs the world has ever seen. And we are living through the greatest shortage of things that actually work.",[11,1162,1163],{},"Anyone can ask for a plan. The hard part is stopping at the point where you have just enough information to start, and then closing the tab.",{"title":85,"searchDepth":86,"depth":86,"links":1165},[1166,1167,1168,1169],{"id":1099,"depth":86,"text":1100},{"id":1115,"depth":86,"text":1116},{"id":1131,"depth":86,"text":1132},{"id":1147,"depth":86,"text":1148},"2026-04-21","LLMs write perfect plans, detailed specifications, and complete architectures. Nobody ever builds any of it.",{},"\u002Fblog\u002Fthe-catch",{"title":1084,"description":1171},"blog\u002Fthe-catch",[103,104],"V636evV6OWoiB2z9PSj4pXcJ4X3iB4SIcckqL0OeY0A",{"id":4,"title":5,"article":6,"body":1179,"date":92,"description":93,"extension":94,"meta":1229,"navigation":96,"path":97,"promptVersion":98,"readingTime":99,"seo":1230,"stem":101,"tags":1231,"__hash__":105},{"type":8,"value":1180,"toc":1223},[1181,1183,1185,1187,1189,1191,1193,1195,1197,1199,1201,1203,1205,1209,1211,1213,1215,1217,1221],[11,1182,13],{},[15,1184,18],{"id":17},[11,1186,21],{},[11,1188,24],{},[11,1190,27],{},[11,1192,30],{},[15,1194,34],{"id":33},[11,1196,37],{},[11,1198,40],{},[11,1200,43],{},[15,1202,47],{"id":46},[11,1204,50],{},[11,1206,53,1207,59],{},[55,1208,58],{"href":57},[11,1210,62],{},[15,1212,66],{"id":65},[11,1214,69],{},[11,1216,72],{},[11,1218,75,1219,80],{},[55,1220,79],{"href":78},[11,1222,83],{},{"title":85,"searchDepth":86,"depth":86,"links":1224},[1225,1226,1227,1228],{"id":17,"depth":86,"text":18},{"id":33,"depth":86,"text":34},{"id":46,"depth":86,"text":47},{"id":65,"depth":86,"text":66},{},{"title":5,"description":93},[103,104],1780548380675]