The Line Between a Prediction and a Patient

A Sponge Left Behind

Imagine a patient is wheeled out of a six-hour abdominal surgery. The surgical team is exhausted. The final instrument count was called "correct." The incision is closed. Everyone moves on.

Three weeks later, the patient returns with a fever that won't break, abdominal pain that keeps escalating, and imaging that reveals the cause: a surgical sponge (roughly the size of a fist) left inside their body.

This is not a hypothetical. In the United States, retained surgical items occur at a reported rate of roughly 1 in every 5,500 operations, an estimated 1,500 cases every year. In over 80% of these cases, the final count was reported as "correct."¹ The humans counting believed they had accounted for everything. They hadn't. These are classified as "never events": errors so serious they should never occur. Some result in sepsis, reoperation, or death.

But the human cost is incalculable. And that is exactly why I do what I do.

I build AI systems for surgical safety where the cost of a wrong answer isn't a bug report; it's a patient outcome. That single fact shapes everything I think about artificial intelligence, software engineering, and what it means to build systems that matter.

Every system I design has a line: the point where a probabilistic prediction becomes a consequential decision. On one side of that line, AI does what AI does well: it perceives, classifies, estimates confidence. On the other side, something must decide whether to act: to alert a surgical team, to halt a procedure, to let a patient go home. That line is the most important thing I build. And it is the one line AI cannot be allowed to draw, because it cannot be accountable for the consequences.

So while the technology industry debates whether AI will replace software engineers, I'm asking a different question: what happens when AI is wrong, and who bears the consequences? That is the question the industry should be asking. Instead, it's consumed by a different one.

The Question Everyone Is Asking

Some of the strongest engineers I know are scared right now. Not publicly, but in private conversations the anxiety is unmistakable. These aren't fresh graduates. These are senior and staff engineers who've architected systems serving millions, watching AI coding agents produce functional software in seconds and asking: Is what I do about to become worthless?

The most powerful voices in technology have been fuelling this fear. "There will be no code writing in the future." "The new programming language is English." Enterprise CEOs have publicly questioned whether they need to hire engineers at all.

These declarations contain a grain of truth. AI is getting remarkably good at writing code. But declaring that this makes engineers obsolete fundamentally misunderstands what the craft is about. To understand why, we need a shared vocabulary.

A Shared Language

Two working terms first. Large Language Models (LLMs) are deep learning systems trained on vast text data to predict what comes next in a sequence of words², which lets them generate text, write code, and carry on conversations. GPT, Claude, Gemini, and Llama are all LLMs, and they power AI coding assistants. Computer Vision (CV) is the field enabling machines to interpret visual information: what allows a system to look at a surgical tray and identify which instruments are present. This is the domain I work in.

AI Coding Agents are the most advanced AI coding tools: autonomous systems that take a high-level objective and independently plan and execute the steps needed to build software. Early assistants suggested the next line of code. Current agents can analyse entire codebases, fix bugs, and build complete features from a plain language description.

And here is the distinction that matters most:

Deterministic systems always produce the same output for the same input: a calculator, a database query, an instrument count algorithm. Probabilistic systems produce outputs based on statistical likelihood: an instrument recognition model saying "93% confident this is a haemostat." That's fundamentally different from a deterministic system saying "instrument count: 47."

A prediction is a model's best guess. A decision is a commitment the world will feel. The job of engineering is deciding where guessing ends.

I call that boundary the line. It matters for one reason: accountability, the ability to assign responsibility for a decision to a person or organisation that can be audited, regulated, and held liable. Any AI system that touches the real world has one. This entire article is about the line: who draws it, what it means, and why the engineers who understand it are the ones who will matter most.

The State of Play: February 2026

As of early 2026, approximately 92% of developers use AI tools in their workflow. Some estimates put AI-authored production code around 27%. On benchmarks, AI agents have jumped from solving 33% of real-world coding issues (August 2024) to consistently above 70%. Developers report saving 30–60% of their time on boilerplate, tests, and documentation.³

But here's what the headline numbers don't tell you. Developer trust in AI output is declining: only 29–46% trust AI-generated code. Two-thirds report quality issues. The most common frustration is AI's "almost correct" solutions that take longer to debug than writing from scratch. One METR study found AI assistance actually slowed experienced developers down in complex codebases.

And the number that reframes the entire debate: developers spend only 20–40% of their time writing code. The rest is analysing problems, communicating with stakeholders, and making design decisions. Even a dramatic speedup in code generation barely moves overall delivery, because delivery is constrained by decisions, not keystrokes.

Where AI Breaks: Lessons from the Operating Room

I run the technology at Scalpel AI, building computer vision systems for surgical instrument tracking and safety. Our systems use cameras and deep learning to identify instruments in real time, preventing exactly the scenario that opened this article.

This work has taught me something the AI discourse ignores: the difference between getting something right and getting something right enough.

In consumer software, 95% accuracy is fine: an irrelevant recommendation is a momentary annoyance. In surgical safety, 95% means 5 failures per 100 procedures. That's a patient safety crisis. And accuracy alone means nothing without understanding the asymmetry of errors. A false positive (incorrectly alerting that an instrument is missing) costs minutes. A false negative (failing to detect a genuinely missing instrument) costs a patient's health, potentially their life.

No AI model encodes this asymmetry on its own, and it cannot be held responsible for getting it wrong. An AI can be told to optimise for recall over precision, but the decision to do so, and the understanding of why, is a human engineering judgement. This is why our architecture follows a principle I keep returning to:

LLMs should speak, classical systems should decide.

AI handles perception: identifying objects, estimating confidence. But the final decision (alert, escalate, halt) is made by deterministic logic designed by engineers who understand the clinical context and the consequences of error. Probabilistic systems inform. Deterministic systems act.

In our architecture, the line sits at a specific confidence threshold: below it, the system flags uncertainty and defers to human verification. Above it, deterministic logic acts. In practice, the line is a gate: a threshold, a rule, and an escalation path. That gate isn't computed; it's chosen. Engineers implement the line, but domain experts, regulators, and organisations co-author where it sits, because accountability for where it falls is shared. The model doesn't know any of this. The threshold encodes human judgement about what is acceptable. It is the line between a prediction and a patient.

That principle extends far beyond surgical safety; it applies everywhere consequences are real, which is far more systems than most engineers realise.

In our domain, this means confronting sensor drift, where cameras degrade subtly over time. Lighting variation, where surgical lights create shadows that confuse models. Occlusion, where hands and drapes temporarily hide instruments. Instrument similarity, where forceps vary by millimetres. No AI coding agent has lived through any of this. An engineer who has debugged a false negative caused by a light shifting two degrees: that engineer reasons about it instinctively. That knowledge is tacit, built from experience, and much of what matters still isn't recorded in a form that's usable for training. When it is recorded, it's often missing labels, context, or the why behind the human decision.

So if the real challenges were never about code, and AI can only automate the code, what does the future actually look like?

When Implementation Becomes Free, Judgement Becomes Everything

When an AI agent generates a thousand lines of code in seconds, the scarce resource isn't prompt-writing skill. It's judgement: specifically, the judgement to draw the line between what AI should decide and what humans must. The question shifts from "can we build this?" to "should we build this, and where does the line sit?"

Every line of code is a liability: it must be maintained, secured, tested, and eventually deprecated. AI makes it trivially easy to generate liabilities at scale. The greatest underappreciated risk of the AI coding revolution isn't that AI will write bad code. It's that it will write mediocre code at unprecedented volume, and organisations without sufficient engineering judgement will mistake speed for progress.

That risk makes the following shifts urgent. Each one describes a different aspect of what it means to be the person who draws the line, and who is accountable for where it falls.

Five Shifts Defining the Next Era

1. Implementation is cheap. Choosing what to build is expensive. AI has collapsed the cost of implementation. The person who can decide what should be built (understanding behaviour, context, regulation, and second-order effects) is now exponentially more valuable than the person who can turn a spec into working code.

2. Code fluency is table stakes. Failure fluency is leverage. The engineers who matter most can tell you exactly how a system will break. In surgical AI, that means knowing a model trained under fluorescent lighting will behave differently under LED surgical lights, or that an instrument recognition system tested on clean trays will degrade when instruments overlap. This knowledge is built from experience, not data, and it becomes more valuable as AI generates more code.

3. Output is abundant. Taste is scarce. When code is cheap, the bottleneck is architecture: how systems compose, where boundaries sit, what breaks when components interact. Taste is the ability to sense that an AI-generated solution is wrong before you can articulate why: that a microservice boundary will create a distributed transaction nightmare, or that a data model will make the next feature request impossible.

4. Writing is automated. Reviewing is the work. Most engineering work is maintaining and evolving existing systems, not building greenfield. As AI generates more code, critically assessing its output (spotting subtle incorrectness, security vulnerabilities, architectural misalignment) becomes a primary skill. In practice, developers reject a large fraction of AI-generated suggestions: human judgement filtering output that is syntactically correct but semantically wrong. This is the autopilot transition: the job shifts from "fly the plane" to "monitor the systems and take over when automation fails."

5. Engineering is no longer translation from spec to code. It's translation between worlds. Engineers who spend less time implementing and more time deciding what to build are occupying product territory. The most valuable will translate between domains: explaining to a clinician why "just alert me when something is missing" requires rethinking the alert hierarchy, or helping a regulator understand why "99% accurate" might be unacceptable for clinical deployment. In safety-critical AI, this translation isn't optional. It's the entire job.

These five shifts point to a single uncomfortable truth.

The Uncomfortable Truth

If your entire professional identity is "I write clean Python", you should be worried. AI already does that, and it will only get better. But if you define systems under ambiguity, make trade-offs, anticipate failure modes, and explain to non-technical stakeholders why the "simple" solution is dangerous, you're about to become significantly more valuable.

Every platform shift (mobile, cloud, deep learning) made certain skills less valuable and others more. In 1992, graphics engineers hand-coded polygon rendering; two years later that work was in hardware and the job became animation and lighting. Code is the new polygon rendering. But before I state what I've learned, I owe you an honest reckoning with the strongest objections to my own argument.

The Honest Objections

Is "AI can't handle meaning" structurally true, or just currently true? LLMs already encode vast semantic structure. Multimodal models reason across text, vision, and planning. Reinforcement learning optimises over asymmetric reward structures: the very thing I said AI can't handle. History is not on the side of "machines will never do X." Radiology, chess, Go, protein folding: each was once believed to require irreducibly human expertise.

I accept this. My argument is not that AI is incapable of processing meaning. It's that AI cannot bear accountability, because it is not a legal subject we can sanction or compel. People and institutions are. When my system misclassifies a retractor and a patient is harmed, no model faces a regulator, no algorithm is named in litigation, no system bears liability. You can encode risk thresholds, build audit trails, and optimise for asymmetric loss, but accountability is not a technical property. It is a legal and structural one. Accountability requires a responsible party that can meaningfully respond: disclose what happened, remediate the harm, compensate those affected, and change behaviour to prevent recurrence. Models don't do that. Institutions, led by humans who drew the line and signed off on where it fell, do. Even if AI can model the consequences of a decision, it does not own the liability for making it.

Aren't we underestimating what AI will eventually automate? Almost certainly, and this is the most technically serious objection to my thesis. Increasingly, everything is recorded: surgical video, system telemetry, operational logs. If surgical AI systems collect enough deployment data over enough years, models may learn real-world degradation patterns from massive multimodal datasets, at a scale beyond any individual engineer's experience. Failure fluency may become partially learnable. I'd rather be honest about this than defensive. The argument is not that humans will always be necessary for these judgements; it's that they are necessary now, and the decisions we make in this window will determine whether future AI systems are built on solid foundations or on a decade of unchecked mediocre code.

Is this really about engineers surviving, or about engineers being promoted? This cuts closest. My thesis shifts from "engineers won't be replaced" to "engineers must become decision-makers." That's not a survival argument. It's a promotion argument. And I should be explicit: not every engineer wants to become a systems thinker or stakeholder translator. Some love writing code: the craft, the elegance, the satisfaction of a clean implementation. Telling them "you'll just do higher-level work" is redefining their job out from under them. The honest version: the role is being redefined whether we like it or not. AI is absorbing the implementation layer. What remains is the judgement layer. Engineers who move into it will thrive. Engineers who cannot or choose not to will face genuine displacement, not because they lack talent, but because the specific form their talent takes is being automated.

Key Learnings

1. Code was always the medium, never the craft. Engineering value lies in understanding problems, making trade-offs, and designing for real-world conditions.

2. Accountability is structural, not computational. AI can process semantic structure and optimise over constraints. It cannot be held liable. Even if AI models consequences, it does not own them, and ownership is what gives engineering decisions their weight.

3. The asymmetry of errors is invisible to AI. In any system where different errors carry different consequences (which is nearly every system that matters), human judgement defines what "acceptable" means.

4. Probabilistic systems should inform, deterministic systems should decide. In safety-critical applications, the final authority must be predictable and auditable.

5. Implementation is becoming free; judgement is becoming priceless. When code is cheap, the ability to decide what should be built, how it should fail, and what risks are acceptable is the only differentiator.

6. Failure fluency is the new literacy. Knowing how systems break, and designing for graceful degradation, is built from experience, not data, and cannot be automated.

The Questions That Remain

If you're an engineer: what would you do differently tomorrow if writing code took zero effort? The answer reveals where your real value lies.

If you're a leader: when your team can build ten times faster, do you have ten times the judgement capacity to match?

If you're building AI systems: where in your architecture does a probabilistic prediction become a consequential decision? Who controls that boundary?

If you're in healthcare: we still rely on manual counting: humans tracking sponges under time pressure, with 80% of errors going undetected. Computer vision can transform surgical safety, but only if we build these systems with the same rigour we demand from the instruments themselves.

And here are the questions I cannot stop thinking about, the ones I believe define not just this moment, but whatever comes after it:

If AI can simulate judgement, who signs the consent form?

When responsibility becomes programmable, who is accountable?

And if machines begin to handle meaning (if they learn to perceive context, model consequences, and optimise for outcomes we care about), what remains that is distinctly, irreducibly human?

I don't have a final answer. But I know this: today, in the systems I build, the line between what AI decides and what humans decide is the line between a patient who goes home safely and one who doesn't.

And this isn't unique to healthcare. That same line exists in every system where a prediction becomes a consequence. It's the boundary between a credit risk model's score and whether a family gets a mortgage. It's the boundary between an autonomous vehicle's object detection and whether it brakes or accelerates. It's the boundary between an infrastructure control system's anomaly forecast and whether a city keeps its power. In every domain where AI touches real lives, someone must draw that line, and someone must be accountable for where it falls.

That line is not drawn by algorithms. It is drawn by engineers who understand what's at stake.

Where we draw it will decide whether this era of AI becomes progress, or a machine that scales harm faster than we can stop it.

Footnotes

Figures as reported in the US patient-safety literature on retained surgical items (see Further Reading and References). Reported incidence varies by study and denominator, from roughly 1 in 5,500 to 1 in 10,000 operations; the share of cases in which the final count had been declared "correct" is consistently reported at 80% or higher. ↩
The nesting, for readers new to the terms: artificial intelligence is the broad field of computer systems performing tasks that typically require human intelligence; machine learning is the subset that learns patterns from examples rather than explicit rules (shown thousands of retractors, it learns to recognise them without being told "longer than 20cm means retractor"); deep learning is the subset of machine learning built on many-layered neural networks, which excel at finding complex patterns in images, text, and audio. LLMs sit at the bottom of that nesting. ↩
Figures as of early 2026; adoption and productivity numbers vary by survey (see Further Reading and References). ↩

The Line Between a Prediction and a Patient

A Sponge Left Behind

The Question Everyone Is Asking

A Shared Language

The State of Play: February 2026

Where AI Breaks: Lessons from the Operating Room

When Implementation Becomes Free, Judgement Becomes Everything

Five Shifts Defining the Next Era

The Uncomfortable Truth

The Honest Objections

Key Learnings

The Questions That Remain

Further Reading and References

On AI in Software Engineering

On Retained Surgical Items and Patient Safety

On Computer Vision in Surgery

On the Future of Engineering

Footnotes

Related Reading

Systems That Deserve Trust

The Mirage of Intelligence