Shah Vision — Engineering Reliable Intelligence

In healthcare, LLMs should speak. Classical systems should decide.

That single distinction will define whether AI transforms medicine—or becomes its next malpractice crisis.

The Promise and the Peril

We've made remarkable progress with Large Language Models. They can reason over messy clinical notes, generate coherent summaries, and surface relevant context buried across years of patient history. LLMs are the best language interface we've ever built.

But let's be clear about what they are: probabilistic sequence models. Not clinical decision engines.

They optimise for plausibility, not correctness. They produce likely answers, not validated ones. They hallucinate with confidence and present uncertainty as fact.

In most domains, that's a minor inconvenience. In healthcare, it's existential.

The Evidence Is In

A 2025 study in npj Digital Medicine analysed 12,999 clinician-annotated sentences generated by LLMs for clinical documentation. The results? A 1.47% hallucination rate and 3.45% omission rate. That might sound acceptable—until you scale it across millions of clinical interactions.

More concerning: research published in Communications Medicine tested six leading LLMs with 300 physician-designed clinical vignettes, each containing a single fabricated detail—a fake lab value, a non-existent sign, an invented condition. The models repeated or elaborated on the planted error in 50% to 82% of cases. Even with mitigation prompts designed to reduce hallucinations, the best-performing model still hallucinated 23% of the time.

These aren't edge cases. These are systematic vulnerabilities in systems being marketed for clinical decision support.

The Questions That Matter

You do not want a stochastic model improvising when the question is:

Is this drug contraindicated with their current medications?
What's the correct paediatric dose for this weight?
Does this symptom cluster require immediate escalation?

These aren't language problems. They're constraint satisfaction problems. They demand traceability, auditability, hard boundaries, and worst-case safety guarantees.

Classical Systems Aren't Legacy—They're Essential

Consider drug-drug interaction checking. Rule-based clinical decision support systems have been deployed for decades, encoding pharmacological knowledge into deterministic alerts. They're transparent: you can trace exactly why an alert fired. They're predictable: the same inputs produce the same outputs. They're auditable: every decision has a documented reasoning chain.

Yes, they suffer from alert fatigue—override rates as high as 96% in some implementations. But that's a tuning problem, not a fundamental architecture flaw. The issue isn't that rule-based systems fire alerts; it's that they fire too many low-priority alerts. That's solvable with better prioritisation, patient-specific context, and smarter filtering.

What's not solvable is an LLM that confidently recommends a drug combination that's contraindicated, with no audit trail explaining why.

Paediatric dosing exemplifies the power of deterministic systems. Weight-based calculations follow precise formulae. Clark's Rule, Young's Rule, body surface area calculations—these are mathematical relationships, not probabilistic inferences. A clinical decision support tool performing these calculations provides guaranteed correctness within its defined parameters. An LLM? It might get it right. It might hallucinate a dose that's ten times too high.

The Co-Pilot Breakthrough

The most compelling evidence for hybrid architecture comes from a 2025 study in Cell Reports Medicine. Researchers evaluated LLM-based clinical decision support across 16 medical specialties, comparing three implementation strategies: LLM alone, pharmacist alone, and pharmacist with LLM as co-pilot.

The results were definitive. The co-pilot configuration achieved the highest accuracy: 61% compared to 46% for pharmacists alone. More critically, co-pilot mode increased accuracy by 1.5-fold in detecting errors posing serious harm.

This is the architecture that works: human expertise augmented by AI capability, not replaced by it. The LLM handles unstructured complexity—parsing clinical notes, identifying relevant context, generating summaries. The human applies clinical judgment, validates recommendations against their expertise, and makes the final decision.

A parallel study from Amazon Pharmacy demonstrated similar patterns. Their MEDIC system—a medication direction co-pilot—reduced near-miss events by 33% in production. The key? Domain-specific fine-tuning combined with pharmacy logic and safety guardrails. The LLM extracts core components; deterministic rules assemble and validate.

The Hybrid Architecture

The right architecture isn't "LLM-first medicine." It's hybrid intelligence:

1. LLMs as Natural Language Co-Pilots

Use LLMs for what they're genuinely excellent at: understanding and generating natural language. This includes parsing unstructured clinical notes, summarising patient histories, translating between medical jargon and patient-friendly explanations, and surfacing relevant information from vast documentation. Let them handle communication and context extraction—tasks where hallucinations can be caught by downstream validation.

2. Deterministic Systems as Safety Governors

Route all safety-critical decisions through classical systems: guideline-based rule engines, pharmacological safety databases, validated risk calculators, deterministic validation layers. These systems enforce constraints, maintain audit trails, and provide guaranteed behaviour within their defined scope. When an LLM suggests a medication, a drug interaction database validates it. When an LLM extracts a diagnosis code, a coding ontology confirms it exists.

3. Uncertainty-Aware Routing

Build systems that know their limits. When confidence is high and the task is within defined parameters, proceed automatically. When uncertainty exceeds thresholds, escalate to human review. This isn't a limitation—it's a feature. A system that confidently handles 80% of cases while flagging 20% for expert review is more valuable than one that confidently handles 100% with a 5% error rate.

4. Clinicians Always in the Loop

For high-stakes decisions, human judgment remains essential. The goal isn't to remove clinicians from decision-making; it's to augment their capabilities while reducing cognitive burden. AI handles the tedious, error-prone tasks of information extraction and synthesis. Humans apply the nuanced clinical judgment that comes from years of training and experience.

The Regulatory Reality

The FDA has cleared over 1,000 AI/ML-enabled medical devices as of late 2024, with 295 clearances in 2025 alone. But 97% go through the 510(k) pathway—cleared based on substantial equivalence to existing devices, not rigorous clinical trials. A 2025 JAMA study found that 46.7% of FDA summaries didn't even describe the study design; 53.3% omitted sample size; 95.5% reported no demographic information.

This isn't a criticism of the FDA—they're adapting frameworks designed for physical devices to software that learns and changes. But it means regulatory clearance alone doesn't guarantee clinical reliability. Systems need architectural safeguards that go beyond what regulations require.

The Guardrails Imperative

A 2025 study on Retrieval-Augmented Generation architectures for postoperative care demonstrated what thoughtful guardrail design can achieve. The system combined the deterministic framework of traditional NLP with the probabilistic capabilities of LLMs. Results: 98.4% classification accuracy. Safety guardrails successfully identified 100% of out-of-scope queries and escalation scenarios.

The architecture matters more than the model. A well-constrained GPT-4 with proper guardrails will outperform an unconstrained frontier model on clinical tasks—not because it's smarter, but because it can't fail in ways that harm patients.

This requires deliberate architectural choices: input validation that detects anomalous or adversarial content, output constraints that prevent hallucinated medical terms, confidence thresholds that route uncertain cases to human review, audit logging that captures every decision for retrospective analysis.

Being Honest About What We're Building

This isn't about slowing down innovation. It's not about being conservative. It's about being honest.

LLMs don't know what they don't know. They can't distinguish between confident knowledge and plausible confabulation. They can't trace their reasoning to verifiable sources. They can't guarantee worst-case behaviour.

Classical systems do. A drug interaction database knows exactly which combinations are contraindicated. A dosing calculator knows the precise formula for weight-based dosing. A risk calculator knows the validated thresholds for escalation.

The future of clinical AI won't be built by replacing medical expertise with sophisticated autocomplete. It will be built by embedding probabilistic intelligence inside deterministic safety rails.

The Real Breakthrough

The real breakthrough won't be the loudest model or the biggest parameter count. It will be the safest architecture.

We're at an inflection point. The technology exists to build clinical AI that genuinely helps—that reduces physician burnout, catches errors humans miss, and surfaces insights buried in data. But only if we architect it correctly.

Less magic. More engineering. Deep respect for domain complexity.

The question isn't whether AI will transform healthcare. It's whether we're building AI that supports clinicians… or AI that pretends to be one.

The Architecture of Trust