Connected Surgical Intelligence

Verification, Not Classification

Our first attempt at automated validation used classification. It achieved 97% accuracy in the lab. In production, it failed catastrophically on the first day. Why embedding-based anomaly detection outperforms traditional classification for industrial validation.

11 min

Our first attempt at automated validation used classification. It seemed obvious—train a model to distinguish correct from incorrect, positive from negative, pass from fail. The model achieved 97% accuracy in the lab. In production, it failed catastrophically on the first day.

The problem wasn't accuracy. The problem was that a classifier must produce a label.

The Classifier That Couldn't Say — I Don't Know

When an instrument it had never seen appeared—a new variant, a damaged tool, something from a completely different surgical kit—the model confidently assigned it to the nearest class. Wrong instruments passed validation. The system was useless precisely when it mattered most.

This failure taught me something fundamental: classification answers which category does this belong to? but validation asks a different question entirely—does this belong here at all?

The solution required a complete paradigm shift. Instead of teaching a model what things are, I needed to teach it what things should look like in a specific context. That's verification—and it demands an entirely different architecture.

The Fundamental Distinction

Classification and verification solve different problems, and conflating them is one of the most common mistakes in applied machine learning.

Classification operates on a closed-world assumption. You define a finite set of categories during training, and every input must map to one of them. The decision boundary partitions the entire feature space—no region is left unassigned. This works when you genuinely know all possible inputs ahead of time.

Verification operates on an open-world assumption. You model what normal or expected looks like, and anything sufficiently different is flagged. The decision boundary encircles the known distribution rather than partitioning all space. This is essential when novel inputs are not just possible but inevitable.

In regulated environments—medical devices, manufacturing, aerospace—verification isn't optional. You cannot deploy a system that confidently misclassifies unknown anomalies as acceptable. The cost of a false negative (accepting something wrong) vastly exceeds the cost of a false positive (rejecting something correct for manual review).

Why Classification Fails for Validation

The research literature has spent the last decade documenting this problem under various names: out-of-distribution detection, open-set recognition, and anomaly detection. The core insight is consistent across all of them: neural networks trained with softmax classification are fundamentally incapable of reliable uncertainty estimation on inputs far from the training distribution.

Consider what happens geometrically. A softmax classifier learns hyperplanes that separate classes in feature space. Every point in that space—no matter how far from any training example—gets mapped to some class with high confidence. The classifier has no mechanism for saying this is unlike anything I've learned.

Research from IJCV 2025 (Dissecting Out-of-Distribution Detection and Open-Set Recognition) confirms this systematically: scoring rules sensitive to feature magnitude—like Maximum Logit Score and Energy scoring—consistently outperform raw softmax probabilities for detecting unknowns. But even these are post-hoc patches on a fundamentally wrong architecture for the verification problem.

The practical failures are predictable:

Domain shift: New lighting conditions, camera changes, or environmental variations cause the model to misclassify inputs it would have handled correctly under training conditions.

Novel anomalies: Defects the model has never seen—new failure modes, contamination, damage patterns—get assigned to the nearest known class.

Subtle variations: Instruments that look similar but aren't interchangeable (wrong size, wrong variant, different manufacturer) pass validation because the classifier can't distinguish fine-grained differences it wasn't explicitly trained on.

Inter-class similarity: Different instruments that share visual features confuse the classifier, leading to cross-category misidentification.

The Verification Architecture

Verification requires a fundamentally different approach: learn an embedding space where similar things cluster together, then make acceptance decisions based on distance rather than classification.

This idea isn't new. Siamese networks, introduced by Bromley et al. in 1994 and popularized by Koch et al. in 2015 for one-shot learning, established the paradigm: train twin networks to produce embeddings where similar inputs have small distances and dissimilar inputs have large distances. The key insight was that you could generalize to new classes without retraining—exactly what verification needs.

Modern metric learning has refined this considerably. Contrastive loss, triplet loss, ArcFace, and ProtoNCE all optimize for the same goal: embeddings that capture semantic similarity in a way that transfers to unseen examples. The 2024-2025 literature shows these approaches consistently outperforming classification on few-shot and open-set tasks.

But embedding space alone isn't enough. You also need a principled way to set acceptance thresholds—and this is where most implementations fall apart. Ad-hoc percentile thresholds work in the lab but fail in production when the distribution of distances shifts or when tail events matter.

Statistical Decision Boundaries: Extreme Value Theory

The breakthrough comes from Extreme Value Theory (EVT)—a branch of statistics specifically designed to model the tails of distributions where data is sparse and stakes are high.

The Fisher-Tippett-Gnedenko theorem proves that regardless of the underlying distribution, extreme values converge to one of three forms—and for our purposes, the Generalized Pareto Distribution (GPD) models exceedances over a threshold. This means we can characterize what extreme looks like without assuming anything about the parent distribution.

In verification terms: collect embeddings from known-good examples, fit a GPD to the tail of the distance distribution, and use it to compute acceptance thresholds that correspond to specific false positive rates. The threshold isn't a magic number someone picked—it's derived from the statistical properties of your actual data with quantified uncertainty.

The SPOT algorithm (Streaming Peaks Over Threshold) from Siffer et al. extends this to streaming data, updating thresholds as new data arrives without storing the entire history. This is critical for production systems where the distribution may drift over time.

Recent work (2024-2025) combines EVT with deep learning embeddings for anomaly detection in domains from time series to graphs to images. The pattern is consistent: EVT-based thresholds outperform Gaussian assumptions and empirical percentiles, especially in the tails where anomalies live.

Multi-Modal Evidence Fusion

Single-modality embeddings have blind spots. A texture-based model might confuse instruments with similar surface patterns. A shape-based model might fail when orientation varies. Production verification needs multiple, complementary evidence streams.

The industrial anomaly detection literature (particularly work on MVTec AD and VisA benchmarks) shows this clearly. PatchCore, achieving near-perfect AUROC on MVTec, uses patch-level features from pretrained networks to capture both local texture and global structure. SimpleNet, DiffAD, and other state-of-the-art methods all leverage multi-scale or multi-modal features.

For surgical instrument validation, the natural decomposition is geometry (shape) and appearance (texture). Shape captures the physical form—silhouette, contours, structural features. Texture captures surface properties—material finish, markings, wear patterns. Neither alone is sufficient; together they provide robust discrimination.

Fusion can happen at multiple levels: early fusion (concatenate raw features), late fusion (combine scores), or learned fusion (train a network to weight modalities). Gated fusion architectures—where the network learns to weight shape versus texture evidence based on input characteristics—often outperform fixed weighting schemes.

The key insight is that modalities should be complementary, not redundant. Shape and texture succeed and fail on different inputs. Fusing them doesn't average performance—it captures the union of their capabilities.

Context-Conditioned Verification

Global anomaly detection—is this strange compared to everything I've seen?—isn't enough for structured validation tasks. The question is context-specific: is this acceptable for this position?

Consider a surgical tray with 40 slots. Each slot expects specific instruments—often just one, sometimes a small set of acceptable variants. An instrument that's perfectly correct in slot 12 might be catastrophically wrong in slot 15. Global verification can't capture this; you need per-context acceptance manifolds.

There are two approaches to context conditioning. Representation-level conditioning (FiLM layers, prototype modulation) bakes context awareness into the embedding itself—the network produces different embeddings for the same input depending on context. Verification-level conditioning uses a shared embedding space but applies different acceptance criteria per context—separate indices, thresholds, and margins for each slot.

Verification-level conditioning has practical advantages: the embedding network remains universal and reusable, adding new contexts requires only new indices (not retraining), and the decision process is fully transparent. You can audit exactly which references were compared, what distances were computed, and which thresholds applied.

State of the Art (2024-2025)

The industrial anomaly detection field has matured rapidly. Key developments:

Memory-based methods: PatchCore and its variants (FR-PatchCore, Sequential PatchCore) achieve 99%+ AUROC on MVTec by storing patch-level features from pretrained networks and detecting anomalies via nearest-neighbor distance. The coreset subsampling strategy keeps memory bounded while maintaining coverage.

Few-shot adaptation: Recent work shows PatchCore can be optimized for one-shot scenarios with architectural choices (anti-aliased backbones, Gaussian random projection) and augmentation strategies. Production deployment with minimal examples is increasingly viable.

Foundation model features: DINOv2 and CLIP embeddings provide strong zero-shot baselines. Fine-tuning these on domain data yields state-of-the-art results with less training than task-specific architectures.

Diffusion-based detection: DiffAD and similar methods use diffusion models for reconstruction-based anomaly scoring, leveraging their ability to model complex data distributions.

Multi-prototype learning: CIDER, diversified multi-prototype contrastive learning, and similar approaches address intra-class variation by learning multiple prototypes per class—critical for instruments with pose variability.

Continual learning: PatchCoreCL and related work enable memory-bounded incremental learning, maintaining sub-banks per task with fixed total capacity. Essential for production systems where the inventory evolves.

What Most Implementations Get Wrong

Having deployed verification systems in production, I've seen consistent failure patterns:

Threshold selection: Most systems use fixed percentile thresholds (95th, 99th) that work in validation but fail when the distribution shifts in production. EVT-based thresholds with explicit false positive rate control are essential but rarely implemented.

Margin calibration: The gap between definitely accept and definitely reject matters. Systems that only compute distance without calibrating rejection margins end up with either too many false positives (conservative) or dangerous false negatives (aggressive).

Negative mining: Embedding spaces trained only on positives develop blind spots. You need negative examples—real and synthetic—to learn the boundary between acceptable and unacceptable. Most academic work focuses on one-class settings; production needs semi-supervised approaches.

Artifact management: A verification system isn't just a model—it's indices, thresholds, margins, manifests, and configuration. Without rigorous versioning and validation, production deployments drift from tested states.

Explainability: Regulators don't accept the neural network said no. Every rejection needs a traceable reason: which reference was compared, what distance was computed, which threshold failed. This requires architecture decisions that prioritize transparency.

Registration sensitivity: PatchCore and similar methods assume approximate alignment. Real-world images have pose variation, partial occlusions, and background clutter. Feature-level registration (FR-PatchCore) or robust pooling (GeM) are essential but often overlooked.

A Production Verification Pipeline

Based on the lessons above, here's what a robust verification system looks like:

  1. Multi-modal embedding: Separate networks for shape (geometric features) and texture (appearance features), followed by learned fusion. Each modality uses metric learning objectives (ArcFace, contrastive loss) rather than classification.

  2. Context-specific indices: Per-position FAISS indices storing embeddings of acceptable items. Cosine similarity for retrieval, with configurable k-nearest-neighbor queries.

  3. Statistical acceptance boundaries: Per-class centroids with quantile radii from training distribution, plus EVT-fitted tails for extrapolation. Acceptance requires falling within the radius and passing the impostor margin test.

  4. Margin calibration: Global and per-class impostor margins derived from negative examples. The gap between best-match distance and rejection threshold is explicitly controlled.

  5. Structured decision output: Every decision includes: matched reference ID, computed distance, applied threshold, and rejection reason code (radius_fail, margin_fail, low_similarity, no_candidate). Full audit trail for regulatory compliance.

  6. Configuration-driven scaling: Adding new items requires only new embeddings and index updates—no model retraining. Manifest files track all artifacts with version control integration.

Why This Matters Beyond the Technical

The shift from classification to verification isn't just a technical choice—it reflects a different relationship with uncertainty.

Classification systems are confident. They always have an answer, even when they shouldn't. This brittleness is hidden in accuracy metrics that assume test distributions match training distributions—an assumption that never holds in production.

Verification systems are humble. They explicitly model what they know and flag what they don't. The price is additional complexity—you need indices, thresholds, margins, calibration. The benefit is a system that fails gracefully, routing uncertain cases to human review rather than confidently making mistakes.

For safety-critical applications, there is no alternative. A surgical tray validation system that occasionally passes wrong instruments is worse than no system at all—it provides false assurance. A system that correctly routes 98% automatically and flags 2% for manual review is genuinely useful.

The verification paradigm also scales better. Adding new instrument variants requires new reference embeddings, not model retraining. Adjusting sensitivity requires threshold tuning, not architecture changes. The system adapts to operational needs without the ML engineering overhead that classification systems demand.

Conclusion

The core insight is simple: match the architecture to the problem. Classification answers what is this? Verification answers does this belong? These are different questions requiring different solutions.

Embedding-based verification with statistical decision boundaries provides what classification cannot: principled handling of unknown inputs, calibrated uncertainty, auditable decisions, and graceful degradation. The technical complexity is higher, but for regulated industrial applications, there is no shortcut.

The field is mature enough that production deployment is practical. PatchCore-family methods, metric learning objectives, EVT-based thresholding, and multi-modal fusion are all well-understood. The challenge isn't inventing new methods—it's engineering robust systems that combine these components correctly.

For anyone building validation systems, the path is clear: abandon classification, embrace verification, and invest in the infrastructure—indices, thresholds, margins, manifests—that makes it production-ready.

The classifier that couldn't say I don't know nearly cost us a deployment. The verification system that knows its limits has run reliably ever since.

Topics

Verification SystemsAnomaly DetectionMetric LearningProduction MLSurgical Intelligence