13
Mins
Verified AI
The Architecture of Clinical Truth: Why Medical AI Needs Proof, Not Agents
Why medical AI needs proof-backed clinical reasoning — not just autonomous agents — to prevent hallucinations, missed exclusions, and unsafe diagnostic shortcuts.
Author
Author
Kaushik
Kaushik
Christine
Christine

The Illusion of the Reasoning Trace
In the world of enterprise software, an AI hallucination costs dollars. In medicine, a hallucination could cost a life.
As artificial intelligence moves deeper into clinical care, the industry is undergoing a massive shift toward "Agentic AI." Major tech institutions and health networks are actively co-creating autonomous clinical agents designed to parse longitudinal electronic health records, draft multi-step care pathways, and simulate medical reasoning. These systems are powerful, and it may be natural to think that by wrapping Large Language Models (LLMs) in autonomous loops, giving them access to medical research and guidelines, and letting them generate long "reasoning traces," we can build digital diagnostic assistants capable of solving complex, rare diseases [1][2].
But there is danger in mistaking a fluent reasoning trace for a guarantee.
An LLM agent is a remarkably capable reasoner, but on its own it is not a clinical auditor; it is a text-prediction engine operating over statistical weights. Without an external check, the confidence associated with any hypothesis comes from a black box. When an agentic workflow fumbles a multi-layered diagnostic problem, the error can hide behind a beautifully formatted, highly confident chain of thought. Richness is not trustworthiness. A confident-sounding clinical narrative, on its own, can mask an underlying statistical gamble.
Because token sampling is inherently probabilistic, running the same complex clinical profile through an agentic loop multiple times means the reasoning can eventually drift, negative constraints can be softened, or a rare historical exclusion can be missed. If it can be wrong even once, it cannot be trusted on its own.
However powerful our agents become, we still need an unbribable sanity-checker to back them up.
The Human Limits of Scale
Great clinicians do not simply "pattern-match" symptoms to diseases and call it a day.
Pattern matching is merely the first, primitive step - a way to generate a list of candidate solutions. The true work of a diagnostician is rigorous validation. A clinician takes those candidate conditions and actively cross-checks them against their internal world model: If the patient has the condition, would the patient have pain radiating out from the back? Would it really leave the inflammatory markers sitting in the normal range like this? Is there an old surgery or exposure in the chart that quietly disqualifies it?
Since medicine is deeply complex, a single clinician will rarely make a high-stakes choice in isolation. They consult senior colleagues. They run complex cases past a tumor board or a differential diagnosis review panel. It is worth being precise about what that second set of eyes provides. Part of its value is generative—a colleague brings different priors and may surface a hypothesis nobody else considered. But a distinct and equally critical part is purely mechanical consistency-checking: confirming that the proposed diagnosis is compatible with every observation, that no lab value contradicts it, and that no exclusion criterion buried in the chart quietly rules it out. This second function is the unglamorous, error-prone bookkeeping of medicine. It is the part that fails silently when a clinician is rushed, tired, or anchored on a compelling first impression.
If a clinician were to sit down and manually and mechanically map out every single possible rare disease, writing an exhaustive, logical argument for why each condition was either applicable or structurally eliminated for a specific patient, that clinical reasoning would likely be airtight. But in a stressed, fast-paced healthcare ecosystem, doing this manually for even one patient, nevermind every patient, is completely unscalable and practically impossible.
A diagnosis is only as trustworthy as the weakest link in the reasoning chain behind it, and that chain depends on two properties holding at once. The first is correctness: no step of the diagnostic argument contradicts the patient's data. The second is completeness: no candidate condition goes unconsidered, and no exclusion criterion goes unchecked.
This exact challenge was solved in the world of pure mathematics. Mathematicians no longer have to rely solely on the fallible, slow process of human peer review to catch subtle flaws in complex theorems or ensure that their solution covers all the cases. Instead, they use Lean 4 and the Mathlib library - a system that automatically typechecks proofs with 100% mathematical correctness.
At Pramaana Labs and BioCognitive Engines, we built clinicians a typechecker of their own, to offer the same guarantees.
Building the Foundation: The Medical Typechecker
When a patient sits in a clinic, they place 100% of their trust in the doctor's hands. Currently, beyond the individual clinician's exhausting efforts and human cognitive capacity, there is no structural safety net backing that trust. An overlooked sentence buried in a five-year-old chart can trigger a wrong diagnosis, leading to incorrect medication, leaving the original life-threatening condition completely untreated.
To prove we can build this safety net, we formalized the multi-layered diagnostic criteria for a targeted set of highly complex rare conditions—including Sjögren’s syndrome and Whipple disease—directly into native Lean 4 code. By treating these intricate nosologies as our initial validation benchmarks, we proved that even the most nuanced clinical knowledge can be made mathematically rigorous.
[Messy Patient Record] ──> (LLM as Translator) ──> [Lean 4 Mathematical Terms] (deformalized Lean4 text will be verified by user) │ ▼ ┌───────────────────────┐ │ Prover + Lean Kernel │ └───────────┬───────────┘ │ ▼ [Human Narrative Explanation] <── (Translation) <── [Verified Proof]
[Messy Patient Record] ──> (LLM as Translator) ──> [Lean 4 Mathematical Terms] (deformalized Lean4 text will be verified by user) │ ▼ ┌───────────────────────┐ │ Prover + Lean Kernel │ └───────────┬───────────┘ │ ▼ [Human Narrative Explanation] <── (Translation) <── [Verified Proof]
[Messy Patient Record] ──> (LLM as Translator) ──> [Lean 4 Mathematical Terms] (deformalized Lean4 text will be verified by user) │ ▼ ┌───────────────────────┐ │ Prover + Lean Kernel │ └───────────┬───────────┘ │ ▼ [Human Narrative Explanation] <── (Translation) <── [Verified Proof]
By formalizing diagnostic criteria (such as the ACR/EULAR or Ghent nosologies), we transform the role of the generative AI. In this architecture, the LLM no longer derives a diagnosis directly from model weights and context. Instead, they act as a Translator, converting unstructured patient records into crisp Lean 4 mathematical terms, which are verified by human users.
Once the data is formalized, the Prover generates a proof and Lean Typechecker Kernel checks the proof. The kernel is a tiny, highly trusted program that verifies mathematical proofs and program correctness. If the kernel accepts the statement, then it is mathematically guaranteed that the diagnosis and observations are completely consistent with the known rules.
The reasoning chain presented to the doctor is no longer unverified model output. It is a careful natural language translation of an ironclad, machine-checked mathematical proof.
How Medical Knowledge Is Encoded
For the kernel to check a diagnosis, medicine's knowledge was written down in two forms that run in opposite directions.
One direction is generative: given a cause, what effects does it produce? A disease produces phenotypes (dry eyes, joint pain, high blood sugar) with characteristic frequencies, and a phenotype shows up in an observation (a lab, an exam) with characteristic reliability. Roughly, this is our causal model of how the body works: it encodes both cause-and-effect and P(effect | cause), and it is what lets us rank how well a candidate explains everything we see.
The other direction is inferential: given the effects, what was the cause? This is what diagnostic criteria capture: published rules like the ACR/EULAR criteria for Sjögren's are an expert committee's compressed, standardized shortcut for reasoning from findings back to a disease. They are not a definition of the disease, as meeting them sharply raises the probability rather than proving it. But they are a rigorous, widely accepted heuristic, disciplined enough to gate clinical trials. That rigor is exactly what makes them worth typechecking against.
We model a diagnostic rule as three parts: an entry gate, a set of weighted scoring items above a threshold, and a list of disqualifying exclusions. Sjögren's encodes precisely this shape:
def sjogren_criteriaSatisfied (p : Patient) (asOf : Date) : Prop := sjogren_inclusionGate p asOf -- sicca symptom for ≥ 90 days ∧ sjogren_acrEularPoints p asOf ≥ 4 -- weighted items clear threshold ∧ ¬sjogren_exclusionFires p asOf -- no disqualifying history
def sjogren_criteriaSatisfied (p : Patient) (asOf : Date) : Prop := sjogren_inclusionGate p asOf -- sicca symptom for ≥ 90 days ∧ sjogren_acrEularPoints p asOf ≥ 4 -- weighted items clear threshold ∧ ¬sjogren_exclusionFires p asOf -- no disqualifying history
def sjogren_criteriaSatisfied (p : Patient) (asOf : Date) : Prop := sjogren_inclusionGate p asOf -- sicca symptom for ≥ 90 days ∧ sjogren_acrEularPoints p asOf ≥ 4 -- weighted items clear threshold ∧ ¬sjogren_exclusionFires p asOf -- no disqualifying history
The exclusion term is the one to watch. A single fired exclusion makes the whole expression false—no matter how many points the symptoms scored. Here is what fires it:
def sjogren_exclusionFires (p : Patient) (asOf : Date) : Prop := p.events.any fun e => match e.eventType with | .medicalProcedure .headNeckRadiation => decide (e.startDate ≤ asOf) | .disease .hepatitisC => e.isActive asOf | .disease .sarcoidosis => e.isActive asOf | ... => false
def sjogren_exclusionFires (p : Patient) (asOf : Date) : Prop := p.events.any fun e => match e.eventType with | .medicalProcedure .headNeckRadiation => decide (e.startDate ≤ asOf) | .disease .hepatitisC => e.isActive asOf | .disease .sarcoidosis => e.isActive asOf | ... => false
def sjogren_exclusionFires (p : Patient) (asOf : Date) : Prop := p.events.any fun e => match e.eventType with | .medicalProcedure .headNeckRadiation => decide (e.startDate ≤ asOf) | .disease .hepatitisC => e.isActive asOf | .disease .sarcoidosis => e.isActive asOf | ... => false
Why does a single exclusion outrank everything else? Active hepatitis C, one of the entries above, can drive the same dry eyes, dry mouth, and antibody findings that score points toward Sjögren's—so a patient can clear the gate and exceed the threshold and still not have Sjögren's, because the infection explains it better. The criteria encode exactly this: if the exclusion fires, the diagnosis doesn't stand, however compelling the positive evidence.
This is the scope of what we prove. The kernel does not claim "this patient does not have Sjögren's," as no rule known to humans can prove that. It proves the exact, checkable thing: this patient does not satisfy the Sjögren classification criteria, and cannot under any additional evidence. We verify diagnostic decisions against the rules medicine has already written down and agreed on, and we add the one guarantee a tired human reviewer cannot: that not a single encoded rule was skipped.
Navigating the Odyssey with Stepwise Verification
This formal architecture turns an exhausting clinical search into an elegant, entropy-reducing loop:
Isolating the Uneliminated: When a patient history is entered, the engine doesn't guess a disease. It instantly calculates a comprehensive list of every single possible root cause that explains at least one observed symptom and cannot yet be logically eliminated by the patient's data.
The Next Best Move: The system identifies the clinical inquiry or laboratory test expected to reduce diagnostic entropy the most. For the elimination it offers a hard guarantee: it can prove that specific outcomes of that test will logically eliminate named candidates from the differential.
The Unbreakable Workflow: The user inputs the new test result, and the engine repeats the step—narrowing the circle of possibilities with mathematical precision until a single, provably correct diagnosis is isolated.
If a diagnosis cannot yet be proven due to missing data, the engine refuses to guess or hallucinate a vague percentage. It either admits that it is not able to come up with a provably correct answer or it constructs a Witness, explicitly telling the doctor something like: "I cannot verify Sjögren's today because the 90-day symptom gate is not satisfied; however, if an anti-SSA lab is confirmed positive today, a complete mathematical proof chain will be unlocked."
The System in Action: Auditing the Mixed Presentation
The loop above is abstract until you watch it run. Here it is in a real, deliberately messy case.
Consider a 45-year-old woman who comes in with more than three months of dry, gritty eyes, a persistently dry mouth, and aching joints. To most clinicians, the first two symptoms light up immediately: dry eyes plus dry mouth is the textbook signature of primary Sjögren's disease, a common autoimmune condition. A reasonably experienced doctor anchors here, orders the autoimmune workup, and starts building the case for Sjögren's.
Two things can quietly go wrong with that entirely sensible path.
The first is a missed exclusion. Buried in this patient's chart is a course of head and neck radiation from December 2017. Radiation damages the salivary and lacrimal glands and produces the very same dry-eyes, dry-mouth picture—which is precisely why it sits as a hard exclusion in the Sjögren classification criteria. A clinician anchored on the obvious autoimmune story, reading a five-year-old line in a dense chart, can miss it. The diagnosis that felt most natural is, under the agreed-upon rules, not permissible.
The second is a missed candidate. Whipple disease—a rare systemic infection—can mimic this exact presentation, joint pain included. But it is rare enough that it simply never reaches the top of most clinicians' minds, and a candidate you never consider is a candidate you never rule out.
When our Lean 4 backend evaluated the global candidate set and typechecked the patient's data, our backtranslation layer generated the following live diagnostic output directly within the clinical user interface:
Summary After applying every exclusion gate, Whipple disease is the only candidate not yet logically eliminated. It is not yet confirmed. The system has narrowed a confusing multi-system presentation to a single candidate and now hands off to that disease's standard confirmatory workup. Proof Explanation The proof reasons over three initially relevant conditions—primary Sjögren disease, a clinical-error hypothesis (the possibility that an observation was simply mis-recorded, which carries no diagnostic criteria and is dismissed trivially), and Whipple disease—and establishes that only Whipple disease can satisfy its diagnostic criteria given the patient's presentation. The key steps are: Relevant events identified: The proof enumerates all conditions that could theoretically explain the observed symptoms (keratoconjunctivitis sicca, xerostomia, and arthralgia). Sjögren disease excluded: Although the patient presents with classic Sjögren symptoms (dry eyes and dry mouth), the history of head/neck radiation in December 2017 triggers an explicit exclusion criterion in the Sjögren classification rules. The proof shows that this exclusion fires regardless of any additional observations that might be added. Whipple disease remains viable: The proof demonstrates that Whipple disease criteria could potentially be satisfied by adding compatible laboratory findings (specifically, PAS-positive T. whipplei staining and positive CSF PCR for T. whipplei) that don't temporally conflict with existing observations. Search terminated—handoff, not diagnosis: With only one candidate remaining, there is no discriminating move left to compute—no test would distinguish between rivals, because there are none. The system stops here and hands off to the standard confirmatory workup for the surviving candidate
Summary After applying every exclusion gate, Whipple disease is the only candidate not yet logically eliminated. It is not yet confirmed. The system has narrowed a confusing multi-system presentation to a single candidate and now hands off to that disease's standard confirmatory workup. Proof Explanation The proof reasons over three initially relevant conditions—primary Sjögren disease, a clinical-error hypothesis (the possibility that an observation was simply mis-recorded, which carries no diagnostic criteria and is dismissed trivially), and Whipple disease—and establishes that only Whipple disease can satisfy its diagnostic criteria given the patient's presentation. The key steps are: Relevant events identified: The proof enumerates all conditions that could theoretically explain the observed symptoms (keratoconjunctivitis sicca, xerostomia, and arthralgia). Sjögren disease excluded: Although the patient presents with classic Sjögren symptoms (dry eyes and dry mouth), the history of head/neck radiation in December 2017 triggers an explicit exclusion criterion in the Sjögren classification rules. The proof shows that this exclusion fires regardless of any additional observations that might be added. Whipple disease remains viable: The proof demonstrates that Whipple disease criteria could potentially be satisfied by adding compatible laboratory findings (specifically, PAS-positive T. whipplei staining and positive CSF PCR for T. whipplei) that don't temporally conflict with existing observations. Search terminated—handoff, not diagnosis: With only one candidate remaining, there is no discriminating move left to compute—no test would distinguish between rivals, because there are none. The system stops here and hands off to the standard confirmatory workup for the surviving candidate
Summary After applying every exclusion gate, Whipple disease is the only candidate not yet logically eliminated. It is not yet confirmed. The system has narrowed a confusing multi-system presentation to a single candidate and now hands off to that disease's standard confirmatory workup. Proof Explanation The proof reasons over three initially relevant conditions—primary Sjögren disease, a clinical-error hypothesis (the possibility that an observation was simply mis-recorded, which carries no diagnostic criteria and is dismissed trivially), and Whipple disease—and establishes that only Whipple disease can satisfy its diagnostic criteria given the patient's presentation. The key steps are: Relevant events identified: The proof enumerates all conditions that could theoretically explain the observed symptoms (keratoconjunctivitis sicca, xerostomia, and arthralgia). Sjögren disease excluded: Although the patient presents with classic Sjögren symptoms (dry eyes and dry mouth), the history of head/neck radiation in December 2017 triggers an explicit exclusion criterion in the Sjögren classification rules. The proof shows that this exclusion fires regardless of any additional observations that might be added. Whipple disease remains viable: The proof demonstrates that Whipple disease criteria could potentially be satisfied by adding compatible laboratory findings (specifically, PAS-positive T. whipplei staining and positive CSF PCR for T. whipplei) that don't temporally conflict with existing observations. Search terminated—handoff, not diagnosis: With only one candidate remaining, there is no discriminating move left to compute—no test would distinguish between rivals, because there are none. The system stops here and hands off to the standard confirmatory workup for the surviving candidate
Why This Matters to a Clinician
The trace above is doing two distinct jobs at once, and they are the two things human review cannot reliably guarantee.
The first is correctness. The system did not let the compelling autoimmune picture override the rules. It proved that the patient cannot satisfy the Sjögren classification criteria—no matter what further evidence arrives—because of the documented radiation history, protecting the clinician from a dangerous anchoring bias.
The second is completeness. The system surfaced Whipple disease not because it is likely, but because it was possible and had not been ruled out. It considered every condition in the knowledge base that could explain even one of the patient's symptoms, so the rare mimic that a busy clinician would never think to consider does not slip through unexamined.
Every single line of this English summary is explicitly backed by an underlying mathematical proof object, machine-checked to the soundness of the Lean kernel. There is no room for the system to fabricate clinical logic, hallucinate a connection, or ignore the temporal constraints of the nosology. It provides the exact automated peer-review safety net that complex clinical workflows demand.
True Continuous Learning via Lemma Caching
There is also a question of how these systems learn. When memory is stored as raw text logs or semantic vector embeddings, it can decay , bloat context windows, and offer no logical guarantee that the right memory will surface when applied to a new patient.
Because our framework is built on formal mathematics, it unlocks true Continuous Learning via Lemma Caching.
As our formal prover searches for a proof path across the causal graph, it discovers and verifies atomic sub-truths—Lemmas. Once the Lean compiler typechecks a lemma, that sub-rule is true within the knowledge system as long as the encoded rules and axioms are unchanged. We cache these verified lemmas. The engine literally grows faster and structurally smarter with every proof it completes, with zero risk of memory corruption, behavioral drift, or logic decay.
A Vision for Verified Healthcare
The current AI healthcare discourse treats higher evaluation scores on static benchmarks as a surrogate for clinical trust. But an evaluation score is merely a sample average—it cannot tell a doctor if the vulnerable patient sitting in front of them right now falls into the system's failure margin.
Verification is different. It delivers an unshakeable, deterministic guarantee for an individual human life.
Today, the kernel verifies that a diagnosis stays consistent with the encoded diagnostic criteria and exclusion rules for each disease—the sanity-checks medicine has already standardized. That starts to deliver the minimum that trust requires: that every cause in the knowledge base was considered, and that no part of the patient's record was silently overlooked.
The deeper version encodes not just the rules, but the reasons behind them. As a tangential example, consider two drugs that must not be taken together. The shallow encoding is a forbidden-pair lookup: "A and B—don't combine." The deeper encoding states the mechanism: both strongly inhibit the same enzyme, and combined inhibition drives it to dangerous levels. Hence any two strong inhibitors of that enzyme are unsafe together. Now the system doesn't just block the pair you listed; it flags drug C, which you never paired with anything, the moment C turns out to inhibit the same enzyme—and it can explain why, in terms of the body, not a table.
This is the trajectory: from consistent with the rules to consistent with how the world works. Each step keeps the same hard guarantee, while widening the body of knowledge that guarantee ranges over.
In medical AI, many companies talk about 'AI Alignment', writing better prompts or reinforcement learning to make models more accurate and appropriately cautious. We think alignment is not enough when a human life is on the line. We don't try to align our model to text instructions. We typecheck our diagnoses against what medicine knows to be true, and we intend to keep deepening what "known to be true" can mean.
Formalizing the entirety of human medicine is a monumental task. But the minimum guarantee, that a diagnosis was checked against every rule we have and none were skipped, is one no amount of fluent text can provide, and one every patient placing their trust in a diagnosis deserves. And the goal is not to replace the agentic systems being built across healthcare—it is to give them a layer of proof they can stand on, so the trust patients place in them is earned.
Appendix
[1] https://lifespan.io/ai-tool-sets-new-standard-in-diagnosing-rare-diseases/

The Illusion of the Reasoning Trace
In the world of enterprise software, an AI hallucination costs dollars. In medicine, a hallucination could cost a life.
As artificial intelligence moves deeper into clinical care, the industry is undergoing a massive shift toward "Agentic AI." Major tech institutions and health networks are actively co-creating autonomous clinical agents designed to parse longitudinal electronic health records, draft multi-step care pathways, and simulate medical reasoning. These systems are powerful, and it may be natural to think that by wrapping Large Language Models (LLMs) in autonomous loops, giving them access to medical research and guidelines, and letting them generate long "reasoning traces," we can build digital diagnostic assistants capable of solving complex, rare diseases [1][2].
But there is danger in mistaking a fluent reasoning trace for a guarantee.
An LLM agent is a remarkably capable reasoner, but on its own it is not a clinical auditor; it is a text-prediction engine operating over statistical weights. Without an external check, the confidence associated with any hypothesis comes from a black box. When an agentic workflow fumbles a multi-layered diagnostic problem, the error can hide behind a beautifully formatted, highly confident chain of thought. Richness is not trustworthiness. A confident-sounding clinical narrative, on its own, can mask an underlying statistical gamble.
Because token sampling is inherently probabilistic, running the same complex clinical profile through an agentic loop multiple times means the reasoning can eventually drift, negative constraints can be softened, or a rare historical exclusion can be missed. If it can be wrong even once, it cannot be trusted on its own.
However powerful our agents become, we still need an unbribable sanity-checker to back them up.
The Human Limits of Scale
Great clinicians do not simply "pattern-match" symptoms to diseases and call it a day.
Pattern matching is merely the first, primitive step - a way to generate a list of candidate solutions. The true work of a diagnostician is rigorous validation. A clinician takes those candidate conditions and actively cross-checks them against their internal world model: If the patient has the condition, would the patient have pain radiating out from the back? Would it really leave the inflammatory markers sitting in the normal range like this? Is there an old surgery or exposure in the chart that quietly disqualifies it?
Since medicine is deeply complex, a single clinician will rarely make a high-stakes choice in isolation. They consult senior colleagues. They run complex cases past a tumor board or a differential diagnosis review panel. It is worth being precise about what that second set of eyes provides. Part of its value is generative—a colleague brings different priors and may surface a hypothesis nobody else considered. But a distinct and equally critical part is purely mechanical consistency-checking: confirming that the proposed diagnosis is compatible with every observation, that no lab value contradicts it, and that no exclusion criterion buried in the chart quietly rules it out. This second function is the unglamorous, error-prone bookkeeping of medicine. It is the part that fails silently when a clinician is rushed, tired, or anchored on a compelling first impression.
If a clinician were to sit down and manually and mechanically map out every single possible rare disease, writing an exhaustive, logical argument for why each condition was either applicable or structurally eliminated for a specific patient, that clinical reasoning would likely be airtight. But in a stressed, fast-paced healthcare ecosystem, doing this manually for even one patient, nevermind every patient, is completely unscalable and practically impossible.
A diagnosis is only as trustworthy as the weakest link in the reasoning chain behind it, and that chain depends on two properties holding at once. The first is correctness: no step of the diagnostic argument contradicts the patient's data. The second is completeness: no candidate condition goes unconsidered, and no exclusion criterion goes unchecked.
This exact challenge was solved in the world of pure mathematics. Mathematicians no longer have to rely solely on the fallible, slow process of human peer review to catch subtle flaws in complex theorems or ensure that their solution covers all the cases. Instead, they use Lean 4 and the Mathlib library - a system that automatically typechecks proofs with 100% mathematical correctness.
At Pramaana Labs and BioCognitive Engines, we built clinicians a typechecker of their own, to offer the same guarantees.
Building the Foundation: The Medical Typechecker
When a patient sits in a clinic, they place 100% of their trust in the doctor's hands. Currently, beyond the individual clinician's exhausting efforts and human cognitive capacity, there is no structural safety net backing that trust. An overlooked sentence buried in a five-year-old chart can trigger a wrong diagnosis, leading to incorrect medication, leaving the original life-threatening condition completely untreated.
To prove we can build this safety net, we formalized the multi-layered diagnostic criteria for a targeted set of highly complex rare conditions—including Sjögren’s syndrome and Whipple disease—directly into native Lean 4 code. By treating these intricate nosologies as our initial validation benchmarks, we proved that even the most nuanced clinical knowledge can be made mathematically rigorous.
[Messy Patient Record] ──> (LLM as Translator) ──> [Lean 4 Mathematical Terms] (deformalized Lean4 text will be verified by user) │ ▼ ┌───────────────────────┐ │ Prover + Lean Kernel │ └───────────┬───────────┘ │ ▼ [Human Narrative Explanation] <── (Translation) <── [Verified Proof]
By formalizing diagnostic criteria (such as the ACR/EULAR or Ghent nosologies), we transform the role of the generative AI. In this architecture, the LLM no longer derives a diagnosis directly from model weights and context. Instead, they act as a Translator, converting unstructured patient records into crisp Lean 4 mathematical terms, which are verified by human users.
Once the data is formalized, the Prover generates a proof and Lean Typechecker Kernel checks the proof. The kernel is a tiny, highly trusted program that verifies mathematical proofs and program correctness. If the kernel accepts the statement, then it is mathematically guaranteed that the diagnosis and observations are completely consistent with the known rules.
The reasoning chain presented to the doctor is no longer unverified model output. It is a careful natural language translation of an ironclad, machine-checked mathematical proof.
How Medical Knowledge Is Encoded
For the kernel to check a diagnosis, medicine's knowledge was written down in two forms that run in opposite directions.
One direction is generative: given a cause, what effects does it produce? A disease produces phenotypes (dry eyes, joint pain, high blood sugar) with characteristic frequencies, and a phenotype shows up in an observation (a lab, an exam) with characteristic reliability. Roughly, this is our causal model of how the body works: it encodes both cause-and-effect and P(effect | cause), and it is what lets us rank how well a candidate explains everything we see.
The other direction is inferential: given the effects, what was the cause? This is what diagnostic criteria capture: published rules like the ACR/EULAR criteria for Sjögren's are an expert committee's compressed, standardized shortcut for reasoning from findings back to a disease. They are not a definition of the disease, as meeting them sharply raises the probability rather than proving it. But they are a rigorous, widely accepted heuristic, disciplined enough to gate clinical trials. That rigor is exactly what makes them worth typechecking against.
We model a diagnostic rule as three parts: an entry gate, a set of weighted scoring items above a threshold, and a list of disqualifying exclusions. Sjögren's encodes precisely this shape:
def sjogren_criteriaSatisfied (p : Patient) (asOf : Date) : Prop := sjogren_inclusionGate p asOf -- sicca symptom for ≥ 90 days ∧ sjogren_acrEularPoints p asOf ≥ 4 -- weighted items clear threshold ∧ ¬sjogren_exclusionFires p asOf -- no disqualifying history
The exclusion term is the one to watch. A single fired exclusion makes the whole expression false—no matter how many points the symptoms scored. Here is what fires it:
def sjogren_exclusionFires (p : Patient) (asOf : Date) : Prop := p.events.any fun e => match e.eventType with | .medicalProcedure .headNeckRadiation => decide (e.startDate ≤ asOf) | .disease .hepatitisC => e.isActive asOf | .disease .sarcoidosis => e.isActive asOf | ... => false
Why does a single exclusion outrank everything else? Active hepatitis C, one of the entries above, can drive the same dry eyes, dry mouth, and antibody findings that score points toward Sjögren's—so a patient can clear the gate and exceed the threshold and still not have Sjögren's, because the infection explains it better. The criteria encode exactly this: if the exclusion fires, the diagnosis doesn't stand, however compelling the positive evidence.
This is the scope of what we prove. The kernel does not claim "this patient does not have Sjögren's," as no rule known to humans can prove that. It proves the exact, checkable thing: this patient does not satisfy the Sjögren classification criteria, and cannot under any additional evidence. We verify diagnostic decisions against the rules medicine has already written down and agreed on, and we add the one guarantee a tired human reviewer cannot: that not a single encoded rule was skipped.
Navigating the Odyssey with Stepwise Verification
This formal architecture turns an exhausting clinical search into an elegant, entropy-reducing loop:
Isolating the Uneliminated: When a patient history is entered, the engine doesn't guess a disease. It instantly calculates a comprehensive list of every single possible root cause that explains at least one observed symptom and cannot yet be logically eliminated by the patient's data.
The Next Best Move: The system identifies the clinical inquiry or laboratory test expected to reduce diagnostic entropy the most. For the elimination it offers a hard guarantee: it can prove that specific outcomes of that test will logically eliminate named candidates from the differential.
The Unbreakable Workflow: The user inputs the new test result, and the engine repeats the step—narrowing the circle of possibilities with mathematical precision until a single, provably correct diagnosis is isolated.
If a diagnosis cannot yet be proven due to missing data, the engine refuses to guess or hallucinate a vague percentage. It either admits that it is not able to come up with a provably correct answer or it constructs a Witness, explicitly telling the doctor something like: "I cannot verify Sjögren's today because the 90-day symptom gate is not satisfied; however, if an anti-SSA lab is confirmed positive today, a complete mathematical proof chain will be unlocked."
The System in Action: Auditing the Mixed Presentation
The loop above is abstract until you watch it run. Here it is in a real, deliberately messy case.
Consider a 45-year-old woman who comes in with more than three months of dry, gritty eyes, a persistently dry mouth, and aching joints. To most clinicians, the first two symptoms light up immediately: dry eyes plus dry mouth is the textbook signature of primary Sjögren's disease, a common autoimmune condition. A reasonably experienced doctor anchors here, orders the autoimmune workup, and starts building the case for Sjögren's.
Two things can quietly go wrong with that entirely sensible path.
The first is a missed exclusion. Buried in this patient's chart is a course of head and neck radiation from December 2017. Radiation damages the salivary and lacrimal glands and produces the very same dry-eyes, dry-mouth picture—which is precisely why it sits as a hard exclusion in the Sjögren classification criteria. A clinician anchored on the obvious autoimmune story, reading a five-year-old line in a dense chart, can miss it. The diagnosis that felt most natural is, under the agreed-upon rules, not permissible.
The second is a missed candidate. Whipple disease—a rare systemic infection—can mimic this exact presentation, joint pain included. But it is rare enough that it simply never reaches the top of most clinicians' minds, and a candidate you never consider is a candidate you never rule out.
When our Lean 4 backend evaluated the global candidate set and typechecked the patient's data, our backtranslation layer generated the following live diagnostic output directly within the clinical user interface:
Summary After applying every exclusion gate, Whipple disease is the only candidate not yet logically eliminated. It is not yet confirmed. The system has narrowed a confusing multi-system presentation to a single candidate and now hands off to that disease's standard confirmatory workup. Proof Explanation The proof reasons over three initially relevant conditions—primary Sjögren disease, a clinical-error hypothesis (the possibility that an observation was simply mis-recorded, which carries no diagnostic criteria and is dismissed trivially), and Whipple disease—and establishes that only Whipple disease can satisfy its diagnostic criteria given the patient's presentation. The key steps are: Relevant events identified: The proof enumerates all conditions that could theoretically explain the observed symptoms (keratoconjunctivitis sicca, xerostomia, and arthralgia). Sjögren disease excluded: Although the patient presents with classic Sjögren symptoms (dry eyes and dry mouth), the history of head/neck radiation in December 2017 triggers an explicit exclusion criterion in the Sjögren classification rules. The proof shows that this exclusion fires regardless of any additional observations that might be added. Whipple disease remains viable: The proof demonstrates that Whipple disease criteria could potentially be satisfied by adding compatible laboratory findings (specifically, PAS-positive T. whipplei staining and positive CSF PCR for T. whipplei) that don't temporally conflict with existing observations. Search terminated—handoff, not diagnosis: With only one candidate remaining, there is no discriminating move left to compute—no test would distinguish between rivals, because there are none. The system stops here and hands off to the standard confirmatory workup for the surviving candidate
Why This Matters to a Clinician
The trace above is doing two distinct jobs at once, and they are the two things human review cannot reliably guarantee.
The first is correctness. The system did not let the compelling autoimmune picture override the rules. It proved that the patient cannot satisfy the Sjögren classification criteria—no matter what further evidence arrives—because of the documented radiation history, protecting the clinician from a dangerous anchoring bias.
The second is completeness. The system surfaced Whipple disease not because it is likely, but because it was possible and had not been ruled out. It considered every condition in the knowledge base that could explain even one of the patient's symptoms, so the rare mimic that a busy clinician would never think to consider does not slip through unexamined.
Every single line of this English summary is explicitly backed by an underlying mathematical proof object, machine-checked to the soundness of the Lean kernel. There is no room for the system to fabricate clinical logic, hallucinate a connection, or ignore the temporal constraints of the nosology. It provides the exact automated peer-review safety net that complex clinical workflows demand.
True Continuous Learning via Lemma Caching
There is also a question of how these systems learn. When memory is stored as raw text logs or semantic vector embeddings, it can decay , bloat context windows, and offer no logical guarantee that the right memory will surface when applied to a new patient.
Because our framework is built on formal mathematics, it unlocks true Continuous Learning via Lemma Caching.
As our formal prover searches for a proof path across the causal graph, it discovers and verifies atomic sub-truths—Lemmas. Once the Lean compiler typechecks a lemma, that sub-rule is true within the knowledge system as long as the encoded rules and axioms are unchanged. We cache these verified lemmas. The engine literally grows faster and structurally smarter with every proof it completes, with zero risk of memory corruption, behavioral drift, or logic decay.
A Vision for Verified Healthcare
The current AI healthcare discourse treats higher evaluation scores on static benchmarks as a surrogate for clinical trust. But an evaluation score is merely a sample average—it cannot tell a doctor if the vulnerable patient sitting in front of them right now falls into the system's failure margin.
Verification is different. It delivers an unshakeable, deterministic guarantee for an individual human life.
Today, the kernel verifies that a diagnosis stays consistent with the encoded diagnostic criteria and exclusion rules for each disease—the sanity-checks medicine has already standardized. That starts to deliver the minimum that trust requires: that every cause in the knowledge base was considered, and that no part of the patient's record was silently overlooked.
The deeper version encodes not just the rules, but the reasons behind them. As a tangential example, consider two drugs that must not be taken together. The shallow encoding is a forbidden-pair lookup: "A and B—don't combine." The deeper encoding states the mechanism: both strongly inhibit the same enzyme, and combined inhibition drives it to dangerous levels. Hence any two strong inhibitors of that enzyme are unsafe together. Now the system doesn't just block the pair you listed; it flags drug C, which you never paired with anything, the moment C turns out to inhibit the same enzyme—and it can explain why, in terms of the body, not a table.
This is the trajectory: from consistent with the rules to consistent with how the world works. Each step keeps the same hard guarantee, while widening the body of knowledge that guarantee ranges over.
In medical AI, many companies talk about 'AI Alignment', writing better prompts or reinforcement learning to make models more accurate and appropriately cautious. We think alignment is not enough when a human life is on the line. We don't try to align our model to text instructions. We typecheck our diagnoses against what medicine knows to be true, and we intend to keep deepening what "known to be true" can mean.
Formalizing the entirety of human medicine is a monumental task. But the minimum guarantee, that a diagnosis was checked against every rule we have and none were skipped, is one no amount of fluent text can provide, and one every patient placing their trust in a diagnosis deserves. And the goal is not to replace the agentic systems being built across healthcare—it is to give them a layer of proof they can stand on, so the trust patients place in them is earned.
Appendix
[1] https://lifespan.io/ai-tool-sets-new-standard-in-diagnosing-rare-diseases/