Back to Blog

Imandra CodeLogician: A Deep Dive into Neuro-Symbolic Code Verification

Imandra CodeLogician: Neuro-Symbolic Code Verification

I spent the last week pulling apart the Imandra CodeLogician paper (Lin et al., February 2026), and it changed how I think about the whole "vericoding" problem. If you work at the intersection of LLMs and formal verification, read this paper. Here's what's in it.

The problem it's solving

The formal verification community has been wrestling with a question for a while: what role should LLMs play in producing verified code? Most existing approaches treat formal methods as a filter. The LLM generates code or a proof, and then a symbolic engine checks whether the output is correct. Binary pass/fail.

CodeLogician flips this. Instead of using formal methods to police LLM outputs after the fact, it uses LLMs to translate source code into formal models, and then hands everything off to an automated reasoning engine called ImandraX. The LLM becomes a translator, not a reasoner. The engine does the math.

This distinction matters because the benchmark results are brutal. When frontier LLMs (Claude Opus 4.5, GPT-5.2, Gemini 3 Pro, and others) were asked to reason about code on their own, they scored 0.186 out of 1.0 on state space estimation accuracy. They identified fewer than 60% of edge cases. They hallucinated safety guarantees 36.5% of the time — confidently claiming a system was safe when formal verification proved a critical vulnerability existed.

When the same LLMs were augmented with CodeLogician's formal feedback, their reasoning accuracy closed the gap by 41 to 47 percentage points. The formal engine didn't make the LLMs smarter. It told them the answers and let them format the output.

The Architecture

CodeLogician has several layers.

The Modeling Language: IML

Everything flows through the Imandra Modeling Language (IML), a pure functional subset of OCaml with extensions for specification and verification. Coming from Lean 4, the key differences are worth understanding.

IML uses Hindley-Milner static typing rather than dependent types. The reasoning engine behind it (ImandraX) is built around automated decision procedures and SMT-based bounded/unbounded verification, not interactive tactic-based proof construction. Where Lean 4 requires you (or an LLM) to write tactic scripts to prove theorems, ImandraX tries to discharge proofs automatically using a typed, higher-order lifting of the Boyer-Moore inductive waterfall.

The automation level is the big differentiator. In Lean 4, you often need manual guidance for complex state machines. ImandraX throws deep automation at the problem, and when it works, it works without human (or AI) intervention.

State handling is different too. Lean 4 uses monads (StateM, IO) to model mutable state and side effects. IML models state as explicit infinite-state machines where state is passed through function inputs and outputs. Loops get rewritten as structurally recursive definitions. Side effects get eliminated by threading state explicitly.

Autoformalization: Python to IML

The "autoformalization" step is where the LLM does its work. Given a Python program, the CodeLogician agent incrementally builds an executable IML model. Not a one-shot translation, but a structured refinement process:

  1. Extract structural, control-flow, and type information from the source program.
  2. Refactor imperative constructs (for loops, while loops) into pure functional recursive definitions.
  3. Make implicit assumptions explicit. If the code calls an external library, the agent introduces "opaque functions" (functions with a type signature but no implementation) or approximation functions (like Taylor-series expansions for math functions).
  4. Synthesize an IML model admissible to the reasoning engine.

The paper distinguishes three abstraction levels for modeling. High-level models come from domain-specific languages that compile directly to IML. Mid-level models are the primary target for CodeLogician: application-level business logic, data-processing pipelines, control-heavy code. Low-level models (instruction sequences, hardware-near implementations) are outside CodeLogician's scope because they need explicit formalization of the execution environment.

The handling of unknowns is worth noting. When the code depends on external libraries or services, CodeLogician has three strategies: opaque functions (declare the type, skip the implementation), axioms (constrain opaque function behavior with explicit assumptions), and approximation functions (replace the unknown with a sound approximation). All of these are explicit and auditable, so you always know what assumptions your reasoning depends on.

Verification Goals

Once the IML model exists, CodeLogician formulates verification goals (VGs): boolean-valued IML functions encoding properties of the system. VGs are universally quantified over all inputs. When ImandraX proves a VG, it establishes the property holds for every input. When it fails, it produces a concrete counterexample witness.

The paper gives a good example from trading venue verification. Consider the transitivity of an order ranking function:

lemma rank_transitivity side order1 order2 order3 mkt =
  order_higher_ranked(side,order1,order2,mkt) &&
  order_higher_ranked(side,order2,order3,mkt)
  ==>
  order_higher_ranked(side,order1,order3,mkt)

In practice, verification is iterative. Counterexamples expose missing preconditions, modeling gaps, or specification errors. The paper emphasizes that counterexamples are "first-class" objects in ImandraX: they're reflected in the runtime and you compute with them directly. This tight feedback loop between conjectures and counterexamples is what makes the model refinement process work.

Region Decomposition

This is what makes CodeLogician qualitatively different from standard formal verification tooling.

Standard verification asks a binary question: does this property hold? Region decomposition asks something richer: what are all the distinct behaviors this function exhibits, and what are the exact boundaries between them?

Given a function, ImandraX partitions its entire input space into a finite set of symbolic regions. Each region has three components:

Constraints. The exact boolean logical conditions inputs must satisfy to enter that region. Precise mathematical boundaries, not approximations.

Invariant result. A mathematical guarantee of what the function returns for any input satisfying those constraints.

Sample points. Concrete witness values satisfying the constraints, which you extract for test generation.

Concrete example from the paper. Take this discount function:

let discount = fun o ->
  match o.customer with
  | Premium -> if o.amount > 100 then 20 else 10
  | Standard -> if o.amount > 100 then 10 else 0

ImandraX proves there are exactly 4 behavioral regions:

This looks obvious for a four-line function. But real-world code produces dozens or hundreds of regions, and the combinatorial interactions between conditions create boundaries that humans and LLMs consistently miss.

The transportation risk manager example in the paper makes this concrete. When asked how many distinct routing scenarios exist for Sea transport under geopolitical tensions, LLMs estimated between 3 and 60. Region decomposition found 117. The gap comes from numeric threshold boundaries (cargo value, time criticality, route cost, route time) that splinter the state space far beyond what categorical reasoning predicts.

You get two complementary mechanisms for focusing decomposition. Side conditions restrict analysis to scenarios matching a boolean predicate (useful for regulatory or operational constraints). Basis functions tell ImandraX to treat certain functions as atomic during decomposition, reducing the number of regions and improving interpretability. Together, these let you zoom in on specific behaviors without drowning in the full state space.

The CodeLogician Server

Real software isn't a single function. The CodeLogician Server handles project-wide formalization through a metamodel architecture.

For Python projects, the PyIML strategy parses the codebase, infers dependency structure from imports, and formalizes code bottom-up in topological order. If module A depends on module B, module B gets formalized first. The dependency context (aggregated IML from all dependencies) is injected alongside each module's source code during formalization.

The minimal re-formalization strategy is good practical engineering. When a developer changes a file, the server computes a diff and re-formalizes only the changed module and its downstream dependents. It caches everything else. This makes continuous verification feasible on evolving codebases.

The server tracks four formalization levels for each module: Unknown (never formalized), Error during validation (admission failed), Admitted with opaqueness (admitted but contains opaque symbols), and Admitted transparent (fully formalized with no opaque elements). The goal is to push every module as high as possible on this ladder.

The Benchmark: code-logic-bench

The paper introduces code-logic-bench, a dataset of 50 state-machine models representing complex systems: BGP routing, NPC pathfinding, smart warehouse controllers, options market makers, nuclear reactor safety systems, TLS handshake protocols.

Each model has three questions probing different aspects of state machine understanding. Five frontier LLMs were evaluated across seven metrics:

State Space Estimation Accuracy (mean: 0.186). The worst result. LLMs are terrible at combinatorial explosions. They apply simple categorical reasoning and miss how numeric thresholds multiply the state space. In the transportation risk manager example, LLMs guessed 3 to 60 scenarios where 117 existed.

Outcome Precision (mean: 0.613). LLMs frequently rely on qualitative descriptions ("temperature must not be too high") instead of exact constraints ("temperature < 620°C").

Direction Accuracy (mean: 0.635). LLMs confidently reach wrong conclusions about safety properties. In one data protection system, formal verification proved a privacy property always holds, but the LLM incorrectly concluded it was violable.

Coverage Completeness (mean: 0.490). LLMs identified fewer than half of actual decision scenarios. They reason at categorical levels while missing how constraint combinations multiply scenarios.

Control Flow Understanding (mean: 0.746). The strongest LLM metric, and the only one above 0.7. LLMs are decent at interpreting branching logic, condition precedence, and short-circuit behavior.

Edge Case Detection (mean: 0.597). LLMs found fewer than 60% of edge cases. They focus on obvious scenarios and miss counterintuitive interactions.

Decision Boundary Clarity (mean: 0.695). LLMs partially succeed at identifying relevant constraints but struggle with exact boundary conditions (the difference between >= and >, specific threshold values, logical combinations).

Claude Opus 4.5 scored highest overall (0.601), followed by GPT-5.2 (0.589). All models sit in the 0.53-0.60 range, well below the 1.0 achieved with formal augmentation. The 41-47 percentage point gap is consistent across models, which suggests architectural limitations rather than model-specific weaknesses.

Case Studies

The paper presents three industrial case studies showing what formal verification catches in practice.

London Stock Exchange GTT Orders

The LSE specification (MIT201) states that Good Till Time orders expiring during an auction call phase must not expire until after uncrossing completes. CodeLogician modeled this as a state machine and expressed the critical property as a conflict detection predicate: auction uncross and order expiry should never happen simultaneously.

ImandraX refuted the property and produced a concrete counterexample. The bug: the auction protection logic extends order expiry to uncross_at + 1, but a market order extension feature delays the actual uncross to uncross_at + extension_period. When extension_period = 1, both events fire on the same tick.

This is a feature interaction bug. Both features work correctly in isolation. Their interaction was never properly considered. The specific timing conditions required to trigger it make it nearly impossible to catch with traditional testing.

LSE Fee Schedule

CodeLogician proved two invariants about the fee calculation logic and then used region decomposition to reveal 6 distinct behavioral regions in the execution fee function.

The analysis found "crossover notionals," trade sizes below which the minimum charge floor dominates the calculated fee. A £1,000 trade at the standard 0.45bp rate yields a calculated fee of £0.045, but the minimum floor of £0.11 applies, giving an effective rate of 11bp. That's 24 times the stated rate. These crossover points vary by pricing scheme and are the kind of boundary condition that humans miss because the interaction between tiered pricing, minimum floors, rebates, and surcharges creates unexpected dead zones.

Multilateral Netting Engine

CodeLogician found two bugs in a CCP-style netting engine. First, an input validation gap where negative trade amounts violated the netting efficiency requirement. Second, IEEE 754 floating-point errors that caused the zero-sum conservation invariant to fail. Summing 0.1 ten times in floating-point gives 0.9999999999999999, not 1.0. In high-frequency clearing with thousands of trades per second, these errors accumulate to material discrepancies.

The fix required migrating from Python's float to Decimal for arbitrary-precision arithmetic. After the fix, CodeLogician re-verified the implementation and proved all three verification goals: exact zero-sum conservation, tolerance-based zero-sum, and netting efficiency.

Connecting this to the Rust/Lean 4 pipeline

Now I want to connect the CodeLogician ideas to the vericoding workflow I care about most: verified Rust code through the Lean 4 pipeline.

The state-of-the-art for verifying Rust in Lean 4 is the Aeneas toolchain. Charon hooks into the Rust compiler, extracts MIR (Mid-level Intermediate Representation), simplifies it into LLBC (Low-Level Borrow Calculus), and Aeneas translates that into pure functional Lean 4 definitions. This translation is deterministic and compiler-driven. No LLM guessing. The Rust borrow checker eliminates memory reasoning for safe code, so Aeneas translates state mutations into pure functional value-passing. You verify the math. Rust guarantees the memory.

Aeneas solves the modeling problem (turning a Rust fn into a Lean def). It does nothing about the specification problem (writing the theorem or lemma that defines what the function should do).

Writing specs is the hard part. An LLM asked to write a spec for a Lean 4 function falls into one of three traps: tautological specs (proving f(x) = f(x)), missing edge cases (recall the 0.186 state space estimation score), or hallucinated boundaries (guessing x > 0 when the code does x >= 0).

CodeLogician's region decomposition gives us a way out.

Step 1: Extract the model deterministically. Run Rust through Charon/Aeneas to get perfect Lean 4 definitions. No LLM in this step.

Step 2: Run an automated reasoning tool over the code to compute the region decomposition. This is ImandraX, Z3, or another SMT solver. The output is a finite set of symbolic regions with exact constraints and invariant outputs.

Step 3: Feed the regions to an LLM for spec generation. Instead of asking the LLM to figure out the state space from scratch, you give it the mathematical truth: "The solver found 14 regions. Region 1 is when customer = Premium and amount > 100, and the invariant is 20. Region 2 is... Write a Lean 4 theorem for each of these 14 regions."

This eliminates the LLM's weakness (state space estimation, boundary precision) and plays to its strength (translating structured logical concepts into Lean 4 syntax).

Step 4: Use the LLM to write tactic proofs. The LLM writes Lean 4 tactics to prove the generated theorems. Hallucinations in the proof step don't matter because Lean's type checker is an absolute judge. Either the proof checks or it doesn't.

You'd also want a counterexample-driven feedback loop. The LLM drafts a candidate spec. Before spending compute on tactic proof, you run a bounded model checker to search for counterexamples. If it finds one (say, balance = -1 when the spec claims the balance never goes below zero), you feed that exact state back to the LLM. It adjusts the spec (adds a precondition like requires initial_balance > 0). This stops the LLM from wasting time proving false specifications, which is a massive compute sink in current vericoding agent setups.

CodeLogician vs Aeneas

CodeLogician and the Aeneas pipeline make different tradeoffs.

CodeLogician (Python to ImandraX) uses LLM-driven heuristic translation. The LLM translates Python to IML and it makes mistakes, hallucinate bounds, or oversimplify control flow. The reasoning engine is automated: SMT-based, excellent at finding concrete edge cases and computing state-space boundaries. It works well on high-level business logic but struggles with low-level memory and pointers.

Aeneas (Rust to Lean 4) uses compiler-driven deterministic translation. No hallucinated code semantics. The reasoning engine is interactive: requires tactic proofs, harder to automate, but allows infinitely deep mathematical proofs. It only works on a subset of safe Rust. Unsafe blocks, interior mutability, and complex concurrency are currently unsupported.

For business logic, state machines, and routing rules, ImandraX wins because it automatically enumerates all edge cases. For systems software, cryptography, or critical infrastructure, the Rust + Aeneas + Lean 4 stack is stronger. It removes the LLM from the sensitive translation step and relegates it to writing proofs, where hallucinations get caught by the type checker.

The best pipeline would combine both: deterministic model extraction via Aeneas, automated state exploration via SMT/region decomposition, LLM-driven spec and proof generation with Lean's type checker as the final arbiter.

SpecLogician

The paper's future work section mentions SpecLogician, a complementary project focused on synthesizing specifications from source code, execution traces, unit tests, and system logs. The idea is to shift from analyzing individual code fragments to building collections of specifications from real-world artifacts.

In a Rust/Lean 4 context, you'd have the LLM look at Rust unit tests and execution traces to infer programmer intent, write candidate Lean 4 specs based on that intent, and then check whether the Aeneas-extracted model satisfies them. This is where the field is heading: closed-loop workflows where specifications, implementations, and empirical artifacts co-evolve under formal constraints.

So what

No amount of prompt engineering will give you a proof. The CodeLogician paper makes this point with data: a 41-47 percentage point gap between LLM-only reasoning and formally augmented reasoning, consistent across all tested frontier models. LLMs are good at translation and bad at reasoning. Automated reasoning engines are the opposite: good at reasoning, terrible interfaces.

The neuro-symbolic split, where LLMs handle the interface and formal engines handle the reasoning, gets the division of labor right. For anyone working with Lean 4 on the vericoding problem, the specific lesson is: don't ask LLMs to write specs from scratch. Run the solver first to get the full picture, then let the LLM fill in the syntax.

The tools aren't integrated yet. Nobody has a single button that runs Rust through Charon/Aeneas, computes region decomposition via an SMT solver, generates Lean 4 specs from the regions, and writes tactic proofs in a feedback loop. But every piece of this pipeline exists today. Assembling it is an engineering problem, not a research problem.

That's the most exciting part for me!