The DeFi Report - Sponsor Image The DeFi Report - Industry-leading crypto research trusted by finance pros. Friend & Sponsor Learn more

The Missing Layer in AI-Resolved Prediction Markets

AI judges could fix prediction markets, but only if their decisions are reproducible.
The Missing Layer in AI-Resolved Prediction Markets
Listen
2
0
0:00 0:00

Subscribe to Bankless or sign in

Last week, a16z a16z published a proposal for using LLMs as prediction market judges. 

The pitch is to lock a specific model and prompt into the blockchain at a market's contract creation, let traders inspect the resolution's nuance before betting, then run it at resolution. The goal here is to eliminate human bias and problems that can arise from token-based dispute resolution.

There's just one problem the proposal glosses over: LLMs aren't designed to give the same answer twice.

The Resolution Bottleneck

Resolution has become the chokepoint for prediction markets at scale.

In their article, a16z cites multiple markets where resolution devolved into scandal:

  • The Venezuela election market, which saw over $6M in volume, before devolving into accusations of biased resolution when observers alleged fraud and the government declared the opposite result. 
  • The Zelensky suit market, which attracted $200M in bets on whether Ukraine's president would wear a suit to a NATO summit. During the market's resolution, UMA token holders flipped the resolution from "Yes" to "No" despite news coverage describing his attire as a suit, leading to traders crying foul and heated discussion over what classifies as a "suit". 
  • a Ukraine territorial control contract specified resolution based on a particular online map; someone allegedly edited the map to influence the outcome.

Human committees have conflicts of interest. Token-based voting systems like UMA have whale problems and credibility issues when large holders vote on contracts they've bet on — even if they vote fairly, the optics undermine trust.

Thus, as any good VC would, a16z proposes to bring AI in. As mentioned, their idea is, at contract creation, a specific LLM and prompt would be locked into the blockchain. Traders could inspect the full resolution mechanism before betting — the model, the prompt, the information sources. If they don't like the setup, they don't trade. At resolution, the committed model runs with the committed prompt and produces a judgment. No rule changes mid-flight, no discretionary calls.

The benefits are real. LLMs resist manipulation better than human committees — you can't easily bribe a model or edit its weights after commitment. They're transparent in a way governance can't match. And they have no financial stake in outcomes, eliminating the conflict-of-interest problem that plagues token voting. To be clear, a16z isn't proposing to remove humans entirely — they acknowledge the need for ongoing governance around which models to trust, how to handle obvious errors, and when to update defaults.

But here's where the proposal runs into trouble.

The Reproducibility Gap

Run the same prompt through any major model with identical settings and you'll get different outputs. This is how modern inference works.

Why? It comes down to how GPUs process information. When you run a model, thousands of calculations happen simultaneously. The order those calculations finish in can vary slightly each time, and those tiny variations compound into different final outputs. We've all witnessed this and for chatbots it's irrelevant. It doesn't matter if your article summary is slightly different each time. If anything, it provides breadth. But for determining who gets paid on a $200M market, that's obviously a different story. In theory, the losing party could re-run the exact same prompt and get the opposite answer.

Now what?

The a16z proposal assumes that locking a model and prompt produces verifiable, auditable resolution. But if someone disputes the outcome and re-runs the same model with the same inputs, they might get a different result and, if the markets mentioned above tell us anything, it's that slight nuances can have significant impact.

As a result, the "transparency" benefit of adding AI evaporates because there's no canonical answer to audit against.

EigenAI's Deterministic Inference

This week, EigenAI published a whitepaper claiming bit-exact reproducibility on production GPUs: 100% match rate across 10K test runs, with minimal slowdown to inference speed.

How they achieve it comes down to controlling every layer of the stack — locking down all the places where variability creeps in.

At the hardware layer, anyone running or verifying inference must use identical GPU models. Since different chip architectures produce different results for the same calculations, even when running the same code, standardizing hardware becomes the first requirement.

At the software layer, Eigen replaces the default math libraries that GPUs use to run calculations with custom versions that enforce strict ordering. The default libraries prioritize speed over consistency; EigenAI's versions sacrifice a small amount of performance to guarantee identical outputs every time.

The result: given identical inputs, the output is a pure function. Run it a thousand times, get identical results.

To make this useful for prediction markets or any disputed AI output, EigenAI pairs deterministic inference with a verification system. Their model borrows from blockchain rollups. The party running inference publishes encrypted results. Results are accepted by default but can be challenged during a dispute window. If challenged, independent verifiers re-execute inside secure hardware enclaves. Because execution is deterministic, verification becomes simple: do the results match?

If they don’t, the mismatch will trigger slashing — economic penalties drawn from bonded stake. The original party loses money; the challenger and verifiers get paid. Privacy stays intact throughout: prompts remain encrypted, with decryption only happening inside verified secure environments during disputes.

Where Else This Matters

Prediction markets are the clearest use case, but they're not the only one.

ERC-8004 launched Thursday, bringing its first two registries, Identity and Reputation, online. The third, the Validation Registry that will coordinate third-party verification of agent work, is still under development but coming soon.

The Validation Registry is designed to be flexible. It will support multiple verification methods: ZK proofs, TEE attestation, human judges, or stake-secured re-execution where validators reproduce a computation and compare outputs. The registry itself is just a coordination layer — it records that a validator checked something and what they concluded, without mandating how they reached that conclusion.

ERC-8004: The Machine Economy’s Missing Piece on Bankless
ERC-8004 launches January 16th, bringing onchain identity and reputation to autonomous agents. The machine economy’s trust layer is here.

For most of these methods, reproducibility is irrelevant. ZK proofs verify that a computation was performed correctly without re-running it. TEE attestation proves that specific code ran in a secure environment. Neither requires the underlying inference to be deterministic.

That said, for high-stakes operations — an agent managing significant capital, for instance — re-execution-based validation could add an extra layer of assurance. In those cases, builders would hit the same wall as prediction markets: without deterministic inference, you can't distinguish between an agent that “cheated” and one that simply got a different result from non-deterministic execution.

Solutions like EigenAI's would slot in here, enabling re-execution-based validation as one option among many. It's not a requirement for ERC-8004 to function, but for certain use cases, it could matter.

The Emerging Pattern

Overall, a16z’s idea of LLM judges is sound — transparent, neutral, resistant to manipulation. But without reproducibility, the proposal lacks the verification layer that would make it trustworthy at scale.

EigenAI's whitepaper suggests this gap is solvable. Deterministic inference is achievable with the right constraints: standardized hardware, custom libraries, controlled execution environments. The tradeoffs are manageable — a small performance hit for the ability to actually audit what an AI did.

For prediction markets specifically, this could solve one of its core issues. Lock in not just the model and prompt, but the infrastructure guaranteeing that anyone can re-run the resolution and get the same answer. Before we do that, though, it’s best to think twice about handing resolution over to the machines.


David Christopher

Written by David Christopher

523 Articles View all      

David is a writer/analyst at Bankless. Prior to joining Bankless, he worked for a series of early-stage crypto startups and on grants from the Ethereum, Solana, and Urbit Foundations. He graduated from Skidmore College in New York. He currently lives in the Midwest and enjoys NFTs, but no longer participates in them.

No Responses
Search Bankless