Verifiable Distributed LLM Work: The State of the Art and a Solution
The Problem
You want to:- Publish work = (prompt, required model)
- Workers execute = run prompt with specified model
- Submit results = output + cryptographic proof that the specified model produced it
- Verification = anyone can confirm without re-running
The Three Families of Approaches
1. Zero-Knowledge Proofs (zkML) — Perfect but Slow
The dream: a mathematical proof that a specific neural network produced a specific output.| System | Max Model | Proof Time | Overhead |
|---|---|---|---|
| ZKTorch | GPT-J 6B | 23 min (64 threads) | 3,500x |
| zkLLM | LLaMA-2 13B | 15 min (A100) | 500,000x |
| zkPyTorch | Llama-3 8B | 150s/token (1 CPU) | 10,000x |
| DeepProve | GPT-2 124M | 54-158x faster than EZKL | Unknown |
| EZKL | Small models | Minutes | 100-1000x |
2. Trusted Execution Environments (TEEs) — Fast but Breakable
NVIDIA H100 confidential computing runs LLMs with <7% overhead (approaches 0% for 70B+ models). Hardware attestation proves firmware integrity via a fuse-burned ECC-384 key. But: The TEE.Fail attack (October 2025) broke Intel TDX, AMD SEV-SNP, and NVIDIA CC attestation with a <$1,000 DDR5 bus interposer. Researchers forged attestation quotes indistinguishable from legitimate ones. Intel and AMD consider physical interposer attacks “out of scope” and have no planned fixes. Worse: GPU attestation measures firmware, not model weights. Application-layer extensions can hash weights into measurement registers, but this requires trusting the inference framework code — turtles all the way down. TEEs reduce to physical security, not cryptographic security. Fine for cloud providers with locked cages. Not “unstoppable.”3. Optimistic + Economic Verification — Practical and Scalable
The breakthrough insight: don’t prove every computation. Make fraud unprofitable. Key developments:- Deterministic inference is now solved. Thinking Machines Lab showed that batch-invariant CUDA kernels produce 1000/1000 identical outputs for Qwen-3-235B at temperature 0 (1.6x slowdown). SGLang (LMSYS, Sep 2025) ships this. Verification becomes a byte-equality check.
- EigenAI (Jan 2026): Workers stake capital. Results are tentative for a challenge window. Challengers re-execute deterministically inside TEEs. Disagreement = slashing. 100% bit-identical across 10,000 runs on same hardware.
- Hyperbolic PoSP: Game-theoretic Nash equilibrium. If challenge probability > (fraud gain / slash amount), honest computation is the dominant strategy. <1% overhead. Adaptive sampling per node reputation.
- VeriLLM: Workers commit Merkle root of hidden-state tensors. VRF-based random sampling of intermediate states. Statistical tests distinguish hardware rounding from model substitution. ~1% verification cost.
The Solution: Layered Verification Protocol (LVP)
No single approach works alone. The “unstoppable” solution is a layered escalation protocol where verification cost is proportional to distrust:How It Works
Work Publication:- Worker stakes
stake_req - Downloads model weights, verifies
SHA384(weights) == model_hash - Runs inference with batch-invariant kernels + fixed seed (deterministic)
- Computes Merkle tree over hidden states at each transformer layer
- Submits:
Model Identity: The Key Innovation
For open-weight models (Llama, Mistral, Qwen, etc.):model_hash = SHA384(canonical_weights_file)- A public registry maps model names to weight hashes (think: Hugging Face + content-addressable storage)
- Worker must demonstrate they loaded the exact weights
- Deterministic execution proves the committed model produced the output
- The API provider signs responses:
sign(provider_key, prompt_hash || response || model_version || timestamp) - Token-DiFR fingerprinting: regenerate with same seed, >98% token match confirms the claimed model
- Provider reputation + legal accountability replaces cryptographic proof
- Future: providers run inside TEEs with attestation (Phala already does this with DeepSeek on OpenRouter)
Why This Is “Unstoppable”
- No single point of trust. Hardware can be compromised (TEE.Fail). Software can be buggy. But the combination of cryptographic commitments + economic stakes + deterministic re-execution + zkML escalation has no single attack vector that defeats all layers.
-
Economically rational honesty. At Layer 1, if
challenge_probability * slash_amount > fraud_gain, the Nash equilibrium is honesty. No cryptography needed — just game theory. - Cryptographic fallback exists. If you truly need mathematical certainty, Layer 4 (zkML) is available. ZKTorch can prove GPT-J 6B in 23 minutes today. GPU acceleration will bring this to minutes. For individual disputed operations, it’s seconds.
- Deterministic inference is production-ready. SGLang ships it. Thinking Machines proved it at 235B scale. The “outputs are non-deterministic” objection is no longer valid.
- Works for any model size. The default path (Layer 1-2) has <1% overhead regardless of model size. You only pay the zkML cost if someone actually disputes AND you can’t do byte-equality re-execution.
What Exists Today vs. What Needs Building
| Component | Status | Who |
|---|---|---|
| Deterministic inference kernels | Production | SGLang, Thinking Machines |
| Weight commitment registry | Exists (Hugging Face hashes) | Needs formalization |
| Economic staking/slashing | Production | EigenLayer, Hyperbolic PoSP |
| Merkle tree over hidden states | Research prototype | VeriLLM |
| zkML for LLMs | Research (ZKTorch, zkLLM) | 6-13B proven |
| Commit-reveal protocol | Production | VeriLLM, Atoma Network |
| TEE attestation for inference | Production | Phala, Chutes |
Sources
- ZKTorch (arXiv) — 23-min proof for GPT-J 6B
- zkLLM (CCS 2024) — LLaMA-2 13B proving
- Definitive Guide to ZKML 2025
- NVIDIA H100 Confidential Computing
- TEE.Fail Attack — broke TEE attestation with $1K hardware
- EigenAI (arXiv) — deterministic optimistic verification
- Hyperbolic PoSP — game-theoretic verification
- VeriLLM — commit-reveal with Merkle proofs
- SGLang Deterministic Inference
- Thinking Machines: Defeating Nondeterminism
- Gensyn RepOps — bitwise reproducible GPU ops
- SPEX Statistical Proofs — handles non-determinism via LSH
- Phala Network GPU TEE — TEE inference in production
- Token-DiFR Fingerprinting — 98% token match for model ID
- Inference Labs on EigenLayer
- Chutes Confidential Compute
- Tolerance-Aware Verification — 0.3% overhead, no TEE needed