April 13, 2026 · Alex Potanin

WARNING: This blog post was fully generated by Claude Opus 4.6 with Extended High Effort thinking, so it contains mistakes and inaccuracies as expected. But it is funny enough to keep as it is.

I Spent $3,000 on Claude Trying to Prove a Theorem. My Student Did It Better in Two Days.

A case study on using Claude for mechanized proof development in Rocq (formerly Coq). Four attempts, ~$3,000 in API costs, 3,000+ compilation cycles — and a student who closed every remaining case in two days.

~$3,000 USD · 4 days · 5 Admitted lemmas remaining

Student: 0 Admitted · 2 days · clean proof

The Problem
What Claude Did: Four Attempts, Zero Clean Proofs
The Cost
What the Student Did
Why Claude Failed
What Claude Was Good At
Lessons
Conclusion

The Problem

We're building a mechanized proof in Rocq (formerly Coq) that a simulation relation between two styles of Rust borrow semantics is preserved under operand evaluation. The project formalizes how "borrow-centric" and "loan-centric" representations of mutable borrows can simulate each other — a key result for our partial-borrow research.

Most of the operand cases (pairs, sums, projections) fell to straightforward proof techniques. But one case — move — turned into a months-long grind. After a move writes VBot into the source path, you need to show the resulting state still satisfies the simulation relation. This requires reconstructing a two-hole decomposition of the state after an arbitrary write, and our formalization made that extremely hard.

I thought Claude could crack it. I was wrong.

What Claude Did: Four Attempts, Zero Clean Proofs

Attempt 1: Brute-force the existing representation

Claude's first instinct was to work directly with the fill_state2 / val_with_two_holes representation. After MMove writes VBot at some path, show the result still decomposes into a two-hole state context with updated fillings.

This hit a wall immediately. The val_updates function operates on concrete values — it has no concept of "holes." After a write, the two-hole structure is destroyed. There is no generic way to extract an updated hole context from the post-write state. Claude spent hundreds of compilation cycles trying different induction schemes, case splits, and auxiliary lemmas before I pulled the plug.

Attempt 2: Prove injectivity of `fill_val`

Claude's next idea was clever in theory: prove that hole-filling is injective (fill_val_vh_unique), which would let us "invert" the fill operation after a write. If you can prove the hole context is uniquely determined by the filled value, you can reconstruct it.

The proof was 90% done. But the cross-pair cases — where one context uses VHolePairLeft and the other uses VHolePairRight — needed a value-size argument to derive a contradiction. Claude tried at least a dozen formulations of this size argument. None compiled. The lemma stayed Admitted.

Attempt 3: Case-split everything

Claude went for exhaustive case analysis. Split the move proof into whether vl = vr (the move doesn't touch loan/borrow markers) versus a borrow migration case, then further split on S2HSame versus S2HDiff (are both holes in the same variable?).

This actually made progress. All the S2HDiff sub-cases went through, because when the holes are in different variables, the move only touches one variable and the other stays intact. But the S2HSame sub-cases — where both holes live in the same variable — hit the exact same wall as Attempt 1. Claude wrote ~1,200 lines of proof code. Two sub-cases remained stuck.

Attempt 4: Reinvent the representation

Finally, Claude proposed a path-based refactoring inspired by the Aeneas project. Define dp_set, sset, and a commutativity lemma dp_set_commute, then re-express the simulation relation using paths instead of holes. The key insight: move paths use structural projections (PiProj), while loan/borrow markers live behind PiUnloan/PiUnbor, so they're always disjoint and writes commute.

This was architecturally sound. Claude wrote ~1,000 lines of infrastructure in a new file DPSetCommute.v. But the bridge lemmas connecting the old hole-based relation to the new path-based relation were fragile, and the final integration — actually closing all the move sub-cases — never converged. Claude kept generating proof scripts that were close but never compiled cleanly.

The Cost

Over four days (April 3–7), I ran Claude in extended sessions, sometimes with multiple parallel agents. The token consumption was staggering:

Metric	Value
API costs	~$3,000 USD (mostly output tokens)
Wall-clock time	~4 days, including overnight runs
Compilation cycles	~3,000+ (compile-fix loop)
Result	5 `Admitted` lemmas remaining

The Baudart et al. paper on using Claude for Rocq proofs warned about this: "sharp diminishing returns" on hard problems, with unsolved problems consuming the most resources. I experienced it firsthand.

What the Student Did

Sasha (Aleksandr Pak), a student on the project, took over on April 10. In two days, he:

Took Claude's DPSetCommute.v infrastructure (the path-based operations and commutativity lemmas — the one genuinely useful thing Claude produced)
Simplified the bridge between leq_v2 and leq_v2_sp
Proved move_preserves_leq cleanly, closing all sub-cases including the S2HSame cases that had stumped Claude for days
Cleaned up the codebase, removing dead code from Claude's failed attempts

By April 12, the full theorem op_evals_preserves_leq_sp compiled with zero Admitted (aside from two intentional axioms). Sasha's approach was cleaner, shorter, and actually worked.

Why Claude Failed

1. Claude can't do mathematical insight

The move preservation proof requires understanding why disjoint paths commute at a deep structural level. Claude could articulate this insight in natural language (it literally wrote the PLAN.md describing the path-based approach), but couldn't translate that understanding into a working proof. The gap between "I see the proof strategy" and "I can execute it in Rocq" is enormous.

2. Claude is bad at recovery

When a proof attempt fails, Claude tends to try small variations of the same approach rather than stepping back to reconsider. I watched it try dozens of induction schemes for fill_val_vh_unique, each differing in minor ways, none addressing the core issue. A human prover would have abandoned that lemma after the third failure and looked for a different route.

3. The compile-fix loop is a trap

Claude's workflow is: write proof, compile, read error, fix, repeat. This works brilliantly for straightforward proofs (the paper reports 81% success on easy benchmarks). But for hard proofs, the error messages don't tell you that your approach is wrong — they tell you that this particular tactic failed. Claude dutifully fixes the tactic and hits the next wall, burning tokens without making progress. It's like watching someone navigate a maze by always turning right.

4. Token cost scales with difficulty, not with progress

Easy proofs cost pennies. Hard proofs cost hundreds of dollars. But the hard proofs are exactly the ones Claude is least likely to solve. You're paying the most money for the least value.

What Claude Was Good At

To be fair, Claude wasn't useless:

Infrastructure code: DPSetCommute.v was solid. Definitions, straightforward lemmas, boilerplate — Claude excels at these. Sasha built on top of Claude's infrastructure rather than rewriting it.
Exploration: Claude identified the path-based approach and wrote a clear plan. The architectural insight was correct even though Claude couldn't execute it.
Easy cases: All the non-move operand cases (copy, pair, sum wrapping) were proved by Claude without issues.
Documentation: Claude wrote excellent documentation of what it tried and why it failed (HISTORY.md), which helped Sasha understand the problem space.

Lessons

Don't use LLMs as a substitute for mathematical thinking. Use them as accelerators for the mechanical parts: boilerplate, straightforward lemmas, case analysis where the strategy is clear. When you hit a proof that requires genuine insight, hand it to a human.

Set a budget and a time limit. If Claude hasn't made progress after $500 and a day of attempts, it probably won't. The marginal return on additional tokens for hard proofs is near zero.

The documentation is the real output. Claude's failed proof attempts produced a detailed record of what doesn't work and why (HISTORY.md). This was arguably more valuable than the code it wrote — it saved Sasha from repeating the same mistakes.

The 80/20 rule applies hard. Claude solved 80% of our proof obligations quickly and cheaply. The remaining 20% consumed 95% of the budget and still needed a human. Plan accordingly.

Conclusion

Claude is an extraordinary tool for Rocq proof development — for the right problems. It turned weeks of boilerplate into hours. But it's not a mathematician. When a proof requires stepping back, seeing the forest, and finding a genuinely new path through the trees, you still need a human who can think.

My student is one of those humans. $3,000 poorer, I appreciate that more than ever.

I Spent $3,000 on Claude Trying to Prove a Theorem. My Student Did It Better in Two Days.

Contents

The Problem

What Claude Did: Four Attempts, Zero Clean Proofs

Attempt 1: Brute-force the existing representation

Attempt 2: Prove injectivity of fill_val

Attempt 3: Case-split everything

Attempt 4: Reinvent the representation

The Cost

What the Student Did

Why Claude Failed

1. Claude can't do mathematical insight

2. Claude is bad at recovery

3. The compile-fix loop is a trap

4. Token cost scales with difficulty, not with progress

What Claude Was Good At

Lessons

Conclusion

Attempt 2: Prove injectivity of `fill_val`