AI Smart Contract Auditing

By May 2026, AI-driven smart-contract security has bifurcated into a parallel attack/defense track running alongside (not replacing) human auditing. The audit cycle as it existed pre-2025 — pay $125K, wait 2–3 months, ship — is structurally obsolete. The Ethereum security stack must now be continuous because the attackers’ loop is continuous: state-of-the-art models find PhD-level multi-domain bugs, and the cost of running them halves roughly every 1.5 months. April 2026 was the highest-loss hacking month on record, almost entirely AI-augmented.

Key Ideas

Audit-cycle-is-dead thesis

Riptide (Greg AI) (→ ETHPrague 2026 — Overview) crystallized the framing: traditional audits offer no guarantee of security, take months, cost six figures, and are obsolete the moment a contract is upgraded. The economically viable replacement is continuous AI audit — heavyweight scans before mainnet pushes, lightweight scans after every commit. Greg AI’s public bug-bounty findings against Lido, Chainlink, Aave, Uniswap, Reserve, ENS, Polygon, Oiler establish empirical validation: 100% AI-found, paid out on live deployments.

The remaining role of human auditors: business-logic and architectural review at the top of the stack.

Cost-of-intellect halving every 1.5 months

Eigor Gulamov / 7chat (Beyond Human Review, ETHPrague) produced the strongest quantitative claim: along each iso-intellectual line (e.g., GPT-o1-equivalent capability), unit price drops to one-half every ~1.5 months. Empirical implication: an o1-equivalent model that cost X to find a vulnerability in February 2025 costs X/91 to do the same nine months later. State-of-the-art capability becomes cheap commodity within a year, then gets repurposed for cheap pre-processing (chunking, marking, slicing) rather than reasoning.

This is the structural reason black-hat economics now work. Pre-2025, finding a $1M-bounty bug required a Lazarus-level adversary because the inference cost was prohibitive. Now, white-hat economics are inverted: the black-hat can spend $100K of inference to drain $1M, while the white-hat is paid $30K on the same bug. Until the cost asymmetry resolves, attackers are dominant in the pure-discovery layer.

Cross-domain bugs are the new frontier

Gulamov’s observation: hacks since late 2025 increasingly require multi-domain expertise — PhD-level math + low-level assembly + cryptoeconomic understanding, all in one exploit. Examples: the Balancer hack, the Bunny hack, cross-domain console-log exploits. Before AI, this combination was structurally rare (“Lazarus doesn’t have PhD-level mathematicians at scale”). AI removed the bottleneck.

Benchmarking AI auditors

CTF Bench (7chat, February 2025) — measures AI ability to find SC vulnerabilities and measures the noise level (false-positives ratio). Without noise control, finding 7/7 vulnerabilities is worthless if you produce 1000 slop reports per signal report. CTF Bench was first fully solved in June 2025 with Gemini 2.5 Pro, then GPT-5.5.
EVM Bench (paradigm) — open-source, fast (minutes), but found only 4/7 of CTF Bench issues.
Agent Lisa (Pessimistic) — research auditor; misses cases like the transfer-function-with-vulnerability example.
Important methodology: each auditor must be evaluated on signal-to-noise, not raw recall. A scanner that flags every line as a bug solves “100% of bugs” while being worthless.

The 7chat architecture

Gulamov’s stack: source code + docs → RAG chunking → wide research per chunk (one angle per pass, then aggregate) → cross-reference against a curated database of 6,500 distilled vulnerabilities (mined from 20,000 historical) → critic sub-agent filters >95% of hypotheses → de-duplication → report. The vulnerability database mimics how senior auditors work — pattern-match new contracts against historically-exploited ones. Increases recall 20–50%. False-positive rate: one report per 200 lines of code (ultra-quiet).

Coming developments

Agentic loop — Models writing experiments, executing them, getting feedback. Pre-2026 this was theoretically possible but cost millions in inference. Now feasible with Minimax M2.7 / similar tier.
Formal verification integration — AI agents writing formal proofs in Certora, Halmos. Reduces inference cost and hallucinations dramatically.
Model orchestration tiering — Expensive PhD-level model for hard reasoning; cheap commodity model for marking/slicing/chunking. The gap between SOTA and commodity is now wide enough that orchestration is the bottleneck.

Defense-side priorities

The economic equilibrium: attackers find bugs expensively but profitably until invariants are restored. The defender’s only viable strategy is broken-invariant detection — auditors highlight that an invariant is violated, regardless of which specific exploit path realizes it. Invariant-discovery is structurally easier (it’s a property of the code, not a path) and is already 100% automated in mature AI auditors.

Details / Subtopics

The Hardware Wallets in the Age of AI track

Tomas Martykan (ETHPrague): AI agents change the hardware-wallet threat model. When an agent signs on a user’s behalf, blind-signing risk multiplies — the agent will accept any human-formatted transaction prompt. NETSPEC (Victor Tron’s example from the decentralisation panel) and similar genuine-language-attestation systems become urgent — without them, agentic transaction signing is structurally unsafe.

The Agentic Risk

Fernando Rabasco (ETHPrague, The Agentic Risk) — the parallel concern. Agents acting on user authority introduce category risks: psychophant agents that please users by buying unbounded ads (real example: LinkedIn-ad agent drained credit card); agents that pay multiple times for the same outcome; agents that respond to prompt-injection in web content. Defense is bounded mandates (AP2-style intent → cart → purchase), not bigger models.

Continuous security as a development practice

The new operational model articulated by Riptide: lightweight scan on every commit, heavyweight scan before mainnet, plus continuous monitoring. The audit-as-event becomes the audit-as-process. Audit firms either adopt this or get out-competed by AI-first audit shops within 12 months.

Connection to the supply-chain-attack wave

Combined with the Smart Contract Security (2026 State) supply-chain finding (50%+ of 2025 losses), the audit-cycle replacement only addresses the smart-contract layer. AI auditing does not catch dependency hijacks, frontend compromises, or wallet-side blind-signing. Continuous SCA + continuous AI smart-contract audit + behavior-based runtime detection is the full defensive stack.

Funding mechanisms

ETH security quadratic-funding round (referenced by Grieco and Prevratil) is now sustaining several of these tools. Wake (Akki Blockchain Security), Akidna, and Riptide-class scanners are all funded partly through round-based public-goods donations — a noteworthy operationalization of Ethereum Public Goods Funding for security infra.

Connections

Smart Contract Security (2026 State) — The broader security posture; this page is the AI subset.
Smart Contract Fuzzing — Companion tooling layer; AI is increasingly being used to write fuzzing harnesses and direct fuzzing campaigns.
On-Chain Agents — The Agentic Risk talk, hardware-wallets-in-the-age-of-AI, blind-signing problems.
Ethereum Public Goods Funding — Funding the audit-tool layer.
ETHPrague 2026 — Overview — Audit-cycle-is-dead as a conference centerpiece.

Open Questions

The cost-of-intellect halving rate — does it hold or flatten? If frontier model inference costs hit hardware-limit walls, the attacker advantage may plateau; if they keep falling, every smart contract becomes a permanent honeypot.
Will model orchestration become standardized, or does each shop run a bespoke stack? Standardization would accelerate the defense side; lack of it favors attackers (who can iterate orchestration secretly).
Formal-verification-in-the-loop: how soon does AI write production-quality Certora proofs? If it ships before 2027, the equilibrium tilts back to defenders. If it ships in 2028, much existing TVL is at risk in the gap.
The 7chat / Greg AI / Akki distinction: are these going to consolidate into a single dominant tool, or does pluralism (multiple ToolBias-different agents finding different bugs) win? Grieco’s argument in fuzzing-tools applies: tool bias means parallel coverage matters more than single-tool quality.