Besu Parallel State Root Computation

Summary

Besu 26.1.0 (April 2026) ships parallelized state root computation as the default, reducing block processing time by up to 40% on 8-core non-NVMe nodes. The technique exploits the 16-way branching structure of the Merkle Patricia Trie (MPT): state updates are sorted by their first nibble, dispatched to parallel workers, and committed to RocksDB in a single bulk write. An “extension node puffing” mechanism handles the edge case where extension nodes must be temporarily expanded to branch nodes for concurrent processing.

Background: State Root Computation Bottleneck

After executing all transactions in a block, Ethereum clients must:

Apply all state changes (account balances, storage slots) to the world state
Compute the new Merkle Patricia Trie root incorporating all changes
Commit the new state to the on-disk database (RocksDB)

Step 2 (state root computation) is a serial bottleneck in many clients: changes must be applied to the trie in order, because trie nodes at higher levels depend on trie nodes at lower levels. For a busy block with thousands of state changes, this takes tens to hundreds of milliseconds.

The 16-Nibble Parallelism Insight

The Merkle Patricia Trie has 16-way branching at every branch node: each node has 16 children, one per hexadecimal nibble (0-9, A-F).

Key observation: state updates whose keys begin with different first nibbles are completely independent. Updating the subtree under nibble 0 does not affect the subtree under nibble F. These subtrees can be updated in parallel.

Implementation:

Collect all state changes for the block
Sort and group changes by their first nibble (16 groups)
Dispatch each group to a parallel worker thread
Merge the 16 updated subtree roots into the root branch node
Commit all changes to RocksDB in a single bulk write (commit cache)

Two Dispatch Criteria

Not every block benefits from parallelization. Besu uses adaptive parallelization based on two criteria:

Criterion 1: Multiple destinations There must be changes in more than one nibble group. If all changes for the block start with nibble A (e.g., all transactions interact with contracts starting with 0xA...), parallelization provides no benefit.

Criterion 2: Sufficient work volume per subtree Each active group must have enough state changes to justify the overhead of spawning a thread and synchronizing results. A group with a single state change is faster to process serially. Threshold is calibrated empirically.

If neither criterion is met (small block, or highly concentrated state access), Besu falls back to serial processing.

Extension Node Puffing

The MPT contains two node types:

Branch nodes: 16-child nodes (the natural parallelism point)
Extension nodes: compressed paths that skip multiple nibbles (optimization for sparse tries)

Problem: an extension node compresses a common path prefix. When a parallel worker needs to modify the subtree at a particular path, an extension node in that path might “block” other workers who share the same compressed prefix.

Puffing: temporarily expand an extension node into a branch node for the duration of parallel processing. This expands the trie slightly but allows full 16-way parallelism across all subtrees. After parallel processing completes, the branch node can be re-compressed back to an extension node if appropriate.

The puffing mechanism is the key engineering challenge that makes the parallel approach work correctly on real Ethereum state (which has many extension nodes due to Ethereum address distribution).

Commit Cache

After parallel workers update their subtrees, all changes are accumulated in an in-memory commit cache. The cache is then flushed to RocksDB in a single bulk write operation.

Why this matters: RocksDB write performance degrades significantly with many small random writes but scales well with large sequential batch writes. The commit cache converts N×(small writes) → 1×(large batch write), improving disk I/O efficiency.

Trade-off: the commit cache requires holding all pending state changes in memory during processing. For very large blocks (high gas limit), this can be hundreds of MB. The implementation must handle this gracefully.

Implementation Details

Concurrency model: Java CompletableFuture for spawning parallel workers and synchronizing on their results
Thread pool: bounded thread pool to prevent spawning more workers than CPU cores
Synchronization point: all 16 subtree workers must complete before the root branch node is updated (join point)

Performance Results

Benchmark conditions: 8-core node, non-NVMe storage (SATA SSD)

Block processing time reduction: up to 40%
Conditions: benefits most pronounced on high-activity blocks with state changes spread across multiple nibble groups
Storage: non-NVMe benefits most; NVMe nodes see smaller improvements (RocksDB write is already fast; CPU becomes the bottleneck rather than I/O)

Scaling: the benefit scales with core count up to 16 cores (matching the 16-nibble branching factor). Beyond 16 cores, additional cores cannot be leveraged by this specific parallelization.

Deployment

Default in Besu 26.1.0+ (shipped April 2026)
No user configuration required; adaptive criteria enable/disable automatically per block
Compatible with existing state databases (no migration needed)

Relationship to Block-Level Access Lists (BALs)

EIP-7928 (Block-Level Access Lists) proposes explicit per-block access lists to enable parallel transaction execution. Besu’s parallel state root computation is complementary:

BALs parallelize transaction execution (parallel simulation of independent transactions)
Besu’s approach parallelizes state root computation (parallel Merkle hashing after execution)

Both can be active simultaneously: transactions execute in parallel (BALs), then state root is computed in parallel (Besu 26.1.0). The combined improvement could reach 50-60% total block processing time reduction on well-equipped nodes.

Impact on Block Builders

Block builders run execution clients to simulate transactions. Faster state root computation means:

More simulations per second: builders can evaluate more transaction orderings in the same wall-clock time
Lower latency deadline: less time lost to state root computation → more time for transaction selection
Higher block value: more simulations → better opportunity to find the optimal ordering

This asymmetrically benefits sophisticated builders with multi-core infrastructure, consistent with existing builder centralization trends.

Open Questions

❓ What is the performance on NVMe nodes where I/O is not the bottleneck? Does CPU-bound performance benefit from additional parallelism?

❓ How does extension node puffing interact with verkle tree migration? Verkle trees have different structure (256-way branching at each internal node) that may enable different parallelization strategies.

❓ Does the commit cache’s memory overhead create problems at very high gas limits (e.g., post-Fusaka)?

Timeline

2026-04-20 — Blog post published by Besu maintainers describing the parallelization approach
2026-04-20 — Besu 26.1.0 released with parallel state root computation as default