Skip to main content

Ethereum's Merkle Patricia Tree

· 8 min read

The state of Ethereum (the totality of all accounts, balances, and smart contracts), is encoded into a special version of data structure known generally in computer science as Merkle Tree. This structure is useful for many applications in cryptography because it creates a relationship between all individual pieces of data entangled in the tree, resulting in a single root value that can be used to prove things about the data.

Ethereum's data structure is a 'modified Merkle-Patricia Trie', named so because it borrows some features of PATRICA ( the Practical Algorithm to Retrieve Information Coded in Alphanumeric), and because it is designed for efficient data retrieval of items that comprise the Ethereum state.

A Merkle-Patricia trie is deterministic and cryptographically verifiable: The only way to generate a state root is by computing it from each individual piece of state, and two states that are identical can be easily proven by comparing the root hash and the hashes that led to it (a Merkle proof). Conversely, there is no way to create two different states with the same root hash, and any attempt to modify state with different values will result in different state root hash. Theoretically, this structure provides 'holy grail' of Olog(n)O\log(n) efficiency for inserts, lookups, and deletes.

Basic Radix Tries

In a basic radix trie, every node looks as follows

[i0,i1,...in,value][i_0, i_1, ... i_n, \text{value}]

Where i0...ini_0 ... i_n represent the symbols of the alphabet, value\text{value} is the terminal value at the node, and the values in the i0,i1,...ini_0, i_1, ... i_n slots are either NULL or pointers to (in our case, hashes of) other nodes. This forms a basic (key, value)\text{(key, value)} store.

Say you wanted to use a radix tree data structure for persisting an order over a set of key value pairs. To find the value currently mapped to the key dog in the trie, you would first convert dog into letters of the alphabet (giving 64 6f 67), and then descend the trie following the path until you find the value. That is, you start by looking up the root hash in a flat key/value DB to find the root node of the trie. It is represented as an array of keys pointing to other nodes. You could use the value at index 6 as a key and look it up in that flat key/value DB to get the node one level down. Then pick index 4 to look up the next value, then pick index 6 and so on, until once you followed the path:

Dog KeyDog Key

There is a difference between looking something up in the 'trie' and the underlying key/value DB. They both define key/value arrangements, but the underlying DB can do a traditional 1 step lookup of a key. Looking up a key in the trie requires multiple underlying DB lookups to get to the final value described above.


def update(node, path, value):

if path == '':
currentNode = db.get(node) if node else [ NULL ] * 17
newNode = currentNode.copy()
newNode[-1] = value
else:
currentNode = db.get(node) if node else [ NULL ] * 17
newNode = currentNode.copy()
newHash = update(currentNode[path[0]], path[1:], value)
newNode[path[0]] = newHash

db.put(hash(newNode), newNode)
return hash(newNode)

def delete(node, path):

if node is NULL:
return NULL
else:
currentNode = db.get(node)
newNode = currentNode.copy()

if path == '':
newNode[-1] = NULL
else:
newIndex = delete(currentNode[path[0]], path[1:])
newNode[path[0]] = newIndex

if all(x is NULL for x in newNode):
return NULL
else:
db.put(hash(newNode), newNode)
return hash(newNode)

A Merkle Radix tree is built by linking nodes using deterministically-generated cryptographic hash digests. This content-addressing (in key/value DB key=keccak256(rlp(value))\text{key} = \text{keccak256}(\text{rlp}(\text{value}))) provides cryptographic integrity guarantee of the stored data. If the root hash of a given trie is publicly known, then anyone with access to the underlying leaf data can construct a proof that a trie includes a given value at a specific path by providing the hashes of each node joining a specific value of the tree root.

It is impossible for an attacker to provide a proof of a (path,value)(\text{path}, \text{value}) pair that does not exist since the root hash is ultimately based on all hashes below it. Any underlying modification would change the root hash. You can think of the hash as a compressed modification of structural information about the data, secured by the pre-image protection of the hashing function.

We will refer to an atomic unit of radix tree (e.g. a single hex character, or a 4 bit binary number) as "nibble". While traversing a path one nibble at at time, as described above, nodes can maximally refer to 16 children but include a value element. We, hence represent them as an array length 17. We call these 17-element arrays "branch nodes".

Merkle Patricia Trie

Radix tries have one major limitation: they are inefficient. If you want to store (path,value)(\text{path}, \text{value}) binding where the path, like in key Ethereum, is 64 characters long (number of nibbles, 8 x 8), we will need over a kilobyte of extra space to store one level per character, and each lookup or delete will take the full 64 steps.

Optimization

A node in a Merkle Patricia trie is one of the following:

  1. NULL\text{NULL} (represented as the empty string)
  2. branch\text{branch} 17-item node [v0...v15,vt][v_0 ... v_{15}, v_t]
  3. leaf\text{leaf} 2-item node [encodedPath,key][\text{encodedPath}, \text{key}]
  4. extension\text{extension} 2-item node [encodedPath,key][\text{encodedPath}, \text{key}]

With 64 character paths it is inevitable that after traversing the first few layers of the trie, you will reach a node where no divergent path exists for at least part of the way down. To avoid have to create up to 15 sparse NULL nodes along the path, we shortcut the descent by setting up an extension node of the form [encodedPath,key][\text{encodedPath}, \text{key}], where encodedPath\text{encodedPath}, contains the "partial path" to skip ahead, and the key is for the next DB lookup.

For a leaf node, which can be marked by a flag in the first nibble of the encodedPath, the path encodes all prior node's path fragments and we can look up the value directly. This optimization, however, introduces ambiguity. There is no difference between a leaf and extension node. In order, to make distinction, a prefix is added to the path.

Specification: Compact Encoding of Hex Seuqnece with Optional Terminator

The flagging of both odd vs even remaining partial path length and leaf vs extension node as described above reside in the first nibble of the partial path of any 2-item node. They result in the following

Hex CharacterBitsNode Type PartialPath Length
00000ExtensionEven
10001Extensionodd
20010Leafeven
30011leftodd

For even remaining path length ( 0 or 2), another 0 "padding" nibble will always follow.


def compact_encode(hexarray):

terminator = 1 if hexarrya[-1] == 16 else 0

if terminator:
hexarray = hexarray[:-1]

oddlength = len(hexarray) % 2
flags = 2 * terminator + oddlength

if oddlength:
hexarray = [flags] + hexarray
else:
hexarry = [flags] + [0] + hexarray


output = ''
for i in range(0, len(hexarray, 2), 2):
output = chr(16 * hexarray[i] + hexarray[i+1])
return output

Examples:

    > [ 1, 2, 3, 4, 5, ...]
'11 23 45'
> [ 0, 1, 2, 3, 4, 5, ...]
'00 01 23 45'
> [ 0, f, 1, c, b, 8, 10]
'20 0f 1c b8'
> [ f, 1, c, b, 8, 10]
'3f 1c b8'

Example Trie

Suppose we want a trie containing four path/value pairs (do, verb), (dog, puppy), (doge, coin), (horse, stallion).

First we convert both paths and values to bytes.

    <64 6f> : 'verb'
<64 6f 67> : 'puppy'
<64 6f 67 65> : 'coin'
<68 6f 72 73 65> : 'stallion'

Now we build such a trie with following key/value pairs in the underlying DB.

    rootHash: [ <16>, hashA ]
hashA: [ <>, <>, <>, <>, hashB, <>, <>, <>, [ <20 6f 72 73 65>, 'stallion' ], <>, <>, <>, <>, <>, <>, <>, <> ]
hashB: [ <00 6f>, hashD ]
hashD: [ <>, <>, <>, <>, <>, <>, hashE, <>, <>, <>, <>, <>, <>, <>, <>, <>, 'verb' ]
hashE: [ <17>, [ <>, <>, <>, <>, <>, <>, [ <35>, 'coin' ], <>, <>, <>, <>, <>, <>, <>, <>, <>, 'puppy' ] ]
Example TrieExample Trie