Skip to main content

An Introduction To Merkle Tree

· 5 min read

A Merkle tree is nothing but a hash-based data structure that is a generalization of hash list. It is a tree structure in which leaf node is a hash of a block of data, and each non-leaf node is a hash of its children. Typically Merkle trees have a branching factor of 2, meaning that each node has up to two children.01

Merkle trees are used in distributed systems for efficient data verification. They are efficient because they use hashes instead of full files. Hashes are ways of encoding files that are much smaller than the actual file itself. Currently their meain uses are in peer-to-peer networks.

Merkle trees are typically implemented as binary trees. However, a Merkle tree can be created as an n-ary tree, with n children per node.

Benefits and Protocol

In various distributed peer-to-peer systems, data verification is important. This is because same data exists in multiple locations. So if a piece of data is changed in one location, it is important that data is changed everywhere. Data verification is used to make sure that data is same everywhere.

However, it is time-consuming and computationally expensive to check the entirety of each file whenever a system wants to verify data. So, this is why Merkle trees are used. Basically, we want to limit the amount of data being sent over a network as much as possible. So instead we just send a hash of the file to see if it matches. The protocol goes like this:

  1. Compute A sends a hash of the file to computer B.
  2. Computer B checks that hash against the root of the Merkle tree.
  3. If there is no difference, we are done! Otherwise go to step 4.
  4. If there is a difference in a single hash, computer B will request the roots of the two subtrees of that hash.
  5. Computer A creates the necessary hashes and sends them back to copmuter B.
  6. Repeat steps 4 and 5 until you have found the data block(s) that are inconsistent. It is possilbe to find more than one data blcok that is wrong because there might be more than one error in the data.

Note that each time a hash is found to match, we need n more comparisons at the next level, where n is the branching factor of the tree.

This algorithm is predicted on the assumption that network I/O takes longer than local I/O to perform hashes.it

Because the computers are only sending hashes over the network (not the entire file) this process can go very quickly. Plus if an inconsistent piece of data is found, it is much easier to insert a small chunk of fixed data than to completely rewrite the entire file to fix the issue.

The way that Merkle trees can be helpful in peer-to-peer system has to do with trust. Before you download a file from peer-to-peer source- like Tor - the root hash is obtained from a trusted source. After that, you can obtain lower nodes of the Merkle tree from untrusted peers. All of these nodes exist in the same tree-like structure and they all are partial representations of the same data. The nodes from untrusted sources are checked again trusted hash. If they match the trusted source, they are accepted and the process continues. If they are no good, they are discarded and searched for again.

Use Cases

As stated above, Merkle trees are especially useful in distributed peer-to-peer systems where the same data should exist in multiple places. These systems use Merkle tree or variants on the Merkle tree in their implementation.

Git is a popular version control system mainly used by programmers. All of the saved files are saved on every user's computer at all times. So, it is very important to check that these changes are consistent across everyone's computer.

Bitcoin is a popular online, anonymous currency. All transactions in bitcoin are stored in blocks on what is called the blockchain. This blockchain exists on every bitcoin user's computer. Leaves of Merkle tree used in bitcoin are typically hashes of single blocks. Every time someone wants to alter the blockchain, this change will be reflected everywhere.

Merkle trees can be used to check for inconsistencies in more than just files and basic data structures like blockchain. Apache Cassandra and NoSQL systems use Merkle trees to detect inconsistencies between replicas of entire databases. Imagine a website that people use all over the world. That website probably needs databases and servers all over the world so that load times are good. If one of those databases get altered, then every single other database gets altered in the same way. Hashes can be made of chunks of the databases, and Merkle trees can detect inconsistencies.

Complexity

Merkle trees have a very little overhead when compared with hash lists. Binary Merkle trees, operate similarly to binary search trees in that their depth is bounded by their branching factor 2. Included below is worst because analysis for a Merkle tree with a branching factor of k.

AverageWorst
SpaceO(n)O(n)O(n)O(n)
SearchO(log2n)O(\log_{2}n)O(logkn)O(\log_{k}n)
TraversalO(n)O(n)O(n)O(n)
InsertO(log2n)O(\log_{2}n)O(logkn)O(\log_{k}n)
DeleteO(log2nO(\log_{2}n)O(logkn)O(\log_{k}n)

However, a very important aspect of Merkle trees is in the synchornoization of data. In the typical case, this synchronization will be a log2n\log_{2}n operation because it is based on traversal and search. However, there is a worst case situation where there are no nodes in common. In this scenario, synchronization is akin to making a complete copy of the file, which will be O(n)O(n) operation.

AverageWorst
SynchronizationO(log2n)O(\log_{2}n)O(n)O(n)