To Home page

Merkle-patricia Dac

1 Definition

1.1 Merkle-patricia Trees

A Merkle-patricia tree is a way of hashing a map, an associative array, such that the map can have stuff added to it, or removed from it, without having to rehash the entire map, and such that one can prove a subset of the map, such as a single mapping, is part of the whole map, without needing to have to have the whole map present to construct the hash.

Its practical application is constructing a global consensus on what public keys have the right to control what digital assets (such as crypto currencies, and globally defined human readable names) and proving that everyone who matters agrees on ownership.

If a large group of peers, each peer acting on behalf of a large group of clients each of whom have rights to a large number of digital assets, agree on what public keys are entitled to control what digitial assets, then presumably their clients also agree, or they would not use that peer.

Thus, for example, we don’t want the Certificate Authority to be able to tell Bob that his public key is a public key whose corresponding secret key is on his server, while at the same telling Carol that Bob’s public key is a public key whose corresponding secret key is in fact controlled by the secret police.

The Merkle-patricia tree not only allows peers to form a consensus on an enormous body of data, it allows clients to efficiently verify that any quite small piece of data, any datum, is in accord with that consensus.

1.2 Patricia trees

A patricia tree is a way of structuring a potentially very large list of bitstrings sorted by bitstring, with a binary point implied somewhere in the bistring, such that bitstrings can be added or deleted without resorting or shifting the whole list of bitstrings.

A patricia tree defines a set of keys, the keys of its leaf nodes. Or perhaps the keys are its leaf nodes, all the information may well be in the strings defined. The keys may be bounded on the left, but unbounded on the right, for example strings, unbounded on the left but bounded on the right, for example arbitrary precision integers, bounded both left and right, for example sixty four bit, integers, or unbounded on either size, for example binary arbitrary precision floating point.

If unbounded left, then the edge pointing at the root node had to give its height above the binary point. The bitstring of the root node is always the empty string, but where is that empty string positioned in height with respect to the binary point?

We cannot reference nodes by bitstring in the canonical form, because the number of leading zeroes in bitstring will change over time as the tree gets deeper – we have to represent nodes by their height plus the bitstring starting at the first non zero bit, or by the key, which is the bitstring with a one bit and several zero bits appended, to align the significance of bits in different bitstrings, in which case again only the first non zero bit matters.

When we give a chain of vertexes, starting at the root vertexes, the compact representation of the location of the vertex in the tree is its vertical position in the tree, plus the bitstring starting at the first nonzero bit. If it its a tree of items with fixed right bound, items identified by their integer sequence, then we give the height above the leaves, since this will not change as the tree grows. If fixed left bound, for example names as utf8 strings, the depth from the root. If neither bound is fixed, a case we are unlikely to have to deal with, the signed height or depth from some arbitrary starting point which will never change as the tree grows. Thus we always need to implicitly or explicitly define the bit alignment in the bitstring.

If a leaf in a patricia tree representing values with fixed right boud, for example oids, the usual case, then the bitstring of a leaf is its oid, or its oid minus one, which does not need leading zeroes.

We need to be able to represent a bitstring containing all zeroes, thus if a bitstring contains any ones, we have to represent that one. The height, therefore, is the alignment of the right edge of the bitstring, hence we can leave out leading zeroes in the bitstring, and indeed must leave them out of the canonical form of a tree representing values with no fixed left bound, such as indefinite precision integers, so that the canonical form does not change when a parent node is placed on top of a former root node

The key to a node, whether a vertex or a leaf, is the bitstring aligned by padding it with a one bit, followed by as many zero bits as needed.

The total number of vertexes equals the twice the number of leaves minus one. Each parent node has as its key the bit string, a sequence of bits not necessarily aligned on bit boundaries, that both its children have in common. This creates the substring problem for patricia trees mapping keys that have no right bound, mapping variable length keys. We cannot permit one key of the map to be the prefix of another key. If, however, the key is is self delimiting, as with null terminated strings, no key can be the prefix of another key, and this tends to be the usual way that variable length values are used as map keys. There are wide variety of too clever by half ways of dealing with prefix keys, but they all involve messing up a rather elegant algorithm with considerable complexity and surprising code paths in special cases and fencepost cases. It is better just to not allow prefix keys, as for example by having strings null terminated. If we really wanted to define arbitrary bit strings as leaf keys of a patricia tree, which I doubt we will, would be better to encode them in self delimiting format. No string in self delimiting format can be the prefix of another string.

A Merkle-patricia dac is a patricia tree with binary radix (which is the usual way patricia trees are implemented) where the hash of each node depends on the hash and the skip of its two children; Which means that each node contains proof of the entire state of all its descendant nodes.

The skip of a branch is the bit string that differentiates its bit string from its parent, with the first such bit excluded as it is implied by being a left or right branch. This is often the empty bitstring, which when mapped to a byte string for hashing purposes, maps to the empty byte string.

It would often be considerably faster and more efficient to hash the full bitstring, rather than the skip, and that may sometimes be not merely OK, but required, but often we want the hash to depend only on the data, and be independent of the metadata, as when the leaf index is an arbitrary precision integer representing the global order of a transaction, that is going to be constructed at some later time and determined by a different authority.

Most of the time we will be using the tree to synchronize two ’t of pending transactions, so though a count of the number of children of a vertex or an edge is not logically part of a Merkle-patricia tree, it will make synchronization considerably more efficient, since the peer that has the node with fewer children wants information from the peer that has the node with more children.

2 Representation

The canonical form will not directly reflect the disk organization.

The canonical form of a sparse tree is that each vertex is represented by the hash of its two children, and the bitstring of the offset of each child from its parent, minus the leading bit of that bitstring. The root node, of course, has an empty bitstring.

Often it will be more compact to transmit the child itself rather than the hash, in reverse polish notation order, from which the hash can be generated.

To form the hash of a node, we need the hashes and relative bitstrings of its children, but if we already have the children, identified by reverse polish position in the stream or by their bitstrings relative to a common ancestor, we don’t need and should not represent the hashes, giving them implicitly rather than explicitly.

Since we cannot hash a bitstring, only a bytestring, the bitstring will be hashed in its representation as a bytecount represented by a variable precision integer, followed by that many bytes, with the bitstring being padded if needed to an integer number of bytes by adding a one bit followed by as many zero bits as needed. Thus a twofiftysix bit byte string requires a count of thirty three, plus thirty three bytes, the last byte being 0x80.

In memory as a ceeplusplus object, the bitstring may conveniently be represented by an integer of at least sixty four bits, with the bitstring bit aligned so that the significan bits in one bitstring line up with bits of the same significance in another bitstring, and padded right with a one bit followed by as many zero bits as needed. In the canonical form, however, the left edge of the bitstring vertex identifier is the left edge of the bitstring, and the length of the bitstring is the depth of the vertex from the root. The left edge of a relative bitstring, identifying a child is one bit to the right of the bitstring identifier of its parent. The child’s bitstring vertex identifier is the parent bitstring vertex identifier, plus a zero bit for the left child and a one bit for the right child, plus the bits of the relative bitstring. The canonical hash of the parent is the hash of its left child, plus the relative bitstring of its left child, and similarly for its right child.

Conceptually and canonically, it is equivalent to a patricia tree where the children of a node are identified by hashes rather than pointers. The hashes are taken over the canonical form, and are unaffected by location in memory and the representation, which is not necessarily canonical. If the actual representation is in a database, it is likely to be represented in a way that makes recursive sql statements work.

Since we in practice cannot find the thing referred to by its hash, any actual representation of the canonical form must contain additional information telling us where to find the data referred to, but this additional information is likely to vary from one situation to the next, and is not canonical.

For this canonical form to work as a direct representation, we would need a universal way of finding the pre-image of any hash, which would be costly, and would deny us some useful cryptographic capabilities where a party reveals a pre-image and a hidden part of the Merkle-patricia tree – typically when a transaction goes bad, he would then make public the convesation leading to it. Also, the tree will eventually grow enormous, and have numerous side chains attached to it, in which case only the party or parties operating the side chain can reverse their pre-images. 

But the forms actually used should be a representation of a Merkle patricia tree with hashes and skip fields in place of pointers.

2.1 Balanced binary trees of fixed height.

We will represent an immutable and ever growing data structure as a collection of balanced binary trees, and a balanced binary tree of fixed height makes much of the information in this representation redundant, which suggests that it may be desirable to use a more efficient and direct canonical form – to ensure that the immutable append only data structure is canonically immutable and append only.

The schere pointing at a balanced binary tree will say that it is a balanced binary tree whose leaves are objects of a certain type (have a certain schema) and give the height of the root, assuming that they all have the same schema. If they have different schemas, the then the leaves will be of type schere.

The patricia bit string for each vertex of the balanced binary tree is implicitly given by its position within the tree, so we do not represent it in the canonical form, though we may well represent it in the actual representation.

2.2 Hashing

Hashing depends on the schema – to hash the bitstream, has to parse it into fields and records by the schema, and distinguish between index nodes and record nodes, which are hashed and represented differently, a record node being self contained, and index node depending on its relationships.

In a dac, rather than a tree, an index node might be referenced by multiple different entities, so in that case we want the hash to only depend on the part of the key field that it governs, independent of the parent part of its key field.

Further, a transaction is a group of records, and we want to represent a transaction locally, so that its records are physically close together in storage.

Which implies a transformation, that the canonical form, which knows nothing about storage location, can have portions represented as a position relative form, in which a group of records is kept in depth first tree order, with the boundaries of the group having hashes linking them to the outside, but internally, when converting back into canonical form, we recalculate the hashes. For a given schema we might do this one way in one context, or another way in another context, or have subschemas.

Of course, tree order assumes we have a tree. In general, we have a dac, not a tree, the most important case here being the tree of names, where we are continually issuing new roots for the tree, but we don’t want to continually issue new leaves.

In the canonical form of the Merkle-patricia tree we act as if hashes were reversible. Of course they are not, nor do we have a general universal look up table for reversing them. Rather, you have to hit up a server that can reverse the hashes you care about, which it may do by looking up a ginormous hash table, or more likely do by having the Merkle-patricia tree on disk or in memory in the ordinary patricial form of links pointing at file relative or absolute locations on disk or in memory.

3 Blockchain

Of course we want more than this – a Merkle-patricia block chain, meaning an immutable past and a constantly changing present.

Which represents an immutable and ever growing sequence of transactions, and also a large and mutable present state of the present database that is the result of those transactions, the database of unspent transaction outputs.

When we are assembling a new block, the records live in memory as native format C++ objects. Upon a new block being finalized, they get written to disk in key order, with implementation dependent offsets between records and implementation dependent compression, which compression likely reflects canonical form. Once written to disk, they are accessed by native format records in memory, which access by bringing disk records into memory in native format, but the least recently loaded entry, or least recetly used entry, gets discarded. Even when we are operating at larger scale than visa, a block representing five minutes of transactions fits easily in memory.

Further, a patricia tree is a tree. But we want, when we have the Merkle patricia tree representing registered names organized by names or the Merkle-patricia tree represenging as yet unspent transaction outputs its Merkle characteristic to represent a directed acyclic graph. If two branches have the same hash, despite being at different positions and depths in the tree, all their children will be identical. And we want to take advantage of this in that block chain will be directed acyclic graph, each block being a tree representing the state of the system at that block commitment, but that tree points back into previous block commitments for those parts of the state of the system that have not changed. So the hash of the node in such a tree will identify, probably through an OID, a record of the block it was a originally constructed for, and its index in that tree.

A Merkle-patricia directed acyclic graph, Merkle-patricia dac, is a Merkle dac, like a git repository or the block chain, with the patricia key representing the path of hashes, and acting as index through that chain of hashes to find the data that you want. 

The key will thread through different computers under the control of different people, thus providing a system of witness that the current global consensus hash accurately reflects past global consensus hashes, and that each entities version of the past agree with the version it previously espoused.

This introduces some complications when a portion of the tree represents a database table with more than one index. 

Ethereum has a discussion and definition of this data structure.

Suppose, when the system is at scale, we have thousand trillion entries in the public, readily accessible, and massively replicated part of the blockchain. (I intend that every man and his dog will also have a sidechain, every individual, every business. The individual will normally not have his side chain publicly available, but in the event of a dispute, may make a portion of it visible, so that certain of his payments, an the invoice they were payments for, become visible to others.)

In that case, a new transaction output is typically going to require forty thirty two byte hashes, taking up about two kilobytes in total on any one peer. And a single person to person payment is typicaly going to take ten transaction outputs or so, taking twenty kilobytes in total on any one peer. And this is going to be massively replicated by a few hundred peers, taking about four megabytes in total.

(A single transaction will typically be much larger than this, because it will mingle several person to person payments.

Right now you can get system with sixty four terabytes of hard disk, thirty two gigabytes of ram, under six thousand, for south of a hundred dollars per terabyte, so storing everything forever is going to cost about a twentieth of a cent per person to person payment.  And a single such machine will be good to hold the whole blockchain for the first few trillion person to person payments, good enough to handle paypal volumes for a year.

“OK”, I hear you say. “And after the first few trillion transactions?”.

Well then, if we have a few trillion transactions a year, and only a few hundred peers, then the clients of any one peer will be doing about ten billion transactions a year. If he profits half a cent per transaction, he is making about fifty million a year. He can buy a few more sixty four terabyte computers every year.

The target peer machine we will write for will have thirty two gigabytes of ram and sixty four terabytes of hard disk, but our software should run fine on a small peer machine, four gigabytes of ram and two terabytes of hard disk, until the crypto currency surpasses bitcoin.

Because we will employ fixed size transaction units – larger currency amounts will be broken into tens, twenties, fifties, hundreds, two hundreds, five hundreds, thousands, two thousands and so forth, and because we will be using a blockchain in the form of a Merkle-patricia dac, our transactions will tak up several times as much space a similar bitcoin transaction, and currently bitcoin transactions take up several hundred megabytes. But this is OK, because the Merkle-patricia dac gives client wallets far more power than on the bitcoin system, so we can get by with far fewer peer wallets and far more client wallets.


4 Packed Form

Assume we have an ordered sequence of records, as if the result of a database query with the index fields first.

Suppose we have a potentially very large sequence of records, and we want to generate the Merkle-patricia hash, so that we can generate an efficient proof that one and only one record with a certain prefix appears in this pile.

We want to convert from sequence of records form to patricia form.

If we look at the difference between each pair of records, we get an index node, which is the position of the first bit at which they differ. Which is a patricia tree expressed in infix order. We want to convert infix order to postfix order, a sequence of records interleaved with a sequence of bit positions at which two records differ.

Or equivalently, hash them as if we already had them in postfix order. 

Suppose we want to output self delimiting records, interleaved with postfix indicators of the difference position. We want the least index output last, so that when we do a second sequential pass, hashing each record when we encounter a record, and putting that has on the stack, and when we encounter and index, hashing two hashes from the stack with the index, and put the resulting hash on the stack.

So, we find the difference position between the current record and the next, then we output the current record, and make the next record the current record. If the difference position is less than the difference position on the stack, we output difference positions from the stack until the difference position on the stack is greater that the current difference position, meaning the node represented by the difference position on the stack is a child node of the current node. We then put the current difference position on the stack, and repeat for the next record.

A similar algorithm simply generates the hash as if we had the literal nodes.

A full path to a leaf node, proving that the leaf node is represented in the tree, contains not the hashes of the things in the path, but the off path hashes and off path keys.

An incomplete tree has the same data structure as full tree, but with missing nodes. A full path is a more compact representation of an incomplete tree, and is treated as a compressed form of an incomplete tree with a single leaf node present. You can regenerate an incomplete tree from a full path, and a full path can be generated for any non missing leaf node in an incomplete tree.

Rather than a chain of blocks, we have a Merkle-patricia dac of blocks, where the index is the block number. This means that the state of any block can be proved to be part of the global consensus with a proof of length logarithmic in the total block number. Thus peers can provide clients with short proofs, so that clients do not have to take assertions by peers on trust.

We also have a small number of Merkle-patricia dacs representing the state of the system at any given block time, for example a Merkle patricia dac of unspent transaction outputs, and a trie linking globally unique human readable names to probabilistically unique public keys. These trees change each new block, though their state in past blocks is immutable.  Each new block contains not the entire new Merkle-patricia dac, which is apt to be enormous, but only those nodes that have changed.  A new block contains the new roots of the new Merkle-patricia dacs and their new descendants, which link to unchanged descendants in past blocks.

Peers synchronize their state by sharing new information to form a new block. They efficiently discover what they have in common, and what is new, by sharing the root of the Merkle-patricia dac describing the new block, and then give each other the new information, after the fashion of usenet news. 

A new block is not just a list of new events, but it is generated from a list of new events, which are themselves listed in a Merkle-patricia dac. To compare nodes in the tree of new events, a peer sends its neighbours an offset into the hash, the leading part of the key of the node, a one bit flag to indicate if the node has children, a hundred and twenty eight bits of the hash of the node, and for each of its two children, the leading parts of their keys,a one bit flag to indicate if the node has children, and sixty bits of their hash, and for each of the four grandchilden, the leading parts of their keys, a one bit flag to indicate if the node has children, twenty eight bits of hash, for each of the eight great grandchildren, the leading parts of their keys, a one bit flag to indicate if the node has children, and thirteen bits of hash, for each of the sixteen great great grandchildren, the leading parts of their keys, a one bit flag to indicate if the node has children, and six bits of hash, for each of the thirty two great great great grandchildren, the leading parts of their keys, a one bit flag to indicate if the node has children, and three bits of hash, and for each of the sixty four great great great great grandchildren, the leading parts of their keys, a one bit flag to indicate if the node has children, and two bits of hash.

This tells it which subtrees are definitely different, and which are definitely new, and for each subtree definitely different, it sends more comparison data for that node, and for each subtree definitely missing, it sends that subtree.  Once there are no more subtrees to be sent, repeats the process starting at the root once again.

When it has some information about which subtrees are definitely missing, which are probably the same, and which are definitely different, it then sends much the same, but for each one missing, sends the full subtree, for each probably the same, skips, for each definitely different, doubles the level of detail – repeats with a different part of the hash, and twice as many bits.

So now we need more than a one bit flag. Need to distinguish between the cases:

  1. no children
  2. Sending a full item – a leaf node.
  3. skipping a child subtree because we are likely in agreement
  4. dropping down to a lower level of detail, for example from thirteen bits of hash to six bits of hash. If down to two bits of hash, always going to leave out the children.
  5. not dropping down to a lower level of detail, keeping the same number of bits in the hash.

The intent is to discover what parts of the tree we have agreement on, and send an image of the tree with those parts skipped over.  Rinse and repeat. We annotate our model of the tree with the probability that a subtree is identical. If the other guy sent us a leaf node, we know the node is identical, and every hash fragment that agrees creates exponential probability that a subtree is identical.  If we recently got a leaf node from source, not from sharing, we know for sure that the other guy does not have it, and send it to him.

Ok, that covers information gathering, but what about the final stages, when we are going to throw out data that was late in coming, for the sake of consensus?

A proposed final hash of all items to be in a block for a certain period is announced. And now, the job is to get those items that are missing, and tag those items that the are not in the proposed final hash, and exclude them from a version of the tree. So instead of “definitely not present in the other guy’s hash” means you send the guy your item, it now means you exclude it, and see if that gets your root hash to agree.

A peer in good standing endorses the proposed final hash, the root of the block if it can get its trees to agree

When building a block, the peers share these new events. When coming to a consensus, the peers attempt go get agreement on the new events in the block. But the block will also contain diffs on the Merkle-patricia dac of unspent transaction outputs, and the Merkle-patricia dac of spent transaction inputs. The peers need to maintain these trees so that clients can see proof of consensus on the tries, so that a peer cannot mislead a client, and a peer should only vote for a consensus if it can generate the same root hash for the the block thus has the same tries describing the entire block chain and providing access to a block chain for clients.

After agreement on Merkle-patricia dac of all new items for the new block, a peer generates the other revisions of the other Merkle-patricia dacs, and add them to the block, the generated items that go into the block, and you should get final agreement. If a peer gets final agreement – all block items are valid, then it votes for the consensus. 

The block contains the root of a Merkle-patricia dac that contains all previous blocks (but not the current block) – thus not so much a block chain, as a block trie, which means that the proof of any fact about the state of the block trie is reasonably short, of order log N hashes, where N is the number of items in the block trie. 

5 Node identifiers

We will call the thing that identifies a node a node infix order, and the member of the subset that the partricia tree identifies a key - because we are generally using it as a map key. Part of the map key is part of the node infix order.

But the canonical form of the map key is a bitstring, which is what we are going to hash.

The node infix order is the representation of the bitstring with significant bits in the different bitstrings aligned, padded on the right with a one bit and as many zero bits as needed, and for a tree of quantities unbounded on the left, padded with a one bit. and as many zero bits as needed on the left.

In practice all integers have some finite bound, but one does not want the computer word length to affect the data on the wire or the result of the hash, so it usually preferable to structure the data and algorithms so that the actual left bound has no effect provided it is sufficiently large, as if the integers had unbounded precision. We will not actually need numbers larger than 264 until Anno Domini ten thousand or so, but the when we do, it will have no effect on blockchain format or previous hash results, hence for integers we will use a tree with right bound but no left bound.

But a patricia tree is bit string oriented. So for integer indexes, in order that we can ascertain which nodes correspond to the leaf nodes, need to have, associated with the root node, the the length of the bit string for the leaf node. The length needs to be run time value, rather than a compile time value. But for hashes, can be a compile time value.

To map from bit strings to byte strings, and to have a bit string index for leaf nodes and tree nodes, we either append 10000... or 011111... to bitstring, to get a one to one map from byte strings to bit strings, and from bit strings to integers.

Compiler intrinsics are generally ffs. For Microsoft compilers use _BitScanForward & _BitScanReverse. For GCC use __builtin_ffs, __builtin_clz, __builtin_ctz.(find first set) or ctz (count trailing zeros) so is probably faster to represent a bit string as a word string by appending 100000... than 0111111.… See Microsoft __lzcnt() and gcc __builtin_clz(). . BitScanReverse is portable between processers, and lzcnt is not, so you need a runtime check at the start of program to see if your code can run.

If the leaf nodes correspond to the integers 1 to N, where N is at most m bits long, then the bit strings of the leaf nodes are m+1 bits, becaust we have 2N-1 vertexes in the tree, and 2N edges.

6 Sparse or sequential, complete or partial, Merkle-patricia trees.

6.1 Sparse and complete

As, for example, a Merkle-patricia tree of a map that maps globally unique human readable and writeable names to public keys.

A patricia tree with no right bound on its keys is necessarily sparse. Thus, for example, in a patricia tree of strings terminated by a null byte, the null byte ensures a gap between the address spaces governned by two successive keys.

If a node in the tree has direct children (no skips) or the leaf nodes are sequential and contiguous its has its hash is the hash of the hashes of children. If the tree is sparse part of what we know is the gaps, hence the parent node has to hash the skip links. If the tree is sequential and contiguous, then the root node has to directly hash its direct descendants, and also the size of the tree.

A block is a sparse tree, while a blockchain is a sequential tree of sparse trees. The sparse tree in a block will itself contain sequential trees, and the root hashes of numerous sequential trees.

In a sequential Merkle-patricia tree, the hash of the parent node is simply the hash of the hashes of its two child nodes, but in a sparse tree, we want two maps from keys to objects with the same sequence of the same objects, but different key values, to have different hashes, so hash of a parent node has to be hash of the hashes of its two child nodes, plus the hash of the portions of each child’s key that govern the skip links to those two children, that portion of each child’s key. The portion of the childs key that govern the skip link is (Level difference -1) bits long.

But computers do not handle bit fields easily, and databases do not handle them at all, plus, how do you hash a bit string, such as the leaf indicator?

Any variable length field creates ambiguity in the hash, so that two values could be hashed as the same stream of bytes. To avoid this outcome, the canonical format for a bit string will be an integer in stream format specifying the number of bytes, followed by the bit string, followed by a zero bit, followed by enough one bits to fill to the next byte boundary. If the representation of a bit string as an integral number of bytes has some 0x00 bytes at the end, it is not in canonical format and gets truncated before hashing till it no longer has any 0xff trailing bytes.

If a bit string has a start position that is not necessarily byte aligned, and is known from context, we left pad it with zeroes. If its start position is not known from context, we provide the starting bit position as an integer in stream format. At this point in the code we are deep in the bitbashing weeds, and are no longer worried about passing the bit string around as a regular byte string.

Assuming no prefix problem, one way or another way, then the index of a node can be two fields, the bit string, and the number of bits in the bit string. But since we already have to represent the number of bytes in the field representing the bit string, we might use the 01111.. canonical format trick, so that we can use the more familiar, standard, and convenient infix order of byte string, in which case we will have to pad the map key to form the node infix order.

However we do this, it is an implementation detail that should not affect the canonical form or the root hash, and the appended 0111... form simplifies fencepost problems on interpreting what is on the wire. Otherwise we are always going to be bothered by distinguishing the bit string 0101000 from the bit string 0101. One less if to screw up.

Since a sequential Merkle-patricia tree always maps the integers from zero to n-1 to the objects, hashing link information is redundant, we do not actually need any link information. It is sufficient that the root hash defines the objects and the sequence.

6.2 Sequential and complete

The tree bears a simple and natural relationship to a linear vector of leaves and a linear vector, one smaller, of nodes.

Each node appears with a position in the linear vector one less than the position of the leaf that required it to be added to the tree.

Size eight Merkle-patricia tree by:

“”
1000

“0”
0100

“00”
0010

“000”
0001

“001”
0001

“01”
0110

“010”
0101

“011”
0111

“1”
1100

“10”
1010

“100”
1001

“101”
1011

“11”
1110

“110”
1101

“111”
1111











We want to collapse the tree into a linear list, so that we can find the correct node without walking one bit at a time through a binary tree, both in order to represent relatively small blocks of hashes, and also to represent the top of very large trees.

We extend each bit string to a fixed and uniform size by appending a one bit, followed by as many one zero bits as are necessary fill to the standard size, so that we can use uniform length bit strings, instead of variable sized bit strings, so that we can use them directly to access an array in memory, or OIDs in a database.

The resulting padded bit strings (keys) are in infix order. But for an immutable append only file or database, we want postfix order, which is harder. And we don’t want to be restricted to only having a power of two objects. We want to be have an arbitrary number of objects, and add an arbitrary number of objects without changing the existing tree.

A binary postfix tree, power of two, is going to look like this:

For an append only structure, the position of a leaf node is number of prior leaf vertexes, plus number of prior internal vertexes, and for an sql append only database, the oid of a leaf node is number or prior leaf nodes, plus one.

So a leaf oid is: \displaystyle\frac{key}{2}+1 where key is the patricia bitstring right padded to a fixed size by a one bit, followed by as many zero bits as needed.

Let \displaystyle{C_{key}} be the count of bits in the key. (which in C++ is bitset.count(), But C++ provides no access to cool intrinsic assembly instructions.)

The number of internal vertices prior to a leaf is \displaystyle{\frac{key}{2}+1-C_{key}}

So the leaf position in postfix order is: \displaystyle{size_{leaf}*\frac{key}{2}+size_{vertex}*({\frac{key}{2}+1-C_{key}})}

Let the height of a vertex be h_{key}, the number of trailing zeroes in the key.

So the postfix vertex position, supposing vertexes are in a separate data structure, is \displaystyle{size_{vertex}*\frac{key}{2}+2^{h_{key}}-C_{key}}
Is that right?
Need to check it.

I think we could also get the postfix vertex oid by \displaystyle{\frac{{(key-1)} | {key}}{2}+1-C_{key}}, but again, needs checking.

Given the vertex oid and the leaf oid, the absolute position in the direct immutable append only file is easy to calculate.

Everyone seems to wind up using regular C bit twiddling hacks, because hardware intrinsics are erratically available, and because the efficiency improvement of hardware intrinsics is seldom worth the thought.

uint64_t bitcount(uint64_t c)

To do the reverse operation, finding the key (the padded patricia index) from the postfix position make the starting guess that the C_{key} adjustment was zero, find the corresponding patricia key, and then walk the tree from you where guessed that you were are, to where you should be. You find the predicted postfix position of your guess, find the order of the highest order bit where they differ, and walk the postfix position and padded patricia key (infix position) in parallel.

6.3 Adding to a sequential and complete Merkle-patricia tree

Well that solves the problem of a postfix tree, but how do we apply this to solve the problem of the number of items not being a power of two?

6.3.1 A sequential append only collection of postfix binary trees

id Immutable append only file as a collection of balanced binary Merkle trees in postfix order Immutable append only file as a Merkle chain

This is a Merkle binary dag, not a Merkle-patricia tree. I found that generalizing Merkle patricia trees to handle the case of immutable append only files led to more complicated data structures and more complicated descriptions. It has a great deal in common with a Merkle-patricia tree, but a Merkle vertex is its own type, even though it convenient for it to have much in common with a Merkle patricia vertex. Putting it into the patricia straightjacket generated a lot of mess. A Merkle patricia vertex is best handled as a special case of a Merkle vertex, and a Merkle patricia tree as special vertices within a Merkle dag.

The superstructure of balanced binary Merkle trees allows us to verify any part of it with only O(log) hashes, and thus to verify that one version of this data structure that one party is using is a later version of the same data structure that another party is using.

This reduces the amount of trust that clients have to place in peers. When the blockchain gets very large there will be rather few peers and a great many peers, thus there will be a risk that the peers will plot together to bugger the clients. This structure enables a client to verify that any part of the blockchain is what his peer say it is, and thus avoids the risk that peer may tell different clients different accounts of the consensus. Two clients can quickly verify that they are on the same total order and total set of transactions.

Edges of the graph are represented by hashes, and thus can only travelled from right to left, Vertices are represented by their hash, and in their canonical form contain child hashes and their full padded Merkle patricia key within a tree as an arbitrary precision integer, and thus their implicit postfix position within a tree, which identities provide implicit edges that can be traversed in any direction with respect to a given total order, while hash edges are agnostic of total order, and are what we construct the consensus about total order from.

The bottom most part of the structure consists of data structures that do not have a canonical order. But when we are figuring out how to order them, we have to construct vertices on top of them that do have a canonical order, where each vertex contains a hash commitment to a total past in total order and a patricia key representing its position in the total order.

When the chain becomes very big, sectors and disks will be failing all the time, and we don’t want such failures to bring everything to a screaming halt.

And, when the chain becomes very big, most people will be operating clients, not peers, and they need to be able to ensure that the peers are not lying to them.

If our initial tree has has a size of zero, this is the same as creating a sequential and complete Merkle-patricia tree.

Since the whole point of a Merkle tree is immutable entities, we seldom want to insert, delete, or update anything but the right hand edge of a sequential Merkle-patricia tree, and normally only want to insert on the right hand edge.

Thus a sequential Merkle-patricia tree is not exactly a block chain, since each block does not contain the hash of the previous block. If it did, you would potentially have to receive and calculate a lot of hashes of hashes. to ascertain that block one thousand did indeed chain to block five hundred. This structure means that clients can calculate the validity of the block chain for those parts of it that contain transactions that concern them, and know that everyone else doing similar calculations are getting results that show the same consensus as they are getting, thus calculating the validity of block chain is distributed to all clients without all clients needing to deal with the entire block chain.

If all the peers get together to screw over one client or few clients, that client is going to have cryptographic proof of misconduct. On publishing that proof, large numbers of people are likely to blacklist those peers, resulting in a fork in the blockchain. We could automate this process, with everyone automatically disregarding the signatures of peers for which a proof existed that they had changed the rules in a way inconsistent with the rules implemented in a client, so that if nine tenths of the peers change their software, and nine tenths of the clients do not, we automatically get a fork with nine tenths of the clients on one block chain with a tenth of the peers, and one tenth of the clients on the blockchain with nine tenths of the peers.

This architecture allows a client peer arrangement where to pervert the blockchain, you have to synchronously pervert everyone’s software everywhere, or most people’s software, whereas with bitcoin, a few big miners can pervert the blockchain.

Bitcoin will fail because power over the blockchain lies with a few big miners, and governments will eventually twist their arms, or themselves get into the business of mining. This is inherent in the scaling problems of bitcoin. Back when everyone was miner, and everyone had the whole blockchain on their machine, power was distributed in a way that ensured that the blockchain was conducted according to consensus rules, but as the blockchain gets bigger and bigger, fewer and fewer people host the complete blockchain, and fewer and fewer people mine. So we are back to the situation we wanted to escape, where a rather small number of people have power over other people’s money.

A client has partial trees for all his transactions in the blockchain, and if all clients check their own particular part of the block chain, the entire blockchain is checked.

Well this is a great layout if we have data structure that fits entirely in memory, and it is a great layout if we have an enormous mutable file and are appending some bits to it, and have log n chunks mapped into memory leading to the part where we are appending stuff.

But if we are frequently constructing partial hash trees here and there, the trouble is we log n non localbits things to look up. If we are putting a big Merkle-patricia tree in a database, better to infix order by the size of the bit string, then the content of the bit string filled out to the long word size with infix order bits, or if it does not fit in a long word, as a blob. (sqlite sorts nulls first, then integers, then strings, then blobs, and if you write something as an integer and read it as a blob, gets converted a string representing a decimal)

Suppose we want a literally immutable append only file, representing a sequential patricia Mekle tree, where we append each completed binary tree?

Well in that case the high nodes have to follow each completed subtree, and our postfix order would be:

000 001 010 011 100 101 110 111
00 01 10 11
0 1
“”

Thus, to construct the postfix order for the leaf position in the tree using right side nodes, leaf n has postfix order (n<<1)-std::bitset<64>(n).count(). Which is considerably more complicated than the infix order for leaves with nodes uniformly intermixed with them.

Further, the not a power of two case is more complicated. To figure out which nodes at the right side are skipped, have to have the infix order of the node. Then we pack all the non skipped right hand nodes following the last leaf node. There is no simple way to identify a skipped node from the postfix order.

But the postfix order has the huge advantage for enormous data structures that when you are constructing, access is sequential and forwards only, and if you are making an ever larger file, the additions are append only.

But this is essentially a plan to build our own database, presumably with a second file containing a copy of only higher nodes, and a third file containing a copy of higher nodes still, until one gets down to copy of the high nodes that fits entirely within memory. X64 has 64 byte blocks in its top level cache, and can virtual seventy terabyte files into its copious address space. 4kilobyte blocks for disk access tend to be fastest but 64 kilobyte blocks are only marginally slower, though in some algorithms they waste more memory. Microsoft recommends 64 kilobyte disk blocks for servers. This suggests a structure in which every time we have a new two byte word in the bitstring, we have a different offset corresponding to a different area on disk and memory, a hierarchy of files each 256 or 65536 times smaller than the other.

A radix 256 Merkle-patricia tree on top of a radix 2 Merkle-patricia tree would be referencing sixteen kilobyte blocks, which sounds like it is near the sweet spot, in which case the cache file for the equivalent of the bitcoin blockchain will fit into a gigabyte of ram.

It is premature to think of designing this. After two years at bitcoin volumes, our blockchain will be two hundred gigabytes, at which point we might want to think of a custom format for sequential patricia trees as a collection of immutable files, append only files which grow into immutable files, and wal files that get idempotently accumulated into the append only files.

For the first level nodes, the ones directly above the leaves, the postfix order is 000🡒00010, 001🡒00101, 010🡒01001, 011🡒01100, 100🡒01001, 101🡒10100, 110🡒11000, 111🡒11011. We append two zeroes, subtract the count of the bits of the bitstring, and add two.

Or equivalently, we take the infix order of first level node, subtract the count of the bits of the infix order, and add two.

For the second level nodes 00🡒00110, 01🡒01101, 10🡒10101, 11🡒11100, we append three zeroes, subtract the count of the bits, and add six to get the postfix order.

For the third level nodes 0🡒01110,1🡒11101. We append four zeroes, add fourteen, and substract the count of the bits to get the postfix order.

For the fourth level node, null🡒11110, we append five zeroes, add thirty, and presumably subtract the number of bits to get the postfix order.

The general formula appears to be add 2^(level+1)-2-count)

Reversing the postfix order to get the bitstring and the level seems rather hard. You have to truncate, then find x: (a+x).count()=x. And since there is no clean and elegant way of finding count()) it is not likely that there is a clean and elegant way of finding x. But it is very easy, given the level, to find the parent, the children, and the sibling, since these are at fixed offsets. Looks like when iterating through structures in postfix order, you have to keep the level and the bitstring implicitly or explicitly around, whereas with infix order there is a clean and simple relationship between the infix order and level and bitstring.

To reconstruct the bitstring from the postfix order, the fastest way is probably to construct it one bit at a time by conceptually walking the tree from the root until we match the postfix order, not recalculating count every time, but incrementing it every time we move right in the tree, and representing the level by the power of two that we will add to the bitstring when we move right.

6.4 incomplete.

We take a list of hashes and their node infix orders, the offset for sequential trees, and the bit strings for left trees, and stuff them into map mapping node ids to hashes, using a map that allows random and sequential access. The representation of sparse and incomplete tree is similar to the representation of a sequential and incomplete vector. We provide the nodes that allow the construction of the chain of hashes from the object to be authenticated, but we do not provide their children.

6.5 Sparse.

In a sparse Merkle-patricia tree, we are not going to do bit bashing on the key, because it is likely inconveniently large, and because we are likely to be counting the height from from the left. But we might do it for code re-use reasons, where the keys are integers of moderate size.

It is customary to define the least significant bit as bit zero, which is the convention I have followed in the description of sequential patricia Markle Trees. So in a patricia tree with node height measured from the right hand side, nodes have a height, which is zero for leaf nodes, and variable for the root node. In a sparse tree, one is apt to measure the from the left, so nodes have a depth, not a height. The depth of the root node is always zero, and leaves have a variable depth. To make the code templatable, will need a patricia tree type to have a member static constexpr bool left_edge = true; for trees with node height measured from the left edge, and false for trees measured from the right edge. The depth of bit three of a byte in position seven in an array is 8*7 − 3 The node depth is equal to the position of the bit that selected that node. If a node is the parent of all keys with the most signficant bit set in the start byte of the key it has depth one.

The standard algorithm for entering a new item assumes you are using the patricia tree as the infix order, but databases and map files do not do this. When we are big and successful, will write a version of sqlite that supports sharding and Merkle-patricia infix orderes.

And because the tree is sparse, it is probably coming in somewhat non sequentially, with updates in random locations, though block updates are likely to be sequential.

Nonetheless, the analysis of sequential blocks suggests we will get more locality of reference on disk and memory, if we put all the patricia nodes, leaf and vertex, in one big table, with the key of a vertex node locating it close to the leaf node that necessitated its creation, though we will need the bit offset to be part of this table. The height means we don’t need keys of different levels be guaranteed unique, because when we are after a particular level, we use the height, likely measured from the right hand side, to select the correct candidate in the improbable event of a collision.

The bitstring of a node infix order is going, in practice to sorted on whole bytes or whole words, rather than its exact bit length, resulting in collisions, therefore we have sort on the bitstring and height, or, as in a sequential patricia tree, pad the bitstring with a representation of the height. Which for long keys is going to be a bit verbose, though hardly a show stopper. We could pad the bitstring with the difference between the height, and height rounded to the nearest number of whole words, but this would have the irritating result that 256 bit hashes become 257 bit hashes, unless we truncated to 255 bits, which is harmless but nonstandard and apt to result in complications all over the place. And we would still be looking at the height, if only to know the number of bytes or words.

The analysis of sequential hash trees suggests that it is going to be more efficient to have all nodes of all levels in the same table, possibly including leaf nodes. We could store the complete keys of parent and its children in a node, but this is rather redundant. If we take the bit field offset from the left, rather than from the right as in sequential hash trees, then we truncate the key, conceptually at the bit indicated by the bit offset, and only have the parts of the keys of the children that are beyond the bit offset in the node.

In a sparse tree, we are just looking up the key to find the thing referred to, which is probably referenced by an oid, so it may well be a pointless indirection to have leaf nodes in the same table as their parent node, which already has both the key to the child and an oid to the child. Maybe we just have a leaf flag that says “this oid is not an oid to yet another internal node, but an oid to something else, somewhere else”.

Two maps, one of them being the map that you are making a Merkle patricia hash of, and one containing the nodes. The nodes The nodes point to a other nodes, or they themselves contain the leaf value, the leaf value being the hash of the map key and object being hashed.

On the wire, we don’t have a representation of the sparse tree. The nodes get generated on the receiving end when we sent the objects themselves. But we do have representations of incomplete sparse trees, which are just a list of nodes with their hash values and keys, but not their child links, that you will need to construct the root node from a subset of the objects.

In rare and exceptional cases, the key of the root node by not be empty bit string, so when specifying the root node of a sparse tree, we have to specify its bit string. But when we send a partial tree, we assume the recipient does not have, and does not intend to build, the complete tree, and that he receives the root node by some special path that authenticates it.

A node with children will be inherently different to a node without, because it is going to need the data to construct the infix orderes of its children, which is a good argument for putting them in distinct tables. If we put leaf hashes and vertex hashes in the same table, we are massively violating normal form. On the other hand, Sqlite3 supports this by having variant typing. For sparse trees, not usually a good idea to put the children in the same table. For sequential trees, usually the way to go.

6.6 Merkle-patricia Tree of 256 bit values.

Typically these are going to be a tree of public keys and or hashes.

Of course a database cannot handle infix order values that are not a multiple of eight bits, and is likely to prefer a multiple of sixteen bits, so we will make the key field of the infix order value fields the prefix bits that all the leaves of a node have in common, with the bit string 0 appended, rounded up to whole number of bytes with additional one bits.

The depth of a node may be determined by taking the number of bytes times eight, and substracting the bits from the last zero leftwards. leaf nodes will be special cased.

If leaf nodes are part of the node map, we need 257 bits – but we are also going to need special case handing of leaf nodes, which is effectively the 257th bit.

Since our infix order has two fields, it is efficient to reference nodes by OID, rather than by key, though you frequently have to find a node by its key field and its depth field. So each internal node will have its depth, (or rather a value reversibly constructed from its depth) and its key (or rather a value reversibly constructed from its key and its depth), and the OIDs of its two children. For a leaf node, the child OID fields have a different meaning.

To reconstruct the hash of a node, we look up the two children by OID, and hash their hashes and the bit strings that differentiate them from their parent and each other, or rather the the bytes containing the bit strings that differentiate them from their parent and each other. We hash the hash and bit string of one node first, then the hash and bit string of the next node, to avoid the possibility that one concatenated pair of bit strings might equal another pair of concatenated bit strings.

If we are inserting a bunch of new map entries with the intent of recalculating the new value of the root node once the insertions are done, we sort them into a temporary table in order of their map key, and after each insertion calculate the new values of the parent nodes up to but excluding the parent node that is the parent of both the item recently inserted, and the next item to be inserted. If no next item to be inserted, up to and including the top.

Pretty sure that this guarantees all nodes get recalculated as necessary and only as necessary, but we will need debug check to make sure we have not missed a fence post problem. Have a forced re-evaluation of the entire tree to check for internal consistency.

When we are generating blocks on the block chain on an ordinary computer with an ordinary internet connection. (16mbps down, 3mbps up), and blocks are typically 300 seconds per block, that implies that blocks are smaller than a gigabyte, so we can prepare blocks entirely in memory, either as an in memory sqlite database, or in a custom format using links, rather than OIDs, and key values of the infix order in sixty four bit integers, rather than bytes. Once a block is committed to global blockchain, it however has to acquire global oids, hence goes in a different database, probably in a slightly different database format, since all the oids for a given kind item are sequential. But the hash of a block is calculated in a way entirely independent of the oids that will be assigned to it. Perhaps when it is committed, it gets capped by additional data telling us what oids were assigned, but including the oids when calculating the root hash of a block under construction, and therefore suffering frequent inserts in random positions, would make it too costly to insert new data.

Such information about the type and number of objects in the block, and the block’s position on the block chain, should not, on the dry principle, be part of the block or the block root hash, though it might well be useful to include such information in the authentication block (authentication blocks alternate with blocks containing actual updates on the block chain) When combining two proposed blocks, it is good to start by merging the larger block into the smaller block, and as a check that everything is working as it should, make sure that the counts are consistent, this being a low cost check against glitches in calculating the hash. We don’t need to incorporate such a consistency check into the blockchain itself until the authentication block, which attests to the previous root of the block chain, which attests to the previous root and previous sequence number of the block chain, thus necessarily contains its own sequence number, unlike a data block.

The hash of a block should be independent of its oids, but we do not want a glitch to result in the oids of a peer getting out of sync with its peers and that peer attempting to sail on unawares that it is out of consensus, therefore the hash of an authentication block should contain a hash that depends on the hash of the final authoritative oids of all blocks in the preceding block chain, or all oids affected by the final block.

On the dry principle, objects inside a block should not contain information as to what block they are in, and a block should not contain information as to what blockchain it is in nor what block number it is. But a block should be accompanied by context information as to what block number it is and what block chain it is in, and the root of the blockchain should be accompanied by context information saying what block chain it is and h1ow many blocks are in it.

An authentication block attests to the recent previous block of the block chain in isolation from context, and therefore links to its own context, but in general, the meaning of a hash is given by the object that hashes to it. And we somehow have to find or construct the object that hashes to it. The object that hashes to it therefore needs to contain information that implies its size and how it is to be hashed, what it means. With a hash, we ad hoc provide context information to hint where to find the data that generates the hash. But the chain to more and more global meanings is on the inside of the information being hashed not on the outside. Every transaction provids the oids of the unspent transaction outputs that it refers to and the signatures authenticaded by their public keys. When an object contains an unhashed reference to data outside itself, it must hash the broadest context in which that data occurs. Thus a transaction must contain or imply the oid of a previous root hash of the blockchain containing all these transaction outputs, and the hash of the transaction must hash that root, thus the each transaction authenticates and is authenticated by previous blockchain.

Repeating. Where an object contains a reference to an object outside itself which reference is not itself a hash, which is to say, when the reference is an oid, then its hash has to chain to that object, preferably by chaining to the largest object in which that object occurs. The oid is merely a hint to finding data that chains into the hash of the object, and the derivation of the correct hash depends on correctly identifying the thing that the oid refers to, the data structure into which the oid is a pointer. The oid could merely be the eleventh hash in a sequential map of seventeen hashes, and to know what the object means, you have to know, or correctly guess from context, what it is. So for the hash of the object to uniquely determine the meaning of the object, it would have to hash the root of the Merkle patricia tree of that map that defines the oid. Objects containing oids have to chain to the context that gives meaning to their oids, which context chains to the larger context, the largest context of them all being a recent past root of the global blockchain, which provides unique oids global to the blockchain, for each type of oid that is directly in the blockchain. And to the extent that the context providing oids provides separate oid sequences for each type of object within it, and thus an oid can refer to multiple things of multiple types within it, then the type of the object or other data within the object has to imply the type of the oid.

In short, if the object mapped by the map a Merkle-patricia tree references another object, it should it should either reference it by hash, by the hash of a patricia tree containing that object and the key within the referenced patricia Mekle tree containing the referenced object, or the hash of the object should depend on the hash of the objects that it references, or the hash of the Merkle-patricia tree containing that object, in order that the hash of the referencing object, and thus the hash of the patricia Markle tree containing the object, uniquely identifies everything that the object references.

7 Implementation of the specific Merkle-patricia trees we will need.

We have a sequential tree of attestation blocks containing proof of consensus, each of which chains to the previous block, and each of which contains the root hash of the tree as it was before this block was added.

We have a sequential tree of transactions, with each attestation block attesting to the state of the sequential tree of transactions as it was at that time, so that each participant knows that they agree with the consensus on transactions, and they agree on the specific sequence number of each transactions. They also agree on the current state after all transactions were applied, they agree on the sparse tree of unspent transaction outputs.

In order to generate that order, each full peer have to construct a sparse tree ordered by hash code. When they reach agreement on the root of that hash tree, they then proceed to agreement on the sequential list of transactions, and the current state of unspent transaction outputs.

7.1 Sparse tree of hashes

Because the bit string is going to be generally quite short, or else exactly two hundred and thirty six bits, we treat this as a list of bitstrings, with each list composed of lists of bitstrings, with each list and sublist prefigured by a bitstring corresponding to their common prefix. We don’t try to save space by omitting the (typically short) common part of the prefix, and we pad the bit string to a whole word with the padding sequence 1000000.…

Each node contains a bit count and a count of its descendants. We special case bit strings corresponding to whole words, and we special case that special case for the case that bitstring is exactly two fifty six bits.

We don’t have a representation for the empty list, but we do have a representation for a list consisting of a single item. So we build the tree by starting with a single item, and then adding items to the tree. The case where we have no items is handled separately and specially. We never have an empty tree. A single item is itself a tree, and you can add a tree to a tree. A tree is, of course composed of trees.

Because we are not prefix compressing the items in the tree of hashes, our skip links do not contain the skip data, and the hash of a tree with more than one item it does not hash the common prefix, but simply hashes the hash of the subtrees. The hash of a leaf node, is, of course, itself.

The root of this tree is recorded in the blockchain, but we do not need to store the tree itself in the blockchain. It lives only in memory. The sequential tree of transactions that it gives rise to lives on disk.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.