mimblewimble team mailing list archive

Thread
Date
Re: Sum tree storage

To: Ignotus Peverell <igno.peverell@xxxxxxxxxxxxxx>
From: Myrtle Warren <MoaningMyrtle@xxxxxxxxxxxxxx>
Date: Tue, 25 Jul 2017 16:20:07 -0400
Cc: "mimblewimble@xxxxxxxxxxxxxxxxxxx" <mimblewimble@xxxxxxxxxxxxxxxxxxx>
Feedback-id: 6ZHuTse7Ql9gFGA-925dUmsO8F3XzbhjzJZNpMbtqP-0IFRXxlk0cGxoH_bSOnx23eb1hYOQfUD-yWJBZ4N-gg==:Ext:ProtonMail
In-reply-to: <mst4TEs7NmwQuWiogjjTVrtbmCGIN4bhWYHTsm9bUYZd0Nknr4pUqYYA9nLO-1zRM4wJrGY9qt4aJbhe2OGKZNXW9gW0gOzmsJ8GT8BGusE=@protonmail.com>
Reply-to: Myrtle Warren <MoaningMyrtle@xxxxxxxxxxxxxx>
Hi Igno,
Thanks for the quick response!

>> If not, doesn't this imply that the root chunk needs to be mutated all
>> the way to the root itself on any addition?
> Mostly, yes. The whole tree is pre-computed (to give credit, that's Merope's good work) up to the root. You're might be overestimating how much adding one more node requires though. In terms of tree structure it only changes the "extreme right" (which is short in a MMR). When a new block comes in, the new data gets bulk-added, the right side of the tree is adjusted and the whole thing gets re-saved. There are likely optimizations to limit the size of the re-saving so it's not the whole root chunk, my plan was to see if they're overkill once I have a first implementation.

Definitely agree that it's not a ton of work, it just seems somewhat undesirable to mix something that looks like an optimization (saving the nodes involved in "bagging the peaks", to borrow Todd's terminology) with the core structure if it can be avoided. The nodes involved in that summing process, including (but not limited to) the root, do not share the append-only nature of the core MMR structure. The core MMR structure is well-optimized for sequential writes, and does not require any seeking on update- it seems desirable to maintain that. If the nodes involved in summing the tree are maintained in memory only it seems that you get most of the performance advantages without having to swallow the disk access cons, and you pay the cost associated with rebuilding that set only rarely- on complete node restart.
That was also the motivation behind my "incomplete chunk" note, to minimize the times you have to random access and/or rewrite any component of the MMR structure, so you can leverage its layout advantages as much as possible.
Regardless, definitely agree with the idea of trying something easy and then iterating if necessary. These are just thoughts that came up while looking at the specification.

> Correct. I think this is okay. On sync you'll get both the UTXO set and the full tree. Then you can validate that a) all UTXOs are in the tree b) there are no other UTXOs c) the root matches the root in the header d) the sums match (remember it's a sum tree). Between the root hash and the sum, I can't see how this can be falsified (from the header root hash) without breaking a few strong cryptographic primitives. As a matter of fact I think d) alone would be enough (although I'd love to hear it if someone thinks I'm wrong).

I definitely don't mean to imply any security repercussions, but this does break with some intuition I had around the MW design (for example, that a commitment to the inputs and outputs in a cut-through block consisting of the entirety of the history _is_ a commitment to the UTXO set, and is equivalent to the current state of the blockchain). That the cut-through transaction covering the history of the chain does not provide enough information to verify the state of the blockchain as committed to by the header (and covered by the proof-of-work) is something that I found surprising. I just wanted to note that making this explicit would have saved me some surprise- whether this has any repercussions for soundness is a coversation for people much smarter than I am :)
Thanks again for the clarification,
--MW

> And I agree with you, we should do another pass on that whole document once all of this is a bit more advanced.
> - Igno
> P.S. I've been made aware (thanks kanzure!) of some tangential and interesting research from Bram Cohen regarding proof of position and bitfields (TXO bitfields section in [1]).
> [1] http://diyhpl.us/wiki/transcripts/sf-bitcoin-meetup/2017-07-08-bram-cohen-merkle-sets/
>
>> -------- Original Message --------
>> Subject: Re: [Mimblewimble] Sum tree storage
>> Local Time: July 25, 2017 4:52 PM
>> UTC Time: July 25, 2017 4:52 PM
>> From: MoaningMyrtle@xxxxxxxxxxxxxx
>> To: Ignotus Peverell <igno.peverell@xxxxxxxxxxxxxx>
>> mimblewimble@xxxxxxxxxxxxxxxxxxx <mimblewimble@xxxxxxxxxxxxxxxxxxx>
>> Hi Igno,
>> Thanks for the documentation update! Comments and questions follow.
>>> In the example above, we have 2 chunks X[A,B] and Y[C,D] and a root chunk G[M,E]. The cutoff height H=1 and the root height R=3.
>> I'm a bit confused as to how this maps between implementation and storage. From the MMR document linked (https://github.com/opentimestamps/opentimestamps-server/blob/master/doc/merkle-mountain-range.md) it appears to me that the node labeled G would not be considered part of the stored set, but merely computed from the constituent direct children M and E. This appears necessary to ensure the append-only nature of the MMR, as addition of another leaf F would require G to change to become a set of [M, Z] where Z is the root of [E,F].
>> Is G then not intended to be stored, but rather merely symbolic of the root? And if this is the case, I think the documentation could use some clarity around what is stored and what is not (given that X and Y, being full trees, could be stored without issue). If not, doesn't this imply that the root chunk needs to be mutated all the way to the root itself on any addition?
>> ---
>> Given that (correct me if I'm mistaken) the storage format of the MMR is not a consensus-critical parameter but rather a detail of the implementation, I'm inclined to avoid any unnecessary bike-shedding. However, is it worth considering an incomplete chunk, rather than adding incomplete subtrees of height less than H to the root chunk? In the insert case, there is always at most a single incomplete chunk, meaning that there is no difficult seeking involved in an addition. The downside to this approach is that the incomplete chunk may contain multiple peaks, but due to the design of the MMR these peaks should be easy to identify. On the other hand, the use of an incomplete chunk removes the need for a painful compaction process where a portion of the root chunk must be cleaved and streamed to a new file. As doing so lazily may result in either an undesirable sparse root chunk file or a rewrite of a significant portion of the root file (up to the 3.3MB you calculated for the size of a chunk), it seems you would want to do this as soon as the subtree reaches the target height, causing potentially significant pauses during the block commit process. In exchange for a bit of complexity on calculating the MMR root (once per block proposed), you gain strict append-only behavior for additions. This incomplete chunk could be replicated in memory, alleviating most of the concerns around identifying peaks and calculating the new root.
>> ---
>>> Each object is one of two things: a commitment indicating an unspent output or a NULL marker indicating a spent one. It is a sum-tree over all unspent outputs (spent ones contribute nothing to the sum).
>> This is straying a bit from your topic of storage, but I could use a bit of clarification here: my understanding is that inserting is strictly append-only, but deletion touches every node above the spent leaf all the way to the root. Is this accurate? Eg. in the example tree in your document, if A is spent, X must change to commit only to B, changing M and in turn G. I'm basing this, in part, off of Peter Todd's TXO MMR proposal for bitcoin.
>> If this is the case, I think it should be made explicit, since it has repercussions for bootstrapping and validation. Consider if, on bootstrap, a new node receives cut-through block A covering the entirety of block history minus the cut-through horizon. If the UTXO commitment was a merkle-tree covering _only_ the unspent outputs, the root in A could be verified against the outputs contained in A, and the new node could be ready to append block A+1 and validate its changes to the unspent set; the root of A+1 could be validated with only blocks A and A+1. The chain state committed to by the header of A consists solely of the explicit inputs (coinbase inputs) and outputs of A. (Of course this design has severe repercussions for efficiently adding or spending outputs...).
>> On the other hand, in (my understanding of) the MMR-based design, block A does not contain enough information itself to recreate and verify its root; it requires knowledge of the structure of the merkle tree, less any fully pruned nodes. This must be provided by bootstrapping nodes. In other words, the chain state committed to by the block header and covered by the proof-of-work no longer consists solely of the explicit inputs and outputs of the cut-through block, but in a way also depends on the cut-through outputs and the order in which they were added to the blockchain.
>> This is not intended as a criticism of the design but merely something that I think bears clarification.
>> Thanks for any clarification!
>> --MW
>>
>>> -------- Original Message --------
>>> Subject: [Mimblewimble] Sum tree storage
>>> Local Time: July 24, 2017 11:03 AM
>>> UTC Time: July 24, 2017 6:03 PM
>>> From: igno.peverell@xxxxxxxxxxxxxx
>>> To: mimblewimble@xxxxxxxxxxxxxxxxxxx <mimblewimble@xxxxxxxxxxxxxxxxxxx>
>>> Hi all,
>>> Now that Merope has the sum tree design mostly figured out and implemented, I've started thinking about how to handle and store it as it grows in size. I've just pushed a draft of the design I'll be going for in the Merkle tree doc:
>>> https://github.com/ignopeverell/grin/blob/master/doc/merkle.md#storage
>>> Reviews and comments are welcome.
>>> As a side note mostly for Merope, externalizing the positions index means we'll need methods (prune, replace, etc.) on the tree that take the positions instead of the element itself. This is actually simpler. For unit tests we can have a simple wrapper that replicates the HashMap based index.
>>> - Igno
References

Sum tree storage
From: Ignotus Peverell, 2017-07-24
Re: Sum tree storage
From: Myrtle Warren, 2017-07-25
Re: Sum tree storage
From: Ignotus Peverell, 2017-07-25