Submitted this CHIP to the list at bch.info:
Great work! Having worked on the Group CHIP and studying the costs of proposed genesis setup that requires hashing operations I found myself wondering why aren’t we counting hash operations and ended up reading this CHIP a few times to give me an idea of the costs involved.
This proposal is well thought out and allows much more utility without significantly increasing node operating costs, and it would be great if it could make it into 2023 upgrade window.
Btw, I also discovered this analysis of digest and signature verification costs, just for reference: Benchmarking Hash and Signature Algorithms | by Michael Zochowski | Logos Network | Medium
The VM limits have come up again in recent discussion now that the VM is much more powerful after the CashTokens upgrade. Both myself, Pat from mainnet-js & @bitcoincashautist have ran into those limits.
For now there has not really been an effort to document these case, but going forward it would be good if we did. Copying from the Jedex topic:
It would also be relevant that we discuss the complexity of splitting contract code into multiple utxos compared to just raising the limit, for example with regards to Jedex. Is 8 different contracts to optimal design or just the necessary design?
Thanks for bringing this back up!
For the most part, it’s actually the optimal design too: the size of each transaction at each particular interaction is minimized if only the necessary code makes it into that transaction.
So for example the Redeem Order transaction only includes the Payout Covenant
, a ~110 byte contract that essentially hard codes all the information needed to accept order receipts and payout what users are owed after a settlement period. Importantly, it can exclude all the hundreds of bytes of code in the DEX Covenant
, Market Maker Covenant
, Thread Covenant
, etc. as each of those need only be included in the specific types of transactions requiring them.
For the v1 demo of Jedex, increased VM limits would likely only offer some minor efficiency improvements in the Market Maker Covenant
, depending on what’s ultimately implemented there (e.g. if limit orders were also implemented). As it stands, that covenant would probably have to be broken into a few coupled-UTXOs in which each UTXO performs part of the computation and stores its result in a commitment which is then read by the next UTXO, etc. Depending on the complexity ultimately required there, that might save a few hundred bytes per settlement period. (But note, the market maker covenant is already excluded from most end-user interactions, it only gets pulled in to perform the “joint execution” every e.g. 2 hours.)
I think a particularly important use case for increased VM limits is to enable introspection of slightly larger parent transactions. This becomes really useful for contract constructions with escrows/bounties for unusual misbehavior like Treeless ZCEs.
The only real downside of Treeless ZCE right now (but not a showstopper) is that transactions are limited to 3 inputs. With the 10,000 byte stack limit in this proposal, treeless ZCE transactions could easily fit 20+ inputs. (Well-designed wallets should consolidate with things like CashFusion anyways, so just getting that limit to ~10 or so should handle even the most extreme edge cases.)
Likewise, there are tons of other ZCE-like constructions that rely on pre-commitment to not attempting double spends. A lot of those constructions are made possible and/or much more efficient with a little more headroom on stack element size and operation count.
Overall, better-targeted VM limits aren’t personally a top priority of mine, but I still think getting this right is long-term important. I’d love to see it happen in 2024 or 2025.
My biggest uncertainty with the current proposal is the stack memory usage limit – it’s relatively expensive and inelegant to measure/enforce, so I’d especially love to see other interested developers research and review that part.
If we had stream hash function opcode(s), then you could split the TX across multiple inputs, but then parsing it would become complicated.
I’d like to put something on the brainstorming table. It’s so cumbersome to slice up the pushed TX to extract the fields, you have to implement a TX parser from Script. Why not have an opcode to “load” a TX from stack, and then use introspection opcodes on it, something like:
<push raw tx> OP_1 OP_TXFROMSTACK OP_INPUTBYTECODE
The OP_FROMSTACK
would be like a prefix to introspection code-points. If used, then an introspection opcode MUST follow (except UTXO codes which should fail eval). In the example, the VM would parse the stack item as a TX and extract bytecode of input with index 1.
Another benefit of focusing on this issue before other VM-related upgrade proposals: we’ll get much better real-world usage information to guide to future improvements like bounded loops or any changes related to higher-precision math emulation.
Loops are essentially a transaction size optimization: anything that can be done with loops can also be done by automatically unrolling the loop in the transaction compiler (already done today by some BCH VM compilers).
Better targeted VM limits would allow the development ecosystem to experiment more with various constructions and help to validate specific proposed APIs for “native” bounded loops. (Whereas right now, developers can already achieve the same behavior as using loops, but only by breaking computations up across multiple inputs such that each remains within today’s poorly-targeted limits.)
Similarly, with better limits, we’ll also be able to see if/how developers actually use higher-precision math emulation in the wild, giving us much better data with which to judge the actual impact of any proposed optimizations.
Probably best to break any further discussion on this out into a new thread, but I’ll just add a few thoughts. I wrote a little about this sort of “Serialized Transaction Introspection” here too.
- After CashTokens, I think the use case for parent TX introspection is very, very limited: essentially just “punishment transactions” in anti-double-spending schemes that regulate the spending of P2PKH contracts (e.g. Treeless ZCE). Other kinds of contracts can allow participants to commit to a more-efficient-to-enforce message structure, so the same sort of punishment transactions don’t require inspection of parent transactions. (The downside is that these schemes require setup, whereas treeless ZCE requires only some confirmed P2PKH outputs without any further setup.)
- I’m likewise very uncertain about streaming hash function opcodes – I don’t know of any use cases beyond working around absurdly-small stack item size limits, and they’re a pretty inefficient solution at that (more assigned VM opcodes, longer contract bytecode when used, more cycles during execution, messier to optimize with caching/memoization).
- Currently, the VM is agnostic to the actual structure of encoded transactions; an operation parsing serialized transactions would necessarily mix a lot of complexity back down into underlying VM implementations. Depending on design, it could also prevent future improvements to transaction format, as we can’t rule out that some P2SH contracts might be broken by any observable change to the “VM API”.
- I don’t think this would actually reduce the number of bytes used, even for these very unusual parent transaction inspection:
<part> <part> OP_DUP [analyze] OP_CAT
has ~equivalent overhead as<raw_tx> OP_DUP <some_item_number> <some_item_type> OP_TXFROMSTACK [analyze]
. I think developers would get the same value from a user-space library that simply makes working with the transaction preimage a little more comprehensible. (Though again, I really doubt this will be common outside the treeless ZCE contract itself.)
Yeah I think this philosophy makes the most sense. Main motivation for these new opcodes is to squeeze the functionality into the current per-input limits so people can keep their sanity when designing contracts Also, with current limits, splitting a contract to multiple inputs results in lots of overhead since they need to require/verify each other’s execution - which spends a lot of redeem script size & opcode count budget.
But if limits were increased, then functionality could be achieved by unrolling the operations, so the only motivation for specialized opcodes would be to reduce TX sizes, which is an OK goal but would be premature optimization given the current context.
With this in mind, I believe this limits CHIP should be top priority for the next VM upgrade.
Interestingly enough, that overhead is much smaller than you might expect. E.g. for Jedex’s process tick transaction, the “market maker” can be split across any number of inputs – each input has a mutable token with a 1-byte commitment indicating it’s index in the sequence of contracts. Each must verify:
- That it’s at the expected input index (e.g.
OP_INPUTINDEX <4> OP_EQUALVERIFY
) and has the expected commitment at that index (e.g.<0x01> OP_EQUALVERIFY
) - That it’s covenant is reproduced (e.g.
<4> OP_UTXOBYTECODE <4> OP_OUTPUTBYTECODE OP_EQUALVERIFY
) - That the UTXO token category is equal to the input token category (e.g.
<4> OP_UTXOBYTECODE <4> OP_OUTPUTBYTECODE OP_EQUALVERIFY
) - And that it’s portion of the computation is fulfilled, (e.g.
<3> OP_INPUTBYTECODE <8> OP_SPLIT <8> OP_SPLIT [...]
to split off the result of the last computation and do something with it.)
So there’s significant overhead in terms of transaction size due to other components (~60 bytes per output, ~40 bytes per input), but from the perspective of the contract code, it’s only ~12 operations/~20 bytes of overhead in each “code split” (of 201 operations and 1650 bytes, respectively).
Of course, I’d prefer bring that overhead to 0 (in most cases) by targeting VM limits to be less wasteful. Just wanted to offer more background here and note that this is partially a tooling problem; our tools need to get better at combining multiple contracts into atomic transactions for other reasons, regardless of VM limits, as there are lots of cases which require this coupled-contracts primitive.
Just want to highlight what I think is next here:
I’d love to see node implementation developers experiment with how the VM can most simply and efficiently limit stack memory usage.
The current proposal is essentially: sum_all_stack_items + stack_depth <= 130,000
. That limit is rounded up from a max of 125,320 + 242
in what we believe is the most memory-expensive script currently possible.
The issue here is that we need to check memory usage after every operation in which memory usage might have increased.
Instead of the simple, incrementing counter that is currently used (++maxOperationsPerEvaluation < 201
), we have to accumulate the sizes of all stack memory items, then add the length of the stack.
The calculation is not terribly expensive, and it can be optimized in many ways – e.g. many operations can skip the check, stack memory usage at >6 depth can be cached across operations, etc., but the current formulation still strikes me as less elegant than past improvements to the VM.
It’s possible that this is the best we can do, but I’d like to see some other people review the issue. Some questions:
-
Stack depth: Do we actually need to limit stack depth?
MAX_SCRIPT_SIZE
is 10,000 bytes (and for standardness,MAX_TX_IN_SCRIPT_SIG_SIZE
is only 1,650 bytes), so withOP_3DUP
, we’re looking at a maximum depth of ~30k for transactions first heard in blocks (and only ~5k for transactions first heard over the network). Can we save cycles and complexity by just skipping that? (Are we ok with the protocol requiring stack +OP_ROLL
implementations to be reasonably optimized?) And if not, do we need to multiply depth by some number >1 to represent the realistic overhead of a stack item? - Memory usage: Is there a more elegant way to limit stack memory usage? Rather than checking total usage after every operation in which it may have increased, is there a more efficient, less direct strategy?
If other developers with an interest in low-level optimization want to take a look at this, I’d love to hear your thoughts!
CC: maybe @cculianu, @tom, @fpelliccioni, @groot-verde, @Andrew-128, @markblundeberg? If you have thoughts on how to efficiently measure stack memory usage, comments are much appreciated!
There are only a limited number of places that the stack and alt-stack are changed. For those I’d simply add a couple of lines that subtract or add the item to be put on the stack.
The stack-depth itself is already being tracked, so you can ignore that.
The actual data on the stack can simply be counted with an extra integer which gets changed on every stack-change with the size of the thing being popped or pushed.
I don’t believe that to be an issue. A little abstraction will make that readable (and maintainable) and the actual CPU load is not the issue. IO is very much the problem for scaling, not doing this extra house-keeping.
I want to tie this effort into another long standing one.
Measuring how much a script takes to executes should in my opinion have an effect on the cost/reward balance of a transaction. There is only a very crude ‘reward’ currenty which is the fee, but to benefit an open market that measurement should improve.
So, what I’m thinking is that a transaction spending a UTXO which is very expensive to spend should have its ‘cost’ increased.
Things like Coin-Days-Destroyed and indeed fees paid could be used to stand on the other side of the ledger and supply the miner reward for this specific transaction.
This means that the black/white proposal for VM limits becomes a little more gray-scale where the limits are still hard limits, but getting closer to them will increase the cost of usage and hopefully avoid overloaded full nodes.
I definitely am a very enthusiastic supporter of this idea and any practical initiative to move such a thing forward. Ideal designs would involve “cheap-to-measure” approaches when evaluating scripts to decide what the required fee / fee discount would be.
As for this proposal itself. I have no clear idea of what the impacts would be on perf. – the only way to know is to construct the most pathological txn imaginable with these limits in place – and perhaps a whole 32MB block of such txns, and measure the execution/validation speed of that versus status quo.
Anybody want to volunteer to construct such a pool of 32MB of txns?
If you hand me a big stash of pathological txns that abuse these limits to the extreme, I can modify a test version of BCHN to use these limits and we can benchmark how bad/not bad it all would be.
I wanted to add some thoughts about OP_RETURN limits and was about to post under the old thread, but I think it’s better to post here.
Redeem script reveal before first spend, or applications like https://memo.cash. Btw, the other day I had a nice chat about scaling with Memo’s founder Jason Chavannes, and I learned he’d like a bigger OP_RETURN, too, and said he’d support a CHIP for it so I got reminded of this CHIP.
That has the inconvenience of having to create a dust UTXO ahead of every data push. Also, if you use the data by contracts then you have the added complication of having to prevent others from stealing the input, e.g. if it a plain commit-reveal with no authentication.
Here’s something new to consider about the where: bulk of the data would ideally be pushed before the 1st non-opreturn output. Why? Because this allows keeping more compact proofs of TX’s outputs: hash the serialization until start of outputs and store the sha256 & length of hashed data instead of the raw TX part. Then later, when you want to prove that outputs are part of the TX, you can do the sha256 length-extension with the outputs & locktime, and apply one more sha256 to get the TXID.
How about we allow a larger limit only if OP_RETURN or an uninterrupted sequence of them is placed before the 1st non-OP_RETURN output, while those placed in any place after the 1st non-OP_RETURN output would have tighter limits?
Ok, the above was quoting the old thread, now some fresh thoughs about stuff from this one.
You could do a sha256 length-extension from within ScriptVM, and have smaller proofs of some TXs outputs Don’t know whether it’d be useful, just pointing it out for info.
- Contract-based refund transactions, too. Contract could inspect the parent’s input and pay refund back to the originating pubkey.
- Non-interactive tokenization of AnyHedge contracts (as opposed to interactive: needing to make tokenizer contract be part of funding TX)
- Combined with something like OP_BLOCKHASHMATURED, we’d get an easy way to “introspect” any historical TX - but the spender would have to provide it.
Btw, did anyone wonder why code-point 0x50
was skipped (left undefined), and it belongs into data push group? What if we made the opcode special - 0x50
to be: OP_PUSHRAWTX <raw TX>
. It would be valid only if the raw TX
would be of valid TX format. Then, user could slice it up to extract what he needs, knowing that the format is OK and he can skip doing some checks on the blob.
Let’s get moving with this CHIP! Many builders (@MathieuG & @mainnet_pat, and I imagine a few more) have already expressed their hopes about this CHIP moving forward. Let’s define actions that need to be taken if it is going to be ready for 2025 activation.
Would the best course of action be to just implement the limits and then do some benchmarks - and if the timings are all below the current worst case then we could safely implement it?
IMO we should aim higher, make these test cases be part of future-proofing tests with 256 MB.
The thing about hashing is … really the amount of data you hash is more costly than the number of hashing operations. Consider hashing 80 bytes 660 times versus hashing 10kb 660 times…
I have a feeling that the stack item 10kb thing combined with hash abuse can be a problem… maybe?
EDIT: What size objects do you “typically” anticipate hashing? 32-byte things? 64-byte things? Perhaps an additional constraint would be: you get up to 660 hash ops, or maybe ~20kb of data you can hash, cumulatively, per script evaluation. Whichever limit is hit first. That’s one way to ensure the amount of data hashed doesn’t get large?
There’s some quantization due to padding to blocks, I think hashing 1 byte costs the same as hashing 32 bytes, but hashing 33-64 bytes costs 2x the hashing of 1-32 bytes, and so on. So if we limit total data then someone could be more abusive by doing a big number of smaller messages.
Correction on hash block size (source):
With most algorithms (including MD4, MD5, RIPEMD-160, SHA-0, SHA-1, and SHA-256), the string is padded until its length is congruent to 56 bytes (mod 64). Or, to put it another way, it’s padded until the length is 8 bytes less than a full (64-byte) block (the 8 bytes being size of the encoded length field). There are two hashes implemented in hash_extender that don’t use these values: SHA-512 uses a 128-byte blocksize and reserves 16 bytes for the length field, and WHIRLPOOL uses a 64-byte blocksize and reserves 32 bytes for the length field.
Sounds good, yeah.
Note: I chose numbers out my buttocks. Ideally one benches this on say BCHN and gets an idea of what the tradeoffs are… I just raised this issue as a potential thing to keep in mind…
EDIT:
There’s some quantization due to padding to blocks, I think hashing 1 byte costs the same as hashing 32 bytes, but hashing 33-64 bytes costs 2x the hashing of 1-32 bytes, and so on. So if we limit total data then someone could be more abusive by doing a big number of smaller messages.
Yes. Good point. Perhaps hashing 1 byte should be “costed” as if it were 32-bytes. That is, each chunk of bytes you hash is rounded up to the nearest 32-bytes in terms of costing… or something.
Something like that…
Just feel the need to point out that the blocksize of sha256 is 64 bytes, not 32.
Oh yeah good point so if one were to round-up the cost def. should be multiples of the block size 64 bytes… sorry about that; thanks for pointing that out…