CHIP 2021-05 Targeted Virtual Machine Limits

If we had stream hash function opcode(s), then you could split the TX across multiple inputs, but then parsing it would become complicated.

I’d like to put something on the brainstorming table. It’s so cumbersome to slice up the pushed TX to extract the fields, you have to implement a TX parser from Script. Why not have an opcode to “load” a TX from stack, and then use introspection opcodes on it, something like:

<push raw tx> OP_1 OP_TXFROMSTACK OP_INPUTBYTECODE

The OP_FROMSTACK would be like a prefix to introspection code-points. If used, then an introspection opcode MUST follow (except UTXO codes which should fail eval). In the example, the VM would parse the stack item as a TX and extract bytecode of input with index 1.

1 Like

Another benefit of focusing on this issue before other VM-related upgrade proposals: we’ll get much better real-world usage information to guide to future improvements like bounded loops or any changes related to higher-precision math emulation.

Loops are essentially a transaction size optimization: anything that can be done with loops can also be done by automatically unrolling the loop in the transaction compiler (already done today by some BCH VM compilers).

Better targeted VM limits would allow the development ecosystem to experiment more with various constructions and help to validate specific proposed APIs for “native” bounded loops. (Whereas right now, developers can already achieve the same behavior as using loops, but only by breaking computations up across multiple inputs such that each remains within today’s poorly-targeted limits.)

Similarly, with better limits, we’ll also be able to see if/how developers actually use higher-precision math emulation in the wild, giving us much better data with which to judge the actual impact of any proposed optimizations.

2 Likes

Probably best to break any further discussion on this out into a new thread, but I’ll just add a few thoughts. I wrote a little about this sort of “Serialized Transaction Introspection” here too.

  • After CashTokens, I think the use case for parent TX introspection is very, very limited: essentially just “punishment transactions” in anti-double-spending schemes that regulate the spending of P2PKH contracts (e.g. Treeless ZCE). Other kinds of contracts can allow participants to commit to a more-efficient-to-enforce message structure, so the same sort of punishment transactions don’t require inspection of parent transactions. (The downside is that these schemes require setup, whereas treeless ZCE requires only some confirmed P2PKH outputs without any further setup.)
  • I’m likewise very uncertain about streaming hash function opcodes – I don’t know of any use cases beyond working around absurdly-small stack item size limits, and they’re a pretty inefficient solution at that (more assigned VM opcodes, longer contract bytecode when used, more cycles during execution, messier to optimize with caching/memoization).
  • Currently, the VM is agnostic to the actual structure of encoded transactions; an operation parsing serialized transactions would necessarily mix a lot of complexity back down into underlying VM implementations. Depending on design, it could also prevent future improvements to transaction format, as we can’t rule out that some P2SH contracts might be broken by any observable change to the “VM API”.
  • I don’t think this would actually reduce the number of bytes used, even for these very unusual parent transaction inspection: <part> <part> OP_DUP [analyze] OP_CAT has ~equivalent overhead as <raw_tx> OP_DUP <some_item_number> <some_item_type> OP_TXFROMSTACK [analyze]. I think developers would get the same value from a user-space library that simply makes working with the transaction preimage a little more comprehensible. (Though again, I really doubt this will be common outside the treeless ZCE contract itself.)
1 Like

Yeah I think this philosophy makes the most sense. Main motivation for these new opcodes is to squeeze the functionality into the current per-input limits so people can keep their sanity when designing contracts :slight_smile: Also, with current limits, splitting a contract to multiple inputs results in lots of overhead since they need to require/verify each other’s execution - which spends a lot of redeem script size & opcode count budget.

But if limits were increased, then functionality could be achieved by unrolling the operations, so the only motivation for specialized opcodes would be to reduce TX sizes, which is an OK goal but would be premature optimization given the current context.

With this in mind, I believe this limits CHIP should be top priority for the next VM upgrade.

4 Likes

Interestingly enough, that overhead is much smaller than you might expect. E.g. for Jedex’s process tick transaction, the “market maker” can be split across any number of inputs – each input has a mutable token with a 1-byte commitment indicating it’s index in the sequence of contracts. Each must verify:

  • That it’s at the expected input index (e.g. OP_INPUTINDEX <4> OP_EQUALVERIFY) and has the expected commitment at that index (e.g. <0x01> OP_EQUALVERIFY)
  • That it’s covenant is reproduced (e.g. <4> OP_UTXOBYTECODE <4> OP_OUTPUTBYTECODE OP_EQUALVERIFY)
  • That the UTXO token category is equal to the input token category (e.g. <4> OP_UTXOBYTECODE <4> OP_OUTPUTBYTECODE OP_EQUALVERIFY)
  • And that it’s portion of the computation is fulfilled, (e.g. <3> OP_INPUTBYTECODE <8> OP_SPLIT <8> OP_SPLIT [...] to split off the result of the last computation and do something with it.)

So there’s significant overhead in terms of transaction size due to other components (~60 bytes per output, ~40 bytes per input), but from the perspective of the contract code, it’s only ~12 operations/~20 bytes of overhead in each “code split” (of 201 operations and 1650 bytes, respectively).

Of course, I’d prefer bring that overhead to 0 (in most cases) by targeting VM limits to be less wasteful. Just wanted to offer more background here and note that this is partially a tooling problem; our tools need to get better at combining multiple contracts into atomic transactions for other reasons, regardless of VM limits, as there are lots of cases which require this coupled-contracts primitive.

3 Likes

Just want to highlight what I think is next here:

I’d love to see node implementation developers experiment with how the VM can most simply and efficiently limit stack memory usage.

The current proposal is essentially: sum_all_stack_items + stack_depth <= 130,000. That limit is rounded up from a max of 125,320 + 242 in what we believe is the most memory-expensive script currently possible.

The issue here is that we need to check memory usage after every operation in which memory usage might have increased.

Instead of the simple, incrementing counter that is currently used (++maxOperationsPerEvaluation < 201), we have to accumulate the sizes of all stack memory items, then add the length of the stack.

The calculation is not terribly expensive, and it can be optimized in many ways – e.g. many operations can skip the check, stack memory usage at >6 depth can be cached across operations, etc., but the current formulation still strikes me as less elegant than past improvements to the VM.

It’s possible that this is the best we can do, but I’d like to see some other people review the issue. Some questions:

  • Stack depth: Do we actually need to limit stack depth? MAX_SCRIPT_SIZE is 10,000 bytes (and for standardness, MAX_TX_IN_SCRIPT_SIG_SIZE is only 1,650 bytes), so with OP_3DUP, we’re looking at a maximum depth of ~30k for transactions first heard in blocks (and only ~5k for transactions first heard over the network). Can we save cycles and complexity by just skipping that? (Are we ok with the protocol requiring stack + OP_ROLL implementations to be reasonably optimized?) And if not, do we need to multiply depth by some number >1 to represent the realistic overhead of a stack item?
  • Memory usage: Is there a more elegant way to limit stack memory usage? Rather than checking total usage after every operation in which it may have increased, is there a more efficient, less direct strategy?

If other developers with an interest in low-level optimization want to take a look at this, I’d love to hear your thoughts!

3 Likes

CC: maybe @cculianu, @tom, @fpelliccioni, @groot-verde, @Andrew-128, @markblundeberg? If you have thoughts on how to efficiently measure stack memory usage, comments are much appreciated!

3 Likes

There are only a limited number of places that the stack and alt-stack are changed. For those I’d simply add a couple of lines that subtract or add the item to be put on the stack.
The stack-depth itself is already being tracked, so you can ignore that.
The actual data on the stack can simply be counted with an extra integer which gets changed on every stack-change with the size of the thing being popped or pushed.

I don’t believe that to be an issue. A little abstraction will make that readable (and maintainable) and the actual CPU load is not the issue. IO is very much the problem for scaling, not doing this extra house-keeping.

I want to tie this effort into another long standing one.

Measuring how much a script takes to executes should in my opinion have an effect on the cost/reward balance of a transaction. There is only a very crude ‘reward’ currenty which is the fee, but to benefit an open market that measurement should improve.

So, what I’m thinking is that a transaction spending a UTXO which is very expensive to spend should have its ‘cost’ increased.

Things like Coin-Days-Destroyed and indeed fees paid could be used to stand on the other side of the ledger and supply the miner reward for this specific transaction.

This means that the black/white proposal for VM limits becomes a little more gray-scale where the limits are still hard limits, but getting closer to them will increase the cost of usage and hopefully avoid overloaded full nodes.

2 Likes

I definitely am a very enthusiastic supporter of this idea and any practical initiative to move such a thing forward. Ideal designs would involve “cheap-to-measure” approaches when evaluating scripts to decide what the required fee / fee discount would be.


As for this proposal itself. I have no clear idea of what the impacts would be on perf. – the only way to know is to construct the most pathological txn imaginable with these limits in place – and perhaps a whole 32MB block of such txns, and measure the execution/validation speed of that versus status quo.

Anybody want to volunteer to construct such a pool of 32MB of txns?

If you hand me a big stash of pathological txns that abuse these limits to the extreme, I can modify a test version of BCHN to use these limits and we can benchmark how bad/not bad it all would be.

4 Likes

I wanted to add some thoughts about OP_RETURN limits and was about to post under the old thread, but I think it’s better to post here.

Redeem script reveal before first spend, or applications like https://memo.cash. Btw, the other day I had a nice chat about scaling with Memo’s founder Jason Chavannes, and I learned he’d like a bigger OP_RETURN, too, and said he’d support a CHIP for it so I got reminded of this CHIP.

That has the inconvenience of having to create a dust UTXO ahead of every data push. Also, if you use the data by contracts then you have the added complication of having to prevent others from stealing the input, e.g. if it a plain commit-reveal with no authentication.

Here’s something new to consider about the where: bulk of the data would ideally be pushed before the 1st non-opreturn output. Why? Because this allows keeping more compact proofs of TX’s outputs: hash the serialization until start of outputs and store the sha256 & length of hashed data instead of the raw TX part. Then later, when you want to prove that outputs are part of the TX, you can do the sha256 length-extension with the outputs & locktime, and apply one more sha256 to get the TXID.

How about we allow a larger limit only if OP_RETURN or an uninterrupted sequence of them is placed before the 1st non-OP_RETURN output, while those placed in any place after the 1st non-OP_RETURN output would have tighter limits?

Ok, the above was quoting the old thread, now some fresh thoughs about stuff from this one.

You could do a sha256 length-extension from within ScriptVM, and have smaller proofs of some TXs outputs :slight_smile: Don’t know whether it’d be useful, just pointing it out for info.

  1. Contract-based refund transactions, too. Contract could inspect the parent’s input and pay refund back to the originating pubkey.
  2. Non-interactive tokenization of AnyHedge contracts (as opposed to interactive: needing to make tokenizer contract be part of funding TX)
  3. Combined with something like OP_BLOCKHASHMATURED, we’d get an easy way to “introspect” any historical TX - but the spender would have to provide it.

Btw, did anyone wonder why code-point 0x50 was skipped (left undefined), and it belongs into data push group? What if we made the opcode special - 0x50 to be: OP_PUSHRAWTX <raw TX>. It would be valid only if the raw TX would be of valid TX format. Then, user could slice it up to extract what he needs, knowing that the format is OK and he can skip doing some checks on the blob.

Let’s get moving with this CHIP! Many builders (@MathieuG & @mainnet_pat, and I imagine a few more) have already expressed their hopes about this CHIP moving forward. Let’s define actions that need to be taken if it is going to be ready for 2025 activation.

Would the best course of action be to just implement the limits and then do some benchmarks - and if the timings are all below the current worst case then we could safely implement it?

IMO we should aim higher, make these test cases be part of future-proofing tests with 256 MB.

2 Likes

The thing about hashing is … really the amount of data you hash is more costly than the number of hashing operations. Consider hashing 80 bytes 660 times versus hashing 10kb 660 times…

I have a feeling that the stack item 10kb thing combined with hash abuse can be a problem… maybe?

EDIT: What size objects do you “typically” anticipate hashing? 32-byte things? 64-byte things? Perhaps an additional constraint would be: you get up to 660 hash ops, or maybe ~20kb of data you can hash, cumulatively, per script evaluation. Whichever limit is hit first. That’s one way to ensure the amount of data hashed doesn’t get large?

2 Likes

There’s some quantization due to padding to blocks, I think hashing 1 byte costs the same as hashing 32 bytes, but hashing 33-64 bytes costs 2x the hashing of 1-32 bytes, and so on. So if we limit total data then someone could be more abusive by doing a big number of smaller messages.

Correction on hash block size (source):

With most algorithms (including MD4, MD5, RIPEMD-160, SHA-0, SHA-1, and SHA-256), the string is padded until its length is congruent to 56 bytes (mod 64). Or, to put it another way, it’s padded until the length is 8 bytes less than a full (64-byte) block (the 8 bytes being size of the encoded length field). There are two hashes implemented in hash_extender that don’t use these values: SHA-512 uses a 128-byte blocksize and reserves 16 bytes for the length field, and WHIRLPOOL uses a 64-byte blocksize and reserves 32 bytes for the length field.

Sounds good, yeah.

1 Like

:+1:

Note: I chose numbers out my buttocks. Ideally one benches this on say BCHN and gets an idea of what the tradeoffs are… I just raised this issue as a potential thing to keep in mind…

EDIT:

There’s some quantization due to padding to blocks, I think hashing 1 byte costs the same as hashing 32 bytes, but hashing 33-64 bytes costs 2x the hashing of 1-32 bytes, and so on. So if we limit total data then someone could be more abusive by doing a big number of smaller messages.

Yes. Good point. Perhaps hashing 1 byte should be “costed” as if it were 32-bytes. That is, each chunk of bytes you hash is rounded up to the nearest 32-bytes in terms of costing… or something.

Something like that…

2 Likes

Just feel the need to point out that the blocksize of sha256 is 64 bytes, not 32.

3 Likes

Oh yeah good point so if one were to round-up the cost def. should be multiples of the block size 64 bytes… sorry about that; thanks for pointing that out…

1 Like

From an architectural point of view, I would advice to think about this in the vain of there being two mempools.
First the one you know, inside of the full node. Its task is to make the network function (detect double spends etc) and for normies as well as miners this mempool is important to make sure that block transfer is cheap. Which means, it should have all the transactions that might be in the next block so we can avoid fetching them afterwards.

The second mempool would be one explicitly for mining software.
Ideally it can be connected to a full node and slave to is so it is simply a subset from the full node’s mempool in order to avoid any and all validation. Keep it simple, and all that.

But adding new transactions to this mempool can be done much more leniently and take more CPU time (probably multi-threaded) in order to give weights and scores to transactions. Include the highest scoring ones and bumb less happy ones.

So, I imagine this second mempool to be snapshotted for a block template regularly and various algorithms can be used to make the most profitable block within the parameters given. Because we’ll inevitably end up with more transactions than will fit in blocks and selection then should be done smarter than just on fees.

1 Like

Thank you all for digging into this! I’m mostly AFK until December, so I’m sorry I’ll have to come back to thread later. Just wanted to link to recent related discussion on limiting memory usage:

https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2023-October/022082.html

The patch bundles the limiting of total stack item memory usage into the stack depth limit.

Not sure if that ends up being simpler in most implementations than this CHIP’s current approach, but I wonder if that strategy opens up more opportunities for skipping the validation in common cases (so implementations can check memory usage less often).

Anyways, this question is still the main item that I personally think needs more research for this CHIP to be ready:

2 Likes

Flowee the Hub is probably the only one that has a validator which is compltely multi-threading. My thinking there has always been that we simply allocate the number of threads that the hardware supports (including hyperthreading) and queue jobs for the first thread to take.

As a result there will be at most N virtual machines (stack etc) in memory. Where N equals the number of virtual cores. Typically less as those VMs are cheap to start (though that may be an optimization in future).

My thumb-in-air feeling is that if we use 1MB per VM for things like stack, that is not even going to register. I’d be OK with 10MB, to stretching that to 25MB per VM.

Because in my 64cores machine, which can validate 64 transactions at a time, that would mean we allocate 1.6GB of memory max for the validation part. Which in 2024 is perfectly acceptable IMO.

2 Likes

@cculianu has gotten acceptance from the other BCHN members to start work on implementing the ‘Targeted VM Limits’ CHIP in a branch of BCHN and do testing of those limits. :smiley:

This seems a great way to move the proposal forward, as a way to investigate and resolve the open questions about the stack memory usage limit - how expensive it is to measure and enforce compared to alternative solutions - and questions around the need for a limit stack depth.

2 Likes