CHIP 2021-05 Targeted Virtual Machine Limits

Just want to highlight what I think is next here:

I’d love to see node implementation developers experiment with how the VM can most simply and efficiently limit stack memory usage.

The current proposal is essentially: sum_all_stack_items + stack_depth <= 130,000. That limit is rounded up from a max of 125,320 + 242 in what we believe is the most memory-expensive script currently possible.

The issue here is that we need to check memory usage after every operation in which memory usage might have increased.

Instead of the simple, incrementing counter that is currently used (++maxOperationsPerEvaluation < 201), we have to accumulate the sizes of all stack memory items, then add the length of the stack.

The calculation is not terribly expensive, and it can be optimized in many ways – e.g. many operations can skip the check, stack memory usage at >6 depth can be cached across operations, etc., but the current formulation still strikes me as less elegant than past improvements to the VM.

It’s possible that this is the best we can do, but I’d like to see some other people review the issue. Some questions:

  • Stack depth: Do we actually need to limit stack depth? MAX_SCRIPT_SIZE is 10,000 bytes (and for standardness, MAX_TX_IN_SCRIPT_SIG_SIZE is only 1,650 bytes), so with OP_3DUP, we’re looking at a maximum depth of ~30k for transactions first heard in blocks (and only ~5k for transactions first heard over the network). Can we save cycles and complexity by just skipping that? (Are we ok with the protocol requiring stack + OP_ROLL implementations to be reasonably optimized?) And if not, do we need to multiply depth by some number >1 to represent the realistic overhead of a stack item?
  • Memory usage: Is there a more elegant way to limit stack memory usage? Rather than checking total usage after every operation in which it may have increased, is there a more efficient, less direct strategy?

If other developers with an interest in low-level optimization want to take a look at this, I’d love to hear your thoughts!

4 Likes

CC: maybe @cculianu, @tom, @fpelliccioni, @groot-verde, @Andrew-128, @markblundeberg? If you have thoughts on how to efficiently measure stack memory usage, comments are much appreciated!

3 Likes

There are only a limited number of places that the stack and alt-stack are changed. For those I’d simply add a couple of lines that subtract or add the item to be put on the stack.
The stack-depth itself is already being tracked, so you can ignore that.
The actual data on the stack can simply be counted with an extra integer which gets changed on every stack-change with the size of the thing being popped or pushed.

I don’t believe that to be an issue. A little abstraction will make that readable (and maintainable) and the actual CPU load is not the issue. IO is very much the problem for scaling, not doing this extra house-keeping.

I want to tie this effort into another long standing one.

Measuring how much a script takes to executes should in my opinion have an effect on the cost/reward balance of a transaction. There is only a very crude ‘reward’ currenty which is the fee, but to benefit an open market that measurement should improve.

So, what I’m thinking is that a transaction spending a UTXO which is very expensive to spend should have its ‘cost’ increased.

Things like Coin-Days-Destroyed and indeed fees paid could be used to stand on the other side of the ledger and supply the miner reward for this specific transaction.

This means that the black/white proposal for VM limits becomes a little more gray-scale where the limits are still hard limits, but getting closer to them will increase the cost of usage and hopefully avoid overloaded full nodes.

2 Likes

I definitely am a very enthusiastic supporter of this idea and any practical initiative to move such a thing forward. Ideal designs would involve “cheap-to-measure” approaches when evaluating scripts to decide what the required fee / fee discount would be.


As for this proposal itself. I have no clear idea of what the impacts would be on perf. – the only way to know is to construct the most pathological txn imaginable with these limits in place – and perhaps a whole 32MB block of such txns, and measure the execution/validation speed of that versus status quo.

Anybody want to volunteer to construct such a pool of 32MB of txns?

If you hand me a big stash of pathological txns that abuse these limits to the extreme, I can modify a test version of BCHN to use these limits and we can benchmark how bad/not bad it all would be.

5 Likes

I wanted to add some thoughts about OP_RETURN limits and was about to post under the old thread, but I think it’s better to post here.

Redeem script reveal before first spend, or applications like https://memo.cash. Btw, the other day I had a nice chat about scaling with Memo’s founder Jason Chavannes, and I learned he’d like a bigger OP_RETURN, too, and said he’d support a CHIP for it so I got reminded of this CHIP.

That has the inconvenience of having to create a dust UTXO ahead of every data push. Also, if you use the data by contracts then you have the added complication of having to prevent others from stealing the input, e.g. if it a plain commit-reveal with no authentication.

Here’s something new to consider about the where: bulk of the data would ideally be pushed before the 1st non-opreturn output. Why? Because this allows keeping more compact proofs of TX’s outputs: hash the serialization until start of outputs and store the sha256 & length of hashed data instead of the raw TX part. Then later, when you want to prove that outputs are part of the TX, you can do the sha256 length-extension with the outputs & locktime, and apply one more sha256 to get the TXID.

How about we allow a larger limit only if OP_RETURN or an uninterrupted sequence of them is placed before the 1st non-OP_RETURN output, while those placed in any place after the 1st non-OP_RETURN output would have tighter limits?

Ok, the above was quoting the old thread, now some fresh thoughs about stuff from this one.

You could do a sha256 length-extension from within ScriptVM, and have smaller proofs of some TXs outputs :slight_smile: Don’t know whether it’d be useful, just pointing it out for info.

  1. Contract-based refund transactions, too. Contract could inspect the parent’s input and pay refund back to the originating pubkey.
  2. Non-interactive tokenization of AnyHedge contracts (as opposed to interactive: needing to make tokenizer contract be part of funding TX)
  3. Combined with something like OP_BLOCKHASHMATURED, we’d get an easy way to “introspect” any historical TX - but the spender would have to provide it.

Btw, did anyone wonder why code-point 0x50 was skipped (left undefined), and it belongs into data push group? What if we made the opcode special - 0x50 to be: OP_PUSHRAWTX <raw TX>. It would be valid only if the raw TX would be of valid TX format. Then, user could slice it up to extract what he needs, knowing that the format is OK and he can skip doing some checks on the blob.

Let’s get moving with this CHIP! Many builders (@MathieuG & @mainnet_pat, and I imagine a few more) have already expressed their hopes about this CHIP moving forward. Let’s define actions that need to be taken if it is going to be ready for 2025 activation.

Would the best course of action be to just implement the limits and then do some benchmarks - and if the timings are all below the current worst case then we could safely implement it?

IMO we should aim higher, make these test cases be part of future-proofing tests with 256 MB.

3 Likes

The thing about hashing is … really the amount of data you hash is more costly than the number of hashing operations. Consider hashing 80 bytes 660 times versus hashing 10kb 660 times…

I have a feeling that the stack item 10kb thing combined with hash abuse can be a problem… maybe?

EDIT: What size objects do you “typically” anticipate hashing? 32-byte things? 64-byte things? Perhaps an additional constraint would be: you get up to 660 hash ops, or maybe ~20kb of data you can hash, cumulatively, per script evaluation. Whichever limit is hit first. That’s one way to ensure the amount of data hashed doesn’t get large?

3 Likes

There’s some quantization due to padding to blocks, I think hashing 1 byte costs the same as hashing 32 bytes, but hashing 33-64 bytes costs 2x the hashing of 1-32 bytes, and so on. So if we limit total data then someone could be more abusive by doing a big number of smaller messages.

Correction on hash block size (source):

With most algorithms (including MD4, MD5, RIPEMD-160, SHA-0, SHA-1, and SHA-256), the string is padded until its length is congruent to 56 bytes (mod 64). Or, to put it another way, it’s padded until the length is 8 bytes less than a full (64-byte) block (the 8 bytes being size of the encoded length field). There are two hashes implemented in hash_extender that don’t use these values: SHA-512 uses a 128-byte blocksize and reserves 16 bytes for the length field, and WHIRLPOOL uses a 64-byte blocksize and reserves 32 bytes for the length field.

Sounds good, yeah.

2 Likes

:+1:

Note: I chose numbers out my buttocks. Ideally one benches this on say BCHN and gets an idea of what the tradeoffs are… I just raised this issue as a potential thing to keep in mind…

EDIT:

There’s some quantization due to padding to blocks, I think hashing 1 byte costs the same as hashing 32 bytes, but hashing 33-64 bytes costs 2x the hashing of 1-32 bytes, and so on. So if we limit total data then someone could be more abusive by doing a big number of smaller messages.

Yes. Good point. Perhaps hashing 1 byte should be “costed” as if it were 32-bytes. That is, each chunk of bytes you hash is rounded up to the nearest 32-bytes in terms of costing… or something.

Something like that…

3 Likes

Just feel the need to point out that the blocksize of sha256 is 64 bytes, not 32.

4 Likes

Oh yeah good point so if one were to round-up the cost def. should be multiples of the block size 64 bytes… sorry about that; thanks for pointing that out…

2 Likes

From an architectural point of view, I would advice to think about this in the vain of there being two mempools.
First the one you know, inside of the full node. Its task is to make the network function (detect double spends etc) and for normies as well as miners this mempool is important to make sure that block transfer is cheap. Which means, it should have all the transactions that might be in the next block so we can avoid fetching them afterwards.

The second mempool would be one explicitly for mining software.
Ideally it can be connected to a full node and slave to is so it is simply a subset from the full node’s mempool in order to avoid any and all validation. Keep it simple, and all that.

But adding new transactions to this mempool can be done much more leniently and take more CPU time (probably multi-threaded) in order to give weights and scores to transactions. Include the highest scoring ones and bumb less happy ones.

So, I imagine this second mempool to be snapshotted for a block template regularly and various algorithms can be used to make the most profitable block within the parameters given. Because we’ll inevitably end up with more transactions than will fit in blocks and selection then should be done smarter than just on fees.

1 Like

Thank you all for digging into this! I’m mostly AFK until December, so I’m sorry I’ll have to come back to thread later. Just wanted to link to recent related discussion on limiting memory usage:

https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2023-October/022082.html

The patch bundles the limiting of total stack item memory usage into the stack depth limit.

Not sure if that ends up being simpler in most implementations than this CHIP’s current approach, but I wonder if that strategy opens up more opportunities for skipping the validation in common cases (so implementations can check memory usage less often).

Anyways, this question is still the main item that I personally think needs more research for this CHIP to be ready:

2 Likes

Flowee the Hub is probably the only one that has a validator which is compltely multi-threading. My thinking there has always been that we simply allocate the number of threads that the hardware supports (including hyperthreading) and queue jobs for the first thread to take.

As a result there will be at most N virtual machines (stack etc) in memory. Where N equals the number of virtual cores. Typically less as those VMs are cheap to start (though that may be an optimization in future).

My thumb-in-air feeling is that if we use 1MB per VM for things like stack, that is not even going to register. I’d be OK with 10MB, to stretching that to 25MB per VM.

Because in my 64cores machine, which can validate 64 transactions at a time, that would mean we allocate 1.6GB of memory max for the validation part. Which in 2024 is perfectly acceptable IMO.

3 Likes

@cculianu has gotten acceptance from the other BCHN members to start work on implementing the ‘Targeted VM Limits’ CHIP in a branch of BCHN and do testing of those limits. :smiley:

This seems a great way to move the proposal forward, as a way to investigate and resolve the open questions about the stack memory usage limit - how expensive it is to measure and enforce compared to alternative solutions - and questions around the need for a limit stack depth.

3 Likes

Copy pasting my own message from telegram, seems relevant:

Re: “Does this make inscriptions (data storage) easier, that sounds bad”:

The data storage question is pretty vague on BCH since putting things on the chain is quite cheap already regardless of method. imho we mainly have to address three things:

  1. Jpegs are not somehow made cheaper than other things (see also btc)
  2. Jpegs don’t somehow gain easy entrance into the UTXO
  3. Jpegs don’t somehow gain an excessively easy route, like a limitless opreturn. This is particularly very ill defined and open to debate.

Otherwise there’s nothing preventing people from just dumping a bunch of 200 byte opreturns on the chain today and assemble them into a jpeg offchain. ^ just to demonstrate that any such proposal doesn’t make it worse than status quo.

4 Likes

Smart contract developers run into the opcode limit every so often. For example @dagurval ran into the limit implementing his ‘Stable asset token - incentivized floating peg’ contract, then optimized it, then later ran into the limit again and tweeted “The BCH opcode limit is too damn low!

I also ran into the opcode limit when designing the improved sha-gate contract.

Sometimes it is possible to split the contract into 2 with relatively low overhead, other times like when trying to emulate math, the added complexity simply becomes practically prohibitive to use.

Agreed! After spending time thinking through ‘Composite Arithmetic Opcodes’ and ‘The need for additional Math opcodes’ it became clear that best to allow this functionality to be emulated first, before it is added natively.

Cashscript library functionality, similar to that of Solidity, is not very useful currently because of the 201 opcode count limit. With the newly proposed limit of 10,000 bytes it would allow contract authors to emulate math/opcode functionality and easily abstract this into separate libraries.

6 Likes

I see one major problem here: How do we name the new upgrade? :wink:

Targeted Virtual Machine Limits makes “TVML”. Hm, no good, need to work on this.

TaVLiMa

MiLViTa

How about ViLMa T.? As in Vilma Tlinstone?

image

:joy:

1 Like

Fantastic! I started cleaning up the CHIP a bit in this PR: Propose for 2025 upgrade by bitjson · Pull Request #6 · bitjson/bch-vm-limits · GitHub

3 Likes

Just an update here: I plan to push this proposal for November 2024 lock in, May 2025 deployment. The latest revision is available at the repo:

I’ll continue my research on the open issues:

I’d appreciate any feedback here on those issues and on the latest draft proposal. You can also open new issues directly at the repo.

I plan to finalize those items and produce a complete set of VMB test vectors prior to the May 2024 upgrade so that node/VM implementation developers will have plenty of time to review before November.

6 Likes