Mitra VM: Thoughts on the architecture + instruction set

I couldn‘t sleep well the last two days so I wrote down my ideas on how a VM for Mitra could look like: https://docs.google.com/spreadsheets/d/10PVUGoOB-otUXvFLbDr3z2R6bPzjVhFqIhdlJlO-Fak/edit?usp=sharing.

The goal is to make a hyper efficient VM, so we can keep Bitcoin Cash as scalable as possible, while allowing as complex smart contracts as possible. The VM has a couple of weird constraints, which are hard to uphold:

  • It should be very fast to interpret.
  • It should be possible to compile the code to efficient machine code, but not take too long for the compilation.
  • Ideally, it should be possible to at some point build ASICs which can run these opcodes. This doesn‘t seem out of the question given ASICs are absolutely standard for miners for years now.
  • It should be relatively easy to implement.

For this, I did some research into WebAssembly, i.e. among other things analyze existing wasm files (one for Mono (C# interpreter), one for d3force and once for some animation framework), but found a few issues:

  • It‘s not very compact, e.g. the most common operation (local.get) always takes at least two bytes.
  • The most common operation by far is local.get, which means most programs spend the majority of instructions on putting values back on the stack. This is solved by using a belt, as used by the Mill CPU architecture.
  • It has two integer types, i32 and i64, which is not sufficient once we integrate fractional satoshis. I added i8, i16 and i128.
  • Each integer operation encodes the type it uses. This bloats the opcode space, and we can encode more ops if we have just one opcode for add, mul, etc.
  • It has many ops for floating-point numbers, which we can ignore and use the opcode space for other stuff.
  • The stack can grow unbounded, which means we cannot pre-allocate the required memory before executing.

WebAssembly instructions are already variable-sized. I made a instruction set which is also variable-sized but utilized the opcode space, I think, more efficiently. Rosco quoted Donald Knuth (premature optimization is the root of all evil) and I agree, so we should’t spend too much time on crafting the right instruction set and rather get to playing around with the thing as early as possible.

I also added a few other things, which are inspired by the Mill CPU architecture (see video 1, video 2, video 3 and video 4 for an intro to that architecture):

  • A belt, instead of a stack, which limits the required memory to a fixed value (currently, 256 bytes) and allows for a small opcode size.
  • NaNs, which allow for, let‘s say, interesting error handling. I haven‘t explored this idea too throroughly yet though,

I‘ve also added slices, so no allocations at all are required when evaluating a VM. Slices point to a region of memory, and can be handed around in the VM like any other value.

If I cannot sleep again at a different date, I‘ll implement an interpreter for this in Python or so, so we can experiment with this. Maybe we‘ll get a compiler for that, too.

What do you guys think?

5 Likes

The belt concept introduces potential undefined behavior, is more complex to model than a stack, and has less functionality (given it can’t pop a value). Moreover, it seems the value you’re wanting from it can be gotten simply by implementing a rotating stack ( tos = stack[pos mod stack_limit] ) which many embedded forth engines have done in the past.

Being stack based keeps the language and the compiler extremely simple. Your belt is basically a register file with complications and invites a lot of complexities for mapping local variables which, although not necessarily super complex (see lisp & scheme for example), are still an order of magnitude or two more complex than simple stack based operations - especially when you consider the possibility of multiple stacks as a classic forth engine provides.

I understand the attraction to webasm but it’s primary focus of executing Javascript as fast as possible has made some architectural decisions that make it a poor general VM which is partially evidenced by the criticisms you’ve already noted in your own suggestions.

Agree completely that a VM’s ISA should be able to be implemented as an FPGA/ASIC. This is imminently doable and desirable. The key to such things, however, is being absolutely brutal in elimination of all complexity not absolutely required. A classic stack machine is really an ideal model if done right. The BTC & ETH VMs were clearly designed by people who hadn’t a lot of experience in such things before. I’d recommend reading Phil Koopman’s “Stack Computers - the new wave” which shows what’s possible with even primitive technology.

Thanks for the reply, @proteusguy

The belt concept introduces potential undefined behavior

Why?

is more complex to model than a stack

For compilers, yes, for interpreters and hardware, absolutely no.

I understand the attraction to webasm but it’s primary focus of executing Javascript as fast as possible has made some architectural decisions that make it a poor general VM which is partially evidenced by the criticisms you’ve already noted in your own suggestions.

WASM is stack based.

1 Like

I think you’ll find keeping the belt “clean” is actually easier to do with a compiler than an interpreter. Check out the efforts the Mill team has gone through to get their LLVM compiler front end going and imagine maintaining this at run time vs having a powerful compiler that can eliminate most of the runtime overhead such as the LLVM compiler they’ve made. As your design stands I suspect almost every opcode has to add both an extra level of indirection and account for belt context offset modifications during runtime for every opcode execution. This would be the opposite of “hyper efficient”.

You call this a stack processor but the majority of your instructions are really referencing belt offsets as if they were registers in a normal register based CPU - except with the extra trick that your offsets may shift right at any moment without warning. Scary undefined behavior unless you’re doing tons of book keeping which would need to be proved correct (more appropriate for a compiler than an interpreter).

Additionally - you claim strong types but, in reality, you have one type (int) with different implementation sizes based on their domain capacity. Again lots of bookkeeping to track which is which that’s costly for an interpreter and complex for a compiler. Why not just have a variable int (octet) type that indicates its continuation by setting the first bit. Now you can support any sized int domain at the cost of 1 bit per 2^7 values. Just cleaned up your opcode space as well.

I’m curious why core built-in concepts of BCH or any crypto system like addresses and hashes aren’t native types with type safety protection. This would be a minimal requirement to even consider the claim of being a type safe engine.

WASM is only slightly more stack based than the C-ABI meaning it has one that’s very popular but isn’t the key architectural driver of the resulting code base. Making a stack based CPU/VM that outperforms WASM and is cleaner is not difficult - just add another stack for return addresses and perform free/cheap function call returns as described in the Koopman book I referred you to in my prior post.

Despite these critical observations, I’m 100% onboard for an improved execution environment IF IT’S POSSIBLE. I’m not convinced BCH can be rebooted onto such a thing for quite some time if ever due to some technical challenges but mainly consensus of stakeholder challenges. Designing languages/VMs are things that I personally have a lot of interest in and would collaborate on something that looks like a promising way forward so I believe I share much of your goals. That said, this Mitra thing is before my awareness/involvement with BCH and the key reference I found (from a tweet where you announced a flipstarter) returns a 502 error for me. Do you have a reference I could review that describes it’s goals and architectural drivers? I’m interested in reviewing and I understand that you’ve been a core driver of the concept.

Thanks.

The belt is well-defined behavior (as you said). At no point it does something undefined.

Explain how it is undefined behavior.

I explained the effort to keep the behavior type safe and well-defined in the interpreter is far more expensive (orders of magnitude) at runtime than doing it with a compiler. And I still haven’t seen a proof that demonstrates it is well defined even from the Mill guys. I suspect strongly that there are additional constraints they impose to make it such that you haven’t covered here.

Obvious Areas of Concern -

  1. loops that get altered and push good data off the end. What’s preventing that?
  2. how do you guarantee the offset context of the belt isn’t altered between operations that assumed the same reference parameter position when the code was generated?
  3. what is the behavior of performing a cast downgrade that overflows the target?
    etc…
  1. Nothing. It’s well-defined.

  2. It’s well-defined. All operations are deterministic, there’s no possibility that it is not well-defined.

  3. Casts are well-defined, see spec.

Sorry, but as I’ve been directly responsive to your questions, I don’t find “take my word for it” as satisfactory or informative responses. I’m completely willing to be wrong - but in doing so I’d expect to get an improved understanding. You’re not providing that.

Beyond this, I made several clear questions/suggestions a few comments back that you’ve just ignored which I think are quite relevant. It’s ok to take time to answer them but being completely dismissive isn’t a way to encourage further evaluation of your proposal.

Let me know if you think my expectations are unreasonable and I’ll withdraw.

What I’m saying is that, what you consider „undefined behavior“ is actually well-defined, because it’s in the spec, or it will be once this progresses. The belt is very simple and easy to implement for compilers and interpreters. Anyone who doesn’t implement it is not following the spec.

Your observation that it’s similar to a register based solution is correct. The advantage of the belt is that it doesn’t need an „output“ register, so we save on bytes in the bytecode.

If you have any examples of where theres actually a gap in the spec (or a danger that I’m overlooking), lmk. So far, you’re not providing any useful criticism that’s not already covered in the draft I provided.

Devil’s in the details my friend. It’s undefined because I can put together operands and data that are ambiguous given the current spec from the issues I described before. That means your spec is, at minimum, incomplete. By your definition, you can say here’s a turing machine that is specified to not have the halting problem. Until you’ve defined how it avoids the halting problem your spec is incomplete/undefined.

Your concerns are not valid, and it doesn’t help anyone to address them here. Please do the necessary research on the belt and refrain from sending any more replies in this thread until then, it’ll make it harder for other people to follow.