Unpopular opinion; (most) indexers don't scale

I may be missing something, but I believe what you are describing is not exponential growth, but quadratic growth. Assuming that the number of addresses increases linearly with time (if no change in users), and linearly with users (in a constant timeframe), then when both are increasing, the rate is the product, quadratic. This is the difference between 18 quintillion, and 4096, quadratic growth is nothing like exponential growth.

1 Like

Solid rebuttal.

You are reading me wrong. I do think SPV is absolutely fundamental to the usability of bitcoin and I fully expect the ratio SPV-to-full-nodes to increase a lot over time and I do currently believe that address-to-transaction indexes (hereby referenced to as just ‘index’ since other types of blockchain indexes such as SLP, CashAccounts etc are not relevant) are currently the best option.

Re-reading this thread I would like to summarize it as:

  • In protocols with an indexed backend each user will on each connection request X lookups of full history that returns transactions and merkel proofs. Let’s say that each lookup has the cost O(X*log N) where N is the size of the full block chain and X is the number of addresses in a users wallet.
  • In protocols without an indexed backend each user will request 1 scan of Y blocks to get all transactions and merkel proofs where Y is the number of blocks since the last sync. This has the cost O(Y).

One of your concerns seems to be that the size of the index will blow up in size. Hopefully you have been convinced that it isn’t a problem since the size of the index will always be less then the size of the blockchain and hence not a scaling concern in this scenario.
Your more important concern, as I see it, is that X might be big while Y will be small.

The answer to this is that X can be bounded by the node to avoid any pathological users that requests an unreasonable amount of lookups as a DOS protection. How big is unreasonable? 100? 1000? 10 million? Who knows, the important thing is that X does not increase with number of users or with the size of the blockchain. If X is bounded to some constant it can be removed from the equation above and the cost becomes O(log N) for all of a users addresses.

What about Y in the other equation, can that be bounded somehow? Clearly the worst case scenario is that Y is the length from genesis to chaintip causing a linear scan of the entire blockchain which we all can hopefully agree on is a scaling disaster. But unless a node would impose limits on users ability to fetch old history this is the reality to deal with.

Am I pessimistic when I set O(Y) = O(N)? Perhaps, but it is the worst-case a node has to deal with and I’m pretty sure that the crossover time for when these two algorithms performs the same will show that X can be very big whilst Y isn’t. This would be really interesting experiments to perform.

1 Like

If I may be frank here – I find it unfathomable that anybody can argue that linearly scanning each block for each user could ever be considered sufficient such that address indexing is not ever needed.

Like the whole thing is like arguing that you’re better off using a linked-list to lookup word → definition association rather than a hash table…


EDIT: Let’s do a thought experiment – let’s say you are running a wallet server. Let’s say blocks are super big and super full at 256MB/block. Let’s say each txn, on average, creates about 1 new address. You will have a constant rate of on the order 1 million addresses generated every 10 mins. To build the address index, you need to scan each block exactly once – that’s 256MB of data scanned every 10 mins. Let’s say the address index stores less data than the blockchain (it doesn’t store things like signatures or full scripts). Maybe 10x less data. Right now Fulcrum stores 22G for its address index of BCH mainnet for a blockchain of ~200G. Let’s pretend I can optimize that down to10x if I wanted to optimize for storage more.

So for a 256MB block you are storing about 25MB every 10 mins.

To build this index of 1 million new addresses, perhaps you need O(n log m) read/write operations to maintain an ever-growing database that is a key-value store that is implemented as a b-tree or an rb-tree, where n is the number of new addresses and m is the total cumulative number of addresses from the beginning of your history (these are worst case numbers since one can also do an on-disk key/value store as basically a giant hash table which would have O(n) amortized time for appends).

So at worst if you have m = 2^50 (which is a huge number) and n = 1 million you only have to process 50 million data records to append 1 million addresses to the DB. Assuming each data record is like 20 bytes, that’s 1GB worth of data reads and writes to maintain the address index for 256MB blocks. So 1GB total reads and writes plus say 25MB of actual writes every 10 mins to maintain the index = 1.025 GB every 10 mins. Not too bad! And this is with a blockchain that already has 2^50 addresses which is a huge number of addresses – 112 quadrillion or so!

And now you can support queries from tens of thousands of clients with fast O(log N) lookups for each user. (Can even do O(1) if your key/value store uses hash tables rather than trees to store records).

Compare this to what your server can serve up if it only had bloom filters + merkle block as its only two primitives. In this case, each and every user you have will have you scanning every new 256MB block as many times as you have users!

If you have 200 users you will be scanning 51GB of data every 10 mins just to figure out which of your users had txns of interest.

I fail to see how merkle block is a solution for scaling. If you have 1000 users you now have to scan 256GB. If you have 10,000, you now have to scan 2.56 TB, etc.

Whereas if you have 10k users each of them having 10k addresses, that’s 100 million addresses, requiring lookups O (n log n) → 3.1 billion lookups, and if it’s 20B per address record, only 60GB of data to churn through do do lookups for 10,000 users!

The same server, if it only used merkleblock would have to scan through 10k * 256MB → 2.560 Terabytes of data!!

2 Likes

There is a pretty strong difference though. Yes, if the SPV-with-merkleblocks would rescan the entire blockchain every week, it would be the worst solution. But unlike the address-index using wallet, the merkleblock-using wallet’s local state is used to minimize the work the server does.

Appending to a data-store is nearly free (just disk space), people reading it is equally free on the side of the server (just reading from disk). On the other side, inserting in an index that (by definition) needs to update its lookup-tables, will become more and more expensive over time, CPU wise as well as disk-seek wise (data locality).

Yes, love that idea. Thank you!
The statement that you are storing 25MB every 10 minutes is ignoring the difference between a database and a datastore. You are doing your calculations based on your address database being a datastore that appends new data every block. This is not what happens.

You indeed may add 25MB to that address index, but since its an indexed lookup table, you also have to change the lookup table and insert all the updates in the middle somewhere. Not only is this massively more expensive compared to an append-only blockchain, it also locks the lookup table for reads while this reshuffling is happening.

It is, like you said, an apples to oranges comparison. An index gets more expensive to add to as it grows bigger. A datastore has constant cost.
An index also gets more expensive to read from as it gets bigger, while a datastore allows one to just read the parts you have not read yet.

So, your nice math part is assuming a static state. In reality you need to rebalance that b-tree often. In reality that b-tree is not accessible while inserts are taking place. In reality you end up waiting with that locked lookup table on an actual disk-write to hit disk.

So, yes, your math for doing a lookup with a b-tree is correct. But you also assume that the entire b-tree is in static and in memory or not fragmented on disk so it can be read fast. This is also a utopia and you’ll end up seeing your CPU for many many cycles waiting for data.
Personal experience; I’ve seen a massive speedup in lookups by caching in-memory an extra layer of the b-tree in my address database. (going from 20 to 15 disk-reads).

I am more than willing to think about alternative solutions, just feel compelled to point out that the topic is about certain indexers not scaling. Merkle blocks scaling or not are really orthogonal to that discussion. Maybe they don’t, but that doesn’t make address indexers scale.

But, yeah, the main alternative is indeed to just read that data. The trick here is that linear reads are massively optimized on modern ssd-firmwares and indeed OS kernels. You can’t compare random reads and linear reads by just stating how many bytes you read. More to the point, prices of SSDs become even more important as cheap SSDs are massively worse at random reads.

To further make clear that apples-and-oranges comparison of data-read. To linearly scan a days worth of blocks of read-only data is similarly going to be benefiting from the fact that a lot of people are doing that. They have been read before and are likely still in disk-cache. Flowee the Hub benefits from this by memory-mapping the blockchain and thus 10 clients all reading the last bunch of blocks will mostly just be about a bunch of CPU-threads linearly scanning though the memory image of said block.


I know I wrote it is an unpopular opinion, replies here show I was right. I wrote it because this opinion is well supported with evidence and experience.

I don’t expect people to suddenly be happy when so much of the Bitcoin Cash ecosystem is using it. I also don’t expect people to be happy with me personally for being the one writing it down and destroying some illusions. I apologize for that.

Point is, you don’t have to convince me. Convincing me would not change reality and it would not cause the bottleneck to go away.

All this post is meant to do is for people start new projects to avoid going a direction just because many others have gone that direction, and instead build new tech based on a solid basis.

I absolutely don’t suggest nor expect that existing wallets get re-done either.