Unpopular opinion; (most) indexers don't scale

tom · October 12, 2022, 4:12pm

I like data-structures and thinking about data-lookup techniques. I’ve progressed from simply using to actually written various types of databases over the last decade, this is an odd hobby of mine. The UTXO one is part of Flowee today.

A database that is meant to be used and to grow over time will be best when designed with the specific use-case in mind. For instance the UTXO-database stores all not-yet-spent money on a chain, its usecase is thus to store unspent coins.
It grows with the amount of users. Individuals will have a relatively stable amount of actual coins (outputs) because the database can delete old ones that are no longer used. Spend one, create a new one means there is no change in database size.
So the amount of UTXO entries there are will grow at the same rate of the amount of people using the coin.

This is super useful information to make sure your database stays scalable and predicting how many users it will be capable of supporting.

The UTXO-database example is linear. If the number of users goes from 10 million to 100 million, amount of data stored and processed by the database is also roughly going to be 10x. Linear scaling is OK

Linear databases grow based on number of users. Notice that user-numbers will likely grow much faster over time. This is normal and how adoption works. Moore’s’ law is the one that is taking care of accelerated growth of users.

Non-linear databases

A good example of this is an address-lookup database.

The main difference between the UTXO and an address-database is deletion of old records. In the UTXO database we remove items that are used-up. In an address database we don’t. All addresses since the beginning of the chain will forever be searchable.

We are happy that privacy is a focus point on our chain, people use HD wallets and we advertise every single receiving transaction to use a new address. This is great for privacy. Not so great for growth of an address indexer.

Now we have two growth dimensions. More users create more addresses, but since we don’t delete old entries we also have per-month usage as a growth dimension. We went from linear growth to exponential growth. From only caring about how many users we add to also caring how many addresses each user creates every month.

Remember the traditional exponential growth explanation from the Indian story about the brahmin Sissa ibn Dahir, who reportedly invented chess.

Sissa invents chess for an Indian king for educational purposes. In gratitude, the King asks Sissa how he wants to be rewarded. Sissa wishes to receive an amount of grain which is the sum of one grain on the first square of the chess board, and which is then doubled on every following square.

It looks cheap. First is one grain. Then 2, 4, 8, 16, 64 grains. Etc.
But the total number of grains can be shown to be 2⁶⁴ - 1 or 18,446,744,073,709,551,615 (eighteen quintillion, four hundred forty-six quadrillion, seven hundred forty-four trillion, seventy-three billion, seven hundred nine million, five hundred fifty-one thousand, six hundred and fifteen, over 1.4 trillion metric tons), which is over 2,000 times the annual world production of wheat. (wikipedia)

That gives a small introduction to exponential growth. Its numbers go up much faster that most people realize. Point is: any system that has to support such growth will reach ultimately its limits and fail.

So, in a world with total stagnation and no new users, the UTXO database will stay will similarly not grow. But in this non-growth world the address database will continue to grow. Now back to the real world where we want massive growth of users and your address database will grow exponentially in size and usage.

Why does that matter?

In the short term we don’t have that big a user base and it doesn’t matter much.

What is important is to realize the underlying scaling issue, and avoid building infrastructure that depends on such indexers because those low-level decisions will cause centralization over time when the only people that can run such databases are the ones investing lots of money for hardware.

The blockchain itself is a data-store, not a database. It, too, will have exponential growth. But storage is cheap and the blockchain being a data store means it won’t actually be used for lookup or inserts.
Indeed, blockchain won’t work without databases like the UTXO one. It would thus be wise to only use and depend on databases that have linear growth.

Why is this an unpopular opinion?

Because in Bitcoin Cash the vast majority of projects / wallets depend on such databases.

cculianu · October 12, 2022, 6:43pm

Yes, address indexers have theoretical trouble at scale. Of course a very fast key-value store can theoretically do lookups quickly even on a HUUUGE dataset (rocksdb in particular can cope well with billions of keys). But ultimately the disk cost can theoretically get very large – almost so large as to be prohibitive for common hardware.

I guess if we get to a point where indexing addresses since the beginning of time is untenable disk-cost-wise, we will have to just do a cutoff data for things like Fulcrum or other address indexers. That is, you won’t see your entire tx history since 2009. If you have addresses that have txns before the cutoff you just won’t see the txns at all. Basically prune old data.

In some ways one might argue that if we get to huuuge blocks and lots of on-chain activity, even having the entire blockchain on-disk from the beginning of time also will not scale particularly well either … thus the need for some sort of utxo-commitments and fast-sync, with miner validation and PoW backing it (thereby effectively enabling secure utxo commitments)…

tom · October 12, 2022, 9:34pm

What is todays size for mainnet of the address DB?

If you have a billion people each creating 10 addresses every single day, this may be viable for a while, but 10 billion new enties every single day…

Exponential growth, baby. It will eventually roll over all those scaling plans.

Jonas · October 15, 2022, 8:57am

What is the alternative for serving a SPV wallet? A user verifying historical transactions is a real use-case and without an address index it would be very costly for a full node to scan the blockchain for certain transactions which definitely doesn’t scale well, primarily in terms of disk I/O. Are you proposing that SPV wallets should have the UTXO set locally (if we had miner enforced UTXO commitments deployed)?

Address-to-transactions indexers are still bound in storage size to be smaller than the blockchain.

tom · October 15, 2022, 9:20am

Those wallets can use the ‘merkleblock’ p2p message. (spec) / (bitcoin.org spec).

Jonas · October 15, 2022, 10:44am

Which is inherently expensive in disk I/O for the node… I have yet to see a good response to this 5 year old article from a prominent small-blocker.

tom · October 15, 2022, 11:32am

The comparison is a good one. What is more scalable. (spoiler, merkleblock wins).

One one hand you have a single user that only ever request a scan of a block a single time. There are a lot of blocks, but one wallet will not repeatedly ask for any block its already scanned. Moreover, most wallets won’t be restoring a backup seed, so they only check new blocks. Not historical ones. And, again, if they do restore a backup seed they only ask historical blocks once.

Basically this means that a single user will only request the last day or two of blocks to be scanned by a full node. Just like all the other users connecting to a full node do. Same blocks, already in disk-cache.

On the other hand we have an address database that grows exponentially in size. Making the server take longer and longer to give replies. At the same time making the hardware needed to run it more costly and thus limiting the number of servers handling the entire ecosystem.

If its not obvious which one is processing more data, look at it from this point of view: the address database will forver grow at an exponential speed.
If I ask that address lookup server today which transactions are for me, it will be cheaper than me asking tomorrow. It gets more expensive every single day that the database grows. (even when block stay the same size)

On the other hand, a full node scanning a days set of blocks has a linear cost. The data created in a day stays simple and stays the same size.

I did that once, its very much full of lies and impossible assumptions.

Did you notice that he assumed that the number of connections per peer could not be increased? Or notice that he assumes that the hardware of 5 years ago would serve an ecosystem of 1 billion SPV users? Maybe you noticed that the number of full nodes running when that article was written is used, while a good coin will definitely grow not just SPV users but also number merchants and others that run a full node.

Now, more important on this topic is thinking how fast your address database will grow when you have 3.5GB blocks, and how many users can still run that indexing server

freetrader · October 15, 2022, 4:37pm

What if we also managed to modify the full node software to support 1,000 connected SPV clients?

He does some math related to that hypothesis, so I would not dismiss him out of hand like that.

Yes, it might be useful to run his calculations against updated hardware costs from 2022 instead of 2017.

His argument that growth of these requirements will reduce the number of users willing to run them, should be examined critically in the wider ecosystem context in my opinion. Definitely it should not be accepted that the number of full nodes would only go down because his argument ignores the potential value increase & hence price increase factor that comes with a more widely used network (coin), and what that does to peoples’ (and businesses) ability to run nodes.

All in all, he makes a shallow argument that centralization would increase if these hardware requirements were to go up. Unfortunately, that first-order logic appeals to many who don’t give much thought to the economic ramifications of Bitcoin serving an increasing number of users.

To be fair I think Lopp does try to point out that his concerns with SPV lie with the scaling of its servers and how many people will be able to run them.

I agree with @Jonas that his article deserves a much more extensive rebuttal from “our side”. We may not even be able to disarm all his points of concern at this time, but at least we can recognize where some of the arguments he makes have weakness.

Jonas · October 15, 2022, 8:45pm

Can you spot the fallacy? In your argument you point out that a full node needs to be able to serve old historical data, although very seldom. This would mean that the node would need to store that data somewhere and since we know that an address-to-txids-index is smaller in size than the data itself you can’t blame the index for exponential growth. You’d still have the same growth complexity without the index!

I don’t really get what you are trying to say in this post TBH. The UTXO set scales fine and this has been known for years but that isn’t directly related to the problem that is solved with indexers and related protocols, namely serving historical data.

tom · October 16, 2022, 4:37pm

There is no fallacy in my argument.

I can’t really follow your problem with it, looks like you are mixing up a data-base and a data-store. I’ll just say that inserting in a database is a function that gets more expensive the bigger it gets. On the other hand, appending to a data-store doesn’t change cost based on existing size.
I explained this already in various way in the actual posts above…

That most indexers don’t scale. That people building stuff in this ecosystem should be made aware of this fact and try to find a solution that doesn’t depend on them.

tom · October 16, 2022, 5:10pm

And it is a fair worry, one that is only sane to have.

There is a mix of us “just try it and see when it fails” on one side and indeed ensuring we don’t innovate in directions that will definitely fail on the other.
And that is the main reason for this topic.
This topic is here to make clear that one approach does not and will not scale. Exponential growth has been known to not be sustainable since that chess story (13th century).

We can, and probably should, try SPV to its extreme. Simply because its the best idea we have seen so far. There is not one way to skin that cat. So knowing that address indexers will perform much much worse over time than SPV based on merkleblock is something that is good to know early on, right?

bitcoincashautist · October 16, 2022, 6:08pm

History doesn’t scale, period. This holds for all history, not just blockchain. What then happens naturally is that people who care about some parts of history keep it around while the rest of it fades away, something like organic UTXO. I think it will inevitably happen with blockchain too. Many exchanges won’t let you download your full history. Many banks don’t give you access to your old transaction logs. If you want them - you need to think ahead and store them yourself.

What does scale is address to UTXOs index If you want history, you be responsible for it - keep your part of it.

Jonas · October 16, 2022, 6:11pm

You are conviniently ignoring the fact that you need to do linear lookups in this ever growing data-store and that those linear lookups will increase linearly with number of SPV users.

tom · October 16, 2022, 6:29pm

I advice you to acually read the article. You are exactly wrong.

tom · October 16, 2022, 6:29pm

I’m not ignoring that at all. What makes you say that? Please read the reply to freetrader.

Jonas · October 16, 2022, 6:42pm

You did it in the very first post:

tom · October 16, 2022, 7:07pm

Dude, you clearly are (again) just arguing because you like arguing. What value are you adding to this forum?

Do you actually understand the difference between depending on a not-scaling, exponentially growing indexer and depending on one that has a real chance of supporting the growth?

Now, you can continue arguing that neither backend will support SPV, but then maybe you should not participate in these discussions because everyone using a full node is definitely not gonna scale.

bitcoincashautist · October 16, 2022, 7:49pm

When I say address->UTXO scales I mean that you have an index that evicts an address from the index when its last UTXO is spent (it’s 1 address to many UTXOs relationship), so number of addresses will always be less than number of UTXOs and the index ought to have same scaling properties as UTXO set, no? On the other hand Address->Transactions doesn’t scale long-term and suffers from the exponential history problem. Question is, at what point will it become a problem where we can’t take fetching entire wallet history for granted? At what point do users need to start being responsible for their history because someone^TM won’t forever be available to provide it for free.

tom · October 16, 2022, 9:52pm

Using the merkleblock setup for the majority of wallets I believe we’ll scale the best way we can without needing to worry about some drastic measures like cutting off history.

cculianu · October 17, 2022, 2:08am

Ok, so these things are generally true:

an address index indexes the data of the blockchian such that certain types of queries can quickly be made.
an address index eats less data than the thing it is indexing (the blockchain).
bloom filter + merkle block, while very clever and elegant, isn’t really an index per se because you cannot arbitrarly ask questions about an address in the blockchain given a particular filter – at least not in a performant manner – without scanning the whole blockchain!

Therefore in a way, comparing merkle block to an address index is a bit of an apples-to-oranges comparison. They are not exactly the same thing and have different tradeoffs.

Yes, a bloom filter + merkleblock eats 0 bytes on-disk above and beyond the size of the blockchain. This is because it’s not an index at all, but rather a database-linear-scanning-trick. If you don’t have an index, of course your lack of an index is smaller than an actual index!

But, at the same time, if you don’t have an index, you can’t grab a history without asking the node to scan everything you missed “since last time” linearly…

A linear scan of a big blob of data is always slower than some indexed query. A good index db will use a O(log N) or even O(1) lookup.

Yes, an indexer eats more disk space than no index. But this is not going to be exponential growth as compared to the blockchain’s own growth. It can only be at most as bad or less bad than the blockchain’s growth itself.

So claiming that an address index is unscalable is the same as claiming that a large blockchain is unscalable… for the same (valid or invalid) reasons (depending on your philosophy).