Assessing the scaling performance of several categories of BCH network software

This is perfect, thank you.

As suggested earlier, I think that generally the tests should run with a very big (or a very small) -dbcache to test the block verification performance (or database performance).

After investigation we discovered that the Blockbook 0p issues were apparently due it not broadcasting blocks (which is the log statement we have been using to determine that it finished processing them) until it starting receiving transactions. To address this we have started a ā€œ1pā€ run (1% of transactions broadcast before the block) to stand in for the 0p run in this case. We believe the difference in performance relative to a true 0p run should be minimal, since 99% of the transactions in the block must still be processed upon receipt of the block.

After analyzing this 1p data, we found that the BCHN numbers were essentially unchanged, as in prior tests. It is worth calling out that this test contained the fix for the BIP34 issue. Both Blockbook tests were run with the default Blockbook settings (rpcworkqueue=1100, maxmempool=2000, dbcache=1000). As for Blockbook performance, it would seem that the 1p configuration resulted in semi-random results. The Misc 2 and Steady State blocks were process much more quickly, while the Fan Out blocks nearly 3 minutes slower on average. Comparing to the 90p data, it would seems that blocks randomly took much longer in the 90p test while the 0p tests were very consistent for each category. The standard deviation for the 1p Steady State blocks (including reorg) was 16 seconds. For 90p it was 6:27. For Misc 2 blocks, all processed in 0 seconds for 1p while they average 7:00 for 90p (note that we generally remove zero-second blocks, which skewed the listed average in this case to 11:40).

The report for this data can be found in the same Blockbook folder, here: Blockbook - Google Drive

One additional call-out is that we noticed periodic discrepencies in the block hash listed in the emitter block hashes, but only for blocks 1-244. These blocks are always processed quickly and as a result the logging is more sporadic, which our data extraction script does not account for. However, the number of blocks identified is still the same and the times for all of these blocks are the same, so there is no impact to the data that we actually care about.

Following Blockbook, we have selected the bitcore node and wallet as our next applications of interest. We plan to set up a similar test to what we did for Blockbook by providing the bitcore node with the intranet configured used by the block emitter and running the three applications together.

3 Likes

As a follow-up from the last post, we ran one additional test with Blockbook to verify that the 0p scenario failed for the reasons we suspected. In particular, we were wanted to ensure that sending any transactions would allow Blockbook to continue processing blocks successfully, even ones that it had not yet seen any transactions from. If such blocks were not processed, it could mean increased susceptibility to certain types of attacks.

In order to test Blockbookā€™s behavior, we ran a custom modification of the 0p test in which the first transaction of block 245. No additional transactions are sent, so it is effectively the same as a 0p run, aside from one transactions being sent just as we reach the larger blocks.

The result of this test was, as expected/hoped, that Blockbook continued processing blocks as expected. The inclusion of this single transaction broadcast appears to shift Blockbook into a different mode, allowing it to process the blocks as expected from that point on.

Given these positive results we are now considering this phase of Blockbook testing complete. As mentioned, we are now moving on to testing Bitcore applications.

1 Like

After working with Bitcore Node for some time, we were unable to successfully get it to sync the intranet blocks.

Documentation of the process and configuration used can be found here: Bitcore - Google Drive

We were able to get Bitcore Node to sync the first 244 blocks of the test case (i.e. both mainnet blocks and test-specific blocks) but it consistently failed to process block 245, the first large block (185 MB). Unfortunately, when this happened there were no immediately apparent errors. It seemed to silently stop syncing and then later print some recurring socket-related errors. Restarting Bitcore Node would cause it to identify that there were new blocks to sync, but then never give any indication that they were processed (e.g. they were not added to the database). Ultimately, our conclusion is that there is either 1) a hidden limit that causes Bitcore Node to reject large blocks in a way that we werenā€™t able to see, or 2) a technical failure, such as being unable to allocate a sufficiently large array or something similar, occurred that caused syncing to fail indefinitely.

Reviewing the configuration and source code led us to conclude that we were likely not encountering a block size limit. The only limit we could find in the code was a 1 MB ā€œblock sizeā€ limit, which appeared to be used only for checking transaction sizes. As a sanity check, we set up the BCHN/Bitcore Node to sync testnet. Our test didnā€™t complete for unrelated reasons (the server storage was too small and filled up) but before that failure, it was able to get to block 1,337,979. In doing so, it successfully processed many 32 MB blocks, all apparently fairly quickly (spot-checking suggested that all blocks were process in less than one second). Our conclusion was that there is either a 32+ MB block size limit that we were unable to find, or no limit is enforced and there was some other failure leading to the stalling.

In the future, we recommend further investigation into why Bitcore Node was unable to sync the larger blocks. Ideally this would be done by someone familiar with the Bitcore codebase who can evaluate the installation steps or configuration that we used for Bitcore Node. That may reveal improvements that would either fix the problem or reveal more information about what went wrong.

Additionally, whether as a part of follow-up testing for Bitcore, or generally for the sake of additional scaling testing, it may be beneficial to create an alternate test chain that more slowly ramps up the block size. In the current test chain, blocks go from 217 bytes to 185 megabytes. For applications that are unable to process large blocks, it would be helpful to see which it could process on a chain that had blocks starting at 217 bytes, then jumping to 16 MB, 32 MB, 48 MB, 64 MB, etc. In this case (Bitcore) it may have provided helpful information for where to start with further investigation. For now, we are also leaving this as a future task, to be revisited for the purposes of revisiting/triaging applications that fail on block 245 of the current test chain, should the others be found to have the same behavior.

Next, we will be testing mining pool software. Specifics are still to be determined, following additional evaluation of the available options.

3 Likes

The mining pool software we decided to go with next was ASICseer-pool.

Documentation on the process and configuration used can be found here: ASICseer Pool - Google Drive

Since mining operates a bit differently than the other tools weā€™ve evaluated so far, we were a bit unsure of what metrics would make sense to look at. Unsurprisingly, as new blocks were emitted, the pool and connected peer were updated less than a second after BCHN finished processing the block. It makes sense that there would be no variation based on block size here, since mining off of the new block can begin immediately, without any transactions. Given that, we turned our attention to getblocktemplate calls, since those would inherently be more dependent on the transaction data.

Interestingly, even the getblocktemplate call duration was not well correlated with block size, nor the transaction types (e.g. number of outputs per transaction). The closest correlation we saw was between the number of outputs per transaction and the number of calls made to getblocktemplate (as opposed to the duration of those calls). This is surprising since fees are more directly tied to transaction size, not number of outputs, so we would have expected the clearer correlation to be with block size.

We are also unsure of how to interpret the rate of calls to getblocktemplate, particularly with respect to the large gap we see in the data before around the 6 minute mark following a new block, and the lack of getblocktemplate calls for small blocks. Using the dummy stratum peer we were about to see that mining.notify stratum calls were being emitted very cyclically, roughly every 30 seconds, varying slightly when new blocks were received. Broadly, though, this seems to happen independent of the getblocktemplate calls. We recommend investigation into what triggers getblocktemplate calls and how this pattern affects fee collection in miners using this pool.

One additional result is that after charting the number of notify calls per block, we noticed an idiosyncrasy in the block processing data: block 262 is emitted right after block 261, causing only one notify call to be made. This block emission pattern exists in all prior tests, however, so it is a ā€œfeatureā€ of the dataset, not an indication of unusual behavior on the part of the pool.

4 Likes

Upon evaluation of the work that has been completed so far, it has been decided that this is a good place to put additional application testing on hold. A summary of the research completed so far can be found here:

6 Likes

PDF file in case Google goes to strange places:BCH Scaling Research Summary.pdf (60.9 KB)

1 Like

I just wanted to leave a note in here ā€“ I fixed the observed performance issues in Fulcrum several months ago. It wasnā€™t a bottleneck in Fulcrum per se, it was basically some bugs in the http client implementation in Fulcrum that would time out and retry over and over again on big blocks, which caused an observed slowness in getting huge blocks from bitcoind. This has been fixed and I believe a new measurement would reveal much faster performance from Fulcrum.

6 Likes