Assessing the scaling performance of several categories of BCH network software

We’ve compiled the reruns of the existing test framework to explore the hypothesis that RPC vs P2P code paths were different (and more specifically that P2P would be better). In short, it looks like the results are within variance between runs (particularly, the large spikes (likely caused by DB flushing) increasing averages). The formatted results for each finding are below:

RPC Results: RPC Results - Google Sheets

P2P Results (No Tx Indexing): P2P Results - No TxIndex - Google Sheets

P2P Results (Tx Indexing):

2 Likes

Let me know if you want help with determining sample size and constructing confidence intervals.

1 Like

Can you please? Thanks in advance!

2 Likes

Thanks Josh

you should have been able to run ./gradlew makeJar copyDependencies and have it “just work”

Not familiar with gradle so I didn’t know that, but eventually I got it to work. I wonder why running the script failed for me initially (I remember it couldn’t find gradlew which is why I figured I had to create one myself).

I calculated some confidence intervals and did some statistical power analysis using the data on the Google Sheet and an R script I wrote here.

My conclusion: each phase that you want to measure should be run for 100 blocks or more. I know that’s running each phase for move than half a day, but if you want reliable results then you have to increase the sample size substantially above what you have now.

I estimated confidence intervals using a nonparametric percentile bootstrap. Bootstrapped confidence intervals work well in cases of low sample size and data that is not normally distributed, like in our case here. I chose to display the 90% confidence interval since that seemed appropriate for our purposes.

The units of the confidence intervals are seconds to process each block. The “Transactions per second” unit cannot be used directly since there is no measurement of how long each transaction verification takes and therefore there is no way to calculate the variability. I only had enough data to measure the fan-out and steady state 1 phases. Fan-in had only two observations, which is too few. Steady state 2 was missing data in the sheet. The Fulcrum data sheet has its units as “msec”, but from the discussion above it seems that it is actually just seconds.

Here are the confidence intervals:

Processing Type Block Type Lower 90% Confidence Interval Upper 90% C.I.
bchn.0p fan-out 31 106
bchn.0p steady state 1 42 62
bchn.90p fan-out 28 135
bchn.90p steady state 1 12 14
fulcrum.0p fan-out 1785 2169
fulcrum.0p steady state 1 NA NA
fulcrum.90p fan-out 1579 1805
fulcrum.90p steady state 1 574 698

The largest confidence intervals are for the fan-out phases for BCHN (both 90p and 0p). They are very large and therefore need to be shrunk by increasing the sample size.

Through statistical power analysis we can get a sense of how many observations are needed to shrink confidence intervals to a certain size. To standardize and make the numbers comparable across different block processing procedures, can express the width of these confidence intervals in terms of percentage of the mean of the quantity being measured.

Below is the estimated sample size to achieve a target width of confidence interval. I chose 10%, 25%, and 50% of the mean for comparison:

Processing Type Block Type N for C.I. width < 10% of mean < 25% < 50%
bchn.0p fan-out 1447 234 60
bchn.0p steady state 1 93 17 6
bchn.90p fan-out 2036 328 84
bchn.90p steady state 1 18 5 3
fulcrum.0p fan-out 45 9 4
fulcrum.0p steady state 1 NA NA NA
fulcrum.90p fan-out 22 6 3
fulcrum.90p steady state 1 25 6 3

The results show that we ought to be able to shrink the confidence interval to less than 50% of the mean for all block processing procedures if we use 100 blocks for each phase.

Let me know if I have misunderstood anything about the data.

2 Likes

Wow, that’s well more in depth than I expected.

Thank you.

I wouldn’t advise to jump to 100 blocks just yet, in this stage of testing, but if in the future there are several tests that are close, it could be the case then.

1 Like

The latest report is available here: P2P Results - Steady-State - Flawed 90p / 0p - Google Sheets

The current test case now has 10 additional 256 MB blocks. Each new block contains transactions that are 2-inputs and 2-outputs. We encountered some difficulty getting these test blocks created; most of the problems we encountered were managing the UTXO set for the test emitter so that the blocks that were created did not consist entirely of massively chained transactions within the same block (as those style of blocks likely have a different code path within the node) and do not necessarily represent a typical real-life situation.

There are currently two problems with the new data set:

  1. The test blockchain is now long enough that the block timestamps can get too far ahead if the node does not have a newer clocktime than the original tests. Fixing this isn’t quite as simple as moving the system clock forward since services (like Fulcrum) will not exit the “initial block download” state if the clock is too far forward, and if the clock is too far behind then the blocks are considered invalid. Fortunately, this only affects the 0p tests (since these blocks are broadcast in rapid succession, whereas the 90p tests are broadcast on a rough 10-minute interval). We will have to restructure the emitter so that the 0p tests are broadcast on a 10-minute interval as well, which will unfortunately make their execution take longer.
  2. The fees used to generate the 2-in 2-out -style transactions (~160 per ~374 bytes) were below the default relay threshold. This was accidental. However, instead of re-mining the blocks (which takes about a day), we decided configuring the node with minrelaytxfee set to 1 is more pragmatic (at least for now).

Unfortunately, we had not noticed the 2nd error until we were compiling the test results today. This error resulted in the test framework broadcasting 90 percent of the transactions up until block 266, then broadcasting 0 percent (due to the min relay fee) for blocks 267-277. We plan on rerunning the tests with the new configuration and posting new results later this week.

Additionally, we mined a 256 MB reorg block and ran it (manually) against the BCHN node. The results of this one-off test showed that it took the node over a minute and a half to finish a reorg of a single 256MB block. If we can reproduce this result, then it could indicate a problem for the network if a 256 MB reorg were to happen with the current level of node performance. Since this could be indicative of a potentially large problem for the current state of long-term scaling, we decided our next step would be to put more time into supporting this test properly within the framework. The intent will be to run the test scenario as 90p with P2P broadcasting enabled, and intentionally reorg as a part of the standard test. Once properly integrated, we can observe how Fulcrum (and other endpoints) respond to this situation.

5 Likes

The latest report with the 10 additional 256 MB blocks can be found here: P2P Results - Steady-State - Google Sheets

As mentioned above, this report was run using minrelaytxfee=0 within BCHN’s bitcoin.conf, and the 0p blocks were broadcast 10-minutes apart to avoid the timestamps causing invalid/rejected blocks (this also brings them more in-line with the 90p tests, although they take longer to run now).

It’s relevant to callout that Fulcrum failed to finish for this 0p result, so the averages are a little misleading. Also what was formerly called “Steady State 1” is now named “Misc 1”, and the new blocks are now labeled “Steady State”. I briefly compared the results between this report and the P2P Results from a couple of weeks ago, and they seemed to match up consistently, which indicates relatively low deviation between tests.

4 Likes

This week’s test intended to evaluate two things:

  1. increasing the DB Cache to see if the 5-block lag would be eliminated (increased from 100MB to 256MB)
  2. evaluate the updated version of Fulcrum, intended to fix the downloading issues (Fulcrum v1.7)

From what we’ve seen here: P2P Results - DBCache - Fulcrum v1.7 - Google Sheets , it would appear that neither of these two tweaks resolved either issue.

There was one unintended quirk in the testing framework for the above results that caused the reorg block to be broadcast 10 minutes later than intended. That shouldn’t affect the processing time of the blocks, but it does create an unexpected timestamp in the data. The latest version of the BCH Test Emitter should resolve this quirk for future data collection.

I think it’s important that someone else attempts to replicate the 5-block lag to ensure the issue is not something specific to our running environment. I believe this has already been replicated on other machines, but confirming that here (on BCR) would be good documentation for the future.

This week we will be moving on to evaluating Trezor’s indexer: https://github.com/trezor/blockbook to ensure hardware wallets will be able to perform with 256MB blocks.

3 Likes

Thanks Josh, we’ll be doing that at BCHN.
I’ll be trying it with same parameters but also with larger dbcache sizes.
Not sure about others, but I routinely allocate more memory to the DB cache on my servers - basically as much as I can spare because disk I/O is so expensive.
Default is 450MB btw.
My gut feeling is one wants that cache to equal or exceed multiple full blocks.
But whether that’s related to the observed lag effect, is something we need to investigate.

2 Likes

Thanks, FT!

Setting the dbcache to be lower than the default was definitely not intended. I remember reading online that the default was 100MB, but that must have been for BTC and/or an old version and/or an outdated stackoverflow post. Fortunately all of the other tests were run with the actual default value, so all this new test did was show what happens when we go lower. (Which was opposite of the actual intent: to go higher.) I’ll just rerun the tests with a gig dbcache and republish the results. Thanks for pointing this out.

2 Likes

We’ve been exploring testing Blockbook (the backend for Trezor wallets) for the past week+. We’ve hit plenty of snags along the way, but things seem to be on a roll now. We’re running the 90p tests today and play to run the 0p tests on the weekend (and/or Monday). On Monday we’ll write the script to parse the logs from Blockbook and then should have a report available mid-week. The memory requirements for Blockbook + BCHN + Emitter are quite large, so we still may have some snags to resolve, but we’re optimistic. Additionally, this means that the hardware used to run these tests will be different than the results we’ve published earlier.

On another note, @matricz from BCHN has been replicating (and exploring) the results we’ve published before. It is my understanding that he’s confirmed the dbcache being the source of the 5-block BCHN lag, which is an awesome discovery. I’ll poke him privately to see if he can post his findings here so it’s not heresy.

4 Likes

I have indeed reproduced and reliably workedaround the slowdowns, which are entirely to attribute to dbcache.

I did runs with the following setup:

  • Verde’s modified version of BCHN
  • No Fulcrum
  • May’s version of code (around the 15th, RPC block submission)
  • May’s version of data (v1)
  • 0p propagation

Measured the wall time with $ time scripts/run.sh for all runs. Runs with default dbcache (which is 400M, pretty small for 256MB blocks) were 24m18.130s and 23m23.271s.
A run with a sizeable -dbcache=16000 yielded 18m5.147s, which is ~3/4th of the total. The difference is still bigger than the sum of db flushes, but it also includes additional db reads, which makes sense.

Unless there is a special need to test db performance, I advise to run these tests with a sufficient dbcache setting to not hit its limit.

2 Likes

This week’s test was the first to evaluate Blockbook, the back-end for the Trezor wallet.

Reports for Blockbook are being compiled here: Blockbook - Google Drive

At this time we have only successfully run the 90p test as we ran into an unexpected issue with Blockbook logging the 0p test results. The data for the 90p test is populated in the “P2P Result - Blockbook 90p” spreadsheet in the folder referenced above. We intend to the run the 0p test soon, but following information from @mtrycz about the BIP34 (non-)activation potentially causing unintended delays in BCHN’s block processing, we have decided to test a fix for that issue in a second 90p run. Following evaluation of the second 90p run we expect to run the 0p with the BIP34 fix.

One additional update to the data processing is that we’ve singled out the “Steady State” block which now undergoes a reorg as a separate category from the rest of the Steady State blocks to avoid it throwing off the average and to generally highlight it’s uniqueness.

The Blockbook results show, as one might expect, a pretty heavy preference for blocks that reduce the number of UTXOs. The ten “Fan-Out” blocks performed the worst, which an average of over 15 minutes to process them. Overall, it averaged 9.5 minutes per block across the non-trivial block types.

1 Like

This is perfect, thank you.

As suggested earlier, I think that generally the tests should run with a very big (or a very small) -dbcache to test the block verification performance (or database performance).

After investigation we discovered that the Blockbook 0p issues were apparently due it not broadcasting blocks (which is the log statement we have been using to determine that it finished processing them) until it starting receiving transactions. To address this we have started a “1p” run (1% of transactions broadcast before the block) to stand in for the 0p run in this case. We believe the difference in performance relative to a true 0p run should be minimal, since 99% of the transactions in the block must still be processed upon receipt of the block.

After analyzing this 1p data, we found that the BCHN numbers were essentially unchanged, as in prior tests. It is worth calling out that this test contained the fix for the BIP34 issue. Both Blockbook tests were run with the default Blockbook settings (rpcworkqueue=1100, maxmempool=2000, dbcache=1000). As for Blockbook performance, it would seem that the 1p configuration resulted in semi-random results. The Misc 2 and Steady State blocks were process much more quickly, while the Fan Out blocks nearly 3 minutes slower on average. Comparing to the 90p data, it would seems that blocks randomly took much longer in the 90p test while the 0p tests were very consistent for each category. The standard deviation for the 1p Steady State blocks (including reorg) was 16 seconds. For 90p it was 6:27. For Misc 2 blocks, all processed in 0 seconds for 1p while they average 7:00 for 90p (note that we generally remove zero-second blocks, which skewed the listed average in this case to 11:40).

The report for this data can be found in the same Blockbook folder, here: Blockbook - Google Drive

One additional call-out is that we noticed periodic discrepencies in the block hash listed in the emitter block hashes, but only for blocks 1-244. These blocks are always processed quickly and as a result the logging is more sporadic, which our data extraction script does not account for. However, the number of blocks identified is still the same and the times for all of these blocks are the same, so there is no impact to the data that we actually care about.

Following Blockbook, we have selected the bitcore node and wallet as our next applications of interest. We plan to set up a similar test to what we did for Blockbook by providing the bitcore node with the intranet configured used by the block emitter and running the three applications together.

3 Likes

As a follow-up from the last post, we ran one additional test with Blockbook to verify that the 0p scenario failed for the reasons we suspected. In particular, we were wanted to ensure that sending any transactions would allow Blockbook to continue processing blocks successfully, even ones that it had not yet seen any transactions from. If such blocks were not processed, it could mean increased susceptibility to certain types of attacks.

In order to test Blockbook’s behavior, we ran a custom modification of the 0p test in which the first transaction of block 245. No additional transactions are sent, so it is effectively the same as a 0p run, aside from one transactions being sent just as we reach the larger blocks.

The result of this test was, as expected/hoped, that Blockbook continued processing blocks as expected. The inclusion of this single transaction broadcast appears to shift Blockbook into a different mode, allowing it to process the blocks as expected from that point on.

Given these positive results we are now considering this phase of Blockbook testing complete. As mentioned, we are now moving on to testing Bitcore applications.

1 Like

After working with Bitcore Node for some time, we were unable to successfully get it to sync the intranet blocks.

Documentation of the process and configuration used can be found here: Bitcore - Google Drive

We were able to get Bitcore Node to sync the first 244 blocks of the test case (i.e. both mainnet blocks and test-specific blocks) but it consistently failed to process block 245, the first large block (185 MB). Unfortunately, when this happened there were no immediately apparent errors. It seemed to silently stop syncing and then later print some recurring socket-related errors. Restarting Bitcore Node would cause it to identify that there were new blocks to sync, but then never give any indication that they were processed (e.g. they were not added to the database). Ultimately, our conclusion is that there is either 1) a hidden limit that causes Bitcore Node to reject large blocks in a way that we weren’t able to see, or 2) a technical failure, such as being unable to allocate a sufficiently large array or something similar, occurred that caused syncing to fail indefinitely.

Reviewing the configuration and source code led us to conclude that we were likely not encountering a block size limit. The only limit we could find in the code was a 1 MB “block size” limit, which appeared to be used only for checking transaction sizes. As a sanity check, we set up the BCHN/Bitcore Node to sync testnet. Our test didn’t complete for unrelated reasons (the server storage was too small and filled up) but before that failure, it was able to get to block 1,337,979. In doing so, it successfully processed many 32 MB blocks, all apparently fairly quickly (spot-checking suggested that all blocks were process in less than one second). Our conclusion was that there is either a 32+ MB block size limit that we were unable to find, or no limit is enforced and there was some other failure leading to the stalling.

In the future, we recommend further investigation into why Bitcore Node was unable to sync the larger blocks. Ideally this would be done by someone familiar with the Bitcore codebase who can evaluate the installation steps or configuration that we used for Bitcore Node. That may reveal improvements that would either fix the problem or reveal more information about what went wrong.

Additionally, whether as a part of follow-up testing for Bitcore, or generally for the sake of additional scaling testing, it may be beneficial to create an alternate test chain that more slowly ramps up the block size. In the current test chain, blocks go from 217 bytes to 185 megabytes. For applications that are unable to process large blocks, it would be helpful to see which it could process on a chain that had blocks starting at 217 bytes, then jumping to 16 MB, 32 MB, 48 MB, 64 MB, etc. In this case (Bitcore) it may have provided helpful information for where to start with further investigation. For now, we are also leaving this as a future task, to be revisited for the purposes of revisiting/triaging applications that fail on block 245 of the current test chain, should the others be found to have the same behavior.

Next, we will be testing mining pool software. Specifics are still to be determined, following additional evaluation of the available options.

3 Likes

The mining pool software we decided to go with next was ASICseer-pool.

Documentation on the process and configuration used can be found here: ASICseer Pool - Google Drive

Since mining operates a bit differently than the other tools we’ve evaluated so far, we were a bit unsure of what metrics would make sense to look at. Unsurprisingly, as new blocks were emitted, the pool and connected peer were updated less than a second after BCHN finished processing the block. It makes sense that there would be no variation based on block size here, since mining off of the new block can begin immediately, without any transactions. Given that, we turned our attention to getblocktemplate calls, since those would inherently be more dependent on the transaction data.

Interestingly, even the getblocktemplate call duration was not well correlated with block size, nor the transaction types (e.g. number of outputs per transaction). The closest correlation we saw was between the number of outputs per transaction and the number of calls made to getblocktemplate (as opposed to the duration of those calls). This is surprising since fees are more directly tied to transaction size, not number of outputs, so we would have expected the clearer correlation to be with block size.

We are also unsure of how to interpret the rate of calls to getblocktemplate, particularly with respect to the large gap we see in the data before around the 6 minute mark following a new block, and the lack of getblocktemplate calls for small blocks. Using the dummy stratum peer we were about to see that mining.notify stratum calls were being emitted very cyclically, roughly every 30 seconds, varying slightly when new blocks were received. Broadly, though, this seems to happen independent of the getblocktemplate calls. We recommend investigation into what triggers getblocktemplate calls and how this pattern affects fee collection in miners using this pool.

One additional result is that after charting the number of notify calls per block, we noticed an idiosyncrasy in the block processing data: block 262 is emitted right after block 261, causing only one notify call to be made. This block emission pattern exists in all prior tests, however, so it is a “feature” of the dataset, not an indication of unusual behavior on the part of the pool.

4 Likes

Upon evaluation of the work that has been completed so far, it has been decided that this is a good place to put additional application testing on hold. A summary of the research completed so far can be found here:

6 Likes