Assessing the scaling performance of several categories of BCH network software

Thank you for working on this assessments. One note though on slpDB and bitDB, I think those are proven to be abandonware or unmaintained with really bad performance and very serious issues. I would suggest to replace them with:

There is a comment in Fulcrum’s docs about passing a ZMQ endpoint config parameter to the (bchn) node to speed things up, but seems optional and I have not tried to measure the difference yet.

Thank you very much Josh & Verde team.

I’ve got the basic test built and running based only on the data you supplied (plus a download of the latest Fulcrum binary release).

As I ran on a Debian 11, I ran into a couple minor issues. As I resolved them I will put some notes here in case others face similar bumps:

  1. stock Debian gradle seems too old, so definitely download a recent gradle package from Gradle’s site otherwise it will fail to parse the build.gradle and error on archiveVersion and archiveBaseName (possibly others too, but I only got up to those before deciding it must be due to an inadequate Gradle on my box).

  2. The run script tries to call gradlew, so one needs to run gradle wrapper in the bch-scaling base folder in order to generate that wrapper there.

Further there are some points unclear to me yet but let me note how I proceeded:

  1. Fulcrum docs says it requires indexing on the node side to be enabled, I think? (still need to verify if it works without, but I added index=1 to the bitcoind.conf config

@freetrader : Hey, thanks for trying to get it running for yourself! Debian is my goto OS, so I’m pretty familiar with the problems you can encounter, which is good. The gradle wrapper was committed to the repo, so you should have been able to run ./gradlew makeJar copyDependencies and have it “just work” (the wrapper should solve all of the problems you encountered with Debian since it will download the version of gradle it needs).

More ideally, the build intent was to run the ./scripts/make.sh script (from the project directory) since it’ll take care of structuring the out directory for you. Do either of these steps not work? I ran this just now on a Debian 11 VM with openjdk 11.0.15 and it worked without anything special, so hopefully that’s the same for you.

This week we implemented a change to the test block emitter to enable the transfer of blocks and transactions via the P2P protocol. We’ve also re-ran the tests via P2P protocol (instead of via the RPC protocol). We have the raw data uploaded but haven’t finished compiling the results. I expect we’ll have this done before the end of the day on Monday. These tests were run with the node configured with txindex=1 and ZMQ enabled; it will be interesting to see if this has any significant performance affect on the node and Fulcrum.

Additionally, we’ve started adding new blocks to the test framework to model cash transactions: 2 inputs -> 2 outputs . These blocks are appended to the current framework and should be made available this coming week.

The one benefit (from a testing perspective) of using RPC was that it was easier to measure when BCHN finished (since the RPC call hung until the block was accepted). We can still measure how long BCHN took, we just have to do it slightly differently, which is not a problem but is something that took us longer than we expected.

A preliminary look at the results are …interesting. It looks like it takes twice as much time for BCHN to process a block compared to last time. I suspect this has little to do with P2P and more to do with index being enabled. I’m going to run the tests again tonight with P2P and index disabled so we can better compare apples-to-apples.

I tried to run the suite to reproduce the results, and these are my personal findings:

  1. Java is my first language, so I have taken a (medium depth) look at the code. It is solid and clean, even if a little verbose. Easily extensible after an hour of getting comfortable. Still, python would probably be more palatable to more people
  2. The reproduction instructions are good so far as they go, but incomplete. I’m happy with building my own software, but there was guesswork involved in following instructions
  3. Could not reproduce fulcrum processing, don’t know where to get the logs; can confirm there is a noticeable slowdown after the fanout phase
  4. The results need to be composed by hand. Four different .csv files are generated and need to be assembled by hand to get the results, which is lenghty and error prone. A python script would do wonders here
  5. Bug: the bchn_start_csv.sh script gives me the start times for blocks 245-260, instead of 245-266
  6. Biggest concern: the test sample size is too small. The variance is all over the place with such few samples. A bigger sample size would easily offset the db flushes, or the flushes could be removed from the timing
2 Likes

We’ve compiled the reruns of the existing test framework to explore the hypothesis that RPC vs P2P code paths were different (and more specifically that P2P would be better). In short, it looks like the results are within variance between runs (particularly, the large spikes (likely caused by DB flushing) increasing averages). The formatted results for each finding are below:

RPC Results: RPC Results - Google Sheets

P2P Results (No Tx Indexing): P2P Results - No TxIndex - Google Sheets

P2P Results (Tx Indexing):

1 Like

Let me know if you want help with determining sample size and constructing confidence intervals.

Can you please? Thanks in advance!

1 Like

Thanks Josh

you should have been able to run ./gradlew makeJar copyDependencies and have it “just work”

Not familiar with gradle so I didn’t know that, but eventually I got it to work. I wonder why running the script failed for me initially (I remember it couldn’t find gradlew which is why I figured I had to create one myself).

I calculated some confidence intervals and did some statistical power analysis using the data on the Google Sheet and an R script I wrote here.

My conclusion: each phase that you want to measure should be run for 100 blocks or more. I know that’s running each phase for move than half a day, but if you want reliable results then you have to increase the sample size substantially above what you have now.

I estimated confidence intervals using a nonparametric percentile bootstrap. Bootstrapped confidence intervals work well in cases of low sample size and data that is not normally distributed, like in our case here. I chose to display the 90% confidence interval since that seemed appropriate for our purposes.

The units of the confidence intervals are seconds to process each block. The “Transactions per second” unit cannot be used directly since there is no measurement of how long each transaction verification takes and therefore there is no way to calculate the variability. I only had enough data to measure the fan-out and steady state 1 phases. Fan-in had only two observations, which is too few. Steady state 2 was missing data in the sheet. The Fulcrum data sheet has its units as “msec”, but from the discussion above it seems that it is actually just seconds.

Here are the confidence intervals:

Processing Type Block Type Lower 90% Confidence Interval Upper 90% C.I.
bchn.0p fan-out 31 106
bchn.0p steady state 1 42 62
bchn.90p fan-out 28 135
bchn.90p steady state 1 12 14
fulcrum.0p fan-out 1785 2169
fulcrum.0p steady state 1 NA NA
fulcrum.90p fan-out 1579 1805
fulcrum.90p steady state 1 574 698

The largest confidence intervals are for the fan-out phases for BCHN (both 90p and 0p). They are very large and therefore need to be shrunk by increasing the sample size.

Through statistical power analysis we can get a sense of how many observations are needed to shrink confidence intervals to a certain size. To standardize and make the numbers comparable across different block processing procedures, can express the width of these confidence intervals in terms of percentage of the mean of the quantity being measured.

Below is the estimated sample size to achieve a target width of confidence interval. I chose 10%, 25%, and 50% of the mean for comparison:

Processing Type Block Type N for C.I. width < 10% of mean < 25% < 50%
bchn.0p fan-out 1447 234 60
bchn.0p steady state 1 93 17 6
bchn.90p fan-out 2036 328 84
bchn.90p steady state 1 18 5 3
fulcrum.0p fan-out 45 9 4
fulcrum.0p steady state 1 NA NA NA
fulcrum.90p fan-out 22 6 3
fulcrum.90p steady state 1 25 6 3

The results show that we ought to be able to shrink the confidence interval to less than 50% of the mean for all block processing procedures if we use 100 blocks for each phase.

Let me know if I have misunderstood anything about the data.

1 Like

Wow, that’s well more in depth than I expected.

Thank you.

I wouldn’t advise to jump to 100 blocks just yet, in this stage of testing, but if in the future there are several tests that are close, it could be the case then.

1 Like

The latest report is available here: P2P Results - Steady-State - Flawed 90p / 0p - Google Sheets

The current test case now has 10 additional 256 MB blocks. Each new block contains transactions that are 2-inputs and 2-outputs. We encountered some difficulty getting these test blocks created; most of the problems we encountered were managing the UTXO set for the test emitter so that the blocks that were created did not consist entirely of massively chained transactions within the same block (as those style of blocks likely have a different code path within the node) and do not necessarily represent a typical real-life situation.

There are currently two problems with the new data set:

  1. The test blockchain is now long enough that the block timestamps can get too far ahead if the node does not have a newer clocktime than the original tests. Fixing this isn’t quite as simple as moving the system clock forward since services (like Fulcrum) will not exit the “initial block download” state if the clock is too far forward, and if the clock is too far behind then the blocks are considered invalid. Fortunately, this only affects the 0p tests (since these blocks are broadcast in rapid succession, whereas the 90p tests are broadcast on a rough 10-minute interval). We will have to restructure the emitter so that the 0p tests are broadcast on a 10-minute interval as well, which will unfortunately make their execution take longer.
  2. The fees used to generate the 2-in 2-out -style transactions (~160 per ~374 bytes) were below the default relay threshold. This was accidental. However, instead of re-mining the blocks (which takes about a day), we decided configuring the node with minrelaytxfee set to 1 is more pragmatic (at least for now).

Unfortunately, we had not noticed the 2nd error until we were compiling the test results today. This error resulted in the test framework broadcasting 90 percent of the transactions up until block 266, then broadcasting 0 percent (due to the min relay fee) for blocks 267-277. We plan on rerunning the tests with the new configuration and posting new results later this week.

Additionally, we mined a 256 MB reorg block and ran it (manually) against the BCHN node. The results of this one-off test showed that it took the node over a minute and a half to finish a reorg of a single 256MB block. If we can reproduce this result, then it could indicate a problem for the network if a 256 MB reorg were to happen with the current level of node performance. Since this could be indicative of a potentially large problem for the current state of long-term scaling, we decided our next step would be to put more time into supporting this test properly within the framework. The intent will be to run the test scenario as 90p with P2P broadcasting enabled, and intentionally reorg as a part of the standard test. Once properly integrated, we can observe how Fulcrum (and other endpoints) respond to this situation.

4 Likes

The latest report with the 10 additional 256 MB blocks can be found here: P2P Results - Steady-State - Google Sheets

As mentioned above, this report was run using minrelaytxfee=0 within BCHN’s bitcoin.conf, and the 0p blocks were broadcast 10-minutes apart to avoid the timestamps causing invalid/rejected blocks (this also brings them more in-line with the 90p tests, although they take longer to run now).

It’s relevant to callout that Fulcrum failed to finish for this 0p result, so the averages are a little misleading. Also what was formerly called “Steady State 1” is now named “Misc 1”, and the new blocks are now labeled “Steady State”. I briefly compared the results between this report and the P2P Results from a couple of weeks ago, and they seemed to match up consistently, which indicates relatively low deviation between tests.

3 Likes

This week’s test intended to evaluate two things:

  1. increasing the DB Cache to see if the 5-block lag would be eliminated (increased from 100MB to 256MB)
  2. evaluate the updated version of Fulcrum, intended to fix the downloading issues (Fulcrum v1.7)

From what we’ve seen here: P2P Results - DBCache - Fulcrum v1.7 - Google Sheets , it would appear that neither of these two tweaks resolved either issue.

There was one unintended quirk in the testing framework for the above results that caused the reorg block to be broadcast 10 minutes later than intended. That shouldn’t affect the processing time of the blocks, but it does create an unexpected timestamp in the data. The latest version of the BCH Test Emitter should resolve this quirk for future data collection.

I think it’s important that someone else attempts to replicate the 5-block lag to ensure the issue is not something specific to our running environment. I believe this has already been replicated on other machines, but confirming that here (on BCR) would be good documentation for the future.

This week we will be moving on to evaluating Trezor’s indexer: https://github.com/trezor/blockbook to ensure hardware wallets will be able to perform with 256MB blocks.

2 Likes

Thanks Josh, we’ll be doing that at BCHN.
I’ll be trying it with same parameters but also with larger dbcache sizes.
Not sure about others, but I routinely allocate more memory to the DB cache on my servers - basically as much as I can spare because disk I/O is so expensive.
Default is 450MB btw.
My gut feeling is one wants that cache to equal or exceed multiple full blocks.
But whether that’s related to the observed lag effect, is something we need to investigate.

1 Like

Thanks, FT!

Setting the dbcache to be lower than the default was definitely not intended. I remember reading online that the default was 100MB, but that must have been for BTC and/or an old version and/or an outdated stackoverflow post. Fortunately all of the other tests were run with the actual default value, so all this new test did was show what happens when we go lower. (Which was opposite of the actual intent: to go higher.) I’ll just rerun the tests with a gig dbcache and republish the results. Thanks for pointing this out.

1 Like

We’ve been exploring testing Blockbook (the backend for Trezor wallets) for the past week+. We’ve hit plenty of snags along the way, but things seem to be on a roll now. We’re running the 90p tests today and play to run the 0p tests on the weekend (and/or Monday). On Monday we’ll write the script to parse the logs from Blockbook and then should have a report available mid-week. The memory requirements for Blockbook + BCHN + Emitter are quite large, so we still may have some snags to resolve, but we’re optimistic. Additionally, this means that the hardware used to run these tests will be different than the results we’ve published earlier.

On another note, @matricz from BCHN has been replicating (and exploring) the results we’ve published before. It is my understanding that he’s confirmed the dbcache being the source of the 5-block BCHN lag, which is an awesome discovery. I’ll poke him privately to see if he can post his findings here so it’s not heresy.

3 Likes

I have indeed reproduced and reliably workedaround the slowdowns, which are entirely to attribute to dbcache.

I did runs with the following setup:

  • Verde’s modified version of BCHN
  • No Fulcrum
  • May’s version of code (around the 15th, RPC block submission)
  • May’s version of data (v1)
  • 0p propagation

Measured the wall time with $ time scripts/run.sh for all runs. Runs with default dbcache (which is 400M, pretty small for 256MB blocks) were 24m18.130s and 23m23.271s.
A run with a sizeable -dbcache=16000 yielded 18m5.147s, which is ~3/4th of the total. The difference is still bigger than the sum of db flushes, but it also includes additional db reads, which makes sense.

Unless there is a special need to test db performance, I advise to run these tests with a sufficient dbcache setting to not hit its limit.

1 Like