Assessing the scaling performance of several categories of BCH network software

We’ve compiled our first report! We’ve learned a lot during this process and we hope to have better (and faster) reports coming in the future. Please review and give us your thoughts on our findings and if you see any flaws in our methods that we can improve upon next time.

Our raw data may be found here: BCHN Research - Google Drive

4 Likes

Thanks a lot for the numbers!

While BCHN performance seemed within expectations, Fulcrum looks like it’s struggling with steady-state and fan-ins. I wonder if that has anything to do with how Fulcrum handles its DB…

1 Like

Also: in “Steady-state 2”, which are hundreds of kb blocks, fulcrum was taking excessive time to process those as well - which seems particularly odd, as we know Fulcrum can handle those in the wild. The numbers are perhaps worth double-checking?

1 Like

Reading the report I (happily) assume the numbers above are inverted. It should be 5 MB per second, right?

Was Fulcrum/BCHN communicating via ZeroMQ?

I’m not sure how Fulcrum communicates to BCHN, to be honest. I think it polls for new stuff via RPC although I’m not sure. This is definitely a question for @cculianu .

Calin is looking into it; in fact we gave him (and the project, located in the google drive directory) instructions for how to replicate the build/results. His theory is there a bug in Fulcrum about message sizes getting too large. That seems like it makes sense to me, although it’s weird we didn’t see the same behavior when running the 90p tests.

My original theory was that BCHN was holding the global lock and was starving Fulcrum from getting RPC responses, but again, that too is a flawed theory because after BCHN finished processing its blocks it still took hours for Fulcrum to complete.

The only thing I know for sure is that we were consistently reproducing the behavior with the 0p tests. We ran it at least 4 times.

1 Like

I’m pretty sure there is something dumb happening in Fulcrum’s HTTP-RPC client causing a slowdown/bottleneck here. I saw something similar happen with ScaleNet in some cases. I will investigate and fix. I don’t think this is problem is fundamental to Fulcrum’s design. (But even if it were, any bottlenecks can be addressed and fixed).

@Jonas To answer your question, Fulcrum doesn’t use ZMQ for anything more than brief notifications (such as to wake up and download a new block when it is available). It uses bitcoind’s HTTP-RPC server to download blocks, etc.

2 Likes

Thank you for working on this assessments. One note though on slpDB and bitDB, I think those are proven to be abandonware or unmaintained with really bad performance and very serious issues. I would suggest to replace them with:

There is a comment in Fulcrum’s docs about passing a ZMQ endpoint config parameter to the (bchn) node to speed things up, but seems optional and I have not tried to measure the difference yet.

Thank you very much Josh & Verde team.

I’ve got the basic test built and running based only on the data you supplied (plus a download of the latest Fulcrum binary release).

As I ran on a Debian 11, I ran into a couple minor issues. As I resolved them I will put some notes here in case others face similar bumps:

  1. stock Debian gradle seems too old, so definitely download a recent gradle package from Gradle’s site otherwise it will fail to parse the build.gradle and error on archiveVersion and archiveBaseName (possibly others too, but I only got up to those before deciding it must be due to an inadequate Gradle on my box).

  2. The run script tries to call gradlew, so one needs to run gradle wrapper in the bch-scaling base folder in order to generate that wrapper there.

Further there are some points unclear to me yet but let me note how I proceeded:

  1. Fulcrum docs says it requires indexing on the node side to be enabled, I think? (still need to verify if it works without, but I added index=1 to the bitcoind.conf config

@freetrader : Hey, thanks for trying to get it running for yourself! Debian is my goto OS, so I’m pretty familiar with the problems you can encounter, which is good. The gradle wrapper was committed to the repo, so you should have been able to run ./gradlew makeJar copyDependencies and have it “just work” (the wrapper should solve all of the problems you encountered with Debian since it will download the version of gradle it needs).

More ideally, the build intent was to run the ./scripts/make.sh script (from the project directory) since it’ll take care of structuring the out directory for you. Do either of these steps not work? I ran this just now on a Debian 11 VM with openjdk 11.0.15 and it worked without anything special, so hopefully that’s the same for you.

This week we implemented a change to the test block emitter to enable the transfer of blocks and transactions via the P2P protocol. We’ve also re-ran the tests via P2P protocol (instead of via the RPC protocol). We have the raw data uploaded but haven’t finished compiling the results. I expect we’ll have this done before the end of the day on Monday. These tests were run with the node configured with txindex=1 and ZMQ enabled; it will be interesting to see if this has any significant performance affect on the node and Fulcrum.

Additionally, we’ve started adding new blocks to the test framework to model cash transactions: 2 inputs -> 2 outputs . These blocks are appended to the current framework and should be made available this coming week.

The one benefit (from a testing perspective) of using RPC was that it was easier to measure when BCHN finished (since the RPC call hung until the block was accepted). We can still measure how long BCHN took, we just have to do it slightly differently, which is not a problem but is something that took us longer than we expected.

A preliminary look at the results are …interesting. It looks like it takes twice as much time for BCHN to process a block compared to last time. I suspect this has little to do with P2P and more to do with index being enabled. I’m going to run the tests again tonight with P2P and index disabled so we can better compare apples-to-apples.

I tried to run the suite to reproduce the results, and these are my personal findings:

  1. Java is my first language, so I have taken a (medium depth) look at the code. It is solid and clean, even if a little verbose. Easily extensible after an hour of getting comfortable. Still, python would probably be more palatable to more people
  2. The reproduction instructions are good so far as they go, but incomplete. I’m happy with building my own software, but there was guesswork involved in following instructions
  3. Could not reproduce fulcrum processing, don’t know where to get the logs; can confirm there is a noticeable slowdown after the fanout phase
  4. The results need to be composed by hand. Four different .csv files are generated and need to be assembled by hand to get the results, which is lenghty and error prone. A python script would do wonders here
  5. Bug: the bchn_start_csv.sh script gives me the start times for blocks 245-260, instead of 245-266
  6. Biggest concern: the test sample size is too small. The variance is all over the place with such few samples. A bigger sample size would easily offset the db flushes, or the flushes could be removed from the timing
2 Likes

We’ve compiled the reruns of the existing test framework to explore the hypothesis that RPC vs P2P code paths were different (and more specifically that P2P would be better). In short, it looks like the results are within variance between runs (particularly, the large spikes (likely caused by DB flushing) increasing averages). The formatted results for each finding are below:

RPC Results: RPC Results - Google Sheets

P2P Results (No Tx Indexing): P2P Results - No TxIndex - Google Sheets

P2P Results (Tx Indexing):

1 Like

Let me know if you want help with determining sample size and constructing confidence intervals.

Can you please? Thanks in advance!

1 Like

Thanks Josh

you should have been able to run ./gradlew makeJar copyDependencies and have it “just work”

Not familiar with gradle so I didn’t know that, but eventually I got it to work. I wonder why running the script failed for me initially (I remember it couldn’t find gradlew which is why I figured I had to create one myself).

I calculated some confidence intervals and did some statistical power analysis using the data on the Google Sheet and an R script I wrote here.

My conclusion: each phase that you want to measure should be run for 100 blocks or more. I know that’s running each phase for move than half a day, but if you want reliable results then you have to increase the sample size substantially above what you have now.

I estimated confidence intervals using a nonparametric percentile bootstrap. Bootstrapped confidence intervals work well in cases of low sample size and data that is not normally distributed, like in our case here. I chose to display the 90% confidence interval since that seemed appropriate for our purposes.

The units of the confidence intervals are seconds to process each block. The “Transactions per second” unit cannot be used directly since there is no measurement of how long each transaction verification takes and therefore there is no way to calculate the variability. I only had enough data to measure the fan-out and steady state 1 phases. Fan-in had only two observations, which is too few. Steady state 2 was missing data in the sheet. The Fulcrum data sheet has its units as “msec”, but from the discussion above it seems that it is actually just seconds.

Here are the confidence intervals:

Processing Type Block Type Lower 90% Confidence Interval Upper 90% C.I.
bchn.0p fan-out 31 106
bchn.0p steady state 1 42 62
bchn.90p fan-out 28 135
bchn.90p steady state 1 12 14
fulcrum.0p fan-out 1785 2169
fulcrum.0p steady state 1 NA NA
fulcrum.90p fan-out 1579 1805
fulcrum.90p steady state 1 574 698

The largest confidence intervals are for the fan-out phases for BCHN (both 90p and 0p). They are very large and therefore need to be shrunk by increasing the sample size.

Through statistical power analysis we can get a sense of how many observations are needed to shrink confidence intervals to a certain size. To standardize and make the numbers comparable across different block processing procedures, can express the width of these confidence intervals in terms of percentage of the mean of the quantity being measured.

Below is the estimated sample size to achieve a target width of confidence interval. I chose 10%, 25%, and 50% of the mean for comparison:

Processing Type Block Type N for C.I. width < 10% of mean < 25% < 50%
bchn.0p fan-out 1447 234 60
bchn.0p steady state 1 93 17 6
bchn.90p fan-out 2036 328 84
bchn.90p steady state 1 18 5 3
fulcrum.0p fan-out 45 9 4
fulcrum.0p steady state 1 NA NA NA
fulcrum.90p fan-out 22 6 3
fulcrum.90p steady state 1 25 6 3

The results show that we ought to be able to shrink the confidence interval to less than 50% of the mean for all block processing procedures if we use 100 blocks for each phase.

Let me know if I have misunderstood anything about the data.

1 Like

Wow, that’s well more in depth than I expected.

Thank you.

I wouldn’t advise to jump to 100 blocks just yet, in this stage of testing, but if in the future there are several tests that are close, it could be the case then.

1 Like