Thank you for working on this assessments. One note though on slpDB and bitDB, I think those are proven to be abandonware or unmaintained with really bad performance and very serious issues. I would suggest to replace them with:
- BCHD
- PSF slp indexer
Thank you for working on this assessments. One note though on slpDB and bitDB, I think those are proven to be abandonware or unmaintained with really bad performance and very serious issues. I would suggest to replace them with:
There is a comment in Fulcrumās docs about passing a ZMQ endpoint config parameter to the (bchn) node to speed things up, but seems optional and I have not tried to measure the difference yet.
Thank you very much Josh & Verde team.
Iāve got the basic test built and running based only on the data you supplied (plus a download of the latest Fulcrum binary release).
As I ran on a Debian 11, I ran into a couple minor issues. As I resolved them I will put some notes here in case others face similar bumps:
stock Debian gradle seems too old, so definitely download a recent gradle package from Gradleās site otherwise it will fail to parse the build.gradle and error on archiveVersion and archiveBaseName (possibly others too, but I only got up to those before deciding it must be due to an inadequate Gradle on my box).
The run script tries to call gradlew, so one needs to run gradle wrapper in the bch-scaling base folder in order to generate that wrapper there.
Further there are some points unclear to me yet but let me note how I proceeded:
index=1 to the bitcoind.conf config@freetrader : Hey, thanks for trying to get it running for yourself! Debian is my goto OS, so Iām pretty familiar with the problems you can encounter, which is good. The gradle wrapper was committed to the repo, so you should have been able to run ./gradlew makeJar copyDependencies and have it ājust workā (the wrapper should solve all of the problems you encountered with Debian since it will download the version of gradle it needs).
More ideally, the build intent was to run the ./scripts/make.sh script (from the project directory) since itāll take care of structuring the out directory for you. Do either of these steps not work? I ran this just now on a Debian 11 VM with openjdk 11.0.15 and it worked without anything special, so hopefully thatās the same for you.
This week we implemented a change to the test block emitter to enable the transfer of blocks and transactions via the P2P protocol. Weāve also re-ran the tests via P2P protocol (instead of via the RPC protocol). We have the raw data uploaded but havenāt finished compiling the results. I expect weāll have this done before the end of the day on Monday. These tests were run with the node configured with txindex=1 and ZMQ enabled; it will be interesting to see if this has any significant performance affect on the node and Fulcrum.
Additionally, weāve started adding new blocks to the test framework to model cash transactions: 2 inputs -> 2 outputs . These blocks are appended to the current framework and should be made available this coming week.
The one benefit (from a testing perspective) of using RPC was that it was easier to measure when BCHN finished (since the RPC call hung until the block was accepted). We can still measure how long BCHN took, we just have to do it slightly differently, which is not a problem but is something that took us longer than we expected.
A preliminary look at the results are ā¦interesting. It looks like it takes twice as much time for BCHN to process a block compared to last time. I suspect this has little to do with P2P and more to do with index being enabled. Iām going to run the tests again tonight with P2P and index disabled so we can better compare apples-to-apples.
I tried to run the suite to reproduce the results, and these are my personal findings:
Weāve compiled the reruns of the existing test framework to explore the hypothesis that RPC vs P2P code paths were different (and more specifically that P2P would be better). In short, it looks like the results are within variance between runs (particularly, the large spikes (likely caused by DB flushing) increasing averages). The formatted results for each finding are below:
RPC Results: RPC Results - Google Sheets
P2P Results (No Tx Indexing): P2P Results - No TxIndex - Google Sheets
P2P Results (Tx Indexing):
Let me know if you want help with determining sample size and constructing confidence intervals.
Can you please? Thanks in advance!
Thanks Josh
you should have been able to run
./gradlew makeJar copyDependenciesand have it ājust workā
Not familiar with gradle so I didnāt know that, but eventually I got it to work. I wonder why running the script failed for me initially (I remember it couldnāt find gradlew which is why I figured I had to create one myself).
I calculated some confidence intervals and did some statistical power analysis using the data on the Google Sheet and an R script I wrote here.
My conclusion: each phase that you want to measure should be run for 100 blocks or more. I know thatās running each phase for move than half a day, but if you want reliable results then you have to increase the sample size substantially above what you have now.
I estimated confidence intervals using a nonparametric percentile bootstrap. Bootstrapped confidence intervals work well in cases of low sample size and data that is not normally distributed, like in our case here. I chose to display the 90% confidence interval since that seemed appropriate for our purposes.
The units of the confidence intervals are seconds to process each block. The āTransactions per secondā unit cannot be used directly since there is no measurement of how long each transaction verification takes and therefore there is no way to calculate the variability. I only had enough data to measure the fan-out and steady state 1 phases. Fan-in had only two observations, which is too few. Steady state 2 was missing data in the sheet. The Fulcrum data sheet has its units as āmsecā, but from the discussion above it seems that it is actually just seconds.
Here are the confidence intervals:
| Processing Type | Block Type | Lower 90% Confidence Interval | Upper 90% C.I. |
|---|---|---|---|
| bchn.0p | fan-out | 31 | 106 |
| bchn.0p | steady state 1 | 42 | 62 |
| bchn.90p | fan-out | 28 | 135 |
| bchn.90p | steady state 1 | 12 | 14 |
| fulcrum.0p | fan-out | 1785 | 2169 |
| fulcrum.0p | steady state 1 | NA | NA |
| fulcrum.90p | fan-out | 1579 | 1805 |
| fulcrum.90p | steady state 1 | 574 | 698 |
The largest confidence intervals are for the fan-out phases for BCHN (both 90p and 0p). They are very large and therefore need to be shrunk by increasing the sample size.
Through statistical power analysis we can get a sense of how many observations are needed to shrink confidence intervals to a certain size. To standardize and make the numbers comparable across different block processing procedures, can express the width of these confidence intervals in terms of percentage of the mean of the quantity being measured.
Below is the estimated sample size to achieve a target width of confidence interval. I chose 10%, 25%, and 50% of the mean for comparison:
| Processing Type | Block Type | N for C.I. width < 10% of mean | < 25% | < 50% |
|---|---|---|---|---|
| bchn.0p | fan-out | 1447 | 234 | 60 |
| bchn.0p | steady state 1 | 93 | 17 | 6 |
| bchn.90p | fan-out | 2036 | 328 | 84 |
| bchn.90p | steady state 1 | 18 | 5 | 3 |
| fulcrum.0p | fan-out | 45 | 9 | 4 |
| fulcrum.0p | steady state 1 | NA | NA | NA |
| fulcrum.90p | fan-out | 22 | 6 | 3 |
| fulcrum.90p | steady state 1 | 25 | 6 | 3 |
The results show that we ought to be able to shrink the confidence interval to less than 50% of the mean for all block processing procedures if we use 100 blocks for each phase.
Let me know if I have misunderstood anything about the data.
Wow, thatās well more in depth than I expected.
Thank you.
I wouldnāt advise to jump to 100 blocks just yet, in this stage of testing, but if in the future there are several tests that are close, it could be the case then.
The latest report is available here: P2P Results - Steady-State - Flawed 90p / 0p - Google Sheets
The current test case now has 10 additional 256 MB blocks. Each new block contains transactions that are 2-inputs and 2-outputs. We encountered some difficulty getting these test blocks created; most of the problems we encountered were managing the UTXO set for the test emitter so that the blocks that were created did not consist entirely of massively chained transactions within the same block (as those style of blocks likely have a different code path within the node) and do not necessarily represent a typical real-life situation.
There are currently two problems with the new data set:
minrelaytxfee set to 1 is more pragmatic (at least for now).Unfortunately, we had not noticed the 2nd error until we were compiling the test results today. This error resulted in the test framework broadcasting 90 percent of the transactions up until block 266, then broadcasting 0 percent (due to the min relay fee) for blocks 267-277. We plan on rerunning the tests with the new configuration and posting new results later this week.
Additionally, we mined a 256 MB reorg block and ran it (manually) against the BCHN node. The results of this one-off test showed that it took the node over a minute and a half to finish a reorg of a single 256MB block. If we can reproduce this result, then it could indicate a problem for the network if a 256 MB reorg were to happen with the current level of node performance. Since this could be indicative of a potentially large problem for the current state of long-term scaling, we decided our next step would be to put more time into supporting this test properly within the framework. The intent will be to run the test scenario as 90p with P2P broadcasting enabled, and intentionally reorg as a part of the standard test. Once properly integrated, we can observe how Fulcrum (and other endpoints) respond to this situation.
The latest report with the 10 additional 256 MB blocks can be found here: P2P Results - Steady-State - Google Sheets
As mentioned above, this report was run using minrelaytxfee=0 within BCHNās bitcoin.conf, and the 0p blocks were broadcast 10-minutes apart to avoid the timestamps causing invalid/rejected blocks (this also brings them more in-line with the 90p tests, although they take longer to run now).
Itās relevant to callout that Fulcrum failed to finish for this 0p result, so the averages are a little misleading. Also what was formerly called āSteady State 1ā is now named āMisc 1ā, and the new blocks are now labeled āSteady Stateā. I briefly compared the results between this report and the P2P Results from a couple of weeks ago, and they seemed to match up consistently, which indicates relatively low deviation between tests.
This weekās test intended to evaluate two things:
From what weāve seen here: P2P Results - DBCache - Fulcrum v1.7 - Google Sheets , it would appear that neither of these two tweaks resolved either issue.
There was one unintended quirk in the testing framework for the above results that caused the reorg block to be broadcast 10 minutes later than intended. That shouldnāt affect the processing time of the blocks, but it does create an unexpected timestamp in the data. The latest version of the BCH Test Emitter should resolve this quirk for future data collection.
I think itās important that someone else attempts to replicate the 5-block lag to ensure the issue is not something specific to our running environment. I believe this has already been replicated on other machines, but confirming that here (on BCR) would be good documentation for the future.
This week we will be moving on to evaluating Trezorās indexer: https://github.com/trezor/blockbook to ensure hardware wallets will be able to perform with 256MB blocks.
Thanks Josh, weāll be doing that at BCHN.
Iāll be trying it with same parameters but also with larger dbcache sizes.
Not sure about others, but I routinely allocate more memory to the DB cache on my servers - basically as much as I can spare because disk I/O is so expensive.
Default is 450MB btw.
My gut feeling is one wants that cache to equal or exceed multiple full blocks.
But whether thatās related to the observed lag effect, is something we need to investigate.
Thanks, FT!
Setting the dbcache to be lower than the default was definitely not intended. I remember reading online that the default was 100MB, but that must have been for BTC and/or an old version and/or an outdated stackoverflow post. Fortunately all of the other tests were run with the actual default value, so all this new test did was show what happens when we go lower. (Which was opposite of the actual intent: to go higher.) Iāll just rerun the tests with a gig dbcache and republish the results. Thanks for pointing this out.
Weāve been exploring testing Blockbook (the backend for Trezor wallets) for the past week+. Weāve hit plenty of snags along the way, but things seem to be on a roll now. Weāre running the 90p tests today and play to run the 0p tests on the weekend (and/or Monday). On Monday weāll write the script to parse the logs from Blockbook and then should have a report available mid-week. The memory requirements for Blockbook + BCHN + Emitter are quite large, so we still may have some snags to resolve, but weāre optimistic. Additionally, this means that the hardware used to run these tests will be different than the results weāve published earlier.
On another note, @matricz from BCHN has been replicating (and exploring) the results weāve published before. It is my understanding that heās confirmed the dbcache being the source of the 5-block BCHN lag, which is an awesome discovery. Iāll poke him privately to see if he can post his findings here so itās not heresy.
I have indeed reproduced and reliably workedaround the slowdowns, which are entirely to attribute to dbcache.
I did runs with the following setup:
Measured the wall time with $ time scripts/run.sh for all runs. Runs with default dbcache (which is 400M, pretty small for 256MB blocks) were 24m18.130s and 23m23.271s.
A run with a sizeable -dbcache=16000 yielded 18m5.147s, which is ~3/4th of the total. The difference is still bigger than the sum of db flushes, but it also includes additional db reads, which makes sense.
Unless there is a special need to test db performance, I advise to run these tests with a sufficient dbcache setting to not hit its limit.
This weekās test was the first to evaluate Blockbook, the back-end for the Trezor wallet.
Reports for Blockbook are being compiled here: Blockbook - Google Drive
At this time we have only successfully run the 90p test as we ran into an unexpected issue with Blockbook logging the 0p test results. The data for the 90p test is populated in the āP2P Result - Blockbook 90pā spreadsheet in the folder referenced above. We intend to the run the 0p test soon, but following information from @mtrycz about the BIP34 (non-)activation potentially causing unintended delays in BCHNās block processing, we have decided to test a fix for that issue in a second 90p run. Following evaluation of the second 90p run we expect to run the 0p with the BIP34 fix.
One additional update to the data processing is that weāve singled out the āSteady Stateā block which now undergoes a reorg as a separate category from the rest of the Steady State blocks to avoid it throwing off the average and to generally highlight itās uniqueness.
The Blockbook results show, as one might expect, a pretty heavy preference for blocks that reduce the number of UTXOs. The ten āFan-Outā blocks performed the worst, which an average of over 15 minutes to process them. Overall, it averaged 9.5 minutes per block across the non-trivial block types.