NSPCC continues tracking Neo 3 development and benchmarking node implementations. Previously we’ve benchmarked preview3 nodes and showcased some post-preview3 improvements made to NeoGo, but now we have proper preview4 releases for both implementations, and it’s interesting to see what has changed there.
neo-bench updates
As usual, we’re using neo-bench to test nodes, and as usual, it has improved since our last post. There were some internal improvements like configuration templating and the ability to build C# node from source code, a number of minor bug fixes, but one change stands out significantly: transaction-pushing code was reworked to handle congested mempools correctly.
Previously, if any error returned from the node on transaction push, it was treated as the final error for this transaction, but when the node’s mempool was reaching its capacity, the node inevitably started returning errors. Lacking a retransmission mechanism, the code tried pushing more and more transactions to the node, receiving more and more errors until it ran out of transactions. This was the reason for using enlarged non-standard mempools to test single nodes (500K instead of 50K).
Now neo-bench is able to handle these OOM errors specifically and retry sending after 0.1s timeout. When the node being tested produces (or receives, depending on setup) a new block containing transactions, it frees up some of its mempool capacity, and new transactions can be accepted again. This allowed us to return to using standard 50K mempools for all tests.
Testing setup
For easier relative comparisons we open this test with our regular Core i7–8565U CPU and 16 GB memory setup. We’re using c5b065f770356689852d641b43e986b4b2623991 neo-bench commit, preview4 release binaries for C# node and commit 25435bbb5251010388128ffe814a7b229859d983 of NeoGo (version 0.92.0).
We’re concentrating now on the 10 worker threads setup, and it works fine to measure the maximum TPS value (more worker threads don’t add much). The fixed-rate mode is mostly interesting for other aspects like block interval stability, and both nodes just do it well now, so there is not a lot to see there. We measure NeoGo and C# nodes in single and four nodes scenarios. The only special protocol-level setting left now is block time, which is 1 second for a single node and 5 seconds for four nodes setup, but that’s just because 15s obviously limits even theoretical TPS value to 3333, which is too low for our champions. C# node also has MaxConcurrentConnections setting of its RPC plugin set to 500 (the default is too low for some scenarios). Both nodes are using LevelDB unless noted otherwise.
Single-node performance
The average TPS result for the NeoGo node is about 10300, and the C# node shows around 3100 TPS. Both nodes have improved substantially since preview3, with C# gaining more than 60% and NeoGo doing more than three-fold. You may notice that NeoGo’s result is pretty much the same as the one we’ve demonstrated in September, but if you’re to look more carefully at it, you may notice that while the TPS value is about the same, the curve shape and the node behavior is actually radically different.
Back in September, NeoGo managed to complete the test (that sends 1M transactions overall) in under 20 blocks because of a huge mempool that allowed it to steadily push the maximum possible number of transactions into them (64K). But it made block times jump into the 5–10 seconds range with the node configured for 1 second. It all worked, but at the same time, with a 1-second block interval configured, we’d really like to see the node being closer to it and producing smaller blocks in time. This is what we have now, the node packs the same overall number of transactions into more than 60 blocks, with each block containing around 10–20 thousand transactions.
Even though there are still spikes on the block interval plot, most of them stay close to the target time. Though it should be noted that the C# node tracks this interval almost perfectly under load.
As we can see from the CPU utilization rate, even though the C# node has almost doubled its CPU usage since preview3 (and that’s the reason for better performance), it still has some obvious room for improvement. NeoGo keeps around 6 out of 8 cores busy most of the time. C# node also started to use more memory than in preview3 because now it processes more transactions within the same time frame.
Four nodes consensus test
NeoGo privnet was able to reach 1750 average TPS value, while C# nodes show almost 3 times lower TPS at 600. Both implementations have doubled their results since preview3, which is quite remarkable.
And we see the same pattern here as with single node, Go nodes in general pack more transactions into blocks, but doing this requires more time, so they tend to deviate from 5-second block intervals a bit more than C# nodes.
Both implementations keep the CPU busy, but it’s not surprising given that we have five nodes running on four real and four virtual cores. Memory consumption has increased somewhat for both implementations, which again is explained by increased TPS values and the need to keep more transactions in RAM.
As we have noted several times previously, our i7–8565U testing machine is a little overloaded with four Neo nodes, and these results tend to fluctuate from one run to another, so while it was necessary to make measurements with it to compare with previous results, we felt like there is something we need (besides a miracle).
Cores, lots of cores
Sixteen real and sixteen virtual cores make Ryzen 9 5950X a good CPU for any kind of benchmarking. Paired with 64 GB of RAM (and a nice SSD, of course), this system allows to unleash more of real node potential (for any implementation). It also has enough room to see how nodes scale. We’ve run the same set of tests using this machine, and the results we have are quite interesting.
Single-node
When we first saw data for NeoGo in single mode on this machine, we just couldn’t resist trying a well-known, performant, but non-default DB backend that is BadgerDB. Using LevelDB NeoGo reaches an average of 19600 TPS (90% improvement over i7–8565U), which is just too close to 20K, and sure enough, using BadgerDB, it easily crosses this line with 21900 TPS. C# node performance also improves on this machine, reaching 5350 TPS average (73% more than on i7–8565U).
But you may notice that the line on this plot is quite shaky for NeoGo, and there is a simple reason for that that can be explained with the following two plots:
The node tries to stay close to 1 second in inter-block timings, but when it creates a big block (with up to 50K transactions inside), it needs to process it afterward, which takes time. Enough so that when it’s done processing, it actually notices that it’s time already to make another block, so it takes whatever transactions it has in mempool (and there are not a lot of them yet, because the mempool itself only has 50K capacity most of which was just flushed with the previous block) and creates another (much smaller) block. It does so in time (or pretty close to that), but the TPS value for this particular block is lower than for the previous one. Now the node has a full second to accept new transactions into an almost empty mempool, and it accepts almost 50K of them before they’re taken for the next block, and then the cycle repeats. Still, it all averages out during the run, and pushing 1M transactions in 44.5 seconds via RPC is exactly that number of transactions in exactly that time.
C# node at the same time tends to always pack around 8000–9000 transactions into its blocks, but also experiences some occasional transaction count drops and spikes in block times that affect the average value.
As for resource utilization, both nodes can’t really stress 5950X because there are a lot of inherently single-threaded operations. Still, given that this CPU has 32 cores exposed to the system, both nodes tend to keep 5–8 of them spinning.
Memory-wise results are quite predictable, as NeoGo node has to keep more transactions in RAM to produce ~50K blocks and in general processes them faster, it uses more of it.
Four nodes
Now that this private network’s nodes finally have enough CPU power to drive all of them, NeoGo setup averages out at 6100 TPS and C# at 1100 TPS.
And this is where we can see an explanation for C# setup TPS values, while it starts nicely with 3K+ TPS, the performance drops after a few blocks because of higher than expected block intervals. And these intervals have some connection to block size that reaches mempool capacity size (50K transactions). We think that solving this problem would give C# node a nice boost in average TPS values.
NeoGo doesn’t show this pattern, its block times are stable here even though per-block transaction count can fluctuate a little.
It’s a bit surprising that in this setup C# nodes are more resource-intensive than NeoGo nodes, keeping CPU busier on average and using more RAM even though the number of transactions processed is lower than for NeoGo. We can even say that NeoGo is still underutilizing this system, although it shows quite a good throughput.
Conclusions
As Neo 3 development continues, both nodes keep improving their performance and reaching new heights. C# node now easily reaches 5K TPS in single-mode and 1K in four-node privnet, while NeoGo is closing the year 2020 with 20K TPS reached in single-node setup and 6K TPS in four-node privnet. We’ll see what 2021 will bring to the table, but we hope that these metrics will allow us to drive a DeFi-enabled future.