On May 22, 2023, the Dash network faced an unexpected disruption due to a glitch during the activation of version 19.0.0 at block 1,874,880. The network was stalled for roughly 16 hours, which marks the first outage since approximately 2015. Many have speculated on the origin of the issue. This post aims to put the speculation at bay as we report our findings.
TL;DR: The issue was related to the upgrade of our BLS scheme from a non-standard scheme to a new standard scheme used by the industry. It is worth noting that when we introduced BLS threshold cryptography to our network in 2018 there was no defined standard. This upgrade is vital on our path to inter-blockchain communication.
More in-depth explanation below.
Immediate response and investigation
On May 22, 2023, at precisely 04:13 UTC, the Dash network encountered an unexpected interruption during the v19 activation. The issue caused a chain halt, preventing the production of new blocks following the main consensus rules. Within minutes, at 04:18 UTC, we detected the anomaly and began our investigations.
Within an hour, at approximately 05:04 UTC, we successfully identified the problematic block. As the team delved deeper into the root cause, we initially thought by 06:45 UTC that the disruption was related to an unusual occurrence while processing a special transaction; however, this turned out to be incorrect. Our theory at that point centered around the BLS signature scheme update meant to transition our system to the new IETF specification. This part of our theory was true.
By 07:24 UTC, our engineers started to hone in on how the BLS update might be interacting with the masternode list to create an issue. Over the following two hours, we developed and tested numerous builds as analysis of the BLS implementation continued. One notable clue from this was the system’s unsuccessful attempt to eliminate certain masternodes, as it failed to locate them in the masternode list using their BLS operator key.
At 10:05 UTC, we briefly discussed whether to release a new version that would delay v19 activation or continue attempting to create a patch to resolve the issue. A decision was made to continue troubleshooting in hopes of identifying the solution quickly. The team then tested several patches that provided additional details but not a solution.
Around 11:00 UTC, we further traced the problem to a failure to apply masternode diffs properly. Having validated that some earlier theories were on the right track but not knowing the exact root cause, the team attempted a fix that briefly showed some promise. At 11:37 UTC, we noticed that some nodes were seeing blocks and decided to figure out how to get all nodes to follow.
Divide and conquer
After observing at 12:34 UTC that not all nodes followed the new chain even if re-indexed, it was probable that we were experiencing a non-determinism issue and the advancing chain was non-viable. Our team made multiple attempts to find a way around this unidentified source of non-determinism. However, ultimately the majority of the network was not following this chain as its blocks contravened consensus rules, leading to their rejection by many nodes.
Faced with that challenge, the team divided efforts in case a patch to resolve the original problem wasn’t practical for time or complexity reasons. One group focused on fully identifying and fixing the root issue. A second team worked on creating a new release that would postpone v19 activation which would allow the network to recover more quickly. As the team working on a complete fix proceeded, team members continued getting divergent and inconsistent results.
By 13:54 UTC, we had a patch to delay v19 activation, and v19.1 was ready to start the building process about an hour later. At 16:40 UTC, only 12 hours after the initial interruption, we decided to release the v19.1.0 hotfix, and our CTO, Samuel Westrich, announced this shortly thereafter. This patch temporarily deactivated the v19 hard fork rules, allowing the network to continue the chain and buying us more time for a permanent fix.
Our parallel efforts also yielded results at 18:00 UTC as we identified the primary root cause. A section of code that periodically creates masternode list diffs (ApplyDiff) was searching the masternode list by operator key based on the active BLS scheme. Although the v19 hard fork activation changed the active scheme from legacy to basic, some masternode updates or deletions captured by ApplyDiff happened under the legacy scheme. Therefore, the system could not locate the masternodes properly, resulting in this issue.
Although this issue only became evident in the first block when v19 activated, it occurred due to activity in the preceding 576 blocks. Any updated or deleted masternode in that period would have triggered this issue. It is worth noting that a BLS scheme migration is done at the activation block, so any future updates or deletions would not have experienced this issue.
We would also like to highlight the complexity of the issue to less technically inclined readers. Figuring it out in such a short time frame was a testament to the dedication of the team, many who had been operating in very stressful conditions for an extended period. By the time the underlying issue was discovered, the team was not in a position to safely write and deploy a complete fix in a timeline that would have been acceptable to the network. At the same time v19.1.0 had already been released and we all agreed that getting our network running again was paramount.
Back in business
At 19:53 UTC, less than 16 hours from the initial incident, the Dash network resumed producing blocks from the last block with a ChainLock (height: 1874879, hash: 0000000000000013cdc708723f111c6d34effd9ea663e62eaaa5c9ff299800cb). Community members and team members alike jumped in quickly to get nodes updated and provide support to anyone encountering issues during the transition. By May 24, 2023, stability had been restored, with around 80% of the network updated to v19.1.0.
While operating under very stressful circumstances, there was some internal miscommunication regarding the new activation date for the v19 hard fork. Some previous communication mentioned June 14 as the v19 activation date; however, the patch provided in Dash Core v19.1.0 changed the start time of the v19 activation process to June 14. This results in the earliest activation being on July 6 after completing the typical lock-in period. The v19.1.0 release notes communicated a start date on June 14th with the earliest activation being 2 weeks later.
In further news, we are pleased to announce that we are beginning to test a resolution for the uncovered issue as of tomorrow. This fix, set to be included in v19.2.0 prior to activation, will necessitate a mandatory upgrade. Once we have established confidence in the robustness and effectiveness of this solution, we plan to consult the community about the potential of expediting the activation date.
This step will enable us to ensure your voice is heard and that we move forward together. As we continue working tirelessly to serve the Dash project, we remain committed to transparency, reliability, and the growth of the Dash network.
CTO, Dash Core Group
Head of Documentation, Dash Core Group
On behalf of the Dash Core Team