Cardano January Post-Mortem
On January 23rd, the Cardano blockchain underwent its first real-world test of large-scale resilience. Approximately 60% of all Cardano nodes crashed and rebooted. As part of the triage team, I was asked to write a blog post providing a community-centric post-mortem analysis of the incident, the quick response from IOG and the community, and the robustness of the Cardano network.
On January 23rd at 7:09:01 PM EST, the Cardano network faced an unexpected challenge when a significant number of nodes crashed and rebooted. The issue was promptly reported on GitHub by community member Smaug, which can be found here. You can find IOG’s official incident report, released today, here.
A Swift Response
In response to the reported issue, IOG quickly formed a task force, bringing together members from the community, including myself, to identify the cause of the crash and develop a solution. The task force managed to identify likely candidates for the crash and release a fix for block producing nodes to adopt within a week.
In accordance with a policy of responsible disclosure, to minimize the risk of an attack while block producers were upgrading, the exact details were kept internal. Recently, the details of the fix were published as an open pull request that can be found here.
A Resilient Network
One of the key points to emphasize is the resilience of the Cardano network. Despite the significant number of nodes crashing, the network managed to recover within minutes, as designed. This rapid recovery showcases the robustness of the Cardano blockchain and its ability to handle unexpected challenges.
Reproducing the Issue and Fix Validation
Engineers on the IOG Ledger team successfully identified the likely cause of the bug. The task force played a supporting role by brainstorming ideas and validating the hypotheses presented by the engineers. Leveraging QuickCheck, a property-based testing framework, unit tests were created to reproduce the issue on a small scale. This process enabled the validation of the proposed fix in a controlled environment. Despite our success in addressing the issue using QuickCheck, we faced challenges when attempting to reproduce the bug on a full private testnet. While this limited our ability to confirm the fix on a larger scale, the team remained confident that the root cause of the problem was addressed and that the fix would prevent similar crashes in the future.
IOG waited for an appropriate period before publishing the details of the fix, along with this blog post in order to adhere to responsible disclosure practices. This decision was made to minimize the risk of malicious actors reproducing the bug and causing repeated crashes, ensuring the safety and stability of the Cardano network.
The root cause of the vulnerability can be traced back to violated invariants in Data.Map, which occurred due to the implementation of a semi-custom map called “CanonicalMap” for tracking MultiAsset data. The custom insert function in CanonicalMap can lead to unbalanced tree structures and, in turn, crash the nodes.
To trigger this rare bug, a specific sequence of events needed to occur:
- A sequence of transactions created the right internal state in the UTXO to prime the system for the vulnerability.
- A node restarted, resetting its UTXO to a different internal structure.
- The node received a transaction into its mempool, consuming UTXOs with several different policy IDs or native tokens. This transaction didn’t cause a crash on the node it was submitted to but triggered crashes on other nodes with the “dangerous” internal structure.
- The transaction propagated to other nodes’ mempools without causing immediate crashes.
- The node minted a block containing the transaction and propagated it to peers.
- As the block reached other nodes, they crashed while trying to adopt the block.
The bug was difficult to reproduce due to the complex interplay of factors and the precise sequence of events required. It is remarkable that this rare bug was discovered relatively early in the history of multiassets and collateral returns, considering the unique set of conditions needed for it to occur.
One thing to come out of this incident was a personal project of mine, cardano-slurp. Part of what made this incident difficult to triage is that any nodes that saw the data that caused the crash… crashed. Cardano-slurp is a program that acts like a cardano-node to the network, gossiping about blocks and transactions, but simply archives them, without processing them in any way. This will allow us to capture any toxic waste in the future, as well as write interesting analytics about real world network traffic patterns.
It is important to highlight the swift response from IOG and the community, as well as the resilience of the network. The task force’s efforts led to a timely fix, and the incident serves as a reminder of the importance of collaboration and transparent communication in the world of blockchain technology.