#DTX2021: A Beginner’s Guide to Chaos
What is chaos engineering is and how to get started? What are the different types of tests and how does it compare to other options? These were questions that Holly Grace Williams, founder of Akimbo Core, aimed to tackle during a technical session at the Digital Transformation EXPO Europe 2021.
The ‘A Chaos Podcast Presents: A Beginner’s Guide to Chaos’ session began by highlighting Facebook’s recent global outage, which lasted almost six hours. “Facebook runs ‘storm’ drills to ready itself to cope with outages,” Williams affirmed, and this is a form of chaos engineering.
“But what is chaos engineering?” Williams questioned. Simply, chaos engineering is the concept that “we experiment on production systems in order to build confidence in how those systems will perform under duress.”
Yet, there is a lot of “incorrect” ideas circulating regarding what chaos engineering is and the experiments involved. “Chaos engineers are not breaking things in production to prove production systems can handle it,” she emphasized. For Williams, chaos engineers are not bringing the chaos; “the chaos is the production system.”
Examples of chaos engineering include “taking something down” — what happens when you cause a failure on some part of a system? Others include “slowing something down” — what happens if a certain system element performs slowly?
A central benefit of chaos engineering is that, according to Williams, organizations can use it to identify vulnerabilities before a hacker does or before a system failure. In addition, changes made as a result of chaos engineering testing bolsters confidence in an organization’s systems. With the rise in cyber threats, businesses must ensure their physical resilience and the resilience of their IT systems, stressed Williams.
Chaos engineering is significant in complex computing environments since these systems can break when unexpected situations occur.
Williams mentioned that business people are not always open to the idea of “experimenting on production,” a crucial part of chaos engineering. Yet, there is “vast potential” for organizations if they leverage chaos engineering.
Shifting the focus to practical steps organizations can do to implement forms of chaos engineering, the first is to “start small.” Additionally,”start in test, start on a schedule,” she said. Organizations should also “build-up to production.”
Despite the benefits that chaos engineering introduces, there are challenges. Williams told the audience to beware of the blast radius and cascading failures. Additionally, organizations that want to experiment less frequently “are more likely to slip back into bad practices.”
Wrapping up the session, an audience member queried what role AI and Automation can play in chaos engineering. Williams pointed to the role AI can play in helping track experiments organizations are performing. “Humans aren’t good at randomness,” she stressed. “Machine learning can help chaos engineers track and operate in different ways.” Crucially, ML can also help analyze systems to find problems in a system.