Chaos engineering (CE) is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. This approach is becoming commonplace in DevOps practices; but how would its application extend to IT operations?
In truth, CE for IT operations offers a similar framework for stress-testing a technology platform to understand its weak points and performance pitfalls under heavy pressure.
CE tends to be used primarily in DevOps during bug testing: setting up experiments to run software under different conditions, such as peak traffic, and monitoring how it functions and performs. This becomes increasingly necessary in cloud-based systems where failure to understand extreme load responses could result in runaway cascade failures or, worse yet, spinning up thousands of extra nodes handling error conditions while not doing any actual work.
These same principles, applied to IT operations management (ITOM), help define a functional baseline and tolerances for infrastructure, policies, and processes by clarifying both steady-state and chaotic outputs when extremes are reached.
Applications in IT
The theory of CE in DevOps gained early traction at Netflix as they moved from physical to virtual infrastructure, with the team that implemented it on AWS breaking off to form Gremlin. However, chaos engineering is not typically used in IT operations, because ITOM has historically been separated from development (generally, IT monitors system dynamics, and when a problem occurs, engineering change management or ITSM is brought in to remediate the issue).
With the growth of containerisation in cloud applications today, IT infrastructure looks more like development environments than classical multi-tier architectures. But the limitless scale of the cloud means failures can also be limitless: microservices are well-served by testing elasticity and scalability, data flows, and resiliency through stressing the system to the edge of its tolerances and fixing their shortcomings before a public crash.
1, 2, 3… chaos
Implementing chaos engineering for IT operations management provides a systematic approach to identifying weaknesses in a microservices-world. In a monolithic environment, you have visibility into performance and event metrics that may be lost with microservices designs. As a result, the need for operational insights becomes even more critical when scaling to unknown workloads.
Netflix’s Chaos Monkey grew out of CE principles from their own cloud-native community, meant to address the gaps in common dev tools’ abilities to manage extreme complexities. This methodology is extendable to infrastructure and helps to set guardrails on platform behavior as a whole. So how should a team bring this thinking into their IT operations management? Follow these five fundamental steps:
- Define the current steady state: Performing baseline analysis is a standard concept in capacity planning, upgrade strategies, and other high-impact functions. Start with something relatively simple (and small) so that you don’t get overwhelmed by the data, or risk interfering with the business if something goes wrong (such as security Red Teaming). For example, monitoring CPU and network utilisation, which are common bottlenecks in any IT shop
- Define optimal conditions. There’s how your system generally operates, and then there’s how it should operate; these typically aren’t the same thing. CPU utilisation and network latency are always affected by application efficiencies, hardware conditions, and a host of other factors. Create a standard that outlines what engineers should expect on a normal day, on an easy day, and on a very hard day. These are the control groups, and the extreme day will be the stress test
- Form a hypothesis. Where will the system break? If you’re running an application scenario such as doubling the peak traffic that even your worst day so far has seen, will your CPU maintain optimum utilisation (or will the container provisioning engine smoothly deploy additional nodes) as in the variable control groups, or will it spike so severely that processes grind to a halt because there isn’t enough memory or network bandwidth left to manage the load?
- Execute a real-world event (but contain the blast radius). Do something extreme, like taking down a firewall that severs connectivity to one internet service provider. This will confuse the application as it tries to respond to requests with repeated failures, ramping up CPU processes as errors return from a dead network endpoint. Log events will mount, filling the database and saturating the backbone
- Validate the hypothesis. What happened? Monitor utilisation and network throughput during the test and see where the system fell over. Is it what you expected, or did something never previously considered take place? Did new chaos erupt from the fissures in your infrastructure? Stabilise, document, and remediate
Never stop not being afraid
Stressing a system to its absolute max—and a little bit further—to see where things go wrong allows you to understand steady-state behavior and error-handling, so you can fix it before something breaks in new and unexpected ways. What do traffic spikes look like? What are real-world events and their impacts on your organisation?
Chaos engineering is not just for DevOps. It should be a systemic practice for load-testing (out of your comfort zone) to the point of failure. It’s a responsibility for more than microservices deployments and applies to all sorts of disciplines within the IT organisation.
Interested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.