We modeled the Cybersecurity Canon after the Baseball Hall of Fame and the Rock & Roll Hall of Fame, except for cybersecurity books. We have more than 25 books on the initial candidate list, but we are soliciting help from the cybersecurity community to increase the number. Please write a review, and nominate your favorite.
The Cybersecurity Canon is a real thing for our community. We have designed it so that you can directly participate in the process. Please do so!
I would recommend Chaos Engineering for the Cybersecurity Canon.
The concept of distributed computing is not new, having been introduced as early as ARPANET (The Advanced Research Project Agency Network) in the 1960s. What is different today are changes such as microservices-based software development, auto scaling, highly decoupled architectures, agile teams, and service oriented architectures. Under this distributed workload, your system needs to be resilient to service failures and network latency spikes. Additionally, distributed systems at scale become complex for any architect to understand, and relying on basic testing principles is not sufficient.
The authors of Chaos Engineering make a compelling case that if you have figured out service failures and network latency, you need to take the next step of either designing and implementing carefully planned experiments in production environments to determine new knowledge about the underlying state of the system or producing new avenues of exploration for your teams. In that respect, chaos engineering diverges from testing, since testing is often conducted earlier in the development life cycle and answers binary questions about the existing state. Chaos engineering will generate new answers about how the state of systems reacts to the wide area of experiments such as a region-wide outage, latency between services causing system wide outages and function-based chaos, among others.
My overall sense after reading the book is that chaos engineering is a nascent field, and for enterprises struggling with basics like automated monitoring and building resilient systems, chaos engineering will not help. It is ideal for organizations that have conquered the common use cases of distributed architecture complexity. The other thought I have is that, in some ways, chaos engineering is similar to deep penetration testing or red team testing by security teams because previously unknown threat vectors are being discovered, and experiments in production are based on the hypothesis of exposing vulnerabilities within the state of the security system.
The difference between testing and chaos engineering:
Who should undertake it and conditions for success:
Optimization in distributed systems: performance, availability, and fault tolerance
Velocity of feature development - describes the speed with which engineers can provide new, innovative features to customers
Operate under a micro services architecture results in higher feature velocity at the expense of coordination. Chaos engineering comes into play here by supporting high velocity, experimentation, and confidence in teams and systems through resiliency verification
Distributed system at sufficient scale becomes too complex for any one human to understand resulting in reduced need for architects who have the master plan. Ignore comprehensibility as a design principle. The system as a whole should make sense, but subsections dont have to make sense
Request / response chaos - spaghetti call graph and the chaos inherent - classical testing is insufficient since it can only tell us when an assertion is true or false. We need to discover new properties
For example, a "bullwhip effect” in Systems theory i.e., a small perturbation in input starts a self-reinforcing cycle that causes a dramatic swing in output. In this case, the swing in output ends up taking down the app. Each microservice could behave rationally, however, taken together under specific circumstances can result in undesirable system behavior.
Principles of Chaos: Chaos engineering as an experimental discipline
How would your system interact if we injected chaos into it? We would need an empirical approach to system behavior since a theoretical approach doesn’t exist. For example, failure injection testing (FIT) adds a failure scenario to the request header of a class of requests at the edge of service. As these requests propagate, injection points between microservices will check for the failure scenario and take action based on the scenario:
Chaos in practice
There is limited implementation examples of Chaos Engineering, though some great examples are shared viz., financial, business to commerce, and several large b2b organizations. Use the disciplined approach of picking a hypothesis, choosing the scope, identifying the metrics, informing the organization, running the experiment, analyzing results, increasing scope and then reiterating it.
Finally, if you become sophisticated in the practice of chaos engineering, you can start measuring the maturity of your chaos experiments by using sophistication and adoption metrics. Sophistication measures the validity and safety of chaos experiments and can be measured as elementary, simple, sophisticated or advanced. Adoption measures the depth and breadth of chaos experimentation coverage and can be measured on a maturity model of: in the shadows; investment; adoption and cultural expectation.
Still a very young field, the adoption of chaos engineering is nascent and it has its share of critics. It is ultimately a means to an end, with the aim to improve the production environment itself from real-world issues. While doing it in combination of proactive failure testing and post-incident reviews is beneficial, Chaos engineering can wreak havoc on state systems in production, if not carefully constructed, and it is very difficult to roll back.
Chaos Engineering: Building Confidence in System Behavior through Experiments is an easy read, and the parallels to penetration testing conducted by red teams are striking. It is somewhat light on the details about how to build carefully crafted experiments; therefore, I would recommend further reading on Chaos Monkey, failure injection testing, etc. In crawl, walk, run phases of enterprises moving from a monolith state of applications to one of microservices-based architecture, chaos engineering represents activity conducted in the run phase. If possible, enterprises should be identifying the experiments with lowest effort and reduced blast radius of impact as a set of training wheels to learn the practice of chaos engineering.