Would Your Team Work With the Chaos Monkey?

7. August 2017
Kategorien
Newsletter abonnieren

Would your team work with the chaos monkey?
Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are quick to adopt practices that increase flexibility of development and velocity of deployment. An urgent question follows on the heels of these benefits: How much confidence we can have in the complex systems that we put into production?

Chaos Engineering

Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.
We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc.  We must address the most significant weaknesses proactively, before they affect our customers in production.  We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent.
An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it during a controlled experiment. This is called Chaos Engineering

Chaos Monkey

Chaos Engineering was the philosophy when Netflix built Chaos Monkey, a tool that randomly disables Amazon Web Services (AWS) production instances to make sure you can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while you continue serving your customers without interruption. 
By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, you can still learn the lessons about the weaknesses of your system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, you won’t even notice.
Chaos Monkey has a configurable schedule that allows simulated failures to occur at times when they can be closely monitored.  In this way, it’s possible to prepare for major unexpected errors rather than just waiting for catastrophe to strike and seeing how well you can manage.
Chaos Monkey was the original member of Netflix’s Simian Army, a collection of software tools designed to test the AWS infrastructure. The software is open source (GitHub) to allow other cloud services users to adapt it for their use. Other Simian Army members have been added to create failures and check for abnormal conditions, configurations and security issues. 
An Agile and DevOps engineering culture doesn’t have a mechanism to force engineers to architect their code in any specific way. Instead, you can build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward. 
Most people think this is a crazy idea, but you can’t depend on the infrequent occurrence of outages to impact behavior. Knowing that this will happen on a frequent basis creates strong alignment among engineers to build in the redundancy and automation to survive this type of incident without any impact on your customers.

Would your team be willing to implement their own Chaos Monkey?

Tags

Das könnte Sie auch interessieren

The Five Elements of a Strong Governance Structure for Critical Projects

16. Januar 2025

Every executive has nightmares about that project—the one that spirals into an unmitigated disaster.  In general there are four ways a project can end up in a boardroom-shaking failure that can destroy value, reputations, and trust in one fell swoop. 1. The Titanic Failure: The project chugs along, oblivious to the iceberg ahead, burning millions

Weiterlesen

Why Every Critical Project Needs Independent Reviews

14. Januar 2025

«Trust, but verify.» That timeless adage applies as much to critical projects as it does to diplomacy. Without an independent review, even the best-run projects can veer off course, leaving organizations blindsided by delays, cost overruns, or outright failures. Here’s the uncomfortable truth: internal stakeholders are often too close to the project to see the

Weiterlesen

Why Every Critical Project Needs an Executive Sponsor

13. Januar 2025

Launching a critical project without an executive sponsor is like sending a ship to sea without a captain—good luck steering through the storm. Projects don’t fail because of bad intentions. They fail because of a lack of alignment, authority, and support.  That’s where the executive sponsor steps in—not just as a figurehead but as the

Weiterlesen

Why Every Critical Project Needs a Dedicated Project Manager

12. Januar 2025

Far too often, organizations assign critical projects to people who already have full-time roles or, worse, delegate management to a loosely organized team with no single point of accountability. The results? Missed deadlines, blown budgets, and a whole lot of finger-pointing. Here’s the hard truth: if the project is important, it deserves a dedicated project

Weiterlesen

Case Study 21: The Australian Securities Exchange (ASX) $250 Million CHESS Blunder

6. Januar 2025

The Australian Securities Exchange (ASX) embarked on an ambitious journey to replace its 25-year-old Clearing House Electronic Subregister System (CHESS) with a state-of-the-art, blockchain-based platform.  Initially envisioned as a groundbreaking project to enhance efficiency, security, and scalability, the CHESS replacement project quickly turned into a cautionary tale.  The initiative faced repeated delays and escalating costs

Weiterlesen

Project Recovery

2. Januar 2025

  Projects fail for a variety of reasons. Especially technology projects have a low success rate. Typically more than half of them are considered a failure. If your current in-house or outsourced software or web development project is off track, chances are I can bring the necessary input and expertise to get the job done. Troubled projects

Weiterlesen

When $100 Million Technology Projects Fail, It’s the Board’s Fault—Every Single Time

2. Januar 2025

In Switzerland, rumors suggest that both Bank Julius Bär and Raiffeisen Schweiz are grappling with failed technology projects, each costing over $100 million so far. Bank Julius Bär is reportedly trying to replace its existing core banking system for the Swiss booking center with Temenos, while Raiffeisen Schweiz is attempting to build a modern e-banking

Weiterlesen

10 Essential Questions Every Board Should Ask About Technology

16. Dezember 2024

Board members play an important role in steering organizations through the complexities of technology initiatives.  To fulfil this role effectively, it’s essential to ask the right questions that probe the strategic, operational, and risk aspects of technology projects.  Here are ten critical questions every board should consider: 1) How does this technology initiative align with

Weiterlesen

Independent Board Advisory

16. Dezember 2024

Effective boards provide clarity, governance, and oversight to steer organizations toward success. However, large technology initiatives, digital transformations, and innovation efforts often challenge even the most seasoned boards.  My Board Advisory service empowers boards and board members to navigate the complexities of modern technology decisions with confidence and precision. As a trusted advisor and experienced

Weiterlesen

Case Study 20: The $4 Billion AI Failure of IBM Watson for Oncology

7. Dezember 2024

In 2011, IBM’s Watson took the world by storm when it won the television game show Jeopardy!, showcasing the power of artificial intelligence (AI). Emboldened by this success, IBM sought to extend Watson’s capabilities beyond trivia to address real-world challenges.    Healthcare, with its complex data and critical decision-making needs, became a primary focus. Among

Weiterlesen
Next