KEMBAR78
Chaos Driven Development | PDF
CHAOS DRIVEN DEVELOPMENT
Future Insights Live 2015, LasVegas
Bruce Wong
A LITTLE ABOUT ME
• Founder of Chaos
Engineering @ Netflix
• Computer Science
Background
• Multiple roles scaling Netflix
from 8m to 60m+ subs
• CurrentlyTaking a Break
@bruce_m_wong
Most enterprises hire people to fix things. Netflix hires
people to break things….
…we should embrace Netflix's culture of "chaos engineering"
throughout organizations of all shapes and sizes.
http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone
@bruce_m_wong
http://www.techrepublic.com/article/serious-about-cloud-it-might-be-time-to-look-into-chaos-engineering/
https://gigaom.com/2014/09/11/netflixs-new-chaos-engineering-push-aims-to-hire-staff-to-help-break-its-cloud-based-system/@bruce_m_wong
http://www.cnbc.com/id/102394893@bruce_m_wong
http://www.cnbc.com/id/102394893@bruce_m_wong
CHAOS DEFINED
“If it ain’t broke don’t fix it”
-Bert Lance, Nation’s Business 1977
If it ain’t broke, try harder
-chaos philosophy
@bruce_m_wong
CHAOS DEFINED
Intentionally introducing failure into a system
with the purpose of validating resilience design.
@bruce_m_wong
WHY CHAOS?
Failure happens.
@bruce_m_wong
WHY CHAOS?
•Hardware fails
•Power outages
•Software has bugs
•Human error
•Natural disasters
@bruce_m_wong
http://money.cnn.com/2012/10/30/technology/netflix-hurricane-sandy/@bruce_m_wong
http://www.pcworld.com/article/2691772/how-netflix-survived-the-amazon-ec2-reboot.html
https://gigaom.com/2014/10/03/netflix-lost-218-database-servers-during-aws-reboot-and-stayed-online/
@bruce_m_wong
BLUE MOONS
Once in a blue moon will eventually happen
@bruce_m_wong
FAULT-TOLERANT DESIGN PRINCIPLES
• Eliminate Single Points of Failure
• Allow parts of the system to fail independently
(Failure Isolation)
• Prevent propagation (Failure Containment)
@bruce_m_wong
START WITH
CONSEQUENCES
Chaos Driven Development
@bruce_m_wong
MINIMUMVIABLE PRODUCT
• Understand your users
• Understand your value proposition
• Understand your business
@bruce_m_wong
PRIORITIZE
• Many aspects and features are important
• Each have different consequences for not working
• A product’s value proposition is what drives your
business
@bruce_m_wong
DESIGN FOR
FAILURE
What failure isolation might
look like
@bruce_m_wong
APPLYING
CHAOS
Validation of fault-tolerant
design
@bruce_m_wong
BREAKINGTHE CONNECTION
How Confident are you?
-Next week?
-Next month?
-After that “quick patch”
WHAT DOES CHAOS LOOK
LIKE?
• Types - errors, latency
• Duration - how long?
• Intensity - how much?
@bruce_m_wong
WHAT DOES CHAOS LOOK
LIKE?
• Return errors a % of requests
• i.e. return HTTP500 for 1% of requests for 1 minute
@bruce_m_wong
WHAT DOES CHAOS LOOK
LIKE?
• Make it slow(er) - Introduce Latency
• i.e. sleep for 10ms on every request for 1 minute
@bruce_m_wong
WHAT DOES CHAOS LOOK
LIKE?
Gradually increase
• i.e. sleep for 10ms on every request for 1
minute
• sleep for 100ms on every request for 3
minutes
@bruce_m_wong
WHAT DOES CHAOS LOOK
LIKE?
The design/implementation worked!
• microscopic impact, high confidence
What if it didn’t work?
• smaller impact than an outage
• proactively fix it and try again
@bruce_m_wong
WHAT AN OUTAGE LOOKS
LIKE?
• Detection takes time (TTD)
• Analysis takes time
• Resolution takes time (TTR)
• Inconvenient times
@bruce_m_wong
CHAOSVS OUTAGE
Chaos
• Controlled
• Planned
• Intentional
• Microscopic user impact
Outages
• Uncontrolled
• Unpredictable
• Unintended
• Large impact
@bruce_m_wong
WHAT ABOUTTESTING?
• Testing is good - do it, automate it
• While great testing disciplines can find most
functional bugs…
• scale, traffic and capacity
• System misconfiguration and design limitations
@bruce_m_wong
LESSONS LEARNED
• Learn more from chaos exercises than outages
• Fixing a failure mode will uncover new ones
• Configuration is often overlooked
• Tools can break
@bruce_m_wong
WHY ISTHIS
HARD?
@bruce_m_wong
WHAT MAKES RESILIENCE
DESIGN HARD?
• Product and Engineering Decision
• Tradeoffs are difficult
• Organizational Silos
@bruce_m_wong
ORGANIZATIONAL SILOS
• Services by Domain
• Dev/Ops/Product
• Incomplete context
@bruce_m_wong
WHAT MAKES CHAOS HARD?
In addition to the technical challenges
• Organizations rarely incentivize people to try and
break production
• Misconceptions about complex systems and scale
@bruce_m_wong
TAKE AWAYS
• What are the consequences?
• Start small, start early
• Work together - share context
• Validate don’t assume
@bruce_m_wong
QUESTIONS?
@bruce_m_wong

Chaos Driven Development