KEMBAR78
Chaos Engineering with Containers | PDF
#QConSF @ana_m_medina
Chaos EngineeringChaos Engineering
with Containers
1
Ana Medina

Chaos Engineer at
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
chaos-engineering-gamedays
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
#QConSF @ana_m_medina
2
Ana Medina
@ana_m_medina


Chaos Engineer @ Gremlin
Previously Software Engineer /
SRE @ Uber, Also worked/
interned @ SFEFCU, Google,
Quicken Loans, Stanford
University and Miami Dade
College.
College dropout.
Self taught engineer.
#QConSF @ana_m_medina
3
How many of you have
heard of Chaos
Engineering?
#QConSF @ana_m_medina
4
How many of have run
a Chaos Engineering
experiment?
#QConSF @ana_m_medina
5
Thoughtful, planned
experiments designed to reveal
the weakness in our systems.


Chaos Engineering
#QConSF @ana_m_medina
6
Inject something harmful to
build an immunity.
-@KoltonAndrus

Gremlin Founder and CEO
Chaos Engineering
#QConSF @ana_m_medina
7
Why?
● Microservices
● Systems are scaling fast
● Downtime is really expensive
● Our dependencies will fail
● Pager fatigue and burnout really hurts
#QConSF @ana_m_medina
8
“Chaos Engineering
Without Observability ...
Is Just Chaos”

-@mipsytipsy
Charity Majors
CEO of honeycomb

#QConSF @ana_m_medina
9
Prerequisite of Chaos Engineering
● Monitoring/Observability
● On-Call and Incident Management
● Cost of Downtime Per Hour
#QConSF @ana_m_medina
10
Use Cases for Chaos Engineering
● Outage reproduction
● On-call training
● Strengthen new products
● Battle test new infrastructure and
services
#QConSF @ana_m_medina
11
Use Cases for Chaos Engineering - Containers
● Testing Provider Specific Reliability
(eg: EKS vs AKS vs GKE)
● Auto Scaling
● Logs, Disk failure
#QConSF @ana_m_medina
Minimize the
Blast radius
12
#QConSF @ana_m_medina
Monitoring /
Observability
13
#QConSF @ana_m_medina
14
What to measure and monitor?
! System Metrics: CPU, Disk, I/O
! Availability
! Service specific KPIs
! Customer complaints
#QConSF @ana_m_medina
15
Demo
#QConSF @ana_m_medina
16
#1 - Battle Test Cloud infrastructure
Real World Scenario: company / user is evaluating cloud
provider managed kubernetes. which one is more reliable?
The Hypothesis: shutting down a container (1/1) should only
give a small delay before app is reachable again
The Experiment: shut down kubernetes dashboard
container
Abort Conditions: app is unreachable after 60 seconds
#QConSF @ana_m_medina
17
#QConSF @ana_m_medina
#QConSF @ana_m_medina
#QConSF @ana_m_medina
#QConSF @ana_m_medina
21
#2 - Shutdown of a Container
Real World Scenario: company / user is evaluating
containers. Are they as reliable as promised?
The Hypothesis: yes, they will come back up
The Experiment: shutdown container and wait a few
seconds and check if it’s up
Abort Conditions: app is unreachable after 60 seconds
#QConSF @ana_m_medina
22
#QConSF @ana_m_medina
23
#3 - Blackholing traffic to Catalog
Real World Scenario: company / user is working with their UI
team to provide a good user experience when there API/DB
issues
The Hypothesis: images will not load, but product listing will
The Experiment: blackhole all traffic from the front end to
REST API and DB ports
Abort Conditions: app is unreachable after 60 seconds
#QConSF @ana_m_medina
24
#QConSF @ana_m_medina
Case Study
25
#QConSF @ana_m_medina
26
Companies doing
Chaos Engineering
#QConSF @ana_m_medina
27
Tools you Can Use
Gremlin

Chaos Toolkit

Litmus

PowerfulSeal
#QConSF @ana_m_medina
28
Break Things Together
bit.ly/chaos-eng-slack

2,000+ members across the world
#QConSF @ana_m_medina
THANKS!
@ana_m_medina
ana@gremlin.com
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
chaos-engineering-gamedays

Chaos Engineering with Containers