SRE
SRE
Site Reliability Engineering
Paradigm Shift in Service Management
Dev
DevOps SRE Architecture
OPs
Pallab Sarkar
1
Why SRE ?
Does all these sounds familiar ?
Things in Production fails, fails again…. And again.
Firefighting on the Root Cause and responsibility when
things breaks in Production and impact customers
Too many production issues, many repetitive issues,
things breaks too often, overloaded Support / Ops team
Everyone is busy, but things doesn’t get any
better over time. No single ownership for
Reliability / Availability / Allowed Error Rate
Too many escalations and handovers before the
right person gets engaged & start fixing the problem
SILO Culture. Defined process &
boundary delays cross team resolution
No time to fix Ops issues permanently, Dev is
always busy in roll out new features and they
want fast delivery of features.
2
SRE Principles & Objectives
SRE is a model where Software Engineers are engaged to run Operations and write
codes / use tools to measure, monitor, improve, automate
operational tasks.
Improve Reliability
SREs are always focused on Reliability. Try to
pinpoint and mitigate Point of Failures in the
system wherever possible.
Accept Failure
Accept Failure as normal, always be ready with
possible remediation and learn from failures
Reduce TOIL
Tedious, manual, repetitive, boring work SREs spend around 50% time improving the
systems they manage.
Measure & Monitor
Various tools are used for monitoring /alerting. Always keep the Customer first in mind
and alerting mechanism should be primarily to have minimal or zero customer impact.
Set SLI, SLO, Error Budget
Quantitative measure of Availability, Latency, Error Rate and set expected Target. Any
SLO breach to have consequence.
Automation / Make tomorrow better than today
Automating infrastructure and Ops activities are important for consistency, time saving,
faster/auto repair. SREs always take out time to make tomorrow better than today via
automation, implement self service, fix toil