KEMBAR78
SRE SRE: Site Reliability Engineering | PDF | Reliability Engineering | Software Engineering
0% found this document useful (0 votes)
105 views3 pages

SRE SRE: Site Reliability Engineering

The document discusses site reliability engineering (SRE) as a paradigm shift in service management. SRE aims to improve reliability, accept failure as normal, reduce tedious operational tasks, implement monitoring and metrics, set service level indicators and objectives with error budgets, and focus on automation to continuously improve systems over time.

Uploaded by

Pallab Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views3 pages

SRE SRE: Site Reliability Engineering

The document discusses site reliability engineering (SRE) as a paradigm shift in service management. SRE aims to improve reliability, accept failure as normal, reduce tedious operational tasks, implement monitoring and metrics, set service level indicators and objectives with error budgets, and focus on automation to continuously improve systems over time.

Uploaded by

Pallab Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

SRE

SRE

Site Reliability Engineering


Paradigm Shift in Service Management

Dev

DevOps SRE Architecture

OPs

Pallab Sarkar
1
Why SRE ?
Does all these sounds familiar ?
Things in Production fails, fails again…. And again.
Firefighting on the Root Cause and responsibility when
things breaks in Production and impact customers

Too many production issues, many repetitive issues,


things breaks too often, overloaded Support / Ops team

Everyone is busy, but things doesn’t get any


better over time. No single ownership for
Reliability / Availability / Allowed Error Rate

Too many escalations and handovers before the


right person gets engaged & start fixing the problem

SILO Culture. Defined process &


boundary delays cross team resolution

No time to fix Ops issues permanently, Dev is


always busy in roll out new features and they
want fast delivery of features.

2
SRE Principles & Objectives
SRE is a model where Software Engineers are engaged to run Operations and write
codes / use tools to measure, monitor, improve, automate
operational tasks.

Improve Reliability
SREs are always focused on Reliability. Try to
pinpoint and mitigate Point of Failures in the
system wherever possible.

Accept Failure
Accept Failure as normal, always be ready with
possible remediation and learn from failures

Reduce TOIL
Tedious, manual, repetitive, boring work SREs spend around 50% time improving the
systems they manage.

Measure & Monitor


Various tools are used for monitoring /alerting. Always keep the Customer first in mind
and alerting mechanism should be primarily to have minimal or zero customer impact.

Set SLI, SLO, Error Budget


Quantitative measure of Availability, Latency, Error Rate and set expected Target. Any
SLO breach to have consequence.

Automation / Make tomorrow better than today


Automating infrastructure and Ops activities are important for consistency, time saving,
faster/auto repair. SREs always take out time to make tomorrow better than today via
automation, implement self service, fix toil

You might also like