Haystack Distributed Tracing

Distributed Systems
A collection of services
and components that
appear as a single
system to it’s users

Distributed Systems
Failures
Monitoring and
alerting individual
services may result
in cascading alerts
HTTP
500
HTTP 500
HTTP
500
Some result in
incorrect results
and others are
complete failures

Distributed Systems
Analyzing (distributed) Logs
[03/Mar/2019 21:49:50 +0000] “GET /search/hotels?city=London HTTP/1.1” 200 …”
[03/Mar/2019 21:49:50 +0000] “GET /geography/polygon?city=Seattle HTTP/1.1” 200 … ”
[03/Mar/2019 21:49:50 +0000] “GET /inventory/hotels?latlong=... HTTP/1.1” 200 … ”
[03/Mar/2019 21:49:50 +0000] “GET /price/hotel/1234 HTTP/1.1” 200 … ”
[03/Mar/2019 21:49:50 +0000] “GET /geography/polygon?city=London HTTP/1.1” 200 …”
[03/Mar/2019 21:49:50 +0000] “GET /inventory/hotels?latlong= HTTP/1.1” 200 … ”
[03/Mar/2019 21:49:50 +0000] “GET /price/hotel/4567 HTTP/1.1” 500 … ”
[03/Mar/2019 21:49:50 +0000] “GET /price/hotel/1379 HTTP/1.1” 200 …”
[03/Mar/2019 21:49:50 +0000] “POST /sort HTTP/1.1” 500 … ”
[03/Mar/2019 21:49:50 +0000] “POST /sort HTTP/1.1” 200 … ”

Distributed Systems
Distributed Tracing & Tracing Taxonomy

It answers:
＊ Services involved in processing a request
＊ Service duration & number of invocations
＊ Network latency between services
＊ Call context for each service
＊ What happened in each service
＊ Bottlenecks in the system
＊ ...
Distributed Tracing
Why?

Distributed Tracing @ Expedia
Benefits
＊ Reduced Time to Know (root cause) by an
average of 26 mins
＊ Time to fix difficult defects down to a
week from seven weeks
＊ 4x cheaper than well known log indexing
product for the same volume
＊ Better benefits & visualization
＊ Failure prediction

Haystack
Distributed tracing at scale
Demo

Distributed Tracing @ Expedia
Haystack
＊ Inspired by Google Dapper (2010) and
Twitter’s Zipkin (2012)
＊ Developed by Expedia as Project
Blackbox(2015) and revised as Haystack
(2017)
＊ OpenTracing API compliant. Accepts ZipkinV2
format and Opencensus
＊ Tracing, Service & operation level trends,
service graph, anomaly detection ...

Haystack Service Graph- Architecture

Adaptive Alerting
Time series anomaly detection for streaming
metric data
＊Support arbitrary algorithms both
classic and ML
＊Handles millions of metrics
＊Automated model selection, hyper-
parameter optimization, model builds

Lessons learned
＊Make it simpler for developers
＊ Use standard client libraries like Spring Sleuth,
Opencensus or OpenTracing implementations
＊ Provide examples for standard frameworks like
Spring-boot, dropwizard and NodeJs
＊Use an agent to forward data. Helps
operational aspects of the platform
＊Needs a bottom up adoption and a top down
support for such initiatives

Key Takeaways
＊ Observability supersedes Monitoring
＊ Distributed tracing is a key aspect of
Observability

Haystack
＊ Haystack & Adaptive Alerting has been open
sourced by Expedia
＊ http://expediadotcom.github.io/haystack/
＊ http://github.com/ExpediaDotCom/Adaptive-Alerting
＊ Questions?
https://gitter.im/expedia-haystack/Lobby

Haystack Distributed Tracing

More Related Content

What's hot

Similar to Haystack Distributed Tracing

Recently uploaded

Haystack Distributed Tracing

Editor's Notes