KEMBAR78
Haystack Distributed Tracing | PPTX
Jason Bulicek
Distributed tracing? Why?
Distributed Systems
A collection of services
and components that
appear as a single
system to it’s users
Distributed Systems
Distributed Systems
Failures
Monitoring and
alerting individual
services may result
in cascading alerts
HTTP
500
HTTP 500
HTTP
500
Some result in
incorrect results
and others are
complete failures
Distributed Systems
Analyzing (distributed) Logs
[03/Mar/2019 21:49:50 +0000] “GET /search/hotels?city=London HTTP/1.1” 200 …”
[03/Mar/2019 21:49:50 +0000] “GET /geography/polygon?city=Seattle HTTP/1.1” 200 … ”
[03/Mar/2019 21:49:50 +0000] “GET /inventory/hotels?latlong=... HTTP/1.1” 200 … ”
[03/Mar/2019 21:49:50 +0000] “GET /price/hotel/1234 HTTP/1.1” 200 … ”
[03/Mar/2019 21:49:50 +0000] “GET /geography/polygon?city=London HTTP/1.1” 200 …”
[03/Mar/2019 21:49:50 +0000] “GET /inventory/hotels?latlong= HTTP/1.1” 200 … ”
[03/Mar/2019 21:49:50 +0000] “GET /price/hotel/4567 HTTP/1.1” 500 … ”
[03/Mar/2019 21:49:50 +0000] “GET /price/hotel/1379 HTTP/1.1” 200 …”
[03/Mar/2019 21:49:50 +0000] “POST /sort HTTP/1.1” 500 … ”
[03/Mar/2019 21:49:50 +0000] “GET /price/hotel/2468 HTTP/1.1” 200 …”
[03/Mar/2019 21:49:50 +0000] “POST /sort HTTP/1.1” 200 … ”
[03/Mar/2019 21:49:50 +0000] “GET /price/hotel/2580 HTTP/1.1” 200 …”
Distributed Systems
Distributed Tracing & Tracing Taxonomy
It answers:
* Services involved in processing a request
* Service duration & number of invocations
* Network latency between services
* Call context for each service
* What happened in each service
* Bottlenecks in the system
* ...
Distributed Tracing
Why?
Distributed Tracing @ Expedia
Benefits
* Reduced Time to Know (root cause) by an
average of 26 mins
* Time to fix difficult defects down to a
week from seven weeks
* 4x cheaper than well known log indexing
product for the same volume
* Better benefits & visualization
* Failure prediction
Haystack
Distributed tracing at scale
Demo
Distributed Tracing @ Expedia
Haystack
* Inspired by Google Dapper (2010) and
Twitter’s Zipkin (2012)
* Developed by Expedia as Project
Blackbox(2015) and revised as Haystack
(2017)
* OpenTracing API compliant. Accepts ZipkinV2
format and Opencensus
* Tracing, Service & operation level trends,
service graph, anomaly detection ...
Haystack - Architecture
Haystack Traces- Architecture
Haystack Traces- UI
Haystack Traces- UI
Haystack Latency Cost- UI
Haystack Trends- Architecture
Haystack Trends- UI
Haystack Service Graph- Architecture
Haystack Service Graph- UI
Adaptive Alerting
Time series anomaly detection for streaming
metric data
*Support arbitrary algorithms both
classic and ML
*Handles millions of metrics
*Automated model selection, hyper-
parameter optimization, model builds
Haystack Alerts- UI
Lessons learned
*Make it simpler for developers
* Use standard client libraries like Spring Sleuth,
Opencensus or OpenTracing implementations
* Provide examples for standard frameworks like
Spring-boot, dropwizard and NodeJs
*Use an agent to forward data. Helps
operational aspects of the platform
*Needs a bottom up adoption and a top down
support for such initiatives
Key Takeaways
* Observability supersedes Monitoring
* Distributed tracing is a key aspect of
Observability
Haystack
* Haystack & Adaptive Alerting has been open
sourced by Expedia
* http://expediadotcom.github.io/haystack/
* http://github.com/ExpediaDotCom/Adaptive-Alerting
* Questions?
https://gitter.im/expedia-haystack/Lobby

Haystack Distributed Tracing

  • 1.
  • 2.
  • 3.
    Distributed Systems A collectionof services and components that appear as a single system to it’s users
  • 4.
  • 5.
    Distributed Systems Failures Monitoring and alertingindividual services may result in cascading alerts HTTP 500 HTTP 500 HTTP 500 Some result in incorrect results and others are complete failures
  • 6.
    Distributed Systems Analyzing (distributed)Logs [03/Mar/2019 21:49:50 +0000] “GET /search/hotels?city=London HTTP/1.1” 200 …” [03/Mar/2019 21:49:50 +0000] “GET /geography/polygon?city=Seattle HTTP/1.1” 200 … ” [03/Mar/2019 21:49:50 +0000] “GET /inventory/hotels?latlong=... HTTP/1.1” 200 … ” [03/Mar/2019 21:49:50 +0000] “GET /price/hotel/1234 HTTP/1.1” 200 … ” [03/Mar/2019 21:49:50 +0000] “GET /geography/polygon?city=London HTTP/1.1” 200 …” [03/Mar/2019 21:49:50 +0000] “GET /inventory/hotels?latlong= HTTP/1.1” 200 … ” [03/Mar/2019 21:49:50 +0000] “GET /price/hotel/4567 HTTP/1.1” 500 … ” [03/Mar/2019 21:49:50 +0000] “GET /price/hotel/1379 HTTP/1.1” 200 …” [03/Mar/2019 21:49:50 +0000] “POST /sort HTTP/1.1” 500 … ” [03/Mar/2019 21:49:50 +0000] “GET /price/hotel/2468 HTTP/1.1” 200 …” [03/Mar/2019 21:49:50 +0000] “POST /sort HTTP/1.1” 200 … ” [03/Mar/2019 21:49:50 +0000] “GET /price/hotel/2580 HTTP/1.1” 200 …”
  • 7.
  • 9.
    It answers: * Servicesinvolved in processing a request * Service duration & number of invocations * Network latency between services * Call context for each service * What happened in each service * Bottlenecks in the system * ... Distributed Tracing Why?
  • 10.
    Distributed Tracing @Expedia Benefits * Reduced Time to Know (root cause) by an average of 26 mins * Time to fix difficult defects down to a week from seven weeks * 4x cheaper than well known log indexing product for the same volume * Better benefits & visualization * Failure prediction
  • 11.
  • 12.
    Distributed Tracing @Expedia Haystack * Inspired by Google Dapper (2010) and Twitter’s Zipkin (2012) * Developed by Expedia as Project Blackbox(2015) and revised as Haystack (2017) * OpenTracing API compliant. Accepts ZipkinV2 format and Opencensus * Tracing, Service & operation level trends, service graph, anomaly detection ...
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    Adaptive Alerting Time seriesanomaly detection for streaming metric data *Support arbitrary algorithms both classic and ML *Handles millions of metrics *Automated model selection, hyper- parameter optimization, model builds
  • 23.
  • 24.
    Lessons learned *Make itsimpler for developers * Use standard client libraries like Spring Sleuth, Opencensus or OpenTracing implementations * Provide examples for standard frameworks like Spring-boot, dropwizard and NodeJs *Use an agent to forward data. Helps operational aspects of the platform *Needs a bottom up adoption and a top down support for such initiatives
  • 25.
    Key Takeaways * Observabilitysupersedes Monitoring * Distributed tracing is a key aspect of Observability
  • 26.
    Haystack * Haystack &Adaptive Alerting has been open sourced by Expedia * http://expediadotcom.github.io/haystack/ * http://github.com/ExpediaDotCom/Adaptive-Alerting * Questions? https://gitter.im/expedia-haystack/Lobby

Editor's Notes

  • #15 The Reader runs as a GRPC server and serves Haystack UI to fetch traces directly from Cassandra for a TraceId or use ElasticSearch for contextual searches. For a search, the Reader component queries ElasticSearch, which returns a set of TraceIds. Reader then pulls from Cassandra all the spans for each unique TraceId Indexer reads spans from Kafka and group them together based on their unique traceId. It writes the resulting grouped data structure to Cassandra and ElasticSearch. Cassandra is used as a raw data store where all the spans are inserted with TraceId as a primary key. ElasticSearch is used to build an index on the metadata, serviceName and operationName associated with every span. This helps the Reader service to query ElasticSearch for contextual searches such as "fetch all TraceId(requests) that have any span with serviceName=xyz and a metadata tag success=false". The Indexer also controls the attributes to be indexed, through a whitelist configuration. The Indexer reads the whitelist from an external store and applies the list dynamically. The goal is to help organizations in controlling and scaling their infrastructure spend (especially ElasticSearch) more proactively.
  • #19 span-timeseries-transformer - this app is responsible for reading spans, converting them to metric points and pushing raw metric points to kafka partitioned by metric-key timeseries-aggregator - this app is responsible for reading metric points, aggregating them based on rules and pushing the aggregated metric points to kafka
  • #21 Node finder finds the relationships between services, graph builder creates the “nodes/edges” model that the UI uses to display the visualization