KEMBAR78
Scaling Uber | PDF
SCALINGUBERMATT RANNEY
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/uber-scalability-arch
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
As of November 2015:
Uber Cities Worldwide: 356
Countries: 65
Employees: 4,800
Engineers: 1,500
US Driver Payments in 2015: $3.5B
UBER ENGINEERING
HISTORY
2009-2010 Outsourced PHP + MySQL
Jan 2011 "dispatch" - Node.JS/MongoDB
Jan 2011 “API” - Python/SQLAlchemy/MySQL
Feb 2012 Dispatch swaps MongoDB for Redis
May 2012 Dispatch adds ON fallback
Jan 2013 First non-API Python services
Feb 2013 API switched to Postgres
Mar 2014 New Python services use MySQL
Mar 2014 Schemaless begins, must finish before pg collapse
Sep 2014 First Schemaless - trips out of Postgres
Aug 2015 Dispatch X.0 / Ringpop / Riak
TECHNICAL DEBT
MICROSERVICES
MICROSERVICES
Immutable?
Append Only?
Node.JS
Python
Go
Java
SCALING NODE
Getting out of the HTTP+JSON business
HTTP is slow, complex, and inconsistent
JSON is hard to validate, awkward in non-node
Thrift is OK, but generated code is bad
SERVICE DISCOVERY
Lots of services, lots of instances
Mostly Node.JS and Python
Call graph unknowable
Self-inflicted DoS
Cascading failures
load balancerservice A
service B
service B
load balancer
service A service B
service B
horizontal scalability
zipkin tracing
circuit breaking
rate limiting
failure testable
almost no configuration
as available as possible
overall latency ≥ latency of slowest component
1ms avg, 1000ms p99
use 1: 1% at least 1000ms
use 100: 63% at least 1000ms
1.0 - 0.99^100 = 0.634 = 63.4%
LATENCY
requeststhatareslow
0%
25%
50%
75%
100%
Processes Used
1 2 4 8 16 32 64 128 256 512 1024
p95 p99 p99.9
CULTURAL CHANGES
FAILURE TESTING
RETRIES
partner app dispatch DC1
Location Updates
State Digest
dispatch DC2
Location Updates
State Request
http://www.principlesofchaos.org/
EMBRACE THE CHAOS
THANKS
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations/
uber-scalability-arch

Scaling Uber