FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT
microservices
Patterns and Practices eMag Issue 59 - Mar 2018
ARTICLE ARTICLE ARTICLE
Polyglot Persistence Microservices Managing
Powering Patterns and Data in
Microservices Practices Panel Microservices
IN THIS ISSUE
6 Polyglot Persistence Powering Microservices
Patterns for Microservice Developer Workflows
14 and Deployment: Q&A with Rafael Schloming
Debugging Distributed Systems: Idit Levine
18 Discusses the Squash Microservices Debugger
22 Microservices Patterns and Practices Panel
32 Managing Data in Microservices
FOLLOW US CONTACT US
GENERAL FEEDBACK feedback@infoq.com
ADVERTISING sales@infoq.com
EDITORIAL editors@infoq.com
facebook.com @InfoQ google.com linkedin.com
/InfoQ /+InfoQ company/infoq
A LETTER FROM THE EDITOR
Thomas Betts
While the underlying technology and patterns a development lead should ask is “How do I break
are certainly interesting, microservices have al- up my monolithic process?” as the development
ways been about helping development teams be process is critical to establishing and maintaining
more productive. Whether used as a technique velocity.
for architects to manage complexity or to make
small teams more independent and responsible With microservices distributed across containers,
for supporting the software they create, the hu- how is a developer able to step into the code and
man aspect of microservices cannot be ignored. debug what is happening? Idit Levine discussed
the problem and introduced Squash, an open-
Many of the experts who spoke about microser- source platform for debugging microservices ap-
vices patterns and practices at QCon San Francis- plications.
co 2017 did not simply talk about the technical
details of microservices. They included a focus on Randy Shoup provided practical examples of how
the business side and more human-oriented as- to manage data in microservices, with an empha-
pects of developing distributed software systems. sis on migrating from a monolithic database. He
also strongly advocated for building a monolith
At Netflix, the cloud database engineering team first, and only migrating to microservices after
is responsible for providing several flavors of data you actually require the scaling and other bene-
persistence as a service to microservice develop- fits they provide.
ment teams. Roopa Tangirala explained how her
team has created self-service tools that help de- The microservices track also included a panel
velopers easily implement the appropriate data discussion where several experts shared their
store for each project’s needs. experiences and advice for being successful with
microservices. Questions from the audience high-
Drawing on his experience with developing a mi- lighted common themes, such as dealing with
croservices application at Datawire in 2013, Ra- deployments, communication between micros-
fael Schloming argued that one of the most im- ervices, and looking at what future trends might
portant — although often ignored — questions follow microservices.
CONTRIBUTORS
Thomas Betts
is a principal software engineer at IHS Markit, with two
decades of professional software development experience.
His focus has always been on providing software solutions
that delight his customers. He has worked in a variety of
industries, including retail, finance, health care, defense and
travel. Thomas lives in Denver with his wife and son, and they
love hiking and otherwise exploring beautiful Colorado.
Rafael Schloming Chris Richardson
is co-founder and chief architect of Datawire. He is a developer and architect. He is a Java
is a globally recognized expert on messaging Champion and the author of POJOs in Action,
and distributed systems and author of the which describes how to build enterprise Java
AMQP specification. Previously, Schloming was a applications with frameworks such as Spring and
principal software engineer at Red Hat. Rafael has Hibernate. Richardson was also the founder of
a B.S. in computer science from MIT. the original Cloud Foundry, an early Java PaaS
for Amazon EC2. He consults with organizations
to improve how they develop and deploy
applications and is working on his third startup.
He’s on Twitter as @crichardson.
Idit Levine Roopa Tangirala
is Founder/Leader/Contributor on a variety of Cloud is an experienced engineering leader with extensive
open source Projects. Expert in cluster management background in databases, be they distributed or
like: Kubernetes, Mesos & DockerSwam. Hybrid cloud: relational. She leads the Cloud Database Engineering
AWS, Google Cloud, OpenStack, Xen & vSphere team at Netflix, responsible for cloud persistent
Comfortable with Cloud Foundry and a laundry list of run-time stores for Netflix, ensuring data availability,
other frameworks and tools. durability, and scalability to meet growing business
needs. The team specializes in providing polyglot
persistence as a service with Cassandra, Elasticsearch,
Dynomite, MySQL, etc.
Louis Ryan Randy Shoup
is a core contributor to Istio and gRPC is a 25-year veteran of Silicon Valley, and has
and is a principal engineer at Google. worked as a senior technology leader and
executive at companies ranging from small
startups to mid-sized places to eBay and Google.
He is currently VP Engineering at Stitch Fix in San
Francisco. He is particularly passionate about the
nexus of culture, technology, and organization.
Daniel Bryant
is leading change within organisations and technology.
His current work includes enabling agility within
organisations by introducing better requirement gathering
and planning techniques, focusing on the relevance of
architecture within agile development, and facilitating
continuous integration/delivery.
Watch presentation online on InfoQ
KEY TAKEAWAYS Polyglot Persistence
Powering Microservices
Choose the appropriate
persistence store for your
microservices.
By providing polyglot Adapted from a presentation at QCon San Francisco 2017,
persistence as a service,
developers can focus on by Roopa Tangirala, engineering manager at Netflix
building great applications
and not worry about tuning,
tweaking, and capacity of
various back ends.
Operating various persistence
stores at scale involves unique We have all worked in companies that started small, and have a
challenges, but common
components can simplify the
monolithic app built as a single unit. That app generates a lot of
process. data for which we pick a data store. Very quickly, the database
becomes the lifeline of the company.
Netflix’s common platform
drives operational excellence Since we are doing such an amazing job, growth picks up and
in managing, maintaining,
and scaling persistence
we need to scale the monolithic app. It starts to fail under high
infrastructures (including load and runs into scaling issues. Now, we must do the right
building reliable systems on thing. We break our monolithic app into multiple microservices
unreliable infrastructure). that have better fallback and can scale well horizontally. But we
don’t worry about the back-end data store; we continue to fit
the microservices to the originally chosen back end.
6 2018 Microservices // eMag Issue 59 - Mar 2018
Soon, things become complicat- of Stranger Things, Narcos, and lem with one region, our traffic
ed at our back-end tier. Our data many more titles. team can shift the traffic in less
team feels overwhelmed because than seven minutes to the oth-
they’re the ones who have to All your interactions as a Netflix er two regions with minimal or
manage the up time of our data customer with the Netflix UI, all no downtime. So all of our data
store. They are trying to support your data such as membership stores need to be distributed and
all kinds of antipatterns of which information or viewing history, all highly scalable.
the database might not be capa- of the metadata that a title needs
ble. to move from script to screen,
and so much more are stored Use case 1: CDN URL
Imagine that instead of trying to in some form in one of the data If, like me, you’re a fan of Netflix
make all of our microservices fit stores we manage. (and love to binge-watch Strang-
one persistence store, we lever- er Things and other titles), you
age the strengths and features of The Cloud Database Engineering know you have to click the play
our back-end data tier to fit our (CDE) team at Netflix runs on the button. From the moment you
application needs. No longer do Amazon cloud, and we support click to the time you see the vid-
we worry about fitting our graph a wide variety of polyglot per- eo on the screen, many things
usage into RDBMS or trying to fit sistence. We have Cassandra, Dy- happen in the background. Net-
ad hoc search queries into Cas- nomite, EVCache, Elastic, Titan, flix has to look at the user au-
sandra. Our data team can work ZooKeeper, MySQL, Amazon S3 thorization and licensing for the
peacefully, in a state of Zen. for some datasets, and RDS. content. Netflix has a network of
Open Connect Appliances (OCAs)
Elasticsearch provides great spread all over the world. These
Polyglot persistence search, analysis, and visualiza- OCAs are where Netflix stores the
powering microservices tion of any dataset in any format video bits, and the sole purpose
I manage the cloud database in near real time. EVCache is a of these appliances is to deliver
engineering team at Netflix. I distributed in-memory caching the bits as quickly and efficiently
have been with Netflix for almost solution based on Memcached as possible to your devices while
a decade and I have seen the that was open-sourced by Netflix we have an Amazon plane that
company transition from being in 2011. Cassandra is a distributed handles the microservices and
monolithic in the data center to NoSQL data store that can handle data-persistence store. This ser-
microservices and polyglot per- large datasets and can provide vice is the one responsible for
sistence in the cloud. Netflix has high-availability, multi-region generating the URL, and from
embraced polyglot persistence. I replication, and high scalability. there, we can stream the movie
will cover five use cases for it, and Dynomite is a distributed Dyna- to you.
discuss the reasons for choosing mo layer, again open-sourced by
different back-end data stores. Netflix, that provides support for The very first requirement for this
different storage engines. Cur- service is to be highly available.
Being a central platform team, rently, it supports Redis, Mem- We don’t want any user experi-
my team faces many challenges cached, and RocksDB. Inspired by ence to be compromised when
in providing different flavors of Cassandra, it adds sharding and you are trying to watch a movie,
database as a service across all of replication to non-distributed say, so high availability was pri-
Netflix’s microservice platforms. datasets. Lastly, Titan is a scalable ority number one. Next, we want
graph database that’s optimized tiny read and write latencies, less
for storing and querying graph than one millisecond, because
About Netflix datasets. this service lies in the middle of
Netflix has been leading the way the path of streaming, and we
for digital content since 1997. Let’s look at the architecture, the want the movie to play for you
We have over 109 million sub- cloud deployment, and how the the moment you click play.
scribers in 190 countries and we datasets are persisted in Amazon
are a global leader in streaming. Web Services (AWS). We are run- We also want high throughput
Netflix delivers an amazing view- ning in three AWS regions, which per node. Although the files are
ing experience across a wide va- take all of the traffic. User traffic pre-positioned in all of these
riety of devices, and brings you is routed to the closest region: caches, they can change based on
great original content in the form primarily, US West 2, US East 1, the cache held or when Netflix in-
and EU West 1. If there’s a prob- troduces new movies — there are
2018 Microservices // eMag Issue 59 - Mar 2018 7
multiple dimensions along which pending on your use of Wi-Fi or gion across multiple devices, in
these movie files can change. So a fixed network. Some devices a specific device, or confined to a
this service receives high read do not support 4K or HD and we particular title. Elasticsearch also
as well as write throughputs. We have to change the stream based supports queries such as “What
want something where per-node on the device. Beyond these few are the top 10 devices across Net-
throughput can be high so we examples, there are hundreds of flix?”
can optimize. dimensions on which your play-
back experience depends. Before Elasticsearch, the inci-
For this particular service we used dent-to-resolution time was more
EVCache. It is a distributed cach- For this service, we wanted the than two hours. The process in-
ing solution that provides low la- ability to quickly resolve inci- volved looking at the logs, grep-
tency because it is all in memory. dents. We want to have some- ping the logs, and looking at the
The data model for this use case place where we can quickly look cause of error and where there’s a
was simple: it was a simple key for the cause of an issue — which mismatch between the manifest
value, and you can easily get that dimension is not in sync, which is and what is being streamed to
data from the cache. EVCache is causing your playback error. If we you. With Elasticsearch, the res-
distributed, and we have multiple have ruled out a push, we want olution time decreased to under
copies in different AWS Availabil- to see if we need to roll back, or 10 minutes. That has been a great
ity Zones, so we get better fault roll forward, based on the scope thing.
tolerance as well. of the error: is the error happen-
ing in all three regions, in only
specific regions, or on only a par- Use case 3: Viewing
Use case 2: Playback ticular device? There are multiple history
error dimensions which we need to fig- As you watch Netflix, you build
Imagine that you click play to ure out the dataset. what we call a “viewing history”,
watch a movie but you get a play- which is basically the titles you
back error. The playback error Another requirement was inter- have been watching over the
happens whenever you click the active dashboards. We wanted past few days. It keeps a book-
title — it’s just not playable. the ability to slice and dice the mark of where you were, and you
dataset to see the root cause of can click to resume from where
Titles have multiple characteris- that error. Near-real-time search you stopped. If you look at your
tics and metadata. It has ratings, is important because we want to account activity, you can see the
the genre, and the description. It figure out whether or not a recent date that you watched a partic-
has the audio languages and the push has caused the problem at ular title and you can report if
subtitle languages it supports. hand. We need ad hoc queries be- there’s a problem viewing a title.
It has the Netflix Open Connect cause there are so many dimen-
CDN URL, discussed in the first sions; we don’t know our query For viewing history, we needed a
use case, which is the location patterns. There may be multiple data store that could store time
from where the movie streams to ways for us to query the dataset series in a dataset. We needed
you. We call all of this metadata to arrive at what is causing the to support a high number of
the “playback manifest”. And we error. writes. A lot of people are watch-
need it to play the title for you. ing Netflix, which is great, so the
We used Elasticsearch for this viewing history service receives
There are hundreds of dimen- service. It provides great search a lot of writes. Because we are
sions that can lead to a playback and analysis for data in any form, deployed in three regions, we
metadata error, and there are and it has interactive dashboards wanted cross-region replication
hundreds of dimensions that can through Kibana. We use Elastic- so that if there’s a problem within
alter the user’s playback expe- search a lot at Netflix, especially one region, we can shift the traffic
rience. For example, some con- for debugging and logging use and have the user’s viewer histo-
tent is licensed only in specific cases. ry available in the other regions
countries and we cannot play as well. Support of large datasets
that to you if you cross a border. Kibana provides a great UI for was important, since viewing his-
Maybe a user wants to watch Nar- interactive exploration that al- tory has been growing exponen-
cos in Spanish. We might have lows us to examine the dataset to tially.
to change the bit rate at which find the error. We can determine
we are streaming the movie de- that the error is in a specific re-
8 2018 Microservices // eMag Issue 59 - Mar 2018
We used Cassandra for this. then we have a roll-up column We wanted a data store where we
Cassandra is a great NoSQL dis- family, which is a combination of could store all of these entities as
tributed data store that offers all historical datasets that is rolled well as the relationships.
multi-data-center, multi-direc- up into another, compressed
tional replication. This works out column family. This means we Our requirements for the digi-
great because Cassandra is doing have to do two reads, once from tal-asset management service
the replication for us. It is high- the compressed family and once were one back-end plane to store
ly available and highly scalable. from the live column family. This the asset metadata, the relation-
It has great fault detection and definitely helps with the size. We ships, and the connected data-
multiple replicas, so that a node drastically reduced the size of the sets — and the ability to quickly
going down doesn’t cause web- dataset because half of the data search that. We used Titan, which
site downtime. We can define dif- was compressed. is a distributed graph database.
ferent consistency levels so that It’s great for storing graph data-
we never experience downtime, The roll-up happens in the path sets, and it supports various stor-
even though there are nodes of read. When the user is trying age back ends. Since we already
that will always go down in our to read from viewing history, the support Cassandra and Elastic-
regions. service knows how many col- search, it was easy to integrate
umns they have read. And if the into our service.
number of columns is more than
Data model whatever we think it should be,
The data model for viewing histo- then we compress the historical Use case 5: Distributed
ry started simple. We have a row data and move it to the other col- delayed queues
key, which is the customer or user umn family. This happens all the The Netflix content platform en-
ID. Each title a user watches is a time based on your reads, which gineering team runs a number
column in that particular column works out very nicely. of business processes. Rolling
family. When you watch, you are out a new movie, content inges-
writing to the viewing history, tion and encoding, or uploading
and we just write a tiny payload: Use case 4: Digital-asset to the CDN are all business pro-
the latest title you watched. View- management cesses that require asynchronous
ing history grows over time, and Our content platform engineer- orchestration between multiple
Cassandra capably handles wide ing team at Netflix deals with microservices. Delayed queues
rows, so there is no problem. You tons of digital assets, and needed form an integral part of this or-
can read your whole viewing his- a tool to store the assets as well chestration.
tory, and when you do so, you are as the connections and relation-
paginating through your rows. ships among these assets. We want delayed queues that
are distributed and highly con-
We quickly ran into issues with For example, we have lots of art- current because multiple micro-
this model. The viewing history work, which is what you see on services are accessing them. And
is quite popular, so the dataset is the website. The art can come in we wanted at-least-once delivery
growing rapidly. A few custom- different formats, including JPEG, semantics for the queue and a
ers have a huge viewing history, PNG, etc. We also have various delayed queue, because there
so the row becomes very wide. categories of artwork: a movie are relationships between all
Even though Cassandra is great can have art, a character can have these microservices and we don’t
for wide rows, trying to read all of art, and a person can have art, etc. know when the queue will be
that data in memory causes heap consumed. A critical requirement
pressures and compromises the And each title is a combination was having priorities within the
99th-percentile latencies. of different things in a package. shard, so that we can pick up the
The package can include video queue with the highest priority.
elements, such as trailers and
New data model montages, and the video, audio, For this particular service, we used
So we have a new model, which and subtitle combination. For Dynomite. Netflix open-sourced
we split into two column families. example, we can have French in Dynomite some time ago. It is a
One is the live viewing history, the video format with subtitles pluggable data store that works
with a similar pattern of each col- in French and Spanish. And then with Redis, Memcached, and
umn being a title, so we can con- you have relationships, like a Rocks DB. It works for this use case
tinue to write small payloads. And montage is a type of video. because Redis has data structures
2018 Microservices // eMag Issue 59 - Mar 2018 9
Netflix’s Cloud that support queues very well.
Early on, we tried to make queues
should know each cluster’s head
room so that if the application
work with Cassandra and failed team says they are increasing ca-
Database Engineering miserably, running into all kinds
of edge cases. Dynomite worked
pacity or throughput or adding
a new feature that causes an in-
team provides data superbly for us in this case. And
it provides multiple-data-center
crease in the back-end IOPS, we
should be able to tell them that
replication and sharding so we, their cluster is sufficient or needs
stores as a service, as application owners, need not
worry about data being replicat-
to scale up.
with self-provisioning ed across regions or data centers. For maintenance and upgrades
across all clusters, software or
Netflix maintains three sets of hardware, we need to know
capabilities that allow Redis structures for each queue.
One is a sorted set that contains
whether we can perform main-
tenance without impacting pro-
application users to queue elements by score. The
second is a hash set that contains
duction services. Can we build
our own solution or should we
create clusters on
the payload, and the key is the buy something that’s out there?
message ID. The third is a sorted
set that contains messages con- Another challenge is monitoring.
their own. sumed by the client, but which
have yet to be acknowledged. So
We have tens and thousands of
instances, and all of these instanc-
the third is the unacknowledged es are sending metrics. When
set. there’s a problem, we should
know which metrics make the
most sense and which we should
Identifying the be looking at. We must maintain a
challenges high signal-to-noise ratio.
I love this quote, but I don’t think
my on-call team feels like this: “I
expected times like this — but I Overcoming challenges
never felt that they’d be so bad, The very first step in meeting
so long, and so frequent.” these challenges is to have ex-
perts. We have two or three core
The first challenge my team faces people in our Cassandra cloud da-
is the wide variety and the scale. tabase engineering team that we
We have so many different fla- call subject-matter experts. These
vors of data store, and we have people provide best practices
to manage and monitor all these and work closely with the mi-
different technologies. We need croservice teams to understand
to build a team that is capable of their requirements and suggest a
doing all this while making sure back-end data store. They are the
the team has the skills to cater ones who drive the features and
to all of these different technolo- best practices, as well as the prod-
gies. Handling that variety, espe- uct future and vision.
cially with a small team, becomes
a challenge to manage. Everybody in the team goes on
call for all of these technologies,
The next challenge is predicting so it’s useful to have a core set of
the future. With a combination of people that understand what’s
all of these technologies, we have happening and how we can re-
thousands of clusters, tens and ally fix the back end. Instead of
thousands of nodes, petabytes building automation that applies
of data. We need to predict when patches on top of what is broken,
our cluster risks running out of ca- we can contribute to the open
pacity. My central-platform team
10 2018 Microservices // eMag Issue 59 - Mar 2018
Figure 1: CDE architecture
source or to the back-end data software and kernel version each central place to capture this
tier — and produce a feature. runs, its size, and the cost of man- metadata is crucial.
aging it. The metadata helps the
Next, we build intelligent systems application team understand the Lastly, we track maintenance win-
to work for us. These systems take cost associated with a particular dows. Some clusters can have
on all automation and remedia- back end and the data they are maintenance windows at night,
tion. They accept the alerts, look trying to store, and whether or while others receive high traffic
at the config, and use the latency not their approach makes sense. at the same time. We decide on
thresholds we have for each ap- an appropriate maintenance win-
plication to make decisions, sav- The self-service capability of CDE dow for a cluster’s use case and
ing people from getting paged Service allows application users traffic pattern.
for each and every alert. to create clusters on their own,
without the CDE team getting in
the way. The users don’t need to Architecture
CDE Service understand all the nitty-gritty de- Figure 1 shows the architecture,
CDE Service helps the CDE team tails of the back-end YAML; they with the datastore in the center.
provide data stores as a service. only need to provide minimal in- For the scheduler on the left, we
Its first component captures the formation. We create the cluster use Jenkins, which is based on
thresholds and SLAs. We have and make sure that it is using the cron and which allows us to click
thousands of microservices; right settings, it has the right ver- a button to do upgrades or node
how do we know which service sion, and it has the best practices replacements. Under that is CDE
requires what 99th-percentile built in. Service, which captures the clus-
latency? We need a way to look ter metadata and is the source
at the clusters and see both the Before CDE Service, contact infor- of all information like SLAs, Pag-
requirements and what have mation only sat outside the sys- erDuty information, and much
we promised so that we can tell tem. For each application, we’d more. On the top is the monitor-
if a cluster is sized effectively or need to know who to contact and ing system. At Netflix, we use At-
needs to scale up. which team to page. It becomes las, an open-source telemetry sys-
tricky when you’re managing so tem, to capture all of the metrics.
Cluster metadata helps provide a many clusters, and having some Whenever there’s a problem and
global view of all the clusters: the we cannot meet the 99th-percen-
2018 Microservices // eMag Issue 59 - Mar 2018 11
Figure 2: CDE Self Service UI
tile latency, the alert will go off. version, the software version, the When an upgrade is running, it
On the very right is the remedia- hardware version, the average can be tricky to figure out what
tion system, an execution frame- node count, and various costs. I percentage of the test clusters
work that runs on containers and can also look at my oldest node, and prod clusters have been up-
that can execute automation. so I can see if the cluster has a graded across a fleet that num-
very old node we need to replace, bers in the thousands. We have a
Anytime an alert fires, the moni- then we will just run remedia- self-service UI to which applica-
toring system will send the alert tions. There’s a job that scans for tion teams can log in to see how
to the remediation system. That old nodes and run terminations. far along we are in the upgrade
system will perform automated In the interest of space, I have not process.
remediation on the data store shown many columns, but you
and won’t even let the alert go can pick what information you
to the CDE team. Only in situa- want to see. Machine learning
tions for which we have not yet Earlier, I mentioned having to
built automation will alerts come We have another UI for creating predict the future. Our telemetry
directly to us. It is in our team’s new clusters, specific to each system stores two weeks of met-
best interest to build as much au- data store. An application user rics, and previous historical data
tomation as possible, to limit the needs to provide only a cluster is pushed to S3. We analyze this
number of on-call pages we need name, email address, the amount data using Kibana dashboards to
to respond to. of data they are planning to store, predict when the cluster will run
and the regions in which to create out of capacity.
the cluster — then the automa-
SLA tion kicks off the cluster creation We have a system called predic-
Figure 2 shows the cluster view in the background. This process tive analysis, which runs models
where I can look at all of my clus- makes it easy for a user to create to predict when a cluster will run
ters. I can see what version they clusters whenever they want, and out of capacity. The system runs
are running, which environment since we own the infrastructure, in the background and pages us
they are, which region they are we make sure that the cluster cre- or notifies us on a Slack channel
in, and what are the number of ation is using the right version of when it expects a cluster to ex-
nodes. This view also shows the the data store with all of the best ceed capacity in 90 days. With
customer email, the Cassandra practices built in. Cassandra, we only want to use a
12 2018 Microservices // eMag Issue 59 - Mar 2018
third of the storage allocation for back end can have a big impact. about the other nodes in the clus-
the dataset, a third for the back- A problem, like a buggy version, ter.
ups, and the last third for com- can compromise all of your up-
pactions. It is important to have time. We have built a lot of con- The common approach is to use
monitoring in place and to have a fidence into our upgrades with cron to poll all the nodes, then use
system that warns us beforehand, Netflix Data Bench (NDBench), that input to figure out whether
not at the cusp of the problem an open-sourced benchmarking or not the cluster is healthy. This is
because that leads to all kinds of tool. It is extensible so we can use noisy, and will produce false pos-
issues. it for Cassandra, Elasticsearch, itives if there are network prob-
or any store that we want. In the lems from the cron system to the
Since we are dealing with stateful NDBench client, we specify the node or if the cron system goes
persistence stores, it is not easy to number of operations we want to down.
scale up. It’s easier with stateless throw at our cluster, the payload,
services; you can do red/black and the data model we want. This We moved from that poll-based
or scale up the clusters with au- allows application teams to test system to continual, streaming
to-scaling groups and the clusters their own applications using ND- health checks. We have a contin-
can increase in size. But it’s tricky Bench. ual stream of fine-grained snap-
for persistence stores because it’s shots being pushed from all the
all data on nodes, and the stores When we upgrade, we look at instances to a central service we
have to stream to multiple nodes. four or five popular use cases. For call Mantis, which aggregates
That’s why we use predictive example, we may try to capture all the data and creates a health
analysis. 80 percent reads and 20 percent score. If the score exceeds a cer-
writes or 50 percent reads and tain threshold, the cluster is de-
50 percent writes. We are trying, termined to be not healthy.
Proactive maintenance with only a few use cases, to cap-
Things go down in the cloud and ture the more common payloads We have a few dashboards where
hardware is bound to fail. We people are using in the clusters. we can see the real-time health.
registered to receive Amazon’s We run the benchmark before the The macro view shows the rel-
notifications and we terminate upgrade, capturing the 99th-per- ative sizes of the clusters with
the nodes in advance instead centile and average latencies. We color coding to indicate if a clus-
of waiting for Amazon to termi- perform the upgrade and run ter is healthy or not. Clicking on
nate them for us. Because we are the benchmark again. We com- a unhealthy node will show a
proactive, we can do the mainte- pare the before and after bench- detailed view of the cluster and
nance in the window we like, as marks to see if the upgrade has that node. Clicking on the bad in-
well as hardware replacements, introduced any regression or has stance shows details about what
terminations, or whatever we caused problems that increased is causing trouble, which helps
want to do. the latencies. This helps debug a us easily debug and troubleshoot
lot of issues before they happen the problem.
For example, we don’t rely on Cas- in production. We never upgrade
sandra’s bootstrap ability to bring when this particular compari-
up nodes because that takes a lot son reveals a problem. That’s the Takeaway
of time. It takes hours and some- reason we are able to roll out The takeaway from all of this is
times even days for clusters, like all these upgrades behind the that balance is the key to life. You
some of ours, with more than scenes without our application cannot have all your microser-
one terabyte of data per node. In teams even realizing that we are vices using one persistent store.
those cases, we have built a pro- upgrading their cluster. At the same time, you don’t want
cess that copies the data from each and every microservice to
the node, puts it into a new node, use a distinct persistent store.
then terminates the first node. Real-time health checks There’s always a balance, and I’m
We also handle health checks at hoping with what I’ve covered
the node level and cluster lev- you will find your own balance
Upgrades el. Node level is whether or not and build your own data store as
Software and hardware upgrades a data store is running and if we a service.
across all these different instanc- have any hardware failures. Clus-
es of polyglot persistence is an ter level is what one node thinks
effort because any change to the
2018 Microservices // eMag Issue 59 - Mar 2018 13
KEY TAKEAWAYS Patterns for Microservice
Developer Workflows
People don’t really care about moving
to microservices per se. What they really
care about is increasing feature velocity. In
order to apply many people to a problem,
you need to divide them up into teams,
because people simply can’t communicate
effectively within very large groups.
and Deployment
You can organize your people as
independent, cross-functional, and self-
Q&A with Rafael Schloming
sufficient feature teams that own an entire
feature from beginning to end. When
you do this, you end up breaking up that
monolithic process that was the gating
factor for feature velocity. InfoQ recently sat down with
A microservice system of any complexity
Rafael Schloming, CTO and
cannot be instantiated fully locally, and chief architect at Datawire, and
therefore a hosted development platform
must provide developer isolation and
discussed the challenges that
developer-driven real-time deployments face modern software-driven
A service (mesh) proxy like Envoy is a organizations.
good way to implement developer isolation
through smart routing, and it can also
provide developer-driven deployments
using techniques like canary releasing.
14 2018 Microservices // eMag Issue 59 - Mar 2018
Although the implementation dination of too many different e.g., your process doesn’t need to
of microservices is often sim- teams. prioritize stability for a new fea-
ply a side effect of the desire to ture that nobody is using. Second,
increase velocity through ap- This can happen across two dif- since all the components needed
plication decomposition and ferent dimensions. Your people for that feature are owned by the
decoupling, there are inherent can be divided into teams by same team, the communication
developer workflow and deploy- function: product versus devel- and coordination necessary to
ment requirements that must be opment versus QA versus oper- get a feature out the door can
met. Schloming here elaborates ations. Your people can also be happen much more quickly and
further on this and discusses how divided up by component: e.g., effectively.
Kubernetes and the Envoy service front end versus domain model
proxy (with control planes like Is- versus search index versus noti- When you do this, you end up
tio and Ambassador) can meet fications. When a single feature breaking up that monolithic
this need. requires coordinating efforts process that was the gating fac-
across too many different teams, tor for feature velocity, and you
the gating factor for delivering create many smaller processes
the feature is how quickly and owned by your independent fea-
InfoQ: A key premise of your effectively those different teams ture teams. The side effect of this
recent QCon San Francisco pre- can communicate. Organizations is that these independent teams
sentation appeared to be that structured like these are effec- deliver their features as micros-
organizations that are moving tively bottlenecked by a single ervices. The fact that this is a side
from a monolithic application monolithic process that requires effect is really important to un-
to a microservice-based archi- each feature to be understood derstand. Organizations that look
tecture also need to break up (at some level) by far too much of to gain benefit directly from mi-
their monolithic process. Can the organization. croservices without understand-
you explain a little more about ing these principles can end up
this? exacerbating their problems by
Rafael Schloming: This is actually creating many small component
InfoQ: So how do you fix this? teams and worsening their com-
based on the premise that people
don’t really care about moving to Schloming: In order to apply munication problems.
microservices per se — what they many people to a problem, you
really care about is increasing fea- need to divide them up into
ture velocity. Microservices sim- teams somehow, because peo-
InfoQ: Could you explain how
ply happen to be a side effect of ple simply can’t communicate
this relates to the three devel-
making the changes necessary to effectively in very large groups.
opment phases that, you men-
increase feature velocity. When you do this you are making
tioned, applications progress
a set of tradeoffs. You are creat-
through: prototyping, produc-
It’s pretty typical for organiza- ing regions of high-fidelity com-
tion, and mission-critical?
tions as they grow to get to a munication and coordination
point where adding more people within each team, and creating Schloming: Each phase rep-
doesn’t increase feature veloci- low-fidelity communication and resents a different tradeoff be-
ty. When this happens, it is often relatively poorer coordination tween stability and velocity. This
because the structure and/or between teams. in turn impacts how you optimal-
process the organization uses to ly go about the different kinds of
produce features have become To improve feature velocity in an activities necessary to deliver a
the bottleneck, rather than the organization, you can organize feature: product, development,
headcount. your people as independent, QA, and operations.
self-sufficient feature teams that
When an organization hits this own an entire feature from be- In the prototyping phase, there
barrier and starts investigating ginning to end. This will improve is a lot of emphasis on putting
why features seem to be taking feature velocity in two ways. features in front of users quickly,
much longer than seems reason- First, since the different functions and because there are no existing
able given the resources avail- (product, development, QA, and users, there is relatively little need
able, the answer is often that operations) are scoped to a single for stability. In the production
every feature requires the coor- feature, you can customize the stage, you are generally trying to
process to that feature area — balance stability and velocity. You
2018 Microservices // eMag Issue 59 - Mar 2018 15
want to add enough features to gineering-related activities — it ers. This really boils down to two
grow your user base, but you also can’t just be a dev team. problems:
need things to be stable enough
to keep your existing users hap- Of course, this can require a lot 1. Developer isolation: With many
py. In the mission-critical phase, of expertise, so how do you keep services under active develop-
stability is your primary objective. the team small? You need to find ment, you can’t have all your de-
a way for your feature teams to velopers share a single dev clus-
If the people in your organiza- leverage the work of other teams ter, or everything is broken all the
tion are divided along these lines in the organization without the time. Your platform needs to be
(product, development, QA, and communication pathways be- able to provision isolated cop-
operations), it becomes very diffi- tween teams getting in the criti- ies of some or all of your system
cult to adjust how many resourc- cal path of feature development. purely for the purpose of devel-
es you apply to each activity for This is where self-service infra- opment.
a single feature. This can show structure comes into play. By pro-
up as new features moving really viding a self-service platform, a 2. Developer/real-time deploy-
slowing because they follow the feature team can benefit from the ments: Once you have access to
same process as mission-criti- work that a platform team does an isolated copy of the system,
cal features or it can show up as without having to file a ticket and you need a way to get the code
mission-critical features breaking wait for a human to act upon it. from your fingertips running
too frequently in order to accom- against the rest of the system as
modate the faster release of new quickly as possible. Mechanically,
features. this is a deployment because you
InfoQ: What kind of tooling can are taking source code and run-
By organizing your people into help with self-service access ning it in on a copy of prod.
independent feature teams, you for deployment, and also to
can enable each team to find the platform? This is pretty different though in
the ideal stability versus velocity Schloming: Kubernetes provides some other important respects.
tradeoff to achieve its objective, some great primitives for this When you deploy to production
without forcing a single global sort of thing — e.g., you can use there is a big emphasis on strict
tradeoff for your whole organiza- namespaces and quotas to allow policies and careful procedures:
tion. independent teams to safely co- e.g., passing tests, canary de-
exist within a single cluster. How- ploys, etc. For these developer
ever, one of the bigger challenges deployments, there is a huge pro-
here comes with maintaining a ductivity win from being able to
InfoQ: Another key premise dispense with the safety and pro-
productive development work-
from the talk was that teams cedure and focus on speed: e.g.,
flow as your system increases in
building microservices must running just the one failing test
complexity. As a developer, your
be cross-functional and able instead of the whole suite, not
productivity depends heavily on
to get self-service access to the having to wait for a git commit
how quickly you can get feed-
deployment mechanisms and and webhook, etc.
back from running code.
the corresponding platform
properties like monitoring,
A monolithic application will
logging, etc. Could you expand
typically have few enough com-
on this? InfoQ: Could you explain these
ponents that you can wire them
problems and how to solve
Schloming: There are really two all together by hand and run
them in a little more depth?
different factors here. First, if your enough of the system locally that
team owns an entire feature, then you have rapid feedback from Schloming: For developer isola-
it needs expertise in all the com- running code as you develop. tion, there are two basic strate-
ponents that go into that feature, With microservices, you quickly gies:
from front end to back end and get to the point where this is no
anything between. Second, if longer feasible. This means that • Copy the whole Kubernetes
your team owns the entire life- your platform, in addition to be cluster.
cycle of a feature from product able to run all your services in
• Use a shared Kubernetes clus-
to operations, your team needs production, also needs to pro-
ter, but copy individual re-
expertise in all these different en- vide a productive development
sources (such as Kubernetes
environment for your develop-
16 2018 Microservices // eMag Issue 59 - Mar 2018
services, deployments, etc.) InfoQ: You mentioned the ben- InfoQ: Is there anything else
for isolation, and then use re- efit that service-mesh technol- you would like to share with
quest routing to access the ogy, like Envoy, can provide for InfoQ readers?
desired code. interservice communication
Schloming: The Datawire team
(“east-west” traffic) in regard
Almost any system will grow to is working on a range of open-
to observability and fault
the point of requiring both strat- source tooling for improving the
tolerance. What about ingress
egies. Kubernetes developer experi-
(“north-south” traffic)? Are
ence, and so we are always keen
there benefits to using similar
To implement developer isola- to get feedback from the com-
technology here?
tion, you need to ensure all your munity. Readers can contact us
services are capable of multi- Schloming: Yes. In fact, in regards through our website, Twitter, or
version deployments, and you to bang for buck, this is the place Gitter, and you can often find us
need a layer-7 router, plus a fair I would look to deploy something speaking at tech conferences.
amount of glue to wire it all into like Envoy first. By placing Envoy
a safe and productive workflow at the edge of your network, you The video from Schloming’s QCon
on top of git. For multiversion de- have a powerful tool to measure San Francisco 2017 talk “Micros-
ployments, I’ve seen people use the quality of service that your ervices: Service-Oriented Devel-
everything from sed to envsubst users are seeing, and this is a key opment” can be found on InfoQ
to fancier tools like Helm, kson- building block for adding canary alongside a summary of the talk.
net, and Forge for templating releases into your dev workflow,
their manifests. For a layer-7 rout- something that is critical for any
er, Envoy is a great choice and production or mission-critical
super easy to use, and is available services you have.
within projects like Istio and the
Ambassador API gateway that
add a more user-friendly control
plane. InfoQ: How do you think the
Kubernetes ecosystem will
For developer/real-time deploy- evolve over the next year? Will
ments, there are two basic strat- some of the tools you mention
egies: become integrated within this
platform?
• Run your code in the Kuberne- Schloming: I certainly wouldn’t
tes cluster, and optimize the be surprised to see deeper inte-
build/deploy times. gration between Envoy and Ku-
• Compile and run your code bernetes. One thing I certainly
locally and then route traffic hope to see is some stabilization.
from the remote Kubernetes Kubernetes and Envoy are both
cluster to your laptop, and foundational pieces of technol-
from the code running on ogy. Together they provide the
your laptop back to the your core parts of an extremely flex-
remote cluster. ible and powerful platform, but
you really need to spend a while
Both these strategies can sig- becoming an expert in order to
nificantly improve developer leverage them.
productivity. Tools like Draft and
Forge are both geared towards I think in regards to the larger eco-
the first strategy, and there are system, we’ll see more projects
tools like kube-openvpn and geared at allowing non-experts
Telepresence for the second. to leverage some of the benefits
these tools can offer.
One thing is for sure, there is still a
lot of DIY required to wire togeth-
er a workable solution.
2018 Microservices // eMag Issue 59 - Mar 2018 17
Watch presentation online on InfoQ
KEY TAKEAWAYS Debugging
Distributed Systems
Debugging a microservice-based application
is more challenging than debugging a
monolithic application as it is difficult to attach
a native debugger to multiple processes that
communicate across a network. Idit Levine Discusses the Squash
Currently, the best approach to debugging
microservices relies on obtaining a trace of
all transactions and dependencies using tools
Microservices Debugger
that, for example, implement the OpenTracing
API standard. OpenTracing tools are powerful,
but they have limitations and gaps, especially
for ad hoc observation.
At QCon San Francisco 2017, Idit Levine, founder and
Squash is an open-source microservice
debugging tool that orchestrates run-time
CEO of solo.io, presented “Debugging Containerized
debuggers attached to microservices (running Microservices” in which she outlined the challenges
within containers deployed onto IaaS or of debugging a distributed microservice-based sys-
CIaaS), and provides familiar features like tem.
setting breakpoints, stepping through the
code, viewing and modifying variables, etc.
Levine began by comparing the debugging of
A service mesh may be the future best point monolithic and microservice-based applications. A
of integration for such observation and monolithic application typically consists of a single
debugging, and Squash currently includes process, and attaching a debugger to this process
early integration work with Istio and the Envoy reveals the complete state and the flow of execu-
service proxy.
tion. Because a microservice-based application is
18 2018 Microservices // eMag Issue 59 - Mar 2018
inherently a distributed system can conduct system topology For the second approach, Levine
consisting of multiple processes analysis and identify bottlenecks presented her company’s open-
communicating over a network, due to shared or contended re- source Squash microservices
this adds significant complexity sources. debugger. Squash currently sup-
to the challenges of effective de- ports debugging within Visual
bugging. Disadvantages to distributed Studio Code (VS Code) — or in
tracing include: the inability to IntelliJ for Java and Kubernetes
The remainder of the talk pre- perform run-time debugging or only — of microservice appli-
sented three approaches to de- modification of application state; cations that are written in a lan-
bugging microservices: distribut- the approach often requires guage that can be debugged by
ed tracing, using the open-source wrapping/decorating and chang- Delve (the Go language), GDB
Squash microservices debugger ing the code, which can incur a (C++, Objective C, Java, etc.),
that Levine has created, and ex- performance penalty at run time; and its own language-specific
ploiting the underlying capabili- and there is no holistic view of debugging protocols for Java,
ties of a service mesh. the application state — develop- Node.js, and Python. The services
ers can only see what was output must be deployed to the Kuber-
Distributed tracing tools, such as part of the trace and associat- netes container-orchestration
as Open Zipkin — which im- ed baggage. platform or a platform that can
plements the OpenTracing API use Istio, which itself is currently
specification hosted by the Cloud
Native Computing Foundation
— can be used to monitor and
understand the flow of execution
through a microservices-based
application. This approach has
several advantages: it easily
sends data to any logging tool,
even from OSS components; it
enables critical-path analysis;
developers can drill down into
request latency and other asso-
ciated trace context metadata in
very high fidelity; and operators
2018 Microservices // eMag Issue 59 - Mar 2018 19
Kubernetes-focused. (Istio does mesh such as Istio or Envoy. A ging distributed systems and ap-
offer limited support for Docker service-mesh data plane, such plications.
and HashiCorp’s Nomad, but it is as Envoy, touches every packet/
worth noting that all of Squash’s request in the system, and is re-
Istio examples use Kubernetes as sponsible for service discovery,
the underlying platform). Squash health checking, routing, load InfoQ: How has operational
would like to add support for balancing, authentication/autho- and infrastructure moni-
more IDE, language, and runtime rization, and observability. A ser- toring evolved over the last
platforms, and encourages com- vice-mesh control plane, such as five years? How have cloud,
munity contributions. Istio, provides policy and configu- containers, and new architec-
ration for all running data planes tural styles like microservices
The Squash architecture con- in the mesh. These properties impacted monitoring and
sists of a Squash server that is provide ideal points of introspec- debugging?
deployed and runs on the target tion and execution flow control. Idit Levine: Monitoring the state
platform (for example, as a Dae- Istio currently integrates with the of an application is important
monSet on a Kubernetes node). Open Zipkin and Jaeger distribut- during development and in pro-
The server holds the information ed tracing systems and, as men- duction. With a monolithic ap-
about the breakpoints for each tioned, with Squash. plication, this is rather straight-
application, and orchestrates the forward, since one can attach a
Squash clients. The Squash clients She concluded by suggesting native debugger to the process
also deploy on the target plat- that the ultimate solution would and have the ability to get a com-
form. Squash uses an IDE as its be to integrate all three of these plete picture of the state of the
user interface — as mentioned, approaches to debugging, application and its evolution.
currently only VS Code and and encouraged the audience
IntelliJ (for Java and Kubernetes). to get involved via the solo.io Monitoring a microservice-based
Squash commands are available Slack channel and contribute to application poses a greater chal-
in the IDE command palette after Squash. lenge, particularly when the ap-
installing the Squash extension. plication is composed of tens or
InfoQ recently sat down with hundreds of microservices. Due
Levine’s third approach to de- Levine to elaborate on the chal- to the fact that any request may
bugging microservices is to lenges of observing and debug- involve being processed by many
use the capabilities of a service microservices running multiple
20 2018 Microservices // eMag Issue 59 - Mar 2018
times — potentially on different Moreover, OpenTracing is not a uted applications the same level
servers — it is exceptionally diffi- run-time debugger and does not of observability and control that
cult to follow the “story” of the ap- allow changing variables during is available for monolithic appli-
plication and identify the causes run time to explore potential cations. A combination of exist-
of problems when they arise. solutions to a problem. Any at- ing tools already points us in the
tempt to fix a problem requires right direction. Log collection can
Currently, the main methodology wrapping the code, running the be done by OpenTracing tools,
relies on obtaining a trace of all application, and waiting for the metrics collected by Prometheus,
transactions and dependencies data again. Solving a problem and debugging by Squash. All of
using tools that, for example, im- may necessitate several such iter- these methods should plug into a
plement the OpenTracing stan- ations, which can be both daunt- service mesh to achieve full effi-
dard. These tools capture timing, ing and expansive. ciency.
events, and tags, and collect this
data out of band (asynchronous- Our vision for Squash is to com-
ly). OpenTracing allows users to plement the OpenTracing tools
perform critical-path analysis, and close these gaps. The main InfoQ: What role do you think
monitor request latency, perform goal of Squash is to provide an QA/testers have in relation to
topological analysis and identi- efficient tool for debugging mi- observability and debuggabili-
fy bottlenecks due to shared re- croservices applications. Squash ty of a system?
sources. Users can also log what orchestrates run-time debug- Levine: In one possible mode
they think could be useful data, gers attached to microservices, of action, I would expect the QA
like the values of different vari- providing familiar features like and testers to focus on the logs
ables, error messages, etc. setting breakpoints, stepping and provide context. With con-
through the code, viewing and tainer-based applications, this
modifying variables, etc. Impor- should be done using OpenTrac-
tantly, Squash allows the devel- ing. The developer will then be
InfoQ: We’ve been eagerly oper to seamlessly follow the able to reproduce the bug and
watching the evolution of application and skip between use Squash to attach a debugger,
Squash and would be keen to microservices. Squash takes care step through the code, and re-
hear the goals of the project of all the necessary piping, allow- solve the issue.
and the rationale for creating ing developers to focus on their
this. own code and solve the issues
Levine: OpenTracing tools are they actually care about. To make
very powerful, but they have lim- Squash accessible and easy to InfoQ: Is there anything else
itations and gaps. Since logging adopt, it integrates with existing you would like to share with
the state of the application during popular IDEs. the InfoQ readers?
run time can be expensive and Levine: We at solo.io are work-
result in performance overhead, Squash is designed to provide ing hard at building more open-
one needs to limit the amount of essential capabilities for monitor- source tools to facilitate mi-
collected information. One way ing the lifecycle of an application croservices development and
to do this is to follow only a sub- both in the development phase, operation. In particular, we are
set of the transactions, and not allowing development of robust focused on innovative and help-
all of them. Tuning the size of this code, as well as during produc- ful tools to accelerate adoption
sample represents a tradeoff be- tion, allowing fast adaptation of of microservices in the enter-
tween the amount of information the code when new difficulties prise. We are super excited about
collected on one hand and the arise. our plans for 2018 — please stay
price in performance and costs tuned!
on the other.
InfoQ: What other tools do you
One consequence is that once a
think future developers will
problem is identified, it is possi-
need to understand and debug
ble that some needed informa-
large-scale, rapidly evolving
tion is missing. Obtaining this
container-based applications?
information requires running
the application again, and wait- Levine: As a community, we
ing for the data to be collected. should aspire to provide distrib-
2018 Microservices // eMag Issue 59 - Mar 2018 21
KEY TAKEAWAYS Microservices Patterns
Defining the boundary contexts for
microservicess is a particular challenge.
In general, you need to really know your
and Practices Panel
problem domain before you can get this
right — in a field like banking, where the
boundary cases are already well known Microservices almost seem to be the
and have been for many years, you can
operate as a startup. That’s less true in
de facto way to build systems today,
other problem domains, but you can but are they always the answer?
start by organizing around business
objects.
Decomposition applies at many levels.
In a sense, you decompose methods,
classes, packages, and modules, and
so microservices is just another level in If you do choose microservices, you’ll face challenges at
that kind of hierarchy. However, they
also have a strong relationship to team scale at both a technical and organizational level. What
structure. strategies should you use now that you are effectively
building a distributed system? What’s the one thing you
Scale is one strong reason to consider wish you’d known before you got here?
microservices, and the most often
cited perhaps along with team velocity,
but another is security — if you have
This panel session brought together many of the most
two things that shouldn’t share a trust popular session speakers at QCon San Francisco 2017 for
domain, for example. a frank discussion on microservices with the Microservices
track host.
22 2018 Microservices // eMag Issue 59 - Mar 2018
optional field — we’d need to talk
The panelists were: about it in a little more detail, sad-
• Chris Richardson, the author of POJOs in Action and founder ly. But the idea is that, as a service
of the original Cloud Foundry, an early Java PaaS for Amazon owner, your primary job is never to
EC2 break the people that use your ser-
vice. So you are never allowed to
• Randy Shoup, a 25-year veteran of Silicon Valley, currently VP break clients, which are consumers
Engineering at Stitch Fix of your events.
• Louis Ryan, a core contributor to Istio and gRPC and a princi-
pal engineer at Google
Q: Deciding boundary contexts
for microservices could be as
• Roopa Tangirala, leader of the Cloud Database Engineering
easy as having orders, and then
team at Netflix
there could be five types of
orders, and then the microser-
• Rafael Schloming, co-founder and chief architect of Datawire
vices becomes a monolith after
a while. How do you decide on
The session was recorded live as the panelists took questions from
a boundary context so that it is
the audience. We’ve lightly edited the transcript.
still good enough after a couple
of years?
Louis Ryan: Mostly, it is probably
Question: How do you manage changes where we are moving going to be informed by your de-
your data when you are doing data around, at least at Netflix. velopment practice in your devel-
red/black deployments? For opment divisions, rather than any
instance, you might have a Chris Richardson: For zero-down- strict semantic thing you would try
version that writes new records, time deployments, constrain the to guess from the get-go. I tend to
which the old version doesn’t kinds of changes you can make at think of microservices as emergent
understand. How does the old the database level. So you could patterns that come out of the need
version know it is not an error add a nullable column, for instance, for decoupling. Usually, the decou-
but actually real data? but you cannot just drop a column. pling works at development-team
So carefully make changes, and de- boundaries pretty well, or at func-
Roopa Tangirala: In most red/
couple database schema changes tional responsibilities within de-
black deployments at Netflix,
from your zero-downtime deploy- velopment teams. That’s where I
whenever there are data chang-
ment. would start.
es, you can do it. It is stateless; it is
not a problem. But when you have
Randy Shoup: Yeah, don’t do what Richardson: I would sort of say this
changes for Cassandra, it is sche-
you just said. Not even kidding. is one of the hardest problems, and
ma-less. So when you are adding a
What you did is you broke the in- it is really not specific to microser-
new column or changing schema,
terface. You made a non-back- vices. Another way of rephrasing it
you don’t necessarily need to do
ward-compatible data change and is “What are the boundaries of my
a DDL to change the schema. You
you exposed it to other people, module?” And I think picking mod-
can keep directly inserting into the
and you did it, in a way, in between ule boundaries is difficult.
new data set with the new column.
a minor release. To people familiar
That is one way.
with semantic versioning, what Unfortunately, there is no mechan-
you just described was a non-back- ical process that, if you apply it, will
And the other way, if they help the
ward-compatible major version come up with a perfect set of mod-
migration from one column fam-
change: “We used to produce data ule boundaries. In the case of ser-
ily to another, we help them build
in this form, and now we produce vices, most of them are organized
tools; we own the client libraries so
it in this non-backward-compatible around particular business objects,
we can help them write to the old
other form.” like order management and cus-
and to the new. We have tools like
tomer management and so on. But
Forklift, which helps move from the
And so, don’t do that. What you can that is your first guess, and you go
source to the destination. But not
do is what Chris said. There are lots with that, and if later on, you find
all red/black deployments need
of ways to make backward-com- out that some services got too big,
patible changes. You can add an then hopefully at that point there is
2018 Microservices // eMag Issue 59 - Mar 2018 23
a clear-enough boundary between the two
internal parts of that module to let you split
it in a meaningful way.
The point of a service is to enable a small
team of developers to deliver rapidly and
safely. And so if a service gets too big, that
really means the team that is developing it
has gotten too big, and they are weighed
down by communication overhead. And so
you kind of want to split the team, and you
want to split the code, so they can go back
to being small, nimble teams again.
Shoup: Probably, if you are a five-person
startup, you might not want to start with
microservices. Part of the why is that it is still
a little bit complicated, maybe a lot com-
plicated, to build a distributed system and
everyone’s questions are around things that
are complicated. When you are small, you
want to start off doing something different.
And another part of it is you want to under-
stand your domain and be able decompose
it in a reasonable way before you do micro-
services, because microservices are just a
physical manifestation of a decomposition
of your domain. So I have found, because
I have tried and failed to do it many times,
that my first cut at a new problem and figur-
ing out the decomposition of the domain is
messed up all the time. And I have gray hair/
no hair; I have been doing this for a while.
There are two rare exceptions to the rule of
maybe don’t start with microservices when
you are tiny. The first is if your MVP requires
scale. So, if you are building the Heroku
competitor, for example, you are building
internet infrastructure, so you’d better scale
from the beginning. That’s a requirement.
And the second is if you know your domain
super well. One great example is people
building new banks. Nubank from Brazil,
who gave the first talk yesterday in the Ar-
chitectures You Always Wondered About
track, started with microservices. Why? Be-
cause the decomposition of the banking
domain we have known for the last 50 years,
the components of the bank, are really well
understood. But the rest of us, we don’t
know the domain well enough, and that is
why this is such an important problem.
24
Q: We know that we’re a mono- times with my team, where this is for most of us, is maybe not. As I
lithic application, and we know a legit thing to say. tried to say in my talk, it is the 0.1
that we want to get to busi- percent, or 0.01 percent that get
ness-context-type services. That’s the answer. You know more really large, where you absolute-
But where does that line get than you think about how to de- ly need them — there is no way
drawn? Is it a product level, an sign services. If you know how to Google, Amazon, Netflix, Stitch
API level, a microservice level? design classes, for the most part Fix work without microservices.
Is it just what feels right? you know how to design services. But if you don’t have a huge load,
The only part is recognizing that it is fine to stick with a monolith.
Rafael Schloming: That is a hard
you cannot be as chatty with ser- When should you go with micro-
question, but I think one of the
vices as you can be with some- services? Well, when are you un-
ways to, sort of, think about it is
thing that is in process. able to scale things independent-
actually something Randy said
ly, when does it slow down, when
earlier, which is don’t think about
Richardson: Decomposition ap- do things evolve at different
the size of a microservice in terms
plies at many levels. In a sense, rates? That’s the wall that you
of its lines of code, think about
you decompose methods, class- have to scale with microservices.
the scope. And how do you de-
es, packages, and modules, and
fine scope? Well, you need to un-
so the microservices is just yet Richardson: And I want to add
derstand what it is you are trying
another level in that kind of hi- to that. If your development ve-
to achieve at a high level, in one
erarchy. One comment I would locity is not where it needs to be,
or two sentences.
add is that I think microservices I would actually start to review
kind of have this important rela- your development practices be-
It is really a negotiation between
tionship with team structure as fore switching to microservices.
the user of a service and the team
well. I think there are two mod- So, for instance, if you are not do-
that delivers that service. You
els for microservices. There is this ing automated testing thorough-
need to track the usage; if your
super fine-grain model, which is ly, and I think probably 70-plus
users are happy, then you’re done.
one service per developer, that percent of organizations, accord-
It really helps to think in terms of
seems to be happening at some ing to a SourceLab report, have
that framing, understanding who
companies. Or when you have not completely embraced auto-
the user of the service is, and go-
thousands of services — that or- mated testing. So if you are one
ing from there. And, from that
der of magnitude. Another way of those, work on that first. And
perspective, you can just try a lot
of thinking about services is as a then, you know, once you have
of different kinds of APIs that will
small enough “application” that the hang of that and you really
sort of serve the same mission
its team can remain nimble and are able to automate as much as
and figure out what you need.
agile. That is a much coarser-grain possible, then think about the mi-
And, again, you can track how
model of microservices. And so croservice architecture. It is kind
successful you’re making your
that impacts decomposition. of like try walking before you run.
users in order to measure your
progress as you iterate through
Ryan: I think it is probably a com- Schloming: That’s a great point,
the difficult design space.
mon problem for a lot of people in and a great thing to do is just — it
the room, that they have a mono- doesn’t need to be super heavy-
Shoup: So this is a little bit of a flip
lith yak that they want to shave, weight — to track where you
response, but I don’t mean it in
and that is totally fine. Start shav- spend your time. If you are doing
any aggressive way. Do you guys
ing where you think shaving adds lots of manual testing and that
build one class, like one language
value, and stop shaving where is slowing you down, you don’t
class in Java or whatever? How do
you are not getting any more val- necessarily think about that on a
you know what the scope of the
ue. It is okay to have a monolith if day-to-day basis. And, you know,
classes are? That’s a design thing.
it is doing what it is supposed to if you are spending a lot of time
The class is a single responsibili-
do. I know that might be heresy wrestling with particular areas of
ty; we try to make the interface
here, but if it is doing what it is your monolith, maybe that’s the
minimal and try not to be chatty.
supposed to do, why touch it? If it time you should start shaving
The reason we ask it that way is
is not, shave it, and iterate. that particular patch of yak.
not to put you on the spot, and
the people that are working for
Shoup: A related, excellent ques- Ryan: So I think Randy gave a
me are laughing right now: this
tion is, more or less, are microser- couple examples of why you
is a thing that I have done many
vices worth it? And the answer, might want to do that, scale be-
2018 Microservices // eMag Issue 59 - Mar 2018 25
ing one of the more obvious ones then extract out the parts of your to focus on the minimal thing we
that is quoted in the industry. application that are frequently need to do to get our job done.
I think there are other reasons; changing. Because that will give And the other reason we choose
security is a big reason why you you the biggest bang for the something useful is if that doesn’t
might want to shave your mono- buck. work, at the worst we have pro-
lith, because you have two things vided some value to our custom-
that should not be stuck together Think about your monolith that ers.
in the same trust domain. That’s is on the slow track of develop-
a big reason. The development ment, and everything that you So that’s the step zero, that pilot.
experience is clearly one; release extracted out of the microservice And now that that pilot is suc-
velocity is a big deal. So there’s a is on the fast track, the rapid de- cessful, and we have learned all
variety of reasons out there. You ploys and all of that. So you want of these things about how to do
know your domain, you know to invest the effort in those areas things in a new way, then we will
what is going on in your domain, that really, really make a differ- call it “microservices”. The steps
you should be able to reason ence. 1 to N are to take the things that
about those types of decisions. have the highest return on invest-
Shoup: I’m going to make some ment — not the easiest things,
Richardson: From my perspec- architectural change from the not necessarily the hardest
tive. I think that microservices are monolith to the microservices. things, but the things that have
primarily a way to tackle complex- So I want to prove that this fancy the highest return on investment
ity rather than scale. Obviously, it millennial way of doing a micros- and we convert those to the new
is a way to scale, but complexity ervice is actually a thing and will way first.
is first. work in our environment. So step
zero is to do a pilot. And the way So think about the areas that are
I would like to think about that is really fast-changing, maybe that
to take a vertical, end-to-end ac- have the highest ROI, or the part
Q: Can you guys comment on tual experience that matters to of the site with the highest rev-
what patterns teams are using our business. enue — that would be a place
to get to microservices? Do I where it would be valuable to
start in the middle where it re- Let’s imagine that you have move faster to make more rev-
ally matters with an important something that actually matters enue. You did the pilot, you de-
object or do I do it on the side to users and you want to build risked it, and then you do the
where it doesn’t make a big dif- that in a new way. It could legit- highest ROI, and then the sec-
ference? Can I just slap a REST imately be a new thing you will ond and third highest, and you
API on an existing app and call build a new way or an old thing keep going until you run out of
it a microservice? that exists that you will rebuild in patience or resources. And if you
Richardson: Well, you know, if a new way. Either way, take a ver- run out before you are done, that
your yearly bonus depends on tical end-to-end thing and build is cool, because the monolith that
having a microservice…. This it in a new way. still exists is something that you
term “microservice” really does don’t care too much about. There
get heavily abused, right? “Can we Why? We are building a pilot, we wasn’t the ROI, it didn’t go above
use a microservice for that” is just want to de-risk it and we want to the bar of what it would take for
kind of the wrong notion, from learn all the things we don’t know you to, you know, get motivated
my perspective. Microservice is about the microservice thing. We to convert it to microservices.
shorthand for the microservice are doing it as a pilot rather than
architecture, which is an architec- building the entire infrastructure. That is exactly what eBay did.
ture style for an application; it is We do a vertical end-to-end user eBay had this monster C++
all about having a system. experience because we want to monolith and they broke it into
be able to be focused on some many applications written in Java.
Say you have this massive mono- particular thing and that tells us So it wasn’t microservices, but the
lith, and there is one part of it what we need to do and don’t principle is the same. Once they
that is under very, very active de- need to do. If we choose some- did the pilot and they convinced
velopment and another part that thing that doesn’t matter, we themselves that Java could work
you never touch, and you want to don’t know what is in or out. If we in the eBay infrastructure with
extract them out. If you want to choose something that is actu- the skill set and people and all
build a microservice or a service, ally useful, then that will help us that kind of stuff, they basically
26 2018 Microservices // eMag Issue 59 - Mar 2018
reverse-sorted the site — they basically the idea is that each Q: Related to event-driven ar-
took the pages on the site and re- service is owning the piece of chitecture, can you share your
verse-ordered them by revenue data it is responsible for, and it is thoughts from the panel on
contribution. So they converted the source of truth for that. That the use of either pass by value
first the top-revenue pages, not is how the interactions happen: or pass by reference on those
because they desperately wanted other services will ask the service messages, how the consumers
to have the greatest risk but if and instead of directly either copying work with that message, and
when they ran out of patience, the dataset or having multiple maybe your thoughts on how
money, or resources, they had the copies in their back end. to handle ordering those?
most valuable things done.
Ryan: I can give my opinion,
Richardson: There are several
which might also be slightly he-
They had started the re-architec- parts to it. One is atomically pub-
retical here, but this is influenced
ture in 2000 or 2002 or something lishing a message when the data
by Google scale. We mostly don’t
like that, and they had mostly fin- changes. So, conceptually, there’s
do it. Most service-to-service
ished by, I want to say, 2007 or a transaction involved in updat-
communication is not reconciled
2006. It took a while, and even ing the database and publishing
through a broker. We use things
after I left in 2011, there were still a message. There’s a whole thing
like retries and network-level
things that were on that v2 C++ around transactional messaging,
things to get scale by not hit-
monolith architecture, but they which is kind of a super inter-
ting storage. So again, it is one of
were things that nobody used; esting topic. And so, it ends up
the scale questions. If persistent
they were simple, they didn’t reliably being published to the
queues in storage give you reli-
change. So there was no ROI to message broker. That’s step one.
ability that you need at your ap-
convert them to the new way. In step two, your message bro-
plication level at scale, then you
ker has to be reliable. That’s what
should use it. And, I think, at cer-
Randy was talking about with at-
tain scales, some of the patterns
least-once delivery. And at the
Q: My question is about the might become a little bit more
consumer end, you need idempo-
communication between limiting, particularly depending
tent event or message handling
microservices. We talk about on the amount of work waiting
to ensure the correct semantics,
having events, so service A for that. So it is not that we’re
and that includes keeping track
talks to service B. For a busi- anti that pattern, per se; we do
of all of the message IDs you have
ness-critical service like cred- use that pattern, and we use that
seen. It is a whole complex topic,
it-card processing, we see lot pattern encapsulated behind an
some of which I cover in my book,
of patterns by Kafka or other API with a clear segmentation of
Microservice Patterns — shame-
brokers once the message is in responsibility. But, for the most
less plug.
the broker, and there are ways part, we don’t do it. We don’t do
to recover or retry. But what’s rendezvous or that type of thing.
Schloming: This area of owner-
the recommendation to ensure ship is like designing classes —
that the credit-processing Richardson: I can’t believe you
ownership and the whole area of
service does issue the event? don’t use Kafka.
communication and this whole
Kafka now has Kafka Connect, event thing. That is where you
which can publish every data- Ryan: We have things that look
are transferring responsibility for
base commit or every database like it.
ownership of some data. And
transaction straight to Kafka. that is where microservices get
What if the business object is Richardson: But Kafka seems
the most different from design-
not the same as what you have fashionable.
ing classes, or one of the areas
in the database? they get the most different from
Ryan: So I hear.
Tangirala: In terms of services, designing traditional class APIs,
each application service is the because you don’t have this same
Richardson: Rightly or wrongly.
source of truth for the data it is locality of data in the context of a
So when we have been talking
serving. So, for payments pro- class hierarchy. So it is just some-
about events, in my brain I have
cessing, in Netflix’s case, they thing to keep an eye out for.
translated that into domain
don’t use Kafka, they have dif- events, which are a concept from
ferent payments stores. They are domain-driven design (DDD).
using transactional data stores One of the DDD books, Imple-
for that payment processing. But
2018 Microservices // eMag Issue 59 - Mar 2018 27
menting DDD by Vaughn Vernon, are. That’s the first thing, and then those things to some degree. And
has a chapter on domain events you have this multiple times, and event brokers have been pitched
that includes a discussion of how then you have to be idempotent; as a way to give operators control
much data you you should have the consumer has to be able to so they can answer those ques-
in an event, you have a choice. If correctly process the same event tions or validate that.
an order is created, you can pub- multiple times. CRDTs is some-
lish the order ID, but that is of thing you should look into if you Q: When the Web started, ev-
no use to the consumer because are kept up at night by these eryone was writing interesting
they have to get the order. So problems. apps. Then came Rails and MVC
there’s the concept of event en- and Rust and people started
richment, which says to put data And there are several ways to writing those, and then we
that is useful to the consumer in deal with event ordering. You can had monolithic scales slow
the event. And when you publish deal with ordering in the bus — us down. Now microservice
an order-created event, stuff the blecch, that is not so great. The is the buzzword. You cannot
order details in there. And when other way to deal with it is to walk into a company and not
you are using event sourcing, have events be the notification hear the word “microservices”.
where your events are your stor- of a thing happening and then What are some things that you
age mechanism for your domain you go and read back to the ser- foresee after the microservice
objects, you have no choice ex- vice that produced the event for trend saturates? What is next?
cept to put the necessary data in the current state of things. These Microfunctions?
there. are all legit ways of handling a
problem, and think about these Ryan: Didn’t that already hap-
And your other point was order- answers — there’s Randy’s way pen?
ing. I think ordered, at least once, to do it versus Chris’s way to do
delivery of domain events is re- it versus Louis’s. Think about that: Shoup: The meta answer is look
ally, really important, because if there’s a space of solutions to this at what Google, Amazon, and
they arrive out of order, then you problem, and don’t take away Netflix are doing now. Meaning
are going to have pretty weird be- from it that it’s solved by reason- no shame, I will be flip: if you are
havior. And I mean, there might ing with first principles asking that question, you are
be other situations where you years behind these people. And
don’t care about delivery and you Martin Fowler of refactoring that’s a good thing. You can look
can just pub/sub an event, but or- fame gave a wonderful key- at what these larger architectures
dering is usually quite important. note at GOTO Chicago 2017 on are sharing.
event-driven architecture. And he
Shoup: You asked THE question, does, in his wonderful way, very Richardson: At some level,
which is how I deal with event clearly, discuss the pros and cons there’s a limit beyond which it
delivery when I might get the of events as notification, events doesn’t meaningfully make sense
thing multiple times, and how that carry the data with them, to decompose a module. Go back
do I deal with event ordering? So event sourcing, etc. to some of the classic work in
both of those things you don’t object-oriented design like the
have in-process but with messag- Ryan: I want to throw in a cau- common-closure principle: the
ing across a distributed system tionary tale, not necessarily a par- things that change for the same
you have those problems. I keep able, I gave a talk earlier on super- reason should be packaged to-
threatening to do an event mas- powers: beware of superpowers. gether. And that means, if you
ter class. Event brokers are superpowers. decompose a package into two
Be careful when you put things packages — and, really, you have
So, again, on delivery, you can into queues when you don’t know split this business concept across
have at most once or at least once. where or how they are going to those two packages — then
If you care about your event, you come out, or who they will come whenever that business concept
want at least once. So that is on out to. If you don’t know the an- changes, you are changing both
failure; I deliver it two times, three swers to those questions, you of those packages. So you are go-
times, N times. At most once is shouldn’t put those things into a ing to see this lockstep.
basically for logging data, things queue until you can answer the
that on failure you want to drop. questions. If you have data that So certainly, to me, there’s an an-
Domain events do not fall into you care about or your users care ti-pattern in the microservice ar-
that category, but logging things about, you need to reason about chitecture, the distributed mono-
28 2018 Microservices // eMag Issue 59 - Mar 2018
lith, where you are really releasing multiple
services simultaneously because of that. So
that’s one part which is, sort of, from a logical
perspective.
And then, from just sort of nuts-and-bolts
technical thing, you can certainly say that
when it comes to deployment, our unit of de-
ployment has been getting increasingly light-
er and more ephemeral. So, 10 or 15 years
ago, if you wanted to deploy something, you
had to get a physical machine. And now you
just deploy a lambda on AWS — and in such
a short amount of time, that’s been a radical
transformation in how we deploy things. And
so that, to me, is one kind of huge trend. And
even from a design point of view, there’s this
common-closure principle that you have to
keep in mind.
Schloming: There’s a way I like to think about
this question that is very complementary
with this but from a different perspective,
and that is thinking about the trends in terms
of how many people you need to accomplish
something. If you look at the transition from
monolith to microservices through the orga-
nizational lens, it is a shift in the division of
labor. You are taking the output of a team, an
engineering team of thousands of people,
and you are fundamentally assembling the
output of that work in a different way into a
single, coherent whole. Look at 10 years ago,
the size of a team it took to deliver a given
service. Today, a teenager could do the same
thing out of his parents’ basement in a week-
end, at least close to that.
And so I think that the limit of this really
comes down to the point where that team
size can effectively stop shrinking. It is how
much a single developer can absorb and ac-
complish, until you throw in something like
AI, which I’m sure people are doing now.
Richardson: Can I just respond? One thing
that is interesting is I don’t know whether
the productivity of an individual developer
writing code has improved. Like, writing and
creating brand-new code. So I look back and,
some things have changed. Like machines
have gotten faster and bigger, and if we are
stuck, there is Google or Stack Overflow. And
then there’s all of this open-source stuff, so we
can quickly assemble a bunch of libraries to-
gether, and if we get stuck, we Google the an-
29
swer. But, in terms of writing code examples of that marketplace of don’t see that you are having a
from scratch, I feel like it is an indi- other things to assemble. degraded experience. If you are
vidual developer muddling along not able to get your personalized
somehow, scratching their head. list of movies to watch from that
And, if that hasn’t changed, we service, if you cannot go to that
have not had a Moore’s law for Q: When I log into an applica- service, then they may fall you
software development in that re- tion like Netflix, it is a pretty back to a fallback page. So you
gard. frictionless user experience. might not experience degraded
I log in once and I don’t get a service; you do not think you are
Ryan: So if we are in the realm of sense that I’m logging into the not seeing your active list of mov-
predictions, I think some of the microservice for my user pro- ies as the service is giving you the
answers are sitting outside in the file, customer history, etc. How fallback experience.
vendor booths. More and more of do you maintain this friction-
your code is running on the same less UI in microservices archi-
network, and I’m not meaning tecture? Most of us are writing
only yours, I mean all of you at applications that span multiple
the same time. You are all putting services but it is really just one
your code into big cloud vendors; application users are trying to
it is much more local with every- go to. How do I maintain the
one else in this room than it was advantages of a share-noth-
before. So we have this interest- ing architecture where I can
ing networking effect. Microser- deploy independently with-
vices are not just a way for you to out dependencies between
build services. It is also a way for services yet maintain a user
you to consume services that oth- experience that is frictionless,
er people have built for you. unified, and with a consistent
look and feel?
When I look out there and I see Tangirala: So, there are different
vendors selling certain types of tiers in the microservice layer.
services, the thing that strikes me There is a front-end tier, which
is that they’re smaller versions of takes all the user traffic, and then
things that bigger vendors used we have a middle tier and back-
to sell. I look at the APM space end tier, which are your mem-
when I see that. And you will see bership and all the core services
the trend continue when there that give that data set to you. And
are more micro-vendors; there so, in terms of the UI integration,
will be more marketplaces that there is a lot of interaction be-
help you acquire services that tween these services, but at any
can do interesting things. Some- given time the source of truth is
body asked about geolocation. just one service.
You can buy that as a service. It is
a tiny little service; it does very lit- I don’t have a lot of insights into
tle in terms of an API and a huge the UI layer. But our UI team does
amount in terms of the back end. a great job in making sure all the
So that is one thing that we might interactions between these mi-
expect to see going forward. croservices and the results that
they are getting in the UI are
Schloming: I think that those seamless. There’s a lot of work
two answers spark a lot. I don’t that goes on behind the scenes,
think a developer writes more but each microservice is not re-
lines of code, but they are way lated to the other. That way, you
more productive because they know which service to call.
figure out how to assemble a lot
of things — and the other things Though there’s a lot of interac-
or what Louis just mentioned are tion, you have fallbacks as well.
From the UI point of view, you
30 2018 Microservices // eMag Issue 59 - Mar 2018
2018 Microservices // eMag Issue 59 - Mar 2018 31
Watch presentation online on InfoQ
KEY TAKEAWAYS Managing Data
in Microservices
Stitch Fix, a clothing retailer, employs nearly
as many data scientists as engineers. The data
scientists work on algorithms critical to the
company’s success, and require a substantial
amount of data to succeed. Adapted from a presentation at QCon
Although microservices may be necessary for
achieving a highly scalable solution, do not start
San Francisco 2017, by Randy Shoup, VP
with the complexity of a highly distributed system of engineering at Stitch Fix
until the company is successful enough that
microservices become justified and necessary.
All major companies that are now using
microservices, including eBay, Twitter, and Amazon.
com, have gone through a migration that started
with a monolithic system. I’m Randy Shoup, VP of engineering at Stitch
Fix, and my background informs the follow-
A true microservices platform requires each
ing lessons about managing data in micros-
microservice to be responsible for its own data.
Creating separate data stores can be the most ervices.
challenging aspect of a microservices migration.
Stitch Fix is a clothing retailer in the United
The process for separating out a monolithic States, and we use technology and data sci-
database involves a repeatable process of isolating ence to help customers find the clothing they
each service’s data and preventing direct data
access from other services. like. Before Stitch Fix, I was a roving “CTO as a
service”, and I helped companies discuss tech-
nologies and these situations.
32 2018 Microservices // eMag Issue 59 - Mar 2018
Earlier in my career, I was director bility of purchase. That is, what is We apply the same techniques to
of engineering at Google for Goo- the conditional probability that what clothes we’re going to buy.
gle App Engine. That is Google’s Randy will keep this shirt that we We make algorithmic recommen-
platform as a service, like Heroku, send him. Imagine that there’s a dations to the buyers and they
or Cloud Foundry, or something 72 percent chance that Randy will figure out that, okay, next season,
like that. Earlier, I was chief en- keep this shirt, 54 percent chance we’re going to buy more white
gineer for about six-and-a-half for these pants, and 47 percent denim or cold shoulders are out
years at eBay, where I helped chance for the shoes — and for or Capri pants are in next.
our teams build multiple gener- each of you in the room, the per-
ations of search infrastructure. centages are going to be differ- We use data analysis for inven-
If you have ever gone to eBay ent. We have machine-learned tory management: what do we
and found something that you models that we layer in an en- keep in what warehouses and so
liked then, great, my team did a semble to compute those per- on. We use it to optimize logistics
good job. And if you didn’t find it, centages, which compose a set of and selection of shipping carriers
well, you know where to put the personalized algorithmic recom- so that the goods arrive on your
blame. mendations for each customer doorstep on the date you want,
that go to the stylists. at minimal cost to us. And we do
Let me start with a little bit about some standard things, like de-
Stitch Fix, because that informs As the stylist is essentially shop- mand prediction.
the lessons and the techniques ping for you, choosing those five
of our breaking monoliths into items on your behalf, he or she is We are a physical business: we
microservices. Stitch Fix is the re- looking at those algorithmic rec- physically buy the clothes, put
verse of the standard clothing re- ommendations and figuring out them in warehouses, and ship
tailer. Rather than shop online or what to put in the box. them to you. Unlike eBay and
go to a store yourself, what if you Google and a bunch of virtual
had an expert do it for you? We need the humans to put to- businesses, if we guess wrong
gether an outfit, which the ma- about demand, if demand is dou-
We ask you to fill out a really de- chines are currently not able ble what we expect, that is not
tailed style profile about yourself, to do. Sometimes, the human a wonderful thing that we cele-
consisting of 60 to 70 questions, will answer a request such as brate. That’s a disaster because
which might take you 20 to 30 “I’m going to Manhattan for an it means that we can only serve
minutes. We ask your size, height, evening wedding, so send me half of the people well. If we have
weight, what styles you like, if something appropriate.” The ma- double the number of clients, we
you want to flaunt your arms, if chine doesn’t know what to do should have double the number
you want to hide your hips… — with that, but the humans know of warehouses, stylists, employ-
we ask very detailed and person- things that the machines don’t. ees, and that kind of stuff. It is
al things. Why? Anybody in your very important for us to get these
life who knows how to choose All of this requires a ton of data. things right.
clothes for you must know about Interestingly and, I believe,
you. We explicitly ask those uniquely, Stitch Fix has a one-to- Again, the general model here is
things, and use data science to one ratio between data science that we use humans for what the
make it happen. As a client, you and engineering. We have more humans do best and machines
have five items we deliver to your than a hundred software engi- for what the machines do best.
doorstep, hand-picked for you by neers in the team that I work on
one of 3,500 stylists around the and roughly 80 data scientists When you design a system at this
country. You keep the things that and algorithm developers that scale, as I hope you do, you have a
you like, pay us for those, and re- are doing all the data science. To bunch of goals. You want to make
turn the rest for free. my knowledge, this is a unique sure that the development teams
ratio in the industry. I don’t know can continue to move forward in-
A couple of things go on behind any other company on the planet dependently and at a quick pace
the scenes among both humans that has this kind of near one-to- — that’s what I call “feature veloc-
and machines. On the machine one ratio. ity”. We want scalability, so that as
side, we look every night at every our business grows, we want the
piece of inventory, reference that What do we do with all of those infrastructure to grow with it. We
against every one of our clients, data scientists? It turns out, if you want the components to scale to
and compute a predicted proba- are smart, it pays off. load, to scale to the demands that
2018 Microservices // eMag Issue 59 - Mar 2018 33
we put on them. Also, we want failures is lower. The frequency of a different set of problems than
those components to be resilient, a high-performing organization five people in a startup that sit
so we want the failures to be iso- deploying, having it not go well, around a conference table. That
lated and not cascade through and having to roll back the de- is three orders of magnitude dif-
the infrastructure. ployment approaches zero, but ferent, and there will be different
slower organizations might have solutions at different scales for
High-performing organizations to do this half the time. This is a different companies.
with these kinds of require- big difference.
ments have some things to do. That said, I love to tell how the
The DevOps Handbook features It is not just the speed and the companies we have heard of
research from Gene Kim, Nicole stability. It is not just the techni- have evolved to microservices —
Forsgren, and others into the dif- cal metrics. The higher-perform- not started with microservices,
ference between high-perform- ing organizations are two-and-a- but evolved there over time.
ing organizations and lower-per- half times more likely to exceed
forming ones. Higher-performing business goals like profitability,
organizations both move faster market share, and productivity. eBay
and are more stable. You don’t So this stuff doesn’t just matter to eBay is now on its fifth complete
have to make a choice between engineers, it matters to business rewrite of its infrastructure. It
speed and stability — you can people. started out as a monolithic PERL
have both. application in 1995, when the
founder wanted to play with this
The higher-performing organiza- Evolving to thing called the Web and so spent
tions are doing multiple deploys microservices the three-day Labor Day week-
a day, versus maybe one per One of the things that I got asked end building this thing that ulti-
month, and have a latency of less a lot when I was doing my roving mately became eBay.
than an hour between commit- CTO-as-a-service gig was “Hey,
ting code to the source control Randy, you worked at Google and The next generation was a mono-
and to deployment, while in oth- eBay — tell us how you did it.” lithic C++ application that, at its
er organizations that might take a worst, was 3.4 million lines of
week. That’s the speed side. I would answer, “I promise to tell code in a single DLL. They were
you, and you have to promise not hitting compiler limits on the
On the stability side, high-per- to do those things, yet.” I said that number of methods per class,
forming organizations recover not because I wanted to hold onto which is 16,000. I’m sure many
from failure in an hour, versus the secrets of Google and eBay, people think that they have a
maybe a day in a lower-perform- but because a 15,000-person en- monolith, but few have one
ing organization. And the rate of gineering team like Google’s has worse than that.
The third generation was a rewrite
in Java — but we cannot call that
microservices; it was mini-appli-
cations. They turned the site into
220 different Java applications.
One was for the search part, one
for the buying part… 220 appli-
cations. The current instance of
eBay is fairly characterized as a
polyglot set of microservices.
Twitter
Twitter has gone through a sim-
ilar evolution, and is on roughly
its third generation. It started as
a Rails application, nicknamed
the Monorail. The second genera-
tion pulled the front end out into
34 2018 Microservices // eMag Issue 59 - Mar 2018
JavaScript and the back end into
services written in Scala, because
model, and things that people
are going to pay for, has built a No one starts with
Twitter was an early adopter. We distributed system they are going
can currently characterize Twitter
as a polyglot set of microservices.
to need in five years. There is a
reason we have not heard of that
microservices. But,
Amazon.com
company.
past a certain scale,
Again, think about where you are
Amazon.com has gone through
a similar evolution, although not
in your business, where you are in
your team size. The solutions for
everyone ends up
as clean in the generations. It be-
gan as a monolithic C++ and Perl
Amazon.com, Google, and Netflix
are not necessarily the solutions with microservices.
application, of which we can still for you when you are a small
If you don’t end up
see evidence in product pages. startup.
The “obidos” we sometimes see
in an Amazon.com URL was the
code name of the original Ama-
zon.com application. Obidos is
Microservices
I like to define the micro in micro-
regretting your early
a place in Brazil, on the Amazon,
which is why it was named that
way.
services as not about the number
of lines of code but about the
scope of the interface.
technology decisions,
Amazon.com rewrote every- A microservice has a single pur-
you probably
thing from 2000 to 2005 in a ser-
vice-oriented architecture. The
pose and a simple, well-defined
interface, and it is modular and over-engineered.
services were mostly written in independent. The critical thing to
Java and Scala. During this peri- focus on and explore the impli-
od, Amazon.com was not doing cations of is that effective micro-
particularly well as a business. services, as Amazon.com found,
But Jeff Bezos kept the faith and have isolated persistence. In oth-
forced (or strongly encouraged) er words, a microservice should
everyone in the company to re- not be sharing data with other
build everything in a service-ori- services.
ented architecture. And now it’s
fair to categorize Amazon.com as For a microservice to reliably ex-
a polyglot set of microservices. ecute business logic and to guar-
antee invariance, we cannot have
These stories all follow a common people reading and writing the
pattern. No one starts with micro- data behind its back. eBay dis-
services. But, past a certain scale covered this the other way. eBay
(a scale that maybe only .1 per- spent a lot of effort with some
cent of us is going to get to), ev- very smart people to build a ser-
erybody ends up with something vice layer in 2008, but it was not
we can call microservices. successful. Although the services
were extremely well built and the
I like to say that if you don’t end interfaces were quite good and
up regretting your early tech- orthogonal — they spent a lot of
nology decisions, you probably time thinking about it — under-
over-engineered. neath them was a sea of shared
databases that were also directly
Why do I say that? available to the applications. No-
body had to use the service layer
Imagine an eBay competitor or in order to do their job, so they
Amazon.com competitor in 1995. didn’t.
This company, instead of finding
a product market fit, a business
2018 Microservices // eMag Issue 59 - Mar 2018 35
A is the starting point. The real
diagram would be too full of box-
es and lines, so let’s imagine that
there are only three tables and
two applications. The first thing
that we’re going to do is build a
service that represents, in this ex-
ample, client information (B). This
will be one of the microservices,
with a well-defined interface. We
negotiated that interface with
the consumers of that service be-
fore we created the service.
Next, we point the applications
to read from the service instead
of using the shared database
to read from the table (C). The
Figure 1: Stitch Fix’s Monolithic, shared database. hardest part is moving the lines.
I do not mean to trivialize, but an
image simply cannot show how
At Stitch Fix, we are on our own independent. It is a single point hard it is to do that. After we do
journey. We did not build a mono- of failure and a performance bot- that, callers no longer connect
lithic application, but our version tleneck. directly to the database but will
of the monolith problem is the instead go through the service.
monolithic database we built. Our plan is to decouple appli- Then we move the table from the
cations and services from the shared database and put it in an
We are breaking up our mono- shared database. There is a lot of isolated private database that is
lithic database and extracting work here. only associated with the micro-
services from it but there are service (D). There’s a lot of hard
some great things that we would Figure 2 shows the steps taken to work involved, and this is the pat-
like to retain. break up shared database. Image tern.
Figure 1 shows a simplified view
of our situation. We have way
more than this number of apps,
but there are only so many things
that fit in one image.
We essentially have a shared da-
tabase that includes everything
that is interesting about Stitch
Fix. This includes clients, the box-
es that we ship, the items that
we put into the boxes, metadata
about the inventory like styles
and SKUs, information about the
warehouses, times about 175
different tables. We have on the
order of 70 or 80 different appli-
cations and services that use the
same database for their work.
That is the problem. That shared
database is a coupling point for
the teams, causing them to be
interdependent as opposed to Figure 2: Breaking up the shared database.
36 2018 Microservices // eMag Issue 59 - Mar 2018
The next task is to do the same relational database. But, as archi- give you a tool or a phrase to use
thing for item information. We tects, we are missing a fundamen- when you discuss this. The prin-
create an item service, and have tal building block that represents ciple, or that phrase, is “single
the applications use the service a state change, and that is what I system of record”. If there’s data
instead of the table (E). Then we will call an event. Because events for a customer, an item, or a box
extract the table and make it a are typically asynchronous, may- that is of interest in your system,
private database of the service. be I will produce an event to there should be one and only one
We then do the same thing for which nobody is yet listening, service that is the canonical sys-
SKUs or styles, and we keep rins- maybe only one other consumer tem of record for that data. There
ing and repeating (F). By the end, within the system is listening to should be only one place in the
the boundary of each microser- it, or maybe many consumers are system where that service owns
vice surrounds both its applica- going to subscribe to it. the customer, owns the item, or
tion box and its database, such owns the box. There are going to
as the paired client-service and Having motivated events to a be many representations of cus-
“core client” database (F). first-class construct in our archi- tomer/item/etc. around (there
tecture, we will now apply events certainly are at Stitch Fix), but ev-
We have divided the monolithic to microservices. ery other copy in the system must
database with everything in there be a read-only, non-authoritative
so that each microservice has its A microservices interface in- cache of that system of record.
own persistence. But there are a cludes the front door, right? It ob-
lot of things that we like about viously includes the synchronous Let that sink in: read only and
of the monolithic database, and I request and response. This is typ- non-authoritative. Don’t modify
don’t want to give them up. These ically HTTP, maybe JSON, maybe the customer record anywhere
include easily sharing data be- gRPC or something like that, but and expect it to stick around in
tween different services and ap- it clearly includes an access point. some other system. If we want to
plications, being able to easily do What is less obvious — and I modify that data, we need to go
joins across different tables, and hope I can convince you that this to the system of record, which is
transactions. I want to be able to is true — is that it includes all of the only place that can currently
do operations that span multiple the events that the service pro- tell us, to the millisecond, what
entities as a single atomic unit. duces, all of the events that the the customer is doing.
These are all common aspects of service consumes, and any other
monolithic databases. way to get data into and out of That’s the idea of the system of
that service. Doing bulk reads out record, and there are a couple of
of the service for analytic purpos- different techniques to use in this
Events es or bulk writes into the service approach to sharing data. The
There are various database fea- for uploads are all part of the in- first is the most obvious and most
tures that we can and cannot terface of the service. Simply put, simple: synchronously look it up
keep through the next part of the I assert that the interface of a ser- from that system of record.
migration, but there are work- vice includes any mechanism that
arounds for those we can’t have. gets data into or out of it. Consider a fulfillment service at
Before going into that, I need to Stitch Fix. We are going to ship a
point out an architectural build- Now that we have events in thing to a customer’s physical ad-
ing block that perhaps you know our toolbox, we will start to use dress. There’s a customer service
about but don’t appreciate as events as a tool in solving those that owns the customer data, one
much as you should — name- problems of shared data, of joins, piece of which is the customer’s
ly, events. Wikipedia defines an and of transactions. That brings address. One solution is for the
event as a significant change in us to the problem of shared fulfillment service to call the cus-
state or a statement that some- data. In a monolithic database, it tomer service and look up the
thing interesting has occurred. is easy to leverage shared data. address. There’s nothing wrong
We point the applications at this with this approach; this is a per-
In a traditional three-tier system, shared table and we are all good. fectly legitimate way to do it. But
there’s the presentation tier that But where does shared data go in sometimes this isn’t right. Maybe
the users or clients use, the appli- a microservices world? we do not want everything to be
cation tier that represents state- coupled on the customer service.
less business logic, and the per- Well, we have a couple of dif- Maybe the fulfillment service, or
sistence tier that is backed by a ferent options — but I will first its equivalent, is pounding the
2018 Microservices // eMag Issue 59 - Mar 2018 37
customer service so often that it to Stitch Fix to see the history of To do this, we are going to have
impedes performance. the boxes that we’ve sent them, an item service, which is going
we might be able to provide that to represent the metadata about
Another solution involves the page in this way. We might have this shirt. The item-feedback ser-
combination of an asynchronous the order-history page call the vice is going to listen to events
event with a local cache. The cus- customer service to get the cur- from the item service, such as
tomer service is still going to own rent version of the customer’s new items, items that are gone,
that representation of the cus- information — maybe her name, and changes to the metadata if
tomer, but when the customer her address, and how many that is interesting. It will also listen
data changes (the customer ad- things we have sent her. Then, it to events from the order service.
dress, say), the customer service can go to the order service to get Every piece of feedback about an
is going to send an update event. details about all of her orders. It order should generate an event
When the address changes, the gets a single customer from the — or, since we send five items in
fulfillment service will listen to customer service then will que- a box, possibly five events. The
that event and locally cache the ry for the orders that match that item-feedback service is listening
address, then the fulfillment cen- customer on the order service. to those events and then mate-
ter will send the box on its merry rializing the join. In other words,
way. This is a pattern used on basically it’s remembering all the feedback
every webpage that does not get that we get for every item in one
The caching within the fulfillment all of its data from one service. cached place. A fancier way to say
service has other nice properties. Again, this is a totally legitimate that is that it maintains a denor-
If the customer service does not solution to this problem. We use malized join of items and orders
retain a history of address chang- it all the time at Stitch Fix, and I’m together in its own local storage.
es, we can remember that in the sure you use it all over the place
fulfillment service. This happens in your applications as well. Many common systems do this all
at scale: customers may change the time, and we don’t even think
addresses between the time that But let’s imagine that this doesn’t that they are doing it. For exam-
they start an order and the time work, whether for reasons of per- ple, any sort of enterprise-grade
that we ship it. We want to make formance or reliability or maybe (i.e., we pay for it) database sys-
sure that we send it to the right we’re querying the order service tem has a concept of a materi-
place. too much. alized view. Oracle has it, SQL
Server has it, and a bunch of en-
For approach number two, we terprise-class databases have a
Joins create a service that does what I concept of materializing a view.
It is really easy to join tables in a like to call, in database terminolo-
monolithic database. We simply gy, “materializing the view”. Imag- Most NoSQL systems work in this
add another table to the FROM ine we are trying to produce an way. Any of the Dynamo-inspired
clause in a SQL statement and item-feedback service. At Stitch data stores, like DynamoDB from
we’re all good. This works great Fix, we send boxes out, and peo- Amazon, Cassandra, React, or
when everything sits in one big, ple keep some of the things that Voldemort, all which come from
monolithic database, but it does we send and return some. We a NoSQL tradition, force us to do
not work in a SQL statement if A want to know why, and we want it up front. Relational databases
and B are two separate services. to remember which things are are optimized for easy writes —
Once we split the data across mi- returned and which are kept. This we write to individual records or
croservices, the joins, conceptual- is something that we want to re- to individual tables. On the read
ly, are a lot harder to do. member using an item-feedback side, we put it all together. Most
service. Maybe we have 1,000 or NoSQL systems are the other way
We always have architecture 10,000 units of a particular shirt around. The tables that we store
choices, and there is more than and we want to remember all are already the queries that we
one way to handle joins. The customer feedback about that wanted to ask. Instead of writing
first option is to join in the client. shirt every time we sent it. Multi- to an individual sub-table at write
Have whatever is interested in ply that effort by the tens of thou- time, we are writing five times to
the A and the B do the join. In this sands of pieces of inventory that all of the stored queries that we
particular example, let’s imagine we might have. want to read from. Every NoSQL
that we are producing an order system is forcing us up front to do
history. When a customer comes this sort of materialized join.
38 2018 Microservices // eMag Issue 59 - Mar 2018
Figure 3: Workflows and sagas.
Every search engine that we use of the wonderful things about transaction. Why? Because it is a
almost certainly is doing some having THE database in our sys- scalability killer.
form of joining one particular tem. It is easy to have a transac-
entity with another particular tion cross multiple entities. In So, we can’t have a transaction —
entity. Every analytical system our SQL statement, we begin the but here is what we can do. We
on the planet is joining lots of transaction, do our inserts and turn a transaction where we want
different pieces of data, because updates, then commit and that to update A, B, and C, all together
that is what analytical systems are either all happens or it doesn’t as a unit or not at all, into a saga.
about. happen at all. To create a saga, we model the
transaction as a state machine of
I hope this technique now sounds Splitting data across services individual atomic events. Figure
a little bit more familiar. makes transactions hard. I will 3 may help clarify this. We re-im-
even replace “hard” with “impos- plement that idea of updating A,
sible”. How do I know it’s impossi- updating B, and updating C as a
Transactions ble? There are techniques known workflow. Updating the A side
The wonderful thing about rela- in the database community for produces an event that is con-
tional databases is this concept doing distributed transactions, sumed by the B service. The B ser-
of a transaction. In a relational like two-phased commit, but vice does its thing and produces
database, a single transaction nobody does them in practice. an event that is consumed by the
embodies the ACID properties: As evidence of that fact, consid- C service. At the end of all of this,
it is atomic, consistent, isolated, er that no cloud service on the at the end of the state machine,
and durable. We can do that in a planet implements a distributed we are in a terminal state where A
monolithic database. That’s one and B and C are all updated.
2018 Microservices // eMag Issue 59 - Mar 2018 39
deployed to production as soon
as you hit return on your IDE.
Nobody does that. That is not an
atomic transaction, nor should it
be. In a continuous-delivery pipe-
line, when I say commit, it does
a bunch of stuff, the end result
of which is, hopefully, deployed
to production. That’s what the
high-performing organizations
are doing. But it does not hap-
pen atomically. Again, it’s a state
machine: this step happens, then
this happens, then this happens,
and if something goes wrong
along the way, we back it out. This
should sound familiar. Stuff we
use every day behaves like this,
which means there is nothing
wrong with using this technique
Now, let’s imagine something In the canonical example of when in the services we build.
goes wrong. We roll back by ap- we would use transactions, we
plying compensating operations would debit something from To wrap up, we have explored
in the reverse order. We undo Justin’s account and add it to how to use events as tools in
the things we were doing in C, Randy’s account. No financial sys- our architectural toolbox. We’ve
which produces one or several tem on the planet actually works shown how we can use events
events, and then we undo the like that. Instead, every financial to share data between differ-
set of things that we did in the B system implements it as a work- ent components in our system.
service, which produces one or flow. First, money gets taken out We have figured out how to use
several events, and then we undo of Justin’s account, and it lives in events to help us implement
the things that we did in A. This the bank for several days. It lives joins. And we have figured out
is the core concept of sagas, and in the bank longer than I would how to use events to help us do
there’s a lot of great detail be- like, but it ultimately does end up transactions.
hind it. If you want to know more in my account.
about sagas, I highly recommend
Chris Richardson’s QCon presen- As another example, consider
tation, Data Consistency in Micro- expense approvals. Probably ev-
services Using Sagas. erybody has to get expenses ap-
proved after a conference. And
As with materializing the view, that does not happen immediate-
many systems that we use every ly. You submit your expenses to
day work in exactly this way. Con- your manager, and she approves
sider a payment-processing sys- it, and it goes to her boss, and she
tem. If you want to pay me with approves it... all the way up. And
a credit card, I would like to see then your reimbursement follows
the money get sucked out of your a payment-processing workflow,
account and magically end up in where ultimately the money goes
my wallet in one unit of work. But into your account or pocket. You
that is not what actually happens. would prefer this to be a single
There are tons of things behind unit, but it actually happens as a
the scenes that involve payment workflow. Any multi-step work-
processors and talking to the dif- flow is like this.
ferent banks and all of this finan-
cial magic. If you write code for a living,
consider as a final example what
would happen if your code were
40 2018 Microservices // eMag Issue 59 - Mar 2018
At Stitch Fix, we are on
our own journey. We
did not build a mono-
lithic application, but
our version of the
monolith problem
is the monolithic
database we built.
We are breaking up our
monolithic database
and extracting services
from it but there are
some great things that
we would like to retain.
2018 Microservices // eMag Issue 59 - Mar 2018 41
PREVIOUS ISSUES
57
Streaming Architecture
This InfoQ emag aims to introduce you to core stream
processing concepts like the log, the dataflow model,
and implementing fault-tolerant streaming systems.
56
Faster,
Smarter DevOps
58 Observability
This eMag explores the topic of observability in-depth,
covering the role of the “three pillars of observability” --
monitoring, logging, and distributed tracing -- and relates
these topics to designing and operating software systems
This DevOps eMag has a broader setting than pre-
vious editions. You might, rightfully, ask “what does
faster, smarter DevOps mean?”. Put simply, any and
all approaches to DevOps adoption that uncover im-
portant mechanisms or thought processes that might
otherwise get submerged by the more straightfor-
ward (but equally important) automation and tooling
aspects.
55
based around modern architectural styles like microser-
vices and serverless. Cloud Native
In this eMag, the InfoQ team pulled together stories
that best help you understand this cloud-native rev-
olution, and what it takes to jump in. It features inter-
views with industry experts, and articles on key topics
like migration, data, and security.