KEMBAR78
How to debug slow lambda response times | PDF
How to debug slow Lambda response times
Yan Cui @theburningmonk
Developer Advocate, Lumigo
AWS Serverless Hero
Author of Production-Ready Serverless
Lambda autoscales by traffic
multi-AZ by default
MyApiFunction
Worker
Worker
…
overloaded servers are a thing of the past
observation
majority of performance problems originates from a function’s integration points
macro
how well is this service performing in general?
micro
why did this transaction perform poorly?
macro
micro
identify systemic
issues
how well is this service performing in general?
why did this transaction perform poorly?
how well is this service performing in general?
why did this transaction perform poorly?
macro
micro
why did this user
get a bad exp?
In control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge
of its external outputs.
In control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge
of its external outputs.
what do we need
to collect?
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
AWS user since 2009
Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
advisetraining delivery
API Gateway Lambda DynamoDB
API Gateway Lambda DynamoDBhow long did this
req take?
what is the state
of the world?
In control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge
of its external outputs.
what are the most
important outputs to
collect?
macro
micro
how well is this service performing in general?
why did this transaction perform poorly?
API Gateway API GatewayLambda Lambda DynamoDB
Service A Service B
API Gateway API GatewayLambda Lambda DynamoDB
Service A Service B
how long did service
B took to respond?
API Gateway API GatewayLambda Lambda DynamoDB
Service A Service B
how long did service
B took to respond?
was DynamoDB slow?
was it a cold start?
could it be API
Gateway?
API Gateway
Lambda
Lambda
Duration
Lambda
time to create and initialize the
worker instance
Lambda
bit.ly/2QXNVwc
bit.ly/2WL1uj0
Lambda
Lambda
time to create and initialize the
worker instance
for API functions, use API Gateway’s IntegrationLatency
as a proxy for “total response time from Lambda”
DynamoDB
DynamoDB
SuccessfulRequestLatency
“I'm facing this problem now with a lambda that usually takes 25 ms
but once a week or so takes > 6000 ms and times out.  The lambda's
first step is to load a DynamoDB table that only has 8 items.  I'm at a
loss to understand how such a simple query could take so long.”
START
START
1st attempt
START
1st attempt
exponential
backoff (1)
START
1st attempt
exponential
backoff (1)
2nd attempt
exponential
backoff (2)
START
1st attempt
exponential
backoff (1)
2nd attempt
exponential
backoff (2)
3rd attempt
exponential
backoff (3)
START
1st attempt
exponential
backoff (1)
2nd attempt
exponential
backoff (2)
3rd attempt
exponential
backoff (3)
4th attempt
success!
START
1st attempt
exponential
backoff (1)
2nd attempt
exponential
backoff (2)
3rd attempt
exponential
backoff (3)
4th attempt
success!
SuccessfulRequestLatency
JavaScript AWS SDK
10 retries
Initial exponential backoff of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
this is Marc Brooker’s
fav formula!
10 retries
Initial exponential backoff of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
JavaScript AWS SDK
10 retries
Initial exponential backoff of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
JavaScript AWS SDK
danger zone!
Record client-side latency metrics for IO operations
www.youtube.com/watch?v=adtCwnKApWI
Embedded Metric Format (EMF)
Latency

[API Gateway]
IntegrationLatency

[API Gateway]
Latency

[API Gateway]
API Gateway’s latency
overhead IntegrationLatency

[API Gateway]
Latency

[API Gateway]
Duration

[Lambda]
API Gateway’s latency
overhead IntegrationLatency

[API Gateway]
Latency

[API Gateway]
Duration

[Lambda]
Lambda’s
allocation time
API Gateway’s latency
overhead IntegrationLatency

[API Gateway]
Latency

[API Gateway]
SuccessfulRequestLatency

[DynamoDB]
Duration

[Lambda]
Lambda’s
allocation time
API Gateway’s latency
overhead IntegrationLatency

[API Gateway]
Latency

[API Gateway]
Caller-side DynamoDB latency

[custom metric]
SuccessfulRequestLatency

[DynamoDB]
Duration

[Lambda]
Lambda’s
allocation time
API Gateway’s latency
overhead IntegrationLatency

[API Gateway]
Latency

[API Gateway]
Caller-side retries
(mostly)
Caller-side DynamoDB latency

[custom metric]
SuccessfulRequestLatency

[DynamoDB]
Duration

[Lambda]
Lambda’s
allocation time
API Gateway’s latency
overhead IntegrationLatency

[API Gateway]
Latency

[API Gateway]
Latency (ms)
Time
Latency
IntegrationLatency
Duration
Caller-side DynamoDB latency
SuccessfulRequestLatency
Latency (ms)
Time
Latency
IntegrationLatency
Duration
Caller-side DynamoDB latency
SuccessfulRequestLatency
how well is this service performing in general?
macro
why did this transaction perform poorly?
micro
X-Ray
X-Ray
X-Ray
can be encapsulated
in custom modules
X-Ray
doesn’t add latency
X-Ray
doesn’t add latency
can see “system” overhead (e.g. allocation time)
X-Ray
doesn’t add latency
can see “system” overhead (e.g. allocation time)
built-in sampling
X-Ray
doesn’t add latency
can see “system” overhead (e.g. allocation time)
built-in sampling
X-Ray SDK adds significant overhead
X-Ray
doesn’t add latency
can see “system” overhead (e.g. allocation time)
built-in sampling
X-Ray SDK adds significant overhead
doesn’t trace TCP traffic (RDS/Elasticache)
X-Ray
doesn’t add latency
can see “system” overhead (e.g. allocation time)
built-in sampling
X-Ray SDK adds significant overhead
doesn’t trace TCP traffic (RDS/Elasticache)
poor support for saync event sources (only SNS)
X-Ray
doesn’t add latency
can see “system” overhead (e.g. allocation time)
built-in sampling
X-Ray SDK adds significant overhead
doesn’t trace TCP traffic (RDS/Elasticache)
poor support for saync event sources (only SNS)
doesn’t capture request & response data
X-Ray
doesn’t add latency
can see “system” overhead (e.g. allocation time)
built-in sampling
X-Ray SDK adds significant overhead
doesn’t trace TCP traffic (RDS/Elasticache)
poor support for saync event sources (only SNS)
doesn’t capture request & response data
logs and traces are separate
X-Ray
doesn’t add latency
can see “system” overhead (e.g. allocation time)
built-in sampling
X-Ray SDK adds significant overhead
doesn’t trace TCP traffic (RDS/Elasticache)
poor support for saync event sources (only SNS)
doesn’t capture request & response data
logs and traces are separate
difficult to search
X-Ray
good enough for simple workloads
when you outgrow X-Ray, look for a 3rd-party tool
answer both macro and micro level questions
in just a few clicks!
Support async event sources such as Kinesis, DynamoDB streams and SNS
Support TCP traffic - e.g. RDS and Elasticache
platform.lumigo.io/signup
trace 500K invocations
per month for FREE with
promo code Yan500
platform.lumigo.io/signup
How to mitigate slow dependencies?
it depends…
can you use another service?
if not, a good caching strategy often helps
bit.ly/3h7Bo41
Client
Server 1
Server 2
Client
Server 1
Server 2
50ms later
Client
Server 1
Server 2
runing required for
each service
helps in some cases
but can exaspate the problem in other cases
can you use another service?
platform.lumigo.io/signup
trace 500K invocations
per month for FREE with
promo code Yan500
@theburningmonk
theburningmonk.com
github.com/theburningmonk
yan@lumigo.io

How to debug slow lambda response times