How to debug slow lambda response times

How to debug slow Lambda response times
Yan Cui @theburningmonk
Developer Advocate, Lumigo
AWS Serverless Hero
Author of Production-Ready Serverless

MyApiFunction
Worker
Worker
…

overloaded servers are a thing of the past

observation
majority of performance problems originates from a function’s integration points

macro
how well is this service performing in general?
micro
why did this transaction perform poorly?

macro
micro
identify systemic
issues

macro
micro
why did this user
get a bad exp?

In control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge
of its external outputs.

what do we need
to collect?

Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
AWS user since 2009

Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
advisetraining delivery

API Gateway Lambda DynamoDBhow long did this
req take?

what is the state
of the world?

what are the most
important outputs to
collect?

macro
micro

API Gateway API GatewayLambda Lambda DynamoDB
Service A Service B

Service A Service B
how long did service
B took to respond?

Service A Service B
how long did service
B took to respond?
was DynamoDB slow?
was it a cold start?
could it be API
Gateway?

Lambda
time to create and initialize the
worker instance

for API functions, use API Gateway’s IntegrationLatency
as a proxy for “total response time from Lambda”

DynamoDB
SuccessfulRequestLatency

“I'm facing this problem now with a lambda that usually takes 25 ms
but once a week or so takes > 6000 ms and times out. The lambda's
ﬁrst step is to load a DynamoDB table that only has 8 items. I'm at a
loss to understand how such a simple query could take so long.”

START
1st attempt
exponential
backoff (1)

START
1st attempt
exponential
backoff (1)
2nd attempt
exponential
backoff (2)

START
1st attempt
exponential
backoff (1)
2nd attempt
exponential
backoff (2)
3rd attempt
exponential
backoff (3)

START
1st attempt
exponential
backoff (1)
2nd attempt
exponential
backoff (2)
3rd attempt
exponential
backoff (3)
4th attempt
success!

JavaScript AWS SDK
10 retries
Initial exponential backoff of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
this is Marc Brooker’s
fav formula!

10 retries
JavaScript AWS SDK

10 retries
JavaScript AWS SDK
danger zone!

Record client-side latency metrics for IO operations

www.youtube.com/watch?v=adtCwnKApWI

IntegrationLatency

[API Gateway]
Latency

[API Gateway]

API Gateway’s latency
overhead IntegrationLatency

[API Gateway]
Latency

[API Gateway]

Duration

[Lambda]

[API Gateway]
Latency

[API Gateway]

Duration

[Lambda]
Lambda’s
allocation time

[API Gateway]
Latency

[API Gateway]


[DynamoDB]
Duration

[Lambda]
Lambda’s
allocation time

[API Gateway]
Latency

[API Gateway]

Caller-side DynamoDB latency

[custom metric]

[DynamoDB]
Duration

[Lambda]
Lambda’s
allocation time

[API Gateway]
Latency

[API Gateway]

Caller-side retries
(mostly)

[custom metric]

[DynamoDB]
Duration

[Lambda]
Lambda’s
allocation time

[API Gateway]
Latency

[API Gateway]

Latency (ms)
Time
Latency
IntegrationLatency
Duration

macro

micro

X-Ray
can be encapsulated
in custom modules

X-Ray
doesn’t add latency
can see “system” overhead (e.g. allocation time)

X-Ray
built-in sampling

X-Ray
built-in sampling
X-Ray SDK adds signiﬁcant overhead

X-Ray
built-in sampling
doesn’t trace TCP trafﬁc (RDS/Elasticache)

X-Ray
built-in sampling
poor support for saync event sources (only SNS)

X-Ray
built-in sampling
doesn’t capture request & response data

X-Ray
built-in sampling
logs and traces are separate

X-Ray
built-in sampling
logs and traces are separate
difﬁcult to search

X-Ray
good enough for simple workloads
when you outgrow X-Ray, look for a 3rd-party tool

answer both macro and micro level questions
in just a few clicks!

Support async event sources such as Kinesis, DynamoDB streams and SNS

Support TCP trafﬁc - e.g. RDS and Elasticache

trace 500K invocations
per month for FREE with
promo code Yan500
platform.lumigo.io/signup

How to mitigate slow dependencies?

if not, a good caching strategy often helps

Client
Server 1
Server 2
50ms later

runing required for
each service

helps in some cases
but can exaspate the problem in other cases

platform.lumigo.io/signup
trace 500K invocations
per month for FREE with
promo code Yan500

@theburningmonk
theburningmonk.com
github.com/theburningmonk
yan@lumigo.io

How to debug slow lambda response times

More Related Content

What's hot

Similar to How to debug slow lambda response times

More from Yan Cui

Recently uploaded

How to debug slow lambda response times