CloudWatch Metrics and Alarms can be used to add circuit breaker functionality to AWS Lambda functions that are triggered by SQS messages in a non-intrusive and cost-effective way.
You can protect overwhelmed downstream services without the need to make code changes, replay messages from dead letter queues or increase operating costs significantly.
A Serverless Architecture frees you from the responsibility to ensure your application scales rapidly with increasing demands and is available even when underlying infrastructure components fail. But as soon as your application calls external APIs — either third-party services hosted somewhere or managed (non-serverless) AWS services — the ideal world is crumbling. You are confronted with increasing latency, long-running calls and increasing error rate.
A couple of well-known stability patterns exist, like Use Timeouts, Bulkheads, Decoupling Middleware and Circuit Breaker (published in Michael T. Nygard’s Book Release It!). In the context of Serverless on AWS, you can configure timeouts for Lambda functions, decouple your application from external APIs by putting a message queue like SQS in front of your single-purpose Lambda functions (Bulkheads). But there is currently no straightforward approach to apply a circuit breaker to AWS Lambda functions. If your message processing lambda functions start to fail recurrently due to an incident in the downstream service, AWS Lambda will retry to send SQS messages to your function (respecting an optional configured dead-letter queue and maximum receives count). Your function might get even more load, since AWS Lambda scales concurrent invocations based on available messages. If you restrict the function concurrency, AWS Lambda might throttle and fail to process messages.
When a downstream service is in trouble, for instance due to very high load or failing underlying infrastructure components, the idea of the Circuit Breaker Pattern is to stop an upstream system making further calls (open state). The downstream service gets the chance to recover and the upstream system does not waste time nor operating resources to make calls which will probably fail anyway. After some time, the circuit breaker allows a small number of calls to find out whether the downstream is operating normal again (half open state). If a threshold of successful calls is reached, the circuit breaker enables all calls to the downstream service again (closed state).
Three key aspects are important to implement a circuit breaker:
A common approach is to implement a circuit breaker inside your function and use DynamoDB to store the circuit breaker state (like Gunnar Grosch’s failure lambda node.js implementation and Jeremy Daly outlines in his AWS Reference Architecture Pattern). The Lambda function will fail before calling the Third-Party API when a failure threshold has been exceeded. This protects the downstream service, but it will not stop AWS Lambda polling the upstream queue and invoking your function. You also have to make changes to lambda function code, specific to the particular Lambda runtime and programming language. The approach introduces a number of DynamoDB requests, which could significantly increase costs.
This solution relies on CloudWatch metrics and alarms to detect message processing issues caused by the downstream service.
This solution supports any Lambda runtime. No changes to your function code are required. Fix costs incur for CloudWatch alarms and metrics per month (AWS free tier can be applied, except for high-resolution alarms). Costs for Step Functions transitions and Lambda Functions invocations incur only in failure state. On the other hand, you save costs for unnecessary queue service requests and lambda invocations.
I designed the solution for SQS as function trigger, but other services like Amazon MQ, that are integrated by AWS Lambda via event source mappings, should work too.
You can find an implementation of this Circuit Breaker solution on GitHub.
Jeremy Daly’s Lambda Orchestrator pattern goes a step further. It does not rely on AWS Lambda event source mapping at all to receive messages and invoke Lambda functions. Instead, a long-running Lambda function polls the queue and invokes the processing lambda function, similar to the solution described above in state “half open”. The Lambda Orchestrator pattern enables sophisticated ways to throttle Third-Party API calls, like respecting API quotas.
This article was originally posted on Medium.