Self-optimize Lambda memory configuration(s) at scale with Flink
What’s this and why do it?
Let’s say you want to optimize the memory allocated for your lambda function because you had over-provisioned it earlier as you didn’t know how much memory the code actually needs to run properly. To do this, logically, one would look at the Cloudwatch logs (or Lambda insights) to understand how much memory the function uses in an invocation and then sample them over time to find the “sweet-spot”.
This works well when you have a handful of lambdas to work with, but imagine if you need to optimize a thousand of them! What then? The manual approach is obviously time consuming. The set up I am going to explain here just automates this process at scale.
How it works:
Access to lambda metrics:
The first step of the flow is to access the memory utilization of a function over time in order to come up with a number that works well. The telemetry API that’s part of Lambda extensions can help with this. We ship the platform logs so that the memory used across all invocations can be sampled over time.
Sampling the numbers:
Now that you have access to the metrics (the maxMemory used), how do you go about using them to get an optimal value? This is where the CEP layer (Flink) comes into the picture. Flink is an open source CEP framework from Apache.
If you haven’t heard of Flink or don’t have a clear idea as to what it does, I encourage you to kindly read some of my previous blogs starting with this one.
We gather the metrics of a lambda function over a 2 week stint and then determine the value to set by pushing a payload from Flink at the end of that period. Owing to how Flink operates internally, we are able to gather the logs of a function and keep track over a defined period individually at scale, and this is why I’ve employed the use of this compute layer.
The process of collecting the metrics and determining the final value to push as output is all programmed within this Flink layer.
Updating the memory value:
When Flink pushes the payload, another lambda uses that information and makes a call via the AWS SDK to update the lambda memory configuration of the function whose details were provided in the Flink payload.
The nitty-gritty details of the setup:
Firstly, the entire project is available at this Github repo.
You will only need to run three commands from the terminal to deploy the whole thing.
Note: You will need Node, Maven & Serverless framework installed to run them.
npm install
npm run flink-build
serverless deploy --max-concurrency 2
IaC:
The deployment of the entire infrastructure is done via the Serverless framework. Since there are multiple items to deploy, the setup uses serverless compose to deploy everything by issuing one command.
These are the resources provisioned as part of the deployment.
- S3 bucket — contains lambda layer and Flink application .jar file
- Lambda(s) — worker function that ships logs to Flink and the updater function which sets the new memory value
- Kinesis Data Analytics — flink instance
- Lambda layer — extensions code that plugs into the worker function for delivering the logs
- Kinesis streams — source stream for pushing logs to Flink and destination stream which receives the outputs from Flink
Shipping logs:
As mentioned, the code that utilizes the telemetry API is injected via a Lambda layer which is also present in the repository. The logs are pushed to a Kinesis stream but in theory, they can be shipped almost anywhere. Kinesis was used because it works well with Flink.
Any lambda that needs to be part of this automated optimizing setup will have to include the lambda layer as part of its deployment.
Note: The code for the extensions set up was forked from AWS’s samples GitHub repo.
Flink:
At a high level, this is what Flink is doing internally,
- All logs contain the function name (this was added manually within the extensions dispatcher code). This function name is used to perform keyBy operation for every log event that comes through.
- The keyed stream within Flink allows state maintenance for every function individually.
- A timer is created from the very first instance of a given function log that is triggered exactly after 14 days. Within this period, all Flink will do is determine the highest value of maxMemory used for a given day based on all the logs that get pushed for that day, and keep that value in state.
- When the timer is fired, the average of all memory values is present in the state calculated. Just as an added buffer, the average value is incremented by 100 and then pushed out to the destination Kinesis stream from Flink.
- Of course, all of the above can be altered to what works best for you.
When it comes to Flink, we can even use Kafka as source/destination topics. I went with Kinesis for a nimble set up.
Memory updater:
The consumer of the destination Kinesis stream is just a lambda function whose role is to read the payload containing the function name and memory value and make the memory configuration update using the AWS SDK.
Conclusion:
Cost optimization with serverless architectures require adequate monitoring & observability to make optimal decisions in terms of resource configuration. The above set up merely handles one aspect of that concept with lambda functions.