Serverless Complex Event Processing with Apache Flink
What’s Apache Flink?
Flink is a distributed processing engine that is capable of performing in-memory computations at scale for data streams. A data stream is a series of events such as transactions, user interactions on a website, application logs etc. from single or multiple sources. Streams in general can be of two types: bounded or unbounded. Bounded streams have a defined start and end, whereas unbounded streams once started, do not have a defined end. Flink is capable of handling both stream types with hyper scalability and state management.
Why use Flink?
Some of the common use cases with Flink can be grouped into three broad categories; Event driven applications, Data analytics, Data pipeline (ETL workloads).
Event driven applications: Building reactive applications with usage of state when certain events occur are classified as event driven applications. Stateful application processing is one of Flink’s core features. Let’s say if you would like to generate notifications/alerts based on certain events coming through from more than one source, Flink allows you to do so by maintaining an internal state so as to correlate events and determine if an alert needs to be sent out or not.
Data analytics: Analyzing patterns by gathering insights from raw data in real time is possible with Flink. Consider a network monitoring system that detects outliers by consuming geographically distributed events in real time. Reacting to a negative effect can quickly mitigate downtime in systems which otherwise would prove to be catastrophic in nature.
Flink also supports a robust CEP (Complex Event Processing) library that can be used for pattern matching for the event streams. With the ability of handling multiple trillion events per day at hyper speed, there is no limit to what can be achieved within the Flink layer.
Flink Concepts:
Before we get a Flink job running on Amazon Kinesis Data Analytics platform, there are some base concepts to be understood on how the framework works in general. Let’s look at a sample job written in scala. (Note that Flink jobs can be written in Java, Scala or Python — only Table API is supported as part AWS KDA)
The goal of the above job is to correlate events from two sources and identify certain values within the events for producing an alert. The above job has two input Kinesis streams and the output (sink) is a Kinesis stream as well.
Notice the use of the Kinesis connector for consumer and producer. The parameters for the consumer & producer are like this:
new FlinkKinesisConsumer(stream name, the schema (with which the stream events should be parsed with), configuration properties)
new FlinkKinesisProducer(schema, configuration properties)
Operators in Flink play an important role in datastream transformations especially when they are chained together. As part of the job, there are three operators that perform datastream transformations,
- flatMap — A Map function (can be of Rich type if need be) that extracts data from the input event which is a stringified JSON object. Of the three flapMaps, two of them extract & parse the relevant data into a defined model (the data modelling is based on POJO — plain old java object classes.) using JsonPath, the third flapMap is a RichCoFlatMapFunction that determines after correlating events from the two sources whether it should be forwarded to the sink or not.
- keyBy — This produces a keyed datastream by logically partitioning a stream into disjoint partitions. All records with the same key are assigned to the same partition. Internally, keyBy() is implemented with hash partitioning. A keyed stream is required for setting up state management. We will discuss more on this shortly.
- connect — Allow connecting two data streams by retaining their types and produces a single stream where the correlation logic can be applied with the help of state.
The above code is one of the flatMap functions applied to an event stream. Looking at the job and this flatMap function, what we accomplish is keyed data stream with the use of keyBy operator chained after the flatMap.
The final flatMap that connects two streams together is where we make use of state for processing the incoming events from both streams.
There are various types of state that Flink supports. We make use of ValueState in this case to keep track of the events and the count. A BookingCreated event may arrive before or after a BookingEvent, and it’s in this case state plays an important role for keeping track of the events coming through. All events from both streams are keyed by the same parameter and therefore logically partitioned by the same hash. This allows correlating events of the same parameter concurrently. We clear the state and output the event with two values once a condition is fulfilled.
When it comes to working with state, a stream needs to be keyed and access to state is available only in Rich functions.
An important aspect to keep track of is the parallelism of your application. A Flink program consists of multiple tasks (transformations/operators, data sources, and sinks). A task is split into several parallel instances for execution and each parallel instance processes a subset of the task’s input data. The number of parallel instances of a task is called its parallelism. Efficient setting of the parallelism is imperative to the scalability of your application.
In this example, we have set the parallelism at the execution environment level to 2. However, since we would be using Kinesis Data Analytics, this setting can be dynamic in nature which we will see later.
Building & compiling the job:
There are 3 ways to build the packaged .jar file for deploying your application.
- Gradle
- Maven
- SBT
Each of these have a defined setup that can be explored from here.
Note: When working with Kinesis Data Analytics, you want to ensure that the relevant dependencies match the Flink version that you will work with. As of date, KDA supports Flink 1.11.1.
Getting started on Amazon Kinesis Data Analytics (KDA):
Amazon Kinesis Data Analytics is a managed serverless offering that allows you to setup the Flink engine for your streaming applications. There are no servers to manage, no minimum fee or setup cost, and you only pay for the resources your streaming applications consume. KDA integrates with Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon Elasticsearch Service, Amazon DynamoDB Streams, Amazon S3, custom integrations, and more using built-in connectors.
With the application jar built, upload it to an S3 bucket and configure the KDA application to point to that S3 bucket object location. You probably want to keep logging turned on as it helps in narrowing down the problem when things go wrong with the application logic or when the job itself isn’t running.
KDA allows automatic scaling of an application when the throughput exceeds beyond a certain limit or conversely scales down when usage is consistent for a certain period of time. AWS scales up/down your application automatically depending on these conditions,
- Your application scales up (increases parallelism) when your CPU usage remains at 75 percent or above for 15 minutes.
- Your application scales down (decreases parallelism) when your CPU usage remains below 10 percent for six hours.
- AWS will not reduce your application’s current Parallelism value to less than your application’s Parallelism setting.
With a running job, the KDA console will look similar to this wherein you can explore various attributes having to do with the resource utilization of your application.
With every successful record flowing through the input stream, the record counter updates accordingly.
Conclusion:
We have looked at some basic concepts of the Flink framework, and how powerful it is for event aggregation at hyper scale with ultra low latency. Kinesis Data Analytics for Flink does most of the heavy lifting in terms of configuration and scalability so the focus can be directed towards the application development. AWS does offer the Flink engine as part of their EMR service if you like precise control over how the cluster should be set up and the libraries that are used under the hood.
Another core concept of Flink that is certainly worth exploring is windowing. Windowing allows processing of your streams into buckets of finite size. Windows can be processed based on time or on triggers. For a deep dive visit this link.
The above code samples were only for illustrative purposes to show the working of Flink from a high level, however, feel free to drop a comment or DM on twitter if you would like to know more about the set up.