The Snowplow Apache Spark Streaming Example Project can help you jumpstart your own real-time event processing pipeline. We will take you through the steps to get this simple analytics-on-write job setup and processing your Kinesis event stream.
Read on after the fold for:
Apache Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams, using a “micro-batch” architecture. Our event stream will be ingested from Kinesis by our Scala application written for and deployed onto Spark Streaming.
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. In this project we leverage the new Kinesis receiver that has been recently developed for Spark Streaming, leveraging the Kinesis Client Library.
Our Spark Streaming job reads a Kinesis stream containing events in a JSON format:
Our job counts the events by
type and aggregates these counts into 1 minute buckets. The job then takes these aggregates and saves them into a table in DynamoDB:
The most complete open-source example of an analytics-on-write implementation is Ian Meyers’ amazon-kinesis-aggregators project; our example project is in turn heavily influenced by the concepts in Ian’s work. Two important concepts to understand in analytics-on-write are:
In this tutorial, we’ll walk through the process of getting up and running with Amazon Kinesis and Apache Spark Streaming. You will need git, Vagrant and VirtualBox installed locally. This project is specifically configured to run in AWS region “us-east-1” to ensure all AWS services are available. Building Spark on a Vagrant box requires at least 8GB of RAM and 64 bit OS hosting vagrant.
In your local terminal:
Let’s now build the project. This should take around 10 minutes with these commands:
You’re going to need IAM-based credentials for AWS. Get your keys and type in “aws configure” in the Vagrant box (the guest). In the below, I’m also setting the region to “us-east-1” and output format to “json”:
We’re going to set up the Kinesis stream. Your first step is to create a stream and verify that it was successful. Use the following command to create a stream named “my-stream”:
If you check the stream and it returns with status CREATING, it means that the Kinesis stream is not quite ready to use. Check again in a few moments, and you should see output similar to the below:
I’m using “my-table” as the table name. Invoke the creation of the table with:
Once the Kinesis’ stream’s “StreamStatus” is
ACTIVE, you can start sending events to the stream by:
Now we need to build a version of Spark with Amazon Kinesis support.
Spark now comes packaged with a self-contained Maven installation to ease building and deployment of Spark from source located under the build/ directory. This script will automatically download and setup all necessary build requirements (Maven, Scala, and Zinc) locally within the build/ directory itself. It honors any mvn binary if present already, however, will pull down its own copy of Scala and Zinc regardless to ensure proper version requirements are met.
We can issue the invoke command to build Spark with Kinesis support; be aware that this could take over an hour:
Open a new terminal window and log into the Vagrant box with:
Now start Apache Spark Streaming system with this command:
If you have updated any of the configuration options above (e.g. stream name or region), then you will have to update the config.hocon.sample file accordingly.
Under the covers, we’re submitting the compiled spark-streaming-example-project jar to run on Spark using the
First review the spooling output of the
run_project command above - it’s very verbose, but if you don’t see any Java stack traces in there, then Spark Streaming should be running okay.
Now head over to your host machine’s localhost:4040 and you should see something like this:
Success! You can now see data being written to the table in DynamoDB. Make sure you are in the correct AWS region, then click on
my-table and hit the
Explore Table button:
For each BucketStart and EventType pair, we see a Count, plus some CreatedAt and UpdatedAt metadata for debugging purposes. Our bucket size is 1 minute, and we have 5 discrete event types, hence the matrix of rows that we see.
Remember to shut off:
StreamingCountingAppDynamoDB table (created automatically by the Kinesis Client Library)
This is a short list of our most frequently asked questions.
I got an out of memory error when trying to build Apache Spark:
I found an issue with the project:
Spark is an increasing focus for us at Snowplow. Recently, we detailed our First experiments with Apache Spark. Also, catch up on our newly released version 0.3.0 of our spark-example-project.
Separately, we are also now starting a port of this example project to AWS Lambda - you can follow our progress in the aws-lambda-example-project repo.
This example project is a very simple example of an event processing technique which is called analytics-on-write. We are planning on exploring these techniques further in a new project, called Icebucket. Stay tuned for more on this!