This project was built by the Data Engineering team at Snowplow Analytics as a proof-of-concept for porting the Snowplow Enrichment process (which is written in Scala) to Samza.
Read on after the fold for:
Apache Kafka is a unified log technology, which is “designed to allow a single cluster to serve as the central data backbone for a large organization”. Regular readers of this blog may be more familiar with Amazon Kinesis - you can think of Kafka broadly as like a self-hosted version of Kinesis.
Apache Samza is a stateful stream processing framework from the team at LinkedIn. It has tight integration with Apache Kafka, and is designed to operate inside a resource-management/scheduler platform such as Apache YARN. Samza is similar to the more well-known Apache Storm framework, but Samza is in our view easier to operate than Storm and its stream processing topologies are easier to reason about.
We have implemented a simple analytics-on-write stream processing job using Apache Samza. Our Samza job reads a Kafka topic,
example-project-inbound, containing “inbound” events in a JSON format:
Our job counts the events by
type and aggregates these counts into 1 minute buckets based on the
Every 30 seconds, our job emits a “window summary” event to a second Kafka topic,
example-project-window-summary, where this event reports the new counts for any type-bucket pairs which have been updated in the past 30 seconds.
In this tutorial, we’ll walk through the process of getting up and running with Amazon Kinesis and AWS Lambda Service. You will need git, Vagrant and VirtualBox installed locally.
First clone the repo and bring up Vagrant:
vagrant up will install everything we need to compile and run our Samza job - including Java, Scala, ZooKeeper, Kafka and YARN.
The rest of the detailed setup assumes you are still inside the Vagrant guest.
Using Tim Harper’s custom SBT tasks, packaging our job is straightforward:
You should now have a package job artifact available as
Now we are ready to submit our Samza job to Apache YARN, the resource-manager and scheduler. Once submitted, YARN will take on responsibility for running our new Samza job and allocating it the resources it needs (even spinning up multiple copies of our job).
First, extract our packaged job artifact to a deployment folder:
Now we can deploy our job to YARN:
On your host machine, browse to the Samza web UI at http://localhost:8088 to watch your job starting up:
Our Samza job will automatically create the Kafka topics for us if they don’t already exist. Confirm this with this command:
Let’s run a tail on the last topic,
Good - you can see that our job is emitting a window summary event every 30 seconds.
So far the “counts” property is empty because our Samza job hasn’t received any inbound events yet. Let’s change that.
Now let’s send in some “inbound” events into our
example-project-inbound topic. In a new terminal, run this command:
This producer will sit waiting for input. Let’s feed it some events, making sure to hit enter after every line:
Now switch back to your consumer terminal and wait a few seconds:
Great! The “window summary” event in the middle of this output now reports the totals for each event type, by 1 minute bucket.
To prove that we are using Samza’s key-value store to track all-time counts, go back to your producer terminal and add in another event:
Now back in your consumer terminal:
Excellent! Our count for Green events for the 2015-06-05T12:54:00.000 minute has now risen to 2.
The tutorial materials in this README for getting started with Samza, YARN and Kafka are adapted from Chapters 2-4 of Alex Dean’s Unified Log Processing book.
This project draws heavily on various analytics-on-write example projects from Snowplow:
All of these four example projects are based on an event processing technique called analytics-on-write. We are planning on exploring these techniques further in a new project, called Icebucket. Stay tuned for more on this!