We are pleased to announce the release of our first analytics SDK for Snowplow, created for data engineers and data scientists working with Snowplow in Scala.
The Snowplow Analytics SDK for Scala lets you work with Snowplow enriched events in your Scala event processing, data modeling and machine-learning jobs. You can use this SDK with Apache Spark, AWS Lambda, Apache Flink, Scalding, Apache Samza and other Scala-compatible data processing frameworks.
Some good use cases for the SDK include:
- Performing event data modeling in Apache Spark as part our Hadoop batch pipeline
- Developing machine learning models on your event data using Apache Spark (e.g. using Databricks or Zeppelin on EMR)
- Performing analytics-on-write in AWS Lambda as part of our Kinesis real-time pipeline:
Read on below the jump for:
The Scala Analytics SDK makes it significantly easier to build applications that consume Snowplow enriched data directly from Kinesis or S3.
The Snowplow enriched event is a relatively complex TSV string containing self-describing JSONs. Rather than work with this structure directly, Snowplow analytics SDKs ship with event transformers, which translate the Snowplow enriched event format into something more convenient for engineers and analysts.
As the Snowplow enriched event format evolves towards a cleaner Apache Avro-based structure, we will be updating this Analytics SDK to maintain compatibility across different enriched event versions.
Working with the Snowplow Scala Analytics SDK therefore has two major advantages over working with Snowplow enriched events directly:
- The SDK reduces your development time by providing analyst- and developer-friendly transformations of the Snowplow enriched event format
- The SDK futureproofs your code against new releases of Snowplow which update our enriched event format
Currently the Analytics SDK for Scala ships with one event transformer: the JSON Event Transformer. Let’s check this out next.
2. The JSON Event Transformer
The JSON Event Transformer takes a Snowplow enriched event and converts it into a JSON ready for further processing. This transformer was adapted from the code used to load Snowplow events into Elasticsearch in the Kinesis real-time pipeline.
The JSON Event Transformer converts a Snowplow enriched event into a single JSON like so:
The most complex piece of processing is the handling of the self-describing JSONs found in the enriched event’s
derived_contexts fields. All self-describing JSONs found in the event are flattened into top-level plain (i.e. not self-describing) objects within the enriched event JSON.
For example, if an enriched event contained a
com.snowplowanalytics.snowplow/link_click/jsonschema/1-0-1, then the final JSON would contain:
For more information, check out the Scala Analytics SDK wiki page.
3. Using the SDK
The latest version of Snowplow Scala Analytics SDK is 0.1.0, which is cross-built against Scala 2.10.x and 2.11.x.
If you’re using SBT, add the following lines to your build file:
Note the double percent (
%%) between the group and artifactId. This will ensure that you get the right package for your Scala version.
3.2 Using from Apache Spark
The Scala Analytics SDK is a great fit for performing Snowplow event data modeling in Apache Spark and Spark Streaming.
Here’s the code we use internally for our own data modeling jobs:
3.3 Using from AWS Lambda
The Scala Analytics SDK is a great fit for performing analytics-on-write, monitoring or alerting on Snowplow event streams using AWS Lambda.
Here’s some sample code for transforming enriched events into JSON inside a Scala Lambda:
We are hugely excited about developing our analytics SDK initiative in four directions:
- Adding more SDKs for other languages popular for data analytics and engineering, including Python, Node.js (for AWS Lambda) and Java
- Adding additional event transformers to the Scala Analytics SDK – please let us know any suggestions!
- We are planning on “dogfooding” the Scala Analytics SDK by starting to use it in standard Snowplow components, such as our Kinesis Elasticsearch Sink (#2553)
- Adding additional functions that are useful for processing event data (and sequences of event data) in particular
If you would like to help out, please get in touch! In particular, we’d love to get contributions to the official Python or Node.js Analytics SDKs.
5. Getting help
We are working on a new section of the Snowplow wiki dedicated to our Analytics SDKs.