Snowplow Scala Analytics SDK 0.2.0 released
We are pleased to announce the 0.2.0 release of the Snowplow Scala Analytics SDK, a library providing tools to process and analyze Snowplow enriched events in Scala-compatible data processing frameworks such as Apache Spark, AWS Lambda, Apache Flink and Scalding, as wells other JVM-compatible data processing frameworks.
This release adds run manifest functionality, removes the Scalaz dependency and adds SDK artifacts to Maven Central, along with many other internal changes.
In the rest of this post we will cover:
- Run manifests
- Using the run manifest
- Important bug fixes
- Removing the Scalaz dependency
- Getting help
1. Run manifests
This release provides tooling for populating and inspecting a Snowplow run manifest. A Snowplow run manifest is a lightweight and robust way to track your data-modeling step progress. It lets you check if a particular “run folder” of enriched events was already processed and can be safely skipped.
Run manifests were previously introduced in Snowplow Python Analytics SDK 0.2.0; the Scala API closely resembles the Python SDK’s version.
Historically, Snowplow’s batch pipeline apps have moved whole folders of data around different locations in Amazon S3 in order to track progress through a pipeline run, and to avoid accidentally reprocessing that data. But file moves have their own disadvantages:
- They are time-consuming
- They are network-intensive
- They are error-prone - a failure to move a file will cause the job to fail and require manual intervention
- They only support one use-case at a time - you can’t have two distinct jobs moving the same files at the same time
- They are inflexible - it’s impossible to add metadata about processing status
Although Snowplow continues to move files, we recommend that you to use a run manifest for your own data processing jobs on Snowplow data.
In this case, we store our manifest in a AWS DynamoDB table, and we use it to keep track of which Snowplow runs our job has already processed.
2. Using the run manifest
The run manifest functionality resides in the new
Here’s a short usage example:
In above example, we create two AWS service clients, one for S3 (to list job runs) and for DynamoDB (to access our manifest).
Then we list all Snowplow runs in a particular S3 path, filtering only those which were not processed yet, and also not archived to AWS Glacier (helping to to avoid increasing AWS costs for restoring data). Then we process the data with the user-provided
process function. Note that
runId is just a simple string with the S3 key of particular job run.
RunManifests class, then, is a simple API wrapper to DynamoDB, which lets you:
createa DynamoDB table for manifests
adda Snowplow run to the table
- check if table
containsa given run ID
3. Important bug fixes
Version 0.2.0 also includes some important bug fixes:
- Schema names with hyphens were considered invalid, now fixed (#22)
- An event with an empty array of custom contexts is no longer considered invalid (#27)
- We no longer add unnecessary
nulls to the generated JSON in the case of an empty
contextsis now omitted (#11)
4. Removing the Scalaz dependency
In our initial release, we used the
Validation datatype from Scalaz library to represent event transformation failures.
This approach is taken inside Snowplow pipeline and serves us quite well there, but in this SDK it caused problems by transitively bringing in dependencies, breaking jobs compiled against Scala 2.11 and confusing analysts unfamiliar in the ways of Functional Programming.
To solve this we replaced
Either - a similar data type from the Scala standard library. This is the only breaking change in this release, and you still can filter RDD to use only successfully transformed JSONs in a familiar way:
As we adding more features to SDK it becomes harder to keep up-to-date documentation in the project’s README. In this release we have split out the README into several wiki pages, each dedicated to a particular feature.
Check out the Scala Analytics SDK in the main Snowplow wiki.
As of this release, the Scala Analytics SDK is available at Maven Central. If you’re using SBT you can add it as follows:
7. Getting help
And if there’s another Snowplow Analytics SDK that you’d like us to prioritize creating, please let us know in the forums!