Snowplow Scala Analytics SDK 0.2.0 released

We are pleased to announce the 0.2.0 release of the Snowplow Scala Analytics SDK, a library providing tools to process and analyze Snowplow enriched events in Scala-compatible data processing frameworks such as Apache Spark, AWS Lambda, Apache Flink and Scalding, as wells other JVM-compatible data processing frameworks.

This release adds run manifest functionality, removes the Scalaz dependency and adds SDK artifacts to Maven Central, along with many other internal changes.

In the rest of this post we will cover:

  1. Run manifests
  2. Using the run manifest
  3. Important bug fixes
  4. Removing the Scalaz dependency
  5. Documentation
  6. Upgrading
  7. Getting help

1. Run manifests

This release provides tooling for populating and inspecting a Snowplow run manifest. A Snowplow run manifest is a lightweight and robust way to track your data-modeling step progress. It lets you check if a particular “run folder” of enriched events was already processed and can be safely skipped.

Run manifests were previously introduced in Snowplow Python Analytics SDK 0.2.0; the Scala API closely resembles the Python SDK’s version.

Historically, Snowplow’s batch pipeline apps have moved whole folders of data around different locations in Amazon S3 in order to track progress through a pipeline run, and to avoid accidentally reprocessing that data. But file moves have their own disadvantages:

  1. They are time-consuming
  2. They are network-intensive
  3. They are error-prone - a failure to move a file will cause the job to fail and require manual intervention
  4. They only support one use-case at a time - you can’t have two distinct jobs moving the same files at the same time
  5. They are inflexible - it’s impossible to add metadata about processing status

Although Snowplow continues to move files, we recommend that you to use a run manifest for your own data processing jobs on Snowplow data.

In this case, we store our manifest in a AWS DynamoDB table, and we use it to keep track of which Snowplow runs our job has already processed.

2. Using the run manifest

The run manifest functionality resides in the new module.

Here’s a short usage example:

 1 import
 2 import
 3 import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
 4 import
 6 val DynamodbRunManifestsTable = "snowplow-run-manifests"
 7 val EnrichedEventsArchive = "s3://acme-snowplow-data/storage/enriched-archive/"
 9 val s3Client = AmazonS3ClientBuilder.standard()
10   .withCredentials(DefaultAWSCredentialsProviderChain.getInstance)
11   .build()
13 val dynamodbClient = AmazonDynamoDBAsyncClientBuilder.standard()
14   .withCredentials(DefaultAWSCredentialsProviderChain.getInstance)
15   .build()
17 val runManifestsTable = RunManifests(dynamodbClient, DynamodbRunManifestsTable)
18 runManifestsTable.create()
20 val unprocessed = RunManifests.listRunIds(s3Client, EnrichedEventsArchive)
21   .filterNot(runManifestsTable.contains)
23 unprocessed.foreach { runId =>
24   process(runId)
25   runManifestsTable.add(runId)
26 }

In above example, we create two AWS service clients, one for S3 (to list job runs) and for DynamoDB (to access our manifest).

Then we list all Snowplow runs in a particular S3 path, filtering only those which were not processed yet, and also not archived to AWS Glacier (helping to to avoid increasing AWS costs for restoring data). Then we process the data with the user-provided process function. Note that runId is just a simple string with the S3 key of particular job run.

RunManifests class, then, is a simple API wrapper to DynamoDB, which lets you:

  • create a DynamoDB table for manifests
  • add a Snowplow run to the table
  • check if table contains a given run ID

3. Important bug fixes

Version 0.2.0 also includes some important bug fixes:

  • Schema names with hyphens were considered invalid, now fixed (#22)
  • An event with an empty array of custom contexts is no longer considered invalid (#27)
  • We no longer add unnecessary nulls to the generated JSON in the case of an empty unstruct_event or contexts is now omitted (#11)

4. Removing the Scalaz dependency

In our initial release, we used the Validation datatype from Scalaz library to represent event transformation failures.

This approach is taken inside Snowplow pipeline and serves us quite well there, but in this SDK it caused problems by transitively bringing in dependencies, breaking jobs compiled against Scala 2.11 and confusing analysts unfamiliar in the ways of Functional Programming.

To solve this we replaced Validation with Either - a similar data type from the Scala standard library. This is the only breaking change in this release, and you still can filter RDD to use only successfully transformed JSONs in a familiar way:

1 val events = input
2   .map(line => EventTransformer.transform(line))
3   .filter(_.isRight)
4   .flatMap(_.right.toOption)

5. Documentation

As we adding more features to SDK it becomes harder to keep up-to-date documentation in the project’s README. In this release we have split out the README into several wiki pages, each dedicated to a particular feature.

Check out the Scala Analytics SDK in the main Snowplow wiki.

6. Upgrading

As of this release, the Scala Analytics SDK is available at Maven Central. If you’re using SBT you can add it as follows:

1 libraryDependencies += "com.snowplowanalytics" %% "scala-analytics-sdk" % "0.2.0"

7. Getting help

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

And if there’s another Snowplow Analytics SDK that you’d like us to prioritize creating, please let us know in the forums!

Thoughts or questions? Come join us in our Discourse forum!

Anton Parkhomenko

Anton is a data engineer at Snowplow. You can find him on GitHub, Twitter and on his personal blog.