We are pleased to announce the 0.2.0 release of the Snowplow Scala Analytics SDK, a library providing tools to process and analyze Snowplow enriched events in Scala-compatible data processing frameworks such as Apache Spark, AWS Lambda, Apache Flink and Scalding, as wells other JVM-compatible data processing frameworks.
This release adds run manifest functionality, removes the Scalaz dependency and adds SDK artifacts to Maven Central, along with many other internal changes.
In the rest of this post we will cover:
- Run manifests
- Using the run manifest
- Important bug fixes
- Removing the Scalaz dependency
- Documentation
- Upgrading
- Getting help
1. Run manifests
This release provides tooling for populating and inspecting a Snowplow run manifest. A Snowplow run manifest is a lightweight and robust way to track your data-modeling step progress. It lets you check if a particular “run folder” of enriched events was already processed and can be safely skipped.
Run manifests were previously introduced in Snowplow Python Analytics SDK 0.2.0</a>; the Scala API closely resembles the Python SDK’s version.
- They are time-consuming
- They are network-intensive
- They are error-prone – a failure to move a file will cause the job to fail and require manual intervention
- They only support one use-case at a time – you can’t have two distinct jobs moving the same files at the same time
- They are inflexible – it’s impossible to add metadata about processing status
Although Snowplow continues to move files, we recommend that you to use a run manifest for your own data processing jobs on Snowplow data.
In this case, we store our manifest in a AWS DynamoDB table, and we use it to keep track of which Snowplow runs our job has already processed.
2. Using the run manifest
The run manifest functionality resides in the new com.snowplowanalytics.snowplow.analytics.scalasdk.RunManifests
module.
Here’s a short usage example:
import com.amazonaws.services.s3.AmazonS3ClientBuilder import com.amazonaws.services.dynamodbv2.AmazonDynamoDBAsyncClientBuilder import com.amazonaws.auth.DefaultAWSCredentialsProviderChain import com.snowplowanalytics.snowplow.analytics.scalasdk.RunManifests val DynamodbRunManifestsTable = "snowplow-run-manifests" val EnrichedEventsArchive = "s3://acme-snowplow-data/storage/enriched-archive/" val s3Client = AmazonS3ClientBuilder.standard() .withCredentials(DefaultAWSCredentialsProviderChain.getInstance) .build() val dynamodbClient = AmazonDynamoDBAsyncClientBuilder.standard() .withCredentials(DefaultAWSCredentialsProviderChain.getInstance) .build() val runManifestsTable = RunManifests(dynamodbClient, DynamodbRunManifestsTable) runManifestsTable.create() val unprocessed = RunManifests.listRunIds(s3Client, EnrichedEventsArchive) .filterNot(runManifestsTable.contains) unprocessed.foreach { runId => process(runId) runManifestsTable.add(runId) }
In above example, we create two AWS service clients, one for S3 (to list job runs) and for DynamoDB (to access our manifest).
Then we list all Snowplow runs in a particular S3 path, filtering only those which were not processed yet, and also not archived to AWS Glacier (helping to to avoid increasing AWS costs for restoring data). Then we process the data with the user-provided process
function. Note that runId
is just a simple string with the S3 key of particular job run.
RunManifests
class, then, is a simple API wrapper to DynamoDB, which lets you:
create
a DynamoDB table for manifestsadd
a Snowplow run to the table- check if table
contains
a given run ID
3. Important bug fixes
Version 0.2.0 also includes some important bug fixes:
- Schema names with hyphens were considered invalid, now fixed (#22)
- An event with an empty array of custom contexts is no longer considered invalid (#27)
- We no longer add unnecessary
null
s to the generated JSON in the case
of an emptyunstruct_event
orcontexts
is now omitted (#11)
4. Removing the Scalaz dependency
In our initial release, we used the Validation
datatype from Scalaz library to represent event transformation failures.
This approach is taken inside Snowplow pipeline and serves us quite well there, but in this SDK it caused problems by transitively bringing in dependencies, breaking jobs compiled against Scala 2.11 and confusing analysts unfamiliar in the ways of Functional Programming.
To solve this we replaced Validation
with Either
– a similar data type from the Scala standard library. This is the only breaking change in this release, and you still can filter RDD to use only successfully transformed JSONs in a familiar way:
val events = input .map(line => EventTransformer.transform(line)) .filter(_.isRight) .flatMap(_.right.toOption)
5. Documentation
As we adding more features to SDK it becomes harder to keep up-to-date documentation in the project’s README. In this release we have split out the README into several wiki pages, each dedicated to a particular feature.
Check out the Scala Analytics SDK in the main Snowplow wiki.
6. Upgrading
As of this release, the Scala Analytics SDK is available at Maven Central. If you’re using SBT you can add it as follows:
libraryDependencies += "com.snowplowanalytics" %% "scala-analytics-sdk" % "0.2.0"
7. Getting help
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.
And if there’s another Snowplow Analytics SDK that you’d like us to prioritize creating, please let us know in the forums!