Snowplow Scala Analytics SDK 0.4.0 released

Share

We are excited to announce the 0.4.0 release of the Snowplow Scala Analytics SDK, a library that provides tools to process and analyze Snowplow enriched events in Apache Spark, AWS Lambda, Apache Flink, Scalding, and other JVM-compatible data processing frameworks. This release reworks the JSON Event Transformer to use a new type-safe API, and introduces several other internal changes.

Read on below the fold for:

  1. Event API
  2. Using the typesafe API
  3. Additional changes
  4. Upgrading
  5. Getting help

1. Event API

Previously, the JSON Event Transformer – a module that takes a Snowplow enriched event and converts it into a JSON ready for further processing – used to return Strings, which represented enriched events turned into JSON objects. While this was a non-opinionated and minimalistic approach, it involved a lot of extra post-processing, namely:

In 0.4.0, the EventTransformer API has been replaced by Event – a single typesafe container that contains all 132 members of a canonical Snowplow event. All fields are automatically converted to appropriate non-String types where possible; for instance, the event_id column is represented as a UUID instance, while timestamps are converted into optional Instant values, eliminating the need for common string conversions. Contexts and self-describing events are also wrapped in self-describing data container types, allowing for advanced operations such as Iglu URI lookups.

The case class has the following primary functions:

2. Using the typesafe API

Since base results of the Scala Analytics SDK are now members of the Event case class, their output needs to be converted to JSON strings. For instance, the following code can be used in an AWS Lambda to load a series of events into a Spark dataframe:

import com.snowplowanalytics.snowplow.analytics.scalasdk.Event val events = input .map(line => Event.parse(line)) .flatMap(_.toOption) .map(event => event.toJson(true).noSpaces) val dataframe = spark.read.json(events)

Here, event.toJson(true).noSpaces first converts the Event instances to a member of Circe’s Json AST class using the toJson function with its lossy parameter set to true (meaning that contexts and self describing event fields will be “flattened”), then converts the Json into a string using the noSpaces method – pretty-printing the JSON to a compact string with no spaces. (Alternatively, spaces2 and spaces4 function
s, or even a custom Circe printer, can be used for a more human-readable output.)

Working with individual members of an Event is now as simple as accessing a specific field of a case class. For example, the following code can be used to safely access the ID, fingerprint and ETL timestamp of an event, replacing the fingerprint with a random UUID if it doesn’t exist and throwing an exception if the timestamp is not set:

val eventId = event.event_id.toString val eventFingerprint = event.event_fingerprint.getOrElse(UUID.randomUUID().toString) val etlTstamp = event.etl_tstamp.getOrElse(throw new RuntimeException(s"etl_tstamp in event $eventId is empty or missing"))

3. Additional changes

Version 0.4.0 also includes several changes to the SDK’s dependencies:

4. Upgrading

The Scala Analytics SDK is available for download at Maven Central. If you’re using SBT, you can add it to your project as follows:

libraryDependencies += "com.snowplowanalytics" %% "scala-analytics-sdk" % "0.4.0"

5. Getting help

To find out more up-to-date documentation about the SDK, check out the Scala Analytics SDK on the main Snowplow wiki. If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

And if there’s another Snowplow Analytics SDK that you’d like us to prioritize creating, please let us know on Discourse!

Share

Related articles