Snowplow Scala Analytics SDK 0.4.0 released

13 February 2019  •  Rostyslav Zatserkovnyi

We are excited to announce the 0.4.0 release of the Snowplow Scala Analytics SDK, a library that provides tools to process and analyze Snowplow enriched events in Apache Spark, AWS Lambda, Apache Flink, Scalding, and other JVM-compatible data processing frameworks. This release reworks the JSON Event Transformer to use a new type-safe API, and introduces several other internal changes.

Read on below the fold for:

  1. Event API
  2. Using the typesafe API
  3. Additional changes
  4. Upgrading
  5. Getting help

1. Event API

Previously, the JSON Event Transformer - a module that takes a Snowplow enriched event and converts it into a JSON ready for further processing - used to return Strings, which represented enriched events turned into JSON objects. While this was a non-opinionated and minimalistic approach, it involved a lot of extra post-processing, namely:

  • Accessing individual JSON fields required casting the result to an instance of the json4s AST class via unsafe functions such as parse(result).
  • Accessing always existing fields still required redundant error processing logic, e.g. parsedJson.map("event_id").getOrElse(throw new RuntimeException("event_id is not present in the enriched event").
  • Getting a list of shredded types required using an additional, separate function, jsonifyWithInventory.

In 0.4.0, the EventTransformer API has been replaced by Event - a single typesafe container that contains all 132 members of a canonical Snowplow event. All fields are automatically converted to appropriate non-String types where possible; for instance, the event_id column is represented as a UUID instance, while timestamps are converted into optional Instant values, eliminating the need for common string conversions. Contexts and self-describing events are also wrapped in self-describing data container types, allowing for advanced operations such as Iglu URI lookups.

The case class has the following primary functions:

  • Event.parse(line) - similar to the old transform function, this method accepts an enriched Snowplow event in a canonical TSV+JSON format as a string and returns an Event instance as a result.
  • event.toJson(lossy) - similar to the old getValidatedJsonEvent function, it transforms an Event into a validated JSON whose keys are the field names corresponding to the EnrichedEvent POJO of the Scala Common Enrich project. If the lossy argument is true, any self-describing events in the fields (unstruct_event, contexts, and derived_contexts) are returned in a “shredded” format, e.g. "unstruct_event_com_acme_1_myField": "value". If it is set to false, they use a standart self-describing format instead of being flattened into underscore-separated top-level fields.
  • event.inventory - extracts metadata from the event containing information about the types and Iglu URIs of its shred properties (unstruct_event, contexts and derived_contexts). Unlike version 0.3.0, it no longer requires a transformWithInventory call and can be obtained from any Event instance.
  • atomic - returns the event as a map of keys to Circe JSON values, while dropping inventory fields. This method can be used to modify an event’s JSON AST before converting it into a final result.
  • ordered - returns the event as a list of key/Circe JSON value pairs. Unlike atomic, which has randomized key ordering, this method returns the keys in the order of the canonical event model, and is particularly useful for working with relational databases.

2. Using the typesafe API

Since base results of the Scala Analytics SDK are now members of the Event case class, their output needs to be converted to JSON strings. For instance, the following code can be used in an AWS Lambda to load a series of events into a Spark dataframe:

import com.snowplowanalytics.snowplow.analytics.scalasdk.Event

val events = input
  .map(line => Event.parse(line))
  .flatMap(_.toOption)
  .map(event => event.toJson(true).noSpaces)

val dataframe = spark.read.json(events)

Here, event.toJson(true).noSpaces first converts the Event instances to a member of Circe’s Json AST class using the toJson function with its lossy parameter set to true (meaning that contexts and self describing event fields will be “flattened”), then converts the Json into a string using the noSpaces method - pretty-printing the JSON to a compact string with no spaces. (Alternatively, spaces2 and spaces4 functions, or even a custom Circe printer, can be used for a more human-readable output.)

Working with individual members of an Event is now as simple as accessing a specific field of a case class. For example, the following code can be used to safely access the ID, fingerprint and ETL timestamp of an event, replacing the fingerprint with a random UUID if it doesn’t exist and throwing an exception if the timestamp is not set:

val eventId = event.event_id.toString
val eventFingerprint = event.event_fingerprint.getOrElse(UUID.randomUUID().toString)
val etlTstamp = event.etl_tstamp.getOrElse(throw new RuntimeException(s"etl_tstamp in event $eventId is empty or missing"))

3. Additional changes

Version 0.4.0 also includes several changes to the SDK’s dependencies:

  • The json4s AST has been removed in favor of circe, a JSON library based on Cats.
  • Scala 2.12 has been updated to 2.12.8.
  • The AWS SDK has been updated to 1.11.490.

4. Upgrading

The Scala Analytics SDK is available for download at Maven Central. If you’re using SBT, you can add it to your project as follows:

libraryDependencies += "com.snowplowanalytics" %% "scala-analytics-sdk" % "0.4.0"

5. Getting help

To find out more up-to-date documentation about the SDK, check out the Scala Analytics SDK on the main Snowplow wiki. If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

And if there’s another Snowplow Analytics SDK that you’d like us to prioritize creating, please let us know on Discourse!