We are excited to announce the 0.4.0 release of the Snowplow Scala Analytics SDK, a library that provides tools to process and analyze Snowplow enriched events in Apache Spark, AWS Lambda, Apache Flink, Scalding, and other JVM-compatible data processing frameworks. This release reworks the JSON Event Transformer to use a new type-safe API, and introduces several other internal changes.
Read on below the fold for:
1. Event API
Previously, the JSON Event Transformer – a module that takes a Snowplow enriched event and converts it into a JSON ready for further processing – used to return Strings, which represented enriched events turned into JSON objects. While this was a non-opinionated and minimalistic approach, it involved a lot of extra post-processing, namely:
- Accessing individual JSON fields required casting the result to an instance of the json4s AST class via unsafe functions such as
- Accessing always existing fields still required redundant error processing logic, e.g.
parsedJson.map("event_id").getOrElse(throw new RuntimeException("event_id is not present in the enriched event").
- Getting a list of shredded types required using an additional, separate function,
In 0.4.0, the
EventTransformer API has been replaced by
Event – a single typesafe container that contains all 132 members of a canonical Snowplow event. All fields are automatically converted to appropriate non-String types where possible; for instance, the
event_id column is represented as a UUID instance, while timestamps are converted into optional Instant values, eliminating the need for common string conversions. Contexts and self-describing events are also wrapped in self-describing data container types, allowing for advanced operations such as Iglu URI lookups.
The case class has the following primary functions:
Event.parse(line)– similar to the old
transformfunction, this method accepts an enriched Snowplow event in a canonical TSV+JSON format as a string and returns an
Eventinstance as a result.
event.toJson(lossy)– similar to the old
getValidatedJsonEventfunction, it transforms an
Eventinto a validated JSON whose keys are the field names corresponding to the EnrichedEvent POJO of the Scala Common Enrich project. If the lossy argument is true, any self-describing events in the fields (unstruct_event, contexts, and derived_contexts) are returned in a “shredded” format, e.g.
"unstruct_event_com_acme_1_myField": "value". If it is set to false, they use a standart self-describing format instead of being flattened into underscore-separated top-level fields.
event.inventory– extracts metadata from the event containing information about the types and Iglu URIs of its shred properties (unstruct_event, contexts and derived_contexts). Unlike version 0.3.0, it no longer requires a
transformWithInventorycall and can be obtained from any
atomic– returns the event as a map of keys to Circe JSON values, while dropping inventory fields. This method can be used to modify an event’s JSON AST before converting it into a final result.
ordered– returns the event as a list of key/Circe JSON value pairs. Unlike
atomic, which has randomized key ordering, this method returns the keys in the order of the canonical event model, and is particularly useful for working with relational databases.
2. Using the typesafe API
Since base results of the Scala Analytics SDK are now members of the
Event case class, their output needs to be converted to JSON strings. For instance, the following code can be used in an AWS Lambda to load a series of events into a Spark dataframe:
event.toJson(true).noSpaces first converts the
Event instances to a member of Circe’s
Json AST class using the
toJson function with its lossy parameter set to true (meaning that contexts and self describing event fields will be “flattened”), then converts the
Json into a string using the
noSpaces method – pretty-printing the JSON to a compact string with no spaces. (Alternatively,
s, or even a custom Circe printer, can be used for a more human-readable output.)
Working with individual members of an
Event is now as simple as accessing a specific field of a case class. For example, the following code can be used to safely access the ID, fingerprint and ETL timestamp of an event, replacing the fingerprint with a random UUID if it doesn’t exist and throwing an exception if the timestamp is not set:
3. Additional changes
Version 0.4.0 also includes several changes to the SDK’s dependencies:
- The json4s AST has been removed in favor of circe, a JSON library based on Cats.
- Scala 2.12 has been updated to 2.12.8.
- The AWS SDK has been updated to 1.11.490.
The Scala Analytics SDK is available for download at Maven Central. If you’re using SBT, you can add it to your project as follows:
5. Getting help
To find out more up-to-date documentation about the SDK, check out the Scala Analytics SDK on the main Snowplow wiki. If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.
And if there’s another Snowplow Analytics SDK that you’d like us to prioritize creating, please let us know on Discourse!