Snowplow Snowflake Loader 0.4.0 released

17 January 2019  •  Rostyslav Zatserkovnyi

We are pleased to announce version 0.4.0 of the Snowplow Snowflake Loader! This release introduces optional event deduplication, brings significant performance improvements to the Snowflake Transformer, and includes several other updates and bug fixes.

Read on below the fold for:

  1. Deduplication
  2. S3 optimizations
  3. New configuration options
  4. Other changes
  5. Upgrading
  6. Getting help

1. Deduplication

It’s possible for two or more Snowplow events to have the same event ID, for example because a duplicate has been introduced at one of the stages in the data processing upstream of the data landing in Snowflake DB. Event duplicates can prove a challenge in any event pipeline - we have previously discussed this issue in detail in this blog post and on our Discourse forum.

To mitigate this issue, version 0.4.0 introduces both in-batch deduplication and DynamoDB-powered cross-batch deduplication:

  • In-batch deduplication groups events with the same event_id and event_fingerprint in a single batch.

  • Cross-batch deduplication works by extracting the ID and fingerprint of an event, as well as etl_tstamp which identifies a single batch, then storing these properties in a DynamoDB table. Duplicate events with the same ID and fingerprint that were seen in previous batches are silently dropped from the Snowflake Transformer output. (Note that this feature is experimental as of version 0.4.0 and may not be 100% foolproof.)

More details on setting up deduplication in the Snowflake Transformer can be found in the project’s wiki.

Alongside this Snowflake Loader release, we have also released the first version of Snowplow Events Manifest. This standalone Scala library contains logic used for cross-batch natural deduplication of Snowplow events, and will be responsible for deduplication in our AWS-based pipelines.

2. S3 optimizations

This update introduces a significant performance improvement eliminating an S3-based bottleneck in the Snowflake Transformer. In previous versions, when the Transformer wrote files from Spark, the output was initially stored to a temporary destination and then renamed when the job has succeeded. However, because S3 is an object store, renaming files is a very expensive operation that requires a complete rewrite. This process was a notable bottleneck in the Transformer, making it incapable of processing large volumes of data (over 1TB/day) even with a large EMR cluster.

Version 0.4.0 solves this problem by using a custom staging committer, which accelerates writing to S3 from Spark by writing task outputs to a temporary directory on the local filesystem rather than S3. This represents a significant performance improvement by avoiding expensive S3 renaming operations. While usage of this committer is optional, it is highly recommended - an example of a Dataflow Runner config with the optimization enabled can be found in the Setup Guide.

3. New configuration options

Version 0.4.0 introduces two new configuration options to the Snowflake Loader config:

  • maxError - An optional setting used when writing to Snowflake - a table copy statement will skip an input file when the number of errors in it exceeds the specified number. This can be used to process runs with a certain number of expected bad rows without immediately failing, but should usually be set to 0.
  • jdbcHost - An optional host for the JDBC driver that has priority over automatically derived hosts. This is useful for targets such as Snowflake on Azure with different conventions for deriving hosts.

4. Other changes

This release introduces plenty of other updates:

  • Newly created Snowflake warehouses are now created with AUTO_SUSPEND set to 5 mins and AUTO_RESUME set to TRUE, and will be suspended after inactivity to prevent incurring unnecessary costs. (These settings can be manually modified in the Snowflake console if necessary.)
  • Text strings are now automatically truncated to their target lengths instead of producing an error if a loaded string exceeds the target length.
  • Sensitive columns in atomic.events have been widened to support the PII Pseudonymization Enrichment, and the geo_region column has been bumped to 3 characters due to changes in MaxMind regional codes. Thanks to community member miike for contributing the latter feature!
  • A DynamoDB client will now attempt recovery if its temporary session credentials have expired, preventing critical run failures when a job takes more than 6 hours.
  • The Snowflake Loader will now correctly fail if no data is located in the specified staging folder, preventing “blank loads” for complex recovery/historical load scenarios.

5. Upgrading

To make use of the new versions of the Snowflake Transformer and Loader, you will need to update your Dataflow Runner configurations to use the following jar files:

s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.4.0.jar
s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.4.0.jar

Due to several columns in atomic.events being widened to support pseudonymization and MaxMind changes, the table schema on Snowflake will need to be migrated. In order to automatically update the relevant column definitions in Snowflake, use the new migrate command and specify the loader’s version:

java -jar snowplow-snowflake-loader-0.4.0.jar migrate --loader-version 0.4.0

6. Getting help

For more details on this release, check out the 0.4.0 release notes on GitHub.

If you have any questions or run into any problems, please visit our Discourse forum.