Snowplow Snowflake Loader 0.4.0 released

Share

We are pleased to announce version 0.4.0 of the Snowplow Snowflake Loader! This release introduces optional event deduplication, brings significant performance improvements to the Snowflake Transformer, and includes several other updates and bug fixes.

Read on below the fold for:

  1. Deduplication
  2. S3 optimizations
  3. New configuration options
  4. Other changes
  5. Upgrading
  6. Getting help

1. Deduplication

It’s possible for two or more Snowplow events to have the same event ID, for example because a duplicate has been introduced at one of the stages in the data processing upstream of the data landing in Snowflake DB. Event duplicates can prove a challenge in any event pipeline – we have previously discussed this issue in detail in this blog post and on our Discourse forum.

To mitigate this issue, version 0.4.0 introduces both in-batch deduplication and DynamoDB-powered cross-batch deduplication:

More details on setting up deduplication in the Snowflake Transformer can be found in the project’s wiki.

Alongside this Snowflake Loader release, we have also released the first version of Snowplow Events Manifest. This standalone Scala library contains logic used for cross-batch natural deduplication of Snowplow events, and will be responsible for deduplication in our AWS-based pipelines.

2. S3 optimizations

This update introduces a significant performance improvement eliminating an S3-based bottleneck in the Snowflake Transformer. In previous versions, when the Transformer wrote files from Spark, the output was initially stored to a temporary destination and then renamed when the job has succeeded. However, because S3 is an object store, renaming files is a very expensive operation that requires a complete rewrite. This process was a notable bottleneck in the Transformer, making it incapable of processing large volumes of data (over 1TB/day) even with a large EMR cluster.

Version 0.4.0 solves this problem by using a custom staging committer, which accelerates writing to S3 from Spark by writing task outputs to a temporary directory on the local filesystem rather than S3. This represents a significant performance improvement by avoiding expensive S3 renaming operations. While usage of this committer is optional, it is highly recommended – an example of a Dataflow Runner config with the optimization enabled can be found in the Setup Guide.

3. New configuration options

Version 0.4.0 introduces two new configuration options to the Snowflake Loader config:

4. Other changes

This release introduces plenty of other updates:

5. Upgrading

To make use of the new versions of the Snowflake Transformer and Loader, you will need to update your Dataflow Runner configurations to use the following jar files:

s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.4.0.jar s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.4.0.jar

Due to several columns in atomic.events being widened to support pseudonymization and MaxMind changes, the table schema on Snowflake will need to be migrated. In order to automatically update the relevant column definitions in Snowflake, use the new migrate command and specify the loader’s version:

java -jar snowplow-snowflake-loader-0.4.0.jar migrate --loader-version 0.4.0

6. Getting help

For more details on this release, check out the 0.4.0 release notes on GitHub.

If you have any questions or run into any problems, please visit our Discourse forum.

Share

Related articles