We are pleased to announce version 0.4.0 of the Snowplow Snowflake Loader! This release introduces optional event deduplication, brings significant performance improvements to the Snowflake Transformer, and includes several other updates and bug fixes.
Read on below the fold for:
It’s possible for two or more Snowplow events to have the same event ID, for example because a duplicate has been introduced at one of the stages in the data processing upstream of the data landing in Snowflake DB. Event duplicates can prove a challenge in any event pipeline – we have previously discussed this issue in detail in this blog post and on our Discourse forum.
To mitigate this issue, version 0.4.0 introduces both in-batch deduplication and DynamoDB-powered cross-batch deduplication:
- In-batch deduplication groups events with the same
event_fingerprintin a single batch.
- Cross-batch deduplication works by extracting the ID and fingerprint of an event, as well as
etl_tstampwhich identifies a single batch, then storing these properties in a DynamoDB table. Duplicate events with the same ID and fingerprint that were seen in previous batches are silently dropped from the Snowflake Transformer output. (Note that this feature is experimental as of version 0.4.0 and may not be 100% foolproof.)
More details on setting up deduplication in the Snowflake Transformer can be found in the project’s wiki.
Alongside this Snowflake Loader release, we have also released the first version of Snowplow Events Manifest. This standalone Scala library contains logic used for cross-batch natural deduplication of Snowplow events, and will be responsible for deduplication in our AWS-based pipelines.
2. S3 optimizations
This update introduces a significant performance improvement eliminating an S3-based bottleneck in the Snowflake Transformer. In previous versions, when the Transformer wrote files from Spark, the output was initially stored to a temporary destination and then renamed when the job has succeeded. However, because S3 is an object store, renaming files is a very expensive operation that requires a complete rewrite. This process was a notable bottleneck in the Transformer, making it incapable of processing large volumes of data (over 1TB/day) even with a large EMR cluster.
Version 0.4.0 solves this problem by using a custom staging committer, which accelerates writing to S3 from Spark by writing task outputs to a temporary directory on the local filesystem rather than S3. This represents a significant performance improvement by avoiding expensive S3 renaming operations. While usage of this committer is optional, it is highly recommended – an example of a Dataflow Runner config with the optimization enabled can be found in the Setup Guide.
3. New configuration options
Version 0.4.0 introduces two new configuration options to the Snowflake Loader config:
maxError– An optional setting used when writing to Snowflake – a table copy statement will skip an input file when the number of errors in it exceeds the specified number. This can be used to process runs with a certain number of expected bad rows without immediately failing, but should usually be set to 0.
jdbcHost– An optional host for the JDBC driver that has priority over automatically derived hosts. This is useful for targets such as Snowflake on Azure with different conventions for deriving hosts.
4. Other changes
This release introduces plenty of other updates:
- Newly created Snowflake warehouses are now created with
AUTO_SUSPENDset to 5 mins and
TRUE, and will be suspended after inactivity to prevent incurring unnecessary costs. (These settings can be manually modified in the Snowflake console if necessary.)
- Text strings are now automatically truncated to their target lengths instead of producing an error if a loaded string exceeds the target length.
- Sensitive columns in
atomic.eventshave been widened to support the PII Pseudonymization Enrichment, and the
geo_regioncolumn has been bumped to 3 characters due to changes in MaxMind regional codes. Thanks to community member miike for contributing the latter feature!
- A DynamoDB client will now attempt recovery if its temporary session credentials have expired, preventing critical run failures when a job takes more than 6 hours.
- The Snowflake Loader will now correctly fail if no data is located in the specified staging folder, preventing “blank loads” for complex recovery/historical load scenarios.
To make use of the new versions of the Snowflake Transformer and Loader, you will need to update your Dataflow Runner configurations to use the following jar files:
Due to several columns in atomic.events being widened to support pseudonymization and MaxMind changes, the table schema on Snowflake will need to be migrated. In order to automatically update the relevant column definitions in Snowflake, use the new
migrate command and specify the loader’s version:
java -jar snowplow-snowflake-loader-0.4.0.jar migrate --loader-version 0.4.0
6. Getting help
For more details on this release, check out the 0.4.0 release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.