Snowplow RDB Loader R29 released
Please read on after the fold for:
1. PII Enrichment-related bug
The PII Enrichment can be configured to hash specific fields and properties in your events so that user-identifiable information is replaced with hashed values (which can still can be used in data modeling).
In order to store these new hashed values in Redshift, we also modified our atomic events table definition by widening the maximum characters for the
CHAR columns that are commonly used to store personally identifiable information, specifically:
All these columns were widened to 128 characters so that they store values produced by the most commonly used hash-algorithms.
Unfortunately, during the R100 release we missed the fact that RDB Shredder, which prepares events for loading into Redshift and Postgres, also performs a truncation on various atomic event columns; without the required corresponding update, RDB Shredder continued to truncate the above columns to their previous lengths.
This bug affected mainly users of the PII Enrichment, but during investigation of the bug we also noticed that two further Redshift columns were being excessively truncated:
However, we think that these two columns, even when excessively truncated, most likely had sufficient capacity for all real-world use cases so there was likely no negative impact on these two columns.
This bug resides only in RDB Shredder, so all enriched data remains valid and can therefore be re-processed by RDB Shredder and Loader.
In order to do that you need to:
- Identify all affected runs - since the day PII Enrichment has been enabled
- Delete all affected runs from Redshift
- Delete all affected runs from
- Re-stage enriched data from
- Run EmrEtlRunner with
Important note about steps 5 and 6: archived folders cannot be staged all at once. They need to be staged and processed one by one, or their contents should be merged into one new folder.
If you are using RDB Loader to load events into Redshift or Postgres, you’ll need to update your EmrEtlRunner configuration to the following:
Upcoming Snowplow releases are unchanged:
- R106 Acropolis, further enhancing our recently-released GDPR-focused PII Enrichment for the realtime pipeline and fixing another related bug
- R107 [STR] New webhooks and enrichment, featuring Marketo and Vero webhook adapters from our partners at Snowflake Analytics, plus a new enrichment for detecting bots and spiders using data from the IAB
- R10x Vallei dei Templi, porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs
5. Getting help
For more details on this release, please check out the [release notes][snowplow-release] on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.