Snowplow RDB Loader R29 released
We are pleased to announce the release of Snowplow RDB Loader R29, fixing an important bug relating to the PII Enrichment introduced in R100 Epidaurus.
Please read on after the fold for:
1. PII Enrichment-related bug
In R100 we introduced a new enrichment for pseudonymizing personally identifiable information to help our users to comply with GDPR.
The PII Enrichment can be configured to hash specific fields and properties in your events so that user-identifiable information is replaced with hashed values (which can still can be used in data modeling).
In order to store these new hashed values in Redshift, we also modified our atomic events table definition by widening the maximum characters for the VARCHAR
and CHAR
columns that are commonly used to store personally identifiable information, specifically:
user_ipaddress
user_fingerprint
domain_userid
network_userid
ip_organization
ip_domain
refr_domain_userid
domain_sessionid
All these columns were widened to 128 characters so that they store values produced by the most commonly used hash-algorithms.
Unfortunately, during the R100 release we missed the fact that RDB Shredder, which prepares events for loading into Redshift and Postgres, also performs a truncation on various atomic event columns; without the required corresponding update, RDB Shredder continued to truncate the above columns to their previous lengths.
This bug affected mainly users of the PII Enrichment, but during investigation of the bug we also noticed that two further Redshift columns were being excessively truncated:
os_timezone
se_label
However, we think that these two columns, even when excessively truncated, most likely had sufficient capacity for all real-world use cases so there was likely no negative impact on these two columns.
2. Recovery
This bug resides only in RDB Shredder, so all enriched data remains valid and can therefore be re-processed by RDB Shredder and Loader.
In order to do that you need to:
- Identify all affected runs - since the day PII Enrichment has been enabled
- Delete all affected runs from Redshift
- Upgrade
rdb_shredder
in yourconfig.yml
- Delete all affected runs from
shredded.archive
- Re-stage enriched data from
enriched.archive
toenriched.good
- Run EmrEtlRunner with
--resume-from shred
option
Important note about steps 5 and 6: archived folders cannot be staged all at once. They need to be staged and processed one by one, or their contents should be merged into one new folder.
3. Upgrading
If you are using RDB Loader to load events into Redshift or Postgres, you’ll need to update your EmrEtlRunner configuration to the following:
storage:
versions:
rdb_shredder: 0.13.1 # WAS 0.13.0
4. Roadmap
Upcoming Snowplow releases are unchanged:
- R106 Acropolis, further enhancing our recently-released GDPR-focused PII Enrichment for the realtime pipeline and fixing another related bug
- R107 [STR] New webhooks and enrichment, featuring Marketo and Vero webhook adapters from our partners at Snowflake Analytics, plus a new enrichment for detecting bots and spiders using data from the IAB
- R10x Vallei dei Templi, porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs
5. Getting help
For more details on this release, please check out the [release notes][snowplow-release] on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.