Snowplow RDB Loader R29 released

12 June 2018  •  Anton Parkhomenko

We are pleased to announce the release of Snowplow RDB Loader R29, fixing an important bug relating to the PII Enrichment introduced in R100 Epidaurus.

Please read on after the fold for:

  1. PII Enrichment-related bug
  2. Recovery
  3. Upgrading
  4. Roadmap
  5. Help

1. PII Enrichment-related bug

In R100 we introduced a new enrichment for pseudonymizing personally identifiable information to help our users to comply with GDPR.

The PII Enrichment can be configured to hash specific fields and properties in your events so that user-identifiable information is replaced with hashed values (which can still can be used in data modeling).

In order to store these new hashed values in Redshift, we also modified our atomic events table definition by widening the maximum characters for the VARCHAR and CHAR columns that are commonly used to store personally identifiable information, specifically:

  • user_ipaddress
  • user_fingerprint
  • domain_userid
  • network_userid
  • ip_organization
  • ip_domain
  • refr_domain_userid
  • domain_sessionid

All these columns were widened to 128 characters so that they store values produced by the most commonly used hash-algorithms.

Unfortunately, during the R100 release we missed the fact that RDB Shredder, which prepares events for loading into Redshift and Postgres, also performs a truncation on various atomic event columns; without the required corresponding update, RDB Shredder continued to truncate the above columns to their previous lengths.

This bug affected mainly users of the PII Enrichment, but during investigation of the bug we also noticed that two further Redshift columns were being excessively truncated:

  • os_timezone
  • se_label

However, we think that these two columns, even when excessively truncated, most likely had sufficient capacity for all real-world use cases so there was likely no negative impact on these two columns.

2. Recovery

This bug resides only in RDB Shredder, so all enriched data remains valid and can therefore be re-processed by RDB Shredder and Loader.

In order to do that you need to:

  1. Identify all affected runs - since the day PII Enrichment has been enabled
  2. Delete all affected runs from Redshift
  3. Upgrade rdb_shredder in your config.yml
  4. Delete all affected runs from shredded.archive
  5. Re-stage enriched data from enriched.archive to enriched.good
  6. Run EmrEtlRunner with --resume-from shred option

Important note about steps 5 and 6: archived folders cannot be staged all at once. They need to be staged and processed one by one, or their contents should be merged into one new folder.

3. Upgrading

If you are using RDB Loader to load events into Redshift or Postgres, you’ll need to update your EmrEtlRunner configuration to the following:

storage:
  versions:
    rdb_shredder: 0.13.1 # WAS 0.13.0

4. Roadmap

Upcoming Snowplow releases are unchanged:

5. Getting help

For more details on this release, please check out the [release notes][snowplow-release] on GitHub.

If you have any questions or run into any problem, please visit our Discourse forum.