We are pleased to announce the release of Snowplow R100 Epidaurus. This streaming pipeline release adds support for pseudonomizing user PII (Personally Identifiable Information) through a new Snowplow enrichment.
We are initially adding this new PII Enrichment to the Snowplow streaming pipeline; extending this support to the batch pipeline will follow in due course.
Read on for more information on R100 Epidaurus, named after an ancient city in Argolida, Greece that features a magnificent theater with excellent acoustics:
- What are PII, GDPR, and pseudonymization, and why they are important?
- PII Enrichment
- Pseudonymizing your Snowplow events
- Other changes
1. What are PII, GDPR, and pseudonymization, and why they are important?
The term Personally identifiable information originated in the context of healthcare record keeping. It soon became evident that the accumulation of healthcare records had huge potential to promote population-level measures, and very often there was public and research interest in such data. At the same time, researchers had to be careful to avoid releasing data that could uniquely identify an individual, hence the emergence of various strategies for PII anonymization.
Nowadays, collecting and processing large amounts of PII is increasingly within reach of even small organizations across every sector, thanks to powerful platforms such as Snowplow. Naturally, citizens and in turn governments have become concerned that this information can be misused, and opted to give back some control to the people whose records were being kept (“data subjects” in EU law terms).
Just as the healthcare records were useful for population-level health studies, so is tracking user behavior down to the level of the individual event useful for any data-driven organization. And just as health records need to be used and disseminated responsibly, so data scientists and analysts need to use event- and customer-level data in a way that protects the rights and identities of data subjects.
The European Union, often a pioneer in the field of human rights protection, has decided to enact a far-reaching regulation to replace previous digital privacy directives. In terms of EU law, a regulation is much more specific and prescriptive than a directive, and does not leave the implementation of that law up to the member states.
The official name is the General Data Protection Regulation (GDPR), and you’ll find plenty of information about it on a dedicated EU website. It is noteworthy that the regulation applies to entities operating outside the EU, if the data collected concerns the activities of an EU citizen or resident which have taken place in the EU. The regulation also provides for hefty fines, and will become law in the EU after the 25th of May 2018.
To help you meet your obligations under GDPR, in this release we are providing a pseudonymization facility, implemented as a Snowplow pipeline enrichment. This is only the first of many features planned to help Snowplow users meet their obligations under GDPR. Pseudonymization essentially means that a datum which can uniquely identify an individual, or betray sensitive information about that individual, is substituted by an alias.
Concretely, the Snowplow operator is able to configure any and all of the fields whose values they wish to have hashed by Snowplow. Through hashing all the PII fields found within Snowplow events, you can minimize the risk of identification of a data subject - an important step towards meeting your obligations as data handlers.
2. PII Enrichment
This Snowplow release introduces the PII Enrichment, which provides capabilities for Snowplow operators to better protect the privacy rights of data subjects. The obligations of handlers of Personally Identifiable Information (PII) data under GDPR have been outlined on the EU GDPR website.
This initial release of the PII Enrichment provides a way to pseudonymize fields within Snowplow enriched events. You can configure the enrichment to pseudonymize any of the following datapoints:
- Any of the “first-class” fields which are part of the Canonical event model, are scalar fields containing a single string and have been identified as being potentially sensitive
- Any of the properties within the JSON instance of a Snowplow self-describing event or context (wherever that context originated). You simply specify the Iglu schema to target and a JSON Path to identify the property or properties within to pseudonomize
In addition, you must specify the “strategy” that will be used in the pseudonymization. Currently the available strategies involve hashing the PII, using one of the following algorithms:
MD2, the 128-bit algorithm MD2 (not-recommended due to performance reasons see RFC6149)
MD5, the 128-bit algorithm MD5
SHA-1, the 160-bit algorithm SHA-1
SHA-256, 256-bit variant of the SHA-2 algorithm
SHA-384, 384-bit variant of the SHA-2 algorithm
SHA-512, 512-bit variant of the SHA-2 algorithm
There is a new Iglu schema that specifies the configuration format for the PII Enrichment.
Further capabilities for the PII Enrichment, including the ability to reverse pseudonymization in a controlled way, are planned for the second phase of this PII Enrichment.
3. Pseudonymizing your Snowplow events
Before you start
Configuring the PII Enrichment
Like all Snowplow enrichments, the PII Enrichment is configured using a JSON document which conforms to a JSON Schema, the pii_enrichment_config, available in Iglu Central.
Here is an example configuration:
You should add that configuration to a directory with the other enrichment configurations. In this example it was added to
se/enrichments and it was called
pii_enrichment_config.json. The above example and other enrichment configurations can be found as always in the example configurations on github.
- The Mailchimp webhook (available since release 0.9.11) is emitting
subscribeevents (among other events, ignored for the purpose of this example)
With the above PII Enrichment configuration, then, you are specifying that:
- You wish for the
user_fingerprintfrom the Snowplow Canonical event model fields to be hashed (the full list of supported fields for pseudonymization is viewable in the enrichment configuration schema)
- You wish for the
data.ip_optfields from the Mailchimp
subscribeevent to be hashed, but only if the schema version begins with
- You wish to use the
SHA-256variant of the algorithm for the pseudonymization
As usual, you would run Stream Enrich like so:
pii_enrichment_config.json configuration JSON is found in the
The enriched events emitted by Stream Enrich will then have the values corresponding to the above PII pseudonymized using SHA-256.
A warning about JSON Schema validation of pseudonymized values
One note of caution: always check the underlying JSON Schema to avoid accidentally invalidating an entire event using the PII Enrichment. Specifically, you should check the field definitions of the fields for any constraints that hold under plaintext but not when the field is hashed, such as field length and format.
The scenario to avoid is as follows:
- You have a
customerEmailproperty in a JSON Schema which must validate with
- You apply the PII Enrichment to hash that field
- The enriched event is successfully emitted from Stream Enrich…
- However, a downstream process (e.g. RDB Shredder) which validates the now-pseudonymized event will reject the event, as the hashed value is no longer in an email format
The same issue can happen with properties with enforced string lengths - note that all of the currently supported pseudonymization functions will generate hashes of up to 128 characters (in the case of SHA-512); be careful if the JSON Schema enforces a shorter length, as again the event will fail downstream validation.
We are exploring ways of avoiding this issue, potentially via a dedicated “pii” annotation within JSON Schema (see issue #860 for more details).
4. Other changes
In order to support the replacing of original field values with pseudonymization hashes, we have had to widen various columns in the Redshift
atomic.events table (issue #3528). At the same time, we also widened the “se_label” field in Redshift to support URLs.
Finally, we continue to improve the quality of our codebase by using scalafmt automated code formatting, which will greatly help new contributors to the project meet our high quality standards. You can see the standards we applied in issue #3496.
The updated Stream Enrich artifact for R100 Epidaurus is available at the following location:
Docker images for this new artifact will follow shortly; the instructions for using them can be found here.
If you were already using Snowplow with Redshift as a storage target, the existing columns need to be widened as discussed above. We have created a migration script for this purpose.
To use it you simply run it with psql like so:
Upcoming Snowplow releases will include:
- R10x [BAT] Priority fixes, part 1, various stability, security and data quality improvements for the batch pipeline
- R10x [STR] GCP support, part 1, our first release towards letting you run the Snowplow realtime pipeline on Google Cloud Platform per our GCP RFC
- R10x [BAT] Priority fixes, part 2, further stability, security and data quality improvements for the batch pipeline
We are also hard at work on a second phase of this PII Enrichment, which will allow you to safely capture the original PII values which have been pseudonomized, ready for secure and audited de-anonymization on a case-by-case basis.
7. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.