The term Personally identifiable information originated in the context of healthcare record keeping. It soon became evident that the accumulation of healthcare records had huge potential to promote population-level measures, and very often there was public and research interest in such data. At the same time, researchers had to be careful to avoid releasing data that could uniquely identify an individual, hence the emergence of various strategies for PII anonymization.
Nowadays, collecting and processing large amounts of PII is increasingly within reach of even small organizations across every sector, thanks to powerful platforms such as Snowplow. Naturally, citizens and in turn governments have become concerned that this information can be misused, and opted to give back some control to the people whose records were being kept (“data subjects” in EU law terms).
Just as the healthcare records were useful for population-level health studies, so is tracking user behavior down to the level of the individual event useful for any data-driven organization. And just as health records need to be used and disseminated responsibly, so data scientists and analysts need to use event- and customer-level data in a way that protects the rights and identities of data subjects.
The European Union, often a pioneer in the field of human rights protection, has decided to enact a far-reaching regulation to replace previous digital privacy directives. In terms of EU law, a regulation is much more specific and prescriptive than a directive, and does not leave the implementation of that law up to the member states.
The official name is the General Data Protection Regulation (GDPR), and you’ll find plenty of information about it on a dedicated EU website. It is noteworthy that the regulation applies to entities operating outside the EU, if the data collected concerns the activities of an EU citizen or resident which have taken place in the EU. The regulation also provides for hefty fines, and will become law in the EU after the 25th of May 2018.
To help you meet your obligations under GDPR, in this release we are providing a pseudonymization facility, implemented as a Snowplow pipeline enrichment. This is only the first of many features planned to help Snowplow users meet their obligations under GDPR. Pseudonymization essentially means that a datum which can uniquely identify an individual, or betray sensitive information about that individual, is substituted by an alias.
Concretely, the Snowplow operator is able to configure any and all of the fields whose values they wish to have hashed by Snowplow. Through hashing all the PII fields found within Snowplow events, you can minimize the risk of identification of a data subject - an important step towards meeting your obligations as data handlers.
This Snowplow release introduces the PII Enrichment, which provides capabilities for Snowplow operators to better protect the privacy rights of data subjects. The obligations of handlers of Personally Identifiable Information (PII) data under GDPR have been outlined on the EU GDPR website.
This initial release of the PII Enrichment provides a way to pseudonymize fields within Snowplow enriched events. You can configure the enrichment to pseudonymize any of the following datapoints:
In addition, you must specify the “strategy” that will be used in the pseudonymization. Currently the available strategies involve hashing the PII, using one of the following algorithms:
MD2, the 128-bit algorithm MD2 (not-recommended due to performance reasons see RFC6149)
MD5, the 128-bit algorithm MD5
SHA-1, the 160-bit algorithm SHA-1
SHA-256, 256-bit variant of the SHA-2 algorithm
SHA-384, 384-bit variant of the SHA-2 algorithm
SHA-512, 512-bit variant of the SHA-2 algorithm
There is a new Iglu schema that specifies the configuration format for the PII Enrichment.
Further capabilities for the PII Enrichment, including the ability to reverse pseudonymization in a controlled way, are planned for the second phase of this PII Enrichment.
This brief tutorial assumes that you have gone through the upgrading section below, deploying the latest version of Stream Enrich and upgrading your Redshift
events table definition.
Like all Snowplow enrichments, the PII Enrichment is configured using a JSON document which conforms to a JSON Schema, the pii_enrichment_config, available in Iglu Central.
Here is an example configuration:
You should add that configuration to a directory with the other enrichment configurations. In this example it was added to
se/enrichments and it was called
pii_enrichment_config.json. The above example and other enrichment configurations can be found as always in the example configurations on github.
subscribeevents (among other events, ignored for the purpose of this example)
With the above PII Enrichment configuration, then, you are specifying that:
user_fingerprintfrom the Snowplow Canonical event model fields to be hashed (the full list of supported fields for pseudonymization is viewable in the enrichment configuration schema)
data.ip_optfields from the Mailchimp
subscribeevent to be hashed, but only if the schema version begins with
SHA-256variant of the algorithm for the pseudonymization
You can easily check whether your own configuration instance conforms to the schema by using this tool alongside the schema.
As usual, you would run Stream Enrich like so:
pii_enrichment_config.json configuration JSON is found in the
The enriched events emitted by Stream Enrich will then have the values corresponding to the above PII pseudonymized using SHA-256.
One note of caution: always check the underlying JSON Schema to avoid accidentally invalidating an entire event using the PII Enrichment. Specifically, you should check the field definitions of the fields for any constraints that hold under plaintext but not when the field is hashed, such as field length and format.
The scenario to avoid is as follows:
customerEmailproperty in a JSON Schema which must validate with
The same issue can happen with properties with enforced string lengths - note that all of the currently supported pseudonymization functions will generate hashes of up to 128 characters (in the case of SHA-512); be careful if the JSON Schema enforces a shorter length, as again the event will fail downstream validation.
We are exploring ways of avoiding this issue, potentially via a dedicated “pii” annotation within JSON Schema (see issue #860 for more details).
In order to support the replacing of original field values with pseudonymization hashes, we have had to widen various columns in the Redshift
atomic.events table (issue #3528). At the same time, we also widened the “se_label” field in Redshift to support URLs.
Finally, we continue to improve the quality of our codebase by using scalafmt automated code formatting, which will greatly help new contributors to the project meet our high quality standards. You can see the standards we applied in issue #3496.
The updated Stream Enrich artifact for R100 Epidaurus is available at the following location:
Docker images for this new artifact will follow shortly; the instructions for using them can be found here.
If you were already using Snowplow with Redshift as a storage target, the existing columns need to be widened as discussed above. We have created a migration script for this purpose.
To use it you simply run it with psql like so:
Upcoming Snowplow releases will include:
We are also hard at work on a second phase of this PII Enrichment, which will allow you to safely capture the original PII values which have been pseudonomized, ready for secure and audited de-anonymization on a case-by-case basis.
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.