Snowplow R100 Epidaurus released with PII pseudonymization support

27 February 2018  •  Konstantinos Servis

We are pleased to announce the release of Snowplow R100 Epidaurus. This streaming pipeline release adds support for pseudonomizing user PII (Personally Identifiable Information) through a new Snowplow enrichment.

We are initially adding this new PII Enrichment to the Snowplow streaming pipeline; extending this support to the batch pipeline will follow in due course.

Read on for more information on R100 Epidaurus, named after an ancient city in Argolida, Greece that features a magnificent theater with excellent acoustics:

  1. What are PII, GDPR, and pseudonymization, and why they are important?
  2. PII Enrichment
  3. Pseudonymizing your Snowplow events
  4. Other changes
  5. Upgrading
  6. Roadmap
  7. Help

epidaurus

1. What are PII, GDPR, and pseudonymization, and why they are important?

PII

The term Personally identifiable information originated in the context of healthcare record keeping. It soon became evident that the accumulation of healthcare records had huge potential to promote population-level measures, and very often there was public and research interest in such data. At the same time, researchers had to be careful to avoid releasing data that could uniquely identify an individual, hence the emergence of various strategies for PII anonymization.

Nowadays, collecting and processing large amounts of PII is increasingly within reach of even small organizations across every sector, thanks to powerful platforms such as Snowplow. Naturally, citizens and in turn governments have become concerned that this information can be misused, and opted to give back some control to the people whose records were being kept (“data subjects” in EU law terms).

Just as the healthcare records were useful for population-level health studies, so is tracking user behavior down to the level of the individual event useful for any data-driven organization. And just as health records need to be used and disseminated responsibly, so data scientists and analysts need to use event- and customer-level data in a way that protects the rights and identities of data subjects.

GDPR

The European Union, often a pioneer in the field of human rights protection, has decided to enact a far-reaching regulation to replace previous digital privacy directives. In terms of EU law, a regulation is much more specific and prescriptive than a directive, and does not leave the implementation of that law up to the member states.

The official name is the General Data Protection Regulation (GDPR), and you’ll find plenty of information about it on a dedicated EU website. It is noteworthy that the regulation applies to entities operating outside the EU, if the data collected concerns the activities of an EU citizen or resident which have taken place in the EU. The regulation also provides for hefty fines, and will become law in the EU after the 25th of May 2018.

Pseudonymization

To help you meet your obligations under GDPR, in this release we are providing a pseudonymization facility, implemented as a Snowplow pipeline enrichment. This is only the first of many features planned to help Snowplow users meet their obligations under GDPR. Pseudonymization essentially means that a datum which can uniquely identify an individual, or betray sensitive information about that individual, is substituted by an alias.

Concretely, the Snowplow operator is able to configure any and all of the fields whose values they wish to have hashed by Snowplow. Through hashing all the PII fields found within Snowplow events, you can minimize the risk of identification of a data subject - an important step towards meeting your obligations as data handlers.

2. PII Enrichment

This Snowplow release introduces the PII Enrichment, which provides capabilities for Snowplow operators to better protect the privacy rights of data subjects. The obligations of handlers of Personally Identifiable Information (PII) data under GDPR have been outlined on the EU GDPR website.

This initial release of the PII Enrichment provides a way to pseudonymize fields within Snowplow enriched events. You can configure the enrichment to pseudonymize any of the following datapoints:

  1. Any of the “first-class” fields which are part of the Canonical event model, are scalar fields containing a single string and have been identified as being potentially sensitive
  2. Any of the properties within the JSON instance of a Snowplow self-describing event or context (wherever that context originated). You simply specify the Iglu schema to target and a JSON Path to identify the property or properties within to pseudonomize

In addition, you must specify the “strategy” that will be used in the pseudonymization. Currently the available strategies involve hashing the PII, using one of the following algorithms:

  • MD2, the 128-bit algorithm MD2 (not-recommended due to performance reasons see RFC6149)
  • MD5, the 128-bit algorithm MD5
  • SHA-1, the 160-bit algorithm SHA-1
  • SHA-256, 256-bit variant of the SHA-2 algorithm
  • SHA-384, 384-bit variant of the SHA-2 algorithm
  • SHA-512, 512-bit variant of the SHA-2 algorithm

There is a new Iglu schema that specifies the configuration format for the PII Enrichment.

Further capabilities for the PII Enrichment, including the ability to reverse pseudonymization in a controlled way, are planned for the second phase of this PII Enrichment.

3. Pseudonymizing your Snowplow events

Before you start

This brief tutorial assumes that you have gone through the upgrading section below, deploying the latest version of Stream Enrich and upgrading your Redshift events table definition.

Configuring the PII Enrichment

Like all Snowplow enrichments, the PII Enrichment is configured using a JSON document which conforms to a JSON Schema, the pii_enrichment_config, available in Iglu Central.

Here is an example configuration:

{
  "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/pii_enrichment_config/jsonschema/1-0-0",
  "data": {
    "vendor": "com.snowplowanalytics.snowplow.enrichments",
    "name": "pii_enrichment_config",
    "enabled": true,
    "parameters": {
      "pii": [
        {
          "pojo": {
            "field": "user_id"
          }
        },
        {
          "pojo": {
            "field": "user_fingerprint"
          }
        },
        {
          "json": {
            "field": "unstruct_event",
            "schemaCriterion": "iglu:com.mailchimp/subscribe/jsonschema/1-0-*",
            "jsonPath": "$.data.['email', 'ip_opt']"
          }
        }
      ],
      "strategy": {
        "pseudonymize": {
          "hashFunction": "SHA-256"
        }
      }
    }
  }
}

You should add that configuration to a directory with the other enrichment configurations. In this example it was added to se/enrichments and it was called pii_enrichment_config.json. The above example and other enrichment configurations can be found as always in the example configurations on github.

The configuration above is for a Snowplow pipeline that is receiving events from the Snowplow JavaScript Tracker, plus a Mailchimp webhook integration:

  • The Snowplow JavaScript Tracker has been configured to emit events which includes the user_id and user_fingerprint fields
  • The Mailchimp webhook (available since release 0.9.11) is emitting subscribe events (among other events, ignored for the purpose of this example)

With the above PII Enrichment configuration, then, you are specifying that:

  • You wish for the user_id and user_fingerprint from the Snowplow Canonical event model fields to be hashed (the full list of supported fields for pseudonymization is viewable in the enrichment configuration schema)
  • You wish for the data.email and data.ip_opt fields from the Mailchimp subscribe event to be hashed, but only if the schema version begins with 1-0-
  • You wish to use the SHA-256 variant of the algorithm for the pseudonymization

You can easily check whether your own configuration instance conforms to the schema by using this tool alongside the schema.

Execution

As usual, you would run Stream Enrich like so:

java -jar se/snowplow-stream-enrich-0.14.0.jar --config se/config.hocon --resolver file:se/resolver.json --enrichments file:se/enrichments

Where your pii_enrichment_config.json configuration JSON is found in the se/enrichments folder.

The enriched events emitted by Stream Enrich will then have the values corresponding to the above PII pseudonymized using SHA-256.

A warning about JSON Schema validation of pseudonymized values

One note of caution: always check the underlying JSON Schema to avoid accidentally invalidating an entire event using the PII Enrichment. Specifically, you should check the field definitions of the fields for any constraints that hold under plaintext but not when the field is hashed, such as field length and format.

The scenario to avoid is as follows:

  • You have a customerEmail property in a JSON Schema which must validate with format: email
  • You apply the PII Enrichment to hash that field
  • The enriched event is successfully emitted from Stream Enrich…
  • However, a downstream process (e.g. RDB Shredder) which validates the now-pseudonymized event will reject the event, as the hashed value is no longer in an email format

The same issue can happen with properties with enforced string lengths - note that all of the currently supported pseudonymization functions will generate hashes of up to 128 characters (in the case of SHA-512); be careful if the JSON Schema enforces a shorter length, as again the event will fail downstream validation.

We are exploring ways of avoiding this issue, potentially via a dedicated “pii” annotation within JSON Schema (see issue #860 for more details).

4. Other changes

In order to support the replacing of original field values with pseudonymization hashes, we have had to widen various columns in the Redshift atomic.events table (issue #3528). At the same time, we also widened the “se_label” field in Redshift to support URLs.

Finally, we continue to improve the quality of our codebase by using scalafmt automated code formatting, which will greatly help new contributors to the project meet our high quality standards. You can see the standards we applied in issue #3496.

5. Upgrading

Stream Enrich

The updated Stream Enrich artifact for R100 Epidaurus is available at the following location:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.14.0.zip

Docker images for this new artifact will follow shortly; the instructions for using them can be found here.

Redshift

If you were already using Snowplow with Redshift as a storage target, the existing columns need to be widened as discussed above. We have created a migration script for this purpose.

To use it you simply run it with psql like so:

psql -h <host_enpoint> -p 5439 -d <name_of_the_database> -U <username> -f migrate_0.9.0_to_0.10.0.sql

6. Roadmap

Upcoming Snowplow releases will include:

We are also hard at work on a second phase of this PII Enrichment, which will allow you to safely capture the original PII values which have been pseudonomized, ready for secure and audited de-anonymization on a case-by-case basis.

7. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please visit our Discourse forum.