Snowplow 106 Acropolis released with PII Enrichment upgrade

14 June 2018  •  Konstantinos Servis

We are pleased to announce the release of Snowplow R106 Acropolis.

This release brings some important improvements to the PII Enrichment first released in R100 Epidaurus.

Read on for more information on R106 Acropolis, named after the acropolis of Athens:

  1. Overview of the new PII-related capabilities
  2. Emitting a stream of PII transformation events
  3. Adding a salt for hashing
  4. Fixing an important bug
  5. Other changes
  6. Upgrading
  7. Roadmap
  8. Help

Acropolis

1. Overview of the new PII-related capabilities

In our recent R100 Epidaurus release, we introduced the capability to pseudonymize Snowplow PII data to help our users meet the GDPR regulations.

In brief, that release let you configure Snowplow to hash any PII-containing fields, be they a Canonical event model field, or a property within a self-describing event or context.

With this new release, users of Snowplow real-time now have the option to configure a stream of events which contain the hashed values alongside their original values. You can think of these pairs as similar to:

"d4bd092ce3df26df6f492296ef8e4daf71be4ac9" -> "10.0.2.1"

Although the new PII Transformation event stream is only available for Snowplow real-time pipeline users, this release also brings two other PII-related updates which are available for both batch and real-time users:

Let’s discuss each of the new PII-related capabilities in turn, starting with the new emitted stream.

2. Emitting a stream of PII transformation events

This release adds a configurable, optional stream of events from the PII Enrichment that contains the hashed and original values.

When enabled and configured, Stream Enrich will emit into this new stream a “PII Transformation” event for each event that was pseudonymized in the PII enrichment, containing the original and hashed values.

Anatomy of a PII Transformation event

The emitted event is a standard Snowplow enriched event as described in the Canonical event model, and as such it can be easily consumed via our analytics SDKs.

The event follows the new PII Transformation event JSON schema. An instance of that event could look like this (depending on the fields configured):

{
  "schema": "iglu:com.snowplowanalytics.snowplow/pii_transformation/jsonschema/1-0-0",
  "data": {
    "pii": {
      "pojo": [
        {
          "fieldName": "user_fingerprint",
          "originalValue": "its_you_again!",
          "modifiedValue": "27abac60dff12792c6088b8d00ce7f25c86b396b8c3740480cd18e21068ecff4"
        },
        {
          "fieldName": "user_ipaddress",
          "originalValue": "70.46.123.145",
          "modifiedValue": "dd9720903c89ae891ed5c74bb7a9f2f90f6487927ac99afe73b096ad0287f3f5"
        },
        {
          "fieldName": "user_id",
          "originalValue": "john@acme.com",
          "modifiedValue": "7d8a4beae5bc9d314600667d2f410918f9af265017a6ade99f60a9c8f3aac6e9"
        }
      ],
      "json": [
        {
          "fieldName": "unstruct_event",
          "originalValue": "50.56.129.169",
          "modifiedValue": "269c433d0cc00395e3bc5fe7f06c5ad822096a38bec2d8a005367b52c0dfb428",
          "jsonPath": "$.ip",
          "schema": "iglu:com.mailgun/message_clicked/jsonschema/1-0-0"
        },
        {
          "fieldName": "contexts",
          "originalValue": "bob@acme.com",
          "modifiedValue": "1c6660411341411d5431669699149283d10e070224be4339d52bbc4b007e78c5",
          "jsonPath": "$.data.emailAddress2",
          "schema": "iglu:com.acme/email_sent/jsonschema/1-1-0"
        },
        {
          "fieldName": "contexts",
          "originalValue": "jim@acme.com",
          "modifiedValue": "72f323d5359eabefc69836369e4cabc6257c43ab6419b05dfb2211d0e44284c6",
          "jsonPath": "$.emailAddress",
          "schema": "iglu:com.acme/email_sent/jsonschema/1-0-0"
        }
      ]
    },
    "strategy": {
      "pseudonymize": {
        "hashFunction": "SHA-256"
      }
    }
  }
}

There are a few notable things about this example. The PII Enrichment was configured to pseudonymize the canonical fields user_fingerprint, user_ipaddress, and user_id, and as such the emitted event contains their original and modified values.

In addition, the enrichment was configured to pseudonymize properties from the unstruct_event and contexts fields. As before, the event contains the original and modified values, but it also contains:

  1. The schema property, identifying the Iglu URI for the related events
  2. The jsonPath property, corresponding to it as in the case of contexts there could be any number of substitutions depending on the path and schema matches

Finally the PII Transformation event strategy, in this case the hashing algorithm version, is also given. What is not emitted is the salt that was used in the hashing (see salt below)

Let’s look at a couple of fields of particular interest, namely the contexts and unstruct_event:

The PII Transformation event's parent event

The contexts field in the new PII Transformation event contains a new context called parent_event with a new schema. Here is an example of such an event:

{
  "schema": "com.snowplowanalytics.snowplow/parent_event/jsonschema/1-0-0",
  "data": {
    "parentEventId": "a0f0213e-d514-44e5-8c3d-b1fba8c54f0f"
  }
}

This context simply contains the Event ID, a UUID, of the parent event for which the PII Enrichment was applied. This can be useful for reconciling the emitted PII Transformation events back to the events which caused them to be generated.

Enabling the new event stream

In order to emit a stream of PII events the stream needs to exist for some configurations (e.g. Kinesis), and you will need to configure the stream in two separate places.

This is all covered in detail in the upgrading section below.

Using the new event stream

This new event stream is intended to be used by downstream processes which want to track the pseudonymization process and make it possible for Snowplow operators to recover the original PII values, if and only if the operator has the appropriate authorization under the conditions required for one of the lawful bases for processing.

We are working on a new open-source project to leverage this event stream, called Piinguin. Expect more information on this project soon.

3. Adding a salt for hashing

In order to make it harder for the hashed PII data to be identified, we have responded to community feedback and added the option of a salt to the hashing pseudonymization. Many thanks to falschparker82 of JustWatch for advocating for this approach in issue #3648.

The salt is simply a string that is appended to the end of the string that is going to be hashed; this makes it a lot harder, if not impossible, for someone to “brute force” the pseudonymized data by hashing all the possible values of a field and trying to match the hash.

The new setting is simply a new field in the configuration for the enrichment - see our Upgrading section for further details. The salt should remain secret in order to ensure that protection against brute-forcing the hashed values is achieved.

Important: note that changing the salt will change the hash of the same value, which will make working with values pre- and post-salt change much more complicated.

4. Fixing an important bug

With our R100 introduction of the PII Enrichment, there was a known issue in one of the underlying libraries that we believed to be harmless; unfortunately we have since identified that it can cause problems downstream in the pipeline.

The problem can cause good events to end up in the bad bucket under certain conditions explained below.

As described in issue #3636, the bug occurs when the user has configured the PII Enrichment to hash a JSON type field with a JSON Path containing an array of fields like so:

{
  "json": {
    "field": "unstruct_event",
    "schemaCriterion": "iglu:com.acme/event/jsonschema/1-0-0",
    "jsonPath": "$.['email', 'username']"
  }
}

In events that did not contain both fields, the hashed output would correctly hash the existent one, but it would also create the one that did not exist as an empty object, so the enriched event would contain an output like so:

{
  "schema": "iglu:com.acme/event/jsonschema/1-0-0",
  "data": {
    "email": "764e2b5c4da5267efd84ab24a86539dfc85031c4",
    "username": {}
  }
}

The problem with that event is that it can fail validation downstream depending on the schema iglu:com.acme/event/jsonschema/1-0-0. For example, if the field username in the schema is only allowed to be a string, then the event will fail validation and end up in the bad bucket during shredding (not during enrichment).

5. Other changes

Two other improvements included in this release are:

  1. Automated code formatting for Stream Enrich
  2. An integration test for Stream Enrich’s Apache Kafka support

Automated code formatting further improves the code quality of the snowplow/snowplow repo and makes it easier for new contributors to meet the expected quality standards for Snowplow code.

The Kafka integration test uses the excellent Kafka Testkit to bring up a Kafka broker for Stream Enrich to interact with, thus extending test coverage and further improving the maintainability of the codebase.

6. Upgrading

R106 Acropolis is slightly unusual in being a simultaneous release for the Snowplow batch and real-time pipelines.

This upgrading section is broken down as follows:

  1. Batch pipeline upgrade instructions
  2. Real-time pipeline upgrade instructions
  3. Full example for the new PII Enrichment configuration

Please make sure to read section 3 alongside either section 1 or 2.

Batch pipeline upgrade instructions

To upgrade, update your EmrEtlRunner configuration to the following:

enrich:
  version:
    spark_enrich: 1.14.0 # WAS 1.13.0

Now review the Full example for the new PII Enrichment configuration below.

Real-time pipeline upgrade instructions

The latest version of Stream Enrich is available from our Bintray here.

There are a few steps to using the new capabilities:

  1. Create your Kinesis stream or equivalent for the PII Transformation events stream
  2. Update your PII Enrichment configuration (using version 2-0-0)
  3. Update your Stream Enrich app configuration

Create your Kinesis stream or equivalent

Make sure to create a dedicated Kinesis stream, Apache Kafka topic, or equivalent to hold the PII Transformation events - otherwise Stream Enrich will fail.

Do not attempt to re-use your enriched event stream, as then you will be co-mingling sensitive PII data with safely pseudonymized enriched events.

Update your PII Enrichment configuration

In the PII Enrichment configuration version 2-0-0 you will need to add:

...
"emitEvent": true
...

The complete configuration file, including salt configuration, can be found in the Full example for the new PII Enrichment configuration below.

Update your Stream Enrich app configuration

In the Stream Enrich configuration you will need to add a new property, pii, and set it to the stream or topic which should hold the PII Transformation events:

enrich {
  streams {
    ...

    out {
      enriched = my-enriched-events-stream
      bad = my-events-that-failed-validation-during-enrichment
      pii = my-pii-transformation-events-stream
      partitionKey = ""
    }

    ...
  }
}

Most of the above configuration should be familiar for Stream Enrich users - if not, you can find more information on the Stream Enrich configuration wiki page.

Full example for the new PII Enrichment configuration

Here is a full example PII enrichment configuration:

{
  "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/pii_enrichment_config/jsonschema/2-0-0",
  "data": {
    "vendor": "com.snowplowanalytics.snowplow.enrichments",
    "name": "pii_enrichment_config",
    "emitEvent": true,
    "enabled": true,
    "parameters": {
      "pii": [
        {
          "pojo": {
            "field": "user_id"
          }
        },
        {
          "pojo": {
            "field": "user_ipaddress"
          }
        },
        {
          "json": {
            "field": "unstruct_event",
            "schemaCriterion": "iglu:com.mailchimp/subscribe/jsonschema/1-*-*",
            "jsonPath": "$.data.['email', 'ip_opt']"
          }
        }
      ],
      "strategy": {
        "pseudonymize": {
          "hashFunction": "SHA-1",
          "salt": "pepper123"
        }
      }
    }
  }
}

Most properties will be familiar from the R100 Epidaurus configuration, which used the 1-0-0 version of the configuration schema, per the relevant wiki page.

The new items are:

  1. emitEvent which configures whether an event will be emitted or not
  2. salt which as explained above sets up the salt that will be used.

Setting emitEvent to true is only relevant for the real-time pipeline; salt is applicable to both pipelines.

7. Roadmap

Upcoming Snowplow releases will include:

8. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please visit our Discourse forum.