In our recent R100 Epidaurus release, we introduced the capability to pseudonymize Snowplow PII data to help our users meet the GDPR regulations.
In brief, that release let you configure Snowplow to hash any PII-containing fields, be they a Canonical event model field, or a property within a self-describing event or context.
With this new release, users of Snowplow real-time now have the option to configure a stream of events which contain the hashed values alongside their original values. You can think of these pairs as similar to:
Although the new PII Transformation event stream is only available for Snowplow real-time pipeline users, this release also brings two other PII-related updates which are available for both batch and real-time users:
Let’s discuss each of the new PII-related capabilities in turn, starting with the new emitted stream.
This release adds a configurable, optional stream of events from the PII Enrichment that contains the hashed and original values.
When enabled and configured, Stream Enrich will emit into this new stream a “PII Transformation” event for each event that was pseudonymized in the PII enrichment, containing the original and hashed values.
The emitted event is a standard Snowplow enriched event as described in the Canonical event model, and as such it can be easily consumed via our analytics SDKs.
The event follows the new PII Transformation event JSON schema. An instance of that event could look like this (depending on the fields configured):
There are a few notable things about this example. The PII Enrichment was configured to pseudonymize the canonical fields
user_id, and as such the emitted event contains their original and modified values.
In addition, the enrichment was configured to pseudonymize properties from the
contexts fields. As before, the event contains the original and modified values, but it also contains:
schemaproperty, identifying the Iglu URI for the related events
jsonPathproperty, corresponding to it as in the case of
contextsthere could be any number of substitutions depending on the path and schema matches
Finally the PII Transformation event strategy, in this case the hashing algorithm version, is also given. What is not emitted is the
salt that was used in the hashing (see salt below)
Let’s look at a couple of fields of particular interest, namely the
contexts field in the new PII Transformation event contains a new context called
parent_event with a new schema. Here is an example of such an event:
This context simply contains the Event ID, a UUID, of the parent event for which the PII Enrichment was applied. This can be useful for reconciling the emitted PII Transformation events back to the events which caused them to be generated.
In order to emit a stream of PII events the stream needs to exist for some configurations (e.g. Kinesis), and you will need to configure the stream in two separate places.
This is all covered in detail in the upgrading section below.
This new event stream is intended to be used by downstream processes which want to track the pseudonymization process and make it possible for Snowplow operators to recover the original PII values, if and only if the operator has the appropriate authorization under the conditions required for one of the lawful bases for processing.
We are working on a new open-source project to leverage this event stream, called Piinguin. Expect more information on this project soon.
In order to make it harder for the hashed PII data to be identified, we have responded to community feedback and added the option of a salt to the hashing pseudonymization. Many thanks to falschparker82 of JustWatch for advocating for this approach in issue #3648.
The salt is simply a string that is appended to the end of the string that is going to be hashed; this makes it a lot harder, if not impossible, for someone to “brute force” the pseudonymized data by hashing all the possible values of a field and trying to match the hash.
The new setting is simply a new field in the configuration for the enrichment - see our Upgrading section for further details. The salt should remain secret in order to ensure that protection against brute-forcing the hashed values is achieved.
Important: note that changing the salt will change the hash of the same value, which will make working with values pre- and post-salt change much more complicated.
With our R100 introduction of the PII Enrichment, there was a known issue in one of the underlying libraries that we believed to be harmless; unfortunately we have since identified that it can cause problems downstream in the pipeline.
The problem can cause good events to end up in the bad bucket under certain conditions explained below.
As described in issue #3636, the bug occurs when the user has configured the PII Enrichment to hash a JSON type field with a JSON Path containing an array of fields like so:
In events that did not contain both fields, the hashed output would correctly hash the existent one, but it would also create the one that did not exist as an empty object, so the enriched event would contain an output like so:
The problem with that event is that it can fail validation downstream depending on the schema
iglu:com.acme/event/jsonschema/1-0-0. For example, if the field
username in the schema is only allowed to be a string, then the event will fail validation and end up in the
bad bucket during shredding (not during enrichment).
Two other improvements included in this release are:
Automated code formatting further improves the code quality of the
snowplow/snowplow repo and makes it easier for new contributors to meet the expected quality standards for Snowplow code.
The Kafka integration test uses the excellent Kafka Testkit to bring up a Kafka broker for Stream Enrich to interact with, thus extending test coverage and further improving the maintainability of the codebase.
R106 Acropolis is slightly unusual in being a simultaneous release for the Snowplow batch and real-time pipelines.
This upgrading section is broken down as follows:
Please make sure to read section 3 alongside either section 1 or 2.
To upgrade, update your EmrEtlRunner configuration to the following:
Now review the Full example for the new PII Enrichment configuration below.
The latest version of Stream Enrich is available from our Bintray here.
There are a few steps to using the new capabilities:
Make sure to create a dedicated Kinesis stream, Apache Kafka topic, or equivalent to hold the PII Transformation events - otherwise Stream Enrich will fail.
Do not attempt to re-use your enriched event stream, as then you will be co-mingling sensitive PII data with safely pseudonymized enriched events.
In the PII Enrichment configuration version 2-0-0 you will need to add:
The complete configuration file, including salt configuration, can be found in the Full example for the new PII Enrichment configuration below.
In the Stream Enrich configuration you will need to add a new property,
pii, and set it to the stream or topic which should hold the PII Transformation events:
Most of the above configuration should be familiar for Stream Enrich users - if not, you can find more information on the Stream Enrich configuration wiki page.
Here is a full example PII enrichment configuration:
Most properties will be familiar from the R100 Epidaurus configuration, which used the 1-0-0 version of the configuration schema, per the relevant wiki page.
The new items are:
emitEventwhich configures whether an event will be emitted or not
saltwhich as explained above sets up the salt that will be used.
emitEvent to true is only relevant for the real-time pipeline;
salt is applicable to both pipelines.
Upcoming Snowplow releases will include:
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.