We are pleased to announce the release of our R2F (Right to be Forgotten) Spark job.
This is a stand-alone Spark job that removes rows from your Snowplow enriched events archive in Amazon S3, based on specific PII identifiers. It lets Snowplow users easily remove data about a specific user, when the data subject has requested it when exercising his or her “right to be forgotten” under Article 17 of the GDPR.
For those deploying Snowplow, the R2F job falls under the new category of “housekeeping” jobs which are background tasks meant to optimize or clean up data (as in this case).
Please read on for:
1. The GDPR's Right to be Forgotten
Even before GDPR, many users of information services were concerned that their actions and behavior would be recorded and be available to the data controller, long after the consent or interaction that justified the data to be recorded and processed had ceased.
Under the GDPR, the EU created a clear regulation describing the obligations of data controllers with specific attention being paid to a data subject’s right to be forgotten.
While the regulation has special provisions for freedom of expression, in general, maintaining and processing personally identifiable information is now conditional and time-limited, thus addressing a significant policy gap between individual rights and commercial interests.
To help Snowplow users along the path of responsible data processing, we have included a number of features to help pseudonymize data in the main Snowplow pipeline under releases R100 and R106. Alongside these efforts, we also released Piinguin in order to better manage re-identification, should that be necessary.
Under GDPR’s Article 17 - Right to erasure (‘right to be forgotten’), the data subject can request the erasure and the data controller, usually, is obliged to act (see also this example from the EU Commission).
As the operator of a Snowplow pipeline, you will want to remove data from a Snowplow event archive in Amazon S3 following an R2F request in a reliable and timely (“without undue delay”) fashion. To address this need, we created the Right to be Forgotten Spark job; this complements our existing tutorials for removing R2F data from Redshift and Snowflake.
2. Running the R2F Spark job
Running the R2F Spark job requires a “removal criteria” file in order to match the events to be erased.
The file consists of rows of a single JSON self-describing datum which conforms to the JSON Schema here. As can be seen from the schema, it expects a single criterion of either
Special care needs to be taken that the value uniquely identifies a single individual, as there is a chance (e.g. when using an IP address) that it does not and more events than intended could be erased.
To avoid that, an argument should be provided to the Spark job that specifies the maximum proportion of rows from the archive that you expect to be matched in that execution (e.g. 0.01 for 1%), as a safeguard. The job will fail if that number is exceeded.
Here is an example of running the R2F job against Elastic MapReduce:
The R2F arguments are:
--removal-criteria(in this example
s3://snowplow-data-<mycompany>/config/to_be_forgotten.json): this is the URL of the removal criteria file containing the criteria for which events will be removed from the archive
--input-directory(in this example
s3://snowplow-data-<mycompany>/enriched/archive/): the directory that contains the Snowplow event archive
--non-matching-output-directory(in this example
s3://snowplow-data-<mycompany>/r2f-test/non-matching/runid=<yyyy-mm-dd-HH-MM-SS>): the directory to write out allevents that do not match the criteria
--matching-output-directory(in this example
s3://snowplow-data-<mycompany>/r2f-test/matching/runid=<yyyy-mm-dd-HH-MM-SS>): the directory that contains the matching output. Optional
--maximum-matching-proportion(in this example
0.01): the maximum proportion of the input events that are allowed to match. If the actual proportion is higher the job will fail
Note: when writing out the filtered output, the R2F Spark job does not preserve the directory structure found within the enriched archive, specifically the
3. Further considerations
Overzealous deletion of data
As you can see in the running section, there is an argument called
maximum-matching-proportion which is a safeguard in case that you have provided a value as removal criterion that corresponds to many events across many users.
This is a very coarse filter that will only catch the worst cases of excessive deletion; we have yet to identify a generic enough solution to reliably catch all cases where the user has mistakenly selected an overly-wide removal criterion. However, we continue to explore alternative safeguards - and of course new ideas are always welcome, so please submit a new issue on GitHub if you have one.
Until other measures are implemented in the R2F Spark job, it is sensible to have some other measures in place to catch issues downstream, for instance a weekly or monthly sanity check in the target database.
Of course, in order to recover from such an issue, you need a backup of the data, which is hard to do while also meeting the requirement to erase all such data. One solution could be to keep the old archive in another bucket or prefix (in the case of S3), eventually automatically expiring through some sort of object life cycle policy and/or versioning.
Whichever solution to this problem you choose, we would like to hear about your experiences on Discourse.
Size of removal criteria file
It is assumed that the file is small enough to fit in memory in each executor. Back of the envelope calculations show that this is a reasonable assumption; this approach simplifies the code and also makes execution very fast.
If you find that your removal criteria file size breaks the Spark job, please submit a new issue on GitHub or even better a PR.
If you have any questions or run into any problems, please raise a question in our Discourse forum.