If you visit the European Union homepage for GDPR, one of the first things you’ll notice is a timer (assuming you read this before enforcement begins). Clearly displayed down to the second, at any given time you can check to see how much time you have left. Considering all of the complexities that come with compliance, problems that must be solved at the technological, procedural, and governance levels, there are many of us who will need to use as much of the remaining days and hours as possible to prepare our organizations for this new set of data protection regulations.
The General Data Protection Regulation gives new rights to data subjects, the individual users about whom data describes (a list can be found in this post). These new rights are meant to shift control over how personal data is used into the hands of the individuals, meaning data controllers, like Snowplow users who collect personal data, are facing new obligations for how their data must be handled from end to end. Because these regulations are onerous, companies need to challenge themselves whenever they’re collecting personal data to be clear on why they’re doing it. In situations where the same impact can be achieved with data that is not personal, for instance, these obligations do not apply.
However, GDPR massively expands the scope of what constitutes personal data. To quote the Information Commissioner’s Office, “GDPR applies to ‘personal data’ meaning any information related to an identifiable person who can be directly or indirectly identified in particular by reference to an identifier. This definition provides for a wide range of personal identifiers to constitute personal data including name, identification number, location data, or online identifier, reflecting changes in technology and the way organisations collect information about people.” While this clearly categorizes IP addresses, cookie IDs, and other device identifiers like IDFVs and IDFAs constitute personal data, the deliberately vague nature of the definition presents a moving target for companies as more data is collected on individual users and additional legal precedent is set.
Why anonymization matters in a GDPR world
“Anonymization” is a technique for taking personal data and rendering it non-personal by making it impossible for a user with the data to identify which individual the data describes. With digital data, we can distinguish two distinct uses:
- We use data to build insight in an attempt to better understand how users are engaging with our products and websites. This insight can be used to optimize marketing spend or support the product development process (as we explored in our product analytics series).
- We use data to better understand individual users and better communicate with and engage those users to, hopefully, improve their experience
We can use anonymized data for (1) but not for (2). When we’re collecting data for the purpose of insight, it doesn’t matter who the individuals are, just what they do. This means that we can still collect and use digital data to do things like optimize marketing campaigns and support product development without collecting personal identifiable information (PII) and incurring the obligations associated with doing so.
Anonymization and Pseudonymization
When data is fully anonymized, the link between the data and the personal identifiers is completely severed: all data collected is decoupled from any sort of personal information. Pseudonymized data, conversely, retains a slight tether between the data and the PII, meaning the data is, on the surface, anonymous but when it comes time, special measures can be taken to restore the connection, allowing a company to act. In practice, fully anonymizing data is very difficult. Even if we anonymize a data subject’s name, knowing other pieces of information about the subject, such as date of birth or location, can allow us to narrow down the potential identity of an “anonymous” user or even pinpoint the specific individual if the data set is small enough.
Though pseudonymized data seems to present the opportunity to retain as much user data as possible while circumventing the obligations GDPR requires, the regulation explicitly states that pseudonymized data can fall within its scope, depending on how difficult it is to attribute the pseudonym to a particular individual. Pseudonymized data remains very valuable from a GDPR perspective, however. If, for example, a company collects digital data to support product development but does not engage in targeted marketing, a marketer cannot accidentally use the data to target an individual user. In light of what companies must do in order to demonstrate compliance, making sure only authorized activities are carried out is a powerful control.
GDPR tools and the Snowplow pipeline
Because of the high value of anonymization as a tool for data controllers to help them comply with GDPR, we recently released R100, Epidaurus which builds this functionality into our data processing pipeline. Snowplow is a data collection platform, used by companies to collect data across multiple platforms and channels. On each platform, we expect there to be at least one user-level or device-level identifier to enable analysts to stitch together user journeys on those channels, and then join them together to create a single-customer view, an essential element in many types of analysis.
By enabling data controllers to pseudonymize all of those identifiers while retaining the ability for data consumers to understand an individual’s complete user journey across platforms and channels (simply removing the fact of who went through the journey), Snowplow users are better able to use data to power insight and better respect the rights of data subjects whose journey that data describes.