How to approach identity stitching with Snowplow data

Share

What is identity stitching and why it matters

Identity stitching, also known as user stitching, is the process of tying together user identifiers to build a more complete view of user behavior. Companies might have different reasons to stitch user identifiers together, and some of the most popular ones include:

This complete view, known also as a 360-degree customer view or a single customer view, can then allow you to personalize the experience users have with your brand or product.

Typically, user stitching involves three key ingredients:

The final part can be highly specific to a business, depending on its industry, use cases, tracking strategy and platforms among many other factors. 

In this post, we’ll take a look at the first two ingredients, and how you would approach these two steps using Snowplow events. We’ll pay special attention to the most common questions that come up in the process, and a few potential answers.

The difference between batch and real-time user stitching, and how to select the right approach

On the surface, your identity stitching implementation may appear different depending on whether you have a “batch” workflow where you process big batches of data at a time, or “in-stream”, which is more suitable for real-time applications. However the difference is mostly in the logistics: what tools and technologies you use, how you connect them to each other and how often your process runs. The underlying logic will largely be the same, built on top of the ingredients mentioned above.

Usually, a real-time approach would suit any use case, though it might not be the most cost-effective option in all situations. On the other hand, a batch approach cannot be applied in use cases where timing is essential, such as real-time recommendation engines. If you are already building an in-stream solution for a time-sensitive use case, it might make sense to scale it so it covers other use cases as well, even if you do not need low latency and provided that it is economical.

User stitching with Snowplow: which properties of a Snowplow event are user identifiers and what are the main differences between them?

Let’s take a look at the user identifiers first.

Out of the box, with Snowplow’s standard web and mobile tracking, you’ll get the following properties:

There are other properties in a Snowplow event that can be used to various degrees to identify users, such as refr_domain_userid, user_fingerprint and user_ipaddress. In addition, you might also have other custom identifiers alongside user_id, such as email_address, client_id, etc. The above are just the most popular ones that you can expect to see (under different names) in many tracking implementations, regardless of whether they are based on Snowplow or not.

How to map your user identifiers together

Let’s take the example of a company that has a website and an Android mobile app, and wants to identify users across both of these platforms, and across different devices, both when they are logged in and not. In this section we’ll take a look at how that company might approach the problem. What questions does it need to answer? What decisions does it have to make?

We’ll limit the example to the identifiers mentioned above. On web, network_userid is set as a first-party cookie, making it a reliable long-term identifier. (If you are using network_userid in a third-party context, you can swap in domain_userid but keep in mind that client-side cookies’ lifetime has recently been dramatically cut, most notably in Safari.) 

The android_idfa uniquely identifies an user’s Android mobile device; for iOS the Apple identifiers play the same role. Let’s also assume that the user_id property is being tracked on both web and mobile, and its value is the email address used for logging in.

How do these properties map onto each other? It helps to consider what types of relationships (one-to-one or one-to-many) can exist between the different properties:

All queries below assume that data is loaded in Redshift. This implies a batch workflow. However, most of the approaches can also be implemented in a real-time setup. Sticking to AWS as the cloud provider, the mapping tables described below can be DynamoDB tables, which will allow very low latency in obtaining the correct mapping.

For simplicity, some values are represented as integers where you would normally find a UUID in real-world Snowplow data.

Multiple network_userids per user_id

What does the data say?

SELECT
  user_id,
  COUNT(DISTINCT network_userid) AS network_userid_count
FROM atomic.events
WHERE user_id IS NOT NULL
GROUP BY 1
ORDER BY 2;

This query will often reveal that for many user_id there is more than one network_userid associated with them.

What does it mean?

This suggests one of three things:

  1. The person associated with a specific user_id is visiting the website from different devices (for example a home and work computer).
  2. The person associated with a specific user_id is visiting the website in different browsers (for example Chrome and Safari), potentially on multiple devices.
  3. The person associated with a specific user_id is always visiting the website from the same device and browser, but the cookies in that browser have been reset, forcing a new network_userid to be assigned.

How can you use this information? 

If your goal is to identify this user across all their devices / browsers, when they are not logged in, then you can create a mapping table, in which each of the network_userid values is associated with the same user_id. Then, whenever you see any one of these network_userids, you’d know it’s the user with that specific user_id.

network_useriduser_id
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8user@email.com
97D44551-58AF-4B8A-8D2D-CA46F728AD41user@email.com
25CA22BC-B3D8-4736-833B-809F3CDEC688user@email.com
Users table
event_idnetwork_useriduser_id
1CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
297D44551-58AF-4B8A-8D2D-CA46F728AD41null
325CA22BC-B3D8-4736-833B-809F3CDEC688null
Events table
SELECT
  a.event_id,
  b.user_id
FROM events AS a
JOIN users AS b
  ON a.network_userid = b.network_userid
event_iduser_id
1user@email.com
2user@email.com
3user@email.com

You may have many events where you only have the network_userid for a user but not their user_id, because those events happened before the user was logged in. But the opposite should not be true. Events generated by a logged-in user should always have both fields populated. For this reason, the reverse lookup (user_id to network_userid) is unnecessary for identity stitching.

Multiple user_ids per network_userid

What does the data say?

SELECT
  network_userid,
  COUNT(DISTINCT user_id) AS user_id_count
FROM atomic.events
GROUP BY 1
ORDER BY 2

This query might reveal cases where there are more than one user_id associated with a specific network_userid.

What does it mean?

The possible reasons include:

  1. A user can have two accounts, under different email addresses.
  2. More than one user is visiting the website from the same device and browser.

How can you use this information?

To establish if these are different users or the same user with multiple accounts, you might need to join the event data to other information that you have about your registered users, such as full name or postal address.

user_idfirst_namelast_nameaddress
steve@email.comSteveHarris22, Acacia Avenue
s.harris@email.comSteveHarris22, Acacia Avenue
User_registry table
network_useriduser_id
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.com
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8s.harris@email.com
Users table
event_idnetwork_useriduser_id
1CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
2CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
3CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
Events table
SELECT
  a.event_id,
  c.first_name || ' ' || c.last_name || ' of ' || c.address AS user
FROM events AS a
JOIN users AS b
  ON a.network_userid = b.network_userid
JOIN user_registry AS c
  ON b.user_id = c.user_id
GROUP BY 1, 2
event_iduser
1Steve Harris of 22, Acacia Avenue
2Steve Harris of 22, Acacia Avenue
3Steve Harris of 22, Acacia Avenue

But what if they are different users, who maybe live in the same household and use the same device?

user_idfirst_namelast_nameaddress
steve@email.comSteveHarris22, Acacia Avenue
nicko@gmail.comNickoMcBrain22, Acacia Avenue
User_registry table
network_useriduser_id
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.com
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8nicko@email.com
Users table

You probably won’t be able to determine with certainty which one is visiting the website at any particular moment. You can deal with this in several different ways:

1. Ignore such users.

If they are a small proportion of your overall user base, this might be the simplest and most efficient solution. It’s also the only 100% accurate one, since every user who is not excluded can have their identity stitched without ambiguity. The downside is that you will be leaving out some portion of your user base.

2. Associate the network_userid with the latest seen user_id

network_useriduser_idtstamp
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8nicko@email.com2020-10-22 14:27:00
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.com2020-10-21 11:45:33
Users table
event_idnetwork_useriduser_id
1CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
2CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
3CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
Events table
WITH users AS (
  SELECT
    network_userid,
    LAST_VALUE(user_id IGNORE NULLS) OVER
    (
      PARTITION BY network_userid ORDER BY tstamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
    ) AS user_id
  FROM users
  GROUP BY 1, 2
)

SELECT
  a.event_id,
  b.user_id
FROM events AS a
JOIN users AS b
  ON a.network_userid = b.network_userid
event_iduser_id
1nicko@email.com
2nicko@email.com
3nicko@email.com

Of course, in this case you are making an assumption about the real identity of the user, so your data will not be 100% accurate.

3. Associate the network_userid with the most frequently appearing user_id.

Consider whether you want to use the user_id with most sessions or the one with most events. The example below uses sessions.

network_useriduser_idsession_id
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.com1
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8nicko@email.com2
CA9C57DE-EEC1-4A47-9B75-3074C8E778B8steve@email.com3
Users table
event_idnetwork_useriduser_id
1CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
2CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
3CA9C57DE-EEC1-4A47-9B75-3074C8E778B8null
Events table
WITH prep_users AS (
  SELECT
    network_userid,
    user_id,
    COUNT(DISTINCT session_id) AS sessions
  FROM users
  GROUP BY 1, 2
),

users AS (
  SELECT
    network_userid,
    FIRST_VALUE(user_id) OVER
    (
      PARTITION BY network_userid ORDER BY sessions DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as user_id
  FROM prep_users
  GROUP BY 1, 2
)

SELECT
  a.event_id,
  b.user_id
FROM events AS a
JOIN users AS b
  ON a.network_userid = b.network_userid
event_iduser_id
1steve@email.com
2steve@email.com
3steve@email.com

Similar to the second option above, the results in the final derived table will not be 100% accurate, so some events might be attributed to the wrong user_id.

Multiple android_idfa values per user_id and multiple user_id values per android_idfa

Sticking to our example above, on mobile devices the android_idfa property will play the same role as network_userid plays on web. It identifies a device. The same considerations as above are applicable in these scenarios as well; and the query examples can be easily adapted.

As a final note, you may want to add a lookback window to the WHERE clause of some of the queries above, to ensure you’re only using the latest data and not your entire historical records dating back months or even years.

Identity stitching as a real-time workflow

All the examples so far have been about batch processing of tables in a data warehouse. With streaming data, you would normally deal with each event as it comes in. The same principles as above will apply, but the “logistics” will be a little bit different. 

The most notable difference will be that in the warehouse creating a mapping table and then using it in conjunction with atomic tables and tables elsewhere in the warehouse can be separate steps. They can even happen at different frequencies, for example if you update the mapping tables daily via a scheduled process but run queries that use them throughout the day. In-stream, as you handle each event, you will need to accomplish two things at once: lookup your DynamoDB mapping table to identify the user in the current event *and* update the DynamoDB mapping table with information from the current event.

Take ownership of your data

There is no right or wrong approach to user identity stitching. The key consideration when deciding between a batch or real-time approach depends on your use case, what you want to achieve through user identity stitching and how it will benefit your product, user experience or business. Since Snowplow gives you full ownership and control of the real-time data stream and the data in the warehouse, you can use Snowplow for both batch, and real-time workflows. 

However, regardless of which approach you take, identity stitching involves using personal data, which means businesses should always make sure to comply with data privacy regulations. At Snowplow, our aim is to give our users more control over how personal user data is handled so they can better manage how personal data is used at every stage. 

To find out more about how to gain a complete customer view with Snowplow, you can get in touch with us here.

Learn more about our unique approach to data delivery with a Snowplow demo.

Share

Related articles