Building first and last touch attribution models in Redshift SQL
In order to calculate the return on marketing spend on individual campaigns, digital marketers need to connect revenue events, downstream in a user journey, with marketing touch events, upstream in a user journey. This connection is necessary so that the cost of those associated with the marketing campaign that drove those marketing touches can be connected to profit associated with the conversion events later on.
Different attribution models involve applying different logic to connecting those marketing touch events with subsequent revenue events. In this blog post, we will document the analytic approach and the corresponding SQL statements to perform the most basic types of attribution: first and last touch. Our example SQL will focus on the web use case. The same undelying analytic approach can, however, be applied to any channel.
Note that all the SQL given below is Redshift compatible. It can be easier to do attribution analysis with other dialects of SQL that support complicated data types (arrays and objects) in particular.
Identifying the different marketing touches
First we need to identify all our marketing touch events. We can generate a table with all of these as follows:
The above table includes a line of data per marketing touch, ordered by user (as identified via the first party cookie id
domain_userid) and time (as identified by the
Identifying the different revenue events
Now lets create a table with all our different revenue events. What these look like will depend on your own particular event schema - for this example we’ll assume that revenue events are standard Snowplow transaction events. It should be straightforward to modify / update the below SQL with your own set of revenue events.
Connecting marketing touches with revenue events: first touch attribution model
Now that we have our marketing touches and our revenue events, we need to join them together.
How we do the join is firstly a question of business logic: what type of attribution model do we want to apply? The simplest model is a first touch model - this credits all the value associated with the revenue event to the first marketing touch for each user.
To do this, we create a new
derived.first_marketing_touch table, that records only the first marketing touch for each user. This is a subset of the marketing touches recorded in the
Now it is trivial to join our
derived.first_marketing_touch table with our
Bingo! We have a table with a line of data for each revenue event, and all the marketing data associated with the corresponding first touch event for that user.
Note that the above is especially straightforward because:
- It is easy to identify the first marketing touch for each user. (It’s simply the one wiht the earlierst timestamp.)
- By the time we do the join we have a table with a maximum of one marketing touch event per user ID, so we do not have to worry about generating a cartesian product when performing the join.
Because neither of the above is true when we are applying a last click attribution model, the SQL gets a bit more complicated.
Connecting marketing touches with revenue events: last touch attribution model
In a last touch attribution model, we want to credit all the value associated with each revenue event to the most recent marketing touch that occurred prior to that event.
There are a number of ways to do this in SQL - I think the following is the most straightforward, but welcome any suggestions at alternatives that are clearer / more performant.
First, we need to identify for each revenue event what is the corresponding marketing touch event that we wish to connect. To do this, we first union our marketing touches and revenue event tables into a single table that contains both the marketing touches and the revenue events. For performance reasons, we only include a subset of the columns in our marketing touches table.
The above table includes a line for every marketing touch event and every revenue events. Marketing touch events have a
marketing_event_id set and revenue events have a
We need to aggregate over this table so that we set the
marketing_event_id for each revenue event. This will be the event_id of the most recent marketing touch event prior to the revenue event. We’ll then be able to use the event ID to join back with our marketing touches table, to pull all the metadata associated with that marketing touch to the revenue event.
To do that, we use a window function to identify the most recent marketing touch event prior to the revenue event:
The window function is doing a lot of work for us, so it is worth explaining what’s going on before we use the above query to generate our final result set:
- First, it partitions our marketing touch and revenue events by user ID
- Then it orders the event stream by time
- Then for each event, it fetches the most recent not null
marketing_event_idvalue. Note that this will be applied to every row in the table i.e. marketing touch events and transaction events. Where it is applied to marketing events, the most recent marketing event ID will be the marketing event ID for the current event. That doens’t matter (we’re going to filter these events out of the event stream in the next step). The important thing is that for revenue events, it will correctly fetch the most recent marketing event ID. (Because the marketing event ID for the current row will be null, so will be ignored.)
Now we apply the above window function to generate our final result set:
The above query will generate a result set with one line of data per revenue event, and each line including the marketing data associated with the last channel that each user engaged with prior before the revenue event occurred.