Snowplow 0.7.4 released for better eventstream analytics
Another week, another release! We’re excited to announce Snowplow version 0.7.4. The primary purpose of this release is to clean up and rationalise our event data model, in particular around user IDs and event timestamps. This release should lay the foundations for more sophisticated eventstream analytics (such as funnel analysis), by:
- Enabling companies to assign custom user IDs (e.g. when a customer logs on)
- Distinguish between IDs set at a domain level (via first-party cookies) and at a network level (via third-party cookies)
- Enable precise ordering of events in a user’s click stream with accuracy correct to the milli-second
In this post we will cover:
Read on below the fold to find out more!
Historically, Snowplow has supported a single
user_id field. Unfortunately, there were three issues with this:
- Snowplow was overloading the field with two different meanings - if a user was running the CloudFront collector, the
- Both meanings of
user_idwere web-specific - neither made sense for user tracking in a mobile app or any other platform which does not support cookies
- No support for a custom user ID - Snowplow did not allow you to track a custom
user_idspecific to your business, such as your users’ account numbers in your ecommerce package
In this release, we aim to solve these issues by separating out user IDs into three separate fields:
| ||A custom user ID which you can set. Will be supported by all trackers (except the Pixel tracker)|
| ||A user ID set by the Clojure collector in a third-party cookie; shared across a network of different domains|
Please note that you must call
setUserId() on every page where you know the user ID - in other words the setting does not survive a pageload.
Whether or not each type of user ID is available for your analysis depends on the combination of your tracker and collector:
|Tracker||Collector||->|| || || |
* Assuming you have added a call to
setUserId() - which isn’t possible in the Pixel tracker.
Previously our data model included two fields,
tm, to track the date and time at which each event occurred. This timestamp was based on when the Snowplow event collector received the event, not when the tracker sent the event.
There are a couple of limitations to using a collector-based timestamp for eventstream analysis:
- If two events occur almost simultaneously in the client, there is no guarantee which will be received by the collector first (because of the unpredictability of the HTTP connection)
- If a tracker batches events and then sends them in one batch (e.g. a cellphone out of cell coverage) , then all of the events in that batch will end up with the same collector timestamp, despite occurring at different times
For this reason, in this release we are introducing a tracker-based timestamp, which is set by the tracker when the event occurs, and is stored in our data model alongside the collector timestamp. This means that we now have five timestamp fields:
| ||string||Date when the collector received the event|
| ||string||Time when the collector received the event|
| ||string||Date on the client device when the event occurred|
| ||string||Time on the client device when the event occurred|
| ||bigint||Milliseconds since the epoch (1/1/1970) on the client device when the tracker sent the event|
Note that we include a super-precise
dvce_epoch field because our
dvce_tm field is not accurate to milliseconds; when querying within a given user session, simply order by
dvce_epoch to get the user’s eventstream accurately ordered to the millisecond.
A word of warning: tracker timestamps are great for understanding the correct order of, and elapsed time between, events from a specific user session. However, they are not a safe way of understanding when a given event actually occurred, because you cannot trust the clocks on users’ devices. So, stick to the collector timestamp if you need to understand when in the real-world events occurred across multiple users.
domain_userid. Many thanks to Angus Mark at Simply Business for alerting us to this.
Previously, the site/app ID as set by
setSiteId() was used as an input into naming the first-party cookie which stores the
domain_userid. This had the unfortunate side effect that, if you used multiple site IDs for different parts of your site, your visitors would end up with different
domain_userids for the different parts of your site.
This release fixes this problem - and it does so in a way that should not corrupt or reset any of your existing
domain_userids. Going forwards, you can set different parts of your site to different app IDs without “fragmenting” your
|Type of change||Component||Change||Comment|
|Data change||S3 & Infobright storage|| ||Now called |
The first change is because we are no longer overloading the
The final change is to rename the
visit_id field to
domain_sessionidx. The field’s contents is unchanged, but we have updated the name to reflect that:
- The field holds the current count (aka index) of visits by this user, not a random ID
- Going forwards we will be tracking different types of sessions (mobile, desktop etc), not just website visits
domain_sessionidxmakes the limited scope of this field clearer
Because we are making some significant changes to the event data model, such as “unpacking” the overloaded
user_id field, this upgrade is relatively complex. Please read this upgrade guide in full first before starting your upgrade.
The upgrade process has multiple steps - we will discuss each step in turn, and then suggest a way of scheduling this upgrade to prevent any data corruption.
Don’t forget to update your Snowplow tags as per the updates in Deprecations above.
4.2 Clojure collector
If you are using the CloudFront collector, you can skip this step.
If you are using the Clojure collector, you will need to upgrade it to the latest version, 0.3.0. You can find the new version packaged as a complete WAR file on our Hosted assets page. If you have forgotten how to deploy the Clojure-based collector, you will find full instructions on our Wiki, Setting up the Clojure collector (you can skip most of the setup steps).
If you are using EmrEtlRunner, you need to update your configuration file,
config.yml, to use the latest versions of the Hive serde and HiveQL scripts:
:snowplow: :serde_version: 0.5.5 :hive_hiveql_version: 0.5.6 :non_hive_hiveql_version: 0.0.7
If you are using Infobright Community Edition for analysis, you will need to update your table definition. To make this easier for you, we have created two scripts:
Choose the appropriate script depending on which collector you are using: “cf” means the CloudFront collector, “clj” the Clojure collector.
Running this script will create a new table,
events_007 (version 0.0.7 of the Infobright table definition) in your
snowplow database, copying across all your data from your existing
events_006 table, which will not be modified in any way.
Once you have run this, don’t forget to update your StorageLoader’s
config.yml to load into the new
events_007 table, not your old
:storage: :type: infobright :database: snowplow :table: events_007 # NOT "events_006" any more
4.5 Scheduling the upgrade
This upgrade has to be carefully scheduled because we are changing the meaning of the
user_id field into the new
Our suggested approach is as follows:
- (If you are using the Clojure collector) Get the Clojure collector version 0.3.0 ready in Elastic Beanstalk as per section 4.2 above, but do not deploy it live yet
- Start a manual run of the EmrEtlRunner for your site…
- Wait for the EmrEtlRunner operation complete
- If you are using Infobright, run the StorageLoader and wait for it to finish
- Now upgrade the ETL as per section 4.3 above
- Now upgrade Infobright (if you are using it) as per section 4.4 above
This upgrade approach should prevent any user ID data from ending up in the wrong fields in your Snowplow event store.