Snowplow 0.7.0 released, with new Clojure-based collector
Today we are hugely excited to announce the release of Snowplow version 0.7.0, which includes an experimental new Clojure-based collector designed to run on Amazon Elastic Beanstalk. This release allows you to use Snowplow to uniquely identify and track users across multiple domains - even across a whole content or advertising network.
To date, the primary collector for Snowplow events has been our CloudFront-based collector. The CloudFront-based collector has been easy to setup and very reliable, but has one main drawback: it does not support user tracking across multiple domains.
And the other good news is that our Clojure collector automatically logs the raw Snowplow events to Amazon S3 - and it logs in the exact same format as the CloudFront-based collector, so we can use the same ETL process for both collectors!
Read on below the fold for installation instructions and some additional information on this release.
You will find full instructions on setting up the new Clojure-based collector on our Wiki, Setting up the Clojure collector.
If you are using EmrEtlRunner, you need to update to the latest version, which is 0.0.7 - this is available by checking out the master branch of the Snowplow repository.
You will also need to update your configuration file,
config.yml, to use the latest versions of the HiveQL scripts:
:snowplow: # ... :hive_hiveql_version: 0.5.4 :non_hive_hiveql_version: 0.0.5
If you are using StorageLoader, you need to update to the latest version, which is 0.0.3 - this is available by checking out the master branch of the Snowplow repository.
If you are using Infobright Community Edition, you will need to update your table definition. This is because the
user_id field was not wide enough to store the new user IDs (UUIDs) set by the Clojure collector. To make this easier for you, we have created a script:
Running this script will create a new table,
events_005 (version 0.0.5 of the table definition) in your
snowplow database, copying across all your data from your existing
events_004 table, which will not be modified in any way.
Once you have run this, don’t forget to update your StorageLoader’s
config.yml to load into the new
events_005 table, not your old
:storage: # ... :table: events_005 # NOT "events_004" any more
That’s it! Your Clojure collector should be ready to run now. However, please read on for an important note about its experimental nature.
We want to stress that the new Clojure-based collector is a piece of experimental technology - we are looking to the community to try it out and feedback to us on how it’s working for you, especially at scale.
In particular, we would recommend running the Clojure-based collector alongside the CloudFront collector to be confident that it is performing under load and that no events are being dropped. We have run both collectors alongside each other for the Snowplow Analytics website for four complete days, and total event counts are as follows:
Overall for the result set, the absolute percentage difference between results for the Cloudfront and Clojure collectors is less than 2% (1.9%). Possible reasons for this discrepancy include:
- Differences in datestamps - possibly an event fell on either side of a date boundary for each collector
- Duplicate rows - the two collectors may be occassionally duplicating different rows (see issue 24)
- Browsing behavior - it may be that the user navigates away from the page before one or other collector can register the event
We plan on testing all of this further with larger datasets; we also intend to explore the Clojure collector’s duplicate rows to check there are no particular issues there.
Other features in this release
There are two minor changes in this release not related to the Clojure-based collector:
Both EmrEtlRunner and StorageLoader now print “Completed successfully” to
stdout on completion. This should help to make it clearer (e.g. in logs) that these Ruby programs have completed successfully.
StorageLoader has been updated so that its
--skip argument works the same way as it does in EmrEtlRunner:
Specific options: ... -s, --skip download,load,archive skip work step(s)