Under the hood

Snowplow consists of five loosely-coupled subsystems.

architecture

1. Trackers

  • Trackers integrate with your application(s) and/or website(s).
  • Trackers generate event data: when an event occurs, they put together a packet of data and send it to a Snowplow collector.
  • Currently we have a Javascript tracker for tracking user interactions on websites and web apps, and a No-JS (also called ‘pixel’) tracker for tracking user behavior in web-environments that do not support Javascript e.g. emails.
  • We also offer a Python tracker and Lua tracker for logging events directly from Python and Lua applications e.g. games and Django webapps, and an Arduino tracker for sensor event analytics for the Internet of Things.
  • Other server side trackers (Java and Ruby) and mobile trackers (iOS and Android) are on the product roadmap.

The Snowplow Tracker Protocol provides a standard way for any tracker to feed data into Snowplow. It is documented here.

2. Collectors

3. Enrichment

  • The enrichment process takes the raw data generated by the collector, validates it, cleans it up and enriches them. (E.g. infers geographical location from IP addresses, and referer data from referer URLs).
  • The enrichment process is written in Scala. It can be run on top of Scalding / Cascading and Amazon EMR, as a batch-based process. It can also be run on Amazon Kinesis, to process incoming data in real time.

4. Storage

  • Snowplow can be setup to load your event-level and customer-level data into one or more data stores, to enable analytics.
  • Snowplow data is delivered into Amazon S3 (for processing by Hive / Pig / Mahout on EMR).
  • In addition, Snowplow supports loading the data into Amazon Redshift and PostgreSQL for analysis in more traditional tools (e.g. R, Looker and Excel). Amazon Redshift enables Snowplow users to query Petabytes of Snowplow data quickly and conveniently via its Postgres API.
  • Going forwards, we plan to support more storage targets to enable a broader set of analyses, including Neo4J, Elastic Search and Google BigQuery.

Snowplow data is stored in each storage option above as close to the Snowplow Canonical Event Model as possible. The data model is described here.

5. Analytics

Once your Snowplow data is available in storage, you can plug it into multiple different tools to crunch that data. Examples include:

  • Exploring and mining your data using [Looker] [looker.]
  • Create dashboards and scorecards with the data using ChartIO.
  • Perform OLAP analysis (i.e. slice and dice different metrics against different metrics) using PivotTables in Excel or Tableau.
  • Mine and model the data, to perform marketing, catalog or platform analytics, using R or Python.
  • Develop and run machine learning algorithms, using Mahout, Python or Weka to develop recommendation engines or clusters audience by behaviour and interest.

Learn more

  • View the Github repo to see the source code for each subsystem listed above.
  • View the technical documentation to learn more about each subsystem.
  • View the setup guide for step-by-step instructions on installing individual subsystems, and Snowplow as a whole.

Built on AWS

Snowplow is built on top AWS, and makes extensive use of Cloudfront, Elastic Beanstalk, Elastic Mapreduce and Amazon Redshift.

We are proud to be an Amazon Web Services Technology Partner.