Under the hood
Snowplow consists of five loosely-coupled subsystems.
- Trackers generate event data.
- We also offer a Lua tracker for logging events directly from Lua applications e.g. games, and an Arduino tracker for sensor event analytics for the Internet of Things.
- Server side trackers (Java, Ruby and Python) and mobile trackers (iOS and Android) are on the product roadmap.
- Collectors receive Snowplow event data from trackers and log it to S3.
- Currently we have a Cloudfront collector for tracking user activity across a single domain, and a Clojure collector for tracking activity across multiple domains. The Clojure Collector runs on Amazon Elastic Beanstalk.
- The enrichment process takes the raw logs generated by the collector, cleans them up, checks them (validation) and enriches them. (E.g. infers geographical location from IP addresses, and referer data from referer URLs).
- The Enrichment process is written on top of Scalding, a Scala API library for Cascading, a framework on Hadoop for building robust data pipelines. The Enrichment process runs on Amazon EMR.
- Snowplow can be setup to load your event-level and customer-level data into one or more data stores, to enable analytics.
- Snowplow data is delivered into Amazon S3 (for processing by Hive / Pig / Mahout on EMR).
- In addition, Snowplow supports loading the data into Amazon Redshift and PostgreSQL for analysis in more traditional tools (e.g. R, Tableau and Excel). Amazon Redshift enables Snowplow users to query Petabytes of Snowplow data quickly and conveniently via its Postgres API.
- Going forwards, we plan to support more storage targets to enable a broader set of analyses, including Neo4J.
Once your Snowplow data is available in storage, you can plug it into multiple different tools to crunch that data. Examples include:
- Create dashboards and scorecards with the data using ChartIO.
- Perform OLAP analysis (i.e. slice and dice different metrics against different metrics) using PivotTables in Excel or Tableau.
- Mine and model the data, to perform marketing, catalog or platform analytics, using R.
- Develop and run machine learning algorithms, using Mahout, Python or Weka to develop recommendation engines or clusters audience by behaviour and interest.
- View the Github repo to see the source code for each subsystem listed above.
- View the technical documentation to learn more about each subsystem.
- View the setup guide for step-by-step instructions on installing individual subsystems, and Snowplow as a whole.
Built on AWS
Snowplow is built on top AWS, and makes extensive use of Cloudfront, Elastic Beanstalk, Elastic Mapreduce and Amazon Redshift.
We are proud to be an Amazon Web Services Technology Partner.