Under the hood
Snowplow consists of five loosely-coupled subsystems.
- Trackers integrate with your application(s) and/or website(s).
- Trackers generate event data: when an event occurs, they put together a packet of data and send it to a Snowplow collector.
- We also offer a Python tracker and Lua tracker for logging events directly from Python and Lua applications e.g. games and Django webapps, and an Arduino tracker for sensor event analytics for the Internet of Things.
- Other server side trackers (Java and Ruby) and mobile trackers (iOS and Android) are on the product roadmap.
- Collectors receive Snowplow event data from trackers and push it to a queue to be processed.
- Currently we have a Cloudfront collector for tracking user activity across a single domain, a Clojure collector for tracking activity across multiple domains and a Scala stream collector for tracking users across multiple domains in real-time.
- The Clojure collector runs on Amazon Elastic Beanstalk.
- The Scala Stream collector is built to work with Amazon Kinesis.
- The enrichment process takes the raw data generated by the collector, validates it, cleans it up and enriches them. (E.g. infers geographical location from IP addresses, and referer data from referer URLs).
- The enrichment process is written in Scala. It can be run on top of Scalding / Cascading and Amazon EMR, as a batch-based process. It can also be run on Amazon Kinesis, to process incoming data in real time.
- Snowplow can be setup to load your event-level and customer-level data into one or more data stores, to enable analytics.
- Snowplow data is delivered into Amazon S3 (for processing by Hive / Pig / Mahout on EMR).
- In addition, Snowplow supports loading the data into Amazon Redshift and PostgreSQL for analysis in more traditional tools (e.g. R, Looker and Excel). Amazon Redshift enables Snowplow users to query Petabytes of Snowplow data quickly and conveniently via its Postgres API.
- Going forwards, we plan to support more storage targets to enable a broader set of analyses, including Neo4J, Elastic Search and Google BigQuery.
Once your Snowplow data is available in storage, you can plug it into multiple different tools to crunch that data. Examples include:
- Exploring and mining your data using [Looker] [looker.]
- Create dashboards and scorecards with the data using ChartIO.
- Perform OLAP analysis (i.e. slice and dice different metrics against different metrics) using PivotTables in Excel or Tableau.
- Mine and model the data, to perform marketing, catalog or platform analytics, using R or Python.
- Develop and run machine learning algorithms, using Mahout, Python or Weka to develop recommendation engines or clusters audience by behaviour and interest.
- View the Github repo to see the source code for each subsystem listed above.
- View the technical documentation to learn more about each subsystem.
- View the setup guide for step-by-step instructions on installing individual subsystems, and Snowplow as a whole.
Built on AWS
Snowplow is built on top AWS, and makes extensive use of Cloudfront, Elastic Beanstalk, Elastic Mapreduce and Amazon Redshift.
We are proud to be an Amazon Web Services Technology Partner.