Snowplow consists of five loosely-coupled subsystems.
Snowplow data is generated by trackers and passed to a collector.
An iOS and Android tracker are on the product roadmap.
Collectors receive Snowplow event data from trackers and log it to S3.
Once raw data has been logged to S3, an ETL step processes that data, cleaning it (e.g. extracting data from querystrings) and enriching it (e.g. inferring user locations from IP addresses).
Our ETL step currently using Apache Hive on EMR to process the raw logs via a custom serde. We are part way through developing a more robust subsystem using Scalding / Cascading.
The ETL step finishes by loading the data into one or more data storage options
Snowplow can be setup to load your event-level and customer level data into one or more data stores, to enable analytics
Currently we support loading Snowplow data into S3 (for processing by Hive / Pig / Hadoop / Mahout on EMR), Redshift and Infobright Community Edition for more traditional analysis (e.g. using BI tools like ChartIO or sophisticated analytics tools like R)
Once your Snowplow data is available in storage, you can plug it into multiple different analytics tools to mine that data