Introducing Snowplow Mini


We’ve built Snowplow for robustness, scalability and flexibility. We have not built Snowplow for ease of use or ease of setup. Nor has the Snowplow Batch Pipeline been built for speed: you might have to wait several hours from sending an event before you can view and analyze that event data in Redshift.


There are occasions when you might want to work with Snowplow in an easier, faster way. Two common examples are:

  1. New users who want to understand what Snowplow does and how it works. For these users, is it really necessary to setup a distributed, auto-scaling collector cluster, Hadoop job and Redshift cluster? For users who want to experiment with the Real-Time Pipeline, it is necessary to setup three different Kinesis Client Library applications and three different Kinesis streams, just to get started?
  2. Existing users who are extending their tracking and want to test it prior to pushing updates to production. Many of these users are running our batch pipeline: is there a way they can get faster (near-instant) feedback on whether their tracker updates are working properly?

Today we’re delighted to announce Snowplow Mini to meet the above two use cases: the complete Snowplow Real-Time Pipeline, on a single AMI / EC2 instance. Download, and setup, in minutes…

  1. Overview
  2. Under the hood
  3. Software stack
  4. Roadmap
  5. Getting help

1. Overview

Snowplow Mini is the complete Snowplow Real-Time Pipeline running on a single instance, available for easy install as a pre-built AMI. Set it up in minutes by following the quickstart guide.

Once deployed, you fetch the public IP for your Snowplow Mini instance from the EC2 console. You can then:

1.1 Log into Snowplow Mini
1.2 Record events
1.3 Explore your data in Elasticsearch and Kibana
1.4 Debug bad data in Elasticsearch and Kibana

1.1 Log into Snowplow Mini

Once Snowplow Mini is up and running, you should be able to fetch the IP address of the instance it is running on from the EC2 console:

Get the IP address for your Snowplow Mini instance from the EC2 console

Navigate to that IP address in the browser. There you’ll find Snowplow Mini:

Snowplow Mini homescreen

1.2 Record events

Send in some events in! You can do this directly from the Snowplow Mini UI, by selecting the Example Events tab and clicking the different buttons. Each button click will be recorded as an event:

Send sample events

You can, more usefully, send events using any one of our Snowplow trackers. Simply configure the tracker to use the Snowplow Mini collector endpoint on <<your-snowplow-mini-public-ip>>:8080

1.3 Explore your data in Elasticsearch and Kibana

Sending in events is great, but now we want to look at the data.

The simplest way to get started is to view the data in Kibana. This will require a quick initial setup.

Navigate in the browser to http://<<your-snowplow-mini-public-ip>>:5601. Kibana will invite you to setup an index pattern. Let’s first setup an index for ‘good’ data (i.e. data that is successfuly processed) by entering the following values:

Kibana setup good index

Hit the create button. Now we have our good index setup:

Kibana good index

Now let’s create a second index for our bad data. Clikc the Add New button on the top right of the screen and then enter the following values to configure the index for bad data:

Kibana setup bad index

Now let’s look at our data. Hit the “Discover” menu item:

Viewing good event data in Kibana

We can build graphs in the Visualize section and assemble them together in the Dashboards section. You can also use other tools for visualizing the data in Elasticsearch: Elasticsearch can be queried directly on http://<<your-snowplow-mini-public-ip>>:9200.

1.4 Debug bad data in Elasticsearch and Kibana

One of the primary uses of Snowplow Mini is to enable Snowplow users to debug updates to their tracker instrumentation in real-time, significantly reducing updates to tracker deployments.

If you have defined your own event and entity (context) schemas, you will need to push these schemas to the Iglu repository that is bundled with Snowplow Mini. There is a simple script you can run to copy those schemas from your lcoal machine to Snowplow Mini: instructions can be found here.

Once you’ve done that, you can start sending data into Snowplow Mini to see if it is processed successfully. Each event you send should either land in the good index or bad index. To switch from one to the other in Kibana, select the cog icon in the top right of the screen and then select the index you want to view from the dropdown:

kibana switch from viewing good to bad data

In the below example you can see that one bad event has landed. It is straightforward to drill in and identify the issue with processing the event (it has a invalid type of nonsense):

kibana example bad data

More information on debugging your data in Elasticsearch / Kibana can be found here.


White paper

Utilise your behavioural data with our guide to better data quality


2. Under the hood

The pipeline running on Snowplow Mini is essentially the Snowplow Real-Time Pipeline:

  1. The collector receives events and writes them to a raw events stream or a bad stream (if the event is oversized)
  2. Stream Enrich consumes each event from the raw stream and attempts to validate and enrich it. Events that are successfully processed are written to an enriched stream and bad events are written to a bad stream
  3. Elasticsearch Sinks are then configured to consume events from both the enriched and bad event streams and to then load them into distinct Elasticsearch indexes for viewing and analysis

The key difference is that on Snowplow Mini:

This diagram illustrates the mini data pipeline:


3. Software stack

The current Snowplow Mini stack consists of the following applications:

As so many services are running on the box we recommend a t2.medium or higher for a smooth experience during testing and use. This is dependant on a number of factors such as the amount of users and the amount of events being sent into the instance.

4. Roadmap

We have big plans for Snowplow Mini:

We also want to make it easy to setup and run Snowplow Mini outside of EC2 by:

If you have an idea of something you would like to see or need from Snowplow Mini please raise an issue!

5. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue.

Learn more about our unique approach to data delivery with a Snowplow demo.


Related articles