We are thrilled to announce the first batch of official Docker images for Snowplow. This first release focuses on laying the foundations for running a Snowplow real-time pipeline in a Docker-containerized environment. As a result, this release includes images for:
In this post, we will cover:
- Why provide Docker images?
- A foundation common to all images
- Real-time pipeline images
- Docker Compose example
- Future work
1. Why provide Docker images?
Snowplow community members have been experimenting with building their own Docker images for Snowplow for some time. Our decision to bring this “in house” and start publishing and maintaining our own official images is based on a few factors.
An important reason is around the ease of distribution and scheduling Snowplow, through container orchestrators such as Kubernetes, Nomad, OpenShift or Docker Swarm. Providing officially supported images should help to reduce the friction in adopting these platforms for Snowplow real-time pipeline users.
Another argument can be made for resource efficiency. For example, running two instances of Stream Enrich will require two different boxes costing us the OS overheads. Moving to containers should allow you to run those two instances on the same box, giving us higher resource utilization.
But most fundamentally, providing Docker images for the Snowplow realtime pipeline is part of a broader move on our side to formalize the Snowplow real-time pipeline as an asynchronous micro-services-based architecture.
Micro-services architectures are growing in popularity, and the Snowplow real-time pipeline is an example of a platform built out of a set of asynchronously connected micro-services. Asynchronous means that none of our apps have any direct coupling with each other – instead they all rely on an overarching streaming abstraction such as Kinesis or Apache Kafka to communicate. These kinds of architectures are very often containerized using Docker to ease deployment and scheduling.
Official Docker images have been a long-requested feature – we’re excited to finally be providing these to the community!
2. A foundation common to all Snowplow images
In this section, we’ll detail a few technical aspects we’ve taken care of to ensure reliable and performant images.
Thanks to this base image, every component runs under dumb-init which handles reaping zombie processes and forwarding signals to all processes running in the container. They also uses su-exec as a sudo replacement, to run any component as the non-root
Each container exposes the
/snowplow/config volume to store the component’s configuration. If this folder is bind-mounted then ownership will be changed to the
-XX:+UseCGroupMemoryLimitForHeap JVM options are automatically provided when launching any component in order to make the JVM adhere to the memory limits imposed by Docker; for more information, see this article.
Finally, if you want to manually tune certains aspect of the JVM, additional options can be set through the
SP_JAVA_OPTS environment variable when launching a container.
3. Real-time pipeline images
As mentioned above, this release includes images for the Snowplow real-time pipeline. In this section, we’ll cover each of these in turn.
Note that all of these images are hosted in our snowplow-docker-registry.bintray.io Docker registry.
3.1 Scala Stream Collector
You can pull and run the image with:
In the above, we’re assuming that there is a valid Scala Stream Collector configuration located in the
config folder in the current directory.
Alternatively, you can build the image yourself:
The above assumes that you’ve cloned the repository.
This image was contributed by Joshua Cox, huge thanks Josh!
3.2 Stream Enrich
We can pull the image and launch a container with:
The Stream Enrich image was written by Daniel Zohar. Big thanks to Daniel for this image and all the advice that he’s given us on our Docker journey!
3.3 Snowplow Elasticsearch Loader
Same as before we can pull and run with the following:
Refer to the Elasticsearch Loader configuration example as required.
3.4 Snowplow S3 Loader
Check out the S3 Loader config example to remind yourself of the format.
4. Docker Compose example
To help you get started there is also a Docker Compose example which incorporates one container for the Scala Stream Collector and another one for Stream Enrich.
As is, the provided configurations make the following assumptions:
snowplow-rawKinesis stream exists and is used to store the collected events
snowplow-enrichedKinesis stream exists and is used to store the enriched events
snowplow-badKinesis stream exists and is used to store the events which failed validation
- All those streams are located in the
Feel free to modify the given configuration files to suit your needs. This Docker Compose example is provided to illustrate how you can start to compose our Snowplow containers together; it is not intended to be a reference or production-ready deployment.
The containers can be launched with:
A Scala Stream Collector and a Stream Enrich container are now running!
If you want to stop them:
The Docker Compose example was contributed by Tamas Szuromi, thanks Tamas!
5. Future work
This release is just the beginning of a huge amount of experimentation around Docker, containerized environments and container scheduling that we are embarking on here at Snowplow.
Within Snowplow Mini, we have firm plans to swap out our current architecture for a Docker Compose-based composition of the various services that make up Snowplow Mini. See the Docker milestone for more details.
And of course if there are other aspects of containerization that you would like us to explore, please let us know!
If you have any questions or run into any problems, please visit our Discourse forum.