We are pleased to announce the release of Snowplow 67, Bohemian Waxwing. This release brings a host of upgrades to our real-time Amazon Kinesis pipeline as well as the embedding of Snowplow tracking into this pipeline.
Table of contents:
- Embedded Snowplow tracking
- Handling outsized event payloads
- More informative bad rows
- Improved Vagrant VM
- New Kinesis S3 repository
- Other changes
- Getting help
1. Embedded Snowplow tracking
Both Scala Kinesis Enrich and Kinesis Elasticsearch Sink now have the ability to record Snowplow events from within the applications themselves. These events include:
heartbeatwhich is sent every 5 minutes so we know that the app is still alive-and-kicking
warningevents, e.g. if no enrichment configurations were found by Scala Kinesis Enrich
- Events for each
failurein pushing events to the Kinesis streams or Elasticsearch
Adding Snowplow tracking to our Kinesis applications is exciting for two reasons:
- It is the first step towards Snowplow becoming “self-hosting”, meaning that we can use one instance of Snowplow to monitor a second instance of Snowplow. “Dog-fooding” Snowplow in this way is essential to finding and fixing bugs in Snowplow faster
- It is an opportunity to start exploring how Snowplow can be used for effective systems-level monitoring, alongside our existing application-level use cases
Note that so far Snowplow tracking has not yet been added to the Scala Stream Collector; this will be added in a subsequent release.
2. Handling outsized event payloads
Previously the Scala Stream Collector was unable to handle events that exceeded the maximum byte limit of a Kinesis stream: large
POST payloads, for example, were simply discarded due to the inability to write them to Kinesis. With this release, the collector can now “break apart” outsized payloads of multiple events into smaller payloads which will fit into a Kinesis stream;
Combine this with the recent increase in allowed record
PUT size for Kinesis from 50kB to 1MB and there should be very few scenarios now when an event payload has to be discarded for being outsized.
This said, at Snowplow we strongly believe that any event processing component which could encounter processing failures (however rare) should have an “stderr” output to record those failures. To accomplish this, Bohemian Waxwing adds a
bad output stream to the collector. As a first use case for this new stream, outsized payloads which cannot be written to Kinesis (essentially single
POST events which are larger than 1MB) will be written to the
bad stream with the error and the total size in bytes.
3. More informative bad rows
All the Kinesis apps are capable of emitting bad rows corresponding to failed events. Previously these bad rows only had a
line field, containing the body of the failed event, plus an
errors field, containing a non-empty list of problems with the event. In Bohemian Waxwing we add a
timestamp field containing the time at which the event was failed.
This makes it easier to monitor the progress of applications which consume failed events; it also makes it easier to analyze these bad rows in Elasticsearch/Kibana.
4. Improved Vagrant VM
Building the Snowplow apps using
sbt assembly in the Vagrant virtual machine is a very I/O intensive operation. To speed up this process, we have added comments to the project’s Vagrantfile indicating how to use NFS and how to allow the VM to use multiple cores.
5. New Kinesis S3 repository
Since the Kinesis S3 Sink is not Snowplow-specific but can be used to mirror arbitrary data from Kinesis to S3, we have moved it from the main Snowplow repo into a repository of its own. There have been two releases of Kinesis S3 since extracting it into its own repo: 0.2.1 and 0.3.0.
6. Other changes
We have also:
- Increased the maximum size of a Kinesis record put to 1MB from 50kB ([#1753], [#1736])
- Fixed a bug where the Kinesis Elasticsearch Sink could hang without ever shutting down (#1743)
- Fixed a bug which prevented Scala Kinesis Enrich from downloading from URIs using the
- Fixed a bug in Scala Kinesis Enrich where the
etl_tstampwas not correctly formatted (#1842)
- Fixed a nasty race condition in Scala Kinesis Enrich which caused the app to attempt to send too many records at once (#1756)
- Ensured that if Scala Kinesis Enrich fails to download the MaxMind database, it will exit immediately rather than attempting to look up IP addresses from a non-existent file (#1711)
- Made the Kinesis Elasticsearch Sink exit immediately if the bad stream does not exist, rather than waiting until the first bad event occurs, as before (#1677)
- Started logging all bad rows in Scala Kinesis Enrich to simplify debugging (#1722)
The Kinesis apps for r67 Bohemian Waxwing are now all available in a single zip file here:
Upgrading will require various configuration changes to each of the three applications’ HOCON configuration files:
Scala Stream Collector
collector.sink.kinesis.stream.goodin the HOCON
collector.sink.kinesis.stream.badto the HOCON
Scala Kinesis Enrich
If you want to include Snowplow tracking for this application please append the following:
Note that this is a wholly optional section; if you do not want to send application events to a second Snowplow instance, simply do not add this to your configuration file.
For a complete example, see our
Kinesis Elasticsearch Sink
locationfields into the
- If you want to include Snowplow Tracking for this application please append the following:
Again, note that Snowplow tracking is a wholly optional section.
For a complete example, see our
And that’s it – you should now be fully upgraded!
8. Getting help
For more details on this release, please check out the r67 Bohemian Waxwing on GitHub.