Snowplow 80 Southern Cassowary released

30 May 2016  •  Fred Blundun

Snowplow 80 Southern Cassowary is now available! This is a real-time pipeline release which improves stability and brings the real-time pipeline up-to-date with our Hadoop pipeline.

  1. The latest Common Enrich
  2. Exiting on error
  3. Configurable maxRecords
  4. Changes to logging
  5. Continuous deployment
  6. Other improvements
  7. Upgrading
  8. Getting help

southern-cassowary

The latest Common Enrich

This version of Stream Enrich uses the latest version of Scala Common Enrich, the library containing Snowplow’s core enrichment logic. Among other things, this means that you can now use R79 Black Swan’s API Request Enrichment and the HTTP Header Extractor Enrichment in your real-time pipeline.

Exiting on error

There are certain error conditions under which our Kinesis apps would previously hang rather than crash outright:

  • The Scala Stream Collector could hang when unable to bind to the supplied port
  • All of the apps could hang if they were unable to find the Kinesis stream to write to on startup. This could be caused by a race condition when starting the app and creating the stream at the same time

In this release, we have modified the apps so that they exit with status code 1 whenever these errors are encountered, rather than hanging. This means that you can run the apps with a background script which restarts them whenever they die. This prevents transient error conditions from requiring human intervention.

Configurable maxRecords

You can now configure the number of records that the Kinesis Client Library should retrieve with each call to GetRecords. The default is 10,000, which is also the maximum. If you frequently see "Unable to execute HTTP request: Connection reset" in your error logs, then you should try reducing maxRecords to make each request smaller and more likely to succeed.

You can set maxRecords to any positive integer up to 10,000 in the configuration file for Stream Enrich (by setting enrich.streams.in.raw.maxRecords = n) and Kinesis Elasticsearch Sink (by setting sink.kinesis.in.maxRecords = n).

Changes to logging

Stream Enrich’s logging for failed records is now less verbose: instead of logging an error message for every failed record in a batch, it buckets the failed requests based on their error code, then prints the size of each bucket together with a representative error message for each bucket.

Additionally, both the Scala Stream Collector and Stream Enrich now log a missing stream - which is a showstopping issue - at the error level rather than the info level.

Continuous integration and deployment

The Kinesis apps are now continuously integrated and deployed using Travis CI. This speeds up our development cycle, making it easier to automatically publish new versions.

Other improvements

Other improvements across the Kinesis applications include:

  • For each app, the example configuration files have been moved out of the src directory into a new example directory. This prevents them from being needlessly added to the jarfile
  • The Scala Stream Collector now sends the string “ok” as the content of its response to POST requests. This is a workaround for a Firefox bug which causes unsightly errors to appear in the JavaScript console when the response to a POST request has no content
  • We have upgraded the Scala Stream Collector to use Akka 2.3.9, Spray 1.3.3, and Scala 2.10.5

Upgrading

The Kinesis apps for R80 Southern Cassowary are all available in a single zip file here:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r80_southern_cassowary.zip

There are no breaking changes in this release - you can upgrade the individual Kinesis apps without worrying about having to update the configuration files or indeed the Kinesis streams.

However, if you want to configure how many records Stream Enrich should read from Kinesis at a time, update its configuration file to add a maxRecords property like so:

enrich {
  ...
  streams {
    in: {
      ...
      maxRecords: 5000 # Default is 10000
      ...

If you want to configure how many records Kinesis Elasticsearch Sink should read from Kinesis at a time, again update its configuration file to add a maxRecords property:

sink {
  ...
  kinesis {
    in: {
      ...
      maxRecords: 5000 # Default is 10000
      ...

Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.