Kinesis S3 0.4.0 released with gzip support

26 August 2015  •  Joshua Beemster

We are pleased to announce the release of Kinesis S3 version 0.4.0. Many thanks to Kacper Bielecki from Avari for his contribution to this release!

Table of contents:

  1. gzip support
  2. Infinite loops
  3. Safer record batching
  4. Bug fixes
  5. Upgrading
  6. Getting help

1. gzip support

Kinesis S3 now supports gzip as a second storage/compression option for the files it writes out to S3. Using this format, each record is treated as a byte array containing a UTF-8 encoded string (whether CSV, JSON or TSV). The records are then written to files as strings, one record per line and gzipped.

Big thanks go to Kacper Bielecki for contributing this storage option! For more information please see Kacper’s pull request.

Snowplow users please note: you must continue to use the LZO format for storing raw Snowplow events.

2. Infinite loops

With the recent Amazon S3 outage in us-east-1, an issue was discovered where Kinesis S3 was unable to recover the connection to S3 even after the service was restored. This resulted in an infinite loop of failures to PUT any records into S3. To fix this, we had to manually restart all Kinesis S3 instances.

To prevent this recurring, Kinesis S3 now supports a failure timeout: if failures extend beyond this timeout, then Kinesis S3 will self-terminate. You can specify this timeout in the configuration file:

// Failure allowed for one minute
sink.s3.max-timeout: 60000

This feature can be neatly coupled with an automated restart wrapper to ensure that the application will recover without human intervention.

3. Safer record batching

In the previous release post we discussed potential out-of-memory problems for this application. To improve things further we have implemented a new configuration option: max-records to specify how many records the application is allowed to read per GetRecords call. This helps prevent the application from suddenly exceeding the Heap with sudden traffic spikes.

// Amount of records per GetRecords call
sink.kinesis.in.max-records: 10000

Unless you are experiencing out-of-memory issues, we recommend using the default of 10000. Please note that 10000, for the moment, is also the maximum setting. If set any higher an InvalidArgumentException will be thrown.

4. Bug fixes

We have also:

  • Fixed a bug where the Snowplow Tracker was using the wrong event type for write_failures (#45)
  • Added logging for OutOfMemoryErrors so it is easier to debug in the future (#29)

5. Upgrading

The Kinesis S3 application is available in a single zip file here:

http://bintray.com/artifact/download/snowplow/snowplow-generic/kinesis_s3_0.4.0.zip

Upgrading will require various configuration changes to the application’s HOCON configuration file:

  • Add max-records to the sink.kinesis.in section and configure how many records you want the application to get at any one time
  • Add format to the sink.s3 section and select either lzo or gzip to control what format files are written in
  • Add max-timeout to the sink.s3 section and enter the maximum timeout in ms for the application

And that’s it - you should now be fully upgraded!

6. Getting help

For more details on this release, please check out the Kinesis S3 0.4.0 release on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.