Snowplow 65 Scarlet Rosefinch released

Share

We are pleased to announce the release of Snowplow 65, Scarlet Rosefinch. This release greatly improves the speed, efficiency, and reliability of Snowplow’s real-time Kinesis pipeline.

Table of contents:

  1. Enhanced performance
  2. CORS support
  3. Increased reliability
  4. Loading configuration from DynamoDB
  5. Randomized partition keys for bad streams
  6. Removal of automatic stream creation
  7. Improved Elasticsearch index initialization
  8. Other changes
  9. Upgrading
  10. Getting help

scarlet-rosefinch

1. Enhanced performance

Kinesis’ new PutRecords API enabled the biggest performance improvement: rather than sending events to Kinesis one at a time, we can now send batches of up to 500. This greatly increases the event volumes which the Scala Stream Collector and Scala Kinesis Enrich can handle.

You might not want to always wait for a full 500 events before sending the stored records to Kinesis, so the configuration for both applications now has a buffer section which provides greater control over when the stored records get sent.

It has three fields:

  1. byte-limit: if the stored records total at least this many bytes, flush the buffer
  2. record-limit: if at least this many records are in the buffer, flush the buffer
  3. time-limit: if at least this many milliseconds have passed since the buffer was last flushed, flush the buffer

An example with sensible defaults:

buffer: { byte-limit: 4500000 # 4.5MB record-limit: 500 # 500 records time-limit: 60000 # 1 minute }

Additionally, the Scala Stream Collector has a ShutdownHook which sends all stored records. This prevents stored events from being lost when the collector is shut down cleanly.

2. CORS support

The Scala Stream Collector now supports CORS requests, so you can send events to it from Snowplow’s client-side JavaScript Tracker using POST rather than GET. This means that your requests are no longer subject to Internet Explorer’s querystring size limit.

The Scala Stream Collector also now supports cross-origin requests from the Snowplow ActionScript 3.0 Tracker.

3. Increased reliability

If an attempt to write records to Kinesis failed, previous versions of the Kinesis apps would just log an error message. The new release prevents events from being lost in the event that Kinesis is temporarily unreachable by implementing an exponential backoff strategy with jitter when PutRecords requests fail.

The minimum and maximum backoffs to use are configurable in the backoffPolicy for the Scala Stream Collector and Scala Kinesis Enrich:

backoffPolicy: { minBackoff: 3000 # 3 seconds maxBackoff: 600000 # 5 minutes }

4. Loading configuration from DynamoDB

The command-line arguments to Scala Kinesis Enrich have also changed. It used to be the case that you provided a --config argument, pointing to a HOCON file with configuration for the app, together with an optional --enrichments argument, pointing to a directory containing the JSON configurations for the enrichments you wanted to make use of:

./scala-kinesis-enrich-0.4.0 --config my.conf --enrichments path/to/enrichments

Scarlet Rosefinch makes three changes:

To recreate the pre-r65 behavior, convert the resolver section of your configuration HOCON to JSON and put it in its own file, “iglu_resolver.json”. Then start the enricher like this:

./scala-kinesis-enrich-0.4.0 --config my.conf --resolver file:iglu_resolver.json --enrichments file:path/to/enrichments

To get the resolver from DynamoDB, create a table named “snowplow_config” with hashkey “id” and add an item to the table of the following form:

{ "id": "iglu_resolver", "json": "{The resolver as a JSON string}" }

Then provide the resolver argument as follows:

--resolver dynamodb:us-east-1/snowplow_config/iglu_resolver

To get the enrichments from DynamoDB, the enrichment JSONs must all be stored in a table – you can reuse the “snowplow_config” table. The enrichments’ hash keys should have a common prefix, for example “enrich_”:

{ "id": "enrich_anon_ip", "json": "{anon_ip enrichment configuration as a JSON string}" } { "id": "enrich_ip_lookups", "json": "{ip_lookups enrichment configuration as a JSON string}" }

Then provide the resolver argument as follows:

--enrichments dynamodb:us-east-1/snowplow_config/enrich_

If you are using a different AWS region, replace “us-east-1” accordingly.

The full command:

./scala-kinesis-enrich-0.5.0 --config my.conf  --resolver dynamodb:us-east-1/snowplow_config/iglu_resolver  --enrichments dynamodb:us-east-1/snowplow_config/enrich_

5. Randomized partition keys for bad streams

When sending a record to Kinesis, you provide a partition key. The hash of this key determines which shard will process the record. The good events generated by Scala Kinesis Enrich use the user_ipaddress field of the event as the key, so that events from a single user will all be in the right order in the same shard.

Previously, the bad events generated by Scala Kinesis Enrich and the Kinesis Elasticsearch Sink all had the same partition key. This meant that no matter how many shards the stream for bad records contained, all bad records would be processed by the same shard. Scarlet Rosefinch fixes this by using random partition keys for bad events.

6. Removal of automatic stream creation

The Kinesis apps will no longer automatically create a stream if they detect that the configured stream does not exist.

This means that if you make a typo when configuring the stream name, the mistake will immediately cause an error rather than creating an incorrectly named new stream which no other app knows about.

7. Improved Elasticsearch index initialization

We now recommend that when setting up your Elasticsearch index, you turn off tokenization of string fields. You can do this by choosing “keyword” as the default analyzer:

curl -XPUT 'http://localhost:9200/snowplow' -d '{ "settings": { "analysis": { "analyzer": { "default": { "type": "keyword" } } } }, "mappings": { "enriched": { "_timestamp" : { "enabled" : "yes", "path" : "collector_tstamp" }, "_ttl": { "enabled":true, "default": "604800000" }, "properties": { "geo_location": { "type": "geo_point" } } } } }'

This has two positive effects:

8. Other changes

We have also:

9. Upgrading

The Kinesis apps for r65 Scarlet Rosefinch are now all available in a single zip file here:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r65_scarlet_rosefinch.zip 

Upgrading will require various configuration changes to each of the four applications:

Scala Stream Collector

Scala Kinesis Enrich

Kinesis LZO S3 Sink

Kinesis Elasticsearch Sink

And that’s it – you should now be fully upgraded!

10. Getting help

For more details on this release, please check out the r65 Scarlet Rosefinch on GitHub.

Documentation for all the Kinesis apps is available on the wiki.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

Share

Related articles