Snowplow 65 Scarlet Rosefinch released

08 May 2015  •  Fred Blundun

We are pleased to announce the release of Snowplow 65, Scarlet Rosefinch. This release greatly improves the speed, efficiency, and reliability of Snowplow’s real-time Kinesis pipeline.

Table of contents:

  1. Enhanced performance
  2. CORS support
  3. Increased reliability
  4. Loading configuration from DynamoDB
  5. Randomized partition keys for bad streams
  6. Removal of automatic stream creation
  7. Improved Elasticsearch index initialization
  8. Other changes
  9. Upgrading
  10. Getting help

scarlet-rosefinch

1. Enhanced performance

Kinesis’ new PutRecords API enabled the biggest performance improvement: rather than sending events to Kinesis one at a time, we can now send batches of up to 500. This greatly increases the event volumes which the Scala Stream Collector and Scala Kinesis Enrich can handle.

You might not want to always wait for a full 500 events before sending the stored records to Kinesis, so the configuration for both applications now has a buffer section which provides greater control over when the stored records get sent.

It has three fields:

  1. byte-limit: if the stored records total at least this many bytes, flush the buffer
  2. record-limit: if at least this many records are in the buffer, flush the buffer
  3. time-limit: if at least this many milliseconds have passed since the buffer was last flushed, flush the buffer

An example with sensible defaults:

buffer: {
    byte-limit: 4500000 # 4.5MB
    record-limit: 500 # 500 records
    time-limit: 60000 # 1 minute
}

Additionally, the Scala Stream Collector has a ShutdownHook which sends all stored records. This prevents stored events from being lost when the collector is shut down cleanly.

2. CORS support

The Scala Stream Collector now supports CORS requests, so you can send events to it from Snowplow’s client-side JavaScript Tracker using POST rather than GET. This means that your requests are no longer subject to Internet Explorer’s querystring size limit.

The Scala Stream Collector also now supports cross-origin requests from the Snowplow ActionScript 3.0 Tracker.

3. Increased reliability

If an attempt to write records to Kinesis failed, previous versions of the Kinesis apps would just log an error message. The new release prevents events from being lost in the event that Kinesis is temporarily unreachable by implementing an exponential backoff strategy with jitter when PutRecords requests fail.

The minimum and maximum backoffs to use are configurable in the backoffPolicy for the Scala Stream Collector and Scala Kinesis Enrich:

backoffPolicy: {
    minBackoff: 3000 # 3 seconds
    maxBackoff: 600000 # 5 minutes
}

4. Loading configuration from DynamoDB

The command-line arguments to Scala Kinesis Enrich have also changed. It used to be the case that you provided a --config argument, pointing to a HOCON file with configuration for the app, together with an optional --enrichments argument, pointing to a directory containing the JSON configurations for the enrichments you wanted to make use of:

./scala-kinesis-enrich-0.4.0 --config my.conf --enrichments path/to/enrichments

Scarlet Rosefinch makes three changes:

  • The “resolver” section of the configuration HOCON has been split into a separate JSON file which should be specified using the command line argument --resolver.
  • If you want to get the resolver JSON and the enrichment JSONs from the local filesystem, you need to preface their filepaths with “file:”.
  • You can now also load the resolver JSON and enrichment JSONs from DynamoDB

To recreate the pre-r65 behavior, convert the resolver section of your configuration HOCON to JSON and put it in its own file, “iglu_resolver.json”. Then start the enricher like this:

./scala-kinesis-enrich-0.4.0 --config my.conf --resolver file:iglu_resolver.json --enrichments file:path/to/enrichments

To get the resolver from DynamoDB, create a table named “snowplow_config” with hashkey “id” and add an item to the table of the following form:

{
    "id": "iglu_resolver",
    "json": "{The resolver as a JSON string}"
}

Then provide the resolver argument as follows:

--resolver dynamodb:us-east-1/snowplow_config/iglu_resolver

To get the enrichments from DynamoDB, the enrichment JSONs must all be stored in a table - you can reuse the “snowplow_config” table. The enrichments’ hash keys should have a common prefix, for example “enrich_”:

{
    "id": "enrich_anon_ip",
    "json": "{anon_ip enrichment configuration as a JSON string}"
}

{
    "id": "enrich_ip_lookups",
    "json": "{ip_lookups enrichment configuration as a JSON string}"
}

Then provide the resolver argument as follows:

--enrichments dynamodb:us-east-1/snowplow_config/enrich_

If you are using a different AWS region, replace “us-east-1” accordingly.

The full command:

./scala-kinesis-enrich-0.5.0 --config my.conf \
  --resolver dynamodb:us-east-1/snowplow_config/iglu_resolver \
  --enrichments dynamodb:us-east-1/snowplow_config/enrich_

5. Randomized partition keys for bad streams

When sending a record to Kinesis, you provide a partition key. The hash of this key determines which shard will process the record. The good events generated by Scala Kinesis Enrich use the user_ipaddress field of the event as the key, so that events from a single user will all be in the right order in the same shard.

Previously, the bad events generated by Scala Kinesis Enrich and the Kinesis Elasticsearch Sink all had the same partition key. This meant that no matter how many shards the stream for bad records contained, all bad records would be processed by the same shard. Scarlet Rosefinch fixes this by using random partition keys for bad events.

6. Removal of automatic stream creation

The Kinesis apps will no longer automatically create a stream if they detect that the configured stream does not exist.

This means that if you make a typo when configuring the stream name, the mistake will immediately cause an error rather than creating an incorrectly named new stream which no other app knows about.

7. Improved Elasticsearch index initialization

We now recommend that when setting up your Elasticsearch index, you turn off tokenization of string fields. You can do this by choosing “keyword” as the default analyzer:

curl -XPUT 'http://localhost:9200/snowplow' -d '{
    "settings": {
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "keyword"
                }
            }
        }
    },
    "mappings": {
        "enriched": {
            "_timestamp" : {
                "enabled" : "yes",
                "path" : "collector_tstamp"
            },
            "_ttl": {
              "enabled":true,
              "default": "604800000"
            },
            "properties": {
                "geo_location": {
                    "type": "geo_point"
                }
            }
        }
    }
}'

This has two positive effects:

  • URLs are no longer incorrectly tokenized
  • Not having to tokenize every string field improves the performance of the Elasticsearch cluster

8. Other changes

We have also:

  • Started logging the names of the streams to which the Scala Stream Collector and Scala Kinesis Enrich write events (#1503, #1493)
  • Added macros to the “config.hocon.sample” sample configuration files (#1471, #1472, #1513, #1515)
  • Fixed a bug which caused the Kinesis Elasticsearch Sink to silently drop inputs containing fewer than 24 tab-separated fields (#1584)
  • Fixed a bug which prevented the applications from using a DynamoDB table in the configured region (#1576, #1582, #1583)
  • Added the ability to prevent the Scala Stream Collector from setting 3rd-party cookies by setting the cookie expiration field to 0 (#1363)
  • Bumped the version of Scala Common Enrich used by Scala Kinesis Enrich to 0.13.1 (#1618)
  • Bumped the version of Scalazon we use to 0.11 to access PutRecords (#1492, #1504)
  • Stopped Scala Kinesis Enrich outputting records of over 50kB because they are exceed Kinesis’ size limit (#1649)

9. Upgrading

The Kinesis apps for r65 Scarlet Rosefinch are now all available in a single zip file here:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r65_scarlet_rosefinch.zip

Upgrading will require various configuration changes to each of the four applications:

Scala Stream Collector

  • Add backoffPolicy and buffer fields to the configuration HOCON.

Scala Kinesis Enrich

  • Add backoffPolicy and buffer fields to the configuration HOCON
  • Extract the resolver from the configuration HOCON into its own JSON file, which can be stored locally or in DynamoDB
  • Update the command line arguments as detailed above

Kinesis LZO S3 Sink

  • Rename the outermost key in the configuration HOCON from “connector” to “sink”
  • Replace the “s3/endpoint” field with an “s3/region” field (such as “us-east-1”)

Kinesis Elasticsearch Sink

  • Rename the outermost key in the configuration HOCON from “connector” to “sink”

And that’s it - you should now be fully upgraded!

10. Getting help

For more details on this release, please check out the r65 Scarlet Rosefinch on GitHub.

Documentation for all the Kinesis apps is available on the wiki.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.