08 October 2016  •  Releases  •  Joshua Beemster

Snowplow 84 Steller's Sea Eagle released with Elasticsearch 2.x support

We are pleased to announce the release of Snowplow 84 Steller’s Sea Eagle. This release brings support for Elasticsearch 2.x to the Kinesis Elasticsearch Sink for both Transport and HTTP clients.

  1. Elasticsearch 2.x support
  2. Elasticsearch Sink buffer
  3. Override the network_id cookie with nuid param
  4. Hardcoded cookie path
  5. Migrating Redshift assets to Iglu Central
  6. Other changes
  7. Upgrading
  8. Roadmap
  9. Getting help

stellers-sea-eagle

1. Elasticsearch 2.x support

This release brings full support for Elasticsearch 2.x for both the HTTP and Transport clients. This lets you use the AWS Elasticsearch Service running ES 2.3, or indeed upgrade your self-hosted Elasticsearch cluster to version 2.x.

We delivered this by dynamically building a Kinesis Elasticsearch Sink binary for both Elasticsearch versions - you’ll choose the binary appended by _1x or _2x as appropriate.

One important thing to flag is that Elasticsearch 2.x no longer allows field names to contain periods (.). While we never used periods within Snowplow or Iglu Central property names, your team may have created some, like so:

{
  "schema": "iglu:com.acme/event/jsonschema/1-0-0",
  "data": {
    "field.with.dots": "value"
  }
}

Previously, this would have loaded into Elasticsearch like so:

> com_acme_event_1: [{"field.with.dots": value}]

From this release on, we are automatically converting the field name’s periods to underscores, whether you are loading Elasticsearch 1.x or 2.x:

> com_acme_event_1: [{"field_with_dots": value}]

For more information please see issue #2894.

2. Elasticsearch Sink buffer

Community member Stephane Maarek flagged to us that our Kinesis Elasticsearch Sink’s buffer settings did not seem to be working correctly.

We investigated and found an underlying issue in the Kinesis Connectors library, where every record in a GetRecords call is added to the buffer for sinking, without checking between additions whether or not the buffer should be flushed.

In the case that your Elasticsearch Sink has to catch up with a very large number of records, and your maxRecords setting is set to 10,000, this can leave the sink struggling to emit to Elasticsearch, because the buffer will be too large to send reliably.

To work around this issue, we updated our Elasticsearch Emitter code to also respect the buffer settings. The new approach works as follows:

  • The Emitter receives a buffer of events from the Kinesis stream (up to 10,000)
  • This buffer is split based on your record and byte buffer settings
  • Each slice of the buffer is sent in succession
  • Once all slices are processed the application checkpoints

It is important that you tune your record and byte limits to match the cluster you are pushing events to. If the limits are set too high you might not be able to emit successfully very often; if your limits are too low then your event sinking to Elasticsearch will be inefficient.

For more information on this issue and the corresponding commit please see issue #2895.

This release adds the ability to update your Scala Stream Collector’s cookie’s value with the nuid a.k.a. network_userid parameter. If a nuid value is available within the querystring of your request, this value will then be used to update the cookie’s value.

This feature is only available through a querystring parameter lookup, so only works for GET requests at the present.

Many thanks to Christoph Buente from Snowplow user LiveIntent for this contribution!

For more information and the reasoning behind this update please see issue #2512.

4. Hardcoded cookie path

To ensure that the cookie path is always valid we have updated the Scala Stream Collector to statically set the cookie path to “/”. This is to avoid situations where a path resource such as “/r/tp2” results in the cookie path ending up at “/r”. Endpoints such as “/i” do not suffer from this issue.

Thanks again to Christoph Buente for this contribution!

For more information on this please see issue #2524.

5. Migrating Redshift assets to Iglu Central

An Iglu schema registry typically consists of schemas, Redshift table definitions and JSON Paths files - see our Iglu example schema registry for an example.

When we originally started building out Iglu Central, we diverted from this approach and stored the Redshift table definitions and JSON Paths files for Iglu Central schemas in the main Snowplow repository, snowplow/snowplow. You’ll see from the Snowplow CHANGELOG that many Snowplow releases have included Redshift and JSON Paths files, to complement specific schemas in new Iglu Central releases.

In hindsight, splitting the Iglu Central resources across two separate code repositories was a mistake:

  • It made it harder to cross-check a JSON Schema with its associated Redshift artifacts
  • It meant that the full release of a new schema in Iglu Central has to wait on a (often unrelated) Snowplow release
  • It was a blocker on our plans to integrate table definitions and data mapping files (like JSON Paths) more closely into Iglu schema registries

Therefore, in this release we have moved all Redshift table definitions and JSON Paths files from the main Snowplow repository into Iglu Central, specifically in these paths:

Our hosting of these files for the correct operation of Snowplow is unchanged, and the Snowpow repository continues to hold current and previous definitions of the atomic.events table, plus corresponding migration scripts.

6. Other changes

This release also contains further changes, notably:

  • Allowing the HTTP client timeouts for the Kinesis Elasticsearch Sink to be configured from the HOCON configuration file (issue #2897)
  • Adding environment variable resolution to the Scala Stream Collector, Stream Enrich and the Kinesis Elasticsearch Sink through Config.resolver(). Many thanks to community member Shin for contributing this
  • Upgrading Stream Enrich to the latest version of Scala Common Enrich to make available all of the new features and latest bug fixes

7. Upgrading

The Kinesis apps for R84 Stellers Sea Eagle are available in the following zipfiles:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_scala_stream_collector_0.8.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.9.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_1x.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_2x.zip

Or you can download all of the apps together in this zipfile:

https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r84_stellers_sea_eagle.zip

Only the Elasticsearch Sink app config has changed. The change does not include breaking config changes. To upgrade the Elasticsearch Sink:

  • Install the new Elasticsearch Sink app on each server in your Elasticsearch Sink auto-scaling group
  • Update your Elasticsearch Sink config with the new elasticsearch.client.http section:
    • elasticsearch.client.http.conn-timeout
    • elasticsearch.client.http.read-timeout

NOTE: These timeouts are optional and will default to 300000 if they cannot be found in your Config.

Here is the updated config file template:

sink {
  ...
  elasticsearch {

    # Events are indexed using an Elasticsearch Client
    # - type: http or transport (will default to transport)
    # - endpoint: the cluster endpoint
    # - port: the port the cluster can be accessed on
    #   - for http this is usually 9200
    #   - for transport this is usually 9300
    # - max-timeout: the maximum attempt time before a client restart
    client {
      type: ""
      endpoint: ""
      port: 
      max-timeout: ""

      # Section for configuring the HTTP client
      http {
        conn-timeout: ""
        read-timeout: ""
      }
    }
  ...
}

Then:

  • Update your supervisor process to point to the new Kinesis Elasticsearch Sink app
  • Restart the supervisor process on each server running the sink

8. Roadmap

We have renamed the upcoming milestones for Snowplow to be more flexible around the ultimate sequencing of releases. Upcoming Snowplow releases, in no particular order, include:

  • R8x [HAD] 4 webhooks, which will add support for 4 new webhooks: Mailgun, Olark, Unbounce, StatusGator
  • R8x [HAD] Synthetic dedupe, which will deduplicate events with the same event_id but different event_fingerprints (synthetic duplicates) in Hadoop Shred

Note that these releases are always subject to change between now and the actual release date.

9. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

Thoughts or questions? Come join us in our Discourse forum!
Joshua Beemster
Josh is a data engineer at Snowplow. You can find him on and LinkedIn.