Snowplow 84 Steller's Sea Eagle released with Elasticsearch 2.x support

Share

We are pleased to announce the release of Snowplow 84 Steller’s Sea Eagle. This release brings support for Elasticsearch 2.x to the Kinesis Elasticsearch Sink for both Transport and HTTP clients.

  1. Elasticsearch 2.x support
  2. Elasticsearch Sink buffer
  3. Override the network_id cookie with nuid param
  4. Hardcoded cookie path
  5. Migrating Redshift assets to Iglu Central
  6. Other changes
  7. Upgrading
  8. Roadmap
  9. Getting help

stellers-sea-eagle

1. Elasticsearch 2.x support

This release brings full support for Elasticsearch 2.x for both the HTTP and Transport clients. This lets you use the AWS Elasticsearch Service running ES 2.3, or indeed upgrade your self-hosted Elasticsearch cluster to version 2.x.

We delivered this by dynamically building a Kinesis Elasticsearch Sink binary for both Elasticsearch versions – you’ll choose the binary appended by _1x or _2x as appropriate.

One important thing to flag is that Elasticsearch 2.x no longer allows field names to contain periods (.). While we never used periods within Snowplow or Iglu Central property names, your team may have created some, like so:

{ "schema": "iglu:com.acme/event/jsonschema/1-0-0", "data": { "field.with.dots": "value" } }

Previously, this would have loaded into Elasticsearch like so:

> com_acme_event_1: [{"field.with.dots": value}]

From this release on, we are automatically converting the field name’s periods to underscores, whether you are loading Elasticsearch 1.x or 2.x:

> com_acme_event_1: [{"field_with_dots": value}]

For more information please see issue #2894.

2. Elasticsearch Sink buffer

Community member Stephane Maarek flagged to us that our Kinesis Elasticsearch Sink’s buffer settings did not seem to be working correctly.

We investigated and found an underlying issue in the Kinesis Connectors library, where every record in a GetRecords call is added to the buffer for sinking, without checking between additions whether or not the buffer should be flushed.

In the case that your Elasticsearch Sink has to catch up with a very large number of records, and your maxRecords setting is set to 10,000, this can leave the sink struggling to emit to Elasticsearch, because the buffer will be too large to send reliably.

To work around this issue, we updated our Elasticsearch Emitter code to also respect the buffer settings. The new approach works as follows:

It is important that you tune your record and byte limits to match the cluster you are pushing events to. If the limits are set too high you might not be able to emit successfully very often; if your limits are too low then your event sinking to Elasticsearch will be inefficient.

For more information on this issue and the corresponding commit please see issue #2895.

This release adds the ability to update your Scala Stream Collector’s cookie’s value with the nuid a.k.a. network_userid parameter. If a nuid value is available within the querystring of your request, this value will then be used to update the cookie’s value.

This feature is only available through a querystring parameter lookup, so only works for GET requests at the present.

Many thanks to Christoph Buente from Snowplow user LiveIntent for this contribution!

For more information and the reasoning behind this update please see issue #2512.

4. Hardcoded cookie path

To ensure that the cookie path is always valid we have updated the Scala Stream Collector to statically set the cookie path to “/”. This is to avoid situations where a path resource such as “/r/tp2” results in the cookie path ending up at “/r”. Endpoints such as “/i” do not suffer from this issue.

Thanks again to Christoph Buente for this contribution!

For more information on this please see issue #2524.

5. Migrating Redshift assets to Iglu Central

An Iglu schema registry typically consists of schemas, Redshift table definitions and JSON Paths files – see our Iglu example schema registry for an example.

When we originally started building out Iglu Central, we diverted from this approach and stored the Redshift table definitions and JSON Paths files for Iglu Central schemas in the main Snowplow repository, snowplow/snowplow. You’ll see from the Snowplow CHANGELOG that many Snowplow releases have included Redshift and JSON Paths files, to complement specific schemas in new Iglu Central releases.

In hindsight, splitting the Iglu Central resources across two separate code repositories was a mistake:

Therefore, in this release we have moved all Redshift table definitions and JSON Paths files from the main Snowplow repository into Iglu Central, specifically in these paths:

Our hosting of these files for the correct operation of Snowplow is unchanged, and the Snowpow repository continues to hold current and previous definitions of the atomic.events table, plus corresponding migration scripts.

6. Other changes

This release also contains further changes, notably:

7. Upgrading

The Kinesis apps for R84 Stellers Sea Eagle are available in the following zipfiles:

http://dl.bintray.com/snowplow/snowplow-generic/snowplow_scala_stream_collector_0.8.0.zip http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.9.0.zip http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_1x.zip http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_2x.zip 

Or you can download all of the apps together in this zipfile:

https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r84_stellers_sea_eagle.zip 

Only the Elasticsearch Sink app config has changed. The change does not include breaking config changes. To upgrade the Elasticsearch Sink:

NOTE: These timeouts are optional and will default to 300000 if they cannot be found in your Config.

Here is the updated config file template:

sink { ... elasticsearch { # Events are indexed using an Elasticsearch Client # - type: http or transport (will default to transport) # - endpoint: the cluster endpoint # - port: the port the cluster can be accessed on # - for http this is usually 9200 # - for transport this is usually 9300 # - max-timeout: the maximum attempt time before a client restart client { type: "" endpoint: "" port: max-timeout: "" # Section for configuring the HTTP client http { conn-timeout: "" read-timeout: "" } } ... }

Then:

8. Roadmap

We have renamed the upcoming milestones for Snowplow to be more flexible around the ultimate sequencing of releases. Upcoming Snowplow releases, in no particular order, include:

Note that these releases are always subject to change between now and the actual release date.

9. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

Share

Related articles