Both Scala Kinesis Enrich and Kinesis Elasticsearch Sink now have the ability to record Snowplow events from within the applications themselves. These events include:
heartbeatwhich is sent every 5 minutes so we know that the app is still alive-and-kicking
warningevents, e.g. if no enrichment configurations were found by Scala Kinesis Enrich
failurein pushing events to the Kinesis streams or Elasticsearch
Adding Snowplow tracking to our Kinesis applications is exciting for two reasons:
Note that so far Snowplow tracking has not yet been added to the Scala Stream Collector; this will be added in a subsequent release.
Previously the Scala Stream Collector was unable to handle events that exceeded the maximum byte limit of a Kinesis stream: large
POST payloads, for example, were simply discarded due to the inability to write them to Kinesis. With this release, the collector can now “break apart” outsized payloads of multiple events into smaller payloads which will fit into a Kinesis stream;
Combine this with the recent increase in allowed record
PUT size for Kinesis from 50kB to 1MB and there should be very few scenarios now when an event payload has to be discarded for being outsized.
This said, at Snowplow we strongly believe that any event processing component which could encounter processing failures (however rare) should have an “stderr” output to record those failures. To accomplish this, Bohemian Waxwing adds a
bad output stream to the collector. As a first use case for this new stream, outsized payloads which cannot be written to Kinesis (essentially single
POST events which are larger than 1MB) will be written to the
bad stream with the error and the total size in bytes.
All the Kinesis apps are capable of emitting bad rows corresponding to failed events. Previously these bad rows only had a
line field, containing the body of the failed event, plus an
errors field, containing a non-empty list of problems with the event. In Bohemian Waxwing we add a
timestamp field containing the time at which the event was failed.
This makes it easier to monitor the progress of applications which consume failed events; it also makes it easier to analyze these bad rows in Elasticsearch/Kibana.
Building the Snowplow apps using
sbt assembly in the Vagrant virtual machine is a very I/O intensive operation. To speed up this process, we have added comments to the project’s Vagrantfile indicating how to use NFS and how to allow the VM to use multiple cores.
Since the Kinesis S3 Sink is not Snowplow-specific but can be used to mirror arbitrary data from Kinesis to S3, we have moved it from the main Snowplow repo into a repository of its own. There have been two releases of Kinesis S3 since extracting it into its own repo: 0.2.1 and 0.3.0.
We have also:
etl_tstampwas not correctly formatted (#1842)
The Kinesis apps for r67 Bohemian Waxwing are now all available in a single zip file here:
Upgrading will require various configuration changes to each of the three applications’ HOCON configuration files:
collector.sink.kinesis.stream.goodin the HOCON
collector.sink.kinesis.stream.badto the HOCON
If you want to include Snowplow tracking for this application please append the following:
Note that this is a wholly optional section; if you do not want to send application events to a second Snowplow instance, simply do not add this to your configuration file.
For a complete example, see our
locationfields into the
Again, note that Snowplow tracking is a wholly optional section.
For a complete example, see our
And that’s it - you should now be fully upgraded!
For more details on this release, please check out the r67 Bohemian Waxwing on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.