We are excited to announce the release of Snowplow R97 Knossos, named Knossos (Greek: Κνωσός) after the palace at the nexus of the Minoan civilization, which was one of the earliest major Grecian civilizations (17th century BCE).
This release is primarily about supporting four new webhook sources, but it also contains other small improvements for the Snowplow batch pipeline. We are initially adding support for these webhooks to the batch pipeline; support for these webhooks in the real-time pipeline will follow shortly.
Specifically, the four new services from which Snowplow can now receive events are:
- Mailgun – for tracking email and email-related events delivered by Mailgun
- Olark – for chat transcript events from Olark
- StatusGator – for cloud service availability events from hundreds of services from StatusGator
- Unbounce – for lead generation events from Unbounce
Many thanks to previous Snowplow intern Ronny Yabar for breaking ground on these webhooks for us!
Read on for more information:
- Mailgun webhook support
- Olark webhook support
- StatusGator webhook support
- Unbounce webhook support
- EmrEtlRunner improvements
- Other changes
1. Mailgun webhook support
The Mailgun webhook adapter lets you track email and email-related events delivered by Mailgun. Using this functionality, you can warehouse all email-related events alongside your existing Snowplow events.
For help setting up the Mailgun webhook, check out the Mailgun webhook setup page.
All the currently documented Mailgun events are supported by this release: bounce, deliver, drop, spam, unsubscribe, click, and open events.
For technical details, see the Mailgun webhook documentation page.
2. Olark webhook support
The Olark webhook adapter lets you receive the transcripts of chats on you website, including messages that you received when a support representative was not online, using Olark. Using this functionality, you can track and analyse chat activity alongside your other Snowplow data.
For help setting up the Olark webhook, see the Olark webhook setup page.
3. StatusGator webhook support
StatusGator lets you track the availability of hundreds of SaaS and other cloud services that you may be relying on. Using the webhook integration with StatusGator, you can collect availability events and use them to find correlations with other activity in your Snowplow data (e.g. elevated error rates in your website).
You could also use this webhook to provide alerts to your operations team, writing an AWS Lambda function or similar to emit alerts if specific cloud services experience outages.
For help setting up the StatusGator webhook, refer to StatusGator webhook guide.
4. Unbounce webhook support
Using the Unbounce service you can experiment with different landing pages and variants thereof; Unbounce is a popular tool for lead generation and conversion rate optimization (CRO). Using the Unbounce webhook you can now integrate your lead generation data with the rest of the Snowplow data.
For help setting up the Unbounce webhook, refer to Unbounce webhook guide.
5. EmrEtlRunner improvements
5.1 Uncompressing raw gzipped files
We have modified the S3DistCp EMR step which copies the raw gzipped log files produced by the Clojure Collector from S3 to HDFS – this step will now uncompress the files in transit. This modification greatly improves performance of the Spark Enrich job as gzipped files are not splittable and are consequently processed on the same core in their entirety.
This change represents a significant speedup in the performance of our Spark Enrich job when working with large gzipped files emitted by the Clojure Collector. This optimization is only enabled for the specific pairing of Spark Enrich (not Hadoop Enrich) and the Clojure Collector (not our other collectors).
5.2 Skipping RDB Loader consistency checks
By default, RDB Loader performs S3-level consistency checks, checking the files for atomic events and shredded types over time, to ensure that Amazon S3’s infamous eventual consistency issue is not going to confound the load.
The problem is that these checks are linearly correlated with the cardinality of shredded types; as a result, pipelines with a wide array of shredded types are disproportionately affected by this check.
To reduce friction for such pipelines, it is now possible to skip the S3 consistency checks performed by RDB Loader, using a new EmrEtlRunner
Be aware that this option requires a RDB Loader version greater or equal to 0.13.0.
6. Other changes
In addition to the above we have made the following changes:
- Adding functionality to default the port to 443 when reading a log line with HTTPS scheme. Many thanks to Mike Robins for this contribution (#3483)
- Tolerating a content-type being set for GET requests sent to Clojure Collector (where previously the content-type had to be empty) (#2743)
- Upgrading the dependency on user-agent-utils to version 1.20 (#2930)
- Plus a host of updates to our Spark Enrich and Scala Common Enrich test suites to make running these tests easier and more predictable
The latest version of EmrEtlRunner is available from our Bintray here.
To benefit from the new webhook integrations, you’ll need to bump the Spark Enrich version used in the EmrEtlRunner configuration file:
For a complete example, see our sample
Upcoming Snowplow releases will include:
- R98 Argentomagus, improving security and data resilience for the real-time pipeline. This release will also add R97’s new webhooks to the RT pipeline
- R9x [BAT] Priority fixes, which will include resilience, security and data-quality fixes for the AWS batch pipeline
- GDPR support part 1, which will include data privacy features as mandated by the new EU General Data Protection Regulation.
9. Getting Help
For more details on this release, as always do check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum.