We are pleased to announce the urgent release of Snowplow 105 Pompeii, named after [the famous but ill-fated ancient Roman city][pompeii].
Shortly after the Snowplow 103 Paestum release, open-source user Asger Bachmann noticed an increase in the number of duplicated events outputted by Stream Enrich. To be clear: our real-time pipeline on Kinesis does have at-least once processing semantics, but the levels of duplication that Asger observed were far in excess of any normal operation.
Upon reproducing the issue, we immediately prioritised an urgent Snowplow release to fix this specific issue, pushing back the other Snowplow releases currently in progress.
Please read on after the fold for:
1. Fixing the Stream Enrich event duplication issue
As part of our refactor to support GCP, Snowplow 101 Neapolis accidentally introduced the sharing of the same Kinesis sink across multiple Amazon Kinesis Client Library’s
RecordProcessors. This resulted in the same Kinesis sink being flushed as many times as there were
RecordProcessors, leading to duplicated events if there were more than one
RecordProcessor running on the same Stream Enrich instance.
This behavior has been corrected in this release by re-implementing one Kinesis sink per
There is a comprehensive guide to this issue on Discourse, detailing who can be affected and the steps to mitigate the issue, in case you would like to discuss it further.
2. A word on quality
The event duplication issue introduced in R101 was a major bug, and does not reflect the code quality and operational standards that we aim for at Snowplow.
As our team grows and we strive for an ever-faster release cadence across our major projects, it is crucial that our software quality actually improves - we cannot achieve flow and deliver high throughput without high-grade quality-supporting processes.
On our side, we are prioritising two areas of improvement:
- Extending and enhancing our internal QA processes and tools, to make sure that issues such as this are identified at an early stage
- Improving our internal collaboration and communication around upcoming releases (from design through to publication), to give our wider team the ability to detect issues like this much earlier
Another idea we are starting to consider is less frequent “LTS” (Long-Term Support) releases of Snowplow, similar for example to the Ubuntu release process.
Above all we want the community’s ideas on how we can improve software quality at Snowplow. Do please share your thoughts in our Discourse forum.
The latest version of Stream Enrich is available from our Bintray here.
If you are currently on R101, please note that you will need to follow the R103 Stream Enrich upgrade steps, relating to the IP Lookups Enrichment. Check out the R103 Upgrading guide.
Upcoming Snowplow releases are unchanged:
- R106 Acropolis, enhancing our recently-released GDPR-focused PII Enrichment for the realtime pipeline
- R10x [STR] New webhooks and enrichment, featuring Marketo and Vero webhook adapters from our partners at Snowflake Analytics, plus a new enrichment for detecting bots and spiders using data from the IAB
- R10x Vallei dei Templi, porting our streaming enrichment process to Google Cloud Dataflow, leveraging the Apache Beam APIs
5. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.