We are thrilled to release Snowplow R119 Tycho Magnetic Anomaly Two.
This release is about as big as they come, as it is an important milestone for the Snowplow pipeline. It marks the production ready release of a new bad rows (failed events) format and becomes the last umbrella release in the history of this project. We’ve also significantly increased the level of test coverage across the different services.
Read on to learn more about Snowplow R119 Tycho Magnetic Anomaly Two, named after the powerful Space Odyssey monolithic artefact that breaks itself into smaller objects.
credit: 2001: A Space Odyssey
In this post:
- Production release of the new failed events format
- No more monorepo
- Other changes
- Getting help
1. Production release of the new failed events format
In Snowplow R118 Morgantina we introduced a new format for failed events (bad rows), bringing a much more structured approach for events that fail at any step of the pipeline. This significantly improves the experience of diagnosing what is causing the failures. For more information on understanding failed events see our documentation here.
We announced R118 as a public beta, given the big changes we made in that release.
We, our open source community and customers extensively tested all assets from R118 and now we’re excited to announce that R119 marks general availability of the new format and all associated functionality.
We identified several bugs in R118 that are related to changes we’ve made there:
- Events from a POST payload could get lost if the payload contained at least one corrupted event (#4320)
- Enrich process could crash with
NullPointerExceptionin case of empty query parameter in IgluAdapter (#4330)
- Enrich process could crash with
NullPointerExceptionin case of empty query parameter in Snowplow Adapter (#4324, thanks Rob Kingston for spotting it!)
In order to make R119 production-ready we also wanted to make sure that new functionality related to bad rows is on feature-parity (or above) with the legacy format.
Hence the Event Recovery job 0.2.0 will be announced soon, which leverages all benefits of the new format.
2. Change in Snowplow Open Source
Historically, most of Snowplow’s Open Source estate was hosted by a single GitHub monorepository. This repository holds the whole history of changes back to inception in 2012. Almost every month we then would do a big bang release with all assets and an associated blog post. One of the goals of these big bang releases was a compatibility guarantee, which meant that all assets within a single release are compatible with each other.
However, as Snowplow’s OSS estate was growing, we started to realise that it’s getting harder to maintain this guarantee and that the monorepository approach makes our development process very inflexible as we sometimes had to do an umbrella release and a blog post just for an urgent hot fix. In order to solve that problem we’ve made a decision to split the Snowplow monorepository into individual repositories, each containing code for a single application.
We expect this change will make development even more OSS-friendly and easier to contribute to.
We also would like to use this as a next step in our batch pipeline deprecation process and archive the following components:
We’re creating the following new repositories for our subprojects:
- Scala Stream Collector
- Scala Common Enrich
- Stream Enrich
- Beam Enrich (which already existed as an independant asset for a couple of months)
- Spark Enrich
We will be establishing an alternative, better way to communicate which versions are compatible and our recommendation for the stack you should be running – while increasing our ability to get changes to you quickly.
3. Other changes
Apart from ground-breaking changes such as the new failed events format and change in the project structure, we also made a few additional tweaks:
- An event that has an attached invalid context (e.g. by API request enrichment) will result in a bad row and it is guaranteed that enrichment will always produce either one good or one bad row (#3795)
- EmrEtlRunner now retries to acquire a connection to EMR if current one is lost (#4290)
- EmrEtlRunner now properly sets amount of core instances, which previously would lead to an under-provisioned cluster (#4285)
- Extensive unit test coverage added across the enrichment workflow
- Allow Stream Enrich to download data from private S3 or GCS buckets (#4269)
For Snowplow BDP customers, there is nothing you need to do. We will be in touch with an upgrade message with details of when we will be upgrading your production pipeline.
The upgrade guide for open source users can be found on our wiki page.
5. Getting help
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please visit our Discourse forum. Open source users will receive high-priority support for components of this release.