This release has been a real community effort and so we’d like to start off by thanking some people that were key to this port:
This has been one of the most inclusive and collaborative Snowplow releases in our history - an exciting outcome of our burgeoning RFC process, and one on which bodes well for the future as we roadmap exciting new features and major refactorings. Thank you all!
We take a conservative approach to technology adoption at Snowplow - your event data pipeline is far too important for us to take chances with speculative technologies or techniques. But technology does not stand still, and we must always be proactively extending and re-architecting Snowplow to ensure that it stays relevant over the next decade.
You may be wondering why we went to the trouble of rewriting the core components of our batch pipeline into Spark, and why now. The definitive explanation for this port can be found in our RFC - but in a nutshell, we wanted to address some particular pain points with Hadoop:
Although the core of the Snowplow batch pipeline had stayed in Scalding since early 2013, we had had multiple positive experiences working with Apache Spark on ancillary Snowplow projects, and were confident that Spark could address these pain points.
The RFC proposed moving to Spark in three phases:
Snowplow 89 Plain of Jars represents the entirety of Phase 1, and the core deliverable of Phase 2 - namely porting our Hadoop Shred job to run on Spark.
This release ports the two core components of the Snowplow batch pipeline from Scalding to Spark:
Spark Enrich effectively replaces Scala Hadoop Enrich. It is a “lift and shift” port, having the exact same set of functionalities and acting as a drop-in replacement.
For its part, RDB Shredder is the successor to Scala Hadoop Shred. Again, the featureset of Scala Hadoop Shred, including DynamoDB-based de-duplication, has been preserved; minor Spark-related changes have been made to the folder structure of the job’s shredded output.
Also note that as part of this release the RDB Shredder has been moved to the correct
4-storage folder within Snowplow, from the
3-enrich folder that Scala Hadoop Shred was erroneously stored in.
This release also includes a set of other updates, preparing the ground for the Spark port and contributing to our ongoing modernization of the Snowplow batch pipeline:
As always, the latest versions of EmrEtlRunner and StorageLoader are now available from our Bintray.
In order to leverage Spark Enrich and RDB Shredder, we’ve made some changes to our configuration YAML:
Don’t forget to update the
ami_version to 5.5.0 - the new Spark jobs will not run successfully on 4.5.0
Note that the
job_name is now part of the
emr:jobflow section, reflecting that the EMR job covers the enrichment and storage phases of the batch pipeline; for clarity the RDB Shredder and Hadoop Elasticsearch job versions have accordingly been moved to the
For a complete example, see our sample
The performance characteristics of Apache Spark are quite different from those of Apache Hadoop, and we strongly recommend that you make time for some thorough performance profiling and tuning as part of this upgrade.
Our experience to date, comparing Spark-based R89 to its Hadoop-based antecedents, is that R89 is more demanding in memory-terms, but much faster if those memory requirements are met.
Given that this is a hugely significant change to the Snowplow batch pipeline, we would appreciate any feedback regarding the performance of this release, be it improvement or degradation; we also want to hear as soon as possible about any regressions that might be Spark-related.
For any concrete bugs or feature requests, please open a ticket on our GitHub. For anything more discursive or subjective, please start a thread in our forums.
Upcoming Snowplow releases include:
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.