We are tremendously excited to announce the release of Snowplow 89 Plain of Jars. This release centers around the port of our batch pipeline from Twitter Scalding to Apache Spark, a direct implementation of our most popular RFC, Migrating the Snowplow batch jobs from Scalding to Spark.
Read on for more information on R89 Plain of Jars, named after [an archeological site in Laos][plain-of-jars]:
- Why Spark?
- Spark Enrich and RDB Shredder
- Under the hood
- Request for feedback
- Getting help
This release has been a real community effort and so we’d like to start off by thanking some people that were key to this port:
- Phil Kallos, formerly of Popsugar, for breaking ground on this port and his support throughout the coding process
- David White of Nordstrom for his guidance and encouragement
- Gabor Ratky of Secret Sauce Partners for his great help during QA
- Everyone else who followed along on our journey
This has been one of the most inclusive and collaborative Snowplow releases in our history - an exciting outcome of our burgeoning RFC process, and one on which bodes well for the future as we roadmap exciting new features and major refactorings. Thank you all!
2. Why Spark?
We take a conservative approach to technology adoption at Snowplow - your event data pipeline is far too important for us to take chances with speculative technologies or techniques. But technology does not stand still, and we must always be proactively extending and re-architecting Snowplow to ensure that it stays relevant over the next decade.
You may be wondering why we went to the trouble of rewriting the core components of our batch pipeline into Spark, and why now. The definitive explanation for this port can be found in our RFC - but in a nutshell, we wanted to address some particular pain points with Hadoop:
- Hadoop being excessively disk-bound, especially for intermediary results, had been very costly in terms of roundtrips to disk across both the Enrich and Shred jobs
- The impossibility of us reusing code between our batch and real-time pipelines. In particular, we want to reuse our database loading apps across our batch and real-time pipelines, not least so that our users can drip-feed Redshift in near-real-time
- The slow pace of innovation in Hadoop - although our Hadoop platform and frameworks are stable, the rapid innovation in our space has moved on to the Spark (and Flink) ecosystems
Although the core of the Snowplow batch pipeline had stayed in Scalding since early 2013, we had had multiple positive experiences working with Apache Spark on ancillary Snowplow projects, and were confident that Spark could address these pain points.
The RFC proposed moving to Spark in three phases:
- Phase 1: porting our batch enrichment process to Spark
- Phase 2: porting our Redshift load process to Spark/Spark Streaming
- Phase 3: porting our remaining apps
Snowplow 89 Plain of Jars represents the entirety of Phase 1, and the core deliverable of Phase 2 - namely porting our Hadoop Shred job to run on Spark.
3. Spark Enrich and Relational Database Shredder
This release ports the two core components of the Snowplow batch pipeline from Scalding to Spark:
- Scala Hadoop Enrich, being renamed Spark Enrich
- Scala Hadoop Shred, being renamed RDB (for Relational Database) Shredder
Spark Enrich effectively replaces Scala Hadoop Enrich. It is a “lift and shift” port, having the exact same set of functionalities and acting as a drop-in replacement.
For its part, RDB Shredder is the successor to Scala Hadoop Shred. Again, the featureset of Scala Hadoop Shred, including DynamoDB-based de-duplication, has been preserved; minor Spark-related changes have been made to the folder structure of the job’s shredded output.
Also note that as part of this release the RDB Shredder has been moved to the correct
4-storage folder within Snowplow, from the
3-enrich folder that Scala Hadoop Shred was erroneously stored in.
4. Under the hood
This release also includes a set of other updates, preparing the ground for the Spark port and contributing to our ongoing modernization of the Snowplow batch pipeline:
- Scala Common Enrich, Spark Enrich and RDB Shredder now run on Scala 2.11 (#3061, #3070, #3071)
- Scala Common Enrich, Spark Enrich and Relational Database Shredder now use Java 8 (#2381, #3212, #3213)
- EmrEtlRunner is now able to run Spark jobs (#641)
- StorageLoader has been updated to read Spark’s output directory structure (#3044)
5.1 Upgrading EmrEtlRunner and StorageLoader
As always, the latest versions of EmrEtlRunner and StorageLoader are now available from our Bintray.
5.2 Updating config.yml
In order to leverage Spark Enrich and RDB Shredder, we’ve made some changes to our configuration YAML:
Don’t forget to update the
ami_version to 5.5.0 - the new Spark jobs will not run successfully on 4.5.0
Note that the
job_name is now part of the
emr:jobflow section, reflecting that the EMR job covers the enrichment and storage phases of the batch pipeline; for clarity the RDB Shredder and Hadoop Elasticsearch job versions have accordingly been moved to the
For a complete example, see our sample
5.3 Performance profiling and tuning
The performance characteristics of Apache Spark are quite different from those of Apache Hadoop, and we strongly recommend that you make time for some thorough performance profiling and tuning as part of this upgrade.
Our experience to date, comparing Spark-based R89 to its Hadoop-based antecedents, is that R89 is more demanding in memory-terms, but much faster if those memory requirements are met.
6. Request for feedback
Given that this is a hugely significant change to the Snowplow batch pipeline, we would appreciate any feedback regarding the performance of this release, be it improvement or degradation; we also want to hear as soon as possible about any regressions that might be Spark-related.
Upcoming Snowplow releases include:
- R9x [HAD] StorageLoader reboot, which will port our JRuby-based StorageLoader app to Scala, renaming it in the process to the RDB Loader
- R9x [STR] Stream refresh, a general upgrade of the apps constituting our Kinesis-based stream processing pipeline
- R9x [HAD] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
- R9x [HAD] EmrEtlRunner robustness, continuing our work making EmrEtlRunner more reliable and modular
8. Getting help
For more details on this release, please check out the release notes on GitHub.