Snowplow 89 Plain of Jars released, porting Snowplow to Spark

We are tremendously excited to announce the release of Snowplow 89 Plain of Jars. This release centers around the port of our batch pipeline from Twitter Scalding to Apache Spark, a direct implementation of our most popular RFC, Migrating the Snowplow batch jobs from Scalding to Spark.

Read on for more information on R89 Plain of Jars, named after an archeological site in Laos:

  1. Thanks
  2. Why Spark?
  3. Spark Enrich and RDB Shredder
  4. Under the hood
  5. Upgrading
  6. Request for feedback
  7. Roadmap
  8. Getting help

plain-of-jars

1. Thanks

This release has been a real community effort and so we’d like to start off by thanking some people that were key to this port:

  • Phil Kallos, formerly of Popsugar, for breaking ground on this port and his support throughout the coding process
  • David White of Nordstrom for his guidance and encouragement
  • Gabor Ratky of Secret Sauce Partners for his great help during QA
  • Everyone else who followed along on our journey

This has been one of the most inclusive and collaborative Snowplow releases in our history - an exciting outcome of our burgeoning RFC process, and one on which bodes well for the future as we roadmap exciting new features and major refactorings. Thank you all!

2. Why Spark?

We take a conservative approach to technology adoption at Snowplow - your event data pipeline is far too important for us to take chances with speculative technologies or techniques. But technology does not stand still, and we must always be proactively extending and re-architecting Snowplow to ensure that it stays relevant over the next decade.

You may be wondering why we went to the trouble of rewriting the core components of our batch pipeline into Spark, and why now. The definitive explanation for this port can be found in our RFC - but in a nutshell, we wanted to address some particular pain points with Hadoop:

  • Hadoop being excessively disk-bound, especially for intermediary results, had been very costly in terms of roundtrips to disk across both the Enrich and Shred jobs
  • The impossibility of us reusing code between our batch and real-time pipelines. In particular, we want to reuse our database loading apps across our batch and real-time pipelines, not least so that our users can drip-feed Redshift in near-real-time
  • The slow pace of innovation in Hadoop - although our Hadoop platform and frameworks are stable, the rapid innovation in our space has moved on to the Spark (and Flink) ecosystems

Although the core of the Snowplow batch pipeline had stayed in Scalding since early 2013, we had had multiple positive experiences working with Apache Spark on ancillary Snowplow projects, and were confident that Spark could address these pain points.

The RFC proposed moving to Spark in three phases:

  1. Phase 1: porting our batch enrichment process to Spark
  2. Phase 2: porting our Redshift load process to Spark/Spark Streaming
  3. Phase 3: porting our remaining apps

Snowplow 89 Plain of Jars represents the entirety of Phase 1, and the core deliverable of Phase 2 - namely porting our Hadoop Shred job to run on Spark.

3. Spark Enrich and Relational Database Shredder

This release ports the two core components of the Snowplow batch pipeline from Scalding to Spark:

  1. Scala Hadoop Enrich, being renamed Spark Enrich
  2. Scala Hadoop Shred, being renamed RDB (for Relational Database) Shredder

Spark Enrich effectively replaces Scala Hadoop Enrich. It is a “lift and shift” port, having the exact same set of functionalities and acting as a drop-in replacement.

For its part, RDB Shredder is the successor to Scala Hadoop Shred. Again, the featureset of Scala Hadoop Shred, including DynamoDB-based de-duplication, has been preserved; minor Spark-related changes have been made to the folder structure of the job’s shredded output.

Also note that as part of this release the RDB Shredder has been moved to the correct 4-storage folder within Snowplow, from the 3-enrich folder that Scala Hadoop Shred was erroneously stored in.

4. Under the hood

This release also includes a set of other updates, preparing the ground for the Spark port and contributing to our ongoing modernization of the Snowplow batch pipeline:

  • Scala Common Enrich, Spark Enrich and RDB Shredder now run on Scala 2.11 (#3061, #3070, #3071)
  • Scala Common Enrich, Spark Enrich and Relational Database Shredder now use Java 8 (#2381, #3212, #3213)
  • EmrEtlRunner is now able to run Spark jobs (#641)
  • StorageLoader has been updated to read Spark’s output directory structure (#3044)

5. Upgrading

5.1 Upgrading EmrEtlRunner and StorageLoader

As always, the latest versions of EmrEtlRunner and StorageLoader are now available from our Bintray.

5.2 Updating config.yml

In order to leverage Spark Enrich and RDB Shredder, we’ve made some changes to our configuration YAML:

aws:
  emr:
    ami_version: 5.5.0                # WAS 4.5.0
    ...
    jobflow:
      job_name: Snowplow ETL          # MOVED
      master_instance_type: m1.medium # AS BEFORE
      ...
enrich:
  versions:
    spark_enrich: 1.9.0               # RENAMED
    ...
storage:
  versions:
    rdb_shredder: 0.12.0              # MOVED AND RENAMED
    hadoop_elasticsearch: 0.1.0       # MOVED

Don’t forget to update the ami_version to 5.5.0 - the new Spark jobs will not run successfully on 4.5.0

Note that the job_name is now part of the emr:jobflow section, reflecting that the EMR job covers the enrichment and storage phases of the batch pipeline; for clarity the RDB Shredder and Hadoop Elasticsearch job versions have accordingly been moved to the storage section.

For a complete example, see our sample config.yml template.

5.3 Performance profiling and tuning

The performance characteristics of Apache Spark are quite different from those of Apache Hadoop, and we strongly recommend that you make time for some thorough performance profiling and tuning as part of this upgrade.

Our experience to date, comparing Spark-based R89 to its Hadoop-based antecedents, is that R89 is more demanding in memory-terms, but much faster if those memory requirements are met.

6. Request for feedback

Given that this is a hugely significant change to the Snowplow batch pipeline, we would appreciate any feedback regarding the performance of this release, be it improvement or degradation; we also want to hear as soon as possible about any regressions that might be Spark-related.

For any concrete bugs or feature requests, please open a ticket on our GitHub. For anything more discursive or subjective, please start a thread in our forums.

7. Roadmap

Upcoming Snowplow releases include:

8. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.

Thoughts or questions? Come join us in our Discourse forum!

Ben Fradet

Ben is a data engineer at Snowplow. You can find him on GitHub, Twitter and LinkedIn.