Spark Example Project 0.3.0 released for getting started with Apache Spark on EMR

10 May 2015  •  Alex Dean

We are pleased to announce the release of our Spark Example Project 0.3.0, building on the original release of the project last year.

This release is part of a renewed focus on the Apache Spark stack at Snowplow. In particular, we are exploring Spark’s applicability to two Snowplow-specific problem domains:

  1. Using Spark and Spark Streaming to implement r64 Palila-style data modeling outside of Redshift SQL
  2. Using Spark Streaming to deliver “analytics-on-write” realtime dashboards as part of our Kinesis pipeline

Expect to see further releases, blog posts and tutorials from the Snowplow team on Apache Spark and Spark Streaming soon!

In the rest of this blog post we’ll talk about:

  1. Spark 1.3.0 support
  2. Simplified Elastic MapReduce support
  3. Automated job upload and running on EMR
  4. Getting help

The project has been updated to Spark 1.3.0, the most recent version of Spark supported on Amazon Elastic MapReduce. Many thanks to community member Vincent Ohprecio for contributing this upgrade!

When we worked on the Spark Example Project last year, getting it working in a non-interactive fashion was challenging: it involved a custom Bash script, and even then was restricted to a relatively old Spark version (0.8.x).

Since then, AWS’s support for running Spark on Elastic MapReduce has evolved significantly, as part of the excellent open source emr-bootstrap-actions initiative. This has enabled us to remove our custom Bash script, and bump our Spark support to 1.3.0 as above.

At Snowplow we have been experimenting with a combination of Invoke plus Boto to automate tasks around Amazon Web Services. Invoke is a Python task runner, and Boto is the official AWS library for Python (also underpinning the AWS CLI tools).

To make it easier to upload and run Spark jobs on Elastic MapReduce, we have created an Invoke tasks.py file with two commands. The first is upload, which uploads the assembled fatjar and input data to Amazon S3:

inv upload aws-profile spark-example-project-bucket

The second command is run_emr, which executes the Spark job on Elastic MapReduce:

inv run_emr aws-profile spark-example-project-bucket ec2-keypair subnet-3dc2bd2a

As well as helping users to get started with the Spark Example Project, the new tasks.py file should be a good starting point for automating your own non-interactive Spark jobs on EMR.

We hope you find Spark Example Project useful. As always with releases from the Snowplow team, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.

Stay tuned for more announcements from Snowplow about Spark and Spark Streaming in the future!