Spark Example Project 0.3.0 released for getting started with Apache Spark on EMR
This release is part of a renewed focus on the Apache Spark stack at Snowplow. In particular, we are exploring Spark’s applicability to two Snowplow-specific problem domains:
- Using Spark and Spark Streaming to implement r64 Palila-style data modeling outside of Redshift SQL
- Using Spark Streaming to deliver “analytics-on-write” realtime dashboards as part of our Kinesis pipeline
Expect to see further releases, blog posts and tutorials from the Snowplow team on Apache Spark and Spark Streaming soon!
In the rest of this blog post we’ll talk about:
- Spark 1.3.0 support
- Simplified Elastic MapReduce support
- Automated job upload and running on EMR
- Getting help
When we worked on the Spark Example Project last year, getting it working in a non-interactive fashion was challenging: it involved a custom Bash script, and even then was restricted to a relatively old Spark version (0.8.x).
Since then, AWS’s support for running Spark on Elastic MapReduce has evolved significantly, as part of the excellent open source emr-bootstrap-actions initiative. This has enabled us to remove our custom Bash script, and bump our Spark support to 1.3.0 as above.
At Snowplow we have been experimenting with a combination of Invoke plus Boto to automate tasks around Amazon Web Services. Invoke is a Python task runner, and Boto is the official AWS library for Python (also underpinning the AWS CLI tools).
To make it easier to upload and run Spark jobs on Elastic MapReduce, we have created an Invoke tasks.py file with two commands. The first is
upload, which uploads the assembled fatjar and input data to Amazon S3:
The second command is
run_emr, which executes the Spark job on Elastic MapReduce:
As well as helping users to get started with the Spark Example Project, the new
tasks.py file should be a good starting point for automating your own non-interactive Spark jobs on EMR.
We hope you find Spark Example Project useful. As always with releases from the Snowplow team, if you do run into any issues or don’t understand any of the above changes, please raise an issue or get in touch with us via the usual channels.
Stay tuned for more announcements from Snowplow about Spark and Spark Streaming in the future!