Snowplow 90 Lascaux released, moving loading step onto EMR
We are tremendously excited to announce the release of Snowplow 90 Lascaux. This release introduces RDB Loader, a new EMR-run application replacing our trusty StorageLoader, as proposed in our Splitting EmrEtlRunner RFC. This release also brings various enhancements and alterations in EmrEtlRunner.
Read on for more information on R90 Lascaux, named after the Upper Paleolithic cave complex in southwestern France:
1. RDB Loader
1.1 The rationale for replacing StorageLoader
StorageLoader was a standalone JRuby app, typically running after EmrEtlRunner on the same orchestration server and ingesting shredded Snowplow event data into relational databases, such as AWS Redshift or PostgreSQL. This approach served us well over the years, but has started to show its age. As we’re moving towards supporting new cloud providers and simplifying our existing orchestration tools, we want to modularize and simplify our batch pipeline, making StorageLoader part of the existing EMR jobflow, and rewriting it in Scala to maximize opportunities for code reuse.
Loading storage targets like Redshift from within EMR jobflow has many advantages:
- Better for security - the server running EmrEtlRunner no longer needs access to your Redshift cluster
- Simpler to setup - user no longer have to setup StorageLoader: it will be automatically fetched and run for you by EMR in same manner as the existing Spark Shred and Enrich jobs
- Modularity - we’re removing features that are either better implemented in Dataflow Runner or SQL Runner, or better performed with specialized tools such as
S3DistCpor EMR itself
1.2 Other improvements
Although we entirely re-implemented and changed the execution model of StorageLoader, RDB Loader is a strict port: it has all functionality that it predecessor had.
Along with shifting from standalone app to EMR step we also made several important improvements in loading process:
- The enriched and shredded events archiving logic now runs in EMR, using
S3DistCpand orchestrated by EmrEtlRunner, increasing stability and performance (issue #1777)
- RDB Loader loads all existing run folders from shredded good folder, eliminating possibility of data missing in Redshift due to “blind” archiving (issue #2962)
- RDB Loader uses an IAM role instead of credentials to access Redshift (issue #3281)
- We fixed the long-standing eventual consistency problem (issue #3113)
And finally, to reiterate that the whole codebase has been written in Scala, which allows us to share many components across codebases and add features in more consistent and confident manner (issue #3023).
1.3 Limitations and plans for RDB Loader
With the initial release of RDB Loader we’ve achieved feature-parity with StorageLoader, however executing the load as an EMR step imposes several new restrictions, which we’re currently actively looking to fix. All of these limitations are addressed by a dedicated milestone on Github.
The most important known limitations are:
- The impossibility of loading Redshift or Postgres via an SSH tunnel or similar non-standard setups
- The visibility of Base64-encoded Redshift credentials in the EMR console
Finally, we should flag that you will have to check the EMR logs for certain types of RDB Loader failure, such as invalid configuration or fatal OutOfMemory errors. All other success or failure messages should be printed to stdout by EmrEtlRunner.
2. Other improvements
We received some tremendous community feedback on Snowplow R89 Plain of Jars; one recurrent theme was the challenges of getting Spark to fully leverage the provided EMR cluster.
As a result, based on this feedback, we’re introducing a way to specify arbitrary EMR configuration options through the EmrEtlRunner configuration file:
In addition to giving you these tuning tools for Spark, the Snowplow community is busy sharing guides on how best to optimize Spark on our Discourse. Rick Bolkey from OneSpot has already released a guide, Learnings from using the new Spark EMR Jobs, thanks a lot Rick!
3.1 Upgrading EmrEtlRunner
The latest version of EmrEtlRunner is available from our Bintray.
3.2 Updating config.yml
In order to use RDB Loader you need to make following addition in your configuration YAML:
The following settings no longer make sense, as Postgres loading also happens on EMR node, therefore can be deleted:
To gradually configure your EMR application you can add optional
For a complete example, see our sample
3.3 Updating EmrEtlRunner scripts
EmrEtlRunner now accepts a new
--include option with a single possible
vacuum argument, which will be passed to RDB Loader.
--skip now accepts new
analyze arguments. Skipping
archive_enriched steps is identical to running R89 EmrEtlRunner without StorageLoader.
Finally, note that the StorageLoader is no more part of batch pipeline apps archive.
3.4 Creating IAM Role for Redshift
As RDB Loader is EMR step now, we wanted to make sure that user’s AWS credentials are not exposed anywhere. To load Redshift we’re using IAM Roles, which allow Redshift to load data from S3.
To create an IAM Role you need to go to AWS Console -> IAM -> Roles -> Create new role.
Then you need chose Amazon Redshift ->
AmazonS3ReadOnlyAccess, choose a role name, for example
RedshiftLoadRole. Once created, copy the Role ARN as you will need it in the next section.
Now you need to attach new role to running Redshift cluster. Go to AWS Console -> Redshift -> Clusters -> Manage IAM Roles -> Attach just created role.
3.5 Whitelisting EMR in Redshift
Your EMR cluster’s master node will need to be whitelisted in Redshift in order to perform the load.
If you are using an “EC2 Classic” environment, from the Redshift UI you will need to create a Cluster Security Group and add the relevant EC2 Security Group, most likely called
ElasticMapReduce-master. Make sure to enable this Cluster Security Group against your Redshift cluster.
If you are using modern VPC-based environment, you will need to modify the Redshift cluster, and add a VPC security group, most likely called
In both cases, you only need to whitelist access from the EMR master node, because RDB Loader runs exclusively from the master node.
3.6 Updating Storage configs
We have updated the Redshift storage target config - the new version requires the Role ARN that you noted down above:
Upcoming Snowplow releases include:
- R9x [STR] Stream refresh, a general upgrade of the apps constituting our Kinesis-based stream processing pipeline
- R9x [HAD] 4 webhooks, which will add support for 4 new webhooks (Mailgun, Olark, Unbounce, StatusGator)
- R9x [HAD] EmrEtlRunner robustness, continuing our work making EmrEtlRunner more reliable and modular
This release is also an important staging post in our mission of loading Snowplow event data into more databases, and in near-real-time. Watch this space!
5. Getting help
For more details on this release, please check out the release notes on GitHub.