StorageLoader was a standalone JRuby app, typically running after EmrEtlRunner on the same orchestration server and ingesting shredded Snowplow event data into relational databases, such as AWS Redshift or PostgreSQL. This approach served us well over the years, but has started to show its age. As we’re moving towards supporting new cloud providers and simplifying our existing orchestration tools, we want to modularize and simplify our batch pipeline, making StorageLoader part of the existing EMR jobflow, and rewriting it in Scala to maximize opportunities for code reuse.
Loading storage targets like Redshift from within EMR jobflow has many advantages:
S3DistCpor EMR itself
Although we entirely re-implemented and changed the execution model of StorageLoader, RDB Loader is a strict port: it has all functionality that it predecessor had.
Along with shifting from standalone app to EMR step we also made several important improvements in loading process:
S3DistCpand orchestrated by EmrEtlRunner, increasing stability and performance (issue #1777)
And finally, to reiterate that the whole codebase has been written in Scala, which allows us to share many components across codebases and add features in more consistent and confident manner (issue #3023).
With the initial release of RDB Loader we’ve achieved feature-parity with StorageLoader, however executing the load as an EMR step imposes several new restrictions, which we’re currently actively looking to fix. All of these limitations are addressed by a dedicated milestone on Github.
The most important known limitations are:
Finally, we should flag that you will have to check the EMR logs for certain types of RDB Loader failure, such as invalid configuration or fatal OutOfMemory errors. All other success or failure messages should be printed to stdout by EmrEtlRunner.
We received some tremendous community feedback on Snowplow R89 Plain of Jars; one recurrent theme was the challenges of getting Spark to fully leverage the provided EMR cluster.
As a result, based on this feedback, we’re introducing a way to specify arbitrary EMR configuration options through the EmrEtlRunner configuration file:
In addition to giving you these tuning tools for Spark, the Snowplow community is busy sharing guides on how best to optimize Spark on our Discourse. Rick Bolkey from OneSpot has already released a guide, Learnings from using the new Spark EMR Jobs, thanks a lot Rick!
Lastly, the Event Manifest Populator from R88 Angkor Wat was also updated in this release. It now supports enriched archives created with pre-R83 versions of Snowplow (issue #3293).
The latest version of EmrEtlRunner is available from our Bintray.
In order to use RDB Loader you need to make following addition in your configuration YAML:
The following settings no longer make sense, as Postgres loading also happens on EMR node, therefore can be deleted:
To gradually configure your EMR application you can add optional
For a complete example, see our sample
EmrEtlRunner now accepts a new
--include option with a single possible
vacuum argument, which will be passed to RDB Loader.
--skip now accepts new
analyze arguments. Skipping
archive_enriched steps is identical to running R89 EmrEtlRunner without StorageLoader.
Finally, note that the StorageLoader is no more part of batch pipeline apps archive.
As RDB Loader is EMR step now, we wanted to make sure that user’s AWS credentials are not exposed anywhere. To load Redshift we’re using IAM Roles, which allow Redshift to load data from S3.
To create an IAM Role you need to go to AWS Console -> IAM -> Roles -> Create new role.
Then you need chose Amazon Redshift ->
AmazonS3ReadOnlyAccess, choose a role name, for example
RedshiftLoadRole. Once created, copy the Role ARN as you will need it in the next section.
Now you need to attach new role to running Redshift cluster. Go to AWS Console -> Redshift -> Clusters -> Manage IAM Roles -> Attach just created role.
Your EMR cluster’s master node will need to be whitelisted in Redshift in order to perform the load.
If you are using an “EC2 Classic” environment, from the Redshift UI you will need to create a Cluster Security Group and add the relevant EC2 Security Group, most likely called
ElasticMapReduce-master. Make sure to enable this Cluster Security Group against your Redshift cluster.
If you are using modern VPC-based environment, you will need to modify the Redshift cluster, and add a VPC security group, most likely called
In both cases, you only need to whitelist access from the EMR master node, because RDB Loader runs exclusively from the master node.
We have updated the Redshift storage target config - the new version requires the Role ARN that you noted down above:
Upcoming Snowplow releases include:
This release is also an important staging post in our mission of loading Snowplow event data into more databases, and in near-real-time. Watch this space!
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.