Snowplow 90 Lascaux released, moving loading step onto EMR

26 July 2017  •  Anton Parkhomenko

We are tremendously excited to announce the release of Snowplow 90 Lascaux. This release introduces RDB Loader, a new EMR-run application replacing our trusty StorageLoader, as proposed in our Splitting EmrEtlRunner RFC. This release also brings various enhancements and alterations in EmrEtlRunner.

Read on for more information on R90 Lascaux, named after the Upper Paleolithic cave complex in southwestern France:

  1. RDB Loader
  2. Other improvements
  3. Upgrading
  4. Roadmap
  5. Getting help

lascaux

1. RDB Loader

1.1 The rationale for replacing StorageLoader

StorageLoader was a standalone JRuby app, typically running after EmrEtlRunner on the same orchestration server and ingesting shredded Snowplow event data into relational databases, such as AWS Redshift or PostgreSQL. This approach served us well over the years, but has started to show its age. As we’re moving towards supporting new cloud providers and simplifying our existing orchestration tools, we want to modularize and simplify our batch pipeline, making StorageLoader part of the existing EMR jobflow, and rewriting it in Scala to maximize opportunities for code reuse.

Loading storage targets like Redshift from within EMR jobflow has many advantages:

  • Better for security - the server running EmrEtlRunner no longer needs access to your Redshift cluster
  • Simpler to setup - user no longer have to setup StorageLoader: it will be automatically fetched and run for you by EMR in same manner as the existing Spark Shred and Enrich jobs
  • Modularity - we’re removing features that are either better implemented in Dataflow Runner or SQL Runner, or better performed with specialized tools such as S3DistCp or EMR itself

1.2 Other improvements

Although we entirely re-implemented and changed the execution model of StorageLoader, RDB Loader is a strict port: it has all functionality that it predecessor had.

Along with shifting from standalone app to EMR step we also made several important improvements in loading process:

  • The enriched and shredded events archiving logic now runs in EMR, using S3DistCp and orchestrated by EmrEtlRunner, increasing stability and performance (issue #1777)
  • RDB Loader loads all existing run folders from shredded good folder, eliminating possibility of data missing in Redshift due to “blind” archiving (issue #2962)
  • RDB Loader uses an IAM role instead of credentials to access Redshift (issue #3281)
  • We fixed the long-standing eventual consistency problem (issue #3113)

And finally, to reiterate that the whole codebase has been written in Scala, which allows us to share many components across codebases and add features in more consistent and confident manner (issue #3023).

1.3 Limitations and plans for RDB Loader

With the initial release of RDB Loader we’ve achieved feature-parity with StorageLoader, however executing the load as an EMR step imposes several new restrictions, which we’re currently actively looking to fix. All of these limitations are addressed by a dedicated milestone on Github.

The most important known limitations are:

  1. The impossibility of loading Redshift or Postgres via an SSH tunnel or similar non-standard setups
  2. The visibility of Base64-encoded Redshift credentials in the EMR console

Finally, we should flag that you will have to check the EMR logs for certain types of RDB Loader failure, such as invalid configuration or fatal OutOfMemory errors. All other success or failure messages should be printed to stdout by EmrEtlRunner.

2. Other improvements

We received some tremendous community feedback on Snowplow R89 Plain of Jars; one recurrent theme was the challenges of getting Spark to fully leverage the provided EMR cluster.

As a result, based on this feedback, we’re introducing a way to specify arbitrary EMR configuration options through the EmrEtlRunner configuration file:

aws:
  emr:
    // ...
    configuration:
      yarn-site:
         yarn.resourcemanager.am.max-attempts: "1"
       spark:
         maximizeResourceAllocation: "true"
       spark-defaults:
         spark.executor.instances: "17"
         spark.yarn.executor.memoryOverhead: "4096"
         spark.executor.memory: "35G"
         spark.yarn.driver.memoryOverhead: "4096"
        // etc

In addition to giving you these tuning tools for Spark, the Snowplow community is busy sharing guides on how best to optimize Spark on our Discourse. Rick Bolkey from OneSpot has already released a guide, Learnings from using the new Spark EMR Jobs, thanks a lot Rick!

Lastly, the Event Manifest Populator from R88 Angkor Wat was also updated in this release. It now supports enriched archives created with pre-R83 versions of Snowplow (issue #3293).

3. Upgrading

3.1 Upgrading EmrEtlRunner

The latest version of EmrEtlRunner is available from our Bintray.

3.2 Updating config.yml

In order to use RDB Loader you need to make following addition in your configuration YAML:

storage:
  versions:
    rdb_loader: 0.12.0        # NEW

The following settings no longer make sense, as Postgres loading also happens on EMR node, therefore can be deleted:

storage:
  download:                   # REMOVE
    folder:                   # REMOVE

To gradually configure your EMR application you can add optional emr.configuration property:

emr:
  configuration:                                  # NEW
    yarn-site:
      yarn.resourcemanager.am.max-attempts: "1"
    spark:
      maximizeResourceAllocation: "true"

For a complete example, see our sample config.yml template

3.3 Updating EmrEtlRunner scripts

EmrEtlRunner now accepts a new --include option with a single possible vacuum argument, which will be passed to RDB Loader.

Also, --skip now accepts new rdb_load, archive_enriched and analyze arguments. Skipping rdb_load and archive_enriched steps is identical to running R89 EmrEtlRunner without StorageLoader.

Finally, note that the StorageLoader is no more part of batch pipeline apps archive.

3.4 Creating IAM Role for Redshift

As RDB Loader is EMR step now, we wanted to make sure that user’s AWS credentials are not exposed anywhere. To load Redshift we’re using IAM Roles, which allow Redshift to load data from S3.

To create an IAM Role you need to go to AWS Console -> IAM -> Roles -> Create new role.

Then you need chose Amazon Redshift -> AmazonS3ReadOnlyAccess, choose a role name, for example RedshiftLoadRole. Once created, copy the Role ARN as you will need it in the next section.

Now you need to attach new role to running Redshift cluster. Go to AWS Console -> Redshift -> Clusters -> Manage IAM Roles -> Attach just created role.

3.5 Whitelisting EMR in Redshift

Your EMR cluster’s master node will need to be whitelisted in Redshift in order to perform the load.

If you are using an “EC2 Classic” environment, from the Redshift UI you will need to create a Cluster Security Group and add the relevant EC2 Security Group, most likely called ElasticMapReduce-master. Make sure to enable this Cluster Security Group against your Redshift cluster.

If you are using modern VPC-based environment, you will need to modify the Redshift cluster, and add a VPC security group, most likely called ElasticMapReduce-Master-Private.

In both cases, you only need to whitelist access from the EMR master node, because RDB Loader runs exclusively from the master node.

3.6 Updating Storage configs

We have updated the Redshift storage target config - the new version requires the Role ARN that you noted down above:

{
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-0-0",       // WAS 1-0-0
    "data": {
        "name": "AWS Redshift enriched events storage",
        ...
        "roleArn": "arn:aws:iam::719197435995:role/RedshiftLoadRole",                               // NEW
        ...
    }
}

4. Roadmap

Upcoming Snowplow releases include:

This release is also an important staging post in our mission of loading Snowplow event data into more databases, and in near-real-time. Watch this space!

5. Getting help

For more details on this release, please check out the release notes on GitHub.

If you have any questions or run into any problems, please raise an issue or get in touch with us through the usual channels.