The Camunian Rose - Luca Giarelli / CC-BY-SA 3.0
It is possible to setup end-to-end encryption for the batch pipeline running in Elastic MapReduce. For context, we recommend checking out Amazon’s dedicated guide to EMR data encryption.
In order to set up end-to-end encryption, you will need a couple of things:
For at rest encryption on S3, the buckets with which EmrEtlRunner will interact must have SSE-S3 encryption enabled - this is the only mode we currently support. For reference, you can look at Amazon’s dedicated guide to S3 encryption.
Keep in mind that switching on this setting is not retroactive. If you want to have only encrypted data in your bucket, you will need to go through the existing data and copy it in place.
Also, if you are using the Clojure Collector, SSE-S3 encryption needs to be set up at the bucket level, not the folder level, in order to take effect.
Once this is done, you will need to tell EmrEtlRunner that it will have to interact with encrypted buckets through the
aws:s3:buckets:encrypted: true configuration setting.
Elastic MapReduce offers EMR security configurations, which let you enforce encryption for various aspects of your job. The options are:
For a complete guide on setting up a EMR security configuration, you can refer to Amazon’s dedicated guide to EMR security.
Once you’ve performed this setup, you can specify which security configuration EmrEtlRunner should use through the
aws:emr:security_configuration EmrEtlRunner configuration option, which we will cover in the Upgrading section below.
Let’s review each of these three EMR encryption options to understand their impact on our Snowplow batch pipeline.
This specifies the strategy to encrypt data when EMR interacts with S3 through EMRFS. By default, even without encryption setup, data is encrypted while in transit from EMR to S3.
Note that, currently, the batch pipeline does not make use of EMRFS, instead it copies data from S3 to the HDFS cluster on the EMR nodes, and from HDFS to S3, through S3DistCp steps; more on that in the next section.
When running the Snowplow pipeline in EMR, an HDFS cluster is setup on the different nodes of your cluster. Enabling encryption for the local disks on those nodes will have the following effects:
When enabling this option, please keep the following drawbacks in mind:
To setup this type of encryption you will need to create an appropriate KMS key (refer to Amazon’s KMS guide for more information). This key needs to be in the same region as the EMR cluster.
It is important to note that the role used in
aws:emr:jobflow_role in the EmrEtlRunner configuration needs to have the
kms:GenerateDataKey policy for this setting to work.
This policy will be used to generate the necessary data keys using the “master” key created above. Those data keys are, in turn, used to encrypt pieces of data on your local disks.
When running the Spark jobs of the Snowplow pipeline (Enrich and Shred), and running some S3DistCp steps (e.g. using
--targetSize), data is shuffled around the different nodes in your EMR cluster. Enabling encryption for those data movements will have the following effects:
Be aware that this type of encryption also has a performance impact as data needs to be encrypted when sent over the network (e.g. when running deduplication in the Shred job).
To set up this type of encryption, you will need to create certificates per Amazon’s PEM certificates for EMR guidance.
Please note: for this type of encryption to work, you will need to be in a VPC and the domain name specified in the certificates needs to be
*.ec2.internal if in us-east-1 or
This release also brings some ergonomic improvements to EmrEtlRunner:
--ignore-lock-on-startoption which lets you ignore an already-in-place lock, should one exist. Note that the lock will still be cleaned up if the run ends successfully
Up until this release, the Clojure Collector defaulted to having the parent path of the requested collector endpoint as path for the
network_userid cookie being set. For example, if you were to use:
my-collector.com/i), the cookie path would be
my-collector.com/com.snowplowanalytics.iglu/v1), the cookie path would be
This would lead to the
network_userid being unintentionally different for the same user across the different event collection paths.
With R108, the cookie path will always default to
/, no matter the endpoint hit. This can be overridden through the
SP_PATH Elastic Beanstalk environment property.
Finally, we’ve updated a good number of dependencies in the Clojure Collector.
This release applies only to our AWS batch pipeline - if you are running any other flavor of Snowplow, there is no upgrade necessary.
The latest version of EmrEtlRunner is available from our Bintray here.
To use the latest EmrEtlRunner features, you will need to make the following changes to your EmrEtlRunner configuration:
For a complete example, see our sample
The new Clojure Collector is available in S3 at:
To customize your cookie path to not default to
/, make sure to specify the
SP_PATH Elastic Beanstalk environment property as described above.
Upcoming Snowplow releases are:
For more details on this release, please check out the release notes on GitHub.
If you have any questions or run into any problem, please visit our Discourse forum.