Thanks Alex! Quick bit about me: I’ve been in software development since the mid 90s working on everything from Java Swing (design automation tools) to embedded Jetty (email encryption) and now a mixture of Ruby and Scala. Since 2010, I’ve been responsible for engineering at Sharethrough - an ad tech company based out of San Francisco. We’re building a native advertising platform based on the belief that advertising is no longer sustainable as banners and punch-the-monkey ads and has begun the transition to engaging, non-interruptive choice-based experiences. One thing that a lot of people outside of ad tech don’t realize is that online advertising is synonymous with scale and some of the most interesting technology problems are driven from those demands. This is where Elasticity comes in.
Our ads report a significant amount of information around user behavior which we then use in decisioning, pricing and insight derivation (e.g. “Do people share videos before watching them?”). In the early days, we were handling what we now consider a small volume of logs (1GB/day) with a correspondingly quick and dirty ETL: a log parser that updated the MySQL instance backing our reporting dashboards. Fast forward to 2013 and our log intake is north of 30GB/day. With this volume of data and with the insights we wanted to derive, that process didn’t cut it and we determined that the quickest way for us to begin deriving value from our data was via Amazon Elastic MapReduce (hereon referred to as EMR).
If you’re unfamiliar with AWS service interaction and evolution, it often follows this pattern (using EMR as an example):
Amazon’s tools are developer services, not meant for absolutely streamlined consumption; some legwork is required. The AWS CLI is a thin wrapper around the EMR REST API meaning there are numerous and frequently mutually exclusive options. If you choose to use the CLI, you’ll spend a significant amount of time learning how to use the command line tools by reading the developer API guide. Why isn’t there a programmatic way to work with EMR that follows the same mental model as that which is exposed via the UI and doesn’t require you to understand the EMR REST API?
That’s where Elasticity comes in.
As an API author you can choose to represent the EMR model directly or layer your own model on top of it. As a point of reference, this is a partial list of EMR Rest API calls: AddInstanceGroups, AddJobFlowSteps, DescribeJobFlows, etc.
Elasticity v1 split (2) and (3) above, encapsulating an entire “job” as you unit of interaction with teh API. You’d create and configure a “HiveJob” and start it. This was assuming that most interactions with EMR are single-step.
Elasticity v2 was a major rewrite focusing wholly on option (3) above. You create and configure “JobFlows” and add steps to them, just as you do in the UI; a much more comfortable model for those familiar with the EMR UI (which we all were at some point when we learned how to use EMR).
Elasticity v3… who knows? First and foremost, I work on features that Sharethrough requires. We’re in a steady state with EMR at the moment and now I’m hoping the community has some suggestions :)
Thanks for making it this far! And if anything I touched on sounds interesting, Sharethrough is hiring and we’re relo-friendly! Check us at at Sharethrough Engineering.