Snowplow’s own Alexander Dean was recently asked to write an article for the Software Developer’s Journal edition on Hadoop The kind folks at the Software Developer’s Journal have allowed us to reprint his article in full below.
Alex started writing Hive UDFs as part of the process to write the Snowplow log deserializer - the custom SerDe used to parse Snowplow logs generated by the Cloudfront and Clojure collectors so they can be processed in the Snowplow ETL step.
In this article you will learn how to write a user-defined function (“UDF”) to work with the Apache Hive platform. We will start gently with an introduction to Hive, then move on to developing the UDF and writing tests for it. We will write our UDF in Java, but use Scala’s SBT as our build tool and write our tests in Scala with Specs2.
In order to get the most out of this article, you should be comfortable programming in Java. You do not need to have any experience with Apache Hive, HiveQL (the Hive query language) or indeed Hive UDFs - I will introduce all of these concepts from first principles. Experience with Scala is advantageous, but not necessary.
Before we start: my name is Alex Dean, and I am the co-founder of Snowplow (http://snowplowanalytics.com/), an open-source web analytics platform built on top of Apache Hadoop and Apache Hive. My experience writing Java code to extend the Hive platform comes from Snowplow, where we built a core piece of our launch platform using a Hive deserializer (https://github.com/snowplow/snowplow/tree/master/3-etl/hive-etl/snowplow-log-deserializers).
So, what is Apache Hive, and what would you want a Hive UDF for? Hive is a data warehouse system built on top of Hadoop for ad-hoc queries and processing of large datasets. Now an Apache Software Foundation project, Hive was originally developed at Facebook, where analysts and data scientists wanted a SQL-like abstraction over traditional Hadoop MapReduce. As such, the key distinguishing feature of Hive is the SQL-like query language HiveQL. An example HiveQL query might look like this:
Listing 1: An example HiveQL query
This is actually a standard Snowplow query to calculate the number of unique visitors to a website by day. So what happens when an analyst runs this query in Hive? Simply this:
- Hive converts the query into the simplest possible set of MapReduce jobs
- The MapReduce job or jobs is run on the Hadoop platform
- The generated result set is then returned to the user’s console
Certainly this is a powerful abstraction over MapReduce jobs, which can be tedious and difficult to write by hand. And HiveQL has a lot of power - there is very little in the ANSI SQL standard which is not available in HiveQL. Nonetheless, sometimes the Hive user will need more power, and for these occasions Hive has three main extension points:
- User-defined functions (“UDFs”), which provide a way of extending the functionality of Hive with a function (written in Java) that can be evaluated in HiveQL statements
- Custom serializers and/or deserializers (“serdes”), which provide a way of either deserializing a custom file format stored on HDFS to a POJO (plain old Java object), or serializing a POJO to a custom file format (or both)
- Custom mappers/reducers, which allow you to add custom map or reduce steps into your Hive query. These map/reduce steps can be written in any programming language - not just Java
We will not consider serdes or custom mappers/reducers further in this article - we hope to write further articles on each of these in the future.
Now that we understand why you might write a UDF for Hive, let’s crack on and start writing one!
Setting up our project
We will be writing a relatively simple UDF - one which generates a converts a string in Hive to upper-case. Note that a version of this function is actually built into Hive as the UPPER function - for a full list of built-in UDFs in Hive, please see: https://cwiki.apache.org/Hive/languagemanual-udf.html
As mentioned previously, we will write our UDF in Java - but we will wrap our Java core in a Scala project (with Scala tests), because at Snowplow we much prefer writing Scala to Java. We will use SBT, the Scala build tool, to configure our project - this is an alternative to Maven or similar; SBT handles mixed Java and Scala projects perfectly well.
First, let’s create a directory for our project, and add a file, project.sbt into the project root, which contains:
Listing 2: Our project.sbt build file
This is a simple project configuration which names and versions our project, and also adds Hadoop, Hive and Specs2 (our testing library) as dependencies. If you do not have SBT already installed, you can find instructions here http://www.scala-sbt.org/release/docs/Getting-Started/Setup.html
Writing our UDF
Done? Onto the code. First let’s create a folder for it:
Now let’s add a file into our udf folder called ToUpper.java, containing:
Listing 3: Our ToUpper.java UDF definition
This file defines our UDF, ToUpper. The package definition and imports should be self-explanatory; the
@Description annotation is a useful Hive-specific annotation to provide usage information for our UDF in the Hive console.
All user-defined functions extend the Hive UDF class; a UDF sub-class must then implement one or more methods named “evaluate” which will be called by Hive. We implement an evaluate method which takes one Hadoop Text (which stores text using UTF8) and returns the same Hadoop Text, but now in upper-case. The only complexity is some exception handling, which we include for safety’s sake.
Now let’s check that this compiles. In the root folder, run SBT like so:
Listing 4: Compiling in SBT
Master building data-driven products
Get the latest eBook from Snowplow Analytics and use data to build killer products
Testing our UDF
Okay great, now time to write a test to make sure this is doing what we expect! First we create a folder for it:
Now let’s add a file into our udf test folder called ToUpperTest.scala, containing:
Listing 5: Our ToUpperTest.scala Specs2 test
This is a Specs2 unit test (http://etorreborre.github.com/specs2/), written in Scala, which checks that ToUpper is performing correctly: we test that an empty string is handled correctly, and then we test that a mixed-case string (“Stephen King”) is successfully converted to “STEPHEN KING”.
So let’s run this next from SBT:
Listing 6: Testing in SBT
Building and using our UDF
Our tests passed! Now we’re ready to use our function “in anger” from Hive. First, still from SBT, let’s build our jarfile:
Listing 7: Packaging our jar
Now take the jarfile (hive-example-udf_2.9.2-0.0.1.jar) and upload it to our Hive cluster - on Amazon’s Elastic MapReduce, for example, you could upload it to S3.
From your Hive console, you can now add our new UDF like so:
And then finally you can use our new UDF in your HiveQL queries, something like this:
That completes our article. If you would like to download the example code above as a working project, you can find it on GitHub here: https://github.com/snowplow/hive-example-udf
I hope to return with further articles about Hive and Hadoop in the future - potentially one on writing a custom serde - an area where we have a lot of experience at Snowplow Analytics.