08 February 2013
Alex Dean

Snowplow’s own Alexander Dean was recently asked to write an article for the Software Developer’s Journal edition on Hadoop The kind folks at the Software Developer’s Journal have allowed us to reprint his article in full below.

Alex started writing Hive UDFs as part of the process to write the Snowplow log deserializer - the custom SerDe used to parse Snowplow logs generated by the Cloudfront and Clojure collectors so they can be processed in the Snowplow ETL step.

Article Synopsis

In this article you will learn how to write a user-defined function (“UDF”) to work with the Apache Hive platform. We will start gently with an introduction to Hive, then move on to developing the UDF and writing tests for it. We will write our UDF in Java, but use Scala’s SBT as our build tool and write our tests in Scala with Specs2.

In order to get the most out of this article, you should be comfortable programming in Java. You do not need to have any experience with Apache Hive, HiveQL (the Hive query language) or indeed Hive UDFs - I will introduce all of these concepts from first principles. Experience with Scala is advantageous, but not necessary.

Introduction

Before we start: my name is Alex Dean, and I am the co-founder of Snowplow (http://snowplowanalytics.com/), an open-source web analytics platform built on top of Apache Hadoop and Apache Hive. My experience writing Java code to extend the Hive platform comes from Snowplow, where we built a core piece of our launch platform using a Hive deserializer (https://github.com/snowplow/snowplow/tree/master/3-etl/hive-etl/snowplow-log-deserializers).

So, what is Apache Hive, and what would you want a Hive UDF for? Hive is a data warehouse system built on top of Hadoop for ad-hoc queries and processing of large datasets. Now an Apache Software Foundation project, Hive was originally developed at Facebook, where analysts and data scientists wanted a SQL-like abstraction over traditional Hadoop MapReduce. As such, the key distinguishing feature of Hive is the SQL-like query language HiveQL. An example HiveQL query might look like this:

Listing 1: An example HiveQL query

SELECT 
dt,
COUNT(DISTINCT (user_id))
FROM events
GROUP BY dt ;

This is actually a standard Snowplow query to calculate the number of unique visitors to a website by day. So what happens when an analyst runs this query in Hive? Simply this:

  1. Hive converts the query into the simplest possible set of MapReduce jobs
  2. The MapReduce job or jobs is run on the Hadoop platform
  3. The generated result set is then returned to the user’s console

Certainly this is a powerful abstraction over MapReduce jobs, which can be tedious and difficult to write by hand. And HiveQL has a lot of power - there is very little in the ANSI SQL standard which is not available in HiveQL. Nonetheless, sometimes the Hive user will need more power, and for these occasions Hive has three main extension points:

  1. User-defined functions (“UDFs”), which provide a way of extending the functionality of Hive with a function (written in Java) that can be evaluated in HiveQL statements
  2. Custom serializers and/or deserializers (“serdes”), which provide a way of either deserializing a custom file format stored on HDFS to a POJO (plain old Java object), or serializing a POJO to a custom file format (or both)
  3. Custom mappers/reducers, which allow you to add custom map or reduce steps into your Hive query. These map/reduce steps can be written in any programming language - not just Java

We will not consider serdes or custom mappers/reducers further in this article - we hope to write further articles on each of these in the future.

Now that we understand why you might write a UDF for Hive, let’s crack on and start writing one!

Setting up our project

We will be writing a relatively simple UDF - one which generates a converts a string in Hive to upper-case. Note that a version of this function is actually built into Hive as the UPPER function - for a full list of built-in UDFs in Hive, please see: https://cwiki.apache.org/Hive/languagemanual-udf.html

As mentioned previously, we will write our UDF in Java - but we will wrap our Java core in a Scala project (with Scala tests), because at Snowplow we much prefer writing Scala to Java. We will use SBT, the Scala build tool, to configure our project - this is an alternative to Maven or similar; SBT handles mixed Java and Scala projects perfectly well.

First, let’s create a directory for our project, and add a file, project.sbt into the project root, which contains:

Listing 2: Our project.sbt build file

name := "hive-example-udf"
version := "0.0.1"
organization := "com.snowplowanalytics"
scalaVersion := "2.9.2"
scalacOptions ++= Seq("-unchecked", "-deprecation")
resolvers += "CDH4" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
libraryDependencies += "org.apache.hadoop" %  "hadoop-core"        % "0.20.2"      % "provided"
libraryDependencies += "org.apache.hive"   %  "hive-exec"          % "0.8.1"       % "provided"
libraryDependencies += "org.specs2"        %% "specs2"             % "1.12.1"      % "test"

This is a simple project configuration which names and versions our project, and also adds Hadoop, Hive and Specs2 (our testing library) as dependencies. If you do not have SBT already installed, you can find instructions here http://www.scala-sbt.org/release/docs/Getting-Started/Setup.html

Writing our UDF

Done? Onto the code. First let’s create a folder for it:

$ mkdir -p src/main/java/com/snowplowanalytics/hive/udf

Now let’s add a file into our udf folder called ToUpper.java, containing:

Listing 3: Our ToUpper.java UDF definition

package com.snowplowanalytics.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.io.Text;

@Description(
	name = "toupper",
	value = "_FUNC_(str) - Converts a string to uppercase",
	extended = "Example:\n" +
	"  > SELECT toupper(author_name) FROM authors a;\n" +
	"  STEPHEN KING"
	)
public class ToUpper extends UDF {

    public Text evaluate(Text s) {
		Text to_value = new Text("");
		if (s != null) {
		    try { 
				to_value.set(s.toString().toUpperCase());
		    } catch (Exception e) { // Should never happen
				to_value = new Text(s);
		    }
		}
		return to_value;
    }
}

This file defines our UDF, ToUpper. The package definition and imports should be self-explanatory; the @Description annotation is a useful Hive-specific annotation to provide usage information for our UDF in the Hive console.

All user-defined functions extend the Hive UDF class; a UDF sub-class must then implement one or more methods named “evaluate” which will be called by Hive. We implement an evaluate method which takes one Hadoop Text (which stores text using UTF8) and returns the same Hadoop Text, but now in upper-case. The only complexity is some exception handling, which we include for safety’s sake.

Now let’s check that this compiles. In the root folder, run SBT like so:

Listing 4: Compiling in SBT

$ sbt
> compile
[success] Total time: 0 s, completed 28-Jan-2013 16:41:53

Testing our UDF

Okay great, now time to write a test to make sure this is doing what we expect! First we create a folder for it:

$ mkdir -p src/test/scala/com/snowplowanalytics/hive/udf

Now let’s add a file into our udf test folder called ToUpperTest.scala, containing:

Listing 5: Our ToUpperTest.scala Specs2 test

package com.snowplowanalytics.hive.udf

import org.apache.hadoop.io.Text

import org.specs2._

class ToUpperSpec extends mutable.Specification {
  val toUpper = new ToUpper

  "ToUpper#evaluate" should {
    "return an empty string if passed a null value" in {
      toUpper.evaluate(null).toString mustEqual ""
    }

    "return a capitalised string if passed a mixed-case string" in {
      toUpper.evaluate(new Text("Stephen King")).toString mustEqual "STEPHEN KING"
    }
  }
}

This is a Specs2 unit test (http://etorreborre.github.com/specs2/), written in Scala, which checks that ToUpper is performing correctly: we test that an empty string is handled correctly, and then we test that a mixed-case string (“Stephen King”) is successfully converted to “STEPHEN KING”.

So let’s run this next from SBT:

Listing 6: Testing in SBT

> test
[info] Compiling 1 Scala source to /home/alex/Development/Snowplow/hive-example-udf/target/scala-2.9.2/test-classes...
[info] ToUpperSpec
[info] 
[info] ToUpper#evaluate should
[info] + return an empty string if passed a null value
[info] + return a capitalised string if passed a mixed-case string
[info]  
[info]  
[info] Total for specification ToUpperSpec
[info] Finished in 742 ms
[info] 2 examples, 0 failure, 0 error
[info] 
[info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0
[success] Total time: 8 s, completed 28-Jan-2013 17:11:45

Building and using our UDF

Our tests passed! Now we’re ready to use our function “in anger” from Hive. First, still from SBT, let’s build our jarfile:

Listing 7: Packaging our jar

> package
[info] Packaging /home/alex/Development/Snowplow/hive-example-udf/target/scala-2.9.2/hive-example-udf_2.9.2-0.0.1.jar ...
[info] Done packaging.
[success] Total time: 1 s, completed 28-Jan-2013 17:21:02

Now take the jarfile (hive-example-udf_2.9.2-0.0.1.jar) and upload it to our Hive cluster - on Amazon’s Elastic MapReduce, for example, you could upload it to S3.

From your Hive console, you can now add our new UDF like so:

> add jar /path/to/HiveSwarm.jar;
> create temporary function to_upper as 'com.snowplowanalytics.hive.udf.ToUpper';

And then finally you can use our new UDF in your HiveQL queries, something like this:

> SELECT toupper(author_name) FROM authors a;
  STEPHEN KING

That completes our article. If you would like to download the example code above as a working project, you can find it on GitHub here: https://github.com/snowplow/hive-example-udf

I hope to return with further articles about Hive and Hadoop in the future - potentially one on writing a custom serde - an area where we have a lot of experience at Snowplow Analytics.

Looking for help performing analytics or developing data pipelines using Hive and other Hadoop-powered tools

Learn more about services offered by the Snowplow Professional Services team.

Alex is co-founder and technical lead at Snowplow Analytics. You can find in him on , Twitter and LinkedIn.


blog comments powered by Disqus