Writing User Defined Functions For Pig

If you are processing a bunch of data, grouping it, joining it, filtering it, then you should probably be using pig.

So go download that, and get it all setup. You need:

Java 1.6 (with JAVA_HOME setup)
Hadoop (with HADOOP_HOME setup)
pig (of course)

Put all the relevant stuff in your PATH too.

pig 101

So here's a simple pig script.

	REGISTER com.darkhax.blog.pig.jar;
	DEFINE Parser com.darkhax.blog.pig.LogParser();

	logs = LOAD 'apache.log.bz2' USING TextLoader AS (line: chararray);
	log_events = FOREACH logs GENERATE FLATTEN(Parser(line));

	by_action = GROUP log_events BY action;
	counts = FOREACH by_action GENERATE group, COUNT(log_events);
	STORE counts INTO 'count_summary';

view raw example.pig hosted with ❤ by GitHub

This registers a jar file and defines a custom UDF(User Defined Function) for doing whatever. It happens to be a log line parser function.

We load a bzipped log file from apache (it can just read the bzipped files! Wee!) and by using the TextLoader, each line comes in as a chararray (in pig terms, a string).

Now, FOREACH line, run it through the parser function we defined ealier. We'll look at this shortly. We can now do some fun stuff, like GROUP on the action, and generate the counts of all these things.

Okay so that might look a little weird, but if you read it, it makes perfect sense. Let's cover a few things before we get to the UDF fun.

`FOREACH` and `GENERATE`

In pig, the FOREACH and GENERATE combination does sort of what it says. It's essentially the map function (and you should be familiar with map functions from the previous posts). For every thing in the bag (a bag is a pig datatype), generate something. In this case, we are telling pig to use our custom class to take the line, and generate some stuff (a tuple, actually).

Tuples and Schemas

Tuples are ordered groups of things, and in pig, the fields can be named. You see tuples in Haskell, lisp (I think), and other programming languages. In the scripts, the logs variable represents a bunch of tuples, where each tuple is a single item, and that single item is named line. We got this because when we said:

logs = LOAD 'apache.log.bz2' USING TextLoader AS (line: chararray);

It's telling pig

Load the file and treat it as a text file, splitting on newlines, and give me a bunch of tuples, where each tuple has a single item that is a chararray named line.

You could load the file the same way, omitting the AS (line: chararray) part, but then the resulting tuples would have no schema.

The schema is essentially type information about the tuple. You can have a tuple without a schema, but it's much more useful to have one, since you can refer to field by name, instead of by field number (like indexing an array).

User Defined Functions

A User Defined Function is exactly that; it's something you write that pig loads and uses. In this case, we are writing a Java class (Java is the only language you can use for this currently). For this example, we are going to write a function to parse a line in a log file and return a tuple so pig can then work its magic with the tuples. So normally this takes 30 minutes to bake, but I've got one already in the oven!

	package com.codebaby.monitor.pig;

	import java.io.IOException;

	import org.apache.pig.EvalFunc;
	import org.apache.pig.data.DataType;
	import org.apache.pig.data.Tuple;
	import org.apache.pig.data.TupleFactory;
	import org.apache.pig.impl.logicalLayer.schema.Schema;

	// Inherit from EvalFunc<Tuple> to implement a EvalFunc that returns a Tuple
	public class LogParser extends EvalFunc<Tuple> {

	// The main method in question. Gets run for every 'thing' that gets sent to
	// this UDF
	public Tuple exec(Tuple input) throws IOException {
	if (null == input \|\| input.size() != 1) {
	return null;
	}

	String line = (String) input.get(0);
	try {
	// In Soviet Russia, factory builds you!
	TupleFactory tf = TupleFactory.getInstance();
	Tuple t = tf.newTuple();

	t.append(getHttpMethod());
	t.append(getIP());
	t.append(getDate());

	// The tuple we are returning now has 3 elements, all strings.
	// In order, they are the HTTP method, the IP address, and the date.

	return t;
	} catch (Exception e) {
	// Any problems? Just return null and this one doesn't get
	// 'generated' by pig
	return null;
	}
	}

	public Schema outputSchema(Schema input) {
	try {
	Schema s = new Schema();

	s.add(new Schema.FieldSchema("action", DataType.CHARARRAY));
	s.add(new Schema.FieldSchema("ip", DataType.CHARARRAY));
	s.add(new Schema.FieldSchema("date", DataType.CHARARRAY));

	return s;
	} catch (Exception e) {
	// Any problems? Just return null...there probably won't be any
	// problems though.
	return null;
	}
	}

	public String getHttpMethod() {
	return "";
	}

	public String getIP() {
	return "";
	}

	public String getDate() {
	return "";
	}
	}

view raw LogParser.java hosted with ❤ by GitHub

Play by play

Okay, first of all remember to add to your classpath the pig jar file. It's the pig-VERSION-core.jar in the pig directory. Add it in Eclipse, or whatever, so when you compile it has access to everything.

Inherit your class from EvalFunc<Tuple> since that's exactly what we are making: an EvalFunc (as opposed to a filter function or something else) that returns a tuple.

The exec method is your main method that has to return the proper type (tuple in our case) and takes a tuple. We check to ensure the input tuple is nice, in that it exists and has only one item (the line of text). We can then get the first item and cast it to a String so we can work with it.

We use a try/catch block to handle errors and make sure we just return null if there are any problems. If you return null, everything in the tuple is null so you can filter that out using standard pig stuff.

We use the TupleFactory singleton to get a tuple, append our values in the order we want them to appear (in this case, just the HTTP method, IP address, and date), and return it. Yay!

I can haz schema?

Yes you can. You could write the schema in the pig script.

log_events = FOREACH logs GENERATE FLATTEN(Parser(line)) AS (action: chararray, ip: chararray, date: chararray);

This does have benefits, the main one being the schema is right there and you can see it. This makes writing the rest of the script a little easier, since you don't have to remember exactly what's in the tuples. The downside is you have to change code in two separate spots.

We decided to put the schema in the java class, so you can do some more programmatic things with it, and when you have to change it, it's right there next to the exec method you are also changing.

Building a schema is a little epic in Java (verbose much?) but it's not terrible. We create a new schema, and add in the same order we added things in the exec method, the names and types of the things we added.

HTTP method: String/chararray
IP address: String/chararray
Date: String/chararray

You add to the new schema a Schema.FieldSchema object where you specify the name (what you want to reference the field as in pig) and the type (a byte, but just use the DataType enum values).

Now, if you DESCRIBE log_events; in the pig shell, it will tell you the schema. You can also now use named indexes into the tuple, as with GROUP log_events BY action to make your code more readable.