Verbose Logging

software development with some really amazing hair

T + G I F R

Super MongoDB MapReduce Max Out!

· · Posted in Programming
Tagged with

I've been playing with MongoDB lately, and I must say, it's the shit. In case you haven't heard of MongoDB, let's drop some buzz words:

  • Document oriented
  • Dynamic queries
  • Index support
  • Replication support
  • Query profiling
  • MapReduce
  • Auto sharding

There are some more things, so check out their website for the full meal deal. I'm going to talk about the MapReduce part of things.

MapReduce

The idea behind MapReduce has been around for a while; since the Lisp days. Here's the basic idea:

  • Gather list of items (list 1).
  • Apply the map function to each item in list 1, generating a new list (list 2).
  • Apply the reduce function to the resultant list (list 2) as a whole.
  • Return value return by reduce.
  • Profit!

In the MongoDB world, you run the mapReduce command, and it takes a few arguments:

  • mapFunction
    • A function that takes an individual document ({ "value": 1 }) and (possibly) emits a value (or emit multiple values), whether that be a new document, or a single value (like a number).
    • The emit function takes a key, and a value.
  • reduceFunction
    • A function that takes a list of values emitted from the map function and a key, and produces a single value.
  • optional options
    • query
      • A MongoDB style query. Like any database query, this selects which documents you are going to apply your map function to.
    • out collection
      • The name of a collection to output into.
    • finalize function
      • A function to further apply to the reduced value.

Here's an example from the mongo shell.

So at the bottom there, you can see the result is 60.

We could rewrite this to move the if statement in the map function into a query. Then we cover less items, and don't have to do the check in the map function.

It returns the same result as above.

With me so far? MapReduce is interesting if you've never seen it before or never done any functional programming, but once you get it, you understand its power.

Caveats

In the MongoDB environment, it's incredibly important that your reduce function is idempotent. Stealing their example straight from the MongoDB website, it means:

for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)

This is because the reduce function might be executed a number of times with results from various stages. Since MapReduce can be done across multiple servers, they will run their map and subsequent reduce functions on their data, but then the master server has to further reduce those results, so it takes the return values from all the reduce functions, and puts them into a list, and passes that to the reduce function again.

Basically, make sure the structure of what you return from reduce, is the same structure as whatever you are emitting in the map function. If you emit an integer, reduce should return an integer as well. In the sum example, it's really straight forward in that we just add stuff up. In other situations it can get more complicated.

Next, I'll talk about getting MapReduce to do, you know, useful things. Stay tuned!