Here at Screen6 we have been using Cascalog to write our Hadoop jobs in Clojure. Even though we could have gone with using plain Clojure, Cascalog gives the ability to write Map-Reduce jobs in a fast and concise way. Another big bonus that comes from using Cascalog is the ease of testing. This blog posts covers a simple use case for Cascalog and shows a few simple test cases that we can write against Cascalog queries.
Testing: Hadoop way
Before we dive into testing Cascalog jobs, a short detour into testing approaches in Hadoop is deserved.
The accepted way of testing MapReduce jobs in Hadoop is to combine unit tests and occasional runs in local cluster. There are multiple issues with this approach. First of all, you’re going to double the work you’re required to do in order to mark the build as “locally tested”. Second of all, it’s not easy to debug failures in Hadoop cluster. Sure, you can go through stacktraces, but getting your debugger to work in such environment is quite cumbersome.
There is an alternative to this — Apache MRUnit. It’s a superior approach to running a local cluster, however it still suffers from being extremely low level when compared to Midje-Cascalog or even Cascading’s built-in testing utilities. Usability of MRUnit is also undermined by the fact that most Hadoop jobs are rarely split into composable parts, often having lots of code repetition across similar tasks.
Overall it feels that Hadoop community often treats testing as a second class citizen and this sentiment truly hurts adoption of best testing practices.
A common use case for writing MapReduce jobs is to parse billions of log entries to produce sensible view into patterns of data access, load distribution, etc. It’s also a task we have to solve at Screen6. In fact, it’s such a common problem, that masses of MapReduce frameworks were created in the past years. The most prominent of them is Hadoop, first developed at Yahoo in 2005 and since gradually adopted as the go-to framework for data analysis.
However, it is worth noting that despite being the de-facto data science tool, Hadoop was never considered “easy” when it came to developing. It had and still has a high learning curve, which led to creation of frameworks that try to provide a higher level of abstraction when working with huge arrays of data. One of the most popular of such frameworks is Cascading, which provides a more declarative way of describing your data processing pipeline. Cascalog stands even higher on the abstraction ladder, implementing Datalog, a truly declarative language on top of existing Cascading library. Unlike many other libraries and frameworks that increase the abstraction level while dropping the ability to go one level deeper, Cascalog gives you the ability to actually resort to writing Cascading when required. In our experience that’s rarely needed and in fact our code doesn’t have any “abstraction breaking”.
MapReduce with Cascalog
Before we can start with testing, we’ll layout a problem that we can solve with Cascalog. A common task for us at Screen6 is extracting useful information out of log files. It’s neither a unique technical challenge, nor is it inherently exciting, but it’s such a common thing to do that it’s always nice to have a way to minimize the amount of code you need to write to accomplish it.
Since our data pipeline is a quite contrived and just explaining it would take a space required for a few blog posts, we’ll devise a simpler example of such usage. Let’s say you’ve created a map service that has a URI scheme similar to OpenStreetMap.
Let’s say your tile server has Nginx serving the tiles and it is running with a slightly customize access log format. Here is a made-up example of a line in this log:
2013-08-01T:07:51:23+0100 22.214.171.124 GET a.tiles.example.com /16/1213/721.png 200 1013 4512 "http://.example.com/mapview" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3"
So, we have the following information in our log files: timestamp of request, client’s IP address, HTTP method used, host name, URI, status code, response time in microseconds, size of response in bytes, referer and user agent. Let’s write our test for this exact line. For now we just want to use plain Clojure. We will be also using Midje library, which integrates with Cascalog easily. We can just jump into Clojure REPL and play around with writing our test:
(def line "2013-08-01T07:51:23+0100 126.96.36.199 GET a.tiles.example.com /16/1213/721.png 200 1013 4512 \"http://map.example.com\" \"Mozilla/5.0\"")
(parse-line line) => [1375339883 "188.8.131.52" "GET"
"a.tiles.example.com" "/16/1213/721.png" 200 1013
4512 "http://map.example.com" "Mozilla/5.0"])
This is plain Midje and plain Clojure, we don’t actually do anything with Cascalog. However, this example will later evolve into Midje-Cascalog based test.
Here’s a short excerpt of the code that would be responsible for parsing the line in a way that satisfies the test:
[clojure.data.csv :as csv]
[clj-time.format :as time.format]
[clj-time.core :as time.core]
[clj-time.coerce :as time.coerce]
(defn set-timestamp [entry]
(let [time-format (time.format/formatter "yyyy-MM-dd'T'HH:mm:ssZ")]
(->> (entry :timestamp)
(assoc entry :timestamp))))
(defn parse-line [line]
(let [fields [:timestamp :ip :http-method :host :uri :response-code
:response-time :response-size :referer :user-agent]
entry (first (csv/read-csv line :separator \space))]
(-> (zipmap fields entry)
This seems simple enough, we’re simply treating each line as a map, run a few transformations on this map and then return a vector of fields in the correct order. Now we can actually write the test against Cascalog code:
[midje.cascalog :refer [produces]]))
(parse-logs [[line]]) => (produces [[1375339883 "184.108.40.206" "GET"
"a.tiles.example.com" 16 1213 721 200 1013
4512 "http://map.example.com" "Mozilla/5.0"]]))
As we can see, not much has changed. In fact, we had to change very little to accommodate for Cascalog: we pass vector of vectors to the query (
parse-logs), the right part of the fact is wrapped in
produces call and the response is also a vector of vectors. We can also see that the actual expected output changed a bit, instead of getting URI
/16/123/721.png we are actually getting three integers
[16 123 721]. Let’s take a look at the corresponding Cascalog query:
(defn parse-logs [source]
(<- [?timestamp ?ip ?http-method ?host ?zoom ?x ?y ?response-code
?response-time ?response-size ?referer ?user-agent]
(parse-line ?line :> ?timestamp ?ip ?http-method ?host ?uri
?response-code ?response-time ?response-size
(parse-uri ?uri :> ?zoom ?x ?y)))
Simple query composition
Of course, in the real world you’d have some checks interspersed, but for the purpose of this blog post we’re omitting those. This query can be our building block for future queries. For example, if we want to know during which hours our servers gets most requests, we could write the following query:
(defn most-popular-hours [source]
(<- [?hour-of-day ?hits]
((select-fields source ["?timestamp"]) ?timestamp)
(get-hour-of-day ?timestamp :> ?hour-of-day)
(cascalog.ops/count :> ?hits)))
And now our test “fact” becomes quite simply this:
(most-popular-hours (parse-logs [[line]])) => (produces [[6 1]]))
As can be seen from these examples, writing tests with Midje-Cascalog is quite an enjoyable experience.
The test cases we shown so far have been quite short. In fact, it makes little to no sense to run your tests using only a handful of log entries. More often you’ll extract a few thousand lines of relevant data, make sure there are a few broken entries and test against that. However, writing Cascalog tests doesn’t become a harder task even then, the only part that you need to take care of is loading of resources, your tests however will retain their structure.
Midje-Cascalog is a fantastic tool and is certainly a great selling point when considering Cascalog as an option for handling your MapReduce tasks. And since it’s still Clojure, it’s extremely easy to compose your queries and your facts. In our experience, using Cascalog allows to spend more time on tackling the problem at hand and trying different approaches when compared to writing pure MapReduce jobs using Hadoop. And since Cascalog is still using Hadoop to actually execute the jobs, we’re getting the best of both worlds — interactivity and ease of development with Clojure and stability and scalability of Hadoop platform.
Here is a wonderful introduction into the world of Cascalog testing. If you’d like to familiarize yourself with Cascalog a bit, I suggest you go through "Cascalog for the Impatient" tutorial.