Democratized Distributed Computing
Last week, we discussed how to spawn instances of Continuum’s Anaconda AMI for Amazon’s EC2 with StarCluster. This enables interested parties — businesses, individuals, and academics alike — to harness distributed computing with very little overhead.This week, I wanted to build on Davin Pott’s Disco Tutorial. (I highly recommend reading it!) In it, he provides a great example of using Disco’s derived class method of running Disco jobs. Davin also suggests that the reader explore chaining MapReduce jobs. We will do just that!
A core part of any MapReduce Framework is allowing the user to feed the results of the one MapReduce step back into another MapReduce Step, and again, and again, and again… This feedback mechanism can be helpful when jobs become more complex and need to be broken down into smaller subtasks. Additionally, creating smaller chained jobs naturally pushes programmers to write code which is cleaner, more composed, and more maintainable.
It can be difficult at first, but getting used to the kinds of workflow above — where Job 1 feeds into Job 2, which feeds into Job 3 — is where we can really see the benefits of MapReduce. To motivate chaining jobs, let’s go back to the Counting Words post.
It’s well documented that Ernest Hemingway and William Faulkner were great rivals. This rivalry extended to quite few literary analyses of one another’s work. Hemingway and Faulkner, both in personae and in writing style, are famously antithetical:
Ernest Hemingway argued that writing should
strip language clean, to lay it bare down to the bone.
And Faulkner, in this literary feud, is known for his loquacious style — writing that tests the upper limit of density of thought in a single sentence. Faulkner was asked in an interview
Interviewer: Some people say they can’t understand your writing, even after they read it two or three times. What approach would you suggest for them? Faulkner: Read it four times.
So, we know that these famous authors had very different goals in their writing, and that the style of their language and sentence structure differed accordingly. Can we then categorize Hemingway and Faulkner based on the parts of speech (POS) they most frequently use?
To help answer this question we will use the Natural Language Toolkit (NLTK) in our MapReduce Jobs. NLTK is great for easy natural language processing in Python — good user support, lots of features, and high activity on the mailing list.
We are going to use the same Map and Reduce functions from the Counting Words post, with a few modifications. In this example, I will compare Hemingway’s A Farewell to Arms with Faulkner’s Collected Works.
This is largely the same as before, but this time we create a class WordCount and inherit from the Disco class Job. Within the class, we define the same Map and Reduce functions and we also define the number of partitions used, as well as the data input.
From Job 1, we have key-value pairs of each unique word, along with the total count: [(‘he’,152009),(‘flower’,203)…]. What we are looking for here is the distribution of the parts of speech (POS). NLTK has a built-in POS-tagger and by default uses a tagger trained with the Penn Treebank Tag Set. (NLTK also provides other taggers; see documentation for more details)
Taking the results from Job 1, we write a similar MapReduce class for Job 2. This time, however, we include the code snippet from the NLTK POS-Tagger example:
The main line to focus on is:
The map_reader defined uses chain_reader which can read Disco’s internal compression format. Notice: there is no input defined. This line allows us to feed the results from the reduce of Job 1 into the map step of Job 2.
To execute our MapReduce job we call the run function. Fairly easy, no?
How does Job 2 get the data from Job 1? We pass the output of Job 1: wordcount as an input parameter to the run function of job 2. And that’s it! We can keep building more and more classes, passing the results of each job into the next.
Back to Hemingway vs. Faulkner…
Hemingway being the macho man that he is, I expected a huge spike in PRP (Personal Pronouns) — lots of He, Him, Himself. But from the graph below we can see that Faulkner also uses a fair amount of PRP in his writing as well.
|NN||noun, singular or mass||pyhton, ball, cat|
|PRP||personal pronoun||she, he, I|
|DT||determiner||the, some, both|
|IN||preposition/subordinating conjunction||in, of, with|
|VBD||verb, past tense||held, wrote, went|
|RB||adverb||carefully, about, very|
Sadly, it seems like we can’t use POS to easily fan the flames of this particular literary feud. But what about other writers? Let’s compare Hemingway’s distribution of POS to the distribution of POS in another author…any author… How about Stephenie Meyer, author of the Twilight Series?
Round Two: Hemingway vs. Stephenie Meyer…
What jumps out at me in this graph are the DT (Determiner) and PRP$ (Possessive Pronouns) columns, where the two writers’ POS distributions deviate most drastically. What do we make of these differences?
* Let’s take determiners to start with. Determiners are non-adjectival noun-modifiers, ones that contextualize nouns rather than describe them in the abstract: a, the, those, my, some, etc. Here is an example:
Do you have to take me to that regiment? — A Farewell to Arms
Here we can see that Hemingway uses way more determiners that Meyer does. Hemingway, in other words, is more declarative than descriptive. He gestures at things instead of describing them.
The above distribution of words within DT shows that Hemingway is quite similar to Meyer for the most part. The outlier here is the word the. The definite article the is used to identify a particular member of a group. It being the most common word in the English language, it’s perhaps unsurprising that Hemingway’s famously direct and simple writing style would make ample use of it.
Possessive Pronouns are pronouns which demonstrate ownership. Here’s an example from Meyer:
Charlie wasn’t comfortable with expressing his emotions out loud — Twilight
Meyer, evidently, is much more fond of possessive pronouns than was Hemingway — she uses almost twice as many as he does. Again, this surprised me somewhat, since I tend to think of Hemingway as a possessive writer.
Now this is where it gets especially interesting! Looking at the distribution of PRP$ words we immediately notice polarizing differences. Between Meyer and Hemingway, the words which demonstrate the greatest difference are my, her, and his — and, much to my surprise, the prize for the most his-s does not go to the most masculine author in history. Rather, Stephenie Meyer’s teen vampire fiction uses almost twice as many. Furthermore, she also uses almost twice as many my-s compared to Hemingway. Lastly, it’s interesting to note that while Meyer out his-s Hemingway, Hemingway uses the word her almost twice as often as Meyer! In fact, Meyer’s use of her occurs only half as often as her use of his. This is rather surprising given that the main character of Twilight is a woman. Bechdel test, anyone?
The Twilight quote also reminds me of that fact that, according to our data, Meyer’s gerund usage (VBG) seems to be significantly larger than Hemingway’s: she likes to use -ing verbs as nouns (in the sentence above, “expressing” is used as a noun). Why might this be? Any thoughts?
Return to Chaining
You may have noticed that we didn’t really need to chain our jobs to produce a distribution of parts of speech. We could have mapped POS instead of the each word. However, separating word counting and POS tagging into two different jobs made generating a distribution of words within a particular part of speech trivially easy. In the future, we will come across MapReduce jobs which can’t be executed unless we chain several MapReduce steps together.