background

Literary Chains

Democratized Distributed Computing

Last week, we discussed how to spawn instances of Continuum’s Anaconda AMI for Amazon’s EC2 with StarCluster. This enables interested parties — businesses, individuals, and academics alike — to harness distributed computing with very little overhead.This week, I wanted to build on Davin Pott’s Disco Tutorial. (I highly recommend reading it!) In it, he provides a great example of using Disco’s derived class method of running Disco jobs. Davin also suggests that the reader explore chaining MapReduce jobs. We will do just that!

Disco Chains

A core part of any MapReduce Framework is allowing the user to feed the results of the one MapReduce step back into another MapReduce Step, and again, and again, and again… This feedback mechanism can be helpful when jobs become more complex and need to be broken down into smaller subtasks. Additionally, creating smaller chained jobs naturally pushes programmers to write code which is cleaner, more composed, and more maintainable.

It can be difficult at first, but getting used to the kinds of workflow above — where Job 1 feeds into Job 2, which feeds into Job 3 — is where we can really see the benefits of MapReduce. To motivate chaining jobs, let’s go back to the Counting Words post.

Rivalry, Rivalry!

It’s well documented that Ernest Hemingway and William Faulkner were great rivals. This rivalry extended to quite few literary analyses of one another’s work. Hemingway and Faulkner, both in personae and in writing style, are famously antithetical:

Ernest Hemingway argued that writing should

strip language clean, to lay it bare down to the bone.

A man can be destroyed but not defeated.

And Faulkner, in this literary feud, is known for his loquacious style — writing that tests the upper limit of density of thought in a single sentence. Faulkner was asked in an interview

Interviewer: Some people say they can’t understand your writing, even after they read it two or three times. What approach would you suggest for them? Faulkner: Read it four times.

There was a wisteria vine blooming for the second time that summer on a wooden trellis before one window, into which sparrows came now and then in random gusts, making a dry vivid dusty sound before going away: and opposite Quentin, Miss Coldfield in the eternal black which she had worn for forty-three years now, whether for sister, father, or nothusband noone knew, sitting so bolt upright in the straight hard chair that was so tall for her that her legs hung straight and rigid as if she had iron shinbones and ankles, clear of the floor with that air of impotent and static rage like children’s feet, and talking in that grim haggard amazed voice until at last listening would renege and hearing-sense self-confound and the long-dead object of her impotent yet indomitable frustration would appear, as though by outraged recapitulation evoked, quiet inattentive and harmless, out of the binding and dreamy and victorious dust. .

Questions

So, we know that these famous authors had very different goals in their writing, and that the style of their language and sentence structure differed accordingly. Can we then categorize Hemingway and Faulkner based on the parts of speech (POS) they most frequently use?

To help answer this question we will use the Natural Language Toolkit (NLTK) in our MapReduce Jobs. NLTK is great for easy natural language processing in Python — good user support, lots of features, and high activity on the mailing list.

MapReduce 1

We are going to use the same Map and Reduce functions from the Counting Words post, with a few modifications. In this example, I will compare Hemingway’s A Farewell to Arms with Faulkner’s Collected Works.

from disco.job import Job
from disco.worker.classic.func import chain_reader
from disco.core import result_iterator
 
class WordCount(Job):
    partitions = 4
    input=["faulkner.txt",]
 

    @staticmethod
    def map(line, params):
        import string
        for word in line.split():
            strippedWord = word.translate(string.maketrans("",""), string.punctuation)
            yield strippedWord, 1
 

    @staticmethod
    def reduce(iter, params):
        from disco.util import kvgroup
        for word, counts in kvgroup(sorted(iter)):
            yield word, sum(counts)

 

Python

This is largely the same as before, but this time we create a class WordCount and inherit from the Disco class Job. Within the class, we define the same Map and Reduce functions and we also define the number of partitions used, as well as the data input.

From Job 1, we have key-value pairs of each unique word, along with the total count: [(‘he’,152009),(‘flower’,203)…]. What we are looking for here is the distribution of the parts of speech (POS). NLTK has a built-in POS-tagger and by default uses a tagger trained with the Penn Treebank Tag Set. (NLTK also provides other taggers; see documentation for more details)

>>> import nltk
>>> nltk.pos_tag(['he'])
[('he', 'PRP')]
>>> nltk.pos_tag(['apple'])
[('apple', 'NN')]

 

Python

 

 

Taking the results from Job 1, we write a similar MapReduce class for Job 2. This time, however, we include the code snippet from the NLTK POS-Tagger example:

Class PosCount(Job):
    map_reader = staticmethod(chain_reader)
 

    @staticmethod
    def map((word,count), params):
        import nltk
        pos_tag = nltk.pos_tag([word]) #[(word, pos)]
            yield pos_tag[0][1],count
 

    @staticmethod
      def reduce(pos_iter,out,params):
        from disco.util import kvgroup
        for pos, counts in kvgroup(sorted(pos_iter)):
          out.add(pos,sum(counts))

 

Python

The main line to focus on is:

map_reader = staticmethod(chain_reader)

 

Python

The map_reader defined uses chain_reader which can read Disco’s internal compression format. Notice: there is no input defined. This line allows us to feed the results from the reduce of Job 1 into the map step of Job 2.

Running

To execute our MapReduce job we call the run function. Fairly easy, no?

from MapReduce_CountWords_Chain import WordCount
from MapReduce_CountWords_Chain import PosCount
 
wordcount = WordCount().run()
posCount = PosCount().run(input=wordcount.wait())
 
for (pos, counts) in result_iterator(posCount.wait(show=False)):
    print pos, counts

 

Python

How does Job 2 get the data from Job 1? We pass the output of Job 1: wordcount as an input parameter to the run function of job 2. And that’s it! We can keep building more and more classes, passing the results of each job into the next.

Back to Hemingway vs. Faulkner…

Hemingway being the macho man that he is, I expected a huge spike in PRP (Personal Pronouns) — lots of He, Him, Himself. But from the graph below we can see that Faulkner also uses a fair amount of PRP in his writing as well.

ID Definition Example
NN noun, singular or mass pyhton, ball, cat
PRP personal pronoun she, he, I
DT determiner the, some, both
IN preposition/subordinating conjunction in, of, with
VBD verb, past tense held, wrote, went
RB adverb carefully, about, very

 

 

Sadly, it seems like we can’t use POS to easily fan the flames of this particular literary feud. But what about other writers? Let’s compare Hemingway’s distribution of POS to the distribution of POS in another author…any author… How about Stephenie Meyer, author of the Twilight Series?

Round Two: Hemingway vs. Stephenie Meyer…

What jumps out at me in this graph are the DT (Determiner) and PRP$ (Possessive Pronouns) columns, where the two writers’ POS distributions deviate most drastically. What do we make of these differences?

* Let’s take determiners to start with. Determiners are non-adjectival noun-modifiers, ones that contextualize nouns rather than describe them in the abstract: a, the, those, my, some, etc. Here is an example:

Do you have to take me to that regiment? — A Farewell to Arms

Here we can see that Hemingway uses way more determiners that Meyer does. Hemingway, in other words, is more declarative than descriptive. He gestures at things instead of describing them.

The above distribution of words within DT shows that Hemingway is quite similar to Meyer for the most part. The outlier here is the word the. The definite article the is used to identify a particular member of a group. It being the most common word in the English language, it’s perhaps unsurprising that Hemingway’s famously direct and simple writing style would make ample use of it.

  • Possessive Pronouns are pronouns which demonstrate ownership. Here’s an example from Meyer:

    Charlie wasn’t comfortable with expressing his emotions out loud — Twilight

    Meyer, evidently, is much more fond of possessive pronouns than was Hemingway — she uses almost twice as many as he does. Again, this surprised me somewhat, since I tend to think of Hemingway as a possessive writer.

    Now this is where it gets especially interesting! Looking at the distribution of PRP$ words we immediately notice polarizing differences. Between Meyer and Hemingway, the words which demonstrate the greatest difference are my, her, and his — and, much to my surprise, the prize for the most his-s does not go to the most masculine author in history. Rather, Stephenie Meyer’s teen vampire fiction uses almost twice as many. Furthermore, she also uses almost twice as many my-s compared to Hemingway. Lastly, it’s interesting to note that while Meyer out his-s Hemingway, Hemingway uses the word her almost twice as often as Meyer! In fact, Meyer’s use of her occurs only half as often as her use of his. This is rather surprising given that the main character of Twilight is a woman. Bechdel test, anyone?

The Twilight quote also reminds me of that fact that, according to our data, Meyer’s gerund usage (VBG) seems to be significantly larger than Hemingway’s: she likes to use -ing verbs as nouns (in the sentence above, “expressing” is used as a noun). Why might this be? Any thoughts?

Return to Chaining

You may have noticed that we didn’t really need to chain our jobs to produce a distribution of parts of speech. We could have mapped POS instead of the each word. However, separating word counting and POS tagging into two different jobs made generating a distribution of words within a particular part of speech trivially easy. In the future, we will come across MapReduce jobs which can’t be executed unless we chain several MapReduce steps together.

Download Example Code

Data Sources


About the Author

Ben Zaitlen

Data Scientist

Ben Zaitlen has been with the Anaconda Global Inc. team for over 5 years.

Read more

Join the Disucssion