Sherlock Holmes vs Hercule Poirot

How do literary detectives think? Let’s look at two of the most famous sleuths in literature, Sherlock Holmes and Hercule Poirot. Here’s Holmes’s key edict:> When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

And here is Poirot’s philosophy:

It is the brain, the little gray cells on which one must rely. One must seek the truth within—not without.

With these famous quotes, we can begin to think about “Detective-ness.” Poirot and Holmes both possess abilities to solve the most challenging of murder mysteries. How? How does Holmes decide what is impossible vs. improbable? How do Poirot’s “little gray cells” decipher the clues, while for Poirot’s sidekick Arthur Hastings the clues remain a series of unconnected objects of events? What kinds of questions do these detectives ask? And might their questions give us a clue about the inner workings of the mind of the literary detective?

Indeed, word counting may provide a good starting point to identify a “fingerprint” of “Detective-ness.” Again, we return to counting words with Disco’s MapReduce Framework. But this time we get to play with some new free data sets.

Free Data Sets

Project Gutenberg hosts a collection of over 40,000 free ebooks (~20GB). The collection is searchable and browsable by genre, language, author, etc. The books are free to download in a variety of formats and generally around a few hundred KB. Today, most personal machines ship with 2-4GB of RAM and can be extended to 8GB; it is also not uncommon for personal machines to have 16GB of RAM. For those with large RAM sizes, this particular data set may be able to be digested in memory; however, the data set will grow faster than RAM size will increase, and we will need an out-of-core solution to analyze the data. This data set provides a great training ground for Data Scientists and Explorers, Hackers, Linguists, and those new to the Disco/MapReduce framework. Project Gutenberg is also a wonderful resource and we encourage everyone to pick up an eBook and read.


Questions and Early Results:

Can we define Holmes and Poirot by looking at the questions in the text? That is, can we classify the person/character based on types of questions they ask? Before classifying, I want to know how many Who, What, Where, When, and Whys are in the text. Additionally, looking at a distribution of modal verbs (verbs which indicate ability, obligation, permission, certainty, probability, possibility) may also give some us insight into these characters.

Question Count
what 271
when 266
who 257
where 103
how 100
why 39


Modal Verb Count
could 282
will 263
may 204
can 173
must 170
shall 170
might 121
need 22


What kinds of modal verbs and questions are seen in Agatha Christie’s Poirot?

Question Count
what 171
who 121
when 85
how 59
where 48
why 29


Question Count
will 223
could 134
must 102
can 84
might 67
may 54
shall 31
need 20

How do we load the data from Project Gutenberg into the Disco cluster?

Input (Streaming)

Disco has a few different input data types. Often data is stored off-site from where the computation is being performed; we can stream data directly to DDFS with a regular HTTP address. DDFS, the Disco Distributed File System, is storage system designed by the Disco team to help support Big Data projects. Often, a MapReduce Framework will be associated with an internal filesystem designed for Big Data stores. HDFS (Hadoop Distributed File System) is the data store solution for the Hadoop Framework. Later, we will go into more specifics of DDFS and HDFS; for now, we can think of DDFS and HDFS as filesystems which can scale horizontally (adding more and more nodes) without much difficulty.

Again, streaming the data avoids direct interaction with DDFS. We can use one or multiple URLs:

input=[""], #Adventures of Sherlock Holmes Collected Works
input=["", "",], #Study in Scarlet and Sign Of Four



Remember, the map function is given a line of text and produces an iterable of key-value pairs.

def map(line, params):
     import string
     for word in line.split():
          strippedWord = word.translate(string.maketrans("",""), string.punctuation)
          yield strippedWord, 1



We use Disco’s default partition function, which buckets keys on hash(str(key) % (number_of_partitions) and the same reduce function:

def reduce(iter, params):
     from disco.util import kvgroup
     for word, counts in kvgroup(sorted(iter)):
          yield word, sum(counts)



This produces an iterable of (word, counts): [(“at”, 385), (“ate”, 3), (“atmosphere”, 4), (“attach”, 1)…]


In part 1, the results were printed to the screen. This time let’s save the results to file so we can do any post processing necessary.

def reduce(iter, params):
     from disco.util import kvgroup
     for word, counts in kvgroup(sorted(iter)):
          yield word, sum(counts)



This produces an iterable of (word, counts): [(“at”, 385), (“ate”, 3), (“atmosphere”, 4), (“attach”, 1)…]


In part 1, the results were printed to the screen. This time let’s save the results to file so we can do any post processing necessary.

filePath = '/tmp/' #FILL IN
out_numerical = open(filePath+'Words-SortNumerically.txt', 'w')
out_abc = open(filePath+'Words-SortAlphabetically.txt', 'w')
wordCount = []
for word, count in result_iterator(job.wait(show=True)):
     out_abc.write('%s t %dn' % (str(word), int(count)) )
#sorted list from an iterable. lambda function returns the count -- position 1 of the tuple we created.
sortedWordCount = sorted(wordCount, key=lambda count: count[1],reverse=True)
for word, count in sortedWordCount:
     out_numerical.write('%s t %dn' % (str(word), int(count)) )
question_words = ['who', 'where','what', 'when', 'why', 'how']
modal_verbs = ['can', 'could', 'need', 'may', 'might', 'must', 'shall', 'will']
print [i for i in sortedWordCount if i[0] in question_words]
print [i for i in sortedWordCount if i[0] in modal_verbs]



In the code above, we create two files in ‘/tmp’ and store lists of words sorted alphabetically (Words-SortAlphabetically.txt) and sorted numerically (Words-SortNumerically.txt).

Analysis and Thoughts

In both Conan Doyle’s and Christie’s detective novels, ‘What’ is the top ‘WH’ question word. This makes a certain amount of sense. Detectives should be more interested in process rather than motivation. To fully make this argument, of course, we would need to look at the distribution of question words over many texts and determine a Control distribution. We leave this an exercise to the readers. 🙂

As for the modal verbs, ‘could’ and ‘will’ appear almost equally frequently for Conan Doyle, whereas ‘will’ is the top modal verb in Christie’s texts. Typically, ‘will’ suggests a focus on certainty; ‘could,’ on the other hand, suggests a focus on possibilities. Perhaps Holmes, whose style of sleuthery depends on ruling out the impossible, is more concerned about possibility than Poirot and his little gray cells.

In doing this kind of analysis, more questions arise — not just about texts but about the importance of questions in a person’s public speech.

  • Can we classify a person based on their questions?
  • What kinds of questions are asked on Facebook?
  • What kinds of questions are asked on Twitter?
  • What kinds of questions are asked in the Senate?
  • Do the Belgians and the English have superior detectives?

Download Example Code