I came across this image on Facebook:
I was struck by the unusual surnames (“Bok” and “Lexier”), and doubly suspicious about the “x” in the surname. The premise of this image is that the left text was created independently, and then the text on the right was created from it, using awesome anagram skills. This image would be a lot less impressive if the two paragraphs were crafted at the same time, so that they were anagrams of one another. So I wondered: is there a way to figure out if this is real or fake?
A simple way to determine if a corpus of text is really “natural” is to run a letter frequency analysis. Wikipedia has an article about letter frequency, and they include a data table of relative frequencies of letters in the English language. As the Wikipedia article states, although there are many different studies done on this and the tables all differ slightly, the general structure is the same. So, let’s analyze this image with a little Python. First, we’ll define the strings, and check out that they really do have the same number of letters, etc.
OK, let’s look for differences. By using Python’s built-in set(), it’s a one-liner to see if we’re missing any letters:
OK, looks good so far – true to their claim, all letters (including punctuation and spaces) are the same between the two. Now let’s look for any letter counts that differ.
Interesting! There are a different number of spaces. But the claim of the text holds true – space is not a punctuation mark. Now let’s dig into the letter counts. We’ll first list the characters (stripping out punctuation and spaces) in decreasing order of frequency, just for a quick sanity check:
OK, what’s the natural letter frequency in English? Let’s define a dict based on the table in the Wikipedia entry.
Well, at first glance, this kind of lines up. A few things look out of whack, though. X has moved way up in the ordering, and O and L both have shifted significantly down.
Enough Guessing, Plot All The Things!
But let’s get real with this – let’s just plot the frequencies side by side. Time to bust out some Python data tools!
Very interesting – we can immediately see that O, X, and L are way out of whack with the natural distribution of English letters. It looks like C, P, and maybe T are also not quite right. Let’s compute the differences and plot them, so we can easily find the most curious letters. To do this concisely, we’ll use Numpy and a little fancy indexing.
So we can clearly see where the text in the image most deviates from a normal distribution of letters: it has quite a few more occurrences of T, I, E, and quite a few less of O and L.
The text corpus only has 187 letters, and so it’s a pretty small sample. It’s very possible that the text naturally deviates from the natural English distribution. However, the significant differences here – especially in some of the most common letters (T, L, E) – is enough to raise doubt in my mind.
Disagree? Want to look at it a different way? This blog post is also available as an IPython Notebook shared via our cloud-based Python-in-the-Browser app, Wakari. Wakari lets you easily run Python 2.6 – 3.3, with Numpy, Scipy, Matplotlib, pandas, and IPython Notebook, all right from your browser. Sign up for the free beta today!
By publishing and sharing this IPython Notebook with Wakari, Trent Oliphant, another Continuum developer, was able to perform his own analysis to see if my conclusions on “natural” English distribution would hold true for other texts. His analysis is at the bottom of this shared IPython Notebook.