Skip to:

Sherlock Holmes and machine learning

Some people would claim there is an uncanny parallel between the methods used by Sherlock Holmes and machine learning. In both cases an apparently insoluble problem is suddenly resolved using nothing but careful assessment of the available evidence. In both cases we are startled that it was possible to find a solution when we, the readers, had no idea of it. So I was intrigued when Phil Gooch presented an analysis of a Sherlock Holmes story to discover the most important details. Could machine learning, like Holmes, solve a crime mystery? Phil’s presentation, at the excellent London Text Analytics Meetup Group, lived up to expectations, even if he didn’t quite demonstrate a machine solving the mystery. Instead, and quite an achievement in its own right, he coded the analysis in front of us while he was presenting. I’ve been to presentations where there is a live demo, but I’ve not seen coding on the screen (and taking suggestions for alterations) as the main part of the presentation. All credit to Phil, then, for coolness!

Ostensibly, Phil Gooch’s talk was about graph databases and text analytics. He outlined two major types of graph database for modelling text, the more traditional RDF graph, and the property graph, in this case Neo4j. Rather than a detailed comparison of the two approaches, he went on to use Neo4j to analyse the Sherlock Holmes short story, “A Scandal in Bohemia”, in fact the first Holmes story that Arthur Conan Doyle ever wrote.

Using the algorithm TextRank, he was able to extract the most significant words from the story, and the most significant sentences.  Given that this was a live demo with the laptop balanced precariously on a cabinet, he could be forgiven for simplifying things somewhat, so the initial demonstration was on just the first few sentences of the story. So did TextRank do its stuff? It certainly looked promising. The three most significant words it found were “Holmes”, “case” and “woman” – certainly these look like significant terms for this story about Holmes solving a case about a woman. As for the most significant sentence in the text, it was “To Sherlock Holmes she was always the woman” – pretty good, although not quite the solution to the mystery.

How did it work? The approach was derived from a paper by Josh Bohde outlining the approach. Well, TextRank is a derivative of Google’s PageRank. As you probably know, PageRank is what Google uses to determine the relative importance of websites; not by counting the number of links a website has to other sites, but by how many links from other sites there are to the website being considered. In a rather similar way, TextRank applies the same algorithm but to the words in a text. The principle is the co-occurrence of words in sentences. Words that are generally well connected tend to be more important. In other words, a sentence that has links to other sentences in the text is likely to be a significant sentence.

We were all very impressed by these results, and we asked to see how the system performed on the full text. Unfortunately the results here were not so good. The most important sentence found was “Said Holmes”, and this, explained Phil, is because the full story is around 80% dialogue. This made it difficult for the algorithm to do its work.  Nonetheless, you could imagine a tool such as this could be used (and I’m told is already being used) to extract, say, the top 10 sentences from a news bulletin.

Phil also showed a couple of other tools that Neo4j can provide, including a “shortest path” tool, showing the shortest path between any two words – potentially interesting, but not very revealing for the examples he showed.

Limitations of the code: of course there were several limitations, largely to get the presentation to work quickly and simply. The demo did not merge singular and plural forms (so “woman” and “women” would be treated as separate items), nor was there any lemmatization. I realised afterwards that there was probably no code to deal with references in the story to “the detective”, as a synonym for “Holmes” – or for the police detective, a potentially ambiguous reference. Nonetheless, this was a startling demo, revealing just how simple it is to put machine learning into use. Was it as good as Holmes? Not quite. Did the machine solve the mystery? Not just yet.