Skip to:

How difficult is text analytics?

Subject: 

A recent ISKO meeting ("Taming the News Beast", 1 April, London) presented the current state of the art for text analytics relating to news publishing. Although the presentations were excellent, and the organisations represented were leading edge, I found the day valuable not for what was described in the presentations, but for some issies that were revealed during the talks. Perhaps text analytics isn't so advanced as some people might think.

For example, there was a presentation by the BBC News Labs about “the newsroom of things” – the “things” being the entities that comprise the news. Matt Shearer explained that the News Labs team was not operational, but worked closely with the journalists doing their normal job. He stated that “text is well-known, but media is less well understood”. Yet from the presentation, it became clear that problems of automatic extraction of tags from articles have not yet been solved. His colleague Jeremy Tarling gave a fascinating insight into “storylines” – a news story comprising several smaller articles appearing in sequence, over many days or weeks. Undoutedly this topic is a fundamental aspect of how news is created and consumed, and the research team’s approach, combining information architects with in-house and external developers, was very impressive.

Yet it was only revealed in passing during the presentations that subject metadata for articles in not being created automatically for stories as they are written and uploaded. You would think that creating subject metadata for news stories as they are added to the content repository is the first goal for automatic metadata, and you could ask why the BBC has not cracked this more fundamental problem before considering more advanced issues such as storylines. The alternative, for journalists to create their own tags, is time-consuming and runs the risk of not following standard terminology ("sport" or "sports"? "football" or "soccer"?)

During the Q&A session, this topic was explored a little further. Jonathan Engel pointed out from his experience at Reuters that, although it is possible to create rules for automatic tagging of content, it turned out to be a very slow process for humans to create those rules in an effective way. At Reuters it took around half a day to create one effective rule for text analytics creation.

Hence, if you are a pessimist, or perhaps a realist, there were two conclusions that could be drawn from this presentation: firstly, research teams sometimes choose to investigate advanced topics before the more simple ones have been cracked. Secondly, if you are planning to implement an automatic entity extraction system, you should not overlook the substantial in-house, human costs involved in creating the automation – sometihg that the vendors of text analytics software often don't mention. Text analytics software is often (always?) not a one-stop solution that can be implemented instantly. It requires detailed configuration by humans with a good understanding of the subject domain in which it is to be implemented. That may be a skill that the in-house team many well not have.