Skip to:


What can machines discover from scholarly content?

Just as you thought that everything was known about the academic user journey, a workshop comes along (the WDSM Workshop on Scholarly Web Mining, SWM 2017, held in Cambridge, February 10 2017) that presents a whole new set of tools and investigations to consider.

It was a rather frantic event, squeezing no fewer than 11 presentations into a half-day session, even if the event took place in the sumptuous and rather grand surroundings of the Council Chamber in the Cambridge Guildhall. Trying to summarise all 11 presentations would be a challenge; were there any common areas of inquiry?

World domination through machine learning: a review of The Master Algorithm

Pedro Domingos likes big ideas. He sets out to describe to us how computers can write their own programs. For example, there is the well-established case of handwriting recognition. This is a form of machine learning in which the computer is provided with sufficient examples (and a training set) to enable the machine to learn to do something. If you show the machine the number “9” written enough ways, the machine eventually becomes as good or even better than a human at recognising a handwritten “9”.

Unfortunately, he alternates between very sensible and clear description like this, and sweeping optimistic generalisations. Mr Domingos is in no doubt who the new masters of the world are going to be. In his potted description of commerce, he describes the how “the progression from computers to the Internet to machine learning was inevitable ... once the inevitable happens and learning algorithms become the middlemen, power becomes concentrated in them.” In fact, there is no future for any company without using machine learning: “a company without machine learning can’t keep up with one that uses it ... businesses embrace it because they have no choice.” That’s a very stern conclusion!

Thema book subject codes: a “huge leap forward”?

The "huge leap forward" were the words of The Bookseller's BookBrunch newsletter, reporting (21 June 2016) a new version (1.2) of the Thema ebook classification scheme (released in May 2016). Normally a point release of a standard is isn't a huge leap forward, so I was curious to know more. Because BookBrunch is open only to subscribers, I can’t see who is responsible for that "huge leap forward" comment. But the Thema classification scheme itself is open and accessible to anyone who wants to find out more details of this new initiative for book classification.

Can you charge for adding metadata to free content?

Clearly you can. A recent post in Book Business Magazine describes how McGraw-Hill Education will include free teaching resources within its paid education platform, Engrade, alongside its own paid resources, on the basis that these free resources are better curated and hence easier to find and to make use of. However, users will be charged for access to the free resources within the platform, since McGraw-Hill state they have had to pay to have them tagged. McGraw-Hill is open about what they are doing: they say they pay to have the material selected for quality and then tagged, so it is only reasonable to charge end users for the selection and better navigation.



Getting a feel for sentiment analysis

An excellent session of the London Text Analytics Group (March 14) contrasted two approaches to sentiment analysis: one proudly (and publicly) ditches grammar, while the other uses grammar to disambiguate content. Both approaches made ambitious claims for their software; which is the best approach?

Stephen Pulman of TheySay, a start up from the University of Oxford, had the more traditional approach.  He pointed out that taking individual words by themselves can lead to great confusion. Just assessing whether something is positive or negative is not so simple: “Bacteria” is negative, and “kill” is negative, but “kills bacteria” is positive.  More complex still, the phrase “never fails to kill bacteria” is highly positive.  A bag-of-words approach is unlikely to pick up all these distinctions.

How many answers would you prefer ?

“We find users prefer one answer.” This was the comment of Google’s Behshad Behzadi when presenting Google’s new Ultimate Assistant. In case you don’t already know, Google’s Ultimate Assistant will answer your questions, whether you key them in or (in Google’s opinion the most likely) you speak to the device. Most of Behzadi's presentation was based around his smartphone, not using the desktop at all. What kind of questions?