The "huge leap forward" were the words of The Bookseller's BookBrunch newsletter, reporting (21 June 2016) a new version (1.2) of the Thema ebook classification scheme (released in May 2016). Normally a point release of a standard is isn't a huge leap forward, so I was curious to know more. Because BookBrunch is open only to subscribers, I can’t see who is responsible for that "huge leap forward" comment. But the Thema classification scheme itself is open and accessible to anyone who wants to find out more details of this new initiative for book classification. So is it a huge leap forward?
First, a bit of background. Thema is an international classification scheme for books. It was launched in 2013, and was designed to harmonize the separate book classification systems in place for individual countries, such as BISAC for the US and BIC for the UK. From the Editeur website (Editeur is the UK-based not-for-profit group that created Thema - they also manage ONIX for books): “The goals of Thema are to reduce the duplication of effort required by the many distinct national subject schemes.” Why doesn’t Thema use existing subject classifications, such as Dewey or Library of Congress? These were rejected because “they are used primarily in the library world”. In other words, the specific requirements of general trade and educational publishing were such that the creators of Thema believed an entirely new subject classification was required, although it would overlap with existing schemes. The result is a classification that assumes a bookshop audience rather than an academic audience: for example, the Thema classification has only one code for books about dinosaurs, YNNA “children and teenage interest: dinosaurs and prehistoric world”. What happens when an adult book about dinosaurs goes on sale in a bookshop? There is no code for it.
The way Thema works is in principle quite simple. All books are coded to one of around 2500 subject headings. First, a book is coded into one or more of 20 top-level categories, such as A for the arts, and then to a more detailed second or third-level code. Any book with a detailed code will be included in a broader code, so that, for example, “TDCT”, food and beverage technology, is included in the broader code “TD”, industrial chemistry and manufacturing technologies. This seems sensible (but you will see later on how it causes problems).
What’s wrong with Thema?
- The codes are not meaningful, in any language. The attempt to create an international system has led to a system that is impossible to guess for English, the most widely used language. “W” means lifestyle. “V” stands for health.
- There are too many codes, and too many rules, for a human to assign meaningfully without errors, whichever language is being used. Selecting from over 2500 codes for each book is not an activity for humans to carry out without errors.
- Creating a hierarchical classification system is not the way that humans think; specialist taxonomers are able to think in terms of hierarchical classification. Mere mortals will struggle. Am I the only one to think that the codes for US cities is loopy? How about “1KBB-US-SWLN” for New Orleans? It is a subset of “1KBB”, which means USA to you and me. And if you know the code for New Orleans, you will have no difficulty working out the code for Miami, which is 1KBB-US-SEFM (of course).
- If humans make mistakes adding the original codes in one language, then all the language variants will have errors also. So the stated goal of Thema consists of fixing a consequent problem (language variation) without changing the underlying fault in the system (a complex system that requires large numbers of highly-trained specialist indexers to use it).
- The taxonomy has been created by a committee. Making changes to the taxonomy will be slow and will inevitably follow the development of new terms by several months, more likely several years, meaning many books will not have the correct code until years after publication. A machine-based ontology would create subject codes as and when they are used.
- The taxonomy attempts far too precise a classification to be useful. How should you distinguish, for example, KNG (manufacturing industries ... “including arms, vehicles, etc” from KNG (transport industries ... “all transport industries, including road, shipping, railway, aerospace). Where would you classify a book about railway engine manufacturing? Is that transport or manufacturing?
- The indexers have included buzzwords, terms that are fashionable but which may be transient. Why is “disruptive innovation” given its own code as a subset of “business innovation”? Why not add other buzzwords such as (to take just recent manufacturing and IT terms) “lean development”, “kanban”? No doubt these will be added in future editions, making the taxonomy even more unwieldy.
- The system does not provide any synonyms, as most taxonomies would do. For example, there is a code for “agile programming” – is this the same as “agile development”, “agile manufacturing”, or “agile project management”, or are all these different? Any full-scale taxonomy would have lists of alternative terms.
- It is alarming to see what was left out of version 1.1 and which had to be added in version 1.2, for example knitting (WFBS1) and sewing (WFBS2). How many knitting books are published each year? It seems to demonstrate that the attempt to create a universal taxonomy is doomed to endless approximation and incompleteness, if managed by hand.
- The classification system is already growing and from the terms being added there seems no reason why it should not grow like, say, Dewey, which is now in its 23rd edition and which extends to four volumes. There must be humans how know how to catalogue using Dewey, but you would probably want to avoid them at parties.
In other words, this is an unwieldy attempt to force untrained staff at publishing companies to create an elaborate machine-readable taxonomy. The whole exercise seems guaranteed to produce inaccurate codings; it is a system created by taxonomists for taxonomists. If you aren’t yet convinced, try reading some of the explanations about individual codes.
It seems truly perverse to be attempting to create elaborate rules for humans to classify a set of items – in this case, books – when other industries are busy replacing human classification with automatic indexing tools. We should have learned enough from library subject classification that for humans to create detailed subject classifications requires training, patience, and lots of time. Publishers have little of this. I'd love to think that the developers of Thema might try to simplify their classificaiton, but it seems to be an occupational hazard with taxonomists to make their classification schemes steadily more complex. That’s how you end up with such actual codes in the library world as Dewey “813.54 M37 2007”, or Library of Congress “PR9199.3.M3855”. Anyone who tries to find a book using these codes will have forgotten the number by the time they arrive at the shelf in the library. At a time when the book is under threat, publishers are making it more difficult for publishers to operate. There must be a simpler system than this!