CHEMISTRY
OSCAR understands the language of chemistry, naturally
Like any other language, the language of chemistry lacks uniformity.
New words are invented, old words fade out of use, styles of writing change and some writers suffer from less-than-perfect grammar.
What’s more, there is no single way of referring to a chemical: one person’s salt is another’s sodium chloride (and yet another’s NaCl). To search for a specific word in a chemistry text, a researcher must take into account every permutation of that word and every possible mistake in representing it.
This is highly inefficient, and with more sources of chemistry information becoming available every day, it’s not getting any easier to find relevant information.
But now, there’s OSCAR to help. Also known as “Open Source Chemistry Analysis Routines,” it is an open-source software package developed at Cambridge University for the semantic annotation of chemistry papers. It is closely integrated with OMII-UK, an organization which seeks software solutions for e-research.
“I see chemistry as a language,” explained Peter Murray-Rust, leader of the OSCAR research group at the university, “one that is communicated in natural language, graphics, formulae and equations.” This insight led to the development of OSCAR for reviewing literature to identify information relevant to chemistry research. The software has gained some prestigious advocates, including the Royal Society of Chemistry.
The software’s primary purpose is to recognize concepts in text that have a precise meaning. Murray-Rust says that it not only recognizes chemical names, adjectives and processes, but is able to link them into their meaning using an ontology — a rigorous and exhaustive organization of some knowledge domain that is usually hierarchical and contains all the relevant relationships.
By using such a system, the researcher is freed from having to hunt for every permutation of a specific word, because OSCAR automatically links the word with its alternatives.
It also can enrich the text-search by providing further information about the terms it identifies, such as chemical properties and molecular structure.
A typical map of an ontology, showing relationships. Image courtesy Wikimedia Commons |
‘Natural’ language
The key to the program’s ability is natural-language processing, which is done by drawing upon a carefully chosen body of documents in which researchers had identified the words related to chemistry.
This has allows it to learn how to identify the context in which chemistry-related words are used.
In addition, OSCAR has also been programmed to look for clues that show a link to chemistry, for example the prefix methyl. The result is software that can process a text and determine whether the author used the word “cat” to refer to Chloramphenicol acetyltransferase, or something else.
Murray-Rust claims that it does the job very well: in a test of precision and recall, OSCAR achieved 83%. Humans manage a slightly higher 90%, but they do the job many millions of times more slowly.
Three major European organizations are already using the software. The Royal Society of Chemistry is using it to make searches of their online journal papers more accurate. Meanwhile, Christoph Steinbeck is investigating a similar system for the European Bioinformatics Institute that he says will ‘provide the community with the tools for large-scale harvesting of chemical data hidden in the past 100 years of printed literature.’
And the European Patent Office is investigating the possibility of OSCAR-assisted searches to provide a higher degree of certainty that all of the documents relevant to a patent application have been identified.
OSCAR can be downloaded, installed and tested for free, and its developers say that they encourage the open-source community to tinker with the software. “Anyone can take the software and do roughly what they like with it, they can extend it,” said Murray-Rust.
The software’s open-source background had previously put it in the right position for a collaboration with OMII-UK, whose developers improved OSCAR’s performance and structure. The code was modularized to make future debugging and upgrading more straightforward, and an automatic test regime was developed and implemented.
With the aid of such a program, researchers can spend less time searching for information and more time performing their research.