GOVERNMENT
Software to bring order to information chaos
A new software system that enables faster and more comprehensive analysis of vast quantities of information is so effective that it not only creates order out of chaos and allows computers to perform tasks that before only people could perform, it is also creating new information from old data. Three major case studies were used to evaluate and demonstrate the system. One market-consulting firm for the biotech industry used it to query situations-vacant advertising and business information newswires to predict what new drugs and compounds individual companies wanted to develop. It's creating highly value-added analyses from public domain information. The Greek Ministry of Defence (MoD) used the developed tools to automatically compile dossiers on terrorist groups from a combination of its own files and newspaper reports. Meanwhile, the Anglo-Dutch multinational Unilever is using them to analyse journal articles, newspaper reports and even anecdotal data to build up a picture of the relationship between weight, health and food. It's a bit like automatically producing a super-study of diseases and how they spread. A framework for uniting information "Our greatest contribution was to create a framework for integrating structured and unstructured information," says Dr Babis Theodoulidis, Senior Lecturer at the University of Manchester's Institute of Science and Technology (UK) and coordinator of the IST-funded PARMENIDES project behind these tools. Currently, the vast majority of information is unstructured text, like reports, newspaper articles, letters, memos, essentially any information that is not part of a database. "Analysing text requires human intervention and, when you are trying to analyse perhaps thousands of documents in many different languages, really large scale text analyses becomes very expensive, or even impossible," says Theodoulidis. Structured information is found only in databases, like customer management software, personnel files, library catalogues, and any information that is organised by specific fields of data, such as name, address and so on. "Analysing structured data is not new. Analysing unstructured information using computers is only a recent development, but integrating and analysing the combined data has never been done before. Our framework makes that possible," says Theodoulidis. Practical applications It means that, once the appropriate priming and tuning is completed, a computer can analyse a given text and put it into context. "For example, a company might get a letter of complaint and then an employee needs to read and forward it to the right person," says Theodoulidis. "But in our system the letter is 'read' by a computer, which then links the letter to the company's personnel database and forwards the letter to the right person." It may sound like a ho-hum example, but the potential applications are huge and anything but humdrum. The Greek MoD used the PARMENIDES system to analyse large quantities of unstructured data, like newspaper reports about terrorist attacks, and then combine that with military intelligence. This type of analysis could reveal that one group is changing its methods from car bombs to suicide bombs or chemical attacks. Or that one group is beginning to work with another. "We got our greatest result with the MoD. Before PARMENIDES, they analysed all their unstructured data manually, essentially people reading articles. Now that's almost entirely automatic," says Theodoulidis. But PARMENIDES’ framework does not just provide a snapshot analysis, it can analyse data over time, too, enabling the system to spot new trends or developments that would remain hidden otherwise. Healthcare consultant BioVista, for example, combined recruitment and business information to track the shifting research priorities in biotech companies over time. Furthermore, its method of analysis creates new, hidden information from old data. The work was so successful that BioVista hired two software developers and created its own IT department to develop the technology. "Before that they simply outsourced their IT, but they see a value in this type of system and want to pursue it," says Dr Theodoulidis. Helping computers understand The key to the framework is the use of ontologies. Ontologies are a fundamental driver to many of the hottest computer science topics, like the Semantic Grid. They are simply a vocabulary detailing all the significant words for a particular domain, like healthcare or tourism or military intelligence, and the relationship between each word. Computers can then recognise these terms in their particular context. It's a method for getting computers to mimic 'understanding'. PARMENIDES used one ontology to analyse unstructured text, another to analyse databases and a third to unify the two by data sets. So while a newspaper might talk of a 'terrorist' or 'bomber', a military database might use the terms 'hostile' or 'enemy agent' or specific names. Each data type has its own ontology for the context, in this case terrorism. A third ontology harmonises the two, and that enables PARMENIDES to create the framework. But while creating a framework is the project's greatest contribution, it is far from the only one. The group also developed tools to enable the semi-automatic creation of those ontologies. "For example, if you give the system many, many samples of the type of information you want to analyse it will produce a provisional ontology, which users can adjust to create a definitive ontology," says Theodoulidis. This has huge potential. This tool could be used to automatically create new ontologies, or to allow two different ontologies in the same domain, say healthcare, to exchange information. Ontologies are a key area in computer science and developing them is an extremely complex and time-consuming task. "The software could be used in that way, though this was not a part of our original brief and we didn't explore that aspect of the project," says Theodoulidis. But the project did produce three demonstrators for the three case studies. "The software needs to be tuned for each particular task," says Theodoulidis, though that is semi-automatic. What's more, each time the system performs an analysis it becomes more refined and as such develops over time. Conceivably, regular surfers looking for information on the Web could even use the system. "But it’s the old data problem of garbage in, garbage out! You don't know the quality of the data. Right now our system is about 75 to 95 per cent accurate, depending on the case. But if your original data is bad, then you won't get accurate results," he says. That's one area Theodoulidis would like to pursue in the future. "We plan to do a proposal for an EU project to use some of the tools we developed in PARMENIDES to analyse the quality of particular data on the Web," says Theodoulidis. In the meantime, the group is pursuing a joint venture with BioVista to develop aspects of the framework further. Separately it is working with IBM, BioVista and the Greek MoD to make the system more robust and refined. "I'd also like to develop this technology to work on a Grid-based architecture," says Theodoulidis. "That would, in many ways, be its ideal environment." And it would create the opportunity to develop even more novel tools for analysing data to bring order and clarity to chaos and confusion.