Decentralized Search Finds Results

This is the scenario that lead to the launch of the IST-funded GRACE project. Their objective was to allow organisations to integrate multiple and varying content sources, and to provide access to them all via a single and standardised interface. The result of this project, completed in April 2005, was the GRACE decentralised search and categorisation engine. The GRACE prototype operates without a central database. Instead it uses the Grid - many geographically-separate networked computers that work together to solve complex problems - to index documents locally within each Grid node, using whichever processing resources are available. The resulting index is also stored locally, and allows processing of queries on demand from any other node on the Grid. The GRACE system is specifically designed to retrieve information from such distributed content sources. This information retrieval is based on accessing unstructured, textual information, usually stored in a variety of different document formats rather than within structured databases. Moreover, GRACE allows organisations to integrate internal content sources with external resources such as Web-based document repositories, databases and search engines. “GRACE represents at least in part a move towards the semantic Grid,” says Professor Jawed Siddiqi of Sheffield Hallam University in the UK, one of the project partners. “As you know the Grid started out as an academic resource – GRACE aimed to see how the resources of the Grid could be applied to less scientific areas. The system could be used by academics, by librarians, in fact by any knowledge worker who wants to compile a reader on a topic.” Based on natural-language processing Siddiqi stresses that the great innovation about GRACE was the way it used very strong natural-language processing methods to harvest textual content from documents. These natural-language processing methods supply the fundamental, unstructured content, which is then re-indexed into ‘knowledge domains’. These knowledge domains represent not only a complete virtualisation of multiple relevant content sources, but also incorporate the underlying semantics in related ontologies, the meanings and hierarchical relationships among terms and concepts in a domain. “Our prototype is capable of accessing the data in typical file types such as Word and Excel for example,” says project coordinator Maurizio Cecchi of Telecom Italia in Rome, “the real limitation is not on file types, but in the fact that you need the lexical database for the processing. Also you need specific ontologies for particular industry or research areas.” Using natural-language processing methods for such indexing is a computationally-intensive task that requires significant resources, which is why GRACE is based on the distributed processing resources of the Grid. The Grid’s ability to execute such computationally intensive tasks with ease makes it ideal for the purpose. GRACE employs the Grid mainly for text normalization and categorization, due to the amount of processing resources required for such tasks. GRACE has been designed to complement the Grid’s existing ‘database federation’ systems, which also index multiple information sources and provide a single point of access. However these database federation systems are designed to deal with structured data. GRACE, by contrast, focuses on integrating unstructured textual information. Multilingual and up to date GRACE is a multilingual system – the prototype supports text processing in English, German, Italian and partially in Swedish. Other languages can be added by integrating suitable lexical databases. It is also designed to keep information resources up to date. Since GRACE systematically checks the relevant content sources on a regular basis, the result is a constantly monitored index which is always current. Users have no need to repeatedly query content sources in order to ensure they have the latest information – it is already integrated into the system’s knowledge domains, automatically. The GRACE test beds were located at the sites of two project partners, the Telecom Italia laboratories in Turin and the GL2006 Europe site in Milan. User testing was carried out by Stockholm University library and Stuttgart University library, as well as by the library staff at CERN. A demo of the GRACE toolkit is available online. “We are now working on further improvements to the GRACE engine to take it closer to market,” says Cecchi. “We have in mind developing it to manage data on the corporate networks of large multinational companies, however there is a long way to go before we can use it for this purpose.” Sheffield Hallam University is also taking the project results forward. It is using the results of GRACE as input into a new IST project that commenced in January 2006. MATCH is focusing on how data mining techniques can help clinicians combine data from clinical and genetic database resources to help with the diagnosis of patient symptoms.