Caltech Bioinformatics Experts Develop New Literature Search Engine

When it comes to finding a used book on the Internet, one merely needs to Google the title, and a few suitable items for sale will soon be just a click away. But for the biologist or medical researcher looking for information on how two nematode genes interrelate in hopes of better understanding human disease, there is a clear need for a more focused search engine. Bioinformatics experts from the California Institute of Technology are formally announcing today the Textpresso search engine, which they hope will revolutionize the way that genetic information is retrieved by researchers worldwide. The Textpresso search engine is specifically built to serve researchers who work on the small worm known as C. elegans, but the basic design should lead to the creation of new search engines for researchers who specialize in other living organisms that are intensively studied. In the current issue of the journal PLOS, published by the Public Library of Science, Caltech biology professor Paul Sternberg and his colleagues--research associate Hans-Michael Muller and bioinformatics specialist Eimear Kenny--write that the new "text-mining system for scientific literature" will be invaluable to specialists trying to cope with the vast amount of information now available on C. elegans. This information has vastly increased in recent years due to the large-scale gene-sequencing initiative as well as the more traditional small-scale projects by individual researchers. As a result, the need for a way to scan the vast literature has become much more important. "Textpresso gives me, as a researcher, more time to actually read papers because I don't have to skim papers anymore," says Sternberg, who is leader of the federally funded WormBase project that has already put online the entire genome sequence of C. elegans and the closely related organism C. briggsae, as well as genes for some 20 other nematode species. The four-year-old WormBase project, a linchpin in the worldwide effort to better understand how genes interrelate, also makes a host of other information freely available in addition to the 100.2 million base-pairs that make up the millimeter-long worm's genome. There are now 28,000 gene-disruption experiments in WormBase, along with 2 million DNA expression ("chip") microarray observations, as well as detailed information on the expression of more than 1,700 of the worm's 20,000 genes. The Textpresso search engine is a logical product for the WormBase team to develop in the ongoing quest to put genetic information to work in curing and preventing human disease. Lest anyone assume that the genes of a millimeter-long nematode have little to do with humans, it should be pointed out that the two organisms are similar in about 40 percent of their genes. A very realistic motivation for funding the genome sequencing of the fruit fly, the small mustardlike plant known as Arabidopsis, the chimp, and various other species, has been the expectation of finding underlying common mechanisms. Thus, a cancer researcher who discovers that a certain gene is expressed in cancer cells can use the WormBase to see if the gene exists in nematodes, and if so, what is known about the gene's function. And now that Textpresso is available, the researcher can do so much more efficiently. "The idea is distilling down the information so it can be extracted easier," says Muller, the lead author of the paper and codeveloper of Textpresso with Kenny. The idea for the name of the search engine, in fact, comes from its resemblance to "espresso," which is a process used to get the caffeine and flavor out of coffee in a minimal volume. According to Kenny, the search engine is designed with a special kind of search in mind, which establishes categories of terms and organizes them as an ontology--that is, a catalog of types of objects and concepts and their relationships. For example, if the researcher wants to find out whether any other researcher has worked on the relationship between the nematode gene called "lin-12" and the anchor cell, then typing the two terms into a conventional search engine like Google results in more than 400 hits. And if the researcher wants to know which genes are important in the anchor cell, the task is even more arduous. But Textpresso is designed to get the information in a much simpler, more efficient, more straightforward way. Textpresso is a text-processing system that splits research papers into sentences, and sentences into words or phrases. All words and phrases are labeled so that they are searchable, and the labels are then condensed into 33 ontological categories. So far, the database includes 4,420 scientific papers on C. elegans, as well as bibliographic information from WormBase, information on various scientific meetings, the "Worm Breeder's Gazette," and various other links and WormBase information. Therefore, the engine already searches through millions of sentences to allow researchers to find a paper of interest or information of interest with great efficiency. Finally, the Textpresso search engine should be a useful prototype for search engines to serve other biological databases--some of which have even larger piles of data for the specialist to cope with. "Yeast currently has 25,000 papers," Kenny says. Textpresso can be accessed at www.textpresso.org or via WormBase at www.wormbase.org.