Five research teams across Columbia use data science to solve societal problems

The Data Science Institute (DSI) at Columbia awarded Seeds Fund Grants to five research teams whose proposed projects will use state-of-the-art data science to solve seemingly intractable societal problems in the fields of cancer research, medical science, transportation and technology.

Each team will receive up to $100,000 for one year and be eligible for a second year of funding.

"In awarding these grants, the DSI review committee selected projects that brought together teams of scholars who aspire to push the state-of-the-art in data science by conducting novel, ethical and socially beneficial research," said Jeannette M.Wing, Avanessians Director of the Data Science Institute. "The five winning teams combine data-science experts with domain experts who'll work together to transform several fields throughout the university."

The inaugural DSI Seeds Fund Program supports collaborations between faculty members and researchers from various disciplines, departments and schools throughout Columbia University. DSI received several excellent proposals from faculty, which shows a growing and enthusiastic interest in data-science research at Columbia, added Wing.

The seed program is just one of many initiatives that Wing has spearheaded since the summer of 2017, when she was named director of DSI, a world-leading institution in field of data science. The other initiatives include founding a Post-doctoral Fellows Program; a Faculty Recruiting Program; and an Undergraduate Research Program.

What follows are brief descriptions of the winning projects, along with the names and affiliations of the researchers on each team.

p(true): Distilling Truth by Community Rating of Claims on the Web:

The team: Nikolaus Kriegeskorte, Professor, Psychology and Director of Cognitive Imaging, Zuckerman's Institute; Chris Wiggins, Associate Professor, Department of Applied Physics and Applied Mathematics, Columbia Engineering; Nima Mesgarani, Assistant Professor, Electrical Engineering Department, Columbia Engineering; Trenton Jerde, Lecturer, Applied Analytics Program, School of Professional Studies.

The social web is driven by feedback mechanisms, or "likes," which emotionalize the sharing culture and may contribute to the formation of echo chambers and political polarization, according to this team. In their p(true) project, the team will thus build a complementary mechanism for web-based sharing of reasoned judgments, so as to bring rationality to the social web.

Websites such as Wikipedia and Stack Overflow are surprisingly successful at providing a reliable representation of uncontroversial textbook knowledge, the team says. But the web doesn't work well in distilling the probability of contentious claims. The question the team seeks to answer is this: How can the web best be used to enable people to share their judgments and work together to find the truth?

The web gives users amazing power to communicate and collaborate, but users have yet to learn how to use that power to distill the truth, the team says. Web users can access content, share it with others, and give instant feedback on claims. But those actions end up boosting certain claims while blocking others, which amounts to a powerful mechanism of amplification and filtering. If the web is to help people think well together, then the mechanism that determines what information is amplified, the team maintains, should be based on rational judgment, rather than emotional responses as communicated by "likes" and other emoticons.

The team's goal is to build a website, called p(true), that enables people to rate and discuss claims (e.g., a New York Times headline) on the web. The team's method will enable the community to debunk falsehoods and lend support to solid claims. It's also a way to share reasoned judgments, pose questions to the community and start conversations.

Planetary Linguistics:

The team: David Kipping, Assistant Professor, Department of Astronomy; Michael Collins, Professor, Computer Science Department, Columbia Engineering.

In the last few years, the catalog of known exoplanets - planets orbiting other stars - has swelled from a few hundred to several thousand. With 200 billion stars in our galaxy alone and with the rise of online planet-hunting missions, an explosion of discovery of exoplanets seems imminent. Indeed, it has been argued that the cumulative number of known exoplanets doubles every 27 months, analogous to Moore's law, implying a million exoplanets by 2034, and a billion by 2057.

By studying the thousands of extrasolar planets that have been discovered in recent years, can researchers infer the rules by which planetary systems emerge and evolve? This team aims to answer that question by drawing upon an unusually interesting source: computational linguistics.

The challenge of understanding the emergence of planetary systems must be done from afar with limited information, according to the team. In that way, it's similar to attempting to learn an unknown language from snippets of conversation, says Kipping and Collins. Drawing upon this analogy, the two will explore whether the mathematical tools of computational linguistics can reveal the "grammatical" rules of planet formation. Their goals include building predictive models, much like the predictive text on a smartphone, which will be capable of intelligently optimizing telescope resources. They also aim to uncover the rules and regularities in planetary systems, specifically through the application of grammar-induction methods used in computational linguistics.

The team maintains that the pre-existing tools of linguistics and information theory are ideally suited for unlocking the rules of planet formation. Using this framework, they'll consider each planet to be analogous to a word; each planetary system to be a sentence; and the galaxy to be a corpus of text from which they might blindly infer the grammatical rules. The basic thesis of their project is to see whether they can use the well-established mathematical tools of linguistics and information theory to improve our understanding of exoplanetary systems.

The three central questions they hope to answer are: Is it possible to build predictive models that assign a probability distribution over different planetary-system architectures? Can those models predict the presence and properties of missing planets in planetary systems? And can they uncover the rules and regularities in planetary systems, specifically through the application of grammar-induction methods used in computational linguistics?

In short, the team says it intends to "speak planet."

A Game-Theoretical Framework for Modeling Strategic Interactions Between Autonomous and Human-Driven Vehicles:

The team: Xuan Sharon Di, Assistant Professor, Department of Civil Engineering and Engineering Mechanics, Columbia Engineering; Qiang Du, Professor, Applied Physics and Applied Mathematics, Columbia Engineering and Data Science Institute; Xi Chen, Associate Professor, Computer Science Department, Columbia Engineering; Eric Talley, Professor and Co-Director, Millstein Center for Global Markets and Corporate Ownership, Columbia Law School.

Autonomous vehicles, expected to arrive on roads over the course of the next decade, will connect with each other and to the surrounding environment by sending messages regarding timestamp, current location, speed and more.

Yet no one knows exactly how autonomous vehicles (AV) and conventional human-driven vehicles (HV) will co-exist and interact on roads. The vast majority of research has considered the engineering challenges from the perspective either of a single AV immersed in a world of human drivers or one in which only AVs are on the road. Much less research has focused on the transition path between these two scenarios. To fill that gap, this team will develop a framework using the game-theoretic approach to model strategic interactions of HVs and AVs. Their approach aims to understand the strategic interactions likely to link the two types of vehicles.

Along with exploring these technical matters, the team will address the ethical issues associated with autonomous vehicles. What decision should an AV make when confronted with an obstacle? If it swerves to miss the obstacle, it will hit five people, including an old woman and a child, whereas if it goes straight it will hit a car and injure five passengers. Algorithms are designed to solve such problems, but bias can creep in depending on how one frames the problem. The team will investigate how to design algorithms that account for such ethical dilemmas while eliminating bias.

Predicting Personalized Cancer Therapies Using Deep Probabilistic Modeling:

The team: David Blei, Professor, Statistics and Computer Science Department, Columbia Engineering; Raul Rabadan, Professor, Systems Biology and Biomedical Informatics, CUMC; Anna Lasorella, Associate Professor, Pathology and Cell Biology and Pediatrics, CUMC, Wesley Tansey, Postdoc Research Scientist, Department of Systems Biology, CUMC

Precision medicine aims to find the right drug for the right patient at the right moment and with the proper dosage. Such accuracy is especially relevant in cancer treatments, where standard therapies elicit different responses from different patients.

Cancers are caused by the accumulation of alterations in the genome. Exact causes vary between different tumors, and no two tumors have the same set of alterations. One can envision that by mapping a comprehensive set of causes of specific tumors, one can provide patient-specific drug recommendations. And drugs targeting specific alterations in the genome provide some of the most successful therapies.

This team has identified new alterations targeted by drugs that are either currently approved or in clinical trials. Most tumors, however, lack specific targets and patient-specific therapies. A recent study of 2,600 patients at the M.D. Anderson Center for instance showed that genetic analysis permits only 6.4 percent of patients to be paired with a drug aimed specifically at the mutation deemed responsible. This team believes the low percentage highlights the need for new approaches that will match drugs to genomic profiles

The team's goal, therefore, is to model, predict, and target therapeutic sensitivity and resistance of cancer. The predictions will be validated in cancer models designed in Dr. Lasorella's lab, which will enhance the accuracy of effort. They will also integrate Bayesian modeling with variational inference and deep-learning methods, leveraging the expertise of two leading teams in computational genomics (Rabadan's group) and machine learning (Blei's group) across the Medical and Morningside Heights campuses.

They have sequenced RNA and DNA from large collections of tumors and have access to databases of genomic, transcriptomic and drug response profiles of more than 1,000 cancer-cell lines. They will use their expertise in molecular characterization of tumors and machine-learning techniques to model and explore these databases.

Enabling Small-Data Medical Research with Private Transferable Knowledge from Big Data:

The team: Roxana Geambasu, Associate Professor, Computer Science Department, Columbia Engineering; Nicholas Tatonetti, Herbert Irving Assistant Professor, Biomedical Informatics, CUMC; Daniel Hsu, Assistant Professor, Computer Science Department, Columbia Engineering.

Clinical data hold enormous promise to advance medical science. A major roadblock to more research in this area is the lack of infrastructural support for sharing of large-scale, clinical datasets across institutions. Virtually every hospital and clinic collects detailed medical records about its patients. For example, New York Presbyterian has developed a large-scale, longitudinal dataset, called the Clinical Data Warehouse that contains clinical information from more than five million patients.

Daniel Hsu

Such datasets are made available to researchers within their home institutions through the Institutional Review Board (IRB), but hospitals are usually wary of sharing data with other institutions. Their main concern is privacy and the legal and public-relations implications of a potential data breach. At times, hospitals will permit the sharing of statistics from their data but generally cross-institution IRBs take years to finalize, often with rejection. Siloing these large-scale, clinical datasets severely constrains the quality and pace of innovation in data-driven medical science. The siloing effect limits the scope and rigor of the research that can be conducted on these datasets.

To overcome this challenge, the team is building an infrastructure system for sharing machine-learning models of large-scale, clinical datasets that will also preserve patients' privacy. The new system will enable medical researchers in small clinics or pharmaceutical companies to incorporate multitask feature models that are learned from big clinical datasets. The researchers, for example, could call upon New York Presbyterian's Clinical Data Warehouse and bootstrap their own machine-learning models on top of smaller clinical datasets. The multitask feature models, moreover, will protect the privacy of patient records in the large datasets through a rigorous method called differential privacy. The team anticipates that the system will vastly improve the pace of innovation in clinical-data research while alleviating privacy concerns.