Data Science Institute at Columbia develops statistical method that makes better predictions

 DSI Professors Tian Zheng and Shaw-Hwa Lo with DOT officials.

A team of statisticians from the Data Science Institute (DSI) received a National Science Foundation grant ($900,000) to develop a statistical method that will help researchers who work with big data make better predictions.

The team's method establishes statistical foundations for measuring "predictivity," the ability of a researcher to make predictions based on big data. The novel approach allows researchers to compare their predictions to a theoretical baseline, which will give their predictions greater accuracy. The method will also help statisticians and policy experts contend with complex social problems, for which big data sets are often difficult to assess.

The DSI team, led by DSI professors Shaw-Hwa Lo and Tian Zheng, is collaborating with New York City Department of Transportation (DOT) on Vision Zero, an initiative to end traffic deaths in the city. DOT collects big data from collisions to analyze the multiple factors that relate to traffic crashes. The potential interactions between the variables and datasets are extremely complex, which led to DOT's interest in working with the DSI team and using its statistical approach.

Lo, a professor of statistics and an affiliate of DSI, said, "we are developing a statistical way to evaluate performance of prediction methods that will be of immense help to DOT. Our method will help DOT identify key combinations of factors and intervention measures to predict where and when crashes are likely to occur."

Statistics can be difficult for the common reader to understand, but in general terms the new method can identify the variable with the highest "predictivity" in large data sets, explained Lo.Current statistical models consider a large number of X variables for predicting a Y variable, and selecting the likely small number of X variables most helpful to predict Y is the goal. But that goal is difficult to reach if the X variables interact in complicated ways. The new method, however, identifies groups of X variables that, once combined, have a stronger ability to predict. Statisticians thus no longer need to apply techniques such as cross validation with the Y variable to evaluate the predictive ability of X variables.

The DSI team will use its new method to help DOT identify risk factors for dangerous roads. It is often difficult to identify the potential risk factors and interactions that lead to the specific crash characteristics of high-crash roadways. The new statistical method, however, will allow DOT to account for all traffic variables, leading to better traffic assessments and enhanced public safety.

"We are excited to collaborate with Professors Lo and Zheng and the Data Science Institute to explore new, innovative research in statistical learning through the analysis of large and diverse transportation and safety datasets," said Seth Hostetter, Director, Safety Analytics and Mapping for DOT. "This is an excellent opportunity to explore the complex interactions between the various risk factors associated with traffic safety that may provide insights that will help us accelerate our progress in achieving the traffic safety goals of Vision Zero."

Zheng, a professor of statistics at Columbia and associate director of education at DSI, said the statistics team is happy to support the work of DOT.

"We are thrilled to be collaborating with DOT on this important project," said Zheng. "Vision Zero aims to end traffic fatalities and we are delighted that DOT is using our new statistical method to further that noble goal."