THE LATEST
Chinese built divide, conquer algorithm offers a promising route for big data analysis
- Written by: Tyler O'Neal, Staff Editor
We live in the era of big data. The huge volume of information we generate daily has major applications in various fields of science and technology, economy, and management. For example, more and more companies now collect, store and analyze large-scale data sets from multiple sources to gain business insights or measure risk.
However, as Prof. Yong Zhou, one of the authors of a new study notes: “Typically, these large or massive data sets cannot be processed with independent computers, which poses new challenges for traditional data analysis in terms of computational methods and statistical theory.”
Together with colleagues at the Chinese University of Hong Kong, Zhou, a professor at China’s East China Normal University, has developed a new algorithm that promises to address these computational problems.
He explains: “State-of-the-art numerical algorithms already exist, such as optimal subsampling algorithms and divide and conquer algorithms. In contrast to the optimal subsampling algorithm, which samples small-scale, informative data points, the divide and conquer algorithm divides large data sets randomly into sub-datasets and processes them separately on multiple machines. While the divide and conquer method is effective in using computational resources to provide a big data analysis, a robust and efficient meta-method is usually required when integrating the results.”
In this study, the researchers have focused on the large-scale inference of a linear expectile regression model, which has wide applications in risk management. They propose a communication-effective, divide and conquer algorithm, in which the summary statistics from the subsystems are combined by the confidence distribution. Zhou explains: “This is a robust and efficient meta-method for integrating the results. More importantly, we studied the relationship between the number of machines and the sample size. We found that the requirement for the number of machines is a trade-off between statistical accuracy and computational efficiency.”
Zhou adds: “We believe the algorithm we have developed can significantly help to address the computational challenges arising from large-scale data.”