Coresets: big data with less data
Increasing digitalisation in society and science is resulting in massive amounts of data. In the present project, efficient algorithms that allow data to be compressed in such a way that the compressed data sets can still be used to train machine learning models with satisfactory accuracy were to be developed.
Portrait / project description (completed research project)
The approach enhances existing mathematical optimisation techniques so that they can be applied to complex tasks and models arising in modern machine learning. Connections with statistical learning theory in order to evaluate the predictive accuracy of the results were also examined. The work is partly theoretical: developing novel algorithms and mathematically analysing their properties and the accuracy of their results. At the same time, the algorithms were implemented and released as open-source software compatible with modern data science tools.
Science and society are generating huge amounts of data in a wide range of areas. Machine learning provides numerous techniques for identifying useful patterns as well as supporting and automating data-based decisions. However, the larger the data volume, the more difficult it is to efficiently solve the resulting computational tasks.
Novel algorithms for the efficient analysis of large data volumes were to be developed. The objective is to summarise or compress the data such that the compressed data allows to train machine learning models with minimal loss in accuracy. Since they are considerably smaller than the original data, the so-called coresets that are created during compression can be processed with a high degree of robustness and accuracy.
The findings will also allow research groups and companies that do not have large computer and data centres to keep pace more effectively with the rapid growth of data. Potential applications range from online recommendation services and robotics to the Internet of Things.
A key result of the project are novel coreset constructions that are compatible with modern deep neural network models.
The central idea is to optimise over the weights associated with the different data points such that a model trained on the weighted data maximises prediction accuracy on the full data set. Rather than simply uniformly subsampling the data, which fails to properly capture edge cases and rare events, the optimised coresets of this project systematically summarise and adaptively sample the data set.
The approaches enable training complex models online, even on nonstationary data streams (i.e., where the underlying distribution of examples arriving changes over time, for example due to seasonal trends). They also provide highly effective means for active semi-supervised learning. I.e., these methods are able to determine, out of a large unlabelled data set, a small subset of points to label, such that predictive accuracy is maximised when propagating the labelling information using modern semi-supervised deep learning techniques.
Scaling Up by Scaling Down: Big ML via Small Coresets