Efficient machine learning via summarizing large data sets


Increasing digitalisation in society and science is resulting in massive amounts of data. In the present project, we are developing efficient algorithms that allow data to be compressed in such a way that the compressed data sets can still be analysed with satisfactory accuracy.

Portrait / project description (ongoing research project)

Our approach enhances existing mathematical optimisation techniques so that they can be applied to complex tasks and models arising in machine learning. We will also be examining connections with statistical learning theory in order to evaluate the predictive accuracy of the results. Our work is partly theoretical: we are developing novel algorithms and provide mathematical proof of their properties and the accuracy of their results. At the same time, we will implement the algorithms and release them as open source software compatible both with the architectures of modern data centres and with mobile platforms.


Science and society are generating huge amounts of data in a wide range of areas. Machine learning provides numerous techniques for identifying useful patterns as well as supporting and automating data-based decisions. However, the larger the data volume, the more difficult it is to efficiently solve the resulting computational tasks.


We are developing novel algorithms for the efficient analysis of large data volumes. The objective is to summarise or compress the data such that the accuracy of key statistical analyses and learning processes is only minimally reduced. Since they are considerably smaller than the original data, the so-called coresets that are created during compression can be processed with a high degree of robustness and accuracy.


Our findings will also allow research groups and companies that do not have giant computer and data centres to keep pace more effectively with the rapid growth of data. Potential applications range from online recommendation services and robotics to the Internet of Things.

Original title

Scaling Up by Scaling Down: Big ML via Small Coresets

Project leader

Prof. Andreas Krause, Departement Informatik, ETH Zürich



Further information on this content


Prof. Andreas Krause Departement Informatik
ETH Zürich
Gebäude CAB
Universitätstrasse 6 8092 Zürich

On this Subject