Programming language support for Big Data


Scala is one of the leading languages for data science platforms and tools. In this project, we will work on new programming language concepts to improve the clarity and ease of use of the language in this domain.

Portrait / project description (ongoing research project)

The project consists of several parts. One part deals with the fundamental data structures needed for database access. Here we have to bridge a gap in scale. Data structures in programming languages typically have only a few fields whereas database records can have many hundreds of columns. We will try to solve the problem by extending the programming language so that more flexible data structures can be defined. Another part of the work deals with optimisation: How can we generate efficient code for typical Big Data workloads? All work packages will flow into an application that demonstrates our approach to distributed data processing.


The Scala programming language has been under development at EPFL since 2003. Thanks to a variety of favourable attributes, Scala is the implementation language of a new generation of Big Data “frameworks” (software libraries) used by hundreds of thousands of developers worldwide. Spark, Flink, Scalding, Summingbird and Kafka are the names of some of the more popular frameworks written in Scala. Scala is also a popular query and programming language for working with these frameworks.


We want to improve combinations of programming languages and databases. The aim is not to integrate specific database features in a programming language, which would be infeasible anyway. Instead, following Scala’s philosophy of being a versatile language, we want to research ways to better express and export fundamental programming abstractions (ways of formulating essential tasks) that are used in the interfaces between databases and programming languages.


If successful, this project will advance the state of the art in frameworks and tools for data science. Better embedding of query languages in general-purpose programming languages will provide stronger foundations on which to write complex Big Data applications. We also expect this work to provide better abstractions to structure and build the next generation of complex distributed data engines. This in turn will lead to better tools for data science, making data scientists more productive and enabling a better integration of data science in other information systems.

Original title

Programming Language Abstractions for Big Data

Project leader

Prof. Martin Odersky, Laboratoire de méthodes de programmation, EPFL



Further information on this content


Prof. Martin Odersky Laboratoire de méthodes de programmation 1
Bâtiment INR
Station 14 1015 Lausanne

On this Subject