Scala programming language: enabling big data analytics
Scala is one of the leading languages for data science platforms and tools. In this project, we worked on new programming language concepts to improve the clarity and ease of use of the language in this domain.
Portrait / project description (completed research project)
The project consisted of several parts. One part dealt with the fundamental data structures needed for database access. Here we had to bridge a gap in scale. Data structures in programming languages typically have only a few fields whereas database records can have many hundreds of columns. We tried to solve the problem by extending the programming language so that more flexible data structures can be defined. Another part of the work dealt with optimisation: How can we generate efficient code for typical Big Data workloads? All work packages will flow into an application that demonstrates our approach to distributed data processing.
The Scala programming language has been under development at EPFL since 2003. Thanks to a variety of favourable attributes, Scala is the implementation language of a new generation of Big Data “frameworks” (software libraries) used by hundreds of thousands of developers worldwide. Spark, Flink, Scalding, Summingbird and Kafka are the names of some of the more popular frameworks written in Scala. Scala is also a popular query and programming language for working with these frameworks.
We wanted to improve combinations of programming languages and databases. The aim was not to integrate specific database features in a programming language, which would be infeasible anyway. Instead, following Scala’s philosophy of being a versatile language, we wanted to research ways to better express and export fundamental programming abstractions (ways of formulating essential tasks) that are used in the interfaces between databases and programming languages.
This project will advance the state of the art in frameworks and tools for data science. Better embedding of query languages in general-purpose programming languages will provide stronger foundations on which to write complex Big Data applications. We also expect this work to provide better abstractions to structure and build the next generation of complex distributed data engines. This in turn will lead to better tools for data science, making data scientists more productive and enabling a better integration of data science in other information systems.
The project has achieved its main goal: Integrate several new technologies into a coherent set of abstractions for interfacing with data and validate its usefulness in open-source projects. The implementations of these abstractions are of sufficiently high quality to be integrated in Scala 3, the next major version of Scala having been released in July 2021. Like Scala 2, Scala 3 is meant to be a production ready platform for major applications, not just a research language.
In particular, the project developed the following new concepts and techniques that were embedded in Scala 3:
- Records are supported by a new abstraction for
programmatic structural types. In the previous version of Scala, structural types were always implemented using Java reflection, which made them unusable as representations of externally handled database rows. The new design and implementation allow the developer to provide their own implementations of structural types. This is particularly relevant in a database context, where records on the level of the programming language are represented as low-level byte blocks.
- Serialisation is supported in a generic way through Scala 3’s new
typeclass derivation mechanism. Typeclasses are simply interfaces (in Scala: traits) with at least one type parameter. Typeclass derivation means that instances of such traits can be generated automatically by the compiler, based on the structure of the type argument.
- Scala 3’s meta programming facilities allow computation to be performed safely at compile time. Such computation can generate new types as well as expressions. There are two principal elements that provide useful foundations for representing database queries: First,
Match types allow to compute types by pattern matching on the structure of scrutinee types. Combined with recursive types, they provide powerful modelling capabilities via typelevel computation. Second,
Inline functions provide a common framework to evaluate expressions at compile time.
- Inlining functions alone already provide some metaprogramming functionality, and it does so in a very safe way. But at other times, one needs to have more fine-grained control of code generation. This is provided in Scala 3 by a
staging system based on quotes and splices. Quotes treat code as data. Data can have holes that are filled in by splices. A novelty of the system developed in Scala 3 is that it allows splices for both expressions and types.
- Lower-level, more detailed access to code trees is supported by defining a standard format for trees exposed to meta programming. The format is called TASTy, which is formed from an acronym for Typed Abstract Syntax Trees. TASTy is technically a serialization format but it gives rise to internal structures of trees and symbols and types in a natural way.
- Records are supported by a new abstraction for
Programming Language Abstractions for Big Data