A learning storage structure for genetic information


The future of biomedical research is closely linked to the decoding of the genome. In particular, success depends on being able to store, analyse and logically link the genetic information contained in hundreds of thousands of samples.

Portrait / project description (ongoing research project)

Improved methods in biomedical research mean that it is possible to sequence the entire genome of an individual at low cost. We are developing new technical concepts for a software system that has the task of storing tens of thousands of such data sets in one place and making them accessible for research and clinical applications. The system is based on what are known as genome graphs, a data structure that combines information on sets of genetic instructions with other relevant clinical or experimental data. New information can be added to genome graphs efficiently, and they combine a low storage capacity requirement with a capability for rapid information searching. Our research focuses primarily on reducing the amount of storage space required while retaining and enabling efficient access to all the information.


Living organisms carry their construction plan in their cells in the form of DNA. To understand life processes and the cause of disease, it is necessary to export this information, store it and compare it. The processes used here are statistical in nature and only acquire informative value if they incorporate genetic information from many thousands of samples. This in turn means that there is a need for a low-cost means of storage and rapid data comparison.


The aim of this project is to develop a software system based on new technical concepts that is capable of recording the genetic information in tens of thousands of biological or medical samples and representing them efficiently. It should be possible to add new samples quickly and compare them with existing information. There is also room for information about the origin of the sample and other information relevant for research in the constantly growing, and therefore learning, information storage system.


Understanding the relationship between genetic information and biological characteristics necessitates comparing a broad spectrum of this information. One major challenge in this context is the correct and user-friendly storage of the enormous volumes of data required. The DNA of as many patients as possible needs to be analysed to gain a better understanding of genetic disease or cancer, for example. By providing a technical basis for this work, the software system we have developed enables biomedical research to be done efficiently.

Original title

Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation

Project leaders

  • Prof. Gunnar Rätsch, Institut für Informationssysteme, ETH Zürich
  • Prof. Torsten Hoefler, Departement Informatik, ETH Zürich
  • Prof. Mario Stanke, Institut für Mathematik und Informatik, Universität Greifswald



Further information on this content


Prof. Gunnar Rätsch Institut für Informationssysteme
ETH Zürich
Gebäude CAB
Universitätsstrasse 6 8092 Zürich

On this Subject