Efficient analysis of genomic data
Technological advances in DNA sequencing are making it easier to decode the genome of numerous organisms. The challenge that this mass of variable-quality data presents for biologists is how to analyse it efficiently and consistently.
Portrait / project description (completed research project)
First of all, the project will focus on developing tools capable of organising genomic data and deducing comparable biological elements from it, such as genes that are similar between different species. Using different types of genomic data, these tools will make it possible to analyse more species, which is important for gaining a better understanding of the processes involved in the evolution of species. The second area of focus will consist in developing new machine learning algorithms capable of identifying which of the tens of thousands of genes present in the genomes show the most interesting characteristics. Studying them in depth with the help of modelling methods will enable their interactions and evolution to be understood.
Identifying the genes that are key to an organism’s development enables scientists to determine which genes relate to functions that are essential to the organism’s survival. In medicine, for example, it is vital to know whether a gene identified in a model organism such as a mouse has the same function in human beings. Answering questions of this kind requires complex computing methods and high-quality data. Such questions are therefore restricted to a small number of organisms that have been studied in great depth and ignore the enormous quantity of poorer-quality data that is currently being generated.
This project aims to develop new computational approaches capable of processing genomic data of variable quality in order to compare the genomes of different organisms. Modelling the interactions between genes with the help of machine learning methods will make it possible to understand, for example, the evolution of groups of genes involved in metabolic processes.
The project’s scope is in full conformity with the issue of Big Data, since it addresses the size, heterogeneity and quality of genomic data in biology. It also has implications that go beyond this single discipline, since establishing approaches for managing and comparing data is essential in other fields, such as language analysis. Moreover, machine learning is a key component of computational sciences.
The project achieved two main results:
First, to efficiently and robustly process protein sequences derived from newly sequenced genomes, we developed OMAmer, a novel alignment-free protein family and subfamily classification method suited to phylogenomic databases with thousands of genomes. We also demonstrated the applicability of this approach to real-life problems of comparative genomics. Such datasets are becoming ever more abundant with new efforts for large-scale comparative genomics, and we expect OMAmer and derived tools to play an important role in making them useful to answer biological questions.
Second, we made progress in the use of Big Data to identify subtle signals of co-evolution in biological sequences using cutting-edge artificial intelligence methods. Though this project focused on co-evolution, the machine learning approach that we developed can be applied to other molecular processes such as the detection of selection. The next step is to move into this direction using the Selectome database (a resource for positive selection in vertebrate genomes) that we updated within the scope of this project.
Efficient and accurate comparative genomics to make sense of high volume low quality data in biology