Distributed processing of DNA sequencing data on Azure HDS

Time and computing resource consumption optimization over a petabyte of data processed

5 min

May 31, 2019 from Veranika Tsiareshchanka

abstract background

Founded in 1946, the Institut National de la Recherche Agronomique (INRA) is the leading agricultural research institute in Europe and the second largest in the world in terms of the number of projects carried out by its researchers and the number of scientific publications. Its teams work in research areas that range from food quality and agricultural sustainability to the preservation of the environment, biodiversity and ecosystems. To carry out its missions, INRA uses state-of-the-art technologies.

Sequencing the intestinal microbiota

MetaGenoPolis brings together researchers, engineers, laboratory technicians, bioinformaticians, bio-analysts, statisticians, mathematicians, microbiologists and a doctor. Through the implementation of advanced metagenomic technologies, the mission of this INRA platform is to understand the impact of the intestinal microbiota - i. e. all microorganisms (bacteria, archaea, viruses, fungi) found in the intestine - on human and animal health.

MetaGenoPolis works on human stool samples to extract microbial DNA and sequence them. Each sample results in 20 million short sequences that must then be assembled like a puzzle to reconstruct genes and genomes and finally establish their microbial profile (i.e. the microbial species present and their abundances).

“Our database now includes the sequencing results of almost 20,000 samples”, explains Nicolas Pons, research engineer at INRA and head of the MetaGenoPolis bioinformatics platform. “This represents a total of 1 petabyte of data that we must store and process locally, i.e. 1 million billion bytes. That’s considerable!”.



From genes to microbial profile characterization: a very large amount of data to be processed

“To build microbial profiles for each individual, we rely on catalogues of genes and microbial species representative of ecosystems,” he continues. “In the human intestine alone, there are nearly 10 million genes. “, explains Nicolas Pons.

According to the studies entrusted to MetaGenoPolis, the number of samples to be processed at the same time can reach and even exceed several hundred or even several thousand units.

Ensure reliability and optimize data processing

MetaGenoPolis therefore needs a digital infrastructure and storage solutions that are particularly reliable and adapted to these complex operations. To do this, the platform relies on Microsoft Azure’s computing capabilities and on the ProActive solution developed by ActiveEon to orchestrate the IT processing of data from the analysis of microbiota samples. ProActive allows not only to distribute treatments in a time-optimized way, but also in terms of computing resource consumption.

“ProActive and Azure allow us to organize the different computing tasks on a cluster, our group of servers on the network, while providing us with a workflow engine that facilitates bio-analysts the implementation of certain processes by optimizing the workflow and accessibility to specific resources. It is completely adapted to our bio-informatics and bio-statistics processing needs.”

Secure sensitive health data

In addition to the complexity of the tasks to be performed and the large amount of data to be processed, there is also the need for security. Here again, Microsoft Azure has a decisive advantage.

“We are working on very critical and sensitive data, assimilated to personal health data. This requires us to put a high level of security in our servers. Azure guarantees us the confidentiality and security of the data thanks to the HDS (Health Data Hosting) certification.”

This certification is mandatory since 2018 for the hosting of personal health data and covers the administration, use and backup of this data. This data security is all the more crucial for MetaGenoPolis as the platform is also a pre-industrial demonstrator that aims to establish partnerships with large companies. It is therefore essential to provide them with as many guarantees as possible.

“Our mission is to show that we are able to set up a platform at a quasi-industrial level for the study of the microbiota. We work with industrialists, pharmaceutical companies or companies in the agri-food sector.”

Azure, a vector for new opportunities

Thanks to the Microsoft Azure cloud, ActiveEon allows MetaGenoPolis to store and secure its health data and to benefit from an elastic computing capacity. These innovations open up new horizons for them by giving them the opportunity to expand their activities and diversify the services they offer. For example, MetaGenoPolis could, in the near future, offer healthcare professionals a tailor-made analysis service to assist in diagnosis.

“With Azure, we could deploy an on-demand service to analyze the microbiota in the cloud. Tests are becoming more democratic and more and more start-ups are offering them. It is conceivable that a clinician could be the prescriber of the microbiota analysis. We could give him the benefit of our processing platform without him having to equip himself with a processing infrastructure.”

In addition, the large number of Microsoft Azure Data Centers in the world will allow us to reach new geographical areas (Germany, Ireland, etc.) in the future.

The aggregation of data from different European research centres is now a major challenge. The availability of data sets on the Azure cloud to a larger number of actors in each ecosystem makes it possible to accelerate the development of collaborative research, the results and above all patient diagnosis.

To learn more about INRA’s research or Microsoft Azure cloud features visit the following websites:

INRA MetaGenoPolis

Microsoft Azure

Find original version in French on Microsoft blog

Download INRA case study


More articles

All our articles