Machine learning approaches to protein generation

Machine learning approaches to protein generation

By In PhD proposals 2019 On March 4, 2019

Project: Machine learning approaches to protein generation

Laboratory: Synthetic Biology group (Institut Pasteur) and Statistical Physics and Inference for Biology group (ENS)

Affiliation: Institut Pasteur / Ecole Normale Superieure (ENS)
Address: 28 rue du Dr Roux 75015 Paris / 24 rue Lhomond, 75005 Paris
E-mail: /

LAB Director
Name: Rémi Monasson
Phone number: +33 144322589

Name: David Bikard
Phone number: +33 140613924

Subject Keywords: statistical inference, protein modelling and design, deep learning

Summary of lab’s interests: The Statistical Physics and Inference for Biology group in the Physics Department at Ecole Normale Superieure is working on the development and application of analytical and computational approaches to model and analyze biological systems, in various fields such as immunology, neuroscience, genomics, ecology, …

The Synthetic Biology Lab at Institut Pasteur is working on engineering bacterial cells with a focus on CRISPR-Cas systems, their biology as an immune system of bacteria, and their applications to modify genomes and control gene expression.
Project summary: Recombinant proteins have found uses in many medical and industrial applications, where it is frequently desirable to identify protein variants with modified properties such as improved stability, catalytic activity, modified substrate preferences etc. The systematic exploration of protein variants is made challenging by the extremely large space of all possible chains of 20 amino acids and the difficulty of accurately predicting a protein fold and function from its sequence.

In this project we propose to explore novel modeling strategies to generate proteins never seen in nature with functional properties that will be measured in the lab. Neural generative models, such as restricted Boltzmann machines (RBMs), generative adversarial networks (GANs), variational autoencoders (VAE), or autoregressive models have recently been show to be successful at modeling diverse kinds of high dimensional data including text, sound and images, allowing the generation of novel realistic samples. Some of these models can learn useful representation of the data in their latent space and can be conditioned to enable the controlled generation of samples with desired properties.

We will extend and apply these approaches to generate new proteins with desired functionalities, based on homologous sequence data available in public databases. The functional properties of the generated proteins will be tested experimentally at the Institut Pasteur. The experimental work will be done either by the PhD student himself, or in collaboration with experimental biologists depending on the training and motivation of the student.

Ref.: J. Tubiana, S. Cocco, R. Monasson, Learning protein constitutive motifs from sequence data,, to appear in eLife (March 2019).

Interdisciplinary aspect of the project: Our project mixes modeling approaches to protein design coming from statistical physics and machine learning, together with experimental measurements of protein functionalities. It is a joint PhD between Institut Pasteur (D.B., Synthetic Biology Lab) and Ecole Normale Superieure (R.M., Physics Laboratory).
Funding: none