Research Forum | Episode 4 - abstract chalkboard background with colorful network nodes and circular icons

Research Forum Brief | September 2024

A generative model of biology for in-silico experimentation and discovery

Share this page

headshot of Kevin Yang

“EvoDiff is a discrete diffusion model trained on evolutionary-scale protein sequence data. By evolutionary scale, we mean that we train on sequences taken from across many different organisms and that perform many different functions.”

Kevin Yang, Senior Researcher, Microsoft Research New England

Transcript: Lightning Talk

A generative model of biology for in-silico experimentation and discovery

Kevin Yang, Senior Researcher, Microsoft Research New England

This talk discusses how deep learning is enabling us to generate novel and useful biomolecules, allowing researchers and practitioners to better understand biology.

Microsoft Research Forum, September 3, 2024

KEVIN YANG: Hi. I’m Kevin K. Yang, senior researcher at Microsoft Research, and I’ll be presenting on generative models of biology for in-silico experimentation and discovery.

Our mission in the Biomedical ML Group at MSR [Microsoft Research] is to develop AI systems that contribute to biomedical knowledge via generative design and interactive discovery across length scales from molecules to patients.

At the smallest scale, we model biomolecules such as nucleic acids and proteins. These molecules function within cells, which we model specifically in the context of understanding and treating cancer. Cells form tissues. We build generative models of histopathology images in order to improve diagnostics. Finally, we study genetics and biomarkers at the whole patient level to better understand health and disease.

Today, we’ll focus on the molecular level with our protein engineering work.

Proteins are the actuators of biology. Each of our cells contains 1 to 3 billion protein molecules at any given time. Proteins catalyze metabolic reactions, replicate DNA, respond to stimuli such as light and scent, provide structure to cells and organisms, and transport molecules within and between cells.

All of this functional diversity is encoded in just 20 chemical building blocks called amino acids. Proteins are sequences of dozens to thousands of amino acid residues. In nature, these sequences often fold into a three-dimensional structure, which then performs a cellular function.

A protein’s structure and function are completely determined by its amino acid sequence. Protein design seeks to generate the amino acid sequences of new proteins that perform useful and novel functions. For example, engineered proteins in laundry detergent help remove stains while other proteins are of great interest as gene editors.

My research focuses on training neural networks and the natural diversity of proteins in order to generate new protein sequences that hopefully encode new functions. Today, I’ll focus on the model called EvoDiff. This work was done in collaboration with Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex Lu, Nicolo Fusi, and Ava Amini.

EvoDiff is a discrete diffusion model trained on evolutionary-scale protein sequence data. By evolutionary scale, we mean that we train on sequences taken from across many different organisms and that perform many different functions.

Diffusion models were first popularized for generating images. During training, a diffusion model learns to remove noise added to a data point. In this case, we randomly mask some amino acid residues from a protein and train the model to predict the identities of the masked residues. After training, EvoDiff can generate new protein sequences, beginning with a sequence of all masks by decoding one position at a time.

Here, we show an example and also show the predicted structure of the generated protein after each decoding step. We see that EvoDiff generates plausible and diverse proteins across a variety of lengths. We visualize the predictions using their predicted 3D structures. The structural prediction model also outputs a confidence metric called pLDDT [predicted local distance difference test], which ranges from 0-100. EvoDiff is able to generate sequences that are likely to fold into stable structures by this metric.

These sequences are also distinct from anything seen in nature.

Often, protein engineers want proteins that perform a similar function to a natural protein, or they want to produce a protein that performs the same function but has other desirable properties, such as stability. By conditioning EvoDiff with a family of related sequences, we can generate new proteins that are very different in sequence space to the natural proteins but are predicted to fold into similar three-dimensional structures. These may be good starting points for finding new functions or for discovering versions of a protein with desirable properties. Finally, EvoDiff can also generate a complete protein sequence conditioned on a desired functional motif.

Biological functions, including binding and catalysis, are often mediated by a small structural motif held in the correct orientation by a scaffold. One way to design new functional proteins is to hand design these functional motifs, and then to generate a scaffold that will position the residues of the motif in the desired orientation. Traditionally, this is done by designing the protein structure and then finding a protein sequence that will fold to the desired structure. Here, we specified a desired functional motif from a natural protein in green, then resampled the rest of the protein around it using EvoDiff.

The new protein sequence is predicted to maintain the functional orientation of the motif with high resolution, demonstrating that we can perform motif scaffolding entirely in sequence space. By training on an evolutionary-scale dataset of 40 million proteins with many different natural functions from many different organisms, EvoDiff is able to generate plausible and diverse sequences. In addition, we have demonstrated the ability to condition on evolutionarily related sequences or undesired function motifs within a sequence.

Looking ahead, our next goal is to train generative models that allow finer grain control of the desired function in the form of text or a chemical reaction. This sort of conditional protein design will expand the scope of applications for designed proteins in chemistry, biology, and medicine. Finally, generative models of proteins can be a building block for models of cells, tissues, and patients, as we seek to design and understand biology.

If you enjoyed this talk, please go read our preprint, or you can use the code in our GitHub (opens in new tab) to generate your own proteins. Thank you.