Intrinsically disordered regions play critical roles in protein function
Intrinsically disordered regions (IDRs) are widespread in proteins, making up around 40% of the human proteome, and carry out critical functions in signaling, protein-protein interactions, phase separation, and more. Despite their critical role in protein function and their widespread prevalence in the proteome (opens in new tab), the systematic understanding of intrinsically disordered regions (IDRs) remains elusive. IDRs do not fold into a stable secondary or tertiary structure, enabling them to mediate functions distinct from structured regions. For example, some IDRs are essential to “hub” proteins, as the lack of structure enables adaptation of conformation to different interaction partners. As research on IDRs grow, biologists have come to increasingly appreciate their role in human disease (opens in new tab): for example, mutations in IDRs can be implicated in neurological diseases or cancers.
IDRs are underserved by current bioinformatics resources
IDRs typically evolve rapidly and may have no detectable sequence homology between even closely related species. This creates methodological challenges for understanding IDR function computationally: while highly specific predictions of function can be produced for structured regions using universal resources that assume sequence conservation (like BLAST), these methods are not applicable to IDRs, prompting calls for machine learning methods that can work for IDRs (opens in new tab). One viable alternative strategy is that IDR functions can be predicted by identifying higher-order features of the sequences (opens in new tab), which may be conserved over evolution even when the sequence is not.
For example, charge and hydrophobicity are critical to mitochondrial targeting IDRs, as these features enable recognition by import mechanisms. However, these features are generally identified on an individual experimental basis, so our knowledge of features important to IDRs is likely not comprehensive. Computational methods that produce hypotheses about novel features (e.g. those in uncharacterized IDRs) are therefore valuable toward the goal of advancing a more universal understanding of IDRs.
The first systematic hypothesis discovery resource for IDRs unbiased by prior knowledge of function
To help researchers understand what features in an IDR sequence are relevant to its function, we introduce the first systematic feature discovery method for IDRs (opens in new tab), capable of uncovering features even without any prior knowledge or hypotheses of function. Our method, which we call “reverse homology”, is powered by a self-supervised neural network, trained to predict if IDRs evolved from the same common ancestor or not (i.e. if they are homologues). This training task makes the neural network sensitive to evolutionarily conserved features, which tend to be important to function: by applying interpretation techniques, we can visualize the features that our neural network believes are likely to be conserved over evolution in an input sequence, directly generating hypotheses about residues and features that may be important to an IDR’s function.