{"id":995577,"date":"2024-01-05T08:07:40","date_gmt":"2024-01-05T16:07:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=995577"},"modified":"2024-05-03T11:51:59","modified_gmt":"2024-05-03T18:51:59","slug":"afmr-scientific-discovery-and-innovation","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/afmr-scientific-discovery-and-innovation\/","title":{"rendered":"AFMR: Scientific Discovery and Innovation"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\"white\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t<\/span>\n\t\t\t\t\t\t\t\t\tAccelerating Foundation Models Research\t\t\t\t\t\t\t\t<\/a>\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

Scientific Discovery and Innovation<\/h1>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n
\n

Academic research plays such an important role in advancing science, technology, culture, and society. This grant program helps ensure this community has access to the latest and leading AI models.<\/em><\/strong><\/p>\nBrad Smith, Vice Chair and President<\/cite><\/blockquote>\n\n\n\n

\n
<\/div>\n\n\n\n
\n
\"dark<\/figure>\n\n\n\n

AFMR Goal: Accelerate scientific discovery in natural sciences<\/h2>\n\n\n\n

via proactive knowledge discovery, hypothesis generation, and multiscale multimodal data generation<\/p>\n<\/div>\n\n\n\n

<\/div>\n<\/div>\n\n\n\n
\n\t\n\t
\n\t\t
\n\t\t\t
<\/div>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n

These projects focus on using foundation models to enhance knowledge discovery and hypothesis generation across many different areas. They particularly leverage the ability of general models to make sense of the exponentially growing volume of scientific literature in astronomy, materials science, and neuroscience. These efforts include exploring domain-specific prompt engineering and specializing foundation models through fine-tuning using techniques such as Low-Rank Adaption (LoRA). A series of proposals are dedicated to biomedical and life sciences research and innovation, including specialized models for drug discovery, genomics, protein engineering, and rare diseases. These proposals underscore the potential of foundation models to accelerate scientific discovery and innovation across many fields and disciplines.<\/p>\n\n\n\n

<\/div>\n\n\n\n\n\n

University of Texas at Arlington<\/strong>: Miao Yin (PI)<\/p>\n\n\n\n

Ion chromatography (IC) is a powerful analytical chemistry technique for selective, sensitive quantification of aqueous ions spanning applications from environmental monitoring to biopharma pipelines. However, intrinsic slow analysis times severely throttle sample throughput. This project intends to develop an artificial intelligence-based platform accelerating IC by leveraging immense datasets from vast historical runs coupled with large foundation models tailored to effectively encode complex interactive influences of system parameters spanning columns, eluents, and detectors on separation performance into predictive modeling engines on Microsoft Azure. Additionally, a special tuning algorithm with analytical chemistry specialists’ feedback will be developed to ensure the correct prediction of the large foundation IC model. Broader anticipated impacts are poised to revolutionize ion chromatography practices with AI across academic, manufacturing, and innovation areas while providing students at MSI with interdisciplinary research opportunities incorporating computer science and analytical chemistry.<\/p>\n\n\n\n\n\n

Georgia Institute of Technology<\/strong>: Yunan Luo (PI)<\/p>\n\n\n\n

This proposal aims to leverage foundation models, including large language models trained on natural language and protein sequences, to advance protein function prediction and optimization. Two key areas of focus are 1) protein function prediction – predicting the biological roles of natural proteins and 2) protein function optimization – predicting which sequence mutations are beneficial for enhancing the function of natural proteins.<\/p>\n\n\n\n\n\n

University College London<\/strong>: Bradley Love (PI)<\/p>\n\n\n\n

The project intends to utilize large language models (LLMs) to aid in the accumulation and assimilation of vast scientific literatures, especially in the field of neuroscience. The proposal aims to create BrainGPT, an AI tool for navigating and understanding large pools of data. The model will generate data patterns based on the scientific literature, assist in identifying anomalous findings, and offer insights for novel study designs. Additionally, the team intends to open source the models and training data for scientific scrutiny and improvements, fostering participation from the scientific community.<\/p>\n\n\n\n\n\n

New Mexico State University<\/strong>: Huiping Cao (PI)<\/p>\n\n\n\n

Data-driven machine learning (ML) models built on large amounts of data have gained great success in many applications. However, their success is less observed in scientific domains. Scientific discoveries and hypothesis generation largely depends on knowledge (commonsense knowledge and expert-domain knowledge). Most of such knowledge is scattered in different sources and such knowledge is rarely utilized in data-driven ML models. Developing ML models that can take both data and knowledge as input in the learning process is still in its infancy.<\/p>\n\n\n\n

Many scientific domains collect multi-modality data. However, there is no good benchmark multi-modal datasets to evaluate foundation models.<\/p>\n\n\n\n

This project will design and develop novel neural network models to extract domain knowledge, incorporate domain knowledge and account for multi-modality data in the learning framework to improve learning accuracy and efficiency. The proposed methods will be applied to one scientific domain, animal sciences, to validate their usefulness, and generate knowledge base and a multi-modality datasets as a benchmark dataset.<\/p>\n\n\n\n\n\n

University of California, Los Angeles<\/strong>: Aditya Grover (PI)<\/p>\n\n\n\n

The project proposes to develop a few-shot machine learning model to learn and optimize multi-task deep learning surrogates across various scientific and engineering domains. The plan includes unsupervised pretraining on large unlabelled datasets, followed by fine-tuning and evaluation on multiple disciplines, including bioengineering, material science, and mechanical design.<\/p>\n\n\n\n\n\n

Harvard University<\/strong>: Alyssa Goodman (PI)<\/p>\n\n\n\n

We aim to enhance human interaction with astronomy literature by utilizing the capabilities of the Large Language Models, particularly GPT-4. We employ in-context prompting techniques to expose the model to astronomy papers to build an astronomy-focused chat application to engage the broader community. On the research track, we want to explore the potential foundation models have to generate novel scientific hypotheses. Specifically, we use GPT-4 to construct an instruction set of scientific ideas to fine-tune smaller models on this astronomy-specific downstream task. To assess their output\u2019s accuracy, feasibility and creativity, we employ a hybrid evaluation strategy consisting of human experts and judge GPT-4 instances. Our research will illuminate a novel and unique way of applying LLMs in the scientific arena.<\/p>\n\n\n\n\n\n

University of Toronto Scarborough<\/strong>: Oleksandr Voznyy (PI)<\/p>\n\n\n\n

The proposal aims to establish Large Language Model (LLM) agents for inorganic materials discovery by augmenting GPT-3.5 with external tools and databases. The team will develop new text representations for the 3D structures of inorganic materials in order to enable discovery of materials for applications like catalysts, batteries, and photovoltaics.<\/p>\n\n\n\n\n\n

Imperial College London<\/strong>: Aaron Zhao (PI)<\/p>\n\n\n\n

The proposal aims to enhance the understanding of complex genomic data by developing a novel Machine Learning framework. Through the use of a new mathematical formulation termed ‘hybrid graphs’, it is suggested that gene expression prediction can be improved beyond the capabilities of current sequence-based approaches. The proposal is also set to construct new databases and theoretical frameworks geared towards genomic data, addressing a current gap in the field.<\/p>\n\n\n\n

Related paper:<\/strong><\/p>\n\n\n\n