Research Forum Brief | June 2024 Articles http://approjects.co.za/?big=en-us/research/ Fri, 12 Jul 2024 20:59:38 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 AutoGen Update: Complex Tasks and Agents http://approjects.co.za/?big=en-us/research/articles/autogen-update-complex-tasks-and-agents/ Tue, 04 Jun 2024 18:08:31 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1035039 Adam Fourney discusses the effectiveness of using multiple agents, working together, to complete complex multi-step tasks. He will showcase their capability to outperform previous single-agent solutions on benchmarks like GAIA, utilizing customizable arrangements of agents that collaborate, reason, and utilize tools to achieve complex outcomes.

The post AutoGen Update: Complex Tasks and Agents appeared first on Microsoft Research.

]]>
Presented by Adam Fourney at Microsoft Research Forum, June 2024

Adam Fourney

“Agents are a very, very powerful abstraction over things like task decomposition, specialization, tool use, etc. Really, you think about which roles you need on your team, and you put together your team of agents, and you get them to talk to one another, and then you start making progress on your task.”

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers

Transcript: Lightning Talk

AutoGen Update: Complex Tasks and Agents

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers

Adam Fourney discusses the effectiveness of using multiple agents, working together, to complete complex multi-step tasks. He will showcase their capability to outperform previous single-agent solutions on benchmarks like GAIA, utilizing customizable arrangements of agents that collaborate, reason, and utilize tools to achieve complex outcomes.

Microsoft Research Forum, June 4, 2024

ADAM FOURNEY: Hello, my name is Adam Fourney, and today, I’ll be presenting our work on completing complex tasks with agents. And though I’m presenting, I’m sharing the contributions of many individuals as listed below. All right, so let’s just dive in.

So in this presentation, I’ll share our goal, which is to reliably accomplish long-running complex tasks using large foundational models. I’ll explain the bet that we’re taking on using multi-agent workflows as the platform or the vehicle to get us there, and I’ll share a little bit about our progress in using a four-agent workflow to achieve state-of-the-art performance on a recent benchmark.

So what exactly is a complex task? Well, if we take a look at the following example from the GAIA benchmark for General AI Assistants, it reads, “How many nonindigenous crocodiles were found in Florida from the years 2000 through 2020?” Well, to solve this task, we might begin by performing a search and discovering that the U.S. Geological Survey maintains an online database for nonindigenous aquatic species. If we access that resource, we can form an appropriate query, and we’ll get back results for two separate species. If we open the collection reports for each of those species, we’ll find that in one instance, five crocodiles were encountered, and in the other, just a single crocodile was encountered, giving a total of six separate encounters during those years. So this is an example of a complex task, and it has certain characteristics of tasks of this nature, which is that it benefits strongly from planning, acting, observing, and reflecting over multiple steps, where those steps are doing more than just generating tokens. Maybe they’re executing code. Maybe they’re using tools or interacting with the environment. And the observations they’re doing … they’re adding information that was previously unavailable. So these are the types of tasks that we’re interested in here. And as I mentioned before, we’re betting on using multi-agent workflows as the vehicle to get us there.

So why multi-agents? Well, first of all, the whole setup feels very agentic from, sort of, a first-principles point of view. The agents are reasoning, they’re acting, and then they’re observing the outcomes of their actions. So this is very natural. But more generally, agents are a very, very powerful abstraction over things like task decomposition, specialization, tool use, etc. Really, you think about which roles you need on your team, and you put together your team of agents, and you get them to talk to one another, and then you start making progress on your task. So to do all this, to build all this, we are producing a platform called AutoGen (opens in new tab), which is open source and available on GitHub. And I encourage you to check this out at the link below.

All right, so now let’s talk about the progress we’ve been making using this approach. So if you recall that question about crocodiles from the beginning, that’s from the GAIA benchmark for General AI Assistants. And we put together four agents to work on these types of problems. It consists of a general assistant, a computer terminal that can run code or execute programs, a web server that can browse the internet, and an orchestrator to, sort of, organize and oversee their work. Now with that team of four agents, we were actually able to, in March, achieve the top results on the GAIA leaderboard for that benchmark by about 8 points. But what’s perhaps more exciting to us is that we are able to more than double the performance on the hardest set of questions, the Level 3 questions, which the authors of that work describe as questions for a perfect general assistant, requiring to take arbitrarily long sequences of actions, use any number of tools, and to access the world in general. So this is all very exciting, and I want to share a little bit more about what those agents are actually doing.

So this is the loop or the plan that they are following. So it begins with the question or the prompt, and then we produce a ledger, which is like a working memory that consists of given or verified facts; facts that we need to look up, for example, on the internet; facts that we need to derive, perhaps through computation; and educated guesses. Now these educated guesses turn out to be really important because they give the language models space to speculate in a constrained environment without some of the downstream negative effects of hallucination. So once we have that ledger, we assign the tasks to the independent agents, and then we go into this inner loop, where we ask first, are we done? If not, well, are we still making progress? As long as we’re making progress, we’ll go ahead and we’ll delegate the next step to the next agent. But if we’re not making progress, we’ll note that down. We might still delegate one other step, but if that stall occurs for three rounds, then we will actually go back, update the ledger, come up with a new set of assignments for the agents, and then start over.

All right, so this is the configuration that’s been working well for us, and it’s all I have time to share with you today. But I mentioned our goal, our bet, and our progress, and I want to conclude by sharing our plans for the future. So already we’re starting to tackle increasingly more complex benchmarks and real-world scenarios with this configuration. And we’re really excited about opportunities to introduce new agents that, for example, learn and self-improve with experience; that understand images and screenshots a little better for maybe more effective web surfing or use of interfaces; and that are maybe a bit more systematic about exploring that solution space. So rather than just updating that ledger and then restarting when they get stuck, they can be a bit more pragmatic about the strategies that they’re employing.
All right, well, thank you for your attention, and thank you for attending the Microsoft Research Forum, and we look forward to you joining us next time.

The post AutoGen Update: Complex Tasks and Agents appeared first on Microsoft Research.

]]>
MatterGen: A Generative Model for Materials Design http://approjects.co.za/?big=en-us/research/articles/mattergen-a-generative-model-for-materials-design/ Tue, 04 Jun 2024 18:07:15 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1035024 Tian Xie introduces MatterGen, a generative model that creates new inorganic materials based on a broad range of property conditions required by the application, aiming to shift the traditional paradigm of materials design with generative AI.

The post MatterGen: A Generative Model for Materials Design appeared first on Microsoft Research.

]]>
Presented by Tian Xie at Microsoft Research Forum, June 2024

Tian Xie

“Materials design is the cornerstone of modern technology. Many of the challenges our society is facing today are bottlenecked by finding a good material. … If we can find a novel material that conducts lithium very well, it will be a key component for our next-generation battery technology. The same applies to many other domains.”

Tian Xie, Principal Research Manager, Microsoft Research AI for Science

Transcript: Lightning Talk

MatterGen: A Generative Model for Materials Design

Tian Xie, Principal Research Manager, Microsoft Research AI for Science

Tian Xie introduces MatterGen, a generative model that creates new inorganic materials based on a broad range of property conditions required by the application, aiming to shift the traditional paradigm of materials design with generative AI.

Microsoft Research Forum, June 4, 2024

TIAN XIE: Hello, everyone. My name is Tian, and I’m from Microsoft Research AI for Science. I’m excited to be here to share with you MatterGen, our latest model that brings generative AI to materials design.

Materials design is the cornerstone of modern technology. Many of the challenges our society is facing today are bottlenecked by finding a good material. For example, if we can find a novel material that conducts lithium very well, it will be a key component for our next-generation battery technology. The same applies to many other domains, like finding a novel material for solar cells, carbon capture, and quantum computers. Traditionally, materials design is conducted by search-based methods. We search through a list of candidates and gradually filter them using a list of design criteria for the application. Like for batteries, we need the materials to contain lithium, to be stable, to have a high lithium-ion conductivity, and each filtering step can be conducted using simulation-based methods or AI emulators. At the end, we get five to 10 candidates that we’re sending to the lab for experimental synthesis.

In MatterGen, we hope to rethink this process with generative AI. We’re aiming to directly generate materials given the design requirements for the target application, bypassing the process of searching through candidates. You can think of it as using text-to-image generative models like DALL-E to generate the images given a prompt rather than needing to search through the entire internet for images via a search engine. The core of MatterGen is a diffusion model specifically designed for materials. A material can be represented by its unit cell, the smallest repeating unit of the infinite periodic structure. It has three components: atom types, atom positions, and periodic lattice. We designed the forward process to corrupt all three components towards a random structure and then have a model to reverse this process to generate a novel material. Conceptually, it is similar to using a diffusion model for images, but we build a lot of inductive bias like equivariance and periodicity into the model because we’re operating on a sparse data region as in most scientific domains.

Given this diffusion architecture, we train the base model of MatterGen using the structure of all known stable materials. Once trained, we can generate novel, stable materials by sampling from the base model unconditionally. To generate the material given desired conditions, we further fine-tune this base model by adding conditions to each layer of the network using a ControlNet-style parameter-efficient fine-tuning approach. The condition can be anything like a specific chemistry, symmetry, or any target property. Once fine-tuned, the model can directly generate the materials given desired conditions. Since we use fine-tuning, we only need a small labeled dataset to generate the materials given the corresponding condition, which is actually very useful for the users because it’s usually computationally expensive to generate a property-labeled dataset for materials.

Here’s an example of how MatterGen generates novel materials in the strontium-vanadium- oxygen chemical system. It generates candidates with lower energy than two other competing methods: random structure search and substitution. The resulting structure looks very reasonable and is proven to be stable using computational methods. MatterGen also generates materials given desired magnetic, electronic, and mechanical properties. The most impressive result here is that we can shift the distribution of generated material towards extreme values compared with training property. This is very significant because most of the materials design problem involves finding materials with extreme properties, like finding superhard materials, magnets with high magnetism, which is difficult to do with traditional search-based methods and is the key advantage of generative models.

Our major next step is to bring this generative AI–designed materials into the real life, making real-world impact in a variety of domains like battery design, solar cell design, and carbon capture. One limitation is that we only have validated this AI-generated materials using computation. We’re working with experimental partners to synthesize them in the wet lab. It is a nontrivial process, but we keep improving our model, getting feedbacks from the experimentalist, and we are looking forward to a future where generative AI–designed materials can make real-world impact in a broad range of domains. Here’s a link to our paper in case you want to learn more about the details. We look forward to any comments and feedbacks that you might have. Thank you very much.


MatterGen: Designing materials with generative AI 

By Tian Xie

MatterGen, a model developed by Microsoft Research AI for Science, applies generative AI to materials design.

Why is this important?

Materials design is the cornerstone of modern technology. Many of the challenges our society is facing today are bottlenecked by scientists’ inability to find good materials that can unlock solutions. If we can find a novel material that conducts lithium ion extremely well, for example, it will be a key component of next-generation battery technology. The same applies to many other domains, like finding novel materials for solar cells, carbon capture, and quantum computers. 

Traditionally, materials design is conducted by search-based methods. We search through a list of candidates and gradually filter them down with a list of design requirements for the application. With batteries, for example, we need the material to contain lithium, to be stable, to have high lithium-ion conductivity, and so on. Each filtering step can be conducted using quantum mechanical simulations or AI emulators. Finally, we end up with 5-10 candidates that can be sent to the lab for experimental synthesis.

In MatterGen, we hope to rethink this process using generative AI. We aim to directly generate materials, given the design requirements for the target application, bypassing the tedious process of searching through a large number of candidates. You can think of it as using text-image generative models like DALLE to generate images given a detailed prompt, rather than using a search engine to scour the entire Internet for specific images.

The core of MatterGen is a diffusion model specifically designed for materials. A material can be represented by its unit cell, the smallest repeating unit of the infinite periodic structure. It has three components: atom types; atom positions; and the periodic lattice. We design the forward process to corrupt all three components toward a random material, and then train a model to reverse the corruption process to generate novel materials. Conceptually, it is similar to a diffusion model for images, but we build a lot of inductive bias, like equivariance and periodicity, into the model because we operate on the sparse data region as in most scientific domains.

Given this diffusion architecture, we train the base model of MatterGen using the structure of all known stable materials. Once the model is trained, we can generate novel, stable materials by sampling from the base model unconditionally. 

To generate materials given the desired conditions, we further fine-tune this base model by adding conditions to each layer of the network, using a ControlNet-style parameter efficient fine-tuning approach. The conditions can be anything, like a specific chemistry, symmetry, or any target property. Once fine-tuned, the model can directly generate materials given desired conditions. Since we use fine-tuning, we only need a small labeled material dataset to generate materials with the corresponding condition, which is very useful for users, because it is often computationally expensive to generate property labels for materials.

Here is an example of how MatterGen generates novel materials in the Sr-V-O (Strontium-Vanadium-Oxygen) chemical system. It generates candidates with lower energy than two other competing methods: random structure search and substitution. The resulting structures look quite reasonable and are proven to be stable using computational methods. 

MatterGen can also generate materials given desired magnetic, electronic, and mechanical properties. The most impressive result here is that we can shift the distribution of generated materials toward extreme values compared with training property distribution. This is very significant, because most materials design problems involve finding materials with extreme properties, such as finding super hard materials or magnets with high magnesium, which is difficult with traditional search-based methods. 

Our next major step is to use these generative AI designed materials to make real-world impacts in a variety of domains, such as battery design, solar-cell design, and carbon capture. One limitation is that we have only validated these AI-generated materials with computation. We are working with experimental partners to synthesize them in the lab. This is not a trivial process, but we will keep improving our models with feedback from the experimentalists. We look forward to a future where generative AI can disrupt the current materials design process and find revolutionary materials that can positively change everyone’s life.


The post MatterGen: A Generative Model for Materials Design appeared first on Microsoft Research.

]]>
Driving Industry Evolution: Exploring the Impact of Generative AI on Sector Transformation http://approjects.co.za/?big=en-us/research/articles/driving-industry-evolution-exploring-the-impact-of-generative-ai-on-sector-transformation/ Tue, 04 Jun 2024 18:05:27 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1035015 Jiang Bian discusses how generative AI transforms industries by bridging gaps between AI capabilities and sector needs. He will showcase domain-specific foundation models and versatile AI agents, setting new industry standards.

The post Driving Industry Evolution: Exploring the Impact of Generative AI on Sector Transformation appeared first on Microsoft Research.

]]>
Presented by Jiang Bian at Microsoft Research Forum, June 2024

headshot of Jiang Bian

There is “a substantial demand for advanced generative AI tailored to enhance core business operations. However, in our dialogues with strategic partners, we have identified crucial gaps in current generative AI capabilities versus the specific needs of industry applications. … Our research is crucial in addressing these limitations and amplifying the underappreciated potential of generative AI.”

Jiang Bian, Senior Principal Research Manager, Microsoft Research Asia

Transcript: Lightning Talk

Driving Industry Evolution: Exploring the Impact of Generative AI on Sector Transformation

Jiang Bian, Senior Principal Research Manager, Microsoft Research Asia

Jiang Bian discusses how generative AI transforms industries by bridging gaps between AI capabilities and sector needs. He will showcase domain-specific foundation models and versatile AI agents, setting new industry standards.

Microsoft Research Forum, June 4, 2024

JIANG BIAN: Hello, everyone. My name is Jiang. Today, I’m excited to discuss the work we are undertaking at Microsoft Research Asia focusing on leveraging generative AI to drive transformation and evolution across various industries.

Our efforts are inspired by our unique co-innovation initiative with world-renowned partners from a few core sectors, including finance, manufacturing, energy, and so on. These collaborations have highlighted a substantial demand for advanced generative AI tailored to enhance core business operations. However, in our dialogues with strategic partners, we have identified crucial gaps in current generative AI capabilities versus the specific needs of industry applications. These include a too-narrow focus on human-like AI but not critical industry applications, limitations in processing complex and noisy data, and concerns about reliability in complex decision-making scenarios. Our research is crucial in addressing these limitations and amplifying the underappreciated potential of generative AI in high-value sectors. We are focusing on two main approaches: developing domain-specific foundation models that enhance analytical and predictive capabilities or enable interactive and controllable simulations and creating a versatile foundation-model-as-agent system for diverse industry decision-making tasks.

Our first project is transforming the way industrial data is analyzed and utilized. Facing diverse data formats like tabular, time series, and graph from various sectors, we are employing Generative Data Learning to enhance the large language model with strong ability to interpret and process diverse data formats by transforming them into a unified, instruction-oriented language. With training over this diverse sector data for [numerous]tasks, this approach enables more intuitive data analytics and predictions across various industries. Initial experiments on a typical classification and regression task over tabular data have shown that even a relatively small-scale model enhanced by our Generative Data Learning approach can outperform both general large language models and traditional models like tree ensembles, particularly in few-shot scenarios. This suggests the significant potential for a single-model solution with no extensive model training or fine-tuning in exploring industrial data intelligence maybe with only few-shot examples.

Our second project is exploring building foundation models over domain-specific data, and we focus on financial markets given its fundamental data is orders. We have developed a dual-level foundation model called Large Market Model that uses transformers on both the order sequence to model the market dynamics and the order-batch sequence to align the market trend with control signals. The performance of financial market simulations based on this Large Market Model has been very promising. They have excelled in forecasting market trends, simulating extreme scenarios for stress tests, and detecting market manipulations efficiently.

Our third project focuses on creating a decision-making agent through the knowledge-augmented generation and adaptive retrieval. This agent is essentially a trainable model that generates and extracts domain-specific knowledge, dynamically updating itself and retrieving most appropriate knowledge to handle changing environment. This adaptive approach is particularly useful in many industrycontrol applications, such as HVAC control with the goal of optimizing energy use while maintaining comfort. Deploying this agent into this scenario has shown it can outperform traditional reinforcement learning methods, saving significantly more energy, especially in unknown environments or when facing perturbations.

In summary, at MSR Asia, we are committed to advancing the development of generative AI to catalyze industry evolution through innovative research and partnership. We will soon be sharing more details about these projects through upcoming papers and open-source initiatives. We invite you, especially our industry partners, to stay tuned and join us in driving these transformative efforts forward. Thank you.

“Foundation models, also known as large language models, possess immense potential across a variety of industries. Yet, some companies and organizations limit their use of these expansive AI models to niche areas, including intelligent customer service, chatbots, or text and image generation. In reality, these foundation models demonstrate robust abilities in reasoning, content creation, and generalization, making them exceptionally fit for high-stakes business tasks. These tasks range from taking accurate prediction and forecasting, optimizing industrial control and complex decision-making, and conducting intelligent and interactive industrial simulations.”

— Jiang Bian, Senior Principal Research Manager, Microsoft Research Asia

Maximizing the business value of foundation models in industry applications

By Jiang Bian

As the development of large AI models, also known as foundation models, progresses, companies and organizations are becoming increasingly excited about their potential for enhancing productivity. However, a significant trend has been observed: many industry practitioners focus heavily on the human-like qualities of AI, such as conversational abilities, writing skills, creativity, and perceptual capabilities. In deploying these large AI models, there is a tendency to prioritize applications in intelligent customer service, chatbots, and other so-called ”human-like” functions. Unfortunately, this emphasis may restrict our comprehension and use of these potent models, hindering our ability to fully unleash their capabilities within various industries.

This limitation is not without reason. Incorporating foundation models into practical, production-oriented scenarios is still in its infancy, with few mature and widespread examples to follow. Viewing AI as a “production tool” is akin to possessing a tool before fully understanding its potential applications. Furthermore, humanity has rarely, if ever, encountered such a versatile yet uncertain tool that is not designed for specific tasks.

Additionally, the complexity and variety inherent in different industries require foundation models that move beyond traditional perceptions. This necessitates synchronized innovation in models at the industry level, enabling them to fully exploit the capabilities of foundation models across diverse industrial landscapes and to better align with AI applications. Instead of limiting AI to a “chat robot” role, we should broaden our perspective. Transforming industries in the AI era involves rethinking current business processes and frameworks, leading to collaborative models that can effortlessly integrate humans and foundation models.

Unlocking the boundless potential of foundation models in industry

Foundation models are endowed with broad capabilities in data representation, knowledge comprehension, and reasoning, allowing them to adjust seamlessly across various domains and scenarios, and swiftly adapt to new environments. Concurrently, digital platforms across industries have evolved, amassing substantial amounts of industry-specific data. This rich repository of knowledge and information positions foundation models to integrate effortlessly into industrial settings.

In practical terms, the advanced reasoning abilities of foundation models provide users with a deeper understanding of data. By extracting valuable insights from large datasets and identifying patterns and correlations, these models deliver more effective recommendations and deeper insights. This benefit is especially vital in industrial contexts, where prediction, decision-making, and simulation play crucial roles.

One of the standout features of foundation models is their exceptional ability to generalize. Before their advent, each industry scenario required specific data to train bespoke AI models, limiting scalability and hindering the full commercial exploitation of AI. Foundation models, with their access to a global pool of knowledge, markedly improve generalization. As a result, industries are freed from the necessity of developing unique models for every situation, overcoming a major limitation of traditional AI solutions.

Moreover, foundation models can work in tandem with generative AI to increase the accuracy, realism, and interactivity of industrial simulations and intelligent modeling, facilitating the creation of digital twins. These simulations and models aim to mimic and test real-world scenarios, which often involve complex roles and intricate environments. Traditional AI models may simplify real-world complexities or miss crucial extreme events, compromising the fidelity and authenticity of simulations. In contrast, generative large AI models, steeped in domain-specific knowledge, establish accurate mappings between specific data dimensions and real-world occurrences. This method allows for simulations that closely mirror reality, significantly aiding industrial forecasting and decision-making processes while maintaining adherence to industry standards.

In the industrial sector, tasks of paramount importance and commercial value include precise forecasting and control, efficient optimization of decisions, and complex duties associated with intelligent and interactive industrial simulations. These areas should be the primary focus for traditional industrial enterprises. Yet, when assessing existing foundation models like GPT and the actual needs within industrial domains, we uncover significant mismatches between the capabilities of these models and the real demands of industry. To bridge this gap and fully leverage their potential, several challenges must be addressed.

First, there is a notable absence of a universal framework capable of effectively extracting complex domain knowledge from diverse field data and using this knowledge to construct intelligent agents. Various domains contain rich and complex data, such as logistics companies dealing with customs information and cross-national policies, pharmaceutical industries with FDA drug review documents, and the legal industry with numerous regulations. Developing intelligent agents that are deeply rooted in domain knowledge calls for a more generalized framework. This framework should be proficient in extracting crucial domain knowledge, identifying hidden connections between data and knowledge, and managing this information efficiently.

Second, while foundation models are adept at generating textual content, their proficiency in processing and understanding structured data, like numerical or tabular information, is lacking. Industrial scenarios often involve structured data, such as health monitoring indicators, battery charge-discharge cycles, and financial transactions. Current large models are not specifically designed or optimized for processing such data, which complicates accurate prediction and classification tasks based on structured inputs.

Third, in practical applications, foundation models currently fall short in stability and reliability for decision-making. Critical industries like energy, logistics, finance, and healthcare require dependable decision-making for tasks such as optimizing logistics routes, controlling energy equipment, formulating investment strategies, and allocating medical resources. These tasks often involve numerous variables and constraints, especially under dynamic environmental changes. Foundation models have yet to fully adapt to these complex industrial tasks, making direct application challenging.

Lastly, there is a lack of insight into domain-specific foundational data, as well as methodologies and experience for developing domain-specific foundation models. Essential information in many specialized fields extends beyond mere text, incorporating unique data structures and semantic relationships. For example, transaction order information in the financial investment field or molecular structure details in the biopharmaceutical industry contain critical knowledge often embedded in such foundational data. A deeper, more nuanced analysis is required. Creating domain-specific foundation models grounded in this detailed understanding is crucial for effectively leveraging and unlocking the potential of data in these fields.

Constructing industry foundation models: harmonizing general knowledge and domain expertise

To expedite the adoption and application of foundation models in industry, we can concentrate on several pivotal areas.

First, we can harness rich and complex industrial domain data to construct a more versatile, efficient, and practical retrieval-augmented generation (RAG) framework. This framework is designed to adapt seamlessly to various vertical domains, extracting essential domain knowledge, uncovering hidden associations between data and knowledge, and effectively organizing and managing this wealth of information.

Diagram: A more universal, efficient, and practical retrieval-augmented generation (RAG) framework based on foundation models.
Figure 1. A more universal, efficient, and practical retrieval-augmented generation (RAG) framework based on foundation models.

Second, by carefully considering critical numerical data and the corresponding structured dependencies prevalent in industrial scenarios, we can design foundation models specifically optimized for industrial applications. These models effectively integrate general knowledge with domain-specific expertise derived from temporal or tabular data, thereby enabling more effective solutions for tasks such as prediction and classification within the industry.

Diagram: From traditional industry AI solutions to Industry foundation models integrating general and domain knowledge.
Figure 2. From traditional industry AI solutions to Industry foundation models integrating general and domain knowledge.

Another avenue we are actively exploring involves harnessing the potent generation, generalization, and transfer capabilities inherent in foundation models to elevate the quality and efficiency of industrial decision-making. We are pursuing two distinct paths: first, treating foundation models as intelligent agents, and; second, leveraging foundation models to assist reinforcement-learning agents.

Treating foundation models as intelligent agents: By leveraging the pre-existing knowledge encoded in foundation models and integrating offline reinforcement learning, we can continuously acquire new domain-specific insights and fine-tune the models. This evolutionary process enhances the optimization and decision-making capabilities of foundation models, enabling them to prioritize industry-specific tasks.

Foundation models optimized for specific tasks can play a pivotal role across various industrial contexts. In formula racing, for example, these foundation models can optimize tire-maintenance strategies. By considering tire wear and repair costs, they determine the optimal pit stop timing, thereby shortening race duration and improving car rankings. In chemical manufacturing, leveraging these foundation models can significantly enhance efficiency in product storage and pipeline coordination during production processes, ultimately boosting overall production-execution efficiency. Furthermore, due to their generalization capabilities and robustness, foundation models can be swiftly adapted to optimize air conditioning control, ensuring comfortable temperatures while minimizing energy consumption.

Diagram: Foundation models and offline reinforcement learning are being synergized to construct decision-making agents.
Figure 3. Foundation models and offline reinforcement learning are being synergized to construct decision-making agents.

Assisting reinforcement learning agents with foundation models: We can empower models to acquire universal representations that rapidly adapt to diverse environments and tasks, thereby enhancing their generalization capabilities. In this approach, we introduce a pre-trained world model that emulates human learning and decision-making processes, ultimately bolstering industrial decision-making. By harnessing a pre-trained world model with extensive knowledge and adopting a two-stage pre-training framework, developers can comprehensively and flexibly train foundation models for industrial decision-making, extending their applicability to any specific decision scenario.

We partnered with the Microsoft Xbox team to rigorously validate the effectiveness of our framework in game-testing scenarios. By harnessing this framework, we pre-trained a specialized world model tailored for game maps. This model directly tackles the challenge of long-term spatial reasoning and navigation, leveraging landmark observations within novel game environments. The results were remarkable: our pre-trained model significantly outperformed counterparts that lacked a world model or relied on traditional learning methods. As a result, game exploration efficiency was greatly enhanced.

Moreover, we can harness domain-specific foundational data and the precise semantic information it encapsulates to develop foundation models within the domain, thereby unlocking novel opportunities for intelligent, interactive decision-making, and simulation. For example, by analyzing transactional data from financial markets, we can construct robust investment models. These foundational datasets extend beyond mere textual characters; they embody intricate semantic structures and valuable information. Leveraging this financial foundation model, we can generate customized order flows for various market styles, simulate large-scale order transactions across diverse market environments, and conduct controlled experiments in the financial investment landscape. This approach empowers us to gain deeper insights into market fluctuations and devise strategies for extreme scenarios.

Diagram: Leveraging financial foundation models to implement order flow generation for different market styles, thereby simulating diverse market environments.
Figure 4. Leveraging financial foundation models to implement order flow generation for different market styles, thereby simulating diverse market environments.

Foundation models propel the next industrial digital transformation

Microsoft Research Asia has long recognized that the widespread adoption of AI in industry necessitates continuous technological exploration, experimentation, and breakthroughs. Through collaborative efforts with partners across various industries, we have developed open-source models, including the Qlib AI quantitative investment platform, the MARO multi-agent resource optimization platform, the FOST spatial-temporal prediction tool, and the BatteryML battery performance analysis and prediction platform. These industry-oriented AI platforms, tools, and models not only play a pivotal role in industry but also serve as critical data and foundational components for implementing cutting-edge foundation models.

Building upon successful experiences in industrializing AI, we have embarked on the exploration of domain-specific foundation models tailored for industry, drawing from the dimensions previously discussed. Our findings reveal that these foundation models possess significant potential to diverge from conventional large-scale model paradigms and profoundly impact industrial transformation.

Envision a future where foundation models empower knowledge management, extraction, and iterative processes across industries. Furthermore, we are actively investigating how foundation models can support companies in achieving automated research and development (R&D). This encompasses tasks such as automatically identifying R&D directions, generating algorithmic research proposals, automating R&D processes and scientific experiments, and iteratively refining research approaches. In essence, AI will autonomously propel data-centric industrial R&D, fundamentally revolutionizing industry operations.

Diagram: R&D agent: Automatically evolve the R&D cycle centered on industrial data.
Figure 5. R&D agent: Automatically evolve the R&D cycle centered on industrial data.

Foundation models are poised to become the driving force behind industrial digital transformation, mirroring the transformative impact of the internet and cloud computing. These models are set to unleash a new wave of industrial innovation. We eagerly anticipate collaborating with additional industry partners, immersing ourselves in real-world scenarios, and exploring diverse applications for foundation models within the industrial landscape, thereby fully unlocking their commercial potential.


Author

Dr. Jiang Bian currently serves as a senior principal research manager at Microsoft Research Asia. He leads the Machine Learning Group and the Industry Innovation Center at Microsoft Research Asia.

His team’s research spans deep learning, reinforcement learning, and privacy computing, with a focus on cutting-edge applications of AI in vertical domains such as finance, energy, logistics, manufacturing, healthcare, and sustainable development.

Dr. Jiang Bian has authored over a hundred research papers published in top-tier international conferences and journals. Also, he holds several U.S. patents. Dr. Jiang actively contributes to the academic community by serving on program committees for various prestigious international conferences and acting as a reviewer for leading international journals. In recent years, Dr. Jiang’s team has made significant strides in applying AI-based prediction and optimization techniques to critical scenarios across diverse fields, such as finance, logistics, and healthcare. Furthermore, they have generously shared relevant technologies and frameworks with the open-source community.

Dr. Jiang Bian completed his undergraduate studies at Peking University, earning a bachelor’s degree in computer science. He then pursued further studies at the Georgia Institute of Technology in the United States, where he obtained his Ph.D. in computer science.

The post Driving Industry Evolution: Exploring the Impact of Generative AI on Sector Transformation appeared first on Microsoft Research.

]]>
Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP http://approjects.co.za/?big=en-us/research/articles/insights-into-the-challenges-and-opportunities-of-large-multi-modal-models-for-blind-and-low-vision-users-a-case-study-on-clip/ Tue, 04 Jun 2024 18:04:19 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1035006 Daniela Massiceti delves into the transformative potential of multimodal models such as CLIP for assistive technologies. Specifically focusing on the blind/low-vision community, the talk explores the current distance from realizing this potential and the advancements needed to bridge this gap.

The post Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP appeared first on Microsoft Research.

]]>
Presented by Daniela Massiceti at Microsoft Research Forum, June 2024

Daniela Massiceti

“Today’s AI models hold incredible potential for assisting the blind community—from text recognition to object identification to question answering. Apps like Seeing AI are already deploying some of these AI features. But there is potential for much more.”

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Transcript: Lightning Talk

Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Daniela Massiceti delves into the transformative potential of multimodal models such as CLIP for assistive technologies. Specifically focusing on the blind/low-vision community, the talk explores the current distance from realizing this potential and the advancements needed to bridge this gap.

Microsoft Research Forum, June 4, 2024

DANIELA MASSICETI: Hi there. My name is Daniela Massiceti, and I’m a senior researcher at Microsoft Research Cambridge. Today, I will be sharing our recent CVPR paper, which examines the challenges and opportunities of large multi-modal models for blind and low-vision users.

Today’s AI models hold incredible potential for assisting the blind community—from text recognition to object identification to question answering. Apps like Seeing AI are already deploying some of these AI features. But there is potential for much more. And I think this is hinted at by the recent partnership between OpenAI and Be My Eyes, with the promise that one day, human assistance could be replaced by AI agents that provide instantaneous assistance to blind users around the world. But despite their potential, no works have really looked at, well, how well do these models actually work on image and text data captured by blind users? And we know from the literature that this data is likely to be out of distribution or different in a number of ways. For example, blind users use a range of quite specialized assistive objects. They also are more likely to capture images with quality variation, things like camera blur and occlusion. And they’re also more likely to make use of non-visual vocabulary, for example, describing their objects by their physical rather than their visual properties.

Our work, therefore, set out to remedy this. Specifically, we systematically evaluated 25 variants of the CLIP model on data from blind and low-vision users. CLIP is one of today’s most widely used multi-modal models. It has over 15,000 citations and 75 million downloads. We used the ORBIT and the VizWiz-Classification datasets. Both of these are collected by blind users through real-world assistive applications. And we inspected CLIP’s performance on both a zero-shot image classification task directly as well as through examining the performance of models that use CLIP as a component, which is very widely done in the community. I unfortunately don’t have time to go into all the details of our work, but I will share our top three findings with you. First, we confirmed that CLIP does indeed underperform on data that is captured by blind and low-vision users. Second, these disparities trickle down to models that use CLIP as a component. And then third, these disparities stem from the fact that disability content is significantly underrepresented and sometimes missing completely from the datasets that are used to pretrain these large models. And I’ll dive into our three findings in a bit more detail.

So for the first finding, we found that CLIP underperforms on objects, image quality, and language that is typically used by blind users. On object type, CLIP recognizes disability objects like a Braille keyboard, for example, up to 28 percentage points less accurately than common objects like a TV remote. On image quality, CLIP is up to 23 percentage points more sensitive to images with things like camera blur and lighting compared to images that don’t have these quality issues. And on language, CLIP recognizes objects that are described by their material—so, for example, a leather boot—up to 12 percentage points less accurately than objects described by their color—for example, a brown boot. And we know that blind users rely heavily on this tactile rather than visual language.

Towards our second finding, we examined three models that use CLIP under the hood—an object detection model, an image segmentation model, and an image generation model—and found that all three struggle with disability content. For example, DALL-E 2, which relies on a CLIP vision encoder, cannot generate common disability objects like guide canes and Braille keyboards. Instead, as you can see here, it gives us very strange-looking walking canes and lots and lots of randomly placed white dots. In comparison, DALL-E 2 generated really high-quality and realistic images for almost all of the non-disability objects that we tested.

And then towards our third and final finding, we really wanted to understand where these performance disparities were stemming from. And so we quantified just how prevalent disability content is in three popular datasets that are commonly used to pretrain these large models: LAION-[400]Million, LAION-2 Billion, and the DataComp-1B dataset, or 1 billion dataset. Specifically, we counted how many times objects are mentioned in these datasets’ captions and found that disability objects appear up to 16 to 17 times less frequently than non-disability objects across all three of the datasets.

So as you can see, our work has identified a clear gap in current models’ capabilities for blind users, and this could have very real consequences if these models are then integrated into assistive technologies for the blind and low-vision community. So what should we, as a research community, be doing about it? First, I think more work is needed to understand how models come to learn or adapt to long-tailed data. Some of our early results show that few-shot learning approaches hold some promise, but they don’t always work, especially in more challenging scenarios, for example, when objects appear in highly cluttered scenarios. And second, I think it’s important for us to really focus on including more disability content in these large-scale pretraining datasets. And our team [is] currently working on developing equitable and fair practices alongside disabled communities to source data that is truly representative of their needs. And so with that, I will wrap up.

Thank you to all the people behind this work and thank you for listening.

The post Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP appeared first on Microsoft Research.

]]>
Panel Discussion: Generative AI for Global Impact: Challenges and Opportunities http://approjects.co.za/?big=en-us/research/articles/panel-discussion-generative-ai-for-global-impact-challenges-and-opportunities/ Tue, 04 Jun 2024 18:02:54 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1035000 Microsoft researchers discuss the challenges and opportunities of making AI more inclusive and impactful for everyone—from data that represents a broader range of communities and cultures to novel use cases for AI that are globally relevant.

The post Panel Discussion: Generative AI for Global Impact: Challenges and Opportunities appeared first on Microsoft Research.

]]>
Hosted by Jacki O’Neill, with Sunayana Sitaram, Daniela Massiceti, and Tanuja Ganu at Microsoft Research Forum, June 2024

Sunayana Sitaram

“One of the solutions that we’ve been using is to actually design with ‘human in the loop’ in mind because we know that these technologies are not perfect. And so, we really want to figure out ways in which humans and AI systems can work together in order to create the most effective outcome.”

Sunayana Sitaram, Principal Researcher, Microsoft Research India

Transcript: Panel Discussion

Generative AI for Global Impact: Challenges and Opportunities

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi (host)
Sunayana Sitaram, Principal Researcher, Microsoft Research India
Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge
Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

Microsoft researchers discuss the challenges and opportunities of making AI more inclusive and impactful for everyone—from data that represents a broader range of communities and cultures to novel use cases for AI that are globally relevant.

Microsoft Research Forum, June 4, 2024

JACKI O’NEILL: I’m delighted to be hosting what promises to be a really engaging panel today with three fabulous panelists. In my talk, I talked about the importance of building globally equitable generative AI systems for diverse communities and application areas, and I hope that I’ve convinced you all of the importance of doing this if generative AI is not going to compound existing systemic inequalities. In this panel, we’re going to dive much deeper into the application areas, the user populations, the problems, and the solutions of doing this with our three expert panelists: Sunayana Sitaram, Tanuja Ganu, and Daniela Massiceti. So without further ado, I’d like to ask each of the panelists to introduce themselves.

TANUJA GANU: Thank you, Jacki, and hello, everyone. My name is Tanuja Ganu, and I’m principal research engineering manager at Microsoft Research in India. My background is in applied AI, and my work is focused on developing and validating technologies that would drive positive change in society. I have been leading an incubation center in MSR India called SCAI—Societal Impact through Cloud and AI—and in last 1½ years, I’m spending a lot of time on how we can take the potential of generative AI to empower every individual across the globe and catalyze the change in some of the domains like education. Thank you.

SUNAYANA SITARAM: Hi, everyone. I’m Sunayana Sitaram. I’m principal researcher at Microsoft Research India, and my background is in natural language processing. My research involves trying to make sure that large language models, or generative AI as they’re also known, work well for all languages and cultures. And over the last couple of years, my research group has really looked into how to evaluate how well these large language models are doing for different languages across the world, including languages that have smaller amounts of data compared to English but are still spoken by millions of people worldwide. Thank you.

DANIELA MASSICETI: Hi, everyone. My name is Daniela Massiceti, and I’m a senior researcher at Microsoft Research based in Australia. My background is in machine learning, but nowadays, I work much more at the intersection of machine learning and human-computer interaction, particularly looking at multi-modal models. So these are models that work with both image and text input. And my main focus is, how do we ensure that these AI models or AI systems work well for the users who are in the tails of the user distribution? And in particular, the research that I’ve done along with my team, it particularly looks at people with disabilities, who will, of course, be major beneficiaries of these multi-modal models.

O’NEILL: Thank you so much. I’d like to start by asking you what you see as the core problems we face building equitable generative AI that works well for diverse communities and user groups. Tanuja, would you like to start us off?

GANU: Let me start off by saying that I feel that this is an exciting time to be in technology, and I’m really thrilled with the remarkable progress and the vast potential of generative AI. And we are already seeing successful deployments of generative AI in enterprise applications like GitHub Copilot for programmers or Office 365 Copilot for enterprise users, which is showing the improved efficiency and quality as well as giving ability to the users to focus more on their creative work. So the natural next question is, how can we take this power of generative AI and empower every individual, every individual across the globe—the people who are coming from different nationalities, different ethnicities, cultures, as well as with varied, kind of, technology access and financial, kind of, affordability, as well? So when we are looking at this technological evolution, I think it’s crucial that we, kind of, prioritize and focus and address the digital divide and we really, kind of, actively work to reduce this particular gap. So taking these points into account, there are [a] few sociotechnical challenges that we need to address when we want to make sure that generative AI technology truly, kind of, works for every individual. So firstly, I think the first important challenge is making sure that these technologies are able to provide seamless interaction across thousands of world languages. And it’s not only about language, but it’s also about incorporating and preserving cultural nuances in these different kind of communities and user groups. The second important challenge is about designing for existing infrastructural constraints. Infrastructural constraints are like the existing technologies need to have low-end mobile phones as primary interface in some of the cases or dealing with low or intermittent network connectivity and overall low affordability when we are especially looking at vast majority of populations from Global South. The third important problem that I consider is the varied access levels depending upon the literacy levels as well as the access needs depending upon disabilities. And the fourth important challenge is really overarching as in, how can we expand and how can we revisit the responsible AI and safe deployment principles taking into account these culturally and linguistically varied user groups and expanding to include the dimensions of equity, access, and inclusion? So I think these are really some of the important challenges.

O’NEILL: Thank you so much, Tanuja. I think you’ve really given us a great overview there. Daniela, I wonder if you could deep dive a bit on the accessibility questions that Tanuja raised.

MASSICETI: Yeah, sure thing, Jacki. So, yeah, I can definitely bring some perspectives here from the work that my team—me and my team—have done in the accessibility space. So we know, as I said earlier, that these multi-modal models really hold the potential to transform assistive technologies for communities with disabilities. But up until now, very few works have actually quantified, well, how well are these models going to work for these communities? And so a piece of work that we recently did, which was published in CVPR, basically aimed to do exactly this. Specifically, we looked at images and text captured by users who are blind and then evaluated how well CLIP, which is a very popular multi-modal model, actually works on their data. And I wanted to share, kind of, three insights that came from this work which speak to the core challenges I think that lie ahead of us realizing truly equitable AI systems in the future.

So the first is that the datasets typically used to train these AI models do not include data from communities with disabilities. In our work, we analyzed three large-scale datasets that are typically used to pretrain these large multi-modal models, and we found that disability content—things like guide canes, Braille displays—are significantly underrepresented or actually just not present at all in these datasets. And so this means that then any model that is trained on this dataset will perform poorly on any task that involves identifying, locating, or answering questions about any of these particular objects. And I don’t think that this problem of data inclusion is just the case for the blind and low-vision community but many, many marginalized communities who may not be included in these datasets. And the second core problem is that I think we’re moving toward this paradigm where we have a very small number of enormous models—these so-called foundation models—which are being widely used by many, many downstream models and applications. But if these foundation models don’t work well in the first instance for marginalized communities, then we have the potential to see this compounding essentially in any downstream application that uses these foundation models. And this is exactly what we saw in our CVPR work.

We identified that CLIP, as a base model, significantly underperforms on data from blind and low-vision users. But then when CLIP is embedded as a component in other models, these failures persist and in some cases are even amplified. So, for example, we looked at DALL-E 2, which uses a CLIP vision encoder under the hood, and we basically saw that it couldn’t generate any decent images of any of the disability objects we tested. You know, when we asked it for a guide cane, it gave us very funky-looking walking sticks. And when we asked it for Braille keyboards, it again gave us these random arrangements of white dots on a page.

And in the final core problem I’ll reflect on is that I think we don’t often embed ourselves deeply enough in marginalized communities to really understand the ways that AI models need to work for these communities. So, for example, one of the findings in our CVPR paper was that CLIP has trouble recognizing objects if users describe them by their material rather than their color. So, for example, a user might say find my leather bag rather than my brown bag. And we only really knew to test for this because our team collectively has over 20-plus years of experience in working with the blind and low-vision community to know that users often use these material-based descriptions when they’re talking about their objects. And so without this insight, we would never have uncovered this particular failure mode, and so I think it’s really important, to achieve truly equitable AI models, we really need to deeply embed ourselves in the communities that we’re working with.

O’NEILL: Thank you, Daniela. So Sunayana, Daniela’s given us a really good overview of the challenges with the multi-modal models and the image models. I know that your research is primarily thinking about how different language communities can interact with these language models. I’m wondering, what do you see as the problems for making these models work well for anyone, anywhere, whatever language they speak?

SITARAM: Right. So as Daniela mentioned, there is a data divide, right, even when it comes to languages because most language models today are trained predominantly on data that comes from the web. And we know that not all languages and cultures are equally represented on the web, right. So at the very first step of the pipeline, you now have this inequity because of different representation of different languages and cultures. But I think that’s not the only problem. There are a lot of other decisions that are taken during the model-building process which could also influence downstream performance. So, for example, in some of our research earlier last year, which was published in EMNLP, we found that the tokenizer, which is the component that actually breaks words down into smaller pieces, that doesn’t work equally well for all languages, and that actually has a significant impact on downstream performance. So things like this, you know, decisions that are taken during the model-building process can also really influence the performance. And finally, you know, one of the biggest challenges I see—and I may be a little biased because this is my area of research—is that, you know, we are not able to actually evaluate these models across all languages and cultures well. And this is because of a variety of reasons, including the fact that, you know, not too many benchmarks exist with the sufficient linguistic and cultural diversity. But because we are not doing a good job of evaluation, we don’t even know how well these models work for different languages and cultures. And so I think, you know, beyond data, there are many other challenges that need to be addressed in order to make these models actually work for all languages and cultures.

O’NEILL: Yeah, thank you so much. I think it’s really clear from your answers how these technologies are the biggest challenges for making these technologies work at both the societal level and also the level of the actual models themselves, you know, whether they’re vision or multi-modal models or language models, and we know that this has a direct impact on various user populations. As Tanuja mentioned in the beginning, you know, we’re seeing a lot of enterprise applications and enterprise technologies being developed, whether that’s for helping you code or ideate or answer emails. But are there other user populations who could really benefit from applications of generative AI which works well? Tanuja?

GANU: Yeah, so I think there are a lot of interesting and impactful applications which are emerging for generative AI in domains like education or health care and agriculture. So let me give you an example from our work in education, where we are developing [an] AI assistant, which is called Shiksha copilot, that provides agency to the teachers in public schools in India for generating personalized and engaging learning experiences like activities, assessments, the teaching material for their students. So what is important here is that the content generated is completely grounded in the local curriculum and the interaction is completely in local language, which is Kannada in this particular case. It’s also important that the content, kind of, preserves the cultural or local norms. So let’s take an example of a teacher teaching components of food or balanced diet as the topic. So it should include the examples which are coming from the local diet and cuisine, maybe giving an example of biryani or maybe giving an example of ragi mudde, which is made up of finger millet. So it’s also additionally important that the teacher is able to use and generate the lesson plans on the mobile phone or their desktop, whichever are the, kind of, resources which are available to them, and they should be able to utilize this particular Shiksha copilot while using in the classrooms where AV systems might not be available. So they can generate the lesson plan on the phone, and they can take it to the classroom and completely utilize it in the offline manner. So I think these are all the challenges that we discussed earlier; those become really important when we are doing these kind of real-world deployments. So with Shiksha copilot, we have completed a successful small pilot with 50 teachers in India, and now we are gearing towards a scaled pilot with thousand teachers. And I feel like applications like these can have a really transformative effect in the education system and create a positive impact for students and teachers across the globe.

O’NEILL: Thank you. Daniela, for the accessibility populations, what type of applications and populations are important in this space?

MASSICETI: Yeah, sure thing. So an estimated 1.3 billion people—around 16 percent of the global population—live with some level of disability today. So I think it’s really exciting to see these generative AI applications coming online for these communities, and our team has done, as you may already have gathered, a lot of work with the blind and low-vision community. And so I wanted to call out a couple of promising generative AI applications for this particular community. The first is Microsoft’s own actually: Seeing AI. So Seeing AI is a mobile app for users who are blind and low vision, and they’re really leading the charge in innovating new assistive user experiences using models like GPT-4. So, for example, they’ve built in features which allow users to answer really detailed questions about a document they’ve scanned as well as get these beautifully detailed captions or descriptions of photos that they’ve taken. And you can really see the impact of these. For example, maybe when you’re visiting a museum, you can snap a picture and get these beautiful descriptions around the artworks that are … of the artworks that are around you. I’ll also call out the partnership which was recently announced or announced last year between Be My Eyes and OpenAI. So Be My Eyes is a video-calling app which connects blind users with sighted volunteers when they need help on a particular task. So, for example, they snap a picture of a packet of potatoes or a packet of tomatoes and then ask the sighted volunteer if they’re out of date, for example. And the promise with the OpenAI partnership is that perhaps some point in the future, these sighted volunteers may be replaced by a model like GPT-4 with vision, enabling pretty much instantaneous and fully automated assistance for blind users anywhere in the world. So I think that’s really exciting. And in fact, I—along with some other colleagues at Microsoft Research—worked very closely with OpenAI and teams across Microsoft to red team the GPT-4 with vision model and really ensure that it met Microsoft’s high bar before it was publicly released. And I think this is a really tangible demonstration of Microsoft’s commitment to delivering safe and responsible AI technologies to its customers.

O’NEILL: Thank you so much. So how do we, given these large populations who could really benefit, how do we go about building solutions for them that actually work?

GANU: So maybe I will take this. So given that we are working with really diverse populations, I think it’s extremely useful that we work with user-centered design or participatory design approach and collect the voices of the users and especially the marginalized communities and the underserved communities right from the start at the design time. It’s also important while we are dealing with this nascent or emerging technology that we do have the right safeguards while deploying the system and we are able to collect the feedback at every stage when we, kind of, deploy the systems, such as using the expert-in-the-loop kind of deployment, where the expert has the ability to verify as well as override the responses as and when required. So to give an example, this was one of the, kind of, conscious decisions when we started working with Shiksha copilot, to start with the teachers and not with the students first, where teacher is the expert in the loop, and we can extend the benefits of the technology to the students through teachers to start with and eventually, kind of, go to the students.

Also, while we are working and looking at various applications across population scale, as I mentioned earlier, in domains like agriculture, education, health care, and other domains, what we are seeing is that there are common problems or universal challenges which are repeated across all these particular domains. As Sunayana talked about earlier, multilingual interaction is a huge problem across all domains. The other important problem is that most of the knowledge base that is required for grounding or, kind of, generating these AI experiences on is non-digitally native and multi-modal. So how do we extract the information from these multi-modal, non-digitally native content is a challenge across these different domains. So what we are doing as part of our project, which is called Project VeLLM, which stands for “uniVersal Empowerment with Large Language Models,” is we are building this versatile platform, which you can think of as building blocks or tool set providing all these different functionalities which are common across these different, kind of, applications. And now the other developers do not have to start from scratch. They can use these building blocks and create their equitable AI experiences rapidly across different domains.

SITARAM: Generalizing a little bit from what Tanuja just said about expert in the loop, I think that, you know, one of the solutions that we’ve been using is to actually design with “human in the loop” in mind because we know that these technologies are not perfect. And so, you know, we really want to figure out ways in which humans and AI systems can work together in order to create the most effective outcome. And in our research, we’ve actually been doing this for evaluation of, you know, multilingual scenarios. So, for example, we know that, you know, large language models can do a good job of evaluation, but we also know that they don’t do a very good job on some languages and along some dimensions, right. So those languages and those dimensions should ideally be left to a human to do, whereas for the ones that we are very confident that the LLM is doing a good job, we can actually rely on it more with some human oversight in order to scale up the process of evaluation. So this idea of actually using humans and AI together and designing for this kind of hybrid system, I think, is really crucial. And, of course, we need to keep revisiting this design as these AI systems become more and more capable.

MASSICETI: Yeah, so many points I can agree with there and build on. I think what’s common with both Tanuja’sand Sunayana’s answers is really this need to, kind of, bring models and humans together. And I think one real limitation we’ve seen in our work across many of the models we’ve worked with is that they really often generate quite generic responses, you know. So if you prompt an LLM to write you an email, the tone and style don’t quite, sort of, quite feel like yours. And so I think as we look to this next decade of generative AI solutions, I really hope to see that we’re going to see more personalized AI models and solutions come through much more strongly, solutions where you as the user have much more control, much more agency, around how your model works for you. And I think that’s another example of how human users and the AI model need to come together in order to create something even more powerful. And I think this is going to be even more impactful for marginalized—even more important even—for marginalized communities, whose needs often differ a lot from, kind of, the average or the generic needs.

And to, kind of, just bring one concrete example to the table, our team has been building a personalizable object recognizer over the last year. So here, a blind user can pretty much teach the object recognizer their personal objects, things like their sunglasses, their partner’s sunglasses, maybe their favorite T-shirt. And they do this by taking short videos of these objects, and then the personalized recognizer can then help them locate these things at any point in the future. And so in this sense, the user is really given the agency. It’s really this example of a human-in-the-loop paradigm, where a user is given the agency to personalize their AI system to meet their exact needs. So, yeah, it’s really exciting. This feature has actually just been released in Seeing AI, and so we’re really keen to begin imagining how we might see more personalizable generative AI experiences for users in the near future.

O’NEILL: Yeah, I really love that idea. I think we would all benefit from more personalized AI, even when you’re just trying to craft an email or something like that. The challenge people often face is it doesn’t really sound like them.

MASSICETI: Exactly.

O’NEILL: And then if you have to edit it too much, then, you know, you reduce the benefit. So I think there’s so many areas right across the board where personalization could help. So finally, as we’re coming to a close, I really would love to finish by asking each of you what you think the biggest research questions that are still open are, what the biggest gaps are, and how you would advise the research community to go about solving them.

MASSICETI: Yeah, it’s a big, big question. I’ll maybe take a stab first. So I think a couple of us have already touched on this point before, but the data divide, I think, is really a big, big challenge. You know, the fact that data is widely available for some communities but then totally absent or very sparse for others. And I think this is one of the biggest hurdles we need to address as a research community in order to really move the needle on equitable AI because it’s impacting everything from the way that we can train models but also, as Sunayanasaid, to how we can evaluate these models, as well. But I want to, kind of, call out that even though we’ve identified the problem—we, kind of, know what the problem is; you know, we need to include data from these communities—I think there’s just so many open questions around how we actually do this well and how we actually do this right. And so I want to bring up two specific challenges or open questions that I feel are very prevalent.

The first is, what do equitable paradigms actually look like when we’re collecting data from or about a marginalized community? These communities, as we know, have often historically been exploited. And so we really need to find fair ways of not only involving these communities in these data collection efforts, but also compensating them for their efforts as these models are then trained on this data and then are deployed and used more broadly. But then the second open question, I think, is that we really need deep technical innovation in adapting models to new data. You know, we’ve obviously seen a lot of adaptation methods coming online—fine-tuning, LoRA—and they do really well at, kind of, adapting these models to new datasets and tasks. But what we’re seeing in our current experiments is that these approaches don’t work so well when that new data that’s coming in is very different from the pretraining dataset. So in one particular example, we gave Stable Diffusion 10 training images of a funky-looking cat statue, and it learned it really well, and it could generate actually really realistic images of this statue. But then when we did the same for a guide cane, Stable Diffusion just still cannot generate realistic-looking images of guide canes. And so I think we really need to build as a research community a deeper understanding around how we get models to learn new concepts or new things, even when they aren’t well represented in the pretraining datasets.

O’NEILL: Thanks so much, Daniela. Tanuja, is there anything you want to add?

GANU: So for me, it feels like we are just, kind of, beginning to scratch the surface, and there is a lot more work underway across the dimensions of culture, cost, human values, cognition, universal access, and many other dimensions. So while the journey is long and we are trying to solve some of these hard and important problems, it is important that we, kind of, continue to make progress systematically and iteratively and we continue to, kind of, collect feedback and critical feedback at each of these stages. We definitely need to do lot more work also looking at different types of models as in large language models for more complex tasks. But can we look at smaller language models, especially when we are looking at the infrastructural challenges, as I discussed earlier. How can we use combination of these models? How can we generate and collect data from different cultures and involve these communities to, kind of … because these are very implicit things and not documented, kind of, information about different cultures. So how do we, kind of, learn for those is also important question. And I think collaboration is the key here. It’s important that we involve the experts from multiple disciplines, user communities, researchers, and policymakers and accelerate the progress in the right direction. We are already doing some of these collaborations with academia and NGOs, with the programs like Microsoft Research AI & Society Fellows, and some of the existing collaborations with our community and partners in India and Africa. But I think we’ll just need to continue doing it more and continue making steady progress on this important problem.

SITARAM: I completely agree with what both Daniela and Tanuja said. And talking more about the language and culture aspect, I think we need to figure out a way to involve these local communities in the design and training as well as evaluation phases of model building. And we need to do this at scale if we really want to reach all languages, all cultures, etc., right. So I think that is the thing that we really need to figure out how to do. So there are a couple of projects that we’ve been working on that have attempted to do this. One of them is called DOSA, where we collected a dataset of cultural artifacts from different users in India. And this was meant to be a participatory design approach where people would tell us what cultural artifacts were really important to them, and then we would collect this data from the ground up and try to evaluate whether LLMs did a good job or not, right. That’s one example. The other project that we’ve been working on is called Pariksha, where we employ workers from this ethical data company called Karya to do evaluation of Indian language models. So here we’re really asking the users, who speak multiple languages, to tell us whether these models work for them or not. And so I feel like we need to figure out more ways in which we can involve these local communities but at scale so that we can really impact the model-building process and then so that we can actually make these models work well for everybody.

O’NEILL: I couldn’t agree with you more, Sunayana. I think involving user communities in technology design in general is one of the most important things that we can do, and this is even more so with underserved communities. I would just like to add something to that, though, which is that we really need multidisciplinary research that goes beyond anything that we’ve done before, involving researchers and practitioners and community members. And it’s important to remember that machine learning engineers and researchers on their own can’t solve the problem of building globally equitable generative AI. This is something that we really need to do in a large scale. We need to transcend disciplinary boundaries if we’re going to build technology that really works for everyone, everywhere. And on that note, I’d like to say thank you to the panelists. It’s been a great discussion and thank you to the audience.

MASSICETI: Thanks very much.

GANU: Thank you so much.

SITARAM: Thank you.

The post Panel Discussion: Generative AI for Global Impact: Challenges and Opportunities appeared first on Microsoft Research.

]]>
Keynote: Building Globally Equitable AI http://approjects.co.za/?big=en-us/research/articles/keynote-building-globally-equitable-ai/ Tue, 04 Jun 2024 18:01:06 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1034994 Jacki O'Neill discusses the importance of creating globally equitable generative AI. She addresses the technical and sociotechnical challenges that must be tackled to positively transform work futures worldwide.

The post Keynote: Building Globally Equitable AI appeared first on Microsoft Research.

]]>
Presented by Jacki O’Neill at Microsoft Research Forum, June 2024

Jacki O'Neill

“It’s only by working together can we solve the challenges we face with generative AI at scale and, in doing so, capitalize on the opportunities these new technologies offer. … Now is the time to change the dialogue and change the direction of these new powerful technologies to ensure that they are globally equitable by design.”

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi

Transcript: Keynote

Building Globally Equitable AI

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi

Jacki O’Neill discusses the importance of creating globally equitable generative AI. She addresses the technical and sociotechnical challenges that must be tackled to positively transform work futures worldwide.

Microsoft Research Forum, June 4, 2024

JACKI O’NEILL: Hi, I’m Jacki, and I head up Microsoft Research Africa, Nairobi. Welcome to the Microsoft Research Forum. I’m going to talk about the importance of building globally equitable generative AI.

Given its ability to generate and process human-like natural language, generative AI is set to transform the way we interact with technology. Like the graphical user interface before it, generative AI promises to make computing and AI more accessible to a wider range of people. This promise encompasses several features. Their natural language interfaces mean users can interact with these models conversationally. They can ask questions, give commands, and get tasks done. This can lead to a reduction of complexity across applications and devices as one can imagine navigating through and creating content using natural language without having to open different applications to find and extract information or even know which tool that content was created in. Given this, LLMs could reduce the burden of repetitive and nonessential tasks—from helping us craft email to summarizing documents and supporting report writing—giving us more time to focus on the work we love. Finally, multi-modal interactions with image, speech, and video processing and generation further enhance the transformational power of these tools, all of which could make both the power of AI specifically and that of computing more generally much more widely accessible, including to a mobile-first or mobile-only audience, thus reaching the billions of people who don’t work at desks.

As a result, generative AI is likely to transform the future of work across the globe in ways as yet unimagined and has sparked excitement about its potential impact on the Sustainable Development Goals. However, generative AI may not be equally useful for everyone, and its impact will not necessarily be evenly distributed globally, across regions, communities, or demographics, and as a consequence, there’s a risk of compounding existing systemic inequalities. For example, those that can most benefit from the promise of generative AI include populations in the Global South, who’ve been previously excluded due to both the traditional digital divides and the AI divides. The traditional digital divide has three levels, including access to digital technology, having the skills and knowledge required for their effective use, and the ability to translate use into desired outcomes. Generative AI then brings additional elements to the digital divide. The AI divide encompasses the socioeconomic conditions around who gets to create, deploy, and benefit from AI. The data divide refers to the consequences of not having good representation and equivalence in the training data and data-related processes, such as labeling and reinforcement learning. And the third divide is compute given the GPU requirements to build, train, and deploy these immense and immensely powerful models.

So how does generative AI impact these divides? Well, it reduces the traditional divide in some ways because natural language interfaces mean AI is more accessible to more people than ever before. For example, this latest generation of general-purpose off-the-shelf tools can and are being deployed to improve productivity by businesses around the globe, including many small businesses who were previously excluded from the AI revolution because they just didn’t have access to machine learning professionals in their companies. In terms of devices, many of the current AI tools can be used on smartphones, although they do require data. But there’s a plethora of specific feature-phone services which are being created in areas such as health and agriculture which don’t require the end user to have data. Whilst it’s too early to definitively talk about the ability to translate use into outcomes, research on small and medium businesses’ adoption of generative AI in Kenya and Nigeria suggests that it provides value for those who start using it in a number of use cases, such as writing emails in English.

The AI divides however remain, and there’s much work to be done to arrive at globally equitable generative AI. Today, I want to focus on the data divide, which stems from the fact that the vast majority of training data comes from the English-speaking Global North. This has an impact on the representations of both language and of knowledge in AI systems and consequently on their ability to process and produce appropriate output. But what does this mean in practice?

Let’s start with a look at language. Last year, research has shown that when compared to state-of-the-art non-autoregressive models, or SOTA models, on standard NLP tasks—natural language processing tasks and benchmarks—those SOTA models outperform large language models, including GPT-4. Large language models tended to work well on high-resource language families with Latin scripts but less well on low-resource languages with limited training data or non-Latin scripts. However, generative models introduced new challenges for NLP benchmarking, many of them due to prompt sensitivity. That is, even small changes in the prompt construction can impact performance, making consistent benchmarking difficult. For example, even asking the LLM to provide explanations for its output can change performance as does the choice of examples used in the prompt. Nonetheless, currently, African language performance on traditional metrics isn’t yet at a par with English performance. But this doesn’t tell the whole story.

In naturalistic interactions, GPT-4’s performance seems pretty amazing. For example, in a collaborative project with the University of Washington Global Health Department, we’ve been looking at building NLP, or natural language processing, tools to support medical facilitators. These facilitators manage peer support groups on WhatsApp for young people living with HIV in informal settlements in Nairobi. The data consists of chat messages in English, Swahili, and Sheng, a local dialect, and includes code-mixing, emojis, and “chat speak.” You can see an example of the data here. This message contains a mixture of English and Swahili code-mixing with its translation by human annotators. We found that even the best multilingual SOTA models performed so badly even after fine-tuning, that we stopped working on this project. Then, along came GPT-4, and suddenly, these tools seem possible again. What’s going on? Why are NLP benchmarks telling us one thing about African language performance and application-based practice telling us another?

Well, one part of the explanation is that previous models typically just couldn’t handle code-mixing, whereas generative models are much better equipped to handle natural language. Therefore, they’re not only able to handle naturally produced code-mixed language, but they can also handle chat speak with its abbreviations, colloquialisms, emojis, and so on. Now, we found that whilst both GPT-4 and LLaMA showed impressive results in sentiment analysis on this dataset, GPT-4 appears to use more of the whole context of the sentence to produce slightly more robust predictions. Returning to our example, if we assume some correlation between explanations and prediction, we can see that GPT-4 gave more nuanced predictions, whereas LLaMA did not seem to pick up on the more positive, although conditional, sentiment in the second part of the sentence.

Despite these impressive advances, there’s still plenty of work to be done. There are still many under-resourced African languages where performance lags far behind. For example, the models make more mistakes on Sheng, which is not included in the training data, and speech models lag behind, often failing at the code-mixing hurdle. This is important because voice interfaces are likely to be essential to enabling even broader access to AI. So this is an area of continued research for African and other under-resourced languages. But language is not the only concern. Whilst language is most researched, the widespread deployment globally of the latest generation of foundation models reveals another equally pressing problem. Models hallucinate, fail, or reproduce stereotypes in African and other Global South contexts.

On the positive side, we’ve seen adoption of text and text-to-image generation tools, generative AI search, AI-augmented design tools, speech generation tools by small businesses in Nigeria and Kenya from a range of sectors, including law, design, outdoor recreation, and retail. These businesses successfully use generative AI to support communication, especially being professional and polite and to the point in emails. A common example that we saw across sectors is illustrated here: how do I tell my client he’s four months late now to pay his fees, and I don’t want to sound rude? And we saw this across pretty much all of the small businesses where they needed customers to pay. They also used AI to support creative work, such as content creation and ideation and so on. They described how it helped save time. For example, it reduced the time for ideation. As an architectural designer said, “We would manually, kind of, like, bounce ideas off each other. … Arriving at 10 strong ideas would take two or three sessions, whereas now we get the same results in one.” Even the lawyers who charged by the hour wanted to reduce their mundane work so they could spend more time on creative work. They would have liked to deploy generative AI to reduce the time spent on small-case document review. As a senior lawyer said, “We could have spent that 15 hours on important things. Once the machine, the AI, had given us the report, we’d be thinking creatively now.” So far so good.

Problems often arise, though, when they want to use generative AI in work involving the African context, which—as SMBs in Africa—is quite often. Whilst generative AI can sometimes help to navigate cultural and contextual boundaries, it’s more likely to hallucinate when the proportion of training data is low—i.e., in most African context—and a whole host of problems starts arising, from accent recognition in meeting transcription to speech production. For example, the CEO of an IT company used voice cloning for training videos but found it gave her a British accent. And as she said, it “takes away from my originality, which is I’m not British; I’m Kenyan.” Or the poor context and consistency we’ve seen in image production systems creating unusable and sometimes stereotypical images of Africans and African landscape, not to mention the tendency to produce answers which neglect or misrepresent African people, history, culture, and knowledge. And even where information about Africa is generated, it often portrays the Western perspective. This was perhaps most clearly encapsulated by one of the lawyers, who explained, “Even if you put into the particular AI that you’re asking from a Kenyan perspective—while in Kenya, does this law apply?—they’ll reference the people in the US, which is insane because we have Kenyan authors; we’ve done the actual work.” Overall then, it can leave the feeling that generative AI is really Americanized.

This regional bias goes way beyond demographic biases like race, although it would be compounded by them. Whole continents and their knowledge are severely underrepresented, and this comes through clearly in use, both in usability and use cases that are directly impacted by this. Indeed, AI has a language problem, but just as importantly, it has a knowledge problem, and this is likely to compound existing systemic inequalities. But we’re at the very early stage of generative AI and the impacts it will have on work. This is a fast-moving field, and there’s an immense opportunity to take control of the agenda and build truly globally equitable AI systems. This requires ensuring that diverse contexts and applications, with their diverse datasets, drive the development of generative AI. We need to be intentional and embrace these approaches. Machine learning methods like fine-tuning and retrieval-augmented generation, or RAG, are unlikely to work well if we don’t design and build for these diverse contexts from the beginning.

This is not something that any one group or company, nor any one discipline, can or should do on their own. This needs to be a collaborative effort incorporating different voices, different perspectives, and different disciplines working more closely together than ever before. And so just before I finish, I want to highlight one initiative that’s attempting to address some of the concerns raised: the African Health Stories Project.

This is a multi-institution, multi-country, multidisciplinary project. Microsoft Research is working with public health researchers at Stellenbosch University, human-computer interaction researchers at the University of Swansea, and machine learning and data science researchers at the University of Pretoria to create culturally appropriate and sensitive stories supporting good health behaviors. We will use generative AI to create interactive visual, oral, and text stories which enable patients to better understand how to apply health advice to their local circumstances. Together, as a multidisciplinary team, we will use this specific real-world application area to probe, evaluate, and extend the ability of generative AI to create situated and culturally appropriate content at the same time as addressing a real health need. Because it’s only by working together can we solve the challenges we face with generative AI at scale and, in doing so, capitalize on the opportunities these new technologies offer.

We have plenty of work to do, but now is the time to change the dialogue and change the direction of these new powerful technologies to ensure that they are globally equitable by design. Thank you.

The post Keynote: Building Globally Equitable AI appeared first on Microsoft Research.

]]>