Agent AI

Agent-based multimodal AI systems are becoming a ubiquitous presence in our everyday lives. A promising direction for making these systems more interactive is to embody them as agents within specific environments. The grounding of large foundation models to act as agents within specific environments can provide a way of incorporating visual and contextual information into an embodied system. For example, a system that can perceive user actions, human behavior, environment objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. Emergent Agent AI as an interactive system that can perceive visual stimuli, language inputs, or other environmentally-grounded data and can produce meaningful actions, manipulation, navigation, gesture, etc. In particular, we focus on improving upon agents based on next action predication by incorporating external knowledge, multimodality, and human feedback obtained by the interactive agent. We argue that by developing agentic AI systems in grounded environments, we will also minimize the hallucinations of large foundation models and their ability to generate incorrect outputs. To accelerate research on embodied agent intelligence, we propose a new project on General Embodied Agent AI, which focuses on the broader embodied and agentic aspects of multimodal interactions.

Agent

The related papers are published:

1) Agent AI Towards a Holistic Intelligence

2) Agent foundation model (opens in new tab) for embodied interaction in Robot, Gaming, and Healthcare;

3) Multi-agent for Gaming (opens in new tab) (GPT-4) in simulation and real infrastructure;

4) Navigation Agent for Robotics;

5) GPT4V agent for Robotics (opens in new tab)

6) Agent AI (opens in new tab) Survey and GPT 4V for Robotics, Gaming, and Healthcare.

Community building:

In addition, we will organize CVPR2024 Tutorial on Generalist Agent AI (opens in new tab), and will release two embodied new datasets – CuisineWorld and VideoAnalytica – with a set of baseline models, encouraging researchers across the world to develop new models and systems, and explore ways to evaluate and improve upon performance in our agent-based multimodal leaderboard.

To push the frontier of this important area, this project aims at bringing researchers and practitioners in the relevant embodied agent AI together, to share ideas and insights. This is an emerging research area that poses new challenges for embodied AI systems and there is still significant room for improvement. A deeper understanding between audio, vision and language has also started to play a key role in human-machine interaction systems. Our project will greatly advance large foundation model technologies including cross-modality understanding and agnostic reality integration, generic agent information and human-aesthetic evaluation.

Agent AI
Agent AI is emerging as a promising route for early progress on the path to Artificial General Intelligence (AGI). The Agent AI training process has been shown to demonstrate an ability for multi-modal understanding in the physical world, and provides a framework for reality-agnostic training by leveraging generative AI alongside multiple independent sources of data. Large foundation models trained for agent and action-related tasks can be applied to physical and virtual/simulated worlds when trained with cross-reality training data. We present the general overview of an Agent AI system that can perceive and act in many different domains and applications, possibly serving as a route towards AGI using an agent paradigm.
2401.03568.pdf (arxiv.org) (opens in new tab)