Grounded Visual Generation

Multi-modal data provides an exciting opportunity to train grounded generative models that synthesize images consistent with real world phenomena. In this talk, I will share several of our recent efforts towards creating grounded visual generation models: (1) introducing user attention grounding for text-to-image synthesis, (2) improving text-to-image generation results with stronger language grounding, and (3) taking steps towards creating spatially grounded world models for embodied vision-and-language tasks.

Speaker Details

Jing Yu Koh is a Research Engineer at Google Research, where he works on machine learning for computer vision and natural language processing. He was previously an AI Resident at Google. His research interests include multi-modal learning, vision-and-language models, and generative models. Prior to joining Google, he completed his undergraduate studies at the Singapore University of Technology and Design in 2019.

Date:
Speakers:
Jing Yu Koh
Affiliation:
Google

Series: Microsoft Vision+Language Summer Talk Series