{"id":1159209,"date":"2025-12-19T00:01:03","date_gmt":"2025-12-19T08:01:03","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1159209"},"modified":"2025-12-19T01:16:48","modified_gmt":"2025-12-19T09:16:48","slug":"deep-video-discovery-using-agentic-search-to-analyze-long-form-video","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/deep-video-discovery-using-agentic-search-to-analyze-long-form-video\/","title":{"rendered":"Deep Video Discovery: Using agentic search to analyze long-form video"},"content":{"rendered":"\n

Extracting useful information from long videos, whether meeting recordings, experimental data, or lecture content, requires painstaking manual review. AI tools offer some help: language-vision models can summarize short clips or answer questions when videos are divided into clear scenes or chapters. But for hours\u2011long recordings packed with information and lacking obvious structure, current models are limited. They process videos slowly, are unable to connect information across long stretches of content, and often provide limited or unhelpful answers.<\/p>\n\n\n\n

To address these limitations, researchers at Microsoft Research Asia developed Deep Video Discovery (DVD)<\/a>, an agentic AI framework for long-video analysis. DVD divides long videos into shorter clips for individual analysis, then uses LLM-based reasoning to plan next steps and select appropriate tools. The agent retrieves needed information and uses it to answer complex questions about the video.<\/p>\n\n\n\n

How DVD works<\/h2>\n\n\n\n

DVD operates through a simple cycle: observe the video content, analyze what it means, and choose the next action. Current video-analysis systems follow rigid, predesigned steps that have difficulty adapting to different tasks. In contrast, DVD adjusts its approach based on information it has gathered so far. To support this flexibility, the system operates in two stages:<\/p>\n\n\n\n

Stage 1: Building a searchable video database<\/strong><\/p>\n\n\n\n

The system converts long videos into a structured database, dividing them into five-second clips and extracting information at three levels:<\/p>\n\n\n\n