Stanford’s VideoAgent Achieves New SOTA of Lengthy-Type Video Understanding by way of Agent-Based mostly System

[ad_1]

Understanding long-form movies presents a formidable problem throughout the realm of laptop imaginative and prescient. This enterprise requires a mannequin adept at processing multi-modal knowledge, managing intensive sequences, and successfully reasoning over these sequences.

In response to this problem, in a brand new paper VideoAgent: Lengthy-form Video Understanding with Massive Language Mannequin as Agent, a Stanford College analysis crew introduces VideoAgent, an revolutionary strategy simulates human comprehension of long-form movies via an agent-based system, showcasing superior effectiveness and effectivity in comparison with present state-of-the-art strategies. This underscores the potential of agent-based approaches in advancing long-form video understanding.

VideoAgent operates by using a big language mannequin (LLM) as a central agent to iteratively determine and compile essential info to handle a given query, whereas vision-language basis fashions function instruments to translate and retrieve visible info.

The method is formulated as a sequence of states, actions, and observations, with the LLM orchestrating this development. Initially, the LLM acquaints itself with the video context by reviewing a set of uniformly sampled frames. Throughout every iteration, it evaluates whether or not the present info is enough to reply the query; if not, it determines what further info is critical. It then makes use of Contrastive Language-Picture Pre-training (CLIP) to retrieve new frames containing this info and Imaginative and prescient-Language Fashions (VLM) to caption these frames, updating the present state.

This design accentuates the significance of reasoning capabilities and iterative processes over direct processing of prolonged visible inputs. The VLM and CLIP act as instrumental instruments, enabling the LLM to own visible understanding and long-context retrieval capabilities.

The efficacy of VideoAgent was assessed on two established long-form video understanding benchmarks, EgoSchema and NExT-QA. VideoAgent achieved 54.1% and 71.3% accuracy on these benchmarks, respectively, surpassing the concurrent state-of-the-art methodology LLoVi by 3.8% and three.6%.

In abstract, VideoAgent marks a major development in long-form video understanding by embracing an agent-based system to emulate human cognitive processes and emphasizing the significance of reasoning over modeling long-context visible info. The researchers anticipate that their work not solely establishes a brand new benchmark in long-form video understanding but additionally offers helpful insights for future analysis on this area.

The paper VideoAgent: Lengthy-form Video Understanding with Massive Language Mannequin as Agent is on arXiv.

Writer: Hecate He | Editor: Chain Zhang

We all know you don’t wish to miss any information or analysis breakthroughs. Subscribe to our widespread e-newsletter Synced International AI Weekly to get weekly AI updates.

Like this:

Like Loading…



[ad_2]

Supply hyperlink

Whole iPhone 16 lineup tipped to have the thinnest bezels but, this is how Apple is predicted to do it

Extremely-Low Profile XY-Theta Nanopositioning Stage Makes use of Air Bearings