How Important are Videos for Training Video LLMs?

1RWTH Aachen University 2ByteDance Seed 3Eindhoven University of Technology
Teaser Image
A comparison of the standard video-based training scheme for video LLMs (top) and our proposed pseudo video training scheme (bottom). We utilize captioned image datasets to automatically generate short pseudo videos and questions for training.

Abstract

Research in Video Large Language Models (LLMs) has progressed rapidly, with multiple models and benchmarks released in the span of a few years. Typically, these models are initialized with a pretrained text LLM and are frequently finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recently released LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Moreover, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in accuracies close to, and occasionally higher than, those achieved by video-trained LLMs. This indicates suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.

Quantitative Results

BibTeX

@article{abouzeid2025ditr,
  title   = {{DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation}},
  author  = {Abou Zeid, Karim and Yilmaz, Kadir and de Geus, Daan and Hermans, Alexander and Adrian, David and Linder, Timm and Leibe, Bastian},
  journal = {arXiv preprint arXiv:2503.18944},
  year    = {2025}
}