How Important are Videos for Training Video LLMs?

1RWTH Aachen University 2ByteDance Seed 3Eindhoven University of Technology
Teaser Image
A comparison of the standard video-based training scheme for video LLMs (top) and our proposed pseudo video training scheme (bottom). We utilize captioned image datasets to automatically generate short pseudo videos and questions for training.

Abstract

Research in Video Large Language Models (LLMs) has progressed rapidly, with multiple models and benchmarks released in the span of a few years. Typically, these models are initialized with a pretrained text LLM and are frequently finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recently released LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Moreover, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in accuracies close to, and occasionally higher than, those achieved by video-trained LLMs. This indicates suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.

Quantitative Results

BibTeX

@article{lydakis2025importantvideostrainingvideo,
  title   = {How Important are Videos for Training Video LLMs?},
  author  = {George Lydakis and Alexander Hermans and Ali Athar and Daan de Geus and Bastian Leibe},
  journal = {arXiv preprint arXiv:2506.06928},
  year    = {2025}
}