Research in Video Large Language Models (LLMs) has progressed rapidly, with multiple models and benchmarks released in the span of a few years. Typically, these models are initialized with a pretrained text LLM and are frequently finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recently released LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Moreover, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in accuracies close to, and occasionally higher than, those achieved by video-trained LLMs. This indicates suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.
@article{lydakis2025importantvideostrainingvideo,
title = {How Important are Videos for Training Video LLMs?},
author = {George Lydakis and Alexander Hermans and Ali Athar and Daan de Geus and Bastian Leibe},
journal = {arXiv preprint arXiv:2506.06928},
year = {2025}
}