Research in Video Large Language Models (LLMs) has progressed rapidly, with multiple models and benchmarks released in the span of a few years. Typically, these models are initialized with a pretrained text LLM and are frequently finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recently released LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Moreover, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in accuracies close to, and occasionally higher than, those achieved by video-trained LLMs. This indicates suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.
@article{abouzeid2025ditr,
title = {{DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation}},
author = {Abou Zeid, Karim and Yilmaz, Kadir and de Geus, Daan and Hermans, Alexander and Adrian, David and Linder, Timm and Leibe, Bastian},
journal = {arXiv preprint arXiv:2503.18944},
year = {2025}
}