Your VLM is Not Good at Counting Pushups

Introducing PushupBench: A Benchmark for Temporal Reasoning in Vision-Language Models

Shengzhi Li, Jiarun Chen, Karun Sharma, Jiaqi Su, Shichao Pei

TL;DR

Large vision-language models can recognize what happens in video but fail to count how many times. The best frontier model achieves only 42.1% exact accuracy on counting repetitions. Open-source 4B models score ~6%, matching a constant predictor. Fine-tuning on counting transfers to general video understanding.

446
Video Clips
36.7s
Avg. Duration
375
Unique Exercises
11
Content Creators

Evaluate with lmms-eval

PushupBench is integrated into lmms-eval (PR #1262). You can evaluate any supported VLM with a single command.

Installation

pip install lmms-eval

Run Evaluation

python -m lmms_eval \
    --model qwen2_5_vl \
    --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
    --tasks pushupbench \
    --batch_size 1 \
    --log_samples \
    --output_path ./logs/

The task reports four metrics:

See VLMs Fail in Real-Time

Click timestamps to jump to different exercises. Compare model predictions vs. ground truth.

Side Crunch Knee Tap Plank Good Morning
Alternating Side Crunch GT: 14 reps
Model Predictions
Exact Close Off

Abstract

Large vision-language models (VLMs) have achieved remarkable progress in semantic video understanding—correctly identifying that a video shows "a person doing squats." Yet when asked how many squats were performed, even frontier models like Gemini 3 Flash struggle, achieving only 42.1% exact accuracy on our benchmark.

We introduce PushupBench, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1% exact accuracy; open-source 4B models score ~6%, matching supervised baselines. We show that accuracy alone misleads—weaker models exploit the modal count rather than reason temporally.

Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.

The Problem: Counting Reveals Temporal Blindness

Unlike action recognition, which can often be solved from a single frame, repetition counting requires:

  • Tracking state changes across time
  • Detecting action boundaries
  • Maintaining a coherent count despite variable speeds, camera motion, and appearance changes

This makes counting an ideal diagnostic probe for temporal reasoning capabilities.

R2 Reveals What Accuracy Hides

Scatter plots showing predicted vs ground truth repetition counts for four models

Figure: Predicted vs. ground truth repetition counts. Diagonal = perfect prediction. Gemini 3 Flash shows strong fit (R2=0.82); base Qwen3-VL-4B collapses to constant predictions (R2=-0.34).

Key Insight

8% exact match can be achieved by always predicting "10"—the modal count in fitness data. R2 distinguishes models that actually count from those that exploit dataset statistics. A model with higher exact match but negative R2 is not counting.

Benchmark Results

Model Exact % MAE R2 Model Exact % MAE R2
Commercial VLMs (Gemini @ 5fps) Open-Source VLMs (Qwen3-VL @ 5fps)
Gemini 3 Flash 42.1 2.9 0.82 Qwen3-VL-4B-Instruct 8.9 8.2 -0.21
Gemini 3 Pro 39.8 3.6 0.70 Qwen3-VL-4B-Thinking 8.2 8.9 -0.34
Gemini 2.5 Pro 29.9 5.7 0.63 + DAPO (Ours) 5.9 7.2 0.23
Gemini 2.5 Flash 10.9 7.7 0.49 Qwen3-VL-8B-Instruct 7.2 7.9 -0.11
Gemini 2.0 Flash 8.9 7.8 0.25 Qwen3-VL-32B-Instruct 11.8 7.7 0.36
Commercial VLMs (Other) Baselines
GPT-5 10.9 7.6 0.01 TransRAC (supervised) 6.7 9.1 -0.18
Claude Sonnet 4.5 9.5 9.0 0.00 Const. (mode=10) 9.9 8.5 -0.21
Claude Opus 4.5 4.9 9.7 0.09 Const. (mean=17) 2.9 9.2 0.00

R2 measures variance explained beyond mean baseline. Negative values indicate constant or random output. Green indicates best performance.

Why Does Predicting "10" Work?

Ground truth distribution showing mode at 10

Figure: Ground truth distribution in training data (left) and PushupBench (right). Both share mode=10, reflecting human preference for 10-rep workout sets.

People commonly perform exercises in sets of 10 reps. Models that collapse to always predicting "10" achieve ~10% exact match without any temporal reasoning. This is why R2 is essential: it penalizes this behavior while exact match rewards it.

Counting Transfers to General Video Understanding

A key finding is that training on repetition counting transfers to general video understanding tasks. Fine-tuning Qwen3-VL-4B-Thinking on just 968 counting samples improves performance across four unrelated benchmarks:

Model MotionBench PerceptionTest MVBench TVBench
Base 58.2 72.57 65.75 52.96
+ DAPO (Ours) 58.7 74.45 67.90 57.50
Δ +0.5 +1.88 +2.15 +4.54

Where Does It Help Most?

Category-level analysis reveals improvements concentrate in tasks requiring temporal state tracking:

  • MVBench: moving_count (+8.5), counterfactual_inference (+12.5), scene_transition (+6.0)
  • TVBench: action_sequence (+9.2), action_localization (+5.6), moving_direction (+40.0)

These categories share a common requirement: detecting when events occur, not just what happens.

Dataset Diversity

Exercise type diversity in PushupBench

Figure: PushupBench spans 375 unique exercise types across 446 clips (84% uniqueness), ensuring evaluation of generalization rather than memorization.

Content Creator Diversity

Creator Country Gender # Videos
Pamela ReifGermanyF6
Chloe TingAustraliaF1
Caroline GirvanN. IrelandF1
GrowingannanasAustriaF1
Eylem AbaciGermanyF8
Chris HeriaUSAM1
MIZIKorea/MalaysiaF1
Toned w/ AlexandraSweden/SpainF9
Shirlyn KimSouth KoreaF2
Lucy Wyndham-ReadUKF1
Oliver SjostromSwedenM3

11 creators across 10 countries ensure broad representation across gender, geography, and presentation style.

Challenges: Reward Hacking in RL Training

During RL fine-tuning, we identified multiple reward hacking patterns that models exploit instead of learning to count:

Mode Collapse

Always predicting "10" yields non-zero expected reward without any counting.

Frame Count Exploitation

Models learn to output half the frame count rather than counting repetitions.

On-Screen Text

VLMs read on-screen counters and timers rather than counting motion.

Frame with on-screen counter

Before: Counter "03" visible

Frame with counter removed

After: Counter removed

We manually edited out on-screen counters from benchmark videos to ensure evaluation measures genuine counting ability.

BibTeX

@inproceedings{li2025pushupbench,
  title     = {Your VLM is Not Good at Counting Pushups},
  author    = {Li, Shengzhi and Chen, Jiarun and Sharma, Karun and Su, Jiaqi and Pei, Shichao},
  year      = {2025},
}