PushupBench is integrated into lmms-eval (PR #1262). You can evaluate any supported VLM with a single command.
pip install lmms-eval
python -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
--tasks pushupbench \
--batch_size 1 \
--log_samples \
--output_path ./logs/
The task reports four metrics:
Click timestamps to jump to different exercises. Compare model predictions vs. ground truth.
Large vision-language models (VLMs) have achieved remarkable progress in semantic video understanding—correctly identifying that a video shows "a person doing squats." Yet when asked how many squats were performed, even frontier models like Gemini 3 Flash struggle, achieving only 42.1% exact accuracy on our benchmark.
We introduce PushupBench, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1% exact accuracy; open-source 4B models score ~6%, matching supervised baselines. We show that accuracy alone misleads—weaker models exploit the modal count rather than reason temporally.
Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.
Unlike action recognition, which can often be solved from a single frame, repetition counting requires:
This makes counting an ideal diagnostic probe for temporal reasoning capabilities.
Figure: Predicted vs. ground truth repetition counts. Diagonal = perfect prediction. Gemini 3 Flash shows strong fit (R2=0.82); base Qwen3-VL-4B collapses to constant predictions (R2=-0.34).
8% exact match can be achieved by always predicting "10"—the modal count in fitness data. R2 distinguishes models that actually count from those that exploit dataset statistics. A model with higher exact match but negative R2 is not counting.
| Model | Exact % | MAE | R2 | Model | Exact % | MAE | R2 | |
|---|---|---|---|---|---|---|---|---|
| Commercial VLMs (Gemini @ 5fps) | Open-Source VLMs (Qwen3-VL @ 5fps) | |||||||
| Gemini 3 Flash | 42.1 | 2.9 | 0.82 | Qwen3-VL-4B-Instruct | 8.9 | 8.2 | -0.21 | |
| Gemini 3 Pro | 39.8 | 3.6 | 0.70 | Qwen3-VL-4B-Thinking | 8.2 | 8.9 | -0.34 | |
| Gemini 2.5 Pro | 29.9 | 5.7 | 0.63 | + DAPO (Ours) | 5.9 | 7.2 | 0.23 | |
| Gemini 2.5 Flash | 10.9 | 7.7 | 0.49 | Qwen3-VL-8B-Instruct | 7.2 | 7.9 | -0.11 | |
| Gemini 2.0 Flash | 8.9 | 7.8 | 0.25 | Qwen3-VL-32B-Instruct | 11.8 | 7.7 | 0.36 | |
| Commercial VLMs (Other) | Baselines | |||||||
| GPT-5 | 10.9 | 7.6 | 0.01 | TransRAC (supervised) | 6.7 | 9.1 | -0.18 | |
| Claude Sonnet 4.5 | 9.5 | 9.0 | 0.00 | Const. (mode=10) | 9.9 | 8.5 | -0.21 | |
| Claude Opus 4.5 | 4.9 | 9.7 | 0.09 | Const. (mean=17) | 2.9 | 9.2 | 0.00 | |
R2 measures variance explained beyond mean baseline. Negative values indicate constant or random output. Green indicates best performance.
Figure: Ground truth distribution in training data (left) and PushupBench (right). Both share mode=10, reflecting human preference for 10-rep workout sets.
People commonly perform exercises in sets of 10 reps. Models that collapse to always predicting "10" achieve ~10% exact match without any temporal reasoning. This is why R2 is essential: it penalizes this behavior while exact match rewards it.
A key finding is that training on repetition counting transfers to general video understanding tasks. Fine-tuning Qwen3-VL-4B-Thinking on just 968 counting samples improves performance across four unrelated benchmarks:
| Model | MotionBench | PerceptionTest | MVBench | TVBench |
|---|---|---|---|---|
| Base | 58.2 | 72.57 | 65.75 | 52.96 |
| + DAPO (Ours) | 58.7 | 74.45 | 67.90 | 57.50 |
| Δ | +0.5 | +1.88 | +2.15 | +4.54 |
Category-level analysis reveals improvements concentrate in tasks requiring temporal state tracking:
These categories share a common requirement: detecting when events occur, not just what happens.
Figure: PushupBench spans 375 unique exercise types across 446 clips (84% uniqueness), ensuring evaluation of generalization rather than memorization.
| Creator | Country | Gender | # Videos |
|---|---|---|---|
| Pamela Reif | Germany | F | 6 |
| Chloe Ting | Australia | F | 1 |
| Caroline Girvan | N. Ireland | F | 1 |
| Growingannanas | Austria | F | 1 |
| Eylem Abaci | Germany | F | 8 |
| Chris Heria | USA | M | 1 |
| MIZI | Korea/Malaysia | F | 1 |
| Toned w/ Alexandra | Sweden/Spain | F | 9 |
| Shirlyn Kim | South Korea | F | 2 |
| Lucy Wyndham-Read | UK | F | 1 |
| Oliver Sjostrom | Sweden | M | 3 |
11 creators across 10 countries ensure broad representation across gender, geography, and presentation style.
During RL fine-tuning, we identified multiple reward hacking patterns that models exploit instead of learning to count:
Always predicting "10" yields non-zero expected reward without any counting.
Models learn to output half the frame count rather than counting repetitions.
VLMs read on-screen counters and timers rather than counting motion.
Before: Counter "03" visible
After: Counter removed
We manually edited out on-screen counters from benchmark videos to ensure evaluation measures genuine counting ability.
@inproceedings{li2025pushupbench,
title = {Your VLM is Not Good at Counting Pushups},
author = {Li, Shengzhi and Chen, Jiarun and Sharma, Karun and Su, Jiaqi and Pei, Shichao},
year = {2025},
}