Self-correction is essential for solving complex reasoning problems in vision–language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse.
To address this challenge, we propose correction-specific rollouts (Octopus), a rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision.
Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs.
Why existing RL methods hard to learn effective self-correction
Self-correction is not explicitly taught to the model; only outcome reward serves as supervision.
Even with reflective words appended, the model struggles to revise incorrect answers.
Teach self-correction through reflection-based prompts and shaped rewards.
Why it hard to learn self-correction:
When signals are sparse, we synthesize
A key observation motivating our approach is that effective self-correction signals already exist in standard RL rollouts: for a given input, incorrect and correct responses often coexist within the same rollout group. By pairing them, we explicitly construct samples that demonstrate effective correction behavior.
Produces dense, explicit self-correction examples through rollout recombination
Balances positive and negative samples, stabilizing RL optimization
Substantially improves efficiency by reusing existing rollouts
Decoupling self-correction from direct reasoning
Two Reward Designs
Jointly optimizing self-correction and direct reasoning limits self-correction learning. Reward shaping alone fails to decouple these signals and instead induces reward hacking and mode collapse.
Stage I: Learning Self-Correction Only
Stage II: Co-evolving Both Capabilities
Stage I isolates self-correction learning by treating $o_1$ as fixed context and updating the policy only from $o_2$. Stage II selectively unmasks $o_1$ when reward signals are non-conflicting, co-evolving both direct reasoning and self-correction.
If you find our work useful, please consider citing
@article{ding2025octopus,
title={Learning Self-Correction in Vision–Language Models via Rollout Augmentation},
author={Ding, Yi and Qiu, Ziliang and Li, Bolian and Zhang, Ruqi},
journal={arXiv preprint},
year={2025}
}