Learning Self-Correction in Vision–Language Models via Rollout Augmentation

1Purdue University · 2University of Illinois Urbana-Champaign
Example Question

What's the lowest number yard line that you can see?

Football field yard lines
Octopus
🐙 Analyzing...
neural output
|
correction trace
1 initial
2 corrected

Introduction

Self-correction is essential for solving complex reasoning problems in vision–language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse.

To address this challenge, we propose correction-specific rollouts (Octopus), a rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision.

Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs.

Sparsity of Self-Correction Signals

Why existing RL methods hard to learn effective self-correction

Traditional RL Limitation

Self-correction is not explicitly taught to the model; only outcome reward serves as supervision.

  • Training: No explicit signal to improve self-correction capability
  • Inference: Unable to reliably control self-correction behavior
Traditional RL

Even with reflective words appended, the model struggles to revise incorrect answers.

Prompt-encouraged Self-correction RL Limitation

Teach self-correction through reflection-based prompts and shaped rewards.

SRPO

Why it hard to learn self-correction:

  • Sparse Signals: Effective self-correction trajectories are extremely rare
  • Model maximizes reward through direct generation instead of learning to self-correct

Octopus Framework

When signals are sparse, we synthesize

Octopus

Rollout Augmentation for Dense Self-Correction Signals

A key observation motivating our approach is that effective self-correction signals already exist in standard RL rollouts: for a given input, incorrect and correct responses often coexist within the same rollout group. By pairing them, we explicitly construct samples that demonstrate effective correction behavior.

Octopus Framework

Dense Self-correction Signals

Produces dense, explicit self-correction examples through rollout recombination

Balanced Training Signals

Balances positive and negative samples, stabilizing RL optimization

Sample Efficiency

Substantially improves efficiency by reusing existing rollouts

Training via Response-Masking

Decoupling self-correction from direct reasoning

Conflicts in RL Training Problem

Two Reward Designs

  • Binary Reward: Only assigns reward based on final answer correctness
  • Shaped Reward: Encourages positive self-correction:
    • Incorrect → Correct: 1.0
    • Correct → Correct: 0.75
    • Incorrect → Incorrect: 0.0
    • Correct → Incorrect: -0.25
Conflicts
Takeaway

Jointly optimizing self-correction and direct reasoning limits self-correction learning. Reward shaping alone fails to decouple these signals and instead induces reward hacking and mode collapse.

Two-Stage Training Solution

Stage I: Learning Self-Correction Only

  • Treat pre-correction response $o_1$ as input
  • Mask loss for all tokens in $o_1$
  • Apply KL loss on $o_1$ to constrain distribution

Stage II: Co-evolving Both Capabilities

  • Remove KL constraint
  • Unmask $o_1$ only for consistent correctness samples
  • Prevent gradient conflicts for inconsistent samples
Two Stage
Takeaway

Stage I isolates self-correction learning by treating $o_1$ as fixed context and updating the policy only from $o_2$. Stage II selectively unmasks $o_1$ when reward signals are non-conflicting, co-evolving both direct reasoning and self-correction.

Experiments

01 Main Results
Main Results
02 Ablation Study
Ablation
03 Test-Time Scaling
Test-Time Scaling

Case Studies

Reference

If you find our work useful, please consider citing

@article{ding2025octopus,
  title={Learning Self-Correction in Vision–Language Models via Rollout Augmentation},
  author={Ding, Yi and Qiu, Ziliang and Li, Bolian and Zhang, Ruqi},
  journal={arXiv preprint},
  year={2025}
}