Octopus: Learning Self-Correction in Vision

01 — Overview

Introduction

Self-correction is essential for solving complex reasoning problems in vision–language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse.

To address this challenge, we propose correction-specific rollouts (Octopus), a rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision.

Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs.

02 — Challenge

Sparsity of Self-Correction Signals

Why existing RL methods hard to learn effective self-correction

Traditional RL Limitation

Self-correction is not explicitly taught to the model; only outcome reward serves as supervision.

Training: No explicit signal to improve self-correction capability
Inference: Unable to reliably control self-correction behavior

Even with reflective words appended, the model struggles to revise incorrect answers.

Prompt-encouraged Self-correction RL Limitation

Teach self-correction through reflection-based prompts and shaped rewards.

Why it hard to learn self-correction:

Sparse Signals: Effective self-correction trajectories are extremely rare
Model maximizes reward through direct generation instead of learning to self-correct

03 — Solution

Octopus Framework

When signals are sparse, we synthesize

Octopus

Rollout Augmentation for Dense Self-Correction Signals

A key observation motivating our approach is that effective self-correction signals already exist in standard RL rollouts: for a given input, incorrect and correct responses often coexist within the same rollout group. By pairing them, we explicitly construct samples that demonstrate effective correction behavior.

Dense Self-correction Signals

Produces dense, explicit self-correction examples through rollout recombination

Balanced Training Signals

Balances positive and negative samples, stabilizing RL optimization

Sample Efficiency

Substantially improves efficiency by reusing existing rollouts

04 — Methodology

Training via Response-Masking

Decoupling self-correction from direct reasoning

Conflicts in RL Training Problem

Two Reward Designs

Binary Reward: Only assigns reward based on final answer correctness
Shaped Reward: Encourages positive self-correction:
- Incorrect → Correct: 1.0
- Correct → Correct: 0.75
- Incorrect → Incorrect: 0.0
- Correct → Incorrect: -0.25

Takeaway

Jointly optimizing self-correction and direct reasoning limits self-correction learning. Reward shaping alone fails to decouple these signals and instead induces reward hacking and mode collapse.

Two-Stage Training Solution

Stage I: Learning Self-Correction Only

Treat pre-correction response $o_1$ as input
Mask loss for all tokens in $o_1$
Apply KL loss on $o_1$ to constrain distribution

Stage II: Co-evolving Both Capabilities

Remove KL constraint
Unmask $o_1$ only for consistent correctness samples
Prevent gradient conflicts for inconsistent samples

Takeaway

Stage I isolates self-correction learning by treating $o_1$ as fixed context and updating the policy only from $o_2$. Stage II selectively unmasks $o_1$ when reward signals are non-conflicting, co-evolving both direct reasoning and self-correction.

05 — Results

Experiments

01 Main Results

02 Ablation Study

03 Test-Time Scaling

06 — Examples

Case Studies

07 — Citation

Reference

If you find our work useful, please consider citing

@article{ding2026learning,
  title={Learning Self-Correction in Vision-Language Models via Rollout Augmentation},
  author={Ding, Yi and Qiu, Ziliang and Li, Bolian and Zhang, Ruqi},
  journal={arXiv preprint arXiv:2602.08503},
  year={2026}
}

Learning Self-Correction in Vision–Language Models via Rollout Augmentation

What's the lowest number yard line that you can see?