Sherlock

Self-Correcting Reasoning in Vision-Language Models

Yi Ding¹, Ruqi Zhang¹,

¹Department of Computer Science, Purdue University, USA

arXiv Paper Code

🤗

Model

🌐

Twitter

🔔 News

🔥 [2025-05-29] Our Sherlock paper is out now. Paper 🚀.

🔥 [2025-05-28] Our Sherlock training, evaluation code are out now. Code 🚀.

🔥 [2025-05-27] Our Sherlock models weight are out now. Models 🚀.

Introduction

Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework.

Can Reasoning VLMs Self-Correct?

We investigate the self-correction capabilities of reasoning VLMs through systematic analysis of step-wise and response-wise correction behaviors.

Figure 1: Left: Overview of experimental settings for self-correction analysis. Blue block illustrates the Modified One Step process using Qwen2.5-7B-Instruct, while Green block represents two correction strategies applied to direct generations: external critique-based correction and internal correction prompt. Right: Reasoning performance of LLaVA-CoT and VL-Rethinker under different settings, evaluated on MMStar and MathVista.

Definition

Reasoning VLMs explicitly generate step-by-step thoughts along with the final answer during inference. This process can be denoted as $(y_1, \cdots, y_n; a) \sim \pi(\cdot \vert x_{I\&T})$, where $y_i$ represents the $i$-th reasoning step, $a$ is the final answer, $\pi$ is the reasoning VLM, and $x_{I\&T}$ denotes the input image and text. For reasoning models, self-correction behavior can be implemented in two ways: Step-wise correction and Response-wise correction.

Step-wise Self-correction

Step-wise Correction: The model reflects on its previous incorrect $i$-th step within a single thinking process and revises it to arrive at the final answer:

$$ (r, y_{i+1}, \cdots, y_n; a) \sim \pi(\cdot \vert x_{I\&T}; y_1, \cdots, y_{i}^*), r \in \{\text{"wait", "however", "check",} \cdots\}. $$

Here, $y^*$ is the erroneous reasoning step, and $r$ is a reflection token indicating the model's intention to correct its previous reasoning.

Our experimental results reveal that once an error occurs in the reasoning trajectory, reasoning VLMs struggle to trigger self-reflection behaviors, such as producing expressions typically associated with "aha moments." Even among the examples where such moments do appear, only about half ultimately lead to a correct final answer.

Response-wise Self-correction

Response-wise Correction: The model is prompted to revise and improve its entire generated response:

$$ (y_1^2,\cdots, y_n^2;a^2)\sim\pi(\cdot\vert x_{I\&T};y_1^1,\cdots,y_n^1,a^1;t), $$

where $\{y^j_i, a^j\}$ denotes the $j$-th attempt and $t$ is an additional instruction guiding the model to perform correction.

The results in the right part of Figure 1 show that neither the correction prompt nor the external critiques effectively improve the reasoning trajectory. Moreover, we observe that, regardless of which critic model is used to provide feedback, the accuracy of the corrected responses tends to converge to a level similar to direct generation, showing little variation based on the quality of the critique.

Sherlock

Self-enhanced reasoning and self-correction training framework.

Figure 2: Training pipeline of Sherlock, including: Left: SFT cold-start stage, Middle: offline preference training, and Right: online iterative self-improvement. In the SFT and offline stages, we randomly sample 10k data with ground truth from the 100k LLaVA-CoT dataset as supervision. During the online stage, each iteration samples only 5k unlabeled inputs, from which a self-constructed and self-labeled dataset is built using the selection rule illustrated in the Right part.

Example of Sherlock Self-Correction

Sherlock Performance

Conclusion: Sherlock achieving best performance using the minimal annotation: only 20k randomly sampled data from LLaVA-CoT.

Conclusion: Sherlock furthure enhances performance via self-correction, better than other inference-time scaling strategy and without the need to deploy another models.

Finding 1: Self-Correction and Reasoning are Not Orthogonal Capabilities

Finding 2: Trajectory-Level Objective Yields Superior Self-Correction in Reasoning VLMs

Finding 3: Dynamic $\beta$ Leads to Stable Training

Conclusion: Sherlock can accelerate and enhance sequential self-correction performance by deploy another Verify model.

Reference

If you find our work useful, please give us a free cite:


@article{ding2025sherlock,
  title={Sherlock: Self-Correcting Reasoning in Vision-Language Models},
  author={Ding, Yi and Zhang, Ruqi},
  journal={arXiv preprint arXiv:2505.22651},
  year={2025}
}