🔥 [2025-05-29] Our Sherlock paper is out now. Paper 🚀.
🔥 [2025-05-28] Our Sherlock training, evaluation code are out now. Code 🚀.
🔥 [2025-05-27] Our Sherlock models weight are out now. Models 🚀.
Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework.
We investigate the self-correction capabilities of reasoning VLMs through systematic analysis of step-wise and response-wise correction behaviors.
Figure 1: Left: Overview of experimental settings for self-correction analysis. Blue block illustrates the Modified One Step process using Qwen2.5-7B-Instruct, while Green block represents two correction strategies applied to direct generations: external critique-based correction and internal correction prompt. Right: Reasoning performance of LLaVA-CoT and VL-Rethinker under different settings, evaluated on MMStar and MathVista.
Reasoning VLMs explicitly generate step-by-step thoughts along with the final answer during inference. This process can be denoted as $(y_1, \cdots, y_n; a) \sim \pi(\cdot \vert x_{I\&T})$, where $y_i$ represents the $i$-th reasoning step, $a$ is the final answer, $\pi$ is the reasoning VLM, and $x_{I\&T}$ denotes the input image and text. For reasoning models, self-correction behavior can be implemented in two ways: Step-wise correction and Response-wise correction.
Step-wise Correction: The model reflects on its previous incorrect $i$-th step within a single thinking process and revises it to arrive at the final answer:
Here, $y^*$ is the erroneous reasoning step, and $r$ is a reflection token indicating the model's intention to correct its previous reasoning.
Our experimental results reveal that once an error occurs in the reasoning trajectory, reasoning VLMs struggle to trigger self-reflection behaviors, such as producing expressions typically associated with "aha moments." Even among the examples where such moments do appear, only about half ultimately lead to a correct final answer.
Response-wise Correction: The model is prompted to revise and improve its entire generated response:
where $\{y^j_i, a^j\}$ denotes the $j$-th attempt and $t$ is an additional instruction guiding the model to perform correction.
The results in the right part of Figure 1 show that neither the correction prompt nor the external critiques effectively improve the reasoning trajectory. Moreover, we observe that, regardless of which critic model is used to provide feedback, the accuracy of the corrected responses tends to converge to a level similar to direct generation, showing little variation based on the quality of the critique.
Self-enhanced reasoning and self-correction training framework.
Figure 2: Training pipeline of Sherlock, including: Left: SFT cold-start stage, Middle: offline preference training, and Right: online iterative self-improvement. In the SFT and offline stages, we randomly sample 10k data with ground truth from the 100k LLaVA-CoT dataset as supervision. During the online stage, each iteration samples only 5k unlabeled inputs, from which a self-constructed and self-labeled dataset is built using the selection rule illustrated in the Right part.
@article{ding2025sherlock,
title={Sherlock: Self-Correcting Reasoning in Vision-Language Models},
author={Ding, Yi and Zhang, Ruqi},
journal={arXiv preprint arXiv:2505.22651},
year={2025}
}