Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

Jinrui Zhang1, Teng Wang1,2, Haigang Zhang3, Ping Lu4, Feng Zheng1,5
1Southern University of Science and Technology,
2The University of Hong Kong,
3Shenzhen Polytechnic University,
4The Cloud Computing and IT Institute of ZTE Corporation
5Research Institute of Multiple Agents and Embodied Intelligence,
Peng Cheng Laboratory, Shenzhen, China
†Corresponding Author
To Appear at ECCV2024
MY ALT TEXT

Reflective Instruction Tuning. Vanilla instruction tuning only trains LVLMs solely for response generation, lacking of supervising the learning of fine-grained reasoning details. Reflective instruction tuning additionally trains model to reflect the rationale underlying the response, which provide more fine-grained supervision (\eg, the key visual evidence and facts to reach the response, highlighted in red), facilitate the model learning to capture more critical information.

Abstract

Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. However, they remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. While various mitigation strategies have been proposed, they often neglect a key contributor to hallucinations: lack of fine-grained reasoning supervision during training. Without intermediate reasoning steps, models may establish superficial shortcuts between instructions and responses, failing to internalize the inherent reasoning logic. To address this challenge, we propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning. Unlike previous methods that learning from responses only, our approach entails the model predicting rationales justifying why responses are correct or incorrect. This fosters a deeper engagement with the fine-grained reasoning underlying each response, thus enhancing the models reasoning proficiency. To facilitate this approach, we propose REVERIE, the first large-scale instruction-tuning dataset with ReflEctiVE RatIonalE annotations. REVERIE comprises 115k machine-generated reasoning instructions, each meticulously annotated with a corresponding pair of correct and confusing responses, alongside comprehensive rationales elucidating the justification behind the correctness or erroneousness of each response. Experimental results on multiple LVLM benchmarks reveal that reflective instruction tuning with the REVERIE dataset yields substantial performance gain over the baseline model demonstrating the effectiveness of reflecting from the rationales.

The REVERIE Dataset

REVERIE is the first large-scale visual instruction-tuning dataset with ReflEctiVE RatIonalE annotations. REVERIE comprises 115k machine-generated reasoning instructions, each meticulously annotated with a corresponding pair of correct and confusing responses, alongside comprehensive rationales elucidating the justification behind the correctness or erroneousness of each response. REVERIE dataset comprises 71,558 natural images. This includes 50,938 images sourced from Visual Genome, 15,706 from the COCO and 4914 images from ScienceQA. REVERIE contains 115,280 instructions paired with corresponding positive responses, and 138,897 negative responses, where each response is supplemented with a reflective rationale, rendering total 254,177 training instances. REVERIE covers four types of vision-language tasks, including multiple-choice QA, short-answer QA, open-ended QA and Yes/No questions.

MY ALT TEXT

Data collection pipeline

Overview of the REVERIE dataset's data collection pipeline. We first employ Gemini-Vision-Pro to annotate the instructions, responses and rationales for each image. Gemini-Pro is then used to check consistency between positive and negative rationales. Inconsistent samples are filtered to maintain dataset quality.

MY ALT TEXT

Examples from REVERIE

The rationales contain rich visual information, outside knowledge and underlying logic, providing fine-grained reasoning supervision that help address hallucinations.

BibTeX


        @article{zhang2024reflective,
          title={Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models},
          author={Zhang, Jinrui and Wang, Teng and Zhang, Haigang and Lu, Ping and Zheng, Feng},
          journal={arXiv preprint arXiv:2407.11422},
          year={2024}
        }