Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset

Zhiyuan You12, Jinjin Gu3, Xin Cai1, Zheyuan Li2, Kaiwen Zhu45, Chao Dong246✝, Tianfan Xue147✝
1Multimedia Laboratory, The Chinese University of Hong Kong
2Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
3INSAIT, Sofia University 4Shanghai AI Laboratory 5Shanghai Jiao Tong University
6Shenzhen University of Advanced Technology 7CPII under InnoHK
Corresponding Author
Abstract

With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce Enhanced Depicted image Quality Assessment (DepictQA-Wild or EDQA). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K or EDQA-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images.

Multi-functional Task Paradigm of DepictQA-Wild

DepictQA-Wild focuses on two tasks including single-image assessment and paired-image comparison in both full-reference and non-reference settings. Each task contains a brief sub-task focusing on the fundamental IQA ability, and a detailed sub-task fostering the reasoning capacities.

Dataset Construction

We construct our DQ-485K dataset through (a) templated responses, (b) ground-truth-informed GPT-4V generation.

Results of Assessment Reasoning
Results of Comparison Reasoning
Results on Web-downloaded Images
BibTeX
              
    @article{depictqa_v2,
        title={Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset},
        author={You, Zhiyuan and Gu, Jinjin and Cai, Xin and Li, Zheyuan and Zhu, Kaiwen and Dong, Chao and Xue, Tianfan},
        journal={IEEE Transactions on Image Processing},
        year={2025}
    }
    
@inproceedings{depictqa_v1, title={Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models}, author={You, Zhiyuan and Li, Zheyuan and Gu, Jinjin and Yin, Zhenfei and Xue, Tianfan and Dong, Chao}, booktitle={European Conference on Computer Vision}, pages={259--276}, year={2024} }