Item

SED++: A Simple Encoder-Decoder for Improved Open-Vocabulary Semantic Segmentation

Zhu, Wenqi
Xie, Bin
Cao, Jiale
Xie, Jin
Khan, Fahad Shahbaz
Pang, Yanwei
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Open-vocabulary semantic segmentation aims to partition an image into distinct semantic regions based on an open set of categories. Existing approaches primarily rely on image-level pre-trained vision-language models to perform this pixel-level task. In this paper, we propose SED, a simple yet effective encoder-decoder architecture for open-vocabulary semantic segmentation leveraging pre-trained vision-language models. SED consists of a hierarchical image encoder, a text encoder, and a gradual fusion decoder. The hierarchical image encoder and text encoder collaboratively generate a cost volume, which is progressively decoded by the gradual fusion decoder to produce segmentation results. In contrast to a plain encoder, the hierarchical encoder better captures image detail information while maintaining linear computational complexity with respect to input size. The gradual fusion decoder adopts a top-down structure to progressively integrate high-resolution features with the cost volume. Furthermore, a category early rejection strategy is introduced in gradual fusion decoder to filter out non-existent categories at different layers, significantly improving inference efficiency. Based on SED, we further introduce two modules, including non-label text embedding and additional category early rejection in the encoder. Moreover, we extend our method with minimal decoder modification for open-vocabulary video semantic segmentation. Extensive experiments on multiple datasets validate the effectiveness and efficiency of our proposed method. With ConvNeXt-B, our method achieves an mIoU of 34.9% on the ADE20K with 150 classes (i.e., A-150) at an inference speed of 69 ms per image on a single A6000 GPU, and has an mIoU score of 40.2% on video segmentation dataset VSPW. The implementation will be publicly available at https://github.com/xb534/SED.git.
Citation
W. Zhu, B. Xie, J. Cao, J. Xie, F. S. Khan and Y. Pang, "SED++: A Simple Encoder-Decoder for Improved Open-Vocabulary Semantic Segmentation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3626757.
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence
Conference
Keywords
category early rejection, gradual fusion decoder, hierarchical encoder, Open-vocabulary semantic segmentation, vision-language model
Subjects
Source
Publisher
IEEE
Full-text link