Item

DisCo: Discovering Common Affordance from Large Models for Actionable Part Perception

Wen, Youpeng
Zhu, Yi
Zhan, Zhihao
Ren, Pengzhen
Han, Jianhua
Xu, Hang
Zhao, Shen
Liang, Xiaodan
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Actionable part perception for robotic object manipulation needs to perceive parts over open-world object categories within 3D space, which is challenging as the appearance of the same part on different objects varies greatly. It is frequently observed that despite the huge intra-class difference in appearance, the parts share common interactive functions over different objects, i.e., common affordance. According to this observation, we propose DisCo, a novel technique that Discovers Common affordance information from powerful large models for guiding the actionable part perception across open-world objects. Specifically, we first use a large language model to identify the object names that each part potentially belongs to and a text-to-image generative model to generate image examples for the queried objects, constructing image-text paired data that indicate visual and semantic information of common affordance. Then, our model encodes the common affordance information by learning to pair the object-part images with their text descriptions. Subsequently, the 2D-pixel features are distilled into 3D space, thus the 3D point features are enriched with not only the semantic information of open-set objects but also the common affordance information which is highly generalizable. Finally, a segmentation head and a pose regression network are developed to predict more accurate results of part segmentation and pose estimation, improving the success rate of robotic object manipulation. Extensive experiments show that our method outperforms existing methods on the part instance and semantic segmentation by significant margins of 4.8% mAp, 5.4% AP50, and 3.9% mIoU on the unseen object categories.
Citation
Y. Wen et al., “DisCo: Discovering Common Affordance from Large Models for Actionable Part Perception,” 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3320–3329, Feb. 2025, doi: 10.1109/WACV61041.2025.00328.
Source
2025 IEEE/CVF Winter Conference on Applications of Computer Vision
Conference
Keywords
3D computer vision, Robotics, Vision language model, Transfer learning
Subjects
Source
Publisher
IEEE
Full-text link