Hierarchical Graph Tokenization for Molecule-Language Alignment
Chen, Yongqiang ; Yao, Quanming ; Zhang, Juzheng ; Cheng, James J. ; Bian, Yatao
Chen, Yongqiang
Yao, Quanming
Zhang, Juzheng
Cheng, James J.
Bian, Yatao
Author
Chen, Yongqiang
Yao, Quanming
Zhang, Juzheng
Cheng, James J.
Bian, Yatao
Yao, Quanming
Zhang, Juzheng
Cheng, James J.
Bian, Yatao
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
An error occurred retrieving the object's statistics
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recently, there has been a surge of interest in extending the success of large language models (LLMs) from texts to molecules. Most existing approaches adopt a graph neural network to represent a molecule as a series of node tokens for molecule-language alignment, which, however, have overlooked the inherent hierarchical structures in molecules. Notably, higherordermolecular structures contain rich semantics of functional groups, which encode crucial biochemical functionalities of the molecules. We show that neglecting the hierarchical information in tokenization will lead to subpar moleculelanguage alignment and severe hallucination. To address this limitation, we propose HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that encodes the hierarchy of atom, motif, and molecular levelsof informative tokens to improve the molecular perception of LLMs. HIGHT also adopts an augmented instruction tuning dataset, enriched with the hierarchical graph information, to further enhance the molecule-language alignment. Extensive experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks. The project is available at https://higraphllm.github.io/.
Citation
Y. Chen, Q. Yao, J. Zhang, J. Cheng, and Y. Bian, “Hierarchical Graph Tokenization for Molecule-Language Alignment,” Oct. 06, 2025, PMLR. [Online]. Available: https://proceedings.mlr.press/v267/chen25cf.html
Source
Proceedings of Machine Learning Research
Conference
42nd International Conference on Machine Learning, ICML 2025
Keywords
Subjects
Source
42nd International Conference on Machine Learning, ICML 2025
Publisher
ML Research Press
