Item

Degradation of Sentence Vector Quality Caused by Changes in Content Word Rate Due to Sentence Length

Hara, Tomomasa
Kurita, Hiroto
Yokoi, Sho
Imaizumi, Masaaki
Inui, Kentaro
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
Japanese
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Techniques for vectorizing sentences and documents have become indispensable for developing various natural language processing applications, such as information retrieval and document classification. However, previous studies have pointed out that the quality of sentence vectors deteriorates as sentence length increases. This paper demonstrates that this degradation is caused by changes in the likelihood of function and content words appearing as sentences become longer. First, we empirically and theoretically demonstrate that the proportion of content words decreases in longer texts. Next, we demonstrate, both theoretically and empirically, that this decrease in content word proportion reduces the distance between sentence vectors, even for sentences on different topics. Building on these two analyses, we discuss how sentence vector quality declines for longer sentences. Our findings highlight the necessity of techniques that dynamically enhance the influence of content words based on sentence length.
Citation
T. Hara, H. Kurita, S. Yokoi, M. Imaizumi, and K. Inui, “Degradation of Sentence Vector Quality Caused by Changes in Content Word Rate Due to Sentence Length,” pp. 3G1GS603-3G1GS603, 2025, doi: 10.11517/PJSAI.JSAI2025.0_3G1GS603
Source
Proceedings of the Annual Conference of JSAI, 2025
Conference
The 39th Annual Conference of the Japanese Society for Artificial Intelligence
Keywords
Natural Language Processing, Sentence Embedding, Sentence Length
Subjects
Source
The 39th Annual Conference of the Japanese Society for Artificial Intelligence
Publisher
Japanese Society for Artificial Intelligence
Full-text link