Item

Discovering Unusual Word Usages with Masked Language Model via Pseudo-label Training

Timothy Baldwin
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Journal Article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
User-generated texts contain not only non-standard words such as b4 for before, but unusual word usages such as catfish for a person who uses fake identity online, which requires knowledge about the words to handle such cases in natural language processing. We present a neural model for detecting the non-standard usages in social media text. To deal with the lack of training data for this task, we propose a method for synthetically generating pseudo non-standard examples from a corpus, which enables us to train the model without manually-annotated training data and for any arbitrary language. Experimental results on Twitter and Reddit datasets show that our proposed method achieves better performance than existing methods, and is effective across different languages.
Citation
T. Aoki, J. H. Lau, H. Kamigaito, H. Takamura, T. Baldwin, and M. Okumura, Discovering Unusual Word Usages with Masked Language Model via Pseudo-label Training, Journal of Natural Language Processing, vol. 32, no. 1, pp. 134-175, 2025, doi: 10.5715/JNLP.32.134.
Keywords
Non-standard Word Usage, Masked Language Model, Pseudo Data
Subjects
Source
Publisher
J-STAGE
DOI
10.5715/jnlp.32.134
Full-text link
Additional links
https://www.jstage.jst.go.jp/article/jnlp/32/1/32_134/_article/-char/en