Overview of the “Voight-Kampff” Generative AI Authorship Verification Task at PAN and ELOQUENT 2025
Bevendorff, Janek ; Wang, Yuxia ; Karlgren, Jussi ; Wiegmann, Matti ; Fröbe, Maik ; Su, Jinyan ; Xie, Zhuohan ; Abassy, Mervat T. ; Mansurov, Jonibek ; Xing, Rui ... show 10 more
Bevendorff, Janek
Wang, Yuxia
Karlgren, Jussi
Wiegmann, Matti
Fröbe, Maik
Su, Jinyan
Xie, Zhuohan
Abassy, Mervat T.
Mansurov, Jonibek
Xing, Rui
Author
Bevendorff, Janek , Wang, Yuxia , Karlgren, Jussi , Wiegmann, Matti , Fröbe, Maik , Su, Jinyan , Xie, Zhuohan , Abassy, Mervat T. , Mansurov, Jonibek , Xing, Rui , Ta, Minh Ngoc , Elozeiri, Kareem A. , Gu, Tianle , Tomar, Raj Vardhan , Geng, Jiahui , Artemova, Ekaterina L. , Shelmanov, Artem O. , Habash, Nizar Y. , Stamatatos, Efstathios , Gurevych, Iryna , Nakov, Preslav Ivanov , Potthast, Martin , Stein, Benno Maria
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The “Voight-Kampff” Generative AI Authorship Verification task aims to determine whether a text was generated by an AI or written by a human. The 2025 edition of the task explores two subtasks: Subtask 1 tests the detection of purely AI generated text with potentially unknown obfuscations, and as such continues our research from 2024. The task is again organized as a builder-breaker challenge together with the ELOQUENT lab. The PAN participants submitted 24 detectors. The best system archives a mean score of 0.99, the best baseline achieves a score of 0.92. ELOQUENT participants submitted 13 new test datasets with 22 obfuscated texts each. The most difficult dataset archives a mean C@1−1 score of 0.63. Subtask 2 investigates texts with six degrees of human-AI collaboration: (i) fully human-written, (ii) human-written, then machine-polished, (iii) machine-written, then machine-humanized (obfuscated), (iv) human-initiated, then machine-continued, (v) deeply mixed text,where some parts are written by a human and some are generated by a machine, and (vi) machine-written, then human-edited. The dataset contains over half a million examples in total and is composed from several relevant AI-detection datasets across multiple text genres. PAN participants submitted 21 detectors to subtask 2. The best system archives an F1 score of 0.65, the best baseline a score of 0.48. The data, baselines, and the code used for creating the datasets and evaluating the systems are available.1
Citation
J. Bevendorff et al., “Overview of the ‘Voight-Kampff’ Generative AI Authorship Verification Task at PAN and ELOQUENT 2025,” 2025.
Source
CEUR Workshop Proceedings
Conference
26th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025
Keywords
Generative AI Detection, Human-AI Collaboration, LLM Detection, PAN, Workshop
Subjects
Source
26th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025
Publisher
CEUR-WS
