Item

Patch-Copying Repeated Tokens for Efficient Generation

Hoang, Phu Gia
Department
Natural Language Processing
Embargo End Date
30/05/2025
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
This thesis investigates a core inefficiency in autoregressive language models (LLMs): the repeated recomputation of internal representations when generating tokens that appear in repeated sequences. Although such models excel at fluent text generation, their decod ing process is computationally expensive, especially for tasks characterized by high lexical redundancy, such as grammatical error correction and code editing. Prior work on copy ing mechanisms has largely focused on what to copy, developing methods that identify multitoken spans to reuse, but has not addressed how to avoid redundant computation during generation. In contrast, we propose a complementary solution: patch copying, a lightweight, architecture-compatible method that reuses internal residual activations for repeated tokens without modifying model weights or training. Our technique detects re peated patterns in context and reuses earlier hidden states directly at matched positions, simulating internal copying within the forward pass of the model. To evaluate this mechanism, we construct a suite of synthetic grammatical error correction tasks with controlled repetition and analyze the effect of patching at different stages of the generation pro cess. Experiments with the Qwen2.5-3B model demonstrate that shallow patching, across a small number of layers or tokens, maintains nearidentical output quality, while deeper patching introduces only gradual degradation. We further quantify theoretical efficiency gains using FLOP estimates, showing that patching offers measurable reductions in com putation. In sum, this work introduces a practical method for computational reuse in generative models, offers insight into the internal representations that support incontext pattern recognition, and opens avenues for combining patch-copying with span-selection mechanisms for end-to-end efficiency.
Citation
Phu Gia Hoang, “Patch-Copying Repeated Tokens for Efficient Generation,” Master of Science thesis, Natural Language Processing, MBZUAI, 2025.
Source
Conference
Keywords
Deep Learning, Natural Language Processing, Efficient Generation
Subjects
Source
Publisher
DOI
Full-text link