Item

iShrink: Development and Evaluation of a Compression Pipeline for Edge-Deployable Language Models

Gebre, Daniel
Department
Machine Learning
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, from natural language understanding to complex reasoning. However, as these models have grown in size—from BERT’s modest 110M parameters to GPT-4’s estimated trillion-plus parameters—a critical challenge has emerged: how to make these powerful systems deployable in resource constrained environments. This thesis addresses this challenge by focusing on an under studied yet crucial area: the efficient compression of already-optimized compact language models (1-3B parameters). While significant research efforts have targeted the compression of larger models (7B+ parameters), the unique challenges of compressing smaller, more efficient architectures have remained largely unexplored. These compact models, despite their relative efficiency, still exceed the computational resources available in many edge computing scenarios, creating a gap between their potential utility and practical deployability. The primary difficulty lies in preserving architectural innovations like Grouped Query Attention (GQA) while achieving meaningful parameter reduction without substantial performance degradation. This thesis introduces iShrink, a novel compression framework specifically designed for compact models. Unlike existing approaches that treat model components uniformly, iShrink employs an architecture-aware methodology respecting the distinct yet interconnected nature of attention and feed-forward components. The framework combines gradient based Taylor expansion with MSE analysis and features several key innovations: (1) systematic zero-out analysis quantifying performance impact at various sparsity levels, (2) importance-based depth pruning preserving critical information flow, (3) width pruning maintaining GQA architectural consistency, and (4) efficient LoRA fine-tuning that recovers model capabilities with minimal computational resources. Extensive evaluations on state-of-the-art 1B parameter models—Llama-3.2-1B, Llama-3.2-1B-Instruct, Falcon3-1B, and Falcon3-1B-Instruct—demonstrate the framework’s effectiveness. iShrink achieves parameter reductions of 13.75-17.04% while maintaining 93.78% of original model performance on few-shot tasks and 95.13% on zero-shot tasks. In stan dard GPU environments, this compression yields 17-21% reduced inference latency and 19-24% improved throughput. To validate real-world applicability, this research includes the development of a Flutter based mobile application that deploys the compressed models on edge devices. Edge specific testing reveals even more substantial gains: 23.8% reduced inference latency, 46.2% improved throughput, 27.6% faster first token generation, and 64.6% reduced model loading time. These results confirm that the theoretical efficiency improvements translate to practical benefits in resource-constrained environments, enabling broader adoption of LLM capabilities in scenarios where computational resources are limited. The complete implementation is available as an open-source project on GitHub. For the compression pipeline, look at: https://github.com/DannyMeb/iShrink-LLM-Compression-Framework.git. For the Edge Deployment Chat App Demo, look at https://github.com/DannyMeb/Edge-AI-Chat-APP. git
Citation
Daniel Gebre, “iShrink: Development and Evaluation of a Compression Pipeline for Edge-Deployable Language Models,” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
LLM Compression, Structured Pruning, Edge AI, Model Deployment, Optimization
Subjects
Source
Publisher
DOI
Full-text link