MBZUAI Institutional Repository

Recent Submissions

  • ItemMetadata only
    Action Tokenizer Matters in In-Context Imitation Learning
    (IEEE, 2025-11-27) Vuong, An Dinh; Vu, Minh Nhat; An, Dong; Reid, Ian
    In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories. Code and checkpoints are available at https://action-tokenizer-matters.github.io/.
  • ItemOpen Access
    A High Magnifications Histopathology Image Dataset for Oral Squamous Cell Carcinoma Diagnosis and Prognosis
    (Springer Nature, 2026-02-06) Guan, Jinquan; Guo, Junhong; Chen, Qi; Chen, Jian; Cai, Yongkang; He, Yilin; Huang, Zhiquan; Wang, Yan; Xie, Yutong
    Oral Squamous Cell Carcinoma (OSCC) is a prevalent and aggressive malignancy where deep learning-based computer-aided diagnosis and prognosis can enhance clinical assessments. However, existing publicly available OSCC datasets often suffer from limited patient cohorts and a restricted focus on either diagnostic or prognostic tasks, limiting the development of comprehensive and generalizable models. To bridge this gap, we introduce Multi-OSCC, a new histopathology image dataset comprising 1,325 OSCC patients, integrating both diagnostic and prognostic information to expand existing public resources. Each patient is represented by six high resolution histopathology images captured at ×200, ×400, and ×1000—two per magnification—covering both the core and edge tumor regions. The Multi-OSCC dataset is richly annotated for six critical clinical tasks: recurrence prediction (REC), lymph node metastasis (LNM), tumor differentiation (TD), tumor invasion (TI), cancer embolus (CE), and perineural invasion (PI). We systematically evaluate the impact of different visual encoders, multi-image fusion techniques, stain normalization, and multi-task learning frameworks to benchmark this dataset. To accelerate future research, we publicly release the Multi-OSCC dataset at: https://github.com/guanjinquan/OSCC-PathologyImageDataset.
  • ItemOpen Access
    vEMRec: High‐Resolution Volume Electron Microscopy Reconstruction Based on Structure‐Preserving and High‐Fidelity 3D Alignment
    (Wiley, 2026-02-20) Zhang, Zhenbang; Li, Hongjia; Yang, Zhongjun; Xu, Zhiqiang; Sun, Duanchen; Gao, Xin; Zhang, Fa; Han, Renmin
    Three-dimensional (3D) alignment is a key step in volume electron microscopy (vEM), aimed at addressing misalignment during data acquisition, thereby recovering the correct biological structures. However, automated 3D alignment has long been challenged by the dilemma between eliminating nonlinear distortions and capturing natural morphological variations inherent to biological specimens. Here, we present vEMRec, a paradigm-shifting, fully automated algorithm for vEM 3D alignment. vEMRec redefines the 3D alignment problem by decoupling it into high-frequency and low-frequency subproblems. In this framework, precision rigid alignment is applied to correct rigid distortions, while a Gaussian filter-driven elastic registration algorithm addresses nonlinear distortions, all the while faithfully preserving biologically plausible deformations. Extensive experiments demonstrate that vEMRec achieves a paradigm shift in 3D alignment. Serving as a critical preprocessing step, vEMRec enhances performance in downstream isotropic reconstruction and 3D segmentation tasks by improving axial continuity in anisotropic data while preserving the structural integrity of ultrastructural details. Moreover, vEMRec accomplishes this through efficient computation, enabling TB-scale specimen analysis with biologically relevant throughput. vEMRec successfully optimized six representative large-scale real-world datasets, demonstrating its applicability, accuracy, and robustness for large-scale data processing.
  • ItemMetadata only
    MPromer: A Unified Diffusion-Based Framework for Scalable and Generalizable Multi-Modal Medical Image Segmentation
    (IEEE, 2026-02-23) Ghallabi, Wafa Al; Dudhane, Akshay; Zamir, Syed Waqas; Khan, Salman; Khan, Fahad Shahbaz
    Multi-modal medical image analysis is essential for comprehensive diagnostics; however, existing segmentation models often struggle to generalize across diverse imaging modalities such as MRI, CT, fundus imaging, and colonoscopy. While recent diffusion-based approaches have shown promising results, they typically rely on task-specific training, which limits their scalability and imposes significant computational demands. To address these limitations, we propose MPromer, a robust and adaptable segmentation framework that incorporates multi-scale implicit prompting within a diffusion-based architecture. In contrast to traditional prompt-driven methods, MPromer adapts automatically to various imaging modalities without requiring manually designed prompts or retraining for individual tasks. By integrating prompt-conditioned diffusion processes into an encoder-decoder structure, the model achieves consistent and effective segmentation across a wide range of medical domains. We evaluate MPromer on six benchmark datasets, demonstrating state-of-the-art performance with strong generalization capabilities. In addition to improved segmentation accuracy, MPromer enhances computational efficiency and extends naturally to multi-label segmentation tasks, making it well-suited for complex clinical applications. The framework provides a scalable and efficient solution that minimizes the need for fine-tuning, which is particularly beneficial in resource-constrained medical environments. Our code and models are available at https://github.com/wafaAlghallabi/MPromer.
  • ItemMetadata only
    Vocabulary-Free Fine-Grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
    (Institute of Electrical and Electronics Engineers (IEEE), 2026-02-23) Demidov, Dmitry; Zaheer, Zaigham; Thawakar, Omkar; Khan, Salman; Khan, Fahad Shahbaz
    Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on https://github.com/demidovd98/e-finer.

Communities in MBZUAI iRep

Select a community to browse its collections.