MBZUAI Institutional Repository

Recent Submissions

  • ItemMetadata only
    Chapter 11 Self- and unsupervised learning for anomaly detection and localization
    (Elsevier, 2025-09-12) Zimmerer, David; Maier-Hein, Klaus
    The growing role of machine learning in medical imaging offers significant opportunities for improving disease detection and diagnosis. Yet, the development of robust and generalizable models remains challenging due to the scarcity of annotated data, variability across imaging modalities, and the complexity of medical conditions. To address these limitations, self-supervised and unsupervised learning strategies are increasingly explored, as they can leverage abundant unlabeled data to learn meaningful representations and detect anomalies without costly manual annotations. This chapter provides an overview of key unsupervised approaches for anomaly detection and localization in medical imaging, focusing on methods such as Denoising Autoencoders (DAEs), Variational Autoencoders (VAEs) and their variants. Additionally, it introduces Contrastive Representations for Unsupervised Anomaly Detection and Localization (CRADL), a promising framework that employs contrastive learning to enhance anomaly detection capabilities. By learning to distinguish normal from abnormal patterns without supervision, these approaches open new avenues for scalable, adaptable, and clinically relevant medical imaging solutions.
  • ItemMetadata only
    SCMBench: benchmarking domain-specific and foundation models for single-cell multi-omics data integration
    (Springer Nature, 2026-05-02) Wang, Yixuan; Fan, Yimin; Wang, Xuesong; Yu, Tingyang; Zong, Yongshuo; Liu, Xinyuan; Zhong, Gaoyang; Liu, Meitong; Li, Qing; Lee, Kin Hei; Dallakyan, Khachatur; Hu, Zhichao; Qi, Yaqian; Huang, Junjie; Jia, Gengjie; Yuan, Jiao; Chan, Ting-Fung; Gao, Xin; King, Irwin; Li, Yu
    Recent advancements in single-cell sequencing technologies have led to the generation of vast amounts of multi-omics data, spurring the development of numerous integration tools. While multi-omics integration has significantly advanced cell research, there is still a lack of comprehensive evaluations and guidelines for these tools. This study benchmarks Domain-specific Models (DMs) and Foundation Models (FMs) for multi-omics data integration, assessing 23 methods with optimized hyperparameters on integration accuracy, biomarker detection, trajectory inference, and quantitative batch effect correction. We address current gaps in assessing the efficacy and limitations of FMs compared to DMs in the multi-omics integration task. Importantly, our comprehensive analysis goes beyond basic integration accuracy, focusing on the preservation of cellular characteristics, transcriptomic biomarkers, epigenomic regulatory elements, and development trajectories. This holistic approach enables researchers to extract meaningful insights from integration results, facilitating a deeper understanding of individual cells. Generally, we find FMs fall short of state-of-the-art DMs in this field. To bridge this performance gap, we propose a lightweight adaptation strategy that enhances their effectiveness in this task. Our findings serve as a guide for researchers in selecting suitable integration methods for specific single-cell analysis objectives and provide insights for future model design.
  • ItemOpen Access
    Benchmarking Arabic Authorship Attribution and Style Transfer with Large Language Models
    (ELDA (Evaluations and Language resources Distribution Agency), 2026-05) Hamed, Injy; Alhafni, Bashar; Habash, Nizar; Solorio, Thamar
    Writing style is a fundamental component of natural language. However, significant research gaps remain in two key style-centric tasks: authorship attribution (AA) and authorship style transfer, particularly for Arabic. In this work, we revisit both tasks in that context. We introduce a new AA dataset comprising texts in Modern Standard and Dialectal Arabic. We train transformer-based AA models using dual cross-entropy and contrastive learning loss objectives, and validate model performance through human evaluation. We then utilize the trained AA model to benchmark a range of large language models (LLMs) on style recognition and generation tasks, providing new insights into their capabilities in modeling Arabic writing styles. Our work reveals limitations of current models and provides resources to advance research in this direction.
  • ItemMetadata only
    Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
    (IEEE, 2026-04-29) Thawakar, Omkar; Demidov, Dmitry; Thawkar, Ritesh; Anwer, Rao Muhammad; Shah, Mubarak; Khan, Salman; Khan, Fahad Shahbaz
    Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at: https://github.com/OmkarThawakar/BSE-CoVR.
  • ItemMetadata only
    Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model
    (IEEE, 2026-04-29) Chen, Shiming; Duan, Bowen; Khan, Salman; Khan, Fahad Shahbaz
    Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and their associated attributes, facilitating effective alignment and providing interpretable similarity without the need for additional training. Extensive experiments demonstrate that our method offers several advantages, including enhanced interpretability, improved accuracy, and strong domain generalization. Codes available at: https://github.com/shiming-chen/LaZSL.

Communities in MBZUAI iRep

Select a community to browse its collections.