Item

ModalityBridge: A Foundational Model-Based Framework for Cross-Modal Content Generation and Video Summarization

Alsuwaidi, Majed
Department
Machine Learning
Embargo End Date
2024-01-01
Type
Thesis
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
This thesis introduces a novel framework that leverages foundational models for converting content across different modalities, encompassing any form of input (image, text, video, audio) to text and then back to any desired output modality. Central to this framework is the ability to understand and generate content through an innovative process that mimics human cognitive abilities in interpreting and recreating media. Two principal applications of this framework are explored in depth: an image-to-image conversion that aims to recreate images based on their textual descriptions, akin to a sketch artist s work from witness accounts, and a video-to-video summarization through a technique that efficiently condenses videos into coherent summaries through a novel use of keyframe extraction, textual description, and synthesis. In the subsequent steps, the core innovation is seen where these visual descriptions are combined into a single textual representation for each scene. These texts are then clustered, with chronological and narrative coherence ensured by restricting clusters to adjacent scenes. Leveraging GPT-4, image prompts for each text cluster are generated, reflecting the combined visual storyline of the scenes. A text-to-image AI model is then employed to translate these prompts into images. The stitching of these images together to create a summarized video constitutes the final step. A Python package that encapsulates the functionalities of this framework, enabling easy integration into various projects, has been developed to facilitate the adoption and experimentation with this framework. Additionally, a survey tool designed to evaluate the effectiveness and user satisfaction of the video summaries produced by our framework is introduced, ensuring a user-centered approach to continuous improvement. Supporting the research, a comprehensive dataset from YouTube videos, enriched with annotations including detected scenes, textual descriptions of key frames, audio clips, and corresponding text transcripts, has been curated. In summary, this thesis introduces a comprehensive framework that enables seamless conversion across different media formats, including image-to-image recreation based on textual descriptions and video-to-video summarization through keyframe extraction, textual description generation, clustering, and image synthesis leveraging large language models and generative AI.
Citation
M. Alsuwaidi, "ModalityBridge: A Foundational Model-Based Framework for Cross-Modal Content Generation and Video Summarization", MS. Thesis, Machine Learning, MBZUAI, Abu Dhabi, UAE, 2024
Source
Conference
Keywords
Subjects
Source
Publisher
DOI
Full-text link