📝 Guest Post: Multimodal Retrieval –Bridging the Gap Between Language and Diverse Data Types*
Was this email forwarded to you? Sign up here Generative AI has recently witnessed an exciting development: using language to understand images, video, audio, molecules, time-series, and other "modalities." Multimodal retrieval exemplifies this advancement, allowing us to search one modality using another. Think of Google image search or Spotify song search. Before recent breakthroughs in deep learning and Gen AI, performing ML on such unstructured data posed significant challenges due to the lack of suitable feature representations. In this article, Stefan Webb, Developer Advocate at Zilliz, explores Multimodal Retrieval, its importance, implementation methods, and future prospects in multimodal Gen AI. Why Multimodal Retrieval MattersMultimodal Retrieval primarily enables us to search images, audio, and videos using text queries. However, it also serves a crucial role in grounding large language models (LLMs) in factual data and reducing hallucinations. In multimodal RAG (retrieval-augmented generation), we use the user's query to retrieve multiple similar images and text strings, augmenting the prompt with this relevant information. This approach either provides the LLM with relevant facts or supplies query-answer pairs as demonstrations for in-context learning. Multimodal retrieval powers numerous applications, including multimedia search engines, visual question-answering systems, and more. How Multimodal Retrieval WorksAt a high level, Multimodal Retrieval follows these steps:
To compare text and image embeddings effectively, we can't use embedding models trained separately. The embedding space differs between modalities and even for the same modality if we retrain the model. Therefore, we need to produce aligned encoders for each modality. This alignment ensures that semantically similar sentences and images have embeddings close to each other in either cosine or Euclidean distance. Embedding models typically use the Transformer architecture for text, images, or other modalities. CLIP (Contrastive Language-Image Pretraining) stands out as a seminal method. A similar architecture to GPT-2 is used for the text encoder, and a Vision Transformer (ViT) is used as the image encoder. Both are trained together from scratch using a contrastive loss function, which minimizes the cosine distance between embeddings of matching (image, text) pairs while penalizing small distances for dissimilar pairs. At each gradient step of learning, a minibatch of around size 32k is used to construct similar and dissimilar (image, text) pairs. After embedding our dataset's text and images, we store these embeddings in a vector database. Vector databases differ from relational databases by offering efficient data structures and algorithms for searching vectors by distance. While a naive algorithm comparing the query vector to every vector in the database would have O(N) runtime, search algorithms like Hierarchical Navigable Small Worlds (HNSW) and Inverted File index (IVF) have, respectively, O(log(N)) and O(K + N/K) runtimes (on average), where K is the number of clusters used for grouping the vectors. This efficiency comes at the cost of, for example, an O(N*log(N)) index construction step for HNSW and extra memory usage but allows vector search speeds to scale to web scale. We can also reduce storage cost through techniques like Product Quantization (PQ). As a purpose-built high-performance vector database, Milvus is open source and offers features for running on single machines or clusters, scaling to tens of billions of vectors, and searching with millisecond latency. Once we've constructed our multimodal dataset's vector database, we perform Multimodal Retrieval by embedding the user's query and searching the database for similar embeddings and their associated metadata. For instance, given a user query describing an image, we can retrieve similar images. The query embedding model is typically the same as the embedding model used for constructing the database, although it is possible to fine-tune it for better retrieval. More complex pipelines might involve filtering the query for appropriateness or relevance, rewriting it to facilitate search, searching both text and image embeddings, combining results, reranking retrieved results with a separate model, and filtering the output. Key Requirements for Multimodal Retrieval
Creating a large multimodal dataset from scratch requires ingenuity to scale up. For example, common image search datasets use (image, alt text) pairs scraped from the web. In the MagicLens model, triplets of (source image, instruction, target image) are formed by scraping similar images from the same webpage, and using an Large Language-Vision Model (LLVM) to synthesize natural language instructions for transforming the source into the target. It's often more convenient to use pre-existing datasets or pretrained models - state-of-the-art examples with commercial-use licenses are available from Hugging Face. Vector database implementations like Milvus address the second and third challenges by handling distributed system aspects and performing efficient searches at scale. Check out this demo implementing Multimodal RAG with Milvus for image search. For those who prefer not to manage their own vector database, hosted services like Zilliz Cloud are available. Future DirectionsMuch exciting work has been happening at the intersection of multimodal retrieval and RAG since the idea was first examined in MuRAG (Google, 2022). As an example, see the following: In this notebook, a graph database is combined with a vector database to search relationships over entities and concepts. A routing component is added to the RAG system that introspects the query to decide whether to retrieve the information from the vector database, the graph database, or defer to a web search. Here are some further examples: Multimodal Gen AI is not limited to just web-mined text and image data. Some recent work examines multimodal data in other domains: For some recent interesting applications see: ConclusionMultimodal Retrieval opens up exciting possibilities for searching and understanding diverse data types using natural language. As we continue to refine these techniques and explore new applications, we can expect to see increasingly sophisticated and powerful AI systems that bridge the gap between human communication and machine understanding across multiple modalities. *This post was written by Stefan Webb, Developer Advocate at Zilliz, specially for TheSequence. We thank Zilliz for their insights and ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 436: Salesforce's xLAM is a New Model for Agentic Tasks
Thursday, October 3, 2024
The new model excels in tasls such as function calling, tool integration and planning. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 435: Learn About Hungry Hungry Hippos and SSMs
Tuesday, October 1, 2024
One of the most important layers of state space models. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Meta AI’s Big Announcements
Sunday, September 29, 2024
New AR glasses, Llama 3.2 and more. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
How Does AI "See" Us?
Friday, September 27, 2024
A fascinating study that analyzed over 1200 images from four global AI models. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 434: How Google DeepMind’s GameNGen can Simulate Entire 1993’s DOOM Game in Real Time
Thursday, September 26, 2024
A major milestone in creating generative AI models that can interact with complex real world environments. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
Recording: 'Data Storytelling: What Organizations Need to Know Going Into 2025'
Friday, November 22, 2024
Thank you for your interest in our latest webinar. As promised here is your recording of the event. View email in browser Recording Now Available Thank you for your interest in receiving a recording of
💻 Issue 437 - Introducing local Azure Service Bus Emulator
Thursday, November 21, 2024
This week's Awesome .NET Weekly Read this email on the Web The Awesome .NET Weekly Issue » 437 Release Date Nov 21, 2024 Your weekly report of the most popular .NET news, articles and projects
💎 Issue 444 - Why did people rub snow on frozen feet? (2017)
Thursday, November 21, 2024
This week's Awesome Ruby Newsletter Read this email on the Web The Awesome Ruby Newsletter Issue » 444 Release Date Nov 21, 2024 Your weekly report of the most popular Ruby news, articles and
💻 Issue 444 - JavaScript Dos and Donts
Thursday, November 21, 2024
This week's Awesome JavaScript Weekly Read this email on the Web The Awesome JavaScript Weekly Issue » 444 Release Date Nov 21, 2024 Your weekly report of the most popular JavaScript news, articles
📱 Issue 438 - Reverse Engineering iOS 18 Inactivity Reboot
Thursday, November 21, 2024
This week's Awesome iOS Weekly Read this email on the Web The Awesome iOS Weekly Issue » 438 Release Date Nov 21, 2024 Your weekly report of the most popular iOS news, articles and projects Popular
💻 Issue 362 - React Anti-Pattern: Stop Passing Setters Down the Components Tree
Thursday, November 21, 2024
This week's Awesome React Weekly Read this email on the Web The Awesome React Weekly Issue » 362 Release Date Nov 21, 2024 Your weekly report of the most popular React news, articles and projects
💻 Issue 444 - Building simple event-driven applications with Pub/Sub
Thursday, November 21, 2024
This week's Awesome Node.js Weekly Read this email on the Web The Awesome Node.js Weekly Issue » 444 Release Date Nov 21, 2024 Your weekly report of the most popular Node.js news, articles and
📱 Issue 441 - Shift Left Is the Tip of the Iceberg
Thursday, November 21, 2024
This week's Awesome Swift Weekly Read this email on the Web The Awesome Swift Weekly Issue » 441 Release Date Nov 21, 2024 Your weekly report of the most popular Swift news, articles and projects
💻 Issue 439 - Async/Await Is Real And Can Hurt You
Thursday, November 21, 2024
This week's Awesome Rust Weekly Read this email on the Web The Awesome Rust Weekly Issue » 439 Release Date Nov 21, 2024 Your weekly report of the most popular Rust news, articles and projects
📲 Why I Ditched Linux for Samsung DeX — Buy This Instead of a Gaming Headset
Thursday, November 21, 2024
Also: Taking Instagram Stories to the Next Level, and More! How-To Geek Logo November 21, 2024 Did You Know Thurl Ravenscroft was both the voice behind the Christmas song "You're a Mean One,