📝 Guest Post: Multimodal Retrieval –Bridging the Gap Between Language and Diverse Data Types*
Was this email forwarded to you? Sign up here Generative AI has recently witnessed an exciting development: using language to understand images, video, audio, molecules, time-series, and other "modalities." Multimodal retrieval exemplifies this advancement, allowing us to search one modality using another. Think of Google image search or Spotify song search. Before recent breakthroughs in deep learning and Gen AI, performing ML on such unstructured data posed significant challenges due to the lack of suitable feature representations. In this article, Stefan Webb, Developer Advocate at Zilliz, explores Multimodal Retrieval, its importance, implementation methods, and future prospects in multimodal Gen AI. Why Multimodal Retrieval MattersMultimodal Retrieval primarily enables us to search images, audio, and videos using text queries. However, it also serves a crucial role in grounding large language models (LLMs) in factual data and reducing hallucinations. In multimodal RAG (retrieval-augmented generation), we use the user's query to retrieve multiple similar images and text strings, augmenting the prompt with this relevant information. This approach either provides the LLM with relevant facts or supplies query-answer pairs as demonstrations for in-context learning. Multimodal retrieval powers numerous applications, including multimedia search engines, visual question-answering systems, and more. How Multimodal Retrieval WorksAt a high level, Multimodal Retrieval follows these steps:
To compare text and image embeddings effectively, we can't use embedding models trained separately. The embedding space differs between modalities and even for the same modality if we retrain the model. Therefore, we need to produce aligned encoders for each modality. This alignment ensures that semantically similar sentences and images have embeddings close to each other in either cosine or Euclidean distance. Embedding models typically use the Transformer architecture for text, images, or other modalities. CLIP (Contrastive Language-Image Pretraining) stands out as a seminal method. A similar architecture to GPT-2 is used for the text encoder, and a Vision Transformer (ViT) is used as the image encoder. Both are trained together from scratch using a contrastive loss function, which minimizes the cosine distance between embeddings of matching (image, text) pairs while penalizing small distances for dissimilar pairs. At each gradient step of learning, a minibatch of around size 32k is used to construct similar and dissimilar (image, text) pairs. After embedding our dataset's text and images, we store these embeddings in a vector database. Vector databases differ from relational databases by offering efficient data structures and algorithms for searching vectors by distance. While a naive algorithm comparing the query vector to every vector in the database would have O(N) runtime, search algorithms like Hierarchical Navigable Small Worlds (HNSW) and Inverted File index (IVF) have, respectively, O(log(N)) and O(K + N/K) runtimes (on average), where K is the number of clusters used for grouping the vectors. This efficiency comes at the cost of, for example, an O(N*log(N)) index construction step for HNSW and extra memory usage but allows vector search speeds to scale to web scale. We can also reduce storage cost through techniques like Product Quantization (PQ). As a purpose-built high-performance vector database, Milvus is open source and offers features for running on single machines or clusters, scaling to tens of billions of vectors, and searching with millisecond latency. Once we've constructed our multimodal dataset's vector database, we perform Multimodal Retrieval by embedding the user's query and searching the database for similar embeddings and their associated metadata. For instance, given a user query describing an image, we can retrieve similar images. The query embedding model is typically the same as the embedding model used for constructing the database, although it is possible to fine-tune it for better retrieval. More complex pipelines might involve filtering the query for appropriateness or relevance, rewriting it to facilitate search, searching both text and image embeddings, combining results, reranking retrieved results with a separate model, and filtering the output. Key Requirements for Multimodal Retrieval
Creating a large multimodal dataset from scratch requires ingenuity to scale up. For example, common image search datasets use (image, alt text) pairs scraped from the web. In the MagicLens model, triplets of (source image, instruction, target image) are formed by scraping similar images from the same webpage, and using an Large Language-Vision Model (LLVM) to synthesize natural language instructions for transforming the source into the target. It's often more convenient to use pre-existing datasets or pretrained models - state-of-the-art examples with commercial-use licenses are available from Hugging Face. Vector database implementations like Milvus address the second and third challenges by handling distributed system aspects and performing efficient searches at scale. Check out this demo implementing Multimodal RAG with Milvus for image search. For those who prefer not to manage their own vector database, hosted services like Zilliz Cloud are available. Future DirectionsMuch exciting work has been happening at the intersection of multimodal retrieval and RAG since the idea was first examined in MuRAG (Google, 2022). As an example, see the following: In this notebook, a graph database is combined with a vector database to search relationships over entities and concepts. A routing component is added to the RAG system that introspects the query to decide whether to retrieve the information from the vector database, the graph database, or defer to a web search. Here are some further examples: Multimodal Gen AI is not limited to just web-mined text and image data. Some recent work examines multimodal data in other domains: For some recent interesting applications see: ConclusionMultimodal Retrieval opens up exciting possibilities for searching and understanding diverse data types using natural language. As we continue to refine these techniques and explore new applications, we can expect to see increasingly sophisticated and powerful AI systems that bridge the gap between human communication and machine understanding across multiple modalities. *This post was written by Stefan Webb, Developer Advocate at Zilliz, specially for TheSequence. We thank Zilliz for their insights and ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 436: Salesforce's xLAM is a New Model for Agentic Tasks
Thursday, October 3, 2024
The new model excels in tasls such as function calling, tool integration and planning. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 435: Learn About Hungry Hungry Hippos and SSMs
Tuesday, October 1, 2024
One of the most important layers of state space models. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Meta AI’s Big Announcements
Sunday, September 29, 2024
New AR glasses, Llama 3.2 and more. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
How Does AI "See" Us?
Friday, September 27, 2024
A fascinating study that analyzed over 1200 images from four global AI models. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 434: How Google DeepMind’s GameNGen can Simulate Entire 1993’s DOOM Game in Real Time
Thursday, September 26, 2024
A major milestone in creating generative AI models that can interact with complex real world environments. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator
Friday, February 14, 2025
What came before the golem? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Defining Your Paranoia Level: Navigating Change Without the Overkill
Friday, February 14, 2025
We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy
5 ways AI can help with taxes 🪄
Friday, February 14, 2025
Remotely control an iPhone; 💸 50+ early Presidents' Day deals -- ZDNET ZDNET Tech Today - US February 10, 2025 5 ways AI can help you with your taxes (and what not to use it for) 5 ways AI can help
Recurring Automations + Secret Updates
Friday, February 14, 2025
Smarter automations, better templates, and hidden updates to explore 👀 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The First Provable AI-Proof Game: Introducing Butterfly Wings 4
Friday, February 14, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? undefined The Market Today #01 Instagram (Meta) 714.52 -0.32%
GCP Newsletter #437
Friday, February 14, 2025
Welcome to issue #437 February 10th, 2025 News BigQuery Cloud Marketplace Official Blog Partners BigQuery datasets now available on Google Cloud Marketplace - Google Cloud Marketplace now offers
Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰
Friday, February 14, 2025
Discover how the share of US wealth held by the top 1% has evolved from 1989 to 2024 in this infographic. View Online | Subscribe | Download Our App Download our app to see thousands of new charts from
The Great Social Media Diaspora & Tapestry is here
Friday, February 14, 2025
Apple introduces new app called 'Apple Invites', The Iconfactory launches Tapestry, beyond the traditional portfolio, and more in this week's issue of Creativerly. Creativerly The Great
Daily Coding Problem: Problem #1689 [Medium]
Friday, February 14, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given a linked list, sort it in O(n log n) time and constant space. For example,
📧 Stop Conflating CQRS and MediatR
Friday, February 14, 2025
Stop Conflating CQRS and MediatR Read on: my website / Read time: 4 minutes The .NET Weekly is brought to you by: Step right up to the Generative AI Use Cases Repository! See how MongoDB powers your