📝 Guest Post: Multimodal Retrieval –Bridging the Gap Between Language and Diverse Data Types*
Was this email forwarded to you? Sign up here Generative AI has recently witnessed an exciting development: using language to understand images, video, audio, molecules, time-series, and other "modalities." Multimodal retrieval exemplifies this advancement, allowing us to search one modality using another. Think of Google image search or Spotify song search. Before recent breakthroughs in deep learning and Gen AI, performing ML on such unstructured data posed significant challenges due to the lack of suitable feature representations. In this article, Stefan Webb, Developer Advocate at Zilliz, explores Multimodal Retrieval, its importance, implementation methods, and future prospects in multimodal Gen AI. Why Multimodal Retrieval MattersMultimodal Retrieval primarily enables us to search images, audio, and videos using text queries. However, it also serves a crucial role in grounding large language models (LLMs) in factual data and reducing hallucinations. In multimodal RAG (retrieval-augmented generation), we use the user's query to retrieve multiple similar images and text strings, augmenting the prompt with this relevant information. This approach either provides the LLM with relevant facts or supplies query-answer pairs as demonstrations for in-context learning. Multimodal retrieval powers numerous applications, including multimedia search engines, visual question-answering systems, and more. How Multimodal Retrieval WorksAt a high level, Multimodal Retrieval follows these steps:
To compare text and image embeddings effectively, we can't use embedding models trained separately. The embedding space differs between modalities and even for the same modality if we retrain the model. Therefore, we need to produce aligned encoders for each modality. This alignment ensures that semantically similar sentences and images have embeddings close to each other in either cosine or Euclidean distance. Embedding models typically use the Transformer architecture for text, images, or other modalities. CLIP (Contrastive Language-Image Pretraining) stands out as a seminal method. A similar architecture to GPT-2 is used for the text encoder, and a Vision Transformer (ViT) is used as the image encoder. Both are trained together from scratch using a contrastive loss function, which minimizes the cosine distance between embeddings of matching (image, text) pairs while penalizing small distances for dissimilar pairs. At each gradient step of learning, a minibatch of around size 32k is used to construct similar and dissimilar (image, text) pairs. After embedding our dataset's text and images, we store these embeddings in a vector database. Vector databases differ from relational databases by offering efficient data structures and algorithms for searching vectors by distance. While a naive algorithm comparing the query vector to every vector in the database would have O(N) runtime, search algorithms like Hierarchical Navigable Small Worlds (HNSW) and Inverted File index (IVF) have, respectively, O(log(N)) and O(K + N/K) runtimes (on average), where K is the number of clusters used for grouping the vectors. This efficiency comes at the cost of, for example, an O(N*log(N)) index construction step for HNSW and extra memory usage but allows vector search speeds to scale to web scale. We can also reduce storage cost through techniques like Product Quantization (PQ). As a purpose-built high-performance vector database, Milvus is open source and offers features for running on single machines or clusters, scaling to tens of billions of vectors, and searching with millisecond latency. Once we've constructed our multimodal dataset's vector database, we perform Multimodal Retrieval by embedding the user's query and searching the database for similar embeddings and their associated metadata. For instance, given a user query describing an image, we can retrieve similar images. The query embedding model is typically the same as the embedding model used for constructing the database, although it is possible to fine-tune it for better retrieval. More complex pipelines might involve filtering the query for appropriateness or relevance, rewriting it to facilitate search, searching both text and image embeddings, combining results, reranking retrieved results with a separate model, and filtering the output. Key Requirements for Multimodal Retrieval
Creating a large multimodal dataset from scratch requires ingenuity to scale up. For example, common image search datasets use (image, alt text) pairs scraped from the web. In the MagicLens model, triplets of (source image, instruction, target image) are formed by scraping similar images from the same webpage, and using an Large Language-Vision Model (LLVM) to synthesize natural language instructions for transforming the source into the target. It's often more convenient to use pre-existing datasets or pretrained models - state-of-the-art examples with commercial-use licenses are available from Hugging Face. Vector database implementations like Milvus address the second and third challenges by handling distributed system aspects and performing efficient searches at scale. Check out this demo implementing Multimodal RAG with Milvus for image search. For those who prefer not to manage their own vector database, hosted services like Zilliz Cloud are available. Future DirectionsMuch exciting work has been happening at the intersection of multimodal retrieval and RAG since the idea was first examined in MuRAG (Google, 2022). As an example, see the following: In this notebook, a graph database is combined with a vector database to search relationships over entities and concepts. A routing component is added to the RAG system that introspects the query to decide whether to retrieve the information from the vector database, the graph database, or defer to a web search. Here are some further examples: Multimodal Gen AI is not limited to just web-mined text and image data. Some recent work examines multimodal data in other domains: For some recent interesting applications see: ConclusionMultimodal Retrieval opens up exciting possibilities for searching and understanding diverse data types using natural language. As we continue to refine these techniques and explore new applications, we can expect to see increasingly sophisticated and powerful AI systems that bridge the gap between human communication and machine understanding across multiple modalities. *This post was written by Stefan Webb, Developer Advocate at Zilliz, specially for TheSequence. We thank Zilliz for their insights and ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 436: Salesforce's xLAM is a New Model for Agentic Tasks
Thursday, October 3, 2024
The new model excels in tasls such as function calling, tool integration and planning. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 435: Learn About Hungry Hungry Hippos and SSMs
Tuesday, October 1, 2024
One of the most important layers of state space models. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Meta AI’s Big Announcements
Sunday, September 29, 2024
New AR glasses, Llama 3.2 and more. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
How Does AI "See" Us?
Friday, September 27, 2024
A fascinating study that analyzed over 1200 images from four global AI models. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 434: How Google DeepMind’s GameNGen can Simulate Entire 1993’s DOOM Game in Real Time
Thursday, September 26, 2024
A major milestone in creating generative AI models that can interact with complex real world environments. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
15 ways AI saved me weeks of work in 2024
Monday, December 23, 2024
ZDNET's product of the year; Windows 11 24H2 bug list updated -- ZDNET ZDNET Tech Today - US December 23, 2024 AI applications on various devices. 15 surprising ways I used AI to save me weeks of
Distributed Locking: A Practical Guide
Monday, December 23, 2024
If you're wondering how and when distributed locking can be useful, here's the practical guide. I explained why distributed locking is needed in real-world scenarios. Explored how popular tools
⚡ THN Weekly Recap: Top Cybersecurity Threats, Tools and Tips
Monday, December 23, 2024
Your one-stop-source for last week's top cybersecurity headlines. The Hacker News THN Weekly Recap The online world never takes a break, and this week shows why. From ransomware creators being
⚙️ OpenA(G)I?
Monday, December 23, 2024
Plus: The Genesis Project
Post from Syncfusion Blogs on 12/23/2024
Monday, December 23, 2024
New blogs from Syncfusion Introducing the New WinUI Kanban Board By Karthick Mani This blog explains the features of the new Syncfusion WinUI Kanban Board control introduced in the 2024 Volume 4
Import AI 395: AI and energy demand; distributed training via DeMo; and Phi-4
Monday, December 23, 2024
What might fighting for freedom in an AI age look like? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
LockBit Ransomware Developer Charged for Billions in Global Damages
Monday, December 23, 2024
THN Daily Updates Newsletter cover The Data Science Handbook, 2nd Edition ($60.00 Value) FREE for a Limited Time Practical, accessible guide to becoming a data scientist, updated to include the latest
Re: How to know if your data has been exposed
Monday, December 23, 2024
Imagine getting an instant notification if your SSN, credit card, or password has been exposed on the dark web — so you can take action immediately. Surfshark Alert does just that. It helps you stay
Christmas On Repeat 🎅
Monday, December 23, 2024
Christmas nostalgia is a hell of a drug. Here's a version for your browser. Hunting for the end of the long tail • December 22, 2024 Hey all, Ernie here with a refresh of a piece from our very
SRE Weekly Issue #456
Monday, December 23, 2024
View on sreweekly.com A message from our sponsor, FireHydrant: On-call during the holidays? Spend more time taking in some R&R and less getting paged. Let alerts make their rounds fairly with our