📝 Guest Post: Multimodal Retrieval –Bridging the Gap Between Language and Diverse Data Types*
Was this email forwarded to you? Sign up here Generative AI has recently witnessed an exciting development: using language to understand images, video, audio, molecules, time-series, and other "modalities." Multimodal retrieval exemplifies this advancement, allowing us to search one modality using another. Think of Google image search or Spotify song search. Before recent breakthroughs in deep learning and Gen AI, performing ML on such unstructured data posed significant challenges due to the lack of suitable feature representations. In this article, Stefan Webb, Developer Advocate at Zilliz, explores Multimodal Retrieval, its importance, implementation methods, and future prospects in multimodal Gen AI. Why Multimodal Retrieval MattersMultimodal Retrieval primarily enables us to search images, audio, and videos using text queries. However, it also serves a crucial role in grounding large language models (LLMs) in factual data and reducing hallucinations. In multimodal RAG (retrieval-augmented generation), we use the user's query to retrieve multiple similar images and text strings, augmenting the prompt with this relevant information. This approach either provides the LLM with relevant facts or supplies query-answer pairs as demonstrations for in-context learning. Multimodal retrieval powers numerous applications, including multimedia search engines, visual question-answering systems, and more. How Multimodal Retrieval WorksAt a high level, Multimodal Retrieval follows these steps:
To compare text and image embeddings effectively, we can't use embedding models trained separately. The embedding space differs between modalities and even for the same modality if we retrain the model. Therefore, we need to produce aligned encoders for each modality. This alignment ensures that semantically similar sentences and images have embeddings close to each other in either cosine or Euclidean distance. Embedding models typically use the Transformer architecture for text, images, or other modalities. CLIP (Contrastive Language-Image Pretraining) stands out as a seminal method. A similar architecture to GPT-2 is used for the text encoder, and a Vision Transformer (ViT) is used as the image encoder. Both are trained together from scratch using a contrastive loss function, which minimizes the cosine distance between embeddings of matching (image, text) pairs while penalizing small distances for dissimilar pairs. At each gradient step of learning, a minibatch of around size 32k is used to construct similar and dissimilar (image, text) pairs. After embedding our dataset's text and images, we store these embeddings in a vector database. Vector databases differ from relational databases by offering efficient data structures and algorithms for searching vectors by distance. While a naive algorithm comparing the query vector to every vector in the database would have O(N) runtime, search algorithms like Hierarchical Navigable Small Worlds (HNSW) and Inverted File index (IVF) have, respectively, O(log(N)) and O(K + N/K) runtimes (on average), where K is the number of clusters used for grouping the vectors. This efficiency comes at the cost of, for example, an O(N*log(N)) index construction step for HNSW and extra memory usage but allows vector search speeds to scale to web scale. We can also reduce storage cost through techniques like Product Quantization (PQ). As a purpose-built high-performance vector database, Milvus is open source and offers features for running on single machines or clusters, scaling to tens of billions of vectors, and searching with millisecond latency. Once we've constructed our multimodal dataset's vector database, we perform Multimodal Retrieval by embedding the user's query and searching the database for similar embeddings and their associated metadata. For instance, given a user query describing an image, we can retrieve similar images. The query embedding model is typically the same as the embedding model used for constructing the database, although it is possible to fine-tune it for better retrieval. More complex pipelines might involve filtering the query for appropriateness or relevance, rewriting it to facilitate search, searching both text and image embeddings, combining results, reranking retrieved results with a separate model, and filtering the output. Key Requirements for Multimodal Retrieval
Creating a large multimodal dataset from scratch requires ingenuity to scale up. For example, common image search datasets use (image, alt text) pairs scraped from the web. In the MagicLens model, triplets of (source image, instruction, target image) are formed by scraping similar images from the same webpage, and using an Large Language-Vision Model (LLVM) to synthesize natural language instructions for transforming the source into the target. It's often more convenient to use pre-existing datasets or pretrained models - state-of-the-art examples with commercial-use licenses are available from Hugging Face. Vector database implementations like Milvus address the second and third challenges by handling distributed system aspects and performing efficient searches at scale. Check out this demo implementing Multimodal RAG with Milvus for image search. For those who prefer not to manage their own vector database, hosted services like Zilliz Cloud are available. Future DirectionsMuch exciting work has been happening at the intersection of multimodal retrieval and RAG since the idea was first examined in MuRAG (Google, 2022). As an example, see the following: In this notebook, a graph database is combined with a vector database to search relationships over entities and concepts. A routing component is added to the RAG system that introspects the query to decide whether to retrieve the information from the vector database, the graph database, or defer to a web search. Here are some further examples: Multimodal Gen AI is not limited to just web-mined text and image data. Some recent work examines multimodal data in other domains: For some recent interesting applications see: ConclusionMultimodal Retrieval opens up exciting possibilities for searching and understanding diverse data types using natural language. As we continue to refine these techniques and explore new applications, we can expect to see increasingly sophisticated and powerful AI systems that bridge the gap between human communication and machine understanding across multiple modalities. *This post was written by Stefan Webb, Developer Advocate at Zilliz, specially for TheSequence. We thank Zilliz for their insights and ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 436: Salesforce's xLAM is a New Model for Agentic Tasks
Thursday, October 3, 2024
The new model excels in tasls such as function calling, tool integration and planning. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 435: Learn About Hungry Hungry Hippos and SSMs
Tuesday, October 1, 2024
One of the most important layers of state space models. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Meta AI’s Big Announcements
Sunday, September 29, 2024
New AR glasses, Llama 3.2 and more. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
How Does AI "See" Us?
Friday, September 27, 2024
A fascinating study that analyzed over 1200 images from four global AI models. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 434: How Google DeepMind’s GameNGen can Simulate Entire 1993’s DOOM Game in Real Time
Thursday, September 26, 2024
A major milestone in creating generative AI models that can interact with complex real world environments. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
Android Weekly #643 🤖
Sunday, October 6, 2024
View in web browser 643 October 6th, 2024 Articles & Tutorials Sponsored A New Approach to Mobile App Protection Guardsquare is proud to announce the launch of our guided configuration approach to
⚙️ Want to become an AI consultant?
Sunday, October 6, 2024
Early access inside
OpenAI raises over $6.6 billion - Sync #487
Sunday, October 6, 2024
Plus: SB 1047 has been vetoed; a new humanoid robot has been revealed; the dark side of AI voice cloning; a new episode in the fight over the CRISPR patent; and more! ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Kotlin Weekly #427
Sunday, October 6, 2024
ISSUE #427 6th of October 2024 Articles Telltale: Automating Experimentation in Gradle Builds Iñaki Villar introduces the latest iteration of Telltale, a framework designed to automate experimentation
Meta Gets Into AI Video Generation
Sunday, October 6, 2024
Movie Gen promises to generate high fidelity videos with synchronized audio. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Inertia 2.0 beta, Laravel 11.26, Laravel MongoDB 5.0, and more! - №533
Sunday, October 6, 2024
Your Laravel week in review ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Flavor Thesaurus/Uncanny a cappella/People as sunsets
Sunday, October 6, 2024
Recomendo - issue #431 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
JSK Daily for Oct 5, 2024
Saturday, October 5, 2024
JSK Daily for Oct 5, 2024 View this email in your browser A community curated daily e-mail of JavaScript news Understanding CommonJS vs. ES Modules in JavaScript JavaScript has undergone significant
🪟 How to Prevent Windows 10/11 From Locking Itself — Biggest Tech Fails
Saturday, October 5, 2024
Also: Does Airplane Mode Speed Up Charging, and More! How-To Geek Logo October 5, 2024 Did You Know The idea that camels store water in their humps to survive long treks through the desert is a
Issue #561: js13kGames 2024 winners, OneJS, and Nadine's Fleet II
Saturday, October 5, 2024
View this email in your browser Issue #561 - October 5th 2024 Weekly newsletter about Web Game Development. If you have anything you want to share with our community please let me know by replying to