📝 Guest Post: An introduction to Similarity Search*
Was this email forwarded to you? Sign up here In this guest post, Frank Liu, Director of Operations & ML Architect @ Zilliz, conducts a quick tour of Similarity Search, comparing embeddings and vector search strategies. An introduction to Similarity SearchVector similarity search is finding similar vectors in a high-dimensional vector space. It has become a critical tool for many applications, such as recommendation systems, image and video search, and natural language processing (generative AI). This blog will discuss the concept of vector similarity search, its importance, and some techniques used to perform this search. What is Vector Similarity Search?Embedding Vectors represent various data types, including text, images, and audio, and are mathematical representations of data points. Vector similarity search is a technique that involves finding similar vectors to a query vector. By applying a liberal quantity of vector algebra to embeddings, we can perform scalable semantic analysis using just basic mathematical operators. For example, in vector similarity search, you can use a distance metric such as L1 distance, L2 distance, and cosine similarity to determine the similarity distance between two vectors. Semantically similar pieces of unstructured data are “near” one another. In contrast, dissimilar pieces of unstructured data are “far” from one another. Why is Vector Similarity Search Important?
Comparing embeddingsLet's go through a couple of word embedding examples. For the sake of simplicity, we'll use Some prep work Before beginning, we'll need to install the
Now that we've done all the prep work required to generate word-to-vector embeddings, let's load the trained
Example 0: Marlon Brando Let's take a look at how
Marlon Brando worked with Al Pacino in The Godfather and Elia Kazan in A Streetcar Named Desire. He also starred in Apocalypse Now. Example 1: If all of the kings had their queens on the throne Vectors can be added and subtracted from each other to demo underlying semantic changes.
Who says engineers can't enjoy a bit of dance-pop now and then? Example 2: Apple, the company, the fruit, ... or both? The word "apple" can refer to both the company as well as the delicious red fruit. In this example, we can see that
"Droid" refers to Samsung's first 4G LTE smartphone ("Samsung" + "iPhone" - "Apple" = "Droid"), while "apple" is the 10th closest word to "fruit". Vector search strategiesNow that we've seen the power of embeddings, let's briefly look at some ways we can conduct nearest neighbor search. First, let's review some standard methods to provide a high-level overview of how vector search works at scale. Note that some of these methods are not exclusive to each other - it's possible, for example, to use quantization in conjunction with space partitioning. Linear search The most straightforward but naïve nearest neighbor search algorithm is good old linear search: computing the distance from a query vector to all other vectors in the vector database. For obvious reasons, naïve search does not work when scaling our vector database to tens or hundreds of millions of vectors. Still, when the total number of elements in the database is small, this can be the most efficient way to perform a vector search since a separate data structure for the index is not required. At the same time, you can perform inserts and deletes relatively quickly. Due to the lack of space complexity and constant space overhead associated with naïve search, this method can often outperform space partitioning even when querying across a moderate number of vectors. Space partitioning Space partitioning is not a single algorithm but a family of algorithms using the same concept. K-dimensional trees (kd-trees) are perhaps the most well-known in this family and work by continuously bisecting the search space (splitting the vectors into “left” and “right” buckets) like binary search trees. Inverted file index (IVF) is also a form of space partitioning, and works by assigning each vector to its nearest centroid - searches are then conducted by first determining the query vector's closest centroid and conducting the search around there, significantly reducing the total number of vectors that need to be searched. Quantization Quantization is a technique for reducing the database's total size by reducing the vectors' precision. Scalar quantization (SQ), for example, works by multiplying high-precision floating point vectors with a scalar value, then casting the elements of the resultant vector to their nearest integers. As a result, SQ reduces the effective size of the entire database (e.g., by a factor of eight for conversion from Product quantization (PQ) is another technique similar to dictionary compression. In PQ, all vectors are split into equally-sized subvectors, and each subvector is then replaced with a centroid. Hierarchical Navigable Small Worlds Hierarchical Navigable Small Worlds (HNSW) is a graph-based indexing and retrieval algorithm. HNSW works differently from product quantization: instead of improving the searchability of the database by reducing its effective size, HNSW creates a multi-layer graph from the original data. Upper layers contain only "long connections," while lower layers have only "short connections" between vectors in the database (see the next section for an overview of vector distance metrics). Individual graph connections are created a-la skip lists. With this architecture in place, searching becomes fairly straightforward – we greedily traverse the uppermost graph (the one with the longest inter-vector connections) for the vector closest to our query vector. We then do the same for the second layer, using the result of the first layer search as the starting point. This continues until we complete the search at the bottommost layer, the result of which becomes the nearest neighbor of the query vector.
Approximate Nearest Neighbors Oh Yeah Due to its playful and unintuitive name, ANNOY is my favorite ANN algorithm. Approximate Nearest Neighbors Oh Yeah (ANNOY) is a tree-based algorithm popularized by Spotify (Spotify used ANNOY in their music recommendation system). Despite the strange name, ANNOY's underlying concept is reasonably straightforward – binary trees. ANNOY works by randomly selecting two vectors in the database and bisecting the search space along the hyperplane separating those two vectors. This is done iteratively until there are fewer than some predefined parameters
Commonly used similarity metricsThe best vector databases are useless without similarity metrics – methods for computing the distance between two vectors. Numerous metrics exist so that we will discuss only the most commonly used subset here. Floating point vector similarity metrics The most common floating point vector similarity metrics are, in no particular order, L1 distance, L2 distance, and cosine similarity. The first two values are distance metrics (lower values imply more similarity while higher values imply lower similarity), while cosine similarity is a similarity metric (higher values imply more simlarity).
L1 distance is also commonly referred to as Manhattan distance, aptly named after the fact that getting from point A to point B in Manhattan requires moving along one of two perpendicular directions. The second equation, L2 distance, is simply the distance between two vectors in Euclidean space. The third and final equation is cosine distance, equivalent to the cosine of the angle between two vectors. Note the equation for cosine similarity works out to be the dot product between normalized versions of input vectors a and b. With a bit of math, we can also show that L2 distance and cosine similarity are effectively equivalent when it comes to similarity ranking for unit norm vectors: d_l2(a,b)=(a−b)T(a−b) =aTa−2aTb+bTb Recall that unit norm vectors have a magnitude of 1: aTa=1 With this, we get: aTa−2aTb+bTb =2−2aTb Since we have unit norm vectors, cosine distance works out to be the dot product between a and b (the denominator in equation 3 above works out to be 1): 2−2aTb =2(1−d_cos(a,b)) Essentially, for unit norm vectors, L2 distance and cosine similarity are functionally equivalent! Always remember to normalize your embeddings. Binary vector similarity metrics Binary vectors, as their name suggest, do not have metrics based in arithmetics a-la floating point vectors. Similarity metrics for binary vectors instead rely on either set mathematics, bit manipulation, or a combination of both (it's okay, I also hate discrete math). Here are the formulas for two commonly used binary vector similarity metrics:
The first equation is called Tanimoto/Jaccard distance, and is essentially a measure of the amount of overlap between two binary vectors. The second equation is Hamming distance, and is a count of the number of vector elements in a and b which differ from each other. You can most likely safely ignore these similarity metrics, since the majority of applications use cosine similarity over floating point embeddings. Wrapping upVector similarity search is an essential tool for a wide range of applications. It involves finding vectors similar to a query vector in a high-dimensional vector space. We reviewed some standard methods to provide a high-level overview of how vector search works at scale. We also quickly looked at a few metrics for computing the distance between two vectors. I hope you found this helpful, and let me know in the comments below what you would like me to cover next time! In the meantime, try it for yourself with the free trial of Zilliz or download open-source Milvus! *This post was written by Frank Liu, Director of Operations & ML Architect @ Zilliz. We thank Zilliz for their ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
The Controversial AI Moratorium Letter
Sunday, April 2, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
📝 Guest Post: Switching from Spreadsheets to Experiment Tracker and How It Improved My Model Development Process*
Friday, March 31, 2023
In this guest post, neptune.ai shares the story of one of its users, Nikita Kozodoi. He talks about his model development process before and after using Neptune. Give it a read! You can find the full
Edge 278: Inside LaMDA, Google's Alternative to GPT-4
Thursday, March 30, 2023
The model powering services such as Bard and the conversational capabilities in Google Suite.
📌 Learn the ABCs of LLMs from OpenAI, 🦙 LlamaIndex, Hugging Face 🤗, and Others At Arize:Observe
Wednesday, March 29, 2023
Join us at the third annual Arize:Observe conference, which is shaping up to be one of the leading events this year on large language models, generative AI, and ML observability. Sessions include: A
Edge 277: Federated Transfer Learning
Tuesday, March 28, 2023
Federated transfer learning, the TorchFL paper and the OpenFL framework.
You Might Also Like
Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator
Friday, February 14, 2025
What came before the golem? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Defining Your Paranoia Level: Navigating Change Without the Overkill
Friday, February 14, 2025
We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy
5 ways AI can help with taxes 🪄
Friday, February 14, 2025
Remotely control an iPhone; 💸 50+ early Presidents' Day deals -- ZDNET ZDNET Tech Today - US February 10, 2025 5 ways AI can help you with your taxes (and what not to use it for) 5 ways AI can help
Recurring Automations + Secret Updates
Friday, February 14, 2025
Smarter automations, better templates, and hidden updates to explore 👀 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The First Provable AI-Proof Game: Introducing Butterfly Wings 4
Friday, February 14, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? undefined The Market Today #01 Instagram (Meta) 714.52 -0.32%
GCP Newsletter #437
Friday, February 14, 2025
Welcome to issue #437 February 10th, 2025 News BigQuery Cloud Marketplace Official Blog Partners BigQuery datasets now available on Google Cloud Marketplace - Google Cloud Marketplace now offers
Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰
Friday, February 14, 2025
Discover how the share of US wealth held by the top 1% has evolved from 1989 to 2024 in this infographic. View Online | Subscribe | Download Our App Download our app to see thousands of new charts from
The Great Social Media Diaspora & Tapestry is here
Friday, February 14, 2025
Apple introduces new app called 'Apple Invites', The Iconfactory launches Tapestry, beyond the traditional portfolio, and more in this week's issue of Creativerly. Creativerly The Great
Daily Coding Problem: Problem #1689 [Medium]
Friday, February 14, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given a linked list, sort it in O(n log n) time and constant space. For example,
📧 Stop Conflating CQRS and MediatR
Friday, February 14, 2025
Stop Conflating CQRS and MediatR Read on: my website / Read time: 4 minutes The .NET Weekly is brought to you by: Step right up to the Generative AI Use Cases Repository! See how MongoDB powers your