by Josh Engels
The set-of-vector search problem is an exciting and unexplored research problem. In our recent paper, we develop DESSERT, a novel algorithm to efficiently solve set-of-vector search. In this blog post, I’ll first give a primer on traditional nearest neighbor search, and then I’ll dive into a detailed explanation of set-of-vector search. In the rest of the post talk about interesting problems that can be solved with set-of-vector search and do a deep dive into the intuition behind DESSERT.
In the traditional nearest neighbor problem, we are given a dataset
The picture below shows an example of a nearest neighbor problem with
In the set-of-vector search problem (AKA vector set search with vector set queries), we are given a dataset
The picture below shows an example of the set-of-vector search problem with
First, let’s talk about an embedding model. An embedding model is a machine learning model that has been trained to map the objects we want to search over (images, music, documents, products) into vectors (“embeddings”), such that similar items will have embeddings that are close together. By transforming all of the items we want to search over into embeddings, and doing the same for a query, we can find the most similar items to the query by finding the closest embeddings to the query embedding.
The image below is an example of an image embedding model that maps images into
In practice, we tend to use higher dimensional embeddings than
These embedding search problems have been extensively analyzed through the lens of traditional nearest neighbor search. Next, we’ll show how a few common embedding search problems might better fit into set-of-vector search.
Semantic document search is the problem of searching for the most relevant document to a query from a large corpus of documents based on the meaning of the query, not just it’s lexical properties. This problem is commonly encountered in search engines and recommendation systems (Google search is at the end of the day semantic document search). The common benchmark used in academia is MSMarco, which is a collection of 3.2 million documents, queries, and their ground truth most relevant documents (there is also a similar dataset with 8.8 million smaller passages).
For the past few years, the best techniques have embedded each document into a single vector and performed traditional near neighbor search. However, recently ColBERT achieved state-of-the-art performance on MSMarco by embedding each document into a set of vectors and then performing set-of-vector search. ColBERT uses BERT to embed each word in the document into a vector that captures the meaning of the word and its surrounding context and then performs a similar process on the query. Intuitively, this method is more effective at representing the entire document than a single vector because different vectors in the set can represent different parts of the sentences meaning, giving an overall richer representation of the document2.
Image search is the problem of searching for images in a collection that are similar to a given query image, where similar can mean the images look similar or are of similar objects. Traditionally, image search is performed by extracting a bad of features, either using hand-tuned algorithms like SIFT or convolutional neural networks, and the aggregating the vectors in some way for single vector similarity search. However, similar to document search, we argue that keeping the representation as a bag of vectors might increase image search performance.
Basket completion is a problem faced by online retailers, where the goal is to suggest additional items for a customer to purchase based on their current cart (or “basket”) of items. For example, if a customer currently has a toothbrush, toothpaste, soap, a razor, and shampoo in their cart, a good basket completion suggestion might be conditioner or shaving cream.
Typical basket completion datasets have many prior baskets that customers have checked out with, and some basket completion algorithms search these prior baskets for similar baskets to the current customer’s choices. A good way of searching for similar baskets to use for item completions may be to embed each item in the current basket into a vector, do the same for the dataset of baskets, and then perform set-of-vector search.
The set-of-vector search process used in ColBERT is prohibitively slow on CPUs because it relies on an expensive matrix multiplication (this may also be why no one has tried using set-of-vector search in image search or basket retrieval). This bottleneck inspired us to build DESSERT, an approximate set-of-vector search algorithm. Approximate search algorithms have been extensively studied in the near neighbor search literature because they are much more tractable than exact near neighbor search as
DESSERT can solve forms of the set relevance function
We note that the earlier “sum of max similarities” metric we defined above fits neatly into this definition, with
DESSERT works by preindexing each set-of-vectors
Aha! This query operation returns a set of similarities similar to the form for
There are a couple of things we’re glossing over here. The first is how
Unlike traditional near neighbor search, where the similarity function is usually the inverse of a formal distance metric, we don’t require our similarity function to relate to a formal distance function, since this requirement is more restrictive in this extended problem. One example of a metric over vector sets might be replacing each set with its average vector and then reducing to the Euclidean metric. ↩
In fact, since the set similarity function isn’t a metric, we can express ideas we simply cannot in a traditional vector embedding space: document