Intro to semantic search with vector databases
Large language models (LLMs) like GPT4 seem to get all the attention these days, but a lesser known component is equally important to power your AI application: semantic search.
In this overview, we explain the basics of semantic search, embeddings, and vector databases.
What is semantic search?
Semantic search helps you to find similarities among datasets, whether among unstructured text, media, audio or structured data like tabular, log files, etc. Importantly, semantic search can find matching results based on the underlying meaning or similarity of data, as opposed to more traditional search based on keyword matches.
Some of the top use cases for semantic search are:
Retrieval augmented generative question/answer apps with LLMs
What is a vector database?
Vector databases are built for storing large amounts of vectorized data and enable running semantic search queries efficiently across the dataset.
Some examples of vector databases are:
Pinecone - A fully hosted/managed vector DB
Elastic vector DB - A component of the Elastic Stack
Weaviate - An open source, hosted vector DB
Chroma - An open source vector DB
What are embeddings?
Think of embeddings as a form of data encoding that preserves meaning. Embeddings are created using an embedding model suited for the task at hand, such as text-embedding-ada-002 (from OpenAI) for text, or ResNet-18 for image recognition.
A vector embedding represents location coordinates of a point in a high-dimensional space. (For simplicity, imagine a simple 2-dimensional space, where locating a point requires [X, Y] coordinates.)
The embeddings model translates the input data into the set of coordinates that define where the data should be located within the multi-dimensional space. (Note that dimension size is dictated by the model--for example, the text-embedding-ada-002 model has 1536 dimensions.)
How does that help me find similar stuff?
For input items that are more similar in meaning or similarity, the embeddings model produces coordinates that locate them closer together in the coordinate space.
For example, the inputs:
will be located more closely together in the coordinate space than:
What is less obvious is that because the semantic meanings and relationships among words are preserved in the embedding, an input such as:
would likely be located a bit closer to the "pie" sentences than to the "cheese" sentence, because of the fact that pie and ice cream are both related to the semantic concept of "dessert". And further, sentences having to do with desserts would be located closer to other sentences having to do with food, than they would to sentences having to do with electric cars.
How do I lookup data in a vector database?
The short answer is that you don't! The vector DB does that for you, and thank goodness, because there's a LOT going on to make it happen—and it feels like magic.
Unlike with traditional relational DBs, NoSQL DBs, blob storage, etc, data in vector DBs is found via similarity searches. Your objective when querying a vector DB is to "find me the K-number-of-things that are most similar to this thing that I'm searching for".
To perform a search for "cheesecake", we likewise need to create vector embeddings for "cheesecake", locate it within the vector space, and then just measure the distance to other points in the vector space. The closer two points are to each other, the more strongly correlated their semantic meanings are.
A note on similarity metrics
In this conceptual overview, we have assumed the use of the Euclidean distance similarity metric for determining similarity between two vectors.
It's important to know that there are other similarity metric options, such as Cosine and Dot Product.
For the purposes of this high level overview, it's not important to understand the specific differences between them, as conceptually it shouldn't affect your mental model for understanding vector search overall.
In practice of course, the details do matter. If you'd like to learn more, this article from Pinecone provides a nice overview of the different similarity metrics: Vector Similarity Explained.
As you can see, vector search is powerful and key component of almost every modern AI application.
We'd love to talk with you about how you're planning to incorporate vector search in your next AI application. Connect with us today!