First steps with Pinecone DB
This hands-on tutorial shows you how to load data into Pinecone DB, and how to search across the data using vector semantic search.
We'll cover it in 3 parts:
Putting data in Pinecone DB
No LLMs here 🚫
You may be surprised to know that LLMs (large language models) like GPT3.5, GPT4, Llama2, etc. are NOT used in any part of this tutorial. Vector search all the way!
What is Pinecone DB?
Tutorial use case
Have you ever struggled to remember the name of a movie? "It's about this guy who gets stuck on Mars or something?"
Well, we're going to build an app that can help you remember. We want to use whatever details you can remember about the movie to search movie summaries, and hopefully find a match! Let's call our app "Total Movie Recall".
Note: This is just a mock user interface for this app. In this tutorial we're going to be focused on the back-end, not the front-end.
1. Getting setup
This section shows you how to setup the Python dependencies you'll need. It also shows you how to get an API key for both Pinecone and OpenAI.
Running the tutorial
The only runtime requirement is Python 3.
Let's setup our environment, including dependencies and obtaining API keys.
We install the pinecone-client, plus we need the openai package because we will be using the text-embedding-ada-002 embedding model from OpenAI.
Get a Pinecone API key
After creating or logging into your account with Pinecone, create a new Project. (Note: if you just created your account, you might already have a default Project, but to avoid any issues we suggest deleting that one and creating a brand new Project.)
Click on "API Keys" in the menu. You can use the default API key, but the best practice is to create a new API key specific to the client application, so you can manage it separately from anything else you build.
Copy both the Environment setting as well as the API key, which will be something like 5a5b3643-4cc9-48da-af3e-56eee93bf435.
You'll enter them both below in the Environment variables section.
Get an Open AI API key
If you don't have an OpenAI account, create one at https://platform.openai.com/.
Note that you will need to establish billing info with OpenAI. Creating embeddings is not free, but the embeddings model we're using in this tutorial (text-embedding-ada-002) is incredibly cost efficient. See OpenAI's embeddings documentation for more info about estimating embeddings costs. It's also a good idea to visit the "Usage Limits" settings on your OpenAI account page, and establish spending limits that make sense for you to avoid getting a nasty surprise bill!
Click on "API Keys" in the menu. Creating a new API key specific to the client application is preferred.
You'll also enter this below.
We need to set 3 environment variables.
PINECONE_ENVIRONMENT - The Pinecone environment where your index resides
PINECONE_API_KEY - Your pinecone API key
OPENAI_API_KEY - Your OpenAI API key
In practice, you'd likely set these in a private .env file or otherwise securely configure them in your runtime environment settings.
We're using the IMDB Movies Dataset from Kaggle for this tutorial.
For simplicity, we'll assume it's already downloaded and stored in a CSV file relative to the current directory: ./data/imdb_top_1000_movies.csv (although the Kaggle python library could help with that part too).
Let's have a look at the dataset:
There's potentially a lot of interesting info here, but for our app we'll focus on just a few columns: title and description.
We'll also need a unique ID for each movie, so let's add one now, starting from 1000.
2. Putting data in Pinecone DB
Now that we're all setup, let's get our data into Pinecone!
We need to:
Design the data model
Create the Pinecone Index
Insert to Pinecone
Data modeling in Pinecone
How we store data in Pinecone is just as important as what data we store in Pinecone, and it's all based on how we intend to search the data.
Pinecone allows you to insert data into a namespace. Think of this as a logical grouping of data that identifies a search boundary. When searching, you specify the namespace you want to execute your search within.
In our movie recall app, we want to allow searches by Description. So we'll populate data into just one namespace for now:
Pinecone also allows you to associate metadata to each vector stored in the DB. The primary use cases for metadata are:
Filtering at search time
Retrieval of associated/additional (non-vectored) content from search results
For our use case, we don't need to use metadata.
Our data model
Remember, we want to be able to search for a movie title based on a similarity search against its description.
For example, if someone enters "A guy gets sent back in time in a car", the app should return "Back to the Future".
So the vector search will be performed against the movie descriptions, which means we need the descriptions to be vectorized in the database.
Finally, we want to return the name of a movie when we find a match, so we'll use the movie's unique ID as the unique ID for our content in Pinecone. When we find a match we can use the ID to lookup the title and original description from our dataset.
Overview of data flow
Here is an overview of the flow of data, including both indexing time and query time.
In this section, we show you how to index your data in Pinecone. In the next section, we show you how to query.
To enable semantic search, we need to encode our dataset using an embeddings model. You can learn more about embeddings in our post "Intro to semantic search with vector databases".
To get accurate search results, we need to use the same model to create embeddings for the searchable dataset as well as for each query against that dataset.
We're going to use the text-embedding-ada-002 model from OpenAI, which is both affordable and effective in preserving semantic meaning of textual datasets.
Let's try creating embeddings for the first few items.
Great, so we can see each movie description is being encoded as 1536 dimensional points.
What's so special about 1536? That's the dimension size of the model we're using, text-embedding-ada-002. If you use a different embedding model, it will likely have a different dimension size. You'll need to remember this number when you create the Pinecone index below!
Let's wrap the call to create embeddings in a function that accepts a batch of strings as input, and returns an array of vector encodings.
So far we have:
selected an embedding model
identified our vector data model
created a function that can create embeddings
Next up, let's store some data in Pinecone!
Creating a Pinecone index
We'll create the Pinecone index via the Pinecone web console (although it's possible to create via the API as well).
Open up the Pinecone app at https://app.pinecone.io, click on Indexes, and then Create Index.
Data Modeling Tip: Each Pinecone index can only store one 'shape' of thing. This means all the embeddings stored in the index must use the same embedding model, have the same dimensions, and same metric type setting. For example, if we wanted to allow searching for similar movie posters (images), we would need to select an embeddings model that is trained on images, and the embeddings would need to be stored in a different index in Pinecone.
Give the index a name based on the use case, set the Dimension size (1536 based on our use of text-embedding-ada-002), and leave the default Metric set to cosine.
Note: For a more detailed discussion of the Metric setting, including when you should set it to something else like Euclidean, this article from Pinecone provides a nice overview of the different similarity metric options: Vector Similarity Explained.
The index will take a few minutes to initialize.
Inserting data to Pinecone
We'll use the pinecone-client lib to insert data into Pinecone. First we need to initialize it with our API key and environment.
Each item you store in Pinecone has this structure:
A unique ID used to manage the vector
The vector embedding itself
Key/value data that can be used for filtering query results or returning associated data
To store an item in Pinecone, we need to first get a reference to the index. Then we can use the upsert function, which will insert or update the item based on its ID.
Let's store the description for The Shawshank Redemption in Pinecone.
Where did we get the namespace "movie-descriptions" from? It came from our data model design above!
For best performance, Pinecone recommends uploading batches of 100 embeddings at a time, and the same batch size works well for OpenAI's embeddings endpoint. Let's write a function to process our dataset in batches of 100.
It will create embeddings for 100 items and then upload those 100 embeddings to Pinecone, with all the right metadata.
Finally, let's verify all the vectors made it -- there should be 1000.
Now that we've got data in Pinecone, let's search it!
3. Searching Pinecone
Going back to our Total Recall app, we want to be able to provide a brief, possibly terrible, recollection a movie and have it recall the title.
For example, "It's about this guy who gets stuck on Mars or something?" should somehow come back as a match for The Martian.
Running the search
To search the vector space we need to create embeddings for the query text, and then pass those to the query interface of our index.
We also need to specify "movie-descriptions" namespace, since that is where the data resides within Pinecone.
Finally, we can specify the number of top search results we want to return using the top_k parameter.
We're getting the IDs of the best-matching search results, as well as a score indicator (highest value == best match).
Let's use that first result to look up the corresponding row in our dataset.
The Martian! Incredible.
Putting the app together
Let's write a function to encapsulate the search and the lookup, and then do a bit more testing.
And now we'll run some additional test queries.
As you can see, vector semantic search is very powerful. We were able to find matching movie titles despite providing only the bare minimum of information about the movie.
Importantly, we can see that the search is not relying on keyword search, but is instead finding similarities in the meanings of words in the description and query.
1. Expand to TV shows?
Now that our Total Recall app is working great for movies, you might consider expanding to include TV shows. (We'll leave that as an exercise to the reader.)
If we were to do that, we'd likely use this dataset from Kaggle: IMDB Top 250 TV Shows.
Based on whether we want to search movies and TV shows together or separately, we could either store TV show descriptions in the same namespace as movies or in a separate namespace.
2. Deploy the app in a user interface
Our tutorial is long enough at this point, but you can imagine how this application could be packaged and rolled out to end users as a simple chat interface.
We'd love to talk with you
Ninetack is dedicated to helping our clients leverage the latest technologies to build innovative solutions for every industry.
We'd love to talk with you about how you're planning to incorporate vector search in your next AI application. Connect with us today @ ninetack.io!