top of page
  • Writer's pictureSilas Smith

First steps with Pinecone DB

This hands-on tutorial shows you how to load data into Pinecone DB, and how to search across the data using vector semantic search.


We'll cover it in 3 parts:

  1. Getting setup

  2. Putting data in Pinecone DB

  3. Searching Pinecone

No LLMs here 🚫

You may be surprised to know that LLMs (large language models) like GPT3.5, GPT4, Llama2, etc. are NOT used in any part of this tutorial. Vector search all the way!


What is Pinecone DB?

Pinecone DB (https://www.pinecone.io/) is a powerful, fully-managed vector database that provides long-term memory and semantic search for today's modern apps.


Tutorial use case

Have you ever struggled to remember the name of a movie? "It's about this guy who gets stuck on Mars or something?"


Well, we're going to build an app that can help you remember. We want to use whatever details you can remember about the movie to search movie summaries, and hopefully find a match! Let's call our app "Total Movie Recall".


We'll be using the IMDB Top 1000 Movies Dataset from Kaggle for this tutorial. (By the way, how is Total Recall not in this list!? 🤨)


A screenshot of a mock chat interface showing entering a few details about a movie and the chatbot replies with the movie title and description.
Fig. 1 - Total Movie Recall app
Note: This is just a mock user interface for this app. In this tutorial we're going to be focused on the back-end, not the front-end.

Let's go!


1. Getting setup

This section shows you how to setup the Python dependencies you'll need. It also shows you how to get an API key for both Pinecone and OpenAI.


Running the tutorial

If you want to run this tutorial yourself, you can find it hosted here on Colab as a Jupyter notebook (easiest), or you can find the original notebook file here on Github.

The only runtime requirement is Python 3.


Environment setup

Let's setup our environment, including dependencies and obtaining API keys.


Install dependencies

We install the pinecone-client, plus we need the openai package because we will be using the text-embedding-ada-002 embedding model from OpenAI.

Get a Pinecone API key

After creating or logging into your account with Pinecone, create a new Project. (Note: if you just created your account, you might already have a default Project, but to avoid any issues we suggest deleting that one and creating a brand new Project.)


Click on "API Keys" in the menu. You can use the default API key, but the best practice is to create a new API key specific to the client application, so you can manage it separately from anything else you build.


Copy both the Environment setting as well as the API key, which will be something like 5a5b3643-4cc9-48da-af3e-56eee93bf435.


You'll enter them both below in the Environment variables section.


Get an Open AI API key

If you don't have an OpenAI account, create one at https://platform.openai.com/.

Note that you will need to establish billing info with OpenAI. Creating embeddings is not free, but the embeddings model we're using in this tutorial (text-embedding-ada-002) is incredibly cost efficient. See OpenAI's embeddings documentation for more info about estimating embeddings costs. It's also a good idea to visit the "Usage Limits" settings on your OpenAI account page, and establish spending limits that make sense for you to avoid getting a nasty surprise bill!

Click on "API Keys" in the menu. Creating a new API key specific to the client application is preferred.


You'll also enter this below.


Environment variables

We need to set 3 environment variables.

  • PINECONE_ENVIRONMENT - The Pinecone environment where your index resides

  • PINECONE_API_KEY - Your pinecone API key

  • OPENAI_API_KEY - Your OpenAI API key

In practice, you'd likely set these in a private .env file or otherwise securely configure them in your runtime environment settings.

Output:


Dataset review

We're using the IMDB Movies Dataset from Kaggle for this tutorial.


For simplicity, we'll assume it's already downloaded and stored in a CSV file relative to the current directory: ./data/imdb_top_1000_movies.csv (although the Kaggle python library could help with that part too).


Let's have a look at the dataset:

Output:

A table showing the contents of the movie dataset, with columns like series_title, overview, director, year released, etc.

There's potentially a lot of interesting info here, but for our app we'll focus on just a few columns: title and description.

Output:

A table showing the contents of the movie dataset focused on two columns: Title and Description.

We'll also need a unique ID for each movie, so let's add one now, starting from 1000.

Output:

A table showing the contents of the movie dataset focused on three columns: Title and Description, and with a 3rd column added representing a unique ID.

2. Putting data in Pinecone DB

Now that we're all setup, let's get our data into Pinecone!


We need to:

  • Design the data model

  • Create embeddings

  • Create the Pinecone Index

  • Insert to Pinecone

Data modeling in Pinecone

How we store data in Pinecone is just as important as what data we store in Pinecone, and it's all based on how we intend to search the data.


Namespaces

Pinecone allows you to insert data into a namespace. Think of this as a logical grouping of data that identifies a search boundary. When searching, you specify the namespace you want to execute your search within.


In our movie recall app, we want to allow searches by Description. So we'll populate data into just one namespace for now:

  • 'movie-descriptions'

Metadata

Pinecone also allows you to associate metadata to each vector stored in the DB. The primary use cases for metadata are:

  • Filtering at search time

  • Retrieval of associated/additional (non-vectored) content from search results

For our use case, we don't need to use metadata.


Our data model

Remember, we want to be able to search for a movie title based on a similarity search against its description.


For example, if someone enters "A guy gets sent back in time in a car", the app should return "Back to the Future".


So the vector search will be performed against the movie descriptions, which means we need the descriptions to be vectorized in the database.


Finally, we want to return the name of a movie when we find a match, so we'll use the movie's unique ID as the unique ID for our content in Pinecone. When we find a match we can use the ID to lookup the title and original description from our dataset.


Overview of data flow

Here is an overview of the flow of data, including both indexing time and query time.


A data flow diagram showing the process of creating embeddings for the movie descriptions and storing in a vector database, and the query time search across that dataset for a matching movie description.
Fig. 2 - Total Movie Recall app data flow

In this section, we show you how to index your data in Pinecone. In the next section, we show you how to query.


Creating embeddings

To enable semantic search, we need to encode our dataset using an embeddings model. You can learn more about embeddings in our post "Intro to semantic search with vector databases".


To get accurate search results, we need to use the same model to create embeddings for the searchable dataset as well as for each query against that dataset.


We're going to use the text-embedding-ada-002 model from OpenAI, which is both affordable and effective in preserving semantic meaning of textual datasets.

Let's try creating embeddings for the first few items.

Output:

Great, so we can see each movie description is being encoded as 1536 dimensional points.

What's so special about 1536? That's the dimension size of the model we're using, text-embedding-ada-002. If you use a different embedding model, it will likely have a different dimension size. You'll need to remember this number when you create the Pinecone index below!

Let's wrap the call to create embeddings in a function that accepts a batch of strings as input, and returns an array of vector encodings.

Output:


Checkpoint!


So far we have:

  • selected an embedding model

  • identified our vector data model

  • created a function that can create embeddings

Next up, let's store some data in Pinecone!


Creating a Pinecone index

We'll create the Pinecone index via the Pinecone web console (although it's possible to create via the API as well).


Open up the Pinecone app at https://app.pinecone.io, click on Indexes, and then Create Index.

Data Modeling Tip: Each Pinecone index can only store one 'shape' of thing. This means all the embeddings stored in the index must use the same embedding model, have the same dimensions, and same metric type setting. For example, if we wanted to allow searching for similar movie posters (images), we would need to select an embeddings model that is trained on images, and the embeddings would need to be stored in a different index in Pinecone.

Give the index a name based on the use case, set the Dimension size (1536 based on our use of text-embedding-ada-002), and leave the default Metric set to cosine.

A screenshot from Pinecone console showing the creation of an index.
Fig. 3a - Create Pinecone index
Note: For a more detailed discussion of the Metric setting, including when you should set it to something else like Euclidean, this article from Pinecone provides a nice overview of the different similarity metric options: Vector Similarity Explained.

The index will take a few minutes to initialize.

A screenshot from Pinecone console showing the new index in "Ready" state.
Fig. 3b - Index ready

Inserting data to Pinecone

We'll use the pinecone-client lib to insert data into Pinecone. First we need to initialize it with our API key and environment.

Output:


Each item you store in Pinecone has this structure:

Name

Description

id

​A unique ID used to manage the vector

values

​The vector embedding itself

metadata

Key/value data that can be used for filtering query results or returning associated data

To store an item in Pinecone, we need to first get a reference to the index. Then we can use the upsert function, which will insert or update the item based on its ID.


Let's store the description for The Shawshank Redemption in Pinecone.

Output:

Where did we get the namespace "movie-descriptions" from? It came from our data model design above!


Batching upserts

For best performance, Pinecone recommends uploading batches of 100 embeddings at a time, and the same batch size works well for OpenAI's embeddings endpoint. Let's write a function to process our dataset in batches of 100.


It will create embeddings for 100 items and then upload those 100 embeddings to Pinecone, with all the right metadata.

Output:


Finally, let's verify all the vectors made it -- there should be 1000.

Output:

Now that we've got data in Pinecone, let's search it!


3. Searching Pinecone

Going back to our Total Recall app, we want to be able to provide a brief, possibly terrible, recollection a movie and have it recall the title.


For example, "It's about this guy who gets stuck on Mars or something?" should somehow come back as a match for The Martian.


Running the search

To search the vector space we need to create embeddings for the query text, and then pass those to the query interface of our index.


We also need to specify "movie-descriptions" namespace, since that is where the data resides within Pinecone.


Finally, we can specify the number of top search results we want to return using the top_k parameter.

Output:


We're getting the IDs of the best-matching search results, as well as a score indicator (highest value == best match).


Let's use that first result to look up the corresponding row in our dataset.

Output:

A table showing the matching entry from the movie dataset as matching ID 1329 -- "The Martian".

The Martian! Incredible.


Putting the app together

Let's write a function to encapsulate the search and the lookup, and then do a bit more testing.

Output:


And now we'll run some additional test queries.

Output:


Wrap-up


It works!

As you can see, vector semantic search is very powerful. We were able to find matching movie titles despite providing only the bare minimum of information about the movie.


Importantly, we can see that the search is not relying on keyword search, but is instead finding similarities in the meanings of words in the description and query.


A screenshot of a mock chat interface showing entering a few details about a movie and the chatbot replies with the movie title and description, where the search string does not directly contain any keywords from the description.
Fig. 4 - Semantic search, not keyword search

Next steps


1. Expand to TV shows?

Now that our Total Recall app is working great for movies, you might consider expanding to include TV shows. (We'll leave that as an exercise to the reader.)


If we were to do that, we'd likely use this dataset from Kaggle: IMDB Top 250 TV Shows.


Based on whether we want to search movies and TV shows together or separately, we could either store TV show descriptions in the same namespace as movies or in a separate namespace.


2. Deploy the app in a user interface

Our tutorial is long enough at this point, but you can imagine how this application could be packaged and rolled out to end users as a simple chat interface.


Note that if we were to build this as a full app, we'd likely start with one of Vercel's excellent Next.js app templates, such as Next.js AI Chatbot.



We'd love to talk with you

Ninetack is dedicated to helping our clients leverage the latest technologies to build innovative solutions for every industry.


We'd love to talk with you about how you're planning to incorporate vector search in your next AI application. Connect with us today @ ninetack.io!

bottom of page