How to Use Cosine Similarity for Vector Search in pgvector

How to Use Cosine Similarity for Vector Search in pgvector

Lately, to deepen my understanding of Retrieval-Augmented Generation (short: RAG), I have been developing a local RAG setup from scratch with a focus on open-source technology. Doing this from scratch and coding as many components myself has allowed me to explore how different components of such a system interact, including vector storage and retrieval.

Today, I want to discuss how to effectively search stored vector embeddings using PostgreSQL and its extension, pgvector.

Prerequisites and Used Technology

Before diving into vector search, ensure you have the following setup:

  • PostgreSQL with the pgvector extension installed.
  • Access to an embedding model for generating vector embeddings.
  • Python with psycopg2 for executing SQL queries programmatically or any other way to access the database for your app purposes

I have written a detailed guide on setting up pgvector with Docker that you can follow to get started.

For this tutorial, I use Obsidian markdown notes as my dataset. These notes are embedded currently using OpenAI's text-embedding-3-large model and stored as vectors in a PostgreSQL database with pgvector running locally via Docker. However, this approach can be applied to any text-based dataset with any text embedding model.

When performing a search query, we need to identify which stored notes are most relevant to the user-input. The key idea is to embed the search query using the same text embedding model and then compare it with the stored embeddings using a similarity metric.

Why Use Cosine Similarity?

The short answer for this specific post is: because it's the default. The slightly longer answer is that this is the metric often used for training vector embeddings.

If you are using text-embedding-3-large by OpenAI or a similar model, you should just listen to their advice: "We recommend cosine similarity. The choice of distance function typically doesn't matter much." (see https://platform.openai.com/docs/guides/embeddings)

If you are using another model for text embedding, you can see if they also have an FAQ on which similarity metric to use, but most of the time, you won't go wrong with cosine similarity.

Mathematically, cosine similarity between two vectors A and B is defined as:

$ \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} $

Where:

  • A ⋅ B is the dot product of the two vectors.
  • ||A|| and ||B|| are the magnitudes (L2 norms) of the vectors.
  • θ is the angle between the vectors.
  • The result ranges from -1 (completely opposite) to 1 (identical), with 0 indicating no similarity.
⚠️
Distance vs. similarity With pgvector, cosine similarity is implemented as a distance metric (<=> operator), meaning lower values indicate higher similarity. The similarity score can then be computed as: cosine_similarity = 1 - cosine_distance

Implementing Cosine Similarity in PG Vector

1. Storing Vectors in PG Vector

First, we store text embeddings as vectors in a PostgreSQL table.

CREATE TABLE IF NOT EXISTS documents (
      id SERIAL PRIMARY KEY,
      title TEXT,
      start_of_chunk TEXT,
      embedding VECTOR(1600), -- insert the embedding dimension size that your model outputs 
      text_chunk TEXT
);

INSERT INTO documents (title, start_of_chunk, embedding, text_chunk) VALUES (%s, %s, %s, %s); -- pass parameters or hard-code values, I guess

2. Querying for Similar Notes

Top K Results

To find the most relevant notes for a given query, we embed the query and compare it with stored vectors using cosine similarity:

SELECT title, text_chunk, 1 - (embedding <=> '[QUERY_VECTOR]') AS similarity
FROM documents
ORDER BY similarity DESC
LIMIT 5;

Here’s how it works:

  • <=> calculates cosine distance (lower = more similar).
  • 1 - cosine_distance converts it into cosine similarity (higher = more similar).
  • ORDER BY similarity DESC sorts results from most to least similar.
  • LIMIT 5 returns the top 5 relevant notes.

Flexible number of results based on threshold

Instead of retrieving the top K results, you can filter results based on a similarity threshold. If you always retrieve 5 results, but only have 3 matching documents, you would end up retrieving irrelevant information with the above strategy that could end up confusing the final generated answer.

Here's how to query using a threshold instead of a fixed limit:

WITH similarity_calculation AS (
    SELECT title, 
        start_of_chunk, 
        text_chunk, 
        embedding, 
        1 - (embedding <=> '[QUERY_VECTOR]') AS cosine_similarity
    FROM documents
)
SELECT title, 
    start_of_chunk, 
    text_chunk, 
    embedding, 
    cosine_similarity
FROM similarity_calculation
WHERE cosine_similarity > 0.5;

This approach:

  • Computes cosine similarity for all stored documents. (Careful if you have a looot of documents - I had ~500 and it was completely fine in terms of performance)
  • Filters out documents with similarity below the chosen threshold (e.g., 0.5).
  • Ensures that only sufficiently relevant results are returned.

3. Running the Query in Python

To integrate this with Python, I use psycopg2 to connect to PostgreSQL and execute the search query:

import psycopg2
import numpy as np

# Connect to PostgreSQL database
conn = psycopg2.connect(
    dbname="embedding_db",
    user="dev_user",
    password="dev_password",
    host="localhost",
    port="5433",
)
cursor = conn.cursor()

# Example query vector (replace with an actual embedding)
query_vector = np.random.rand(1536).tolist()

# Execute PG Vector search
query = """
    WITH similarity_calculation AS (
        SELECT title, 
            start_of_chunk, 
            text_chunk, 
            chunk_id, 
            embedding, 
            1 - (embedding <=> %s::vector) AS cosine_similarity
        FROM documents
    )
    SELECT title, 
        start_of_chunk, 
        text_chunk, 
        chunk_id, 
        embedding, 
        cosine_similarity
    FROM similarity_calculation
    WHERE cosine_similarity > 0.5;
"""

cursor.execute(query, (query_vector,))

# Fetch and display results
results = cursor.fetchall()
for title, content, similarity in results:
    print(f"{title} (Similarity: {similarity:.2f})")
cursor.close()
💡
You can see here that to insert the vector into the query, I use the %s string placeholder and convert it into a pgvector-vector using the %s::vector syntax, which type-casts the string into a vector (:: being the cast-operator in PostgreSQL).

Understanding Distance vs. Similarity Thresholds

It is crucial to understand that distance and similarity are inversely related:

  • Cosine distance (PG Vector default): Lower values = more similar.
  • Cosine similarity (what we typically expect): Higher values = more similar.

To adjust the search results, set a threshold for similarity:

  • 0.5 (broad results): Captures loosely related notes.
  • 0.7–0.8 (strict results): Retrieves highly relevant content.

These thresholds are again based on common values found online. Since I want to chat with all my notes relevant to the topic, I chose a lower threshold of 0.5, but I encourage you to try out different values between 0.5 and 1 based on a few examples in your use-case to adjust how many results you will retrieve per query.

Limitations and Considerations

While this approach worked completely fine for my limited use-case of chatting with my personal notes, it's basically guaranteed that it will fail for larger use cases. Here are some alternatives you can consider within pgvector.

  • Scalability: By default, pgvector implemented exact nearest neighbor search, which has been completely fine with my ~500 documents. For large-scale retrieval, consider using an index (https://github.com/pgvector/pgvector?tab=readme-ov-file#indexing), which will then use approximate nearest neighbor search.
  • Alternative Metrics: Pgvector also supports L2 distance, inner product, L1 distance, Hamming distance and Jaccard distance (see documentation on the repo: https://github.com/pgvector/pgvector?tab=readme-ov-file#querying)

Of course, beyond that there are also other dedicated vector DBs to use like ChromaDB, Weaviate, FAISS or something fully managed like Pinecone.

Conclusion

Using cosine similarity in PG Vector provides an efficient way to search for relevant text-based entries. The combination of OpenAI embeddings, PostgreSQL, and SQL queries enables fast and flexible vector retrieval.

By applying this technique, you can:

  • Store any text dataset as embeddings.
  • Perform semantic search over your notes.
  • Tune your similarity threshold for better results.
  • Integrate with Python for a dynamic retrieval system.

If you’re building your own local RAG system, I’d love to hear about your experiences - drop a comment below or check out my other posts for deeper insights!