How to set up pgvector with Docker: Local Vector Database for Text Embeddings

How to set up pgvector with Docker: Local Vector Database for Text Embeddings

In this post, I'll walk you through how I set up a local vector storage system using pgvector, a PostgreSQL extension designed for handling high-dimensional vectors. This setup is part of my ongoing "learning-by-doing" project to build a local retrieval augmented generation (RAG) system. This RAG database can then either be used just for search or to generate text based on the retrieved content - like a knowledge base for your GenAI app.

This setup uses a Docker image for the setup, but maybe you can use some parts if you already have an existing PostgreSQL database, too.

Why pgvector?

Pgvector introduces a dedicated data type, operators, and functions that enable efficient storage, manipulation, and analysis of vector data. It's an open-source solution that integrates seamlessly with PostgreSQL, making it a great choice if you're already familiar with Postgres databases. There are many solutions specifically designed for vector storage, but I was hesitant to learn a completely new tool and research its licensing for this project, so I'm trying out pgvector first.

Getting Started with Docker

I decided to use Docker for this setup because it allows for easy deployment and management of the database. At least, once you have figured out the right image to use and the right settings for PostgreSQL to put into your yml files.

Here's the docker-compose.yml file I ended up using:

services:
  db:
    image: pgvector/pgvector:pg17  # Prebuilt Postgres image with pgvector
    container_name: postgres-pgvector
    ports:
      - "5432:5432"
    environment:
      POSTGRES_USER: dev_user
      POSTGRES_PASSWORD: dev_password
      POSTGRES_DB: embedding_db
    volumes:
      - postgres_data:/var/lib/postgresql/data
volumes:
  postgres_data:

docker-compose.yml

Some important notes about this docker-compose.yml:

  • This image has both the basic PostgreSQL DB and pgvector included, you don't need a separate setup for the "pure" PostgreSQL. There also is the ankane/pgvector image, but I wasn't sure how/when this would be updated.
  • You can choose the container_name however you want
  • If you already have another project that uses Postgres, it's likely that the 5432 post is already in use. In that case use 5433:5432 instead (or an even higher number, if that's in use too). You need to change only the first (external) port number, not the second (internal) one, which is used by other (potential) services in the same Docker group.
  • Obviously change the username and password if you plan on storing sensitive information or if the service will be running in a production environment.
  • You need the volume settings if you want the data stored in the DB to remain there - otherwise the DB is empty every time you restart the Docker container.

Step 1: Start the Docker Container with the PostgreSQL DB

Run the following command in the directory where your docker-compose.yml file is located:


docker compose up -d

This command will pull the pgvector/pgvector:pg17 image, start the Postgres container, and mount the postgres_data volume for data persistence.

Step 2: Connect to the Database

Now, here's where it got tricky again. Theoretically you can use psql to connect to the database:


psql -h localhost -p 5432 -U dev_user -d embedding_db

Enter the password (dev_password) when prompted. You should see a Postgres prompt (embedding_db=#). If you do, that's great.

Psql is the interactive terminal for working with Postgres. However, I could not figure out how to correctly install it on MacOS to make all the commands here work.

Alternative: use a GUI-based tool like DBeaver

Dbeaver (https://dbeaver.io/) is a cross-platform database tool, that has an open-source community edition, that you can install and use for free. You then have a GUI to set up the connection to Postgres (or in fact most other SQL databases like MySQL, SQLite etc.) and then query your data with an SQL editor.

Simply make sure to insert the information from your yml file into the fields for port, database, username and password and then you can connect.

Step 3: Enable the pgvector Extension

Once connected, enable the pgvector extension with the following SQL command:

CREATE EXTENSION IF NOT EXISTS vector;

Step 4: Test the Vector Extension

Create a table and insert some vector data to test the setup:

CREATE TABLE items (
  id SERIAL PRIMARY KEY,
  embedding VECTOR(3) -- Example for 3-dimensional vectors
);

INSERT INTO items (embedding) VALUES ('[0.1, 0.2, 0.3]'), ('[0.4, 0.5, 0.6]');

SELECT * FROM items;

Next Steps - Connect with your App (Optional)

You can use the database as is with SQL commands, of course. Next on my agenda is however to connect the DB with my Python application which I'm using to generate embeddings of Markdown files.

Depending on the programming language you are working with, there are multiple libraries to choose from to connect to Postgres. For Python these are apparently psycopg2 or asyncpg (let's be real, those are horrible names, I can never remember those) and I will try them in the coming weeks.

Reflections and Challenges

Setting up this system was a learning experience. As a beginner with Docker, I was a bit overwhelmed in picking the right image and configuring the Docker Compose file. If you're new to this, don't worry - it's a learning curve, but it's manageable - and thankfully tools like ChatGPT are very patient in explaining all the details about what difference it makes if I change this or that in the compose file.

This setup is a stepping stone for my local RAG project (which you can find here including updates to GitHub), and I hope it serves as a useful reference for you. If you have any feedback or better ideas, feel free to reach out. Happy coding!