--- obj: application repo: https://github.com/pgvector/pgvector rev: 2024-09-30 --- # pgVector **pgvector** is a [PostgreSQL](./Postgres.md) extension designed to support vector similarity search. With the rise of machine learning models like those in natural language processing (NLP), computer vision, and recommendation systems, the need to efficiently store and query high-dimensional vectors (embeddings) has grown significantly. pgvector provides a solution by enabling PostgreSQL to handle these vector operations, making it possible to search for similar items using vector distance metrics directly in SQL. ## Installation 1. Install pgvector using `git` and `make`: ```bash git clone https://github.com/pgvector/pgvector.git cd pgvector make && make install ``` 2. Add the extension to your PostgreSQL database: ```sql CREATE EXTENSION IF NOT EXISTS vector; ``` ## Data Types pgvector introduces a new data type called `vector`. It is used to store fixed-length vectors, and the size must be specified during table creation. ```sql CREATE TABLE items ( id serial PRIMARY KEY, embedding vector(3) -- a 3-dimensional vector ); ``` ## Functions and Operators pgvector provides several functions and operators for vector similarity and distance calculation. ### Distance Metrics - **Euclidean Distance** (`<->`): Measures the straight-line distance between two vectors. ```sql SELECT * FROM items ORDER BY embedding <-> '[1, 0, 0]' LIMIT 5; ``` - **Cosine Similarity** (`<=>`): Measures the cosine of the angle between two vectors. ```sql SELECT * FROM items ORDER BY embedding <=> '[1, 0, 0]' LIMIT 5; ``` - **Inner Product** (`<#>`): Measures the dot product between two vectors. ```sql SELECT * FROM items ORDER BY embedding <#> '[1, 0, 0]' LIMIT 5; ``` ### Basic Operations - **Set a Vector Value**: ```sql INSERT INTO items (embedding) VALUES ('[1, 0, 0]'); ``` - **Retrieve All Vectors**: ```sql SELECT * FROM items; ``` ## Indexing To enhance performance for similarity search, pgvector supports indexing. The recommended index types depend on the distance metric you plan to use: - **Euclidean Distance** (L2): ```sql CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100); ``` - **Cosine Similarity**: ```sql CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); ``` - **Inner Product**: ```sql CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100); ``` ### Index Parameters - **Lists**: Defines the number of centroids to use in the IVF (Inverted File) index. Higher values of `lists` improve recall but may increase query time. ## Use Cases 1. **Recommendation Systems**: Store user and item embeddings and use similarity search to recommend items based on user preferences. 2. **Search Engines**: Search for semantically similar documents or images using vector embeddings. 3. **NLP Applications**: Store word, sentence, or document embeddings to perform similarity search or clustering of textual data. 4. **Image Recognition**: Query for similar images based on embeddings generated by deep learning models.