diff --git a/technology/applications/development/pgvector.md b/technology/applications/development/pgvector.md new file mode 100644 index 0000000..0344fc0 --- /dev/null +++ b/technology/applications/development/pgvector.md @@ -0,0 +1,99 @@ +--- +obj: application +repo: https://github.com/pgvector/pgvector +rev: 2024-09-30 +--- + +# pgVector +**pgvector** is a [PostgreSQL](./Postgres.md) extension designed to support vector similarity search. With the rise of machine learning models like those in natural language processing (NLP), computer vision, and recommendation systems, the need to efficiently store and query high-dimensional vectors (embeddings) has grown significantly. pgvector provides a solution by enabling PostgreSQL to handle these vector operations, making it possible to search for similar items using vector distance metrics directly in SQL. + +## Installation +1. Install pgvector using `git` and `make`: + ```bash + git clone https://github.com/pgvector/pgvector.git + cd pgvector + make && make install + ``` + +2. Add the extension to your PostgreSQL database: + ```sql + CREATE EXTENSION IF NOT EXISTS vector; + ``` + +## Data Types +pgvector introduces a new data type called `vector`. It is used to store fixed-length vectors, and the size must be specified during table creation. + +```sql +CREATE TABLE items ( + id serial PRIMARY KEY, + embedding vector(3) -- a 3-dimensional vector +); +``` + +## Functions and Operators +pgvector provides several functions and operators for vector similarity and distance calculation. + +### Distance Metrics + +- **Euclidean Distance** (`<->`): Measures the straight-line distance between two vectors. + + ```sql + SELECT * FROM items ORDER BY embedding <-> '[1, 0, 0]' LIMIT 5; + ``` + +- **Cosine Similarity** (`<=>`): Measures the cosine of the angle between two vectors. + + ```sql + SELECT * FROM items ORDER BY embedding <=> '[1, 0, 0]' LIMIT 5; + ``` + +- **Inner Product** (`<#>`): Measures the dot product between two vectors. + + ```sql + SELECT * FROM items ORDER BY embedding <#> '[1, 0, 0]' LIMIT 5; + ``` + +### Basic Operations + +- **Set a Vector Value**: + + ```sql + INSERT INTO items (embedding) VALUES ('[1, 0, 0]'); + ``` + +- **Retrieve All Vectors**: + + ```sql + SELECT * FROM items; + ``` + +## Indexing +To enhance performance for similarity search, pgvector supports indexing. The recommended index types depend on the distance metric you plan to use: + +- **Euclidean Distance** (L2): + + ```sql + CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100); + ``` + +- **Cosine Similarity**: + + ```sql + CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); + ``` + +- **Inner Product**: + + ```sql + CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100); + ``` + +### Index Parameters +- **Lists**: Defines the number of centroids to use in the IVF (Inverted File) index. Higher values of `lists` improve recall but may increase query time. + +## Use Cases + +1. **Recommendation Systems**: Store user and item embeddings and use similarity search to recommend items based on user preferences. +2. **Search Engines**: Search for semantically similar documents or images using vector embeddings. +3. **NLP Applications**: Store word, sentence, or document embeddings to perform similarity search or clustering of textual data. +4. **Image Recognition**: Query for similar images based on embeddings generated by deep learning models.