knowledge/technology/applications/development/pgvector.md

100 lines
3.2 KiB
Markdown
Raw Permalink Normal View History

2024-09-30 09:02:04 +00:00
---
obj: application
repo: https://github.com/pgvector/pgvector
rev: 2024-09-30
---
# pgVector
**pgvector** is a [PostgreSQL](./Postgres.md) extension designed to support vector similarity search. With the rise of machine learning models like those in natural language processing (NLP), computer vision, and recommendation systems, the need to efficiently store and query high-dimensional vectors (embeddings) has grown significantly. pgvector provides a solution by enabling PostgreSQL to handle these vector operations, making it possible to search for similar items using vector distance metrics directly in SQL.
## Installation
1. Install pgvector using `git` and `make`:
```bash
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make && make install
```
2. Add the extension to your PostgreSQL database:
```sql
CREATE EXTENSION IF NOT EXISTS vector;
```
## Data Types
pgvector introduces a new data type called `vector`. It is used to store fixed-length vectors, and the size must be specified during table creation.
```sql
CREATE TABLE items (
id serial PRIMARY KEY,
embedding vector(3) -- a 3-dimensional vector
);
```
## Functions and Operators
pgvector provides several functions and operators for vector similarity and distance calculation.
### Distance Metrics
- **Euclidean Distance** (`<->`): Measures the straight-line distance between two vectors.
```sql
SELECT * FROM items ORDER BY embedding <-> '[1, 0, 0]' LIMIT 5;
```
- **Cosine Similarity** (`<=>`): Measures the cosine of the angle between two vectors.
```sql
SELECT * FROM items ORDER BY embedding <=> '[1, 0, 0]' LIMIT 5;
```
- **Inner Product** (`<#>`): Measures the dot product between two vectors.
```sql
SELECT * FROM items ORDER BY embedding <#> '[1, 0, 0]' LIMIT 5;
```
### Basic Operations
- **Set a Vector Value**:
```sql
INSERT INTO items (embedding) VALUES ('[1, 0, 0]');
```
- **Retrieve All Vectors**:
```sql
SELECT * FROM items;
```
## Indexing
To enhance performance for similarity search, pgvector supports indexing. The recommended index types depend on the distance metric you plan to use:
- **Euclidean Distance** (L2):
```sql
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
```
- **Cosine Similarity**:
```sql
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
```
- **Inner Product**:
```sql
CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);
```
### Index Parameters
- **Lists**: Defines the number of centroids to use in the IVF (Inverted File) index. Higher values of `lists` improve recall but may increase query time.
## Use Cases
1. **Recommendation Systems**: Store user and item embeddings and use similarity search to recommend items based on user preferences.
2. **Search Engines**: Search for semantically similar documents or images using vector embeddings.
3. **NLP Applications**: Store word, sentence, or document embeddings to perform similarity search or clustering of textual data.
4. **Image Recognition**: Query for similar images based on embeddings generated by deep learning models.