pgvector

This document introduces pgvector, a plug-in tool for storing and querying vector data.

1 Description

Function
- Store your vector data;
- Accurate/fuzzy similar nearest neighbor searches;
- Supports L2 distance, internal product, and cosine distance calculation;
- Support IVFlat and HNSW indexes;
- Supports running in any programming language in Postgres client.

2 Get started quickly

Install the extension. Each database needs to be created only once.

=# CREATE EXTENSION vector;

Create a three-dimensional vector column.

=# CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Insert vector data.

=# INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Get the nearest neighbor through L2 distance.

=# SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

It also supports internal product (<#>) and cosine distance (<=>).

Note! <#> returns a negative inner product, because YMatrix only supports sequential index scans for operators.

3 Full features

3.1 Storage

This section shows the use of DDL/DML statements.

Create a new table with vector columns.

=# CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Or add a vector column to the existing table.

=# ALTER TABLE items ADD COLUMN embedding vector(3);

Insert vector data.

=# INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Upsert vector data.

=# INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
   ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;

Update vector data.

=# UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;

Delete vector data.

=# DELETE FROM items WHERE id = 1;

3.2 Query

This section shows the use of query SQL statements.

Get the nearest neighbor of a vector data.

=# SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Get the nearest neighbor of a row.

=# SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;

Gets rows containing a specific distance.

=# SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;

Notice! The query contains both ORDER BY and LIMIT clauses to use the index.

3.2.1 Distance

Get distance.

=# SELECT embedding <-> '[3,1,2]' AS distance FROM items;

For the inner product, it needs to be multiplied by -1 (because <#> returns a negative inner product).

=# SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;

For cosine similarity, 1 - cosine distance is required.

=# SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;

3.2.2 Gathering

Average vector.

=# SELECT AVG(embedding) FROM items;

Average vector set.

=# SELECT id, AVG(embedding) FROM items GROUP BY id;

3.3 Index

By default, pgvector performs precise nearest neighbor searches, providing perfect recall.

You can add an index to use approximately nearest neighbor searches, which sacrifices some recall in exchange for speed. Unlike traditional indexes, after adding a fuzzy index, you will see multiple different results returned by the query.

The supported index types are as follows:

IVFFlat
HNSW

3.3.1 IVFlat

The IVFFlat index is a cluster index, which divides the high-dimensional space into multiple clusters and then searches for a subset of those clusters closest to the query vector. It has faster build time and less memory consumption than HNSW, but query performance is also relatively lower (in terms of speed versus recall tradeoffs).

Three keys to receiving a good recall:

Create an index after the data table has some data
Choose a suitable number of lists - A good starting value for 1 million rows is the number of rows/1000, and if more than 1 million rows are the number of rows squared.
Specify a suitable number of probes in the query (higher values are beneficial for recall, lower values are beneficial for speed) - A good starting value is the list number squared

Add an IVFFlat index for each distance function you want to use.

L2 distance.

=# CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

Inner accumulation.

=# CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);

Cosine distance.

=# CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Vector data up to 2000 dimensions can be indexed.

3.3.1.1 Query Options

Specifies the number of probes (the default is 1).

=# SET ivfflat.probes = 10;

Higher value of this parameter provides better recall at the expense of speed and can be set to the exact number of lists for nearest neighbor searches (at this time the optimizer does not use indexes).

Use the SET LOCAL clause in a transaction to set it to a single query.

=# BEGIN;
SET LOCAL ivfflat.probes = 10;
SELECT ...
COMMIT;

3.3.2 HNSW

HNSW index belongs to graph index. An HNSW index will create a multi-layer graph. The bottom layer contains all node vectors. The upper layer contains the fewer nodes. It slows down build time and uses more memory resources compared to IVFFlat, but also has better query performance (in terms of speed versus recall tradeoffs). It does not have training steps like IVFFlat, so it is possible to create an index When there is no data in the table.

Add an HNSW index for each distance function you want to use.

L2 distance.

=# CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);

Inner accumulation.

=# CREATE INDEX ON items USING hnsw (embedding vector_ip_ops);

Cosine distance.

=# CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);

Vector data up to 2000 dimensions can be indexed.

3.3.2.1 Index Options

Specify the HNSW parameter.

m - Maximum number of connections per layer (default is 16)
ef_construction - The size of the dynamic candidate list used to construct the graph (the default is 64)
```
=# CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);
```
3.3.2.2 Query Options

Specifies the dynamic candidate list size for the search (default is 40).

=# SET hnsw.ef_search = 100;

A higher value of this parameter will increase better recall at the expense of speed.

Use the SET LOCAL clause in a transaction to set it to a single query.

=# BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT ...
COMMIT;

3.3.3 Index progress

Check the index progress.

=# SELECT phase, tuples_done, tuples_total FROM pg_stat_progress_create_index;

These stages are:

initializing
performance k-means - IVFFlat only
assigning tuples - IVFFlat only
loading tuples

Notice! tuples_done and tuples_total can only be populated during the loading tuples phase.

3.4 Filter

There are many ways to index nearest neighbor queries using the WHERE clause.

=# SELECT * FROM items WHERE id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Create an index on one or more WHERE columns for accurate search.

=# CREATE INDEX ON items (id);

Or create a local index on the vector column used for fuzzy search.

=# CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100)
    WHERE (id = 123);

Use partitions to fuzzy searches for many different values of the WHERE column.

=# CREATE TABLE items (embedding vector(3), id int) PARTITION BY LIST(id);

4 Examples

4.1 Python

import psycopg2


def connect_to_database():
    # Connect to the database
    try:
        conn = psycopg2.connect(
            database="your_database_name",
            user="your_database_user",
            password="your_database_password",
            host="your_database_host",
            port="your_database_port"
        )
        return conn
    except psycopg2.Error as e:
        print(f"Error connecting to the database: {e}")
        return None


def close_database_connection(conn):
    # Close the database connection
    if conn:
        conn.close()


def create_extension(conn, extension_name):
    try:
        cursor = conn.cursor()
        # Perform an insert operation
        cursor.execute('create extension if not exists %s ;' % extension_name)
        conn.commit()
        cursor.close()
        print("extension %s has been created successfully." % extension_name)
    except psycopg2.Error as e:
        print(f"Error create extension: {e}")


def create_table(conn, tablename, sql_tabledef):
    try:
        cursor = conn.cursor()
        # Create a vector table
        cursor.execute(sql_tabledef)
        conn.commit()
        cursor.close
        print('table %s has been created successfully.' % tablename)
    except psycopg2.Error as e:
        print(f"Error inserting data: {e}")


def insert_data(conn, tablename, data):
    try:
        cursor = conn.cursor()
        # Perform an insert operation
        cursor.execute("INSERT INTO %s VALUES ('%s',%s, %s,'%s')" % (
            tablename, data['value1'], data['value2'], data['value3'], data['value4']))
        conn.commit()
        cursor.close()
        print("Data inserted successfully.")
    except psycopg2.Error as e:
        print(f"Error inserting data: {e}")


def select_data(conn, tablename):
    try:
        cursor = conn.cursor()
        # Perform query operations
        cursor.execute("SELECT * FROM %s" % tablename)
        result = cursor.fetchall()
        cursor.close()
        return result
    except psycopg2.Error as e:
        print(f"Error selecting data: {e}")
        return []


def vector_search(conn, tablename, vector_type, nearest_values):
    try:
        cursor = conn.cursor()
        # Perform query operations
        cursor.execute(
            "SELECT * FROM %s order by embedding %s '%s' limit 5;" % (tablename, vector_type, nearest_values))
        result = cursor.fetchall()
        cursor.close()
        return result
    except psycopg2.Error as e:
        print(f"Error selecting data: {e}")
        return []


def update_data(conn, tablename, data):
    try:
        cursor = conn.cursor()
        # Perform update operations
        cursor.execute(
            "UPDATE %s SET embedding = '%s' WHERE id = %s" % (tablename, data['new_value'], data['condition']))
        conn.commit()
        cursor.close()
        print("Data updated successfully.")
    except psycopg2.Error as e:
        print(f"Error updating data: {e}")


def delete_data(conn, tablename, condition):
    try:
        cursor = conn.cursor()
        # Perform a delete operation
        cursor.execute("DELETE FROM %s WHERE id = %s" % (tablename, condition))
        conn.commit()
        cursor.close()
        print("Data deleted successfully.")
    except psycopg2.Error as e:
        print(f"Error deleting data: {e}")


if __name__ == "__main__":
    conn = connect_to_database()
    if conn:
        # Perform database operations here
        data_to_insert = {
            'value1': '2023-10-12 00:00:00',
            'value2': 1,
            'value3': 1,
            'value4': [1, 2, 3]

        }

        # Create vector extension
        create_extension(conn, 'vector')

        # Create a vector table
        sql_tabledef = 'drop table if exists documents_l2;' \
                       'CREATE TABLE documents_l2(' \
                       'created_at timestamptz,' \
                       'id integer,' \
                       'document_type int,' \
                       'embedding vector(3)' \
                       ')' \
                       'distributed by (id);';
        create_table(conn, 'documents_l2', sql_tabledef)

        # Write data into the table
        insert_data(conn, 'documents_l2', data_to_insert)

        # Query data
        data_to_select = select_data(conn, 'documents_l2')
        print("Selected Data:", data_to_select)

        # Update data
        data_to_update = {
            'new_value': '[4,5,6]',
            'condition': 1
        }
        update_data(conn, 'documents_l2', data_to_update)
        # Query data
        data_to_select = select_data(conn, 'documents_l2')
        print("Selected Data:", data_to_select)

        # Vector search
        print('向量检索：L2 distance')
        data_vector_search = vector_search(conn, 'documents_l2', '<->', '[4,5,6]')
        print("vector search（L2）:", data_vector_search)

        print('向量检索：inner product')
        data_vector_search = vector_search(conn, 'documents_l2', '<#>', '[4,5,6]')
        print("vector search(IP):", data_vector_search)

        print('向量检索：cosine distance')
        data_vector_search = vector_search(conn, 'documents_l2', '<=>', '[4,5,6]')
        print("vector search(cos):", data_vector_search)

        # Delete data
        data_to_delete = 1
        delete_data(conn, 'documents_l2', data_to_delete)
        # Query data
        data_to_select = select_data(conn, 'documents_l2')
        print("Selected Data:", data_to_select)

        close_database_connection(conn)

4.2 Golang

package main

import (
    "database/sql"
    "fmt"

    _ "github.com/lib/pq"
)

type Data struct {
    Id int32
    Embedding string
}

func getConn() (*sql.DB, error) {
    psqlInfo := fmt.Sprintf("host=%s port=%d user=%s password=%s dbname=%s sslmode=disable", "192.168.100.30", 6432, "mxadmin", "", "mxadmin")

    db, err := sql.Open("postgres", psqlInfo)
    if err != nil {
        return nil, err
    }

    return db, nil
}

func insertData() error {
    db, _ := getConn()
    defer db.Close()

    _, err := db.Exec(`INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]')`)
    return err
}

func selectData() ([]Data, error) {
    db, _ := getConn()
    defer db.Close()

    datas := make([]Data, 0)
    rows, err := db.Query(`SELECT id, embedding FROM items ORDER BY embedding <=> '[0.45, 0.4, 0.85]' LIMIT 10;`)
    if err != nil {
        return nil, err
    }
    defer rows.Close()

    for rows.Next() {
        data := Data{}
        err = rows.Scan(&data.Id, &data.Embedding)
        if err != nil {
            return nil, err
        }
        datas = append(datas, data)
    }

    return datas, err
}

func main() {
    insertData()
    datas, _ := selectData()
    fmt.Println(datas)
}

简体中文