This document introduces pgvector, a plug-in tool for storing and querying vector data.
Install the extension. Each database needs to be created only once.
=# CREATE EXTENSION vector;
Create a three-dimensional vector column.
=# CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Insert vector data.
=# INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Get the nearest neighbor through L2 distance.
=# SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
It also supports internal product (<#>
) and cosine distance (<=>
).
Note!
<#>
returns a negative inner product, because YMatrix only supports sequential index scans for operators.
This section shows the use of DDL/DML statements.
Create a new table with vector columns.
=# CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Or add a vector column to the existing table.
=# ALTER TABLE items ADD COLUMN embedding vector(3);
Insert vector data.
=# INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Upsert vector data.
=# INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;
Update vector data.
=# UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;
Delete vector data.
=# DELETE FROM items WHERE id = 1;
This section shows the use of query SQL statements.
Get the nearest neighbor of a vector data.
=# SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Get the nearest neighbor of a row.
=# SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;
Gets rows containing a specific distance.
=# SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
Notice!
The query contains both ORDER BY and LIMIT
clauses to use the index.
Get distance.
=# SELECT embedding <-> '[3,1,2]' AS distance FROM items;
For the inner product, it needs to be multiplied by -1
(because <#>
returns a negative inner product).
=# SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;
For cosine similarity, 1 - cosine distance
is required.
=# SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;
Average vector.
=# SELECT AVG(embedding) FROM items;
Average vector set.
=# SELECT id, AVG(embedding) FROM items GROUP BY id;
By default, pgvector performs precise nearest neighbor searches, providing perfect recall.
You can add an index to use approximately nearest neighbor searches, which sacrifices some recall in exchange for speed. Unlike traditional indexes, after adding a fuzzy index, you will see multiple different results returned by the query.
The supported index types are as follows:
The IVFFlat index is a cluster index, which divides the high-dimensional space into multiple clusters and then searches for a subset of those clusters closest to the query vector. It has faster build time and less memory consumption than HNSW, but query performance is also relatively lower (in terms of speed versus recall tradeoffs).
Three keys to receiving a good recall:
Add an IVFFlat index for each distance function you want to use.
L2 distance.
=# CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
Inner accumulation.
=# CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);
Cosine distance.
=# CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Vector data up to 2000 dimensions can be indexed.
Specifies the number of probes (the default is 1).
=# SET ivfflat.probes = 10;
Higher value of this parameter provides better recall at the expense of speed and can be set to the exact number of lists for nearest neighbor searches (at this time the optimizer does not use indexes).
Use the SET LOCAL
clause in a transaction to set it to a single query.
=# BEGIN;
SET LOCAL ivfflat.probes = 10;
SELECT ...
COMMIT;
HNSW index belongs to graph index. An HNSW index will create a multi-layer graph. The bottom layer contains all node vectors. The upper layer contains the fewer nodes. It slows down build time and uses more memory resources compared to IVFFlat, but also has better query performance (in terms of speed versus recall tradeoffs). It does not have training steps like IVFFlat, so it is possible to create an index When there is no data in the table.
Add an HNSW index for each distance function you want to use.
L2 distance.
=# CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);
Inner accumulation.
=# CREATE INDEX ON items USING hnsw (embedding vector_ip_ops);
Cosine distance.
=# CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);
Vector data up to 2000 dimensions can be indexed.
Specify the HNSW parameter.
m
- Maximum number of connections per layer (default is 16
)ef_construction
- The size of the dynamic candidate list used to construct the graph (the default is 64
)=# CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);
Specifies the dynamic candidate list size for the search (default is 40).
=# SET hnsw.ef_search = 100;
A higher value of this parameter will increase better recall at the expense of speed.
Use the SET LOCAL
clause in a transaction to set it to a single query.
=# BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT ...
COMMIT;
Check the index progress.
=# SELECT phase, tuples_done, tuples_total FROM pg_stat_progress_create_index;
These stages are:
initializing
performance k-means
- IVFFlat onlyassigning tuples
- IVFFlat onlyloading tuples
Notice!
tuples_done
and tuples_total
can only be populated during the loading tuples phase.
There are many ways to index nearest neighbor queries using the WHERE
clause.
=# SELECT * FROM items WHERE id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Create an index on one or more WHERE
columns for accurate search.
=# CREATE INDEX ON items (id);
Or create a local index on the vector column used for fuzzy search.
=# CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100)
WHERE (id = 123);
Use partitions to fuzzy searches for many different values of the WHERE
column.
=# CREATE TABLE items (embedding vector(3), id int) PARTITION BY LIST(id);
import psycopg2
def connect_to_database():
# Connect to the database
try:
conn = psycopg2.connect(
database="your_database_name",
user="your_database_user",
password="your_database_password",
host="your_database_host",
port="your_database_port"
)
return conn
except psycopg2.Error as e:
print(f"Error connecting to the database: {e}")
return None
def close_database_connection(conn):
# Close the database connection
if conn:
conn.close()
def create_extension(conn, extension_name):
try:
cursor = conn.cursor()
# Perform an insert operation
cursor.execute('create extension if not exists %s ;' % extension_name)
conn.commit()
cursor.close()
print("extension %s has been created successfully." % extension_name)
except psycopg2.Error as e:
print(f"Error create extension: {e}")
def create_table(conn, tablename, sql_tabledef):
try:
cursor = conn.cursor()
# Create a vector table
cursor.execute(sql_tabledef)
conn.commit()
cursor.close
print('table %s has been created successfully.' % tablename)
except psycopg2.Error as e:
print(f"Error inserting data: {e}")
def insert_data(conn, tablename, data):
try:
cursor = conn.cursor()
# Perform an insert operation
cursor.execute("INSERT INTO %s VALUES ('%s',%s, %s,'%s')" % (
tablename, data['value1'], data['value2'], data['value3'], data['value4']))
conn.commit()
cursor.close()
print("Data inserted successfully.")
except psycopg2.Error as e:
print(f"Error inserting data: {e}")
def select_data(conn, tablename):
try:
cursor = conn.cursor()
# Perform query operations
cursor.execute("SELECT * FROM %s" % tablename)
result = cursor.fetchall()
cursor.close()
return result
except psycopg2.Error as e:
print(f"Error selecting data: {e}")
return []
def vector_search(conn, tablename, vector_type, nearest_values):
try:
cursor = conn.cursor()
# Perform query operations
cursor.execute(
"SELECT * FROM %s order by embedding %s '%s' limit 5;" % (tablename, vector_type, nearest_values))
result = cursor.fetchall()
cursor.close()
return result
except psycopg2.Error as e:
print(f"Error selecting data: {e}")
return []
def update_data(conn, tablename, data):
try:
cursor = conn.cursor()
# Perform update operations
cursor.execute(
"UPDATE %s SET embedding = '%s' WHERE id = %s" % (tablename, data['new_value'], data['condition']))
conn.commit()
cursor.close()
print("Data updated successfully.")
except psycopg2.Error as e:
print(f"Error updating data: {e}")
def delete_data(conn, tablename, condition):
try:
cursor = conn.cursor()
# Perform a delete operation
cursor.execute("DELETE FROM %s WHERE id = %s" % (tablename, condition))
conn.commit()
cursor.close()
print("Data deleted successfully.")
except psycopg2.Error as e:
print(f"Error deleting data: {e}")
if __name__ == "__main__":
conn = connect_to_database()
if conn:
# Perform database operations here
data_to_insert = {
'value1': '2023-10-12 00:00:00',
'value2': 1,
'value3': 1,
'value4': [1, 2, 3]
}
# Create vector extension
create_extension(conn, 'vector')
# Create a vector table
sql_tabledef = 'drop table if exists documents_l2;' \
'CREATE TABLE documents_l2(' \
'created_at timestamptz,' \
'id integer,' \
'document_type int,' \
'embedding vector(3)' \
')' \
'distributed by (id);';
create_table(conn, 'documents_l2', sql_tabledef)
# Write data into the table
insert_data(conn, 'documents_l2', data_to_insert)
# Query data
data_to_select = select_data(conn, 'documents_l2')
print("Selected Data:", data_to_select)
# Update data
data_to_update = {
'new_value': '[4,5,6]',
'condition': 1
}
update_data(conn, 'documents_l2', data_to_update)
# Query data
data_to_select = select_data(conn, 'documents_l2')
print("Selected Data:", data_to_select)
# Vector search
print('向量检索:L2 distance')
data_vector_search = vector_search(conn, 'documents_l2', '<->', '[4,5,6]')
print("vector search(L2):", data_vector_search)
print('向量检索:inner product')
data_vector_search = vector_search(conn, 'documents_l2', '<#>', '[4,5,6]')
print("vector search(IP):", data_vector_search)
print('向量检索:cosine distance')
data_vector_search = vector_search(conn, 'documents_l2', '<=>', '[4,5,6]')
print("vector search(cos):", data_vector_search)
# Delete data
data_to_delete = 1
delete_data(conn, 'documents_l2', data_to_delete)
# Query data
data_to_select = select_data(conn, 'documents_l2')
print("Selected Data:", data_to_select)
close_database_connection(conn)
package main
import (
"database/sql"
"fmt"
_ "github.com/lib/pq"
)
type Data struct {
Id int32
Embedding string
}
func getConn() (*sql.DB, error) {
psqlInfo := fmt.Sprintf("host=%s port=%d user=%s password=%s dbname=%s sslmode=disable", "192.168.100.30", 6432, "mxadmin", "", "mxadmin")
db, err := sql.Open("postgres", psqlInfo)
if err != nil {
return nil, err
}
return db, nil
}
func insertData() error {
db, _ := getConn()
defer db.Close()
_, err := db.Exec(`INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]')`)
return err
}
func selectData() ([]Data, error) {
db, _ := getConn()
defer db.Close()
datas := make([]Data, 0)
rows, err := db.Query(`SELECT id, embedding FROM items ORDER BY embedding <=> '[0.45, 0.4, 0.85]' LIMIT 10;`)
if err != nil {
return nil, err
}
defer rows.Close()
for rows.Next() {
data := Data{}
err = rows.Scan(&data.Id, &data.Embedding)
if err != nil {
return nil, err
}
datas = append(datas, data)
}
return datas, err
}
func main() {
insertData()
datas, _ := selectData()
fmt.Println(datas)
}