This document introduces the fundamentals of full-text search, explains how it works in YMatrix databases, and provides examples.
Although text search operators have existed in databases for many years—PostgreSQL, for example, provides operators such as LIKE, ILIKE, ~, and ~~* for text data types—they lack many essential features required by modern information systems:
.*) is insufficient because they cannot easily handle word variations, such as run and running. If you search for run using regular expressions, you are likely to miss documents containing running.To address these issues, full-text search was developed.
Full-text search offers the following advantages:
Full-text search requires building a full-text index, which allows documents to be preprocessed and stored in an optimized format for fast searching. Preprocessing involves:
ing or ed). This enables matching of word variations. In short, tokens are raw fragments of text, while lexemes are the normalized forms used for indexing and searching.In addition to indexing, full-text search relies on two data types: tsvector and tsquery:
tsvector stores preprocessed documents.tsquery represents processed queries.Various functions and operators support these data types. The most important is the match operator @@.
Note!
MARS3 and MARS2 tables do not support full-text search.
First, create a HEAP table named articles with columns id, title, and content.
=# CREATE TABLE articles (
id int,
title text,
content text
)
DISTRIBUTED BY (id);
Insert 10 test records.
=# INSERT INTO articles (id, title, content) VALUES
(1, 'Database Administration', 'Database administration is the process of organizing, storing, maintaining, and retrieving information from computer databases.'),
(2, 'Data Analysis', 'Data analysis is the process of using statistical and computational tools to extract useful information from raw data.'),
(3, 'Machine Learning', 'Machine learning is a form of artificial intelligence (AI) that enables computers to learn without being explicitly programmed.'),
(4, 'Deep Learning', 'Deep learning is a method of machine learning that relies on artificial neural networks to model decision-making.'),
(5, 'Natural Language Processing', 'Natural Language Processing (NLP) is a technology that enables computers to understand, interpret, and generate human language.'),
(6, 'Computer Vision', 'Computer vision is a field of study that enables computers to gain high-level understanding from images or multidimensional data.'),
(7, 'Software Engineering', 'Software engineering is an applied science that uses systematic methods to design, develop, and test software.'),
(8, 'Cybersecurity', 'Cybersecurity involves protecting computer systems and their data from unauthorized access or damage.'),
(9, 'Cloud Computing', 'Cloud computing is a model for delivering on-demand computing resources and data over the internet.'),
(10, 'Internet of Things', 'The Internet of Things (IoT) is a system that connects physical devices over a network to exchange data.');
=# SELECT id, title, content
FROM articles
WHERE content like '%Learning%';
id | title | content
----+------------------+------------------------------------------------------------------------
3 | Machine Learning | Machine learning is a form of artificial intelligence (AI) that enables computers to learn without being explicitly programmed.
4 | Deep Learning | Deep learning is a machine learning approach that relies on artificial neural networks to model decision-making.
(2 rows)
Two functions are required for full-text search:
to_tsquery: Converts text into tsquery, a processed query. Words must be combined using valid tsquery operators.to_tsvector: Converts text into tsvector, a preprocessed document containing positional information for each word.Create a full-text search index.
=# CREATE INDEX articles_content_gin ON articles USING gin(to_tsvector('simple', content));
Use this index for full-text search. The @@ operator determines whether two items match. Order does not matter. It returns TRUE if matched, FALSE otherwise.
=# SELECT id, title, content
FROM articles
WHERE to_tsvector('simple', content) @@ to_tsquery('simple', 'Deep learning is a machine learning approach');
id | title | content
----+------------------+------------------------------------------------------------------------
4 | Deep Learning | Deep learning is a machine learning approach that relies on artificial neural networks to model decision-making.
(1 row)
This full-text search query performs the following:
SELECT id, title, content FROM articles part is standard SQL, retrieving the id, title, and content fields from the articles table.WHERE to_tsvector(...) @@ to_tsquery(...) clause applies full-text search to find documents in the content field that match the query. Note: Since this example uses Chinese text, the simple configuration is used, which performs no language-specific processing. For English content, use english.to_tsvector('simple', content) function converts the content field into a tsvector (a lexeme vector) for full-text search.to_tsquery('simple', 'Deep learning is a machine learning approach') function converts the query string into a tsquery (a search vector).@@ operator matches the tsvector and tsquery. It returns TRUE if the content field contains the query terms, FALSE otherwise.You can view the query plan to verify that the query uses the full-text index.
=# EXPLAIN SELECT id, title, content
FROM articles
WHERE to_tsvector('simple', content) @@ to_tsquery('simple', 'machine');
QUERY PLAN
----------------------------------------------------------------------------------------------
Gather Motion 4:1 (slice1; segments: 4) (cost=13.26..52.13 rows=167 width=68)
-> Bitmap Heap Scan on articles (cost=13.26..50.04 rows=42 width=68)
Recheck Cond: (to_tsvector('simple'::regconfig, content) @@ '''machine'''::tsquery)
-> Bitmap Index Scan on articles_content_gin (cost=0.00..13.25 rows=42 width=0)
Index Cond: (to_tsvector('simple'::regconfig, content) @@ '''machine'''::tsquery)
Optimizer: Postgres query optimizer
(6 rows)
The query plan shows Bitmap Index Scan on articles_content_gin, confirming that the index articles_content_gin is used.
In summary, this query performs a full-text search on the articles table using the full-text index. It selects all rows where the content field contains the term machine, and returns the id, title, and content fields of the matching rows.