Full-Text Search

This document introduces the fundamentals of full-text search, explains how it works in YMatrix databases, and provides examples.

Background

Although text search operators have existed in databases for many years—PostgreSQL, for example, provides operators such as LIKE, ILIKE, ~, and ~~* for text data types—they lack many essential features required by modern information systems:

  • Limited language support (even for English). Relying solely on regular expressions (with wildcards for digits, letters, symbols, etc., such as .*) is insufficient because they cannot easily handle word variations, such as run and running. If you search for run using regular expressions, you are likely to miss documents containing running.
  • No result ranking. This makes them ineffective when thousands of documents match a query.
  • Poor performance due to lack of indexing. They must scan all documents for each search, making them slow.

To address these issues, full-text search was developed.

Concepts

  • Full-Text Search and Full-Text Indexing: Full-text search (or text search) enables the identification of natural language documents that satisfy a query condition and ranks results by relevance. It is also an effective method for handling unstructured data—data with variable length or no fixed format, such as emails or Word documents. The core idea is to transform unstructured data into a structured form by extracting and reorganizing parts of the data. This restructured information is called a full-text index, which is then used to accelerate searches.
  • Document: A document is the unit of data searched in a full-text search system. For example, a magazine article or an email message. A text search engine must parse the document and store associations between lexemes and their parent documents. These associations are used to identify documents containing query terms.

Features

Full-text search offers the following advantages:

  • Higher accuracy: Full-text search delivers more accurate results than traditional search algorithms by examining every word in each stored document to match search criteria.
  • Improved search efficiency: It improves performance by separating the process into two phases: indexing and searching. During indexing, all document texts are scanned to build a list of searchable terms.
  • Enhanced user experience: By returning more relevant results, full-text search improves user experience and reveals more meaningful information than simple substring searches.
  • Versatility: Full-text search is widely used across domains, including websites, content management systems, and document management systems.

Principles

Full-text search requires building a full-text index, which allows documents to be preprocessed and stored in an optimized format for fast searching. Preprocessing involves:

  1. Parsing documents into tokens. Identifying different token types—such as numbers, words, compound words, or email addresses—allows them to be processed differently.
  2. Converting tokens into lexemes (or keywords). A lexeme is a normalized string. Normalization includes converting uppercase letters to lowercase and removing word suffixes (e.g., ing or ed). This enables matching of word variations. In short, tokens are raw fragments of text, while lexemes are the normalized forms used for indexing and searching.
  3. Storing preprocessed documents for optimized search. Each preprocessed document contains an ordered array of normalized lexemes along with positional information for ranking. Documents where query terms appear in closer proximity receive higher relevance scores and are ranked higher in results.

In addition to indexing, full-text search relies on two data types: tsvector and tsquery:

  • tsvector stores preprocessed documents.
  • tsquery represents processed queries.

Various functions and operators support these data types. The most important is the match operator @@.

Examples

Note!
MARS3 and MARS2 tables do not support full-text search.

First, create a HEAP table named articles with columns id, title, and content.

=# CREATE TABLE articles (
id int,
title text,
content text
)
DISTRIBUTED BY (id);

Insert 10 test records.

=# INSERT INTO articles (id, title, content) VALUES  
(1, 'Database Administration', 'Database administration is the process of organizing, storing, maintaining, and retrieving information from computer databases.'),  
(2, 'Data Analysis', 'Data analysis is the process of using statistical and computational tools to extract useful information from raw data.'),  
(3, 'Machine Learning', 'Machine learning is a form of artificial intelligence (AI) that enables computers to learn without being explicitly programmed.'),  
(4, 'Deep Learning', 'Deep learning is a method of machine learning that relies on artificial neural networks to model decision-making.'),  
(5, 'Natural Language Processing', 'Natural Language Processing (NLP) is a technology that enables computers to understand, interpret, and generate human language.'),  
(6, 'Computer Vision', 'Computer vision is a field of study that enables computers to gain high-level understanding from images or multidimensional data.'),  
(7, 'Software Engineering', 'Software engineering is an applied science that uses systematic methods to design, develop, and test software.'),  
(8, 'Cybersecurity', 'Cybersecurity involves protecting computer systems and their data from unauthorized access or damage.'),  
(9, 'Cloud Computing', 'Cloud computing is a model for delivering on-demand computing resources and data over the internet.'),  
(10, 'Internet of Things', 'The Internet of Things (IoT) is a system that connects physical devices over a network to exchange data.');

Fuzzy Search (Non Full-Text Search) Example

=# SELECT id, title, content
FROM articles
WHERE content like '%Learning%';
 id |  title           |                                content
----+------------------+------------------------------------------------------------------------
3   | Machine Learning | Machine learning is a form of artificial intelligence (AI) that enables computers to learn without being explicitly programmed.  
4   | Deep Learning    | Deep learning is a machine learning approach that relies on artificial neural networks to model decision-making.
(2 rows)

Full-Text Search Example

Two functions are required for full-text search:

  • to_tsquery: Converts text into tsquery, a processed query. Words must be combined using valid tsquery operators.
  • to_tsvector: Converts text into tsvector, a preprocessed document containing positional information for each word.

Create a full-text search index.

=# CREATE INDEX articles_content_gin ON articles USING gin(to_tsvector('simple', content));

Use this index for full-text search. The @@ operator determines whether two items match. Order does not matter. It returns TRUE if matched, FALSE otherwise.

=# SELECT id, title, content
FROM articles
WHERE to_tsvector('simple', content) @@ to_tsquery('simple', 'Deep learning is a machine learning approach');
 id |   title          |                                content
----+------------------+------------------------------------------------------------------------
4   | Deep Learning    | Deep learning is a machine learning approach that relies on artificial neural networks to model decision-making.
(1 row)

This full-text search query performs the following:

  1. The SELECT id, title, content FROM articles part is standard SQL, retrieving the id, title, and content fields from the articles table.
  2. The WHERE to_tsvector(...) @@ to_tsquery(...) clause applies full-text search to find documents in the content field that match the query. Note: Since this example uses Chinese text, the simple configuration is used, which performs no language-specific processing. For English content, use english.
  3. The to_tsvector('simple', content) function converts the content field into a tsvector (a lexeme vector) for full-text search.
  4. The to_tsquery('simple', 'Deep learning is a machine learning approach') function converts the query string into a tsquery (a search vector).
  5. The @@ operator matches the tsvector and tsquery. It returns TRUE if the content field contains the query terms, FALSE otherwise.

You can view the query plan to verify that the query uses the full-text index.

=# EXPLAIN SELECT id, title, content
FROM articles
WHERE to_tsvector('simple', content) @@ to_tsquery('simple', 'machine');
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)  (cost=13.26..52.13 rows=167 width=68)
   ->  Bitmap Heap Scan on articles  (cost=13.26..50.04 rows=42 width=68)
         Recheck Cond: (to_tsvector('simple'::regconfig, content) @@ '''machine'''::tsquery)
         ->  Bitmap Index Scan on articles_content_gin  (cost=0.00..13.25 rows=42 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, content) @@ '''machine'''::tsquery)
 Optimizer: Postgres query optimizer
(6 rows)

The query plan shows Bitmap Index Scan on articles_content_gin, confirming that the index articles_content_gin is used.

In summary, this query performs a full-text search on the articles table using the full-text index. It selects all rows where the content field contains the term machine, and returns the id, title, and content fields of the matching rows.