Full text search

This document introduces the basics of full-text search, how it works in YMatrix databases, and give examples.

background

Although the text search operator* has been in databases for many years, such as PostgreSQL provides ~, `~,LIKEandILIKE` operators for text data types, they actually lack many of the basic properties required by modern information systems:

  • Lack of full support for languages ​​​​(even for English). It is not enough to just use regular expressions (including numbers, letters, symbolic wildcards, etc., such as ^[a-zA-Z0-9_-]{3,15}$, because they cannot easily handle derivative words such as satisfies and satisfy. If you search for satisfy using regular expressions, you will most likely miss the document containing satisfies.
  • They do not provide sorting of search results, which makes them invalid when facing thousands of found documents.
  • They are slow because there is no index support, so they have to process all the documents for each search.

In order to solve the above problems, a full-text search appeared.

concept

  • Full-text search and full-text index: Full-text search (or text search) provides the ability to identify natural language documents that meet query conditions, and can sort query results by relevance. It is also an effective way to deal with unstructured data. Unstructed data referers to data with uncertain lengths or no fixed formats, such as emails, Word documents, etc. The basic idea is to make unstructured data structured, that is, extract part of the information in the unstructured data, reorganize it, and then search these data with certain structures to achieve the purpose of relatively fast search. This part of the information extracted from the unstructured data and then reorganized is called the full text index.
  • Document: A document is a unit that searches in a full-text search system. For example, a magazine article or email message. Text search engines must be able to parse documents and store associations between word positions and their parent documents, which are then used to search for documents containing search terms.

Features

Full-text search has the following characteristics:

  • Higher accuracy: Full-text search provides higher accuracy than traditional search algorithms. It checks all words in each stored document, trying to match the search criteria.
  • Better search efficiency: Full-text search improves search efficiency by dividing the problem into two tasks: indexing and search. The indexing phase scans the text of all documents and creates a list of search terms.
  • Stronger User Experience: Full-text search enhances the user experience by providing more relevant search results. It can reveal more information than simple abstract searches.
  • Multifunctionality: Full-text search has applications in many fields, including websites, content management systems and document management.

Principle

Full-text search requires a full-text index, which allows the document to be preprocessed and saves an index, which facilitates quick search later. Preprocessing includes:

  1. Parse the document into a mark. It is helpful to identify multiple types of marks, such as numbers, words, complex words, email addresses, so that they can be processed in different ways;

  2. Convert the mark to a word position (or keyword). A word position is a normalized string. Normalization includes converting capital letters to lowercase, removing English word suffixes (such as s or es), etc. This processing allows the user to search for a complete variation of the same word. In short, notation is the original fragment of document text, while lexical position is the words considered useful for indexing and searching;3. Storage preprocessed documents for optimization of searches. Each preprocessed document contains an ordered array of normalized word position and position information for appropriate ranking. Such a document containing a more "density" area of ​​​​the query term has a higher ranking than a document containing a scattered query term, which means it will be "preferred to display".

In addition to indexing, full-text search is based on two data types: tsvector and tsquery:

  • tsvector is used to store preprocessed documents.
  • tsquery is used to represent processed queries.

There are many functions and operators available for these data types, the most important of which is the matching operator @@.

Example

Notice! Both MARS3 and MARS2 tables do not support full-text search.

First, create a HEAP table called articles with columns id, title, and content.

=# CREATE TABLE articles (
id int,
title text,
content text
)
DISTRIBUTED BY (id);

Insert 10 test data.

=# INSERT INTO articles (id, title, content) VALUES
(1, ‘Database Management’, ‘Database management is the process of organizing, storing, maintaining, and retrieving information from computer databases.’),
(2, ‘Data Analysis’, ‘Data analysis is the process of extracting useful information from raw data using statistical and computational tools.’),
(3, ‘Machine Learning’, ' Machine learning is a form of artificial intelligence (AI) that allows computers to learn without explicit programming.'),
(4, ‘Deep Learning’, ‘Deep learning is a machine learning method that relies on artificial neural networks for modeling decisions.’),
(5, ‘Natural Language Processing’, ‘Natural language processing is a technology that enables computers to understand, interpret, and generate human language.’),
(6, ‘Computer Vision’, ' Computer vision is the science of enabling computers to gain advanced understanding from images or multidimensional data.'),
(7, ‘Software Engineering’, ‘Software engineering is the applied science of designing, developing, and testing software using systematic methods.’),
(8, ‘Cybersecurity’, ‘Cybersecurity involves protecting computer systems and their data from unauthorized access or damage.’),
(9, ‘Cloud Computing’, ' Cloud computing is a model for delivering on-demand computing resources and data over the internet.'),
(10, ‘Internet of Things’, ‘The Internet of Things is a system that enables physical devices to connect to networks and exchange data.’);

Fuzzy search (non-full text search) example

=# SELECT id, title, content
FROM articles
WHERE content like '%机器学习%';
 id |  title   |                                content
----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  3 | Machine Learning | Machine learning is a form of artificial intelligence (AI) that allows computers to learn without explicit programming.
  4 | Deep Learning | Deep learning is a method of machine learning that relies on artificial neural networks for modeling decisions.
(2 rows)

Full text search example

To implement full-text search, two functions must be used:

  • to_tsquery ( [ config regconfig, ] query text ): The to_tsquery() function converts text into tsquery, that is, processed queries, and words must be combined by valid tsquery operators.
  • to_tsvector ( [ config regconfig, ] document text ): to_tsvector() function converts text into tsvector, that is, preprocessed documents, and words contains location information.

Create a full-text search index.

=# CREATE INDEX articles_content_gin ON articles USING gin(to_tsvector('simple', content));

Then use this index to search the full text. @@ is a text search operator used to determine whether the two items before and after match. The order of the items has no effect. The results returns TRUE or FALSE, that is, a match or a mismatch.

=# SELECT id, title, content
FROM articles
WHERE to_tsvector('simple', content) @@ to_tsquery('simple', '深度学习是一种机器学习的方法');
 id |  title   |                                content
----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  4 | Deep Learning | Deep learning is a method of machine learning that relies on artificial neural networks for modeling decisions.
(1 row)

This is a full-text search query that does the following:

  1. SELECT id, title, content FROM articles This part is a standard SQL query. It selects the three fields id, title, and content from the articles table;

  2. WHERE to_tsvector('simple', content) @@ to_tsquery('simple', 'Deep learning is a method of machine learning') This part is a condition for full-text search. Here, YMatrix's full-text search function is used to find documents that contain Deep Learning is a machine learning method in the content field. Note: Since the content of this example is in Chinese, it requires the simple configuration, which will not perform any language-specific processing. If your document content is in English, you can use english;

  3. to_tsvector('simple', content) This function converts the text of the content field into a vector (tsvector) for full text search;

  4. to_tsquery('simple', 'Deep learning is a method of machine learning') This function converts the text of a query Deep learning is a method of machine learning into a query vector (tsquery);

  5. @@ is an operator that matches tsvector and tsquery, and returns true (TRUE) if the text in the content field contains Deep learning is a machine learning method, otherwise returns false (FALSE).

You can check the query's [Query Plan] (/doc/latest/performance_tuning/query_plan) to verify that the query uses the full-text index we created.

=# EXPLAIN SELECT id, title, content
FROM articles
WHERE to_tsvector('simple', content) @@ to_tsquery('simple', 'machine');
                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)  (cost=13.26..52.13 rows=167 width=68)
   ->  Bitmap Heap Scan on articles  (cost=13.26..50.04 rows=42 width=68)
         Recheck Cond: (to_tsvector('simple'::regconfig, content) @@ '''machine'''::tsquery)
         ->  Bitmap Index Scan on articles_content_gin  (cost=0.00..13.25 rows=42 width=0)
               Index Cond: (to_tsvector('simple'::regconfig, content) @@ '''machine'''::tsquery)
 Optimizer: Postgres query optimizer
(6 rows)

You can see that Bitmap Index Scan on articles_content_gin appears in the query plan, using the articles_content_gin index we created.

In summary, this query uses the full-text index to search the articles table for a full-text search: select the content field containing all records of Deep Learning is a machine learning method, and then return the id, title, and content fields of these records.