简体中文

YMatrix Performance FAQ

This document records common performance optimization-related issues with YMatrix.

1 Java stress test report `Connection timed out`

JDBC query stress test, concurrency 50, the connection pool uses Alibaba's druid.
The response time is stable within 2.5 minutes, and then the response time becomes longer. Master begins to experience the following error: ERROR "failed to acquire resources on one or more segments", "can not connect to server: Connection timed out".
Segment None error/panic log

Problem Analysis

A distributed database will have a large amount of TCP/UDP data transmission, and each transmission will use a different port number. These port numbers or connections are routed at one time (connection conn) in the eyes of the OS.
The system parameter nf_conntrack_max means that the number of routing information that OS can maintain at the maximum at the same time.
Since multiple virtual machines are on a physical machine in this case, the virtual network should use NAT. This way, when a large number of concurrent queries come at the same time, the virtual routing information will increase rapidly, which may cross the nf_conntrack_max threshold in a short time, which will cause the network card to actively discard packages that cannot be processed.

This can also explain the larger the concurrent query, the more likely it is to lose packets.

Solution

Modify kernel parameters:

sudo sysctl net.netfilter.nf_conntrack_buckets=262144
sudo sysctl net.netfilter.nf_conntrack_max=1048576

2 SQL query motion operator takes time

In the query execution plan, it was found that the motion operator was expensive.

Problem Analysis

The business environment may be a cloud environment. The cloud environment is inefficient in UDP processing, which may lead to high UDF latency or high packet loss probability, indirectly leading to a longer UDF data transmission time.
In addition, the user log level may be set to higher levels such as debug5, and the UDP transmission efficiency will also be affected during the process of large amounts of log output.

Solution

Switch TCP interconnect.
Turn on ic proxy.
Adjust log level.

3 The client returns a large amount of data for a long time

Problem Description

It takes a long time to return large amounts of data when using PSQL clients or other clients.

Solution

Set the FETCH_COUNT parameter before querying to limit the number of data rows returned at one time.

postgres=# \set FETCH_COUNT 200
postgres=# SELECT *  FROM test_table;

4 After performing point query operations in the JAVA application layer, a large number of bind binding variables appear in the database log

Problem Analysis

A large number of bind binding variables will take up a lot of performance overhead. You can avoid this overhead by modifying the preferQueryMode parameter. preferQueryModeis a database driver-related parameter that determines the query pattern used when executing queries. Setting preferQueryMode tosimple` usually results in better query performance, because queries are only executed on the node where the data resides and do not require distributed preparation and data transmission.

Solution

Both of the following steps need to be performed:

Add the option preparedThreshold=0&preferQueryMode=simple to the URI.
The JDBC version needs to be upgraded to postgresql-9.4.1210.jar and above. The old version of JDBC does not support the preferQueryMode parameter.

5 How to replace Postgres crash or memory resource occupancy due to the addr2line tool.

Problem Description

After the database cluster is running normally for a period of time, the server has a short resource downtime, resulting in abnormal nodes of the database cluster and unable to connect through ssh

Problem Analysis

By viewing the system message log information, I found that a large number of addr2line processes appeared on the server at a certain point in time, causing server resources to be tight

Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400707] [  96593]  1001 96593    78471    24188   249856        0             0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400708] [  96594]  1001 96594    85480    28085   290816        0             0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400709] [  96595]  1001 96595    85480    29534   307200        0             0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400710] [  96596]  1001 96596    77318    23547   249856        0             0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400711] [  96597]  1001 96597    85480    29597   294912        0             0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400712] [  96598]  1001 96598    85480    31518   315392        0             0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400714] [  96599]  1001 96599    85480    29146   290816        0             0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400715] [  96600]  1001 96600    77318    23283   253952        0             0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400716] [  96601]  1001 96601    77318    23322   253952        0             0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400718] [  96602]  1001 96602    76199    22816   241664        0             0 addr2line
......
......
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.402077] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/matrixdb5.supervisor.service,task=addr2line,pid=96124,uid=1001
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.402084] Out of memory: Killed process 96124 (addr2line) total-vm:355284kB, anon-rss:141732kB, file-rss:0kB, shmem-rss:0kB
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.427650] oom_reaper: reaped process 96124 (addr2line), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Solution

The original addr2line tool on the system can be backed up and replaced with llvm-addr2line, taking Galaxy Kirin V10 SP3 version as an example

Install llvm
```
# yum install llvm -y
```

Backup the original addr2line

# mv /usr/bin/addr2line /usr/bin/addr2line_bak

Create a soft link

# ln -s /usr/bin/llvm-addr2line /usr/bin/addr2line

Verify that the replacement is successful
```
# addr2line --version
```

简体中文

YMatrix Performance FAQ

1 Java stress test report Connection timed out

2 SQL query motion operator takes time

3 The client returns a large amount of data for a long time

4 After performing point query operations in the JAVA application layer, a large number of bind binding variables appear in the database log

5 How to replace Postgres crash or memory resource occupancy due to the addr2line tool.

1 Java stress test report `Connection timed out`