YMatrix
Quick Start
Simulate Time Series Scenarios
Standard Cluster Deployment
Data Modeling
Connecting to The database
Data Writing
Data Migration
Data Query
Scene Application Examples
Federal Query
Maintenance and Monitoring
Global Maintenance
Partition Maintenance
Backup and Restore
Cluster Expansion
Monitoring
Performance Tuning
Troubleshooting
Reference Guide
Tool Guide
Data Type
Storage Engine
Execution Engine
Stream Processing
DR (Disaster Recovery)
Configuration Parameters
Index
Extension
SQL Reference
FAQ
This document records common performance optimization-related issues with YMatrix.
Connection timed out
JDBC query stress test, concurrency 50, the connection pool uses Alibaba's druid.
The response time is stable within 2.5 minutes, and then the response time becomes longer. Master begins to experience the following error:
ERROR "failed to acquire resources on one or more segments", "can not connect to server: Connection timed out".
Segment None error/panic log
Problem Analysis
A distributed database will have a large amount of TCP/UDP data transmission, and each transmission will use a different port number. These port numbers or connections are routed at one time (connection conn) in the eyes of the OS.
The system parameter nf_conntrack_max
means that the number of routing information that OS can maintain at the maximum at the same time.
Since multiple virtual machines are on a physical machine in this case, the virtual network should use NAT. This way, when a large number of concurrent queries come at the same time, the virtual routing information will increase rapidly, which may cross the nf_conntrack_max
threshold in a short time, which will cause the network card to actively discard packages that cannot be processed.
This can also explain the larger the concurrent query, the more likely it is to lose packets.
Solution
Modify kernel parameters:
sudo sysctl net.netfilter.nf_conntrack_buckets=262144
sudo sysctl net.netfilter.nf_conntrack_max=1048576
In the query execution plan, it was found that the motion operator was expensive.
Problem Analysis
The business environment may be a cloud environment. The cloud environment is inefficient in UDP processing, which may lead to high UDF latency or high packet loss probability, indirectly leading to a longer UDF data transmission time.
In addition, the user log level may be set to higher levels such as debug5, and the UDP transmission efficiency will also be affected during the process of large amounts of log output.
Solution
Problem Description
It takes a long time to return large amounts of data when using PSQL
clients or other clients.
Solution
Set the FETCH_COUNT
parameter before querying to limit the number of data rows returned at one time.
postgres=# \set FETCH_COUNT 200
postgres=# SELECT * FROM test_table;
Problem Analysis
A large number of bind binding variables will take up a lot of performance overhead. You can avoid this overhead by modifying the preferQueryMode
parameter. preferQueryModeis a database driver-related parameter that determines the query pattern used when executing queries. Setting preferQueryMode to
simple` usually results in better query performance, because queries are only executed on the node where the data resides and do not require distributed preparation and data transmission.
Solution
Both of the following steps need to be performed:
preparedThreshold=0&preferQueryMode=simple
to the URI.preferQueryMode
parameter.Problem Description
After the database cluster is running normally for a period of time, the server has a short resource downtime, resulting in abnormal nodes of the database cluster and unable to connect through ssh
Problem Analysis
By viewing the system message log information, I found that a large number of addr2line processes appeared on the server at a certain point in time, causing server resources to be tight
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400707] [ 96593] 1001 96593 78471 24188 249856 0 0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400708] [ 96594] 1001 96594 85480 28085 290816 0 0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400709] [ 96595] 1001 96595 85480 29534 307200 0 0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400710] [ 96596] 1001 96596 77318 23547 249856 0 0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400711] [ 96597] 1001 96597 85480 29597 294912 0 0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400712] [ 96598] 1001 96598 85480 31518 315392 0 0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400714] [ 96599] 1001 96599 85480 29146 290816 0 0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400715] [ 96600] 1001 96600 77318 23283 253952 0 0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400716] [ 96601] 1001 96601 77318 23322 253952 0 0 addr2line
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.400718] [ 96602] 1001 96602 76199 22816 241664 0 0 addr2line
......
......
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.402077] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/matrixdb5.supervisor.service,task=addr2line,pid=96124,uid=1001
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.402084] Out of memory: Killed process 96124 (addr2line) total-vm:355284kB, anon-rss:141732kB, file-rss:0kB, shmem-rss:0kB
Aug 29 14:46:17 yonmatrix-01 kernel: [13522.427650] oom_reaper: reaped process 96124 (addr2line), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Solution
The original addr2line tool on the system can be backed up and replaced with llvm-addr2line, taking Galaxy Kirin V10 SP3 version as an example
# yum install llvm -y
# mv /usr/bin/addr2line /usr/bin/addr2line_bak
# ln -s /usr/bin/llvm-addr2line /usr/bin/addr2line
# addr2line --version