Blog/Technical Discussion

Why Can WAL Data Become “Corrupted Out of Thin Air”?——An Incident Investigation Spanning the Boundary Between Databases and Hardware

2026-05-15 · YMatrix Team
#Technical Discussion

Introduction

In database systems, WAL (Write-Ahead Log) is responsible for data recovery and replication, and is widely regarded as one of the most stable and reliable components in the entire database pipeline. However, during a recent production troubleshooting process, the YMatrix team encountered a highly unusual issue: CRC32C checksum verification failures continuously occurred on a Mirror node, causing synchronous replication to stall and transaction commits to time out. Further analysis revealed that the WAL on the Primary node itself was completely intact, network connectivity was normal, and even TCP checksum validation passed successfully.

What made the issue even more complicated was its non-deterministic nature. The vast majority of WAL files were entirely identical, while only a very small number of segments became corrupted at specific offsets. Moreover, the corrupted data exhibited obvious “structured characteristics.” To locate the root cause, the YMatrix team started from the PostgreSQL / GPDB WAL streaming replication source code and progressively narrowed the issue down through synchronized multi-point GDB capture, binary-level WAL analysis, and Linux network-path auditing, eventually tracing the problem to the NIC DMA and PCIe hardware transmission path.

This investigation not only involved the database kernel itself, but also extended deeply into the Linux network stack, DMA mechanisms, and hardware-level data transmission behavior. This article presents the complete troubleshooting process behind this complex production failure, as well as YMatrix’s engineering practices and technical insights regarding reliability across the underlying data path.

01 A WAL Anomaly That “Didn’t Look Like a Software Bug”

The issue first appeared during Primary-Mirror streaming replication in a MatrixDB cluster. While replaying WAL records, the Mirror node continuously reported CRC32C checksum errors:

LOG: incorrect resource manager data checksum in record at 0/4F551398

The Mirror node then repeatedly retried WAL replay, causing the synchronous replication pipeline to become blocked. Distributed transaction COMMIT operations gradually started timing out, and the entire cluster entered an unstable state.

At first glance, this looked like a typical WAL corruption issue. The real challenge, however, was that the problem could not be reproduced consistently. The overwhelming majority of WAL files were perfectly normal, and only a few segments produced CRC errors at specific offsets. This meant the issue did not behave like a conventional software bug with deterministic patterns, but rather resembled a low-probability, transient data corruption event.

For database systems, this type of intermittent issue is often the most difficult to diagnose. It cannot be triggered reliably through stress testing, nor can clear evidence be easily identified from logs. More importantly, subsequent binary comparisons of WAL files showed that among 24 WAL files, 22 were completely identical, while only one file contained actual corruption — with multiple completely normal WAL segments in between.

This observation proved extremely important. It implied that the problem was not a persistent systemic failure, but rather a sporadic abnormal data injection occurring somewhere along the transmission path. As a result, the investigation gradually shifted from the database internals toward the lower-level data transmission infrastructure.

02 The YMatrix Engineering Team Gradually Narrows the Scope

Since the final error manifested during WAL replay on the Mirror node, the most natural assumption was that the WAL file had become corrupted during disk writes. To verify this, the team first migrated the WAL storage on the Mirror node by symlinking the WAL directory to a different disk, attempting to rule out storage media failures.

However, the issue persisted.

This indicated that the corruption likely occurred before the data was actually written to disk. The investigation therefore shifted from “disk anomalies” toward “transmission path anomalies.” At this point, conventional logs were no longer sufficient for narrowing down the problem. The YMatrix team decided to stop “guessing” and instead capture real data directly at critical points along the WAL transmission pipeline.

The team subsequently designed multiple synchronized GDB capture experiments to validate the WAL data state at different stages of the pipeline.

First, WAL data was captured before transmission within the Primary node’s walsender process. The results showed that pg_waldump verification passed completely, CRC32C values were valid, and WAL records were intact. This confirmed that the local WAL files on the Primary node were healthy, and that the path from disk reads to walsender output was functioning correctly.

However, when the team further captured data on the Mirror node’s walreceiver immediately after recv() and before processing, the issue surfaced for the first time. The data received by walreceiver was already corrupted, and it matched exactly what was ultimately written to disk. This meant that write() was not the point where corruption occurred — the Mirror node merely “faithfully” persisted already-corrupted data onto disk.

At this point, the problem was narrowed down for the first time with certainty: the WAL transmitted from the Primary node was correct, but the data became corrupted during transmission.

This conclusion was highly significant, because it meant the issue was no longer confined to database-internal logic, but had entered the boundary area between the database, operating system, networking stack, and hardware.

03 Can the Database Software Layer Really Rewrite WAL Data?

To further verify whether the database software layer itself could potentially rewrite WAL payloads, the team began auditing the PostgreSQL / GPDB streaming replication source code path section by section.

In theory, a WAL record traveling from Primary to Mirror passes through multiple stages:

  • WAL file reads

  • walsender

  • libpq buffer

  • kernel socket

  • TCP stack

  • NIC DMA

  • network transmission

  • Mirror recv buffer

Any stage could theoretically introduce anomalies.

However, after examining the source code in depth, the team discovered that the WAL replication pipeline was actually remarkably “thin.”

Most of the path essentially boiled down to:

memcpy → send() → recv() → memcpy → write()

Throughout the entire process, there was no compression, protocol re-encoding, or data transformation logic. Without SSL/GSS enabled, there was no encryption processing either. Most of the pipeline fundamentally consisted only of standard memory copies and system calls.

This meant that the database software layer itself had virtually no path capable of “actively rewriting WAL data.” In other words, if the WAL sent by the sender was correct while the data received by the receiver was already corrupted, then the issue was far more likely to reside outside the database software — namely in the Linux network stack, DMA path, or even lower-level hardware transmission stages.

The investigation therefore continued converging toward the infrastructure layer.

04 The Turning Point: A Binary-Level diff

The team then performed a full binary comparison between the WAL segments on the sender and receiver sides.

The result was highly intriguing: among 24 WAL files, 22 were completely identical, while only one file contained actual corruption. This indicated that the issue was not a persistent transmission-path failure, but rather a transient data corruption event.

However, what truly changed the direction of the investigation was not merely “only one file was corrupted,” but the actual content within the corrupted region itself.

Normal WAL data typically exhibits high entropy and resembles random binary streams. Yet the abnormal region displayed highly obvious structured characteristics:

FF E1 EA DA 01 00 00 00
00 FF FF FF FF 00 00 00

The abnormal data length was precisely 64 bytes. For database developers, 64 bytes may appear to be an ordinary number. But in the hardware world, it is highly indicative. Sixty-four bytes commonly correspond to a CPU cache line, a DMA descriptor, or a typical PCIe payload chunk. Moreover, the repeated presence of FF FF FF FF strongly resembled characteristic markers associated with PCIe error poisoning or DMA control structures. More importantly, this structured data did not exist in the sender’s original WAL, yet it was fully inserted into the receiver’s WAL stream. This meant it was not part of the WAL itself, but rather an “external structured data fragment.” At this stage, the problem had clearly moved beyond the scope of database software, and the team gradually shifted suspicion toward lower-level issues such as NIC DMA descriptor leakage, PCIe TLP error poisoning, or DMA transmission-path anomalies.

05 Why Didn’t TCP Checksum Detect the Problem?

This is one of the most easily misunderstood yet most critical aspects of the entire case. Many people instinctively assume:

“If TCP checksum validation passes, then the data must be correct.”

In reality, TCP checksum capabilities are far weaker than commonly imagined. TCP can guarantee that data does not change during network transmission, but it cannot guarantee that the data participating in checksum calculation was correct in the first place.

If corruption occurs before checksum calculation — for example inside NIC DMA logic, during PCIe DMA memory writes, or within hardware offload processing — then the data covered by the TCP checksum may already be corrupted. In such scenarios, recv() may behave completely normally, the TCP connection may show no abnormalities whatsoever, yet the database can still receive invalid WAL data.

This is precisely why PostgreSQL systems maintain mechanisms such as WAL CRC32C, page checksum, and end-to-end consistency verification in addition to TCP validation. Database systems must be capable of defending against situations where: “TCP believes the data is correct, while the data itself has already been corrupted.” Fundamentally, this incident belonged exactly to that category of anomaly.

06 From a Database Problem to an Infrastructure Problem

Looking back, the issue initially appeared to be merely a WAL replay failure. But as the investigation deepened, the problem ultimately extended all the way into the Linux network stack, DMA mechanisms, and PCIe hardware transmission paths.

The defining characteristics of such problems are their extremely low probability, difficulty of reproduction, and invisibility in logs. Traditional monitoring systems are almost incapable of covering them.

People often view databases purely as software systems. In today’s large-scale real-time data environments, however, databases are increasingly operating close to hardware boundaries. NUMA, CPU cache, DMA, NIC offload, PCIe, and the Linux network stack — concepts once considered part of the “system layer” — can now directly impact database stability. For modern data infrastructure, true reliability is no longer determined solely by the database software itself, but by the consistency and verifiability of the entire data path across every layer.

07 Conclusion

This WAL anomaly investigation ultimately progressed from the database kernel itself into the Linux network stack, DMA mechanisms, and PCIe hardware transmission paths.

What makes such issues difficult to diagnose is not merely the complexity of the technical layers involved, but the fact that they exist within a “boundary zone”: they are neither reproducible like typical software bugs, nor accompanied by explicit hardware error reports like conventional hardware failures. Instead, they appear as extremely low-probability, transient anomalies with almost no logging evidence, causing single-layer diagnostic methodologies to fail easily.

Resolving this class of issue can no longer rely on “experience-based judgment” at any single layer. Instead, the entire data path must be decomposed and traced step by step — from database internals to the operating system, network transmission, and even the actual behavior of hardware DMA — verifying at every hop whether the data remains trustworthy.

This process also reveals a deeper transformation: database systems themselves are gradually evolving from mere “software components” into complete data infrastructures spanning operating systems and hardware transmission paths.

Data correctness is no longer guaranteed solely by database-internal logic, but depends on consistency across the entire transmission chain at multiple layers.

Going forward, YMatrix will continue advancing deeper engineering exploration and technical practices around distributed database kernels, streaming real-time architectures, data foundations for the Agent era, and highly reliable data transmission infrastructures.