Blog/Product

PXF: Cross-Source Queries in Seconds with YMatrix

2025-11-20 · YMatrix Team
#Product

The Data Silo Problem

In many enterprises, critical data lives in isolated systems:

  • Orders in an ERP
  • Production logs in HDFS
  • Customer records in Oracle
  • Product assets in S3

When business needs arise—such as analyzing how regional customer orders correlate with manufacturing activity—the traditional response is costly and slow: extract data from each source, run ETL pipelines to load it into a central warehouse, clean and integrate it, then finally generate reports.

But this approach doesn’t scale. Larger datasets mean longer ETL windows. Faster data updates lead to stale insights. And maintaining pipelines across heterogeneous formats becomes a growing operational burden.

This is the classic data silo challenge—and YMatrix’s PXF (Platform Extension Framework) offers a better way.

Query Without Moving Data

PXF is YMatrix’s federated query engine. It enables direct, secure, and high-performance SQL access to dozens of external data sources—without copying or moving data. Supported systems include:

  • Relational databases (Oracle, MySQL, SQL Server)
  • Big data platforms (HDFS, Hive)
  • Cloud object stores (S3, OSS)

With PXF, you can create foreign tables in YMatrix that point to remote data. Once defined, these tables behave like local ones. For example, joining an Oracle customer table with an HDFS log file requires only standard SQL—no custom scripts, no batch windows, no data duplication.

The result? Reports shift from T+1 to real time, storage costs drop, and ETL complexity fades.

Real-World Impact

Real-Time Funnel Analysis

An e-commerce company needed to track user behavior from click to purchase. Instead of waiting for nightly ETL jobs to import HDFS logs, they used PXF to query raw logs directly alongside warehouse tables. Analysts now explore conversion paths using live data—no latency, no staging.

Secure, Efficient Oracle Access

An energy firm stores billions of production records in Oracle. Running regional sales aggregations used to require full-table exports—a process that consumed hours and terabytes of network bandwidth. With PXF, YMatrix pushes aggregation logic down to Oracle. Only summarized results are transferred, reducing network traffic from gigabytes to kilobytes—and keeping sensitive schema details private.

Why PXF Works

PXF’s performance and flexibility stem from three core design principles:

  • Parallel Execution

Each YMatrix segment node runs its own PXF instance, reading external data shards (like HDFS blocks or Oracle partitions) in parallel. Query throughput scales linearly with cluster size.

  • Predicate Pushdown

When compatible, PXF pushes WHERE clauses, filters, and projections directly to the source system. This minimizes data movement and leverages native source optimizations.

  • Extensible Architecture

Built on a modular framework, PXF supports pluggable connectors. New data sources can be added without modifying the core engine—making it future-proof for evolving data landscapes.

A Lighter Approach to Modern Analytics

As data volumes grow, the “move everything” warehouse model is becoming obsolete. PXF enables a smarter paradigm: leave data where it lives, and bring computation to it.

Whether you’re unifying lake and warehouse workloads, connecting to legacy OLTP systems, or querying cloud storage, PXF lets YMatrix deliver real-time insights with minimal infrastructure overhead.

Stop building pipelines just to move data. Start analyzing it—wherever it is.

Learn more about PXF in YMatrix: https://ymatrix.cn/zh/doc/6.6/dataquery/pxf_hdfs