YMatrix

SANY Heavy Industry is a manufacturing leader focused on construction engineering equipment. Within SANY, the Intelligentization Division has built an industrial big data platform for concrete machinery (泵诵云平台, hereafter “the platform”), which integrates distributed storage, data modeling and deployment, and visual analytics.

Today, the platform covers more than 20,000 concrete machinery units. By quantifying around 268 metrics across five key performance dimensions—vibration, blockage, height, rotation, and efficiency—it provides comprehensive health check reports and fault prediction for each individual machine, and offers data-driven insights for intelligent operations and maintenance.

At the same time, historical multi-equipment, multi-dimensional statistics and comparative analysis support R&D in product definition, optimization, and tracking. The platform fully underpins four core scenarios: data-driven decision-making, new product tracking, technical retrofit validation, and precise fault diagnosis.

Data management quality and efficiency are strategically important to SANY. By mining industrial big data, the company aims to:

Implement intelligent inspections on key performance indicators of concrete machinery
Build unique “digital profiles” for thousands of vehicles (“one profile per truck”)
Achieve predictive maintenance
Connect factory operational data with business data
Manage the full life cycle of each machine
Build optimal life-cycle models for equipment
Optimize resource utilization and configuration efficiency
Improve equipment availability

Current Business Challenges

The platform ingests nearly 2,000 operating parameters from more than 20,000 machines. Telemetry is reported at up to 2 Hz, with over 500 million records uploaded every day, consuming more than 1 TB of disk space per day. On top of that, the system performs per-vehicle daily metric calculations, intelligent inspections, model training, and time-series visualization.

In its early phase, the platform was built on a Hadoop + Spark architecture. Over time, four major pain points emerged:

1. Complex architecture caused data redundancy and operational overhead

The traditional Hadoop + Spark architecture required a full “big data stack” such as CDH, including:

Hadoop – storing raw data
Spark – running daily batch jobs
Hive – handling ad-hoc offline queries
HBase – powering data serving and dashboards

This resulted in a complex technology stack. To satisfy different service requirements, the same data had to exist in multiple forms across different components, wasting storage and making operations and maintenance increasingly difficult. The diversity of technologies also made it harder to staff and train the team.

2. Misaligned time-series operating data

Operating data is inherently time-series. In real-world working conditions, various unpredictable factors mean that data is often uploaded irregularly—rows are not aligned “horizontally” across time. The platform acts as a passive receiver of telemetry and can only cleanse data after it lands on disk.

As a result, the cluster contains large amounts of “empty” or sparse data every day. The data cleansing process is complex and error-prone, and it was difficult to maintain high accuracy.

3. Long analysis cycles with the traditional stack

For data analysts, iterative experimentation is essential. They need tools that can return results quickly so they can refine their analysis and algorithms while maintaining “thought continuity”.

In the traditional architecture, Spark—the computation engine—had to read data from HDFS and shuffle or regroup it before computing. This movement and aggregation of data consumed substantial resources and time, significantly slowing down computation and lengthening the analysis cycle.

4. Limited support for Python-based procedural analysis

In industrial scenarios, procedural languages like Python are a must-have for data analysts. Under the traditional architecture, the team used Spark’s pandasUDF feature to batch-run Python code. This led to a lot of “glue code” in the pipelines, which slowed down algorithm development and made it harder to iterate on models.

Migrating to YMatrix: Architecture and Key Advantages

To address these challenges, SANY rebuilt the underlying architecture of the platform with YMatrix at the core. Compared with the original Spark-based setup, YMatrix brought four major advantages:

1. “One for ALL” hyper-converged time-series database

YMatrix converges multiple workload types into a single database, covering:

Data storage
Real-time computation
Offline/batch computation
Data serving and visualization

This eliminates the need for a sprawling Hadoop “all-in-one” stack.

YMatrix also offers GUI-based installation and integrates with Grafana for monitoring, which dramatically lowers the operational burden. For the team, this is the “One for ALL” architecture they had been looking for.

2. MatrixGate (mxgate) with upsert support

For real-time ingestion, the platform uses MatrixGate (mxgate), YMatrix’s streaming data ingestion tool that supports upsert semantics. It can merge and update multiple rows from the same timestamp into a single consolidated record—ideal for the time-series telemetry uploaded by heavy machinery.

At the same time, part of the data cleansing logic can be moved “upstream” into the ingestion process, simplifying the overall data cleaning workflow.

3. MARS tables deliver ~5x faster queries

YMatrix is a hybrid OLAP + OLTP database. Since the data is stored directly inside YMatrix, there is no need to move it out for computation. Leveraging YMatrix’s unique MARS table feature, SANY improved end-to-end query and computation performance by around 5x, making it much easier for algorithm engineers to inspect raw data and refine models.

In benchmark comparisons between the two generations of clusters (with the same per-node hardware configuration), the YMatrix hyper-converged database:

Used only half the number of physical machines (about 50% resource savings)

Reduced algorithm runtime from 2.5 hours to 1 hour

4. Better compatibility with Python

YMatrix provides friendly support for writing user-defined functions (UDFs) in Python 3. All interface definitions and invocation metadata are structurally stored in the cluster, making Python code migration, calling, and management much easier.

This is a developer-centric advantage that significantly benefits the team’s subsequent data analysis and structured algorithm iteration.

Stronger Data Support for Daily Business Scenarios

After the migration to YMatrix, SANY re-examined core workflows across marketing, R&D, and service, and further expanded the way data supports business users and decisions.

1. Marketing Scenarios

Pumping index analytics

The platform analyzes metrics such as utilization rates and pumped concrete volume across regions to assess overall market conditions and customer profitability. It helps evaluate profitability and demand patterns (e.g., metro projects, elevated roads, high-rise buildings) across different levels of the national market, and supports the marketing team in identifying and digging into high-potential focus markets.

Marketing decision support

By analyzing user behavior and performance dimensions of equipment across regions—combined with boom length, chassis type, model, delivery date, and other attributes—the platform helps pinpoint regional equipment needs and enables the marketing team to design more targeted go-to-market strategies.

2. R&D Scenarios

Retrofit comparison

Based on the progress of technical retrofits, the platform continuously compares equipment performance before and after changes, quantifies key indicators, and visualizes the effect of modifications. This replaces manual phone-based follow-up with online data statistics and analysis, greatly improving both efficiency and reliability.

Fault localization

R&D and service engineers can remotely view (or replay) operating data at the moment a fault occurs, quickly pinpoint root causes, and speed up fault resolution. This has reduced travel for on-site troubleshooting by about 60%.

Product innovation support

By analyzing boom length, chassis type, model, delivery date, and other dimensions, engineers gain a finer-grained understanding of product behavior in the field and can better capture real market needs to guide product innovation.

3. Service Scenarios

Multi-level health analysis: national → region → key city → key equipment

The platform scans equipment health status across all regions, clarifies overall performance, and identifies:

High-performing machines for best-practice sharing
Underperforming machines for priority follow-up

It also supports closed-loop tracking of 26 predictive fault patterns and 297 self-diagnostic fault types.

CRM-based service closed loop

By integrating the platform with SANY’s service assistant in a microservices architecture, the team has built a closed loop from monitoring to service execution. This helps improve service efficiency and ultimately reduce customer downtime caused by issues such as pipeline blockage, engine failures, and hydraulic system faults.

Conclusion

Across industries, the demand for massive, real-time data continues to grow. Real-time recommendations, precision marketing, and instant decision-making are becoming core capabilities in digital transformation. The ability to sense and guide user needs more quickly—and improve product experience in real time—creates lasting competitive advantage.

YMatrix’s hyper-converged database is a natural fit for this trend. By providing a unified, one-stop data platform that handles both massive data volumes and real-time analytics, YMatrix makes it easier and more efficient to unlock the value of data and turn it into concrete business outcomes.

Previous：PXF: Cross-Source Queries in Seconds with YMatrix

Next：Why Are Large-Table Joins Such a Performance Bottleneck? Accelerating Queries with the YMatrix Runtime Filter

PXF: Cross-Source Queries in Seconds with YMatrix

AI Era Database Infrastructure: Exploring Vectorized Execution in PostgreSQL

In the AI Era, Parallel Query Is Evolving from “Faster Scanning” to “Faster Computing”

In-depth Analysis (Part 1) — In the AI Era, Databases Are Entering the “Unified Storage Era”

Why Can WAL Data Become “Corrupted Out of Thin Air”?——An Incident Investigation Spanning the Boundary Between Databases and Hardware