Skip to content

Comparing Huawei's Rack-Scale Specter to Nvidia's Top-tier Performance

Analysis reveals CloudMatrix 384 from Chinese tech giant, offering performance surpassing GB200, yet overlooking power consumption and hefty price tag

Compares Huawei's rack-scale solution against Nvidia's top-tier offering in server architecture...
Compares Huawei's rack-scale solution against Nvidia's top-tier offering in server architecture competition

Comparing Huawei's Rack-Scale Specter to Nvidia's Top-tier Performance

Huawei's CloudMatrix 384 Offers Superior AI Performance, but at a Cost

Huawei has unveiled its latest AI system, the CloudMatrix 384, at the World Artificial Intelligence Conference (WAIC) in Shanghai. The system, powered by the Ascend 910C NPU, promises significant improvements in AI compute performance and memory capacity.

Each Ascend 910C accelerator comes equipped with a pair of compute dies stitched together using a high-speed chip-to-chip interconnect, achieving 540 GB/s data transfer. This design is well-suited for large-scale AI workloads.

The CloudMatrix 384 spans 16 racks, with 12 racks dedicated to compute and 4 for networking. It offers up to 300 petaFLOPS BF16 (dense compute), nearly double the total FP16/BF16 compute performance of Nvidia's GB200 NVL72 rack systems. However, it consumes over four times more power and requires substantially more floor space.

| Metric | Huawei CloudMatrix 384 | Nvidia GB200 NVL72 | |-------------------------|--------------------------------------------|-----------------------------------------------| | Compute Performance | Up to 300 petaFLOPS BF16 (dense compute), nearly 2× Nvidia's performance[3]. Each Ascend 910C chip offers 752 teraFLOPS FP16/BF16 from dual dies[1]. | About 7.5× slower at FP16/BF16 per rack scale system, less total compute than CloudMatrix 384[2][3]. | | Memory Capacity | 128 GB HBM per Ascend 910C chip; 3.6× greater total memory capacity at rack scale[3][1]. | 96 GB HBM per chip; smaller total memory capacity[1][3]. | | Memory Bandwidth | 3.2 TB/s per chip; 2.1× more total memory bandwidth per system[3][1]. | Lower memory bandwidth compared to CloudMatrix 384 but competitive in newer Nvidia GPUs[1]. | | Power Consumption | Estimated ~559–600 kW for full system[5][2], >4× higher than Nvidia[3]. | ~120 kW per GB200 NVL72 system[2]. | | Power Efficiency | About 460 gigaFLOPS per watt, much less efficient[2]. | About 1,500 gigaFLOPS per watt, ~3× more power efficient[2]. | | Physical Size/Density | Takes up ~16× more floor space with over 5× the number of accelerators[1]. | Much smaller footprint, higher compute and power density[1][2]. | | Cost (Deployment & Ops) | Likely higher due to elevated power draw and operational complexity; access to cheap power in China may offset this[2][3]. | Lower cost to deploy and operate due to better efficiency and density[2]. |

Huawei compensates for lower silicon design maturity by deploying massive scale with hundreds of NPUs interconnected in an all-to-all topology for aggregate performance. On the other hand, Nvidia's GB200 NVL72 systems, based on the Blackwell architecture, lead in sheer efficiency and compute density, benefiting from more mature design and better FP8 support.

In testing on DeepSeek-R1, a mixture-of-experts model, CloudMatrix-Infer increased performance, with a single NPU processing 6,688 input tokens a second while generating tokens at a rate of 1,943 tokens a second. Under ideal conditions, Huawei claims a prompt-processing efficiency of 4.5 tokens/sec per teraFLOPS.

The CloudMatrix 384 is reported to retail for around $8.2 million, while Nvidia's NVL72 rack systems are estimated to cost around $3.5 million a piece. It remains to be seen in what volumes Huawei will be able to churn out CloudMatrix systems.

In summary, Huawei’s CloudMatrix 384 offers superior raw AI compute performance and memory scale at the expense of far greater power consumption, lower efficiency, and larger operational footprint. Meanwhile, Nvidia’s GB200 NVL72 excels in power efficiency, compute density, and lower costs. The choice between them depends on priorities: raw performance vs. operational efficiency and cost.

  1. The technology industry, particularly the AI sector, is debating the advantages and disadvantages of Huawei's CloudMatrix 384, as it offers superior AI compute performance, but at a higher cost compared to systems like Nvidia's GB200 NVL72.
  2. In the finance industry, the operational efficiency and cost implications of deploying Huawei's CloudMatrix 384 are crucial factors to consider, as it consumes over four times more power and requires substantially more floor space than Nvidia's GB200 NVL72.
  3. Despite the superior AI performance of Huawei's CloudMatrix 384, networking and cloud service providers might lean towards Nvidia's GB200 NVL72 due to its significant advantages in power efficiency, compute density, and lower costs.
  4. As the use of artificial intelligence continues to grow across various sectors including finance, data centers will have to make strategic decisions based on factors such as raw performance, power consumption, efficiency, and cost, with Huawei's CloudMatrix 384 and Nvidia's GB200 NVL72 being key contenders in the race for superior AI infrastructure.

Read also:

    Latest