News

DeepSeek’s AI Breakthrough, 671B-Parameter MoE Model Trained with 10X Efficiency

Key Points

DeepSeek trained a 671B-parameter Mixture-of-Experts AI model with 10X efficiency compared to Meta.
The company used 2,048 Nvidia H800 GPUs and completed training in just two months.
Instead of CUDA, DeepSeek leveraged Nvidia’s PTX programming for fine-grained GPU optimizations.
The company reconfigured GPU multiprocessors for faster server communication and optimized pipeline algorithms.

DeepSeek has made waves in the AI industry by training a Mixture-of-Experts (MoE) language model with an impressive 671 billion parameters. This achievement was made possible using a cluster of 2,048 Nvidia H800 GPUs, which completed training in just two months. This demonstrates a tenfold efficiency increase compared to industry giants like Meta.

3rd party Ad. Not an offer or recommendation by hardwareanalytic.com.

According to an analysis from Mirae Asset Securities Korea, cited by @Jukanlosreve, DeepSeek’s breakthrough stems from advanced fine-grained optimizations and the use of Nvidia’s PTX (Parallel Thread Execution) programming instead of the more commonly used CUDA.

PTX is an intermediate instruction set architecture (ISA) designed by Nvidia that sits between high-level GPU programming languages, like CUDA, and low-level machine code (SASS). Unlike CUDA, PTX provides near-metal access to GPU hardware, enabling precise optimizations such as register allocation and thread/warp-level adjustments. This level of fine-tuning allows DeepSeek to extract maximum performance from Nvidia’s GPUs, making their AI training process far more efficient than conventional methods.

3rd party Ad. Not an offer or recommendation by hardwareanalytic.com.

For instance, when training its V3 model, DeepSeek reconfigured Nvidia’s H800 GPUs by allocating 20 out of 132 streaming multiprocessors specifically for server-to-server communication. This likely involved compressing and decompressing data to overcome connectivity bottlenecks and improve processing speeds. Additionally, the company implemented advanced pipeline algorithms with intricate thread- and warp-level optimizations, enhancing overall performance beyond what CUDA alone could achieve.

While these modifications pushed the boundaries of AI training efficiency, they are also notoriously difficult to maintain, underscoring the high level of expertise within DeepSeek’s engineering team. The company’s innovations come amid a global GPU shortage and increasing U.S. trade restrictions, forcing AI firms to find alternative ways to maximize computing power. However, the financial cost of these optimizations remains unclear.

DeepSeek’s efficiency breakthrough has sent ripples through the market. Some investors speculate that future AI models may require less high-performance hardware, potentially impacting Nvidia’s sales. However, industry leaders like former Intel CEO Pat Gelsinger argue that AI will continue to demand as much computing power as possible. He sees DeepSeek’s advancements as a way to bring AI to more affordable, mass-market devices.

3rd party Ad. Not an offer or recommendation by dailyalo.com.

3rd party Ad. Not an offer or recommendation by hardwareanalytic.com.