China has found a way to sidestep the limitations of NVIDIA’s scaled-back AI accelerators. Thanks to DeepSeek, a game-changing project now boasts an impressive eightfold increase in TFLOPS with their Hopper H800 AI accelerators.
DeepSeek’s creation, FlashMLA, is set to revolutionize China’s AI landscape by pushing NVIDIA’s pared-down Hopper GPUs to their limits. Instead of relying on external hardware upgrades, Chinese companies like DeepSeek are cleverly leveraging software innovations to maximize what’s already at their disposal. DeepSeek’s breakthrough is a prime example, as they’ve extracted remarkable performance from NVIDIA’s streamlined Hopper H800 GPUs, thanks to their expertise in optimizing memory utilization and resource distribution across inference requests.
DeepSeek’s unveiling is part of their “OpenSource” week—a series of events designed to introduce new tech and tools that will be open to the public on platforms like GitHub. Day one kicked off with the launch of FlashMLA, a specialized “decoding kernel” for NVIDIA’s Hopper GPUs. And the results? Absolutely groundbreaking.
The firm claims that FlashMLA can achieve 580 TFLOPS for BF16 matrix multiplication on the Hopper H800—about eight times the current industry benchmark. Memory bandwidth is also impressive, reaching up to 3000 GB/s, almost double the theoretical peak of the H800, all accomplished through clever coding rather than physical hardware alterations.
What makes FlashMLA particularly innovative is its implementation of “low-rank key-value compression.” This method breaks down data into smaller segments, leading to quicker processing speeds and significantly reduced memory usage—by 40% to 60%, to be precise. Add to that a block-based paging system that dynamically allocates memory based on task intensity, allowing models to handle variable-length sequences with greater efficiency and enhanced performance.
DeepSeek’s FlashMLA is a testament to the diverse capabilities within AI computing—proving it’s not just hardware that dictates performance. While FlashMLA is currently optimized for Hopper GPUs, it raises exciting prospects about what might be achieved when paired with the H100. Stay tuned—it’s only the beginning.