NVIDIA blackwell ultra dominates MLPERF inference

The automatic learning field moves quickly, and the elements used by the measurements have become which must run to follow. An example, MLPERF, the bi-annual automatic learning competition, called “AI Olympic Games”, introduced three new reference tests, reflecting new directions on the ground.
“Lately, it was very difficult to try to follow what is going on in the field,” explains Miro Hodak, AMD engineer and co -president of the MLPERF inference working group. “We see that the models gradually become larger, and in the last two laps, we have introduced the most important models that we have ever had.”
The chips that addressed these new references come from the usual suspects – Nvidia, Arm and Intel. Nvidia has exceeded graphics, presenting its new Blackwell Ultra GPU, wrapped in a GB300 Rack scale design. AMD has implemented a high performance, presenting its last MI325X GPUs. Intel has proven that we can always make an inference on the processors with their Xeon bids, but also entered the GPU game with an Intel Arc Pro submission.
New reference
Last round, MLPERF presented its greatest reference to this day, a large language model based on LLAMA3.1-403B. This round, they have been exceeded again, introducing a reference based on the Deepseek R1 671B model – more than 1.5 times the number of parameters of the greatest previous reference.
As a model of reasoning, Deepseek R1 goes through several stages of the chain of thoughts when approaching a request. This means that a large part of the calculation occurs during inference, then in the operation of LLM normal, which makes this reference even more difficult. The reasoning models would be the most precise, making it the technique of choice for science, mathematics and complex programming requests.
In addition to the biggest LLM reference to date, Mlperf has also introduced the smallest, based on LLAMA3.1-8B. There is an increasing demand for industry for low latency reasoning but with high precision, said Taran Iyengar, president of the MLPERF inference working group. Small LLMs can provide this and are an excellent choice for tasks such as text summary and on -board applications.
This brings the total number of landmarks based on LLM to a confusing four. They include the news, the smallest reference LLAMA3.1-8B; a preexisting reference LLAMA2-70B; The introduction by the last round of the reference LLAMA3.1-403B; And the biggest, the new R1 Deepseek model. If nothing else, these LLMS signals are not going.
In addition to the LLMS Myriad, this MLPERF inference cycle included a new text vocal model, based on Whisper-Large-V3. This reference is a response to the growing number of vocal applications, whether intelligent devices or IA interfaces based on speech.
Thelperf’s inference competition has two main categories: “closed”, which requires the use of the reference neural network model as modifications, and “open”, where certain model changes are authorized. In these, there are several subcategories linked to the way the tests are carried out and in what type of infrastructure. We will focus on the results of the Datacenter server “closed” for mental health.
Nvidia is in the lead
Surprising that no one, the best performance by accelerator on each reference, at least in the “server” category, was produced by a system based on NVIDIA GPU. Nvidia also unveiled the Blackwell Ultra, at the top of the cards in the two largest references: LLLAMA3.1-405B and Reason Deepseek R1.
Blackwell Ultra is a more powerful iteration of Blackwell architecture, with much more memory capacity, double acceleration for attention layers, 1.5 times more AI calculation and faster memory and connectivity compared to standard blackwell. It is intended for the workloads of the larger AI, such as the two landmarks on which it was tested.
In addition to the material improvements, the director of accelerated IT for Nvidia Dave Salvator attributes the success of Blackwell Ultra with two key changes. First, the use of the 4 -bit floating point number format from Nvidia, NVFP4. “We can provide comparable clarification in formats like BF16,” says Salvator, while using much less computing power.
The second is supposedly disaggregated portion. The idea behind the disintegrated portion is that there are two main parts with an inference workload: preplacement, where the request (“please summarize this report”) and all its context window (the report) are loaded in the LLM and the generation / decoding, where the output is really calculated. These two steps have different requirements. While the prefassin is heavy, the generation / decoding depends much more on the bandwidth of memory. Salvator says that by attributing different GPU groups to the two different stages, Nvidia reaches a performance gain of almost 50%.
AMD near behind
The new AMD accelerator chip, MI355X, was launched in July. The company only offered results in the “open” category where the software modifications of the model are authorized. Like Blackwell Ultra, the MI355X has a 4 -bit floating comma support, as well as wide -banded memory. The MI355X beat its predecessor, the MI325X, in the reference open LLAMA2.1-70B by a factor of 2.7, explains Mahesh Balasubramanian, principal director of GPU product marketing of the data center at AMD.
AMD’s “closed” submissions included systems powered by AMD MI300X and MI325X. The most advanced Mi325X computer occurred in the same way as those built with NVIDIA H200S on the LLLAMA2-70B, the mixture of experts and image generation marks.
This tour also included the first hybrid submission, where the AMD MI300X and MI325X GPUs were used for the same inference task, the LLAMA2-70B reference. The use of Hybrid GPUs is important, because new GPU arrives at an annual rate, and the old models, deployed en masse, are not going. Being able to distribute workloads between different types of GPU is an essential step.
Intel enters the GPU game
In the past, Intel has remained firm that you don’t need a GPU to make automatic learning. Indeed, the submissions using the Intel XEON CPU have always been carried out with the NVIDIA L4 on the object detection reference, but followed the reference of the Rechandder system.
This round, for the first time, an Intel GPU also projected. The Intel Arc Pro was published for the first time in 2022. The MLPERF submission included a graphics card called Maxsun Intel Arc Pro B60 Dual 48G Turbo, which contains two GPUs and 48 gigabytes of memory. The system made a word with the NVIDIA L40s on the small LLM reference and followed it on the LLAMA2-70B reference.
From your site items
Related items on the web




