Nvidia’s must-have H100 AI chip made it a multitrillion-dollar firm, one which may be value greater than Alphabet and Amazon, and opponents have been preventing to catch up. However maybe Nvidia is about to increase its lead — with the brand new Blackwell B200 GPU and GB200 “superchip.” Nvidia CEO Jensen Huang holds up his new GPU on the left, subsequent to an H100 on the proper, from the GTC livestream. Picture: NvidiaNvidia says the brand new B200 GPU presents as much as 20 petaflops of FP4 horsepower from its 208 billion transistors and {that a} GB200 that mixes two of these GPUs with a single Grace CPU can provide 30 instances the efficiency for LLM inference workloads whereas additionally probably being considerably extra environment friendly. It “reduces value and vitality consumption by as much as 25x” over an H100, says Nvidia. Coaching a 1.8 trillion parameter mannequin would have beforehand taken 8,000 Hopper GPUs and 15 megawatts of energy, Nvidia claims. Right this moment, Nvidia’s CEO says 2,000 Blackwell GPUs can do it whereas consuming simply 4 megawatts.On a GPT-3 LLM benchmark with 175 billion parameters, Nvidia says the GB200 has a considerably extra modest seven instances the efficiency of an H100, and Nvidia says it presents 4x the coaching pace. Right here’s what one GB200 appears to be like like. Two GPUs, one CPU, one board. Picture: NvidiaNvidia instructed journalists one of many key enhancements is a second-gen transformer engine that doubles the compute, bandwidth, and mannequin dimension by utilizing 4 bits for every neuron as an alternative of eight (thus, the 20 petaflops of FP4 I discussed earlier). A second key distinction solely comes whenever you hyperlink up big numbers of those GPUs: a next-gen NVLink swap that lets 576 GPUs discuss to one another, with 1.8 terabytes per second of bidirectional bandwidth. That required Nvidia to construct a whole new community swap chip, one with 50 billion transistors and a few of its personal onboard compute: 3.6 teraflops of FP8, says Nvidia.Nvidia says it’s including each FP4 and FP6 with Blackwell. Picture: NvidiaPreviously, Nvidia says, a cluster of simply 16 GPUs would spend 60 % of their time speaking with each other and solely 40 % really computing.Nvidia is relying on firms to purchase massive portions of those GPUs, after all, and is packaging them in bigger designs, just like the GB200 NVL72, which plugs 36 CPUs and 72 GPUs right into a single liquid-cooled rack for a complete of 720 petaflops of AI coaching efficiency or 1,440 petaflops (aka 1.4 exaflops) of inference. It has almost two miles of cables inside, with 5,000 particular person cables.The GB200 NVL72. Picture: NvidiaEach tray within the rack incorporates both two GB200 chips or two NVLink switches, with 18 of the previous and 9 of the latter per rack. In complete, Nvidia says one in all these racks can help a 27-trillion parameter mannequin. GPT-4 is rumored to be round a 1.7-trillion parameter mannequin.The corporate says Amazon, Google, Microsoft, and Oracle are all already planning to supply the NVL72 racks of their cloud service choices, although it’s not clear what number of they’re shopping for. And naturally, Nvidia is comfortable to supply firms the remainder of the answer, too. Right here’s the DGX Superpod for DGX GB200, which mixes eight methods in a single for a complete of 288 CPUs, 576 GPUs, 240TB of reminiscence, and 11.5 exaflops of FP4 computing.Nvidia says its methods can scale to tens of hundreds of the GB200 superchips, linked along with 800Gbps networking with its new Quantum-X800 InfiniBand (for as much as 144 connections) or Spectrum-X800 ethernet (for as much as 64 connections). We don’t anticipate to listen to something about new gaming GPUs immediately, as this information is popping out of Nvidia’s GPU Know-how Convention, which is often nearly fully centered on GPU computing and AI, not gaming. However the Blackwell GPU structure will seemingly additionally energy a future RTX 50-series lineup of desktop graphics playing cards.