NVIDIA Vera Rubin GPU on Bare Metal: R200 Specs, Power & Cost vs Blackwell B200 (2026โ€“2027)

At CES 2026, Jensen Huang confirmed that NVIDIA's Vera Rubin architecture had entered full production. For AI teams, research labs, and infrastructure buyers, that announcement restarted a familiar debate: commit to Blackwell now, or hold resources for the next generation?

This article compiles what has been publicly disclosed through GTC 2026, along with industry estimates and realistic infrastructure projections for bare-metal GPU dedicated servers, so you can make decisions based on practical data rather than roadmap optimism.

What the Vera Rubin R200 Actually Is (and Why It Matters)

Rubin isn't a modest step-up from Blackwell. On paper, it's the most significant single-generation leap NVIDIA has shipped in the datacenter accelerator market.

The R200 is built on TSMC's 3 nm N3P process node, a full node smaller than Blackwell's 4 nm custom process. It uses a multi-chip module design with two near-reticle-sized compute dies and two I/O dies, all mounted on a large-format CoWoS-L interposer alongside eight HBM4 memory stacks. Total transistor count: 336 billion, versus Blackwell's 208 billion.

The headline numbers (based on kW TDP per GPU confirmed, versus approximately 1.4 kW NVIDIA disclosures and industry estimates):

  • 50 PFLOPS FP4 sparse inference - 5ร— the Blackwell B200's 10 PFLOPS

  • 288 GB HBM4 memory per GPU - 50% more than the B200's 192 GB HBM3e

  • 22 TB/s memory bandwidth - 2.75ร— higher than the B200's 8 TB/s

  • 224 Streaming Multiprocessors - up from approximately 160 on Blackwell

  • ~1.8 kW TDP per GPU - widely reported estimated, versus approximately 1.4 kW for the B200

  • NVLink 6 (projected ~3.6 TB/s class throughput) at 3.6 TB/s bidirectional GPU-to-GPU throughput - roughly 2ร— NVLink 5

The sixth-generation Tensor Cores support the full precision stack โ€” FP4, FP6, FP8, FP16, BF16, TF32, FP32, and FP64 โ€” with a third-generation Transformer Engine that dynamically adjusts precision across transformer layers using hardware-accelerated micro-block scaling. That adaptive precision control is what makes the 5ร— FP4 inference gain achievable in practice, not just on paper.

One important caveat on the 5ร— number: NVIDIA benchmarked it on large mixture-of-experts (MoE) models at long context lengths, the Kimi-K2-Thinking MoE at 32K input / 8K output. For dense models running at FP8 or FP16, the realistic training improvement is closer to 1.6ร—. If your pipeline doesn't yet use FP4 precision paths, plan accordingly.

The Vera Rubin Superchip

The VR200 Superchip pairs two R200 GPU dies with NVIDIA's new Vera CPU, NVIDIAโ€™s next-generation Arm-based Vera CPU (final core counts not yet publicly disclosed), connected via NVLink-C2C. This tighter CPU-GPU coupling reduces data movement overhead for memory-bound workloads, which matters significantly for large MoE inference and long-context reasoning jobs where repeated weight loading dominates latency.

Per Superchip: 100 PFLOPS FP4, with 2ร— the CPU performance of the Grace chip used in Blackwell NVL systems.

One nomenclature note worth flagging for capacity planning: Early partner documentation suggests NVIDIA may be shifting its NVL numbering convention starting with Rubin. The VR200 NVL144 contains the same 72 GPU packages as the GB200 NVL72 โ€” the number now reflects die count rather than package count. Don't let the higher number mislead your rack-level planning.

Full Spec Comparison: Rubin R200 vs Blackwell B200

Metric Blackwell B200 Rubin R200 Delta
Architecture Blackwell Vera Rubin โ€”
Process Node TSMC 4 nm (custom) TSMC 3 nm N3P 1 node class smaller
Transistors 208 billion 336 billion +61%
Streaming Multiprocessors ~160 SMs 224 SMs +40%
FP4 Inference (sparse) 10 PFLOPS 50 PFLOPS 5ร— faster
FP8 Training ~10 PFLOPS ~16 PFLOPS ~1.6ร—
VRAM per GPU 192 GB HBM3e 288 GB HBM4 +50%
Memory Bandwidth 8 TB/s 22 TB/s 2.75ร— higher
GPU-to-GPU Interconnect NVLink 5 NVLink 6 ยท 3.6 TB/s bidir. ~2ร— throughput
NVSwitch Aggregate Fabric NVSwitch 5 NVSwitch 6 ยท 28.8 TB/s ~2ร— aggregate
TDP per GPU ~1.4 kW (est.) ~1.8 kW (confirmed) +~400 W
Precision Support FP4โ€“FP64 FP4, FP6, FP8โ€“FP64 Adds FP6

The number that matters most for AI inference: The jump to 22 TB/s memory bandwidth. Modern LLMs and MoE architectures are memory-bandwidth-bound at inference time โ€” not compute-bound. That 2.75ร— bandwidth increase directly determines tokens-per-second per GPU, which is what actually sets your cost per token in production.

Power, Cooling, and Data Center Requirements

Industry estimates place R200 TDP at approximately 1.8 kW per GPU. For an 8-GPU bare-metal node, that translates to roughly 14โ€“16 kW total wall draw once you add CPU, RAM, NVMe storage, and networking. For context, a comparable Blackwell 8-GPU node runs approximately 11โ€“12 kW.

At rack scale, a full Rubin NVL144 configuration power density becomes the binding constraint, not compute.

What bare-metal operators need to prepare for:

  • Liquid cooling is effectively mandatory for any dense Rubin deployment. Air-cooled operation is technically possible in facilities with exceptional airflow capacity, but unusual in practice at 1.8 kW per GPU. NVIDIA confirmed the Rubin NVL144 will use the same Oberon rack chassis as the GB300 NVL72, with cooling modifications to handle the higher per-GPU TDP, which reduces integration risk for operators already running Blackwell infrastructure.

  • Power delivery is shifting. 48 V to 54 V rack power distribution is becoming the standard for Rubin-class systems, replacing 12 V legacy infrastructure. Data centers still running 12 V bus architecture at scale will face upgrade costs before they can support dense Rubin deployments.

  • Rack power density requirements: plan for 30โ€“50 kW per rack for 4โ€“8 GPU configurations. Legacy facilities rated at 10โ€“15 kW per rack cannot accommodate Rubin at meaningful density without infrastructure investment.

  • Liquid cooling cost at rack scale: Early infrastructure estimates suggest cooling costs for a full Vera Rubin NVL144 rack on the order of ~$55,000, roughly 15โ€“20% higher than a comparable GB300 NVL72 setup.

  • Network fabric: High-bandwidth, low-latency interconnect is necessary for multi-GPU and multi-node training jobs that fully exploit NVLink 6 throughput. Teams building serious AI clusters typically run 10 Gbps dedicated servers at a minimum for node-to-node communication, with larger training clusters requiring 100 Gbps dedicated infrastructure to avoid the network becoming the performance bottleneck. Under-provisioning the network fabric is one of the most common ways organizations fail to realize the performance gains they paid for at the GPU level.

Bare-Metal vs Cloud Pricing: Rubin R200 and Blackwell B200

NVIDIA has not released official per-unit pricing for Rubin. Based on NVL72 rack cost estimates in the $3.5โ€“4.0 million range and historical pricing patterns across generations, cloud on-demand rates for Rubin are commonly projected in the ~$6โ€“10+/GPU-hour range at launch.

Bare-metal pricing operates on a different model, dedicated capacity rather than on-demand overhead, which is where the unit economics shift significantly for teams with consistent utilization.

Deployment Option GPUs Est. Monthly Cost Effective Hourly 90-Day Total vs Cloud Hyperscaler
Cloud hyperscaler (AWS / GCP / Azure B200 equiv.) 4ร— $16,000โ€“$28,000 $5.20โ€“$9.10 $48,000โ€“$84,000 Baseline
Cloud GPU specialist (CoreWeave / Lambda B200) 4ร— $10,000โ€“$16,000 $3.30โ€“$5.30 $30,000โ€“$48,000 โ€”
KW Servers Bare Metal B200 4ร— $4,800โ€“$7,200 $1.67โ€“$2.50 $14,400โ€“$21,600 50โ€“70% savings
KW Servers Bare Metal Rubin R200 (internal projection, Q4 2026 target) 4ร— $4,200โ€“$6,500 $1.45โ€“$2.25 $12,600โ€“$19,500 55โ€“80% savings

When does bare metal actually beat cloud?

Under 10 days of use per month, cloud spot or reserved pricing can still compete. At 15โ€“25 days per month, bare-metal Rubin pulls clearly ahead. For continuous 24/7 production workloads, bare metal typically yields 60โ€“85% savings versus cloud on-demand. Teams running sustained AI inference or ongoing fine-tuning jobs are the kind of workloads where unmetered dedicated servers eliminate unpredictable bandwidth cost from the equation, resulting in the most dramatic TCO improvement over cloud alternatives.

The metric worth optimizing for: Hourly GPU rates are increasingly the wrong number to anchor on. Cost per token or cost per useful FLOP of inference is what determines your AI infrastructure economics in practice. Rubin's 5ร— FP4 inference uplift means that even at a per-hour premium over Blackwell, Rubin can deliver better cost-per-token for the right workloads, particularly long-context reasoning and large MoE inference at scale.

Availability Timeline: What's Actually Confirmed

NVIDIA indicated Rubin had entered production ramp at CES 2026. Quanta, a primary manufacturing partner, indicated initial customer units could reach buyers as early as August 2026. The realistic rollout sequence:

  • H2 2026: Initial Rubin samples and early production units reach priority partners. Hyperscaler cloud providers begin internal validation and cluster buildouts.

  • Q4 2026: First cloud instances go live. AWS, Google Cloud, Microsoft Azure, Oracle Cloud, CoreWeave, Lambda, Nebius, and Nscale are all confirmed launch partners.

  • Q1 2027: Broader bare-metal and non-hyperscaler availability as manufacturing volumes scale. Teams without priority hyperscaler allocations gain meaningful access. At KW Servers, we are actively upgrading select North American, European, and Asia-Pacific facilities with 48 V power distribution and hybrid liquid cooling to support Rubin deployments from Q4 2026 onward.

  • H2 2027: Rubin Ultra arrives โ€” 4 compute dies per package, approximately 100 PFLOPS FP4, 384 GB HBM4e, 32 TB/s bandwidth, deployed in NVL576 "Kyber" racks drawing ~600 kW.

System integrator ecosystem is forming: Dell, HPE, Lenovo, Cisco, and Supermicro are all developing Rubin platform builds. Early demand at scale is already evident, with large deployments from companies like OpenAI, Anthropic, and Meta widely expected, along with broader adoption across leading AI labs and cloud providers.

Deploy Blackwell Now, or Wait for Rubin?

This is the central infrastructure decision for AI teams in 2026. The answer is genuinely workload-dependent.

Deploy Blackwell B200 now if:

  • Your training or inference workload goes to production in 2026

  • Your CUDA-optimized software stack is ready, and you cannot afford the integration time for a new GPU generation

  • Your models fit within 192 GB HBM3e VRAM per GPU

  • You need proven ecosystem stability, cuDNN, cuBLAS, NCCL, TensorRT depth without software stack risk

  • Your supply-chain tolerance is low, and you cannot absorb a 6โ€“12 month ramp wait

Wait or plan for Rubin if:

  • Your production deployment starts in late 2026 or 2027

  • You run large MoE models or long-context inference workloads where memory bandwidth is your binding constraint

  • Cost per token is your primary infrastructure KPI, and you have a runway to wait

  • You need more than 192 GB VRAM per GPU for model residency without multi-node sharding

  • Your models will benefit from FP4 precision paths once your software pipeline supports them

A note on software stack readiness: The 5ร— FP4 gain requires your pipeline to actually use NVFP4 precision with the third-generation Transformer Engine. Teams running FP16 or BF16 workflows today will see genuine gains from Rubin โ€” but 1.6โ€“2.5ร—, not 5ร—. The gap closes as frameworks (PyTorch, JAX, vLLM, TensorRT-LLM) add FP4 support, but that takes time after hardware ships.

The Competitive Context: AMD MI400 and What It Means for Pricing

Rubin doesn't operate in a pricing vacuum. AMD's Instinct MI400 series and its deepening cloud partnerships with Meta and OpenAI โ€” both have AMD supply agreements at scale โ€” and are applying real pricing pressure on NVIDIA at the OEM and hyperscaler level. This competitive dynamic is already compressing bare-metal GPU server pricing faster in the Rubin cycle than it did during Blackwell's equivalent stage.

For teams currently evaluating both architectures, AMD dedicated servers are increasingly competitive for specific workloads, particularly those already optimized for ROCm or running frameworks with strong AMD support. That said, CUDA's ecosystem depth โ€” cuDNN, cuBLAS, NCCL, TensorRT โ€” still holds a meaningful operational advantage for most production AI pipelines, especially for teams without dedicated ML infrastructure engineering resources.

For workloads that run best on a specific microarchitecture, Intel dedicated servers powered by Xeon 6 (Granite Rapids) remain the standard choice for CPU-bound preprocessing, embedding generation, and inference tasks where GPU acceleration provides diminishing returns.

Rubin Ultra (H2 2027): What's Already Confirmed

Early roadmap disclosures suggest approximately 500 billion transistors, 384 GB HBM4e memory at ~32 TB/s bandwidth, deployed in NVL576 "Kyber" rack systems drawing roughly 600 kW per rack. In aggregate AI factory throughput terms, a full Kyber rack delivers roughly 14ร— the performance of today's GB300 NVL72.

The Feynman architecture โ€” NVIDIA's 2028 target โ€” has been referenced on NVIDIAโ€™s long-term roadmap, built on TSMC A16 (1.6 nm) with backside power delivery, eighth-generation NVSwitch, ConnectX-10 at 3.2 Tb/s, and Spectrum-7 Ethernet. NVIDIA has locked in an annual architecture cadence that makes the GPU upgrade cycle as predictable as it is demanding on infrastructure teams.

Frequently Asked Questions

When will Rubin R200 bare-metal servers actually be available?

NVIDIA confirmed full production at CES 2026, and Quanta indicated initial customer deliveries as early as August 2026. Cloud hyperscalers receive first allocation, with broader bare-metal availability expected in Q1 2027. Operators preparing liquid cooling and high-density power infrastructure now will be better positioned to receive early systems.

Is the 5ร— inference gain realistic for my workload?

For MoE models at long context lengths using NVFP4, yes, that's the specific benchmark scenario. For dense models at FP8, the realistic improvement is 1.6โ€“2.5ร—. If your software pipeline doesn't yet use FP4 precision paths, plan for the lower range until frameworks add support post-hardware launch.

Can existing data centers support Rubin without infrastructure upgrades?

Generally, no, for any dense deployment. The 1.8 kW per-GPU TDP requires facilities with high-density power (30+ kW per rack) and liquid cooling capability. The Rubin NVL144 uses the same Oberon chassis as GB300, which reduces integration complexity for operators already running Blackwell, but the power density requirements are a meaningful upgrade from legacy DC specs.

How does Rubin compare to AMD's MI400 for AI workloads?

AMD has not released confirmed MI400 specs at the time of publication. Rubin's memory bandwidth and FP4 throughput figures are class-leading on paper. AMD's competitive advantage is primarily pricing and growing the ROCm ecosystem maturity for specific frameworks. Teams already invested in CUDA pipelines face a high switching cost that generally favors NVIDIA unless the per-dollar compute gap is significant for their specific workload.

All pricing figures are estimates based on publicly available data and internal projections as of March 2026. Specifications reflect a combination of NVIDIA disclosures, partner information, and industry estimates as of March 2026. Bare-metal pricing reflects KW Servers estimated configurations. Contact us for a custom quote based on your specific workload.

Stay tuned โ€” we'll publish Rubin benchmarks and confirmed bare-metal pricing the moment hardware lands in our data centers.