How to Set Up NVIDIA Blackwell B200 GPUs with NVLink on Ubuntu 24.04 โ€” Complete 2026 Guide

A complete setup and optimization process for 1โ€“8ร— B200 GPUs connected via NVLink 5 on a fresh Ubuntu 24.04 LTS dedicated server.

Introduction

The NVIDIA Blackwell B200 is one of the most capable data center GPUs available in 2026, packing up to ~192 GB of HBM3e memory (SKU-dependent), multi-terabyte memory bandwidth, and advanced FP8/FP4 compute capabilities purpose-built for large-scale AI training and inference.

But raw hardware alone won't get you there. Without correct driver installation, NVLink topology verification, and NCCL tuning, you'll leave a significant portion of your cluster's performance on the table, especially in multi-GPU configurations.

This guide covers the complete setup and optimization process for 1โ€“8ร— B200 GPUs connected via NVLink 5 on a fresh Ubuntu 24.04 LTS dedicated server. Whether you're training large language models, running distributed inference, or building a GPU compute cluster, these steps apply directly to your workload.

What you'll learn:

  • How to install production NVIDIA drivers for Blackwell architecture

  • How to verify NVLink connectivity and GPU topology

  • How to install and tune NCCL for multi-GPU collective operations

  • How to configure PyTorch with CUDA 12.8 for maximum throughput

  • Persistence mode, power management, and MIG partitioning

Table of Contents

Prerequisites

Before starting, confirm the following:

  • OS: Ubuntu 24.04 LTS (fresh install recommended)

  • Access: Root or sudo privileges

  • Hardware: 1โ€“8ร— NVIDIA B200 GPUs installed in the server

  • NVLink: NVLink 5 bridges or NVSwitch interconnect (depending on your system architecture)

  • Network: Active internet connection on the server

  • Time: ~30โ€“60 minutes end-to-end

  • BIOS: Above 4G Decoding must be enabled for multi-GPU systems

Step 1: Update the System and Install Build Dependencies

Start with a fully patched system. Outdated kernel headers are one of the most common causes of driver installation failures on dedicated servers.

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential dkms linux-headers-$(uname -r) curl wget git
sudo reboot

Why this matters: The dkms package ensures your NVIDIA kernel module rebuilds automatically after future kernel updates critical for production servers that receive unattended upgrades.

Step 2: Disable the Nouveau Open-Source Driver

The Nouveau driver conflicts with the proprietary NVIDIA driver and must be blacklisted before installation. This step is especially important for Blackwell GPUs, which are not supported by Nouveau at all.

sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nouveau.conf"
sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo update-initramfs -u
sudo reboot

After rebooting, verify Nouveau is no longer loaded:

lsmod | grep nouveau

An empty result confirms it's disabled.

Step 3: Install the NVIDIA Driver and CUDA Toolkit

As of early 2026, the 570+ production branch (or newer) is the recommended driver for Blackwell B200 GPUs. Use NVIDIA's official CUDA repository to stay current with security and performance patches.

# Add the official NVIDIA CUDA repository
distribution=$(. /etc/os-release; echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb

# Install driver and CUDA toolkit
sudo apt update
sudo apt install -y nvidia-driver-570 cuda-toolkit-12-8
# Or install the latest available production driver if newer than 570

Reboot to load the new driver:

sudo reboot

Verify the installation:

nvidia-smi

You should see all installed B200 GPUs listed, each showing high-memory HBM capacity (up to ~192 GB depending on SKU) and the correct driver version. If any GPU is missing, check physical seating and PCIe slot power connectors.

Step 4: Verify NVLink 5 Connectivity and GPU Topology

NVLink (or NVSwitch in some systems) delivers dramatically higher inter-GPU bandwidth than PCIe, which is what enables efficient multi-GPU training at scale. Confirming it's active before running workloads saves hours of debugging later.

Check NVLink link status:

nvidia-smi nvlink --status

Each link between GPU pairs should report Active. Inactive links indicate a physical bridge issue or driver misconfiguration.

Inspect the full topology matrix:

nvidia-smi topo -m

In a healthy 8-GPU NVLink configuration, you'll see NV (NVLink) between connected GPU pairs rather than PHB or SYS (PCIe paths). If you see only PCIe connections between GPUs that should be NVLink-connected, the physical bridge is either not seated properly or not recognized. Re-seat and recheck before proceeding.

Step 5: Install NCCL for Multi-GPU Collective Communications

NCCL (NVIDIA Collective Communications Library) is the backbone of distributed deep learning on multi-GPU systems. PyTorch's DistributedDataParallel, TensorFlow's MirroredStrategy, and most LLM training frameworks rely on NCCL for all-reduce, broadcast, and scatter operations across GPUs.

sudo apt install -y libnccl2 libnccl-dev

Benchmark NCCL across all GPUs:
First, build the nccl-tests utility:

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make -j$(nproc) CUDA_HOME=/usr/local/cuda

Then run the all-reduce benchmark:

./build/all_reduce_perf -b 8 -e 1G -f 2 -g 8

In a properly configured 8ร— B200 NVLink system, you should see collective bandwidth numbers well above what PCIe alone could provide. Look for the busbw column. NVLink-backed operations will show significantly higher throughput than the algbw (algorithm bandwidth) alone.

Step 6: Install PyTorch with CUDA 12.8 Support

PyTorch is the dominant framework for GPU-accelerated deep learning in 2026, and the CUDA 12.8 build includes full support for Blackwell architecture optimizations including FP8 mixed-precision training.

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Confirm GPU and NCCL detection:

import torch
print(torch.cuda.device_count())      # Should return 8 for a full 8-GPU node
print(torch.cuda.nccl.version())      # Confirms NCCL version detected by PyTorch
print(torch.cuda.get_device_name(0))  # Should show "NVIDIA B200"

Test NCCL process group initialization:

import torch.distributed as dist
# Requires torchrun or proper env variables (MASTER_ADDR, RANK, WORLD_SIZE)
dist.init_process_group(backend="nccl", init_method="env://")
print("NCCL process group initialized over NVLink")

Note: This requires launching with torchrun or setting distributed environment variables.
A successful init without errors confirms your multi-GPU communication stack is fully operational.

Step 7: Performance Tuning and Optimization

This is where most guides stop short. The steps below directly impact throughput, latency, and stability in production AI workloads.

7.1 Enable Persistence Mode

Persistence mode keeps the NVIDIA driver loaded between jobs, eliminating GPU initialization latency at the start of each workload especially important for inference serving.

sudo nvidia-smi -pm 1

Make this permanent by adding it to /etc/rc.local or a systemd unit.

7.2 Configure Power Limits

The B200 can reach up to ~1000W TDP depending on the form factor (SXM vs PCIe). In thermally constrained environments or shared racks, you may need to cap power per GPU while preserving most performance:

sudo nvidia-smi -pl 1000   # Set per-GPU power limit in watts

Run nvidia-smi -q -d POWER to monitor actual draw and headroom before adjusting limits.

7.3 Tune NCCL Environment Variables

These variables have a measurable impact on multi-GPU collective bandwidth. Add them to your ~/.bashrc or your job launcher environment:

export NCCL_IB_DISABLE=0           
# Enable InfiniBand if your server has IB NICs

export NCCL_SOCKET_IFNAME=eth0  
# Replace with your actual interface (check with `ip a`)     

export NCCL_NET_GDR_LEVEL=1        
# Enable GPUDirect RDMA for peer memory access

export NCCL_P2P_DISABLE=0          
# Keep peer-to-peer transfers enabled

export NCCL_P2P_LEVEL=NV           
# Force NVLink paths for P2P communication

After setting these, re-run the all_reduce_perf benchmark to confirm improvement.

7.4 Enable MIG for Multi-Tenant Workloads

If you're running multiple smaller jobs simultaneously rather than one large distributed job, Multi-Instance GPU (MIG) mode lets you partition each B200 into isolated GPU instances with dedicated memory and compute slices:

sudo nvidia-smi -mig 1
sudo reboot  
                     
# List available MIG profiles for your GPU
nvidia-smi mig -lgip   
    
# Example: create instances using valid profile IDs from above
sudo nvidia-smi mig -cgi <profile_ids> 
sudo nvidia-smi mig -cci   

Note: MIG profile IDs vary by GPU model; always check with -lgip.
MIG is particularly useful for inference serving platforms where multiple models need guaranteed memory isolation.

Step 8: Monitoring Your GPU Cluster

Ongoing observability is essential once your B200 system is under load. These tools cover the most important metrics: utilization, memory, temperature, and NVLink throughput.

Built-in monitoring:

watch -n 1 nvidia-smi   # Live GPU stats, refreshed every second

NVIDIA DCGM (Data Center GPU Manager) recommended for production:

sudo apt install -y datacenter-gpu-manager || echo "Install DCGM from NVIDIA repository if package is unavailable"
sudo systemctl start dcgm
dcgmi discovery -l   # List discovered GPUs and health status

DCGM also exposes Prometheus-compatible metrics for integration with Grafana dashboards.

Netdata (lightweight, real-time, includes GPU plugin):

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

Troubleshooting Common Issues

Symptom Likely Cause Fix
NVLink shows Inactive Bridge not seated or not recognized Reseat physical NVLink bridge; check nvidia-smi nvlink --status
GPU topology shows PCIe instead of NV Driver or BIOS issue Update driver to 570.xx+; check BIOS Above 4G decoding
Low multi-GPU scaling efficiency NCCL not using NVLink paths Set NCCL_P2P_LEVEL=NV; rerun nccl-tests to verify
High idle power consumption Normal Blackwell behavior B200 idles at ~140โ€“180W/GPU; use nvidia-smi -pm 1 to minimize variance
Driver install fails Nouveau not blacklisted Verify Step 2 completed; run: lsmod | grep nouveau
nvidia-smi shows partial GPUs PCIe power or slot issue Check all GPU power connectors; test slots individually

Summary

A properly configured Blackwell B200 system on Ubuntu 24.04 gives you one of the most powerful compute platforms available for AI workloads in 2026. The critical path is:

  • Clean driver install using the official CUDA repository and 570.xx branch

  • NVLink topology verification before running any distributed workload

  • NCCL tuning to ensure collective operations actually traverse NVLink, not PCIe

  • PyTorch CUDA 12.8 for full Blackwell feature support including FP8

  • Persistence mode and power management for stable, low-latency production operation

Each B200 GPU brings 192 GB HBM3e, FP8/FP4 acceleration, and NVLink 5 interconnect. The setup process here ensures you're actually using all of it.

Have questions about NVLink topology, NCCL debugging, or scaling your workload across multiple nodes? Drop a comment below or contact the infrastructure team directly.

Discover KW Servers Dedicated Server Locations

KW Servers servers are available around the world, providing diverse options for hosting websites. Each region offers unique advantages, making it easier to choose a location that best suits your specific hosting needs.