How to Deploy Ollama and Open WebUI with Multi-GPU Offloading on Bare Metal Dedicated Servers (2026 Guide)

Running large language models on your own hardware has never been more practical. In 2026, the combination of Ollama and Open WebUI has become the go-to self-hosted AI stack for teams and developers who want full control over model inference without relying on cloud providers like AWS, Azure, or Google Cloud.

This guide walks you through a production-ready deployment of Ollama and Open WebUI on a bare metal dedicated server equipped with multiple NVIDIA GPUs. You'll get efficient multi-GPU layer offloading, NVLink acceleration, and a polished ChatGPT-style interface all running privately on your own infrastructure.

What You'll Learn

Who This Guide Is For

Why Bare Metal Over Cloud for LLM Inference?

Prerequisites

Step 1 — Update the System and Install Build Dependencies

Step 2 — Install the NVIDIA Driver and CUDA Toolkit

Step 3 — Verify NVLink Topology (Multi-GPU Servers)

Step 4 — Install Ollama

Step 5 — Configure Ollama for Multi-GPU Layer Offloading

Step 6 — Install Open WebUI

Step 7 — Pull Models and Confirm GPU Offloading

Step 8 — Performance Tuning for Production Inference

Troubleshooting Reference

Final Thoughts

Who This Guide Is For

This tutorial is written for:

ML engineers and AI developers self-hosting open-weight models like Llama 3.1, Mistral, DeepSeek-R1, Qwen 2.5, Gemma 2, or Phi-4
DevOps teams are deploying inference servers on dedicated GPU hardware
Businesses that need private, on-premise LLM inference with no data leaving their environment
Anyone migrating away from expensive GPU cloud rentals to owned bare metal infrastructure

If you're running a single consumer GPU, most of this guide still applies, but multi-GPU sections are specifically aimed at server-grade setups with 2–8 GPUs (NVIDIA B200, H100, A100, RTX 5090, or similar).

Why Bare Metal Over Cloud for LLM Inference?

Before diving into the installation, it's worth understanding the architectural choice here.

Cloud GPU rentals (Lambda Labs, CoreWeave, Vast.ai, etc.) are excellent for short-term experimentation. But for sustained inference workloads, the economics shift quickly. A dedicated bare metal server with 4× H100s costs a fixed monthly fee you're not paying per GPU-hour, and you're never competing for spot availability.

Bare metal also eliminates the hypervisor tax; virtualized GPU environments introduce latency overhead that doesn't exist on physical hardware. For latency-sensitive inference, this difference is measurable.

Finally, data sovereignty matters. On bare metal, your prompts, completions, and model weights never leave your server. For healthcare, legal, finance, and enterprise AI use cases, this isn't optional.

Prerequisites

Before you begin, confirm you have:

A dedicated server with 1–8× NVIDIA GPUs (Blackwell B200, H100, A100, RTX 5090, or equivalent)
Ubuntu 24.04 LTS (clean install strongly recommended, this guide targets that OS)
Root or sudo access
64 GB RAM minimum (128 GB or more for 70B+ parameter models)
NVMe SSD storage on PCIe 4.0 or 5.0 (model weights load significantly faster)
An active internet connection
Approximately 45–90 minutes

Step 1 — Update the System and Install Build Dependencies

Start with a clean, fully updated system. This ensures driver installation doesn't conflict with outdated kernel headers.

                                sudo apt update && sudo apt upgrade -y

                                sudo apt install -y build-essential dkms linux-headers-$(uname -r) curl wget git python3-pip

                                sudo reboot

Step 2 — Install the NVIDIA Driver and CUDA Toolkit

Use the 570.xx production driver branch or newer. As of 2026, this branch has stable support for Blackwell (B200) and Hopper (H100) architectures.

                                distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')

                                wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.1-1_all.deb

                                sudo dpkg -i cuda-keyring_1.1-1_all.deb

                                sudo apt update

                                # Use the latest CUDA 12.x version supported by your installed driver.

                                sudo apt install -y nvidia-driver-570 cuda-toolkit-12-x

Use the latest CUDA 12.x version supported by your installed driver.
Reboot and verify the installation:

                                sudo reboot

                                nvidia-smi

A successful output will list all installed GPUs with their VRAM, driver version, and CUDA version. On a 4× B200 server, you'll see four entries each showing 192 GB of HBM3e memory.

Common issue: If nvidia-smi returns "command not found" or a driver error after reboot, the DKMS module may not have built correctly. Run dkms status to check, and reinstall the driver if needed.

Step 3 — Verify NVLink Topology (Multi-GPU Servers)

NVLink dramatically increases inter-GPU bandwidth compared to PCIe critical for large models that need to split layers across multiple cards.

Check NVLink link status:

nvidia-smi nvlink --status

Inspect the full GPU communication topology:

nvidia-smi topo -m

In the topology matrix, look for NV (NVLink) entries between GPU pairs rather than PHB (PCIe host bridge) entries. NVLink connections give you up to ~900 GB/s aggregate bandwidth on NVIDIA H100 NVLink (depending on topology) versus ~64 GB/s on PCIe 4.0 x16.

If NVLink shows as inactive on a server that has physical NVLink bridges, contact your hosting provider's support. It's usually a BIOS or driver configuration issue, not a hardware fault.

Step 4 — Install Ollama

Ollama is the inference engine that handles model downloading, quantization, and GPU scheduling. The official install script always pulls the latest stable release:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation and check that the binary is accessible:

                                ollama --version

                                ollama list

Enable and start Ollama as a system service so it persists across reboots:

                                sudo systemctl enable ollama

                                sudo systemctl start ollama

At this point, Ollama is running and listening on http://127.0.0.1:11434. It will auto-detect your GPUs, but the next step configures it for optimal multi-GPU throughput.

Step 5 — Configure Ollama for Multi-GPU Layer Offloading

Ollama's 2026 builds detect multiple GPUs automatically, but explicit environment variables ensure you get maximum VRAM utilization and parallel request handling.

Important: Ollama uses layer-based GPU offloading rather than full tensor parallelism. This means models are split across GPUs, but performance scaling is workload-dependent and not strictly linear like distributed inference engines.

Open a systemd override file:

sudo systemctl edit ollama

Add the following configuration (adjust GPU IDs and counts to match your hardware):

                                [Service]

                                Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"

                                Environment="OLLAMA_NUM_PARALLEL=4"

                                Environment="OLLAMA_MAX_LOADED_MODELS=4"

                                Environment="OLLAMA_FLASH_ATTENTION=true"

                                Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

What each setting does:

CUDA_VISIBLE_DEVICES — explicitly maps which physical GPUs Ollama can use; avoids accidental CPU fallback
OLLAMA_NUM_PARALLEL — number of concurrent inference requests handled across available GPU memory; optimal values depend on model size, context length, and VRAM headroom
OLLAMA_MAX_LOADED_MODELS — how many models stay resident in VRAM (useful for multi-model deployments)
OLLAMA_FLASH_ATTENTION=true — enables FlashAttention (when supported by the model and GPU architecture), which reduces memory bandwidth pressure on long contexts
OLLAMA_KV_CACHE_TYPE=q8_0 — quantizes the KV cache to 8-bit, freeing VRAM with minimal quality impact

Apply the changes:

                                sudo systemctl daemon-reload

                                sudo systemctl restart ollama

Confirm all GPUs are active during inference:

watch -n 1 nvidia-smi

Pull and run a model in a separate terminal, and you should see GPU utilization spread across all cards.

Note: Depending on the model and workload, you may still observe uneven GPU utilization. This is expected behavior with layer-based offloading.

Step 6 — Install Open WebUI

Open WebUI provides a full-featured browser interface for your Ollama instance model switching, conversation history, RAG (retrieval-augmented generation), multi-user accounts, and API key management.

Deploy it via Docker:

                                docker run -d -p 3000:8080 \

                                  --add-host=host.docker.internal:host-gateway \

                                  -v open-webui:/app/backend/data \

                                  --name open-webui \

                                  --restart unless-stopped \

                                  ghcr.io/open-webui/open-webui:main

Access the interface at: http://your-server-ip:3000

Connect Open WebUI to your Ollama backend:

Open WebUI → Settings → Connections
Set Ollama API Base URL to http://127.0.0.1:11434
Click Save and Test Connection

Once connected, all models pulled via Ollama appear in the model selector dropdown automatically.

Security note: If your server is internet-facing, place it behind a reverse proxy (Nginx or Caddy) with HTTPS, authentication (OAuth or basic auth), and firewall rules restricting access. Avoid exposing port 3000 directly to the public internet.

Note: Open WebUI does not require GPU access; all model inference is handled by Ollama on the host system.

Step 7 — Pull Models and Confirm GPU Offloading

With the stack running, pull the models you need. Ollama supports most quantized GGUF formats:

                                # Pull models by name and quantization tag

                                ollama pull llama3.1:70b

                                ollama pull mistral-large

                                # Very large models require 8× GPUs or more

                                # ollama pull deepseek-r1:671b-q4_k_m

                                ollama pull qwen2.5:72b

Note: Ultra-large models (500B+ parameters) typically require 8× GPUs or more and careful tuning to run reliably.

Run a test prompt and monitor GPU utilization simultaneously:

                                # Terminal 1

                                ollama run llama3.1:70b "Explain transformer attention mechanisms in plain English"

                                # Terminal 2

                                watch -n 0.5 nvidia-smi

All GPUs should show non-zero utilization during generation. If only GPU 0 is active, revisit Step 5 and confirm the CUDA_VISIBLE_DEVICES override is applied correctly.

Step 8 — Performance Tuning for Production Inference

8.1 Configure NCCL for Inter-GPU Communication

NCCL (NVIDIA Collective Communications Library) manages how GPUs communicate during parallel inference. These environment variables optimize NCCL behavior for NVLink-connected servers:

These settings are most beneficial for multi-node or InfiniBand environments; for single-node NVLink systems, default behavior is often sufficient.

Add to /etc/environment (applies system-wide on reboot):

                                NCCL_IB_DISABLE=0

                                NCCL_P2P_LEVEL=NV

                                NCCL_NET_GDR_LEVEL=1

                                CUDA_DEVICE_MAX_CONNECTIONS=1

For immediate effect in your current session, export each variable in your shell before starting Ollama.

8.2 Choose the Right Quantization Level

Model quantization is the single biggest lever for fitting large models into available VRAM while preserving generation quality:

Format	VRAM Usage	Quality Impact	Best For
q4_k_m	Lowest	Moderate	Maximizing throughput, large models on limited VRAM
q5_k_m	Medium	Low	Balanced quality and efficiency
q6_k	Higher	Minimal	Near-full quality with manageable VRAM
q8_0	Highest	Negligible	70B models with 384+ GB total VRAM

For a 70B model on 4× H100 (320 GB total VRAM), q5_k_m or q6_k is typically the sweet spot.

8.3 Tune Parallel Request Handling

If you're serving multiple concurrent users through Open WebUI:

Environment="OLLAMA_NUM_PARALLEL=8"

Start with a value equal to your GPU count as a baseline, then increase gradually while monitoring VRAM usage and latency under load.

8.4 Benchmark Your Setup

Run a simple throughput test to measure tokens per second:

                                time ollama run llama3.1:70b "Write a detailed 600-word technical overview of the transformer architecture"
                            

Divide output token count by elapsed time. On 4× H100s with Llama 3.1 70B at q5_k_m, expect roughly~60–140 tokens/second depending on context length, quantization, and batching.

Troubleshooting Reference

Ollama is only using one GPU. Check CUDA_VISIBLE_DEVICES in the systemd override. Run systemctl cat ollama to confirm the environment variables are actually applied. Restart the service after any change.
NVLink not detected in topology. Verify physical NVLink bridges are installed and seated. Update to the latest 570.xx driver branch. Check BIOS for NVLink/NVSwitch settings. Contact your server provider's hardware support.
Out of VRAM errors on large models. Switch to a lower quantization (e.g., q4_k_m). Reduce OLLAMA_MAX_LOADED_MODELS to 1. Check that no other processes are consuming GPU memory with nvidia-smi.
Slow inference speeds despite GPU utilization. Enable OLLAMA_FLASH_ATTENTION=true. Apply the NCCL environment variables from Step 8.1. Confirm model weights are on NVMe storage; spinning disk can bottleneck initial model load significantly.
Open WebUI can't connect to Ollama. Confirm Ollama is running: systemctl status ollama. Verify the API is reachable: curl http://127.0.0.1:11434. Check Docker networking, the --add-host=host.docker.internal:host-gateway flag is required for container-to-host communication.

Final Thoughts

Deploying Ollama and Open WebUI on bare metal GPU servers gives you something cloud GPU rentals fundamentally can't: predictable cost, guaranteed availability, and complete data ownership. Once the stack is running, adding new models is a single ollama pull command, and the Open WebUI interface makes it immediately accessible to non-technical teammates.

The configuration in this guide is designed to be production-stable, not just a demo setup. The systemd service management, NCCL tuning, and quantization guidance are all aimed at sustained, multi-user inference workloads rather than one-off experiments.

Run Your Own LLM Infrastructure on KW Servers

KW Servers bare metal GPU dedicated servers are available in 250+ locations worldwide, including the USA, Canada, Singapore, Hong Kong, Tokyo, and Seoul.

Every server includes:

Full NVLink and PCIe multi-GPU configurations (up to 8× GPUs per node)
Free 10 Tbps DDoS protection
24/7 expert infrastructure support
Instant deployment with clean Ubuntu 24.04 LTS images

→ Browse GPU Dedicated Servers → Choose your deployment region → Contact us for a free multi-GPU sizing recommendation.

Have questions about model quantization, VRAM requirements, or scaling to 8+ GPUs? Drop a comment below or reach out directly. We're happy to help you spec the right setup.

How to Deploy Ollama and Open WebUI with Multi-GPU Offloading on Bare Metal Dedicated Servers (2026 Guide)

A production-ready deployment guide for Ollama and Open WebUI on bare metal dedicated servers equipped with multiple NVIDIA GPUs.