How to Deploy Ollama and Open WebUI with Multi-GPU Offloading on Bare Metal Dedicated Servers (2026 Guide)
Running large language models on your own hardware has never been more practical. In 2026, the combination of Ollama and Open WebUI has become the go-to self-hosted AI stack for teams and developers who want full control over model inference without relying on cloud providers like AWS, Azure, or Google Cloud.
This guide walks you through a production-ready deployment of Ollama and Open WebUI on a bare metal dedicated server equipped with multiple NVIDIA GPUs. You'll get efficient multi-GPU layer offloading, NVLink acceleration, and a polished ChatGPT-style interface all running privately on your own infrastructure.
What You'll Learn
Who This Guide Is For
Why Bare Metal Over Cloud for LLM Inference?
Prerequisites
Step 1 โ Update the System and Install Build Dependencies
Step 2 โ Install the NVIDIA Driver and CUDA Toolkit
Step 3 โ Verify NVLink Topology (Multi-GPU Servers)
Step 4 โ Install Ollama
Step 5 โ Configure Ollama for Multi-GPU Layer Offloading
Step 6 โ Install Open WebUI
Step 7 โ Pull Models and Confirm GPU Offloading
Step 8 โ Performance Tuning for Production Inference
Troubleshooting Reference
Final Thoughts
Who This Guide Is For
This tutorial is written for:
-
ML engineers and AI developers self-hosting open-weight models like Llama 3.1, Mistral, DeepSeek-R1, Qwen 2.5, Gemma 2, or Phi-4
-
DevOps teams are deploying inference servers on dedicated GPU hardware
-
Businesses that need private, on-premise LLM inference with no data leaving their environment
-
Anyone migrating away from expensive GPU cloud rentals to owned bare metal infrastructure
If you're running a single consumer GPU, most of this guide still applies, but multi-GPU sections are specifically aimed at server-grade setups with 2โ8 GPUs (NVIDIA B200, H100, A100, RTX 5090, or similar).
Why Bare Metal Over Cloud for LLM Inference?
Before diving into the installation, it's worth understanding the architectural choice here.
Cloud GPU rentals (Lambda Labs, CoreWeave, Vast.ai, etc.) are excellent for short-term experimentation. But for sustained inference workloads, the economics shift quickly. A dedicated bare metal server with 4ร H100s costs a fixed monthly fee you're not paying per GPU-hour, and you're never competing for spot availability.
Bare metal also eliminates the hypervisor tax; virtualized GPU environments introduce latency overhead that doesn't exist on physical hardware. For latency-sensitive inference, this difference is measurable.
Finally, data sovereignty matters. On bare metal, your prompts, completions, and model weights never leave your server. For healthcare, legal, finance, and enterprise AI use cases, this isn't optional.
Prerequisites
Before you begin, confirm you have:
-
A dedicated server with 1โ8ร NVIDIA GPUs (Blackwell B200, H100, A100, RTX 5090, or equivalent)
-
Ubuntu 24.04 LTS (clean install strongly recommended, this guide targets that OS)
-
Root or sudo access
-
64 GB RAM minimum (128 GB or more for 70B+ parameter models)
-
NVMe SSD storage on PCIe 4.0 or 5.0 (model weights load significantly faster)
-
An active internet connection
-
Approximately 45โ90 minutes
Step 1 โ Update the System and Install Build Dependencies
Start with a clean, fully updated system. This ensures driver installation doesn't conflict with outdated kernel headers.
sudo apt install -y build-essential dkms linux-headers-$(uname -r) curl wget git python3-pip
sudo reboot
Step 2 โ Install the NVIDIA Driver and CUDA Toolkit
Use the 570.xx production driver branch or newer. As of 2026, this branch has stable support for Blackwell (B200) and Hopper (H100) architectures.
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
# Use the latest CUDA 12.x version supported by your installed driver.
sudo apt install -y nvidia-driver-570 cuda-toolkit-12-x
Use the latest CUDA 12.x version supported by your installed driver.
Reboot and verify the installation:
nvidia-smi
A successful output will list all installed GPUs with their VRAM, driver version, and CUDA version. On a 4ร B200 server, you'll see four entries each showing 192 GB of HBM3e memory.
Common issue: If nvidia-smi returns "command not found" or a driver error after reboot, the DKMS module may not have built correctly. Run dkms status to check, and reinstall the driver if needed.
Step 3 โ Verify NVLink Topology (Multi-GPU Servers)
NVLink dramatically increases inter-GPU bandwidth compared to PCIe critical for large models that need to split layers across multiple cards.
Check NVLink link status:
Inspect the full GPU communication topology:
In the topology matrix, look for NV (NVLink) entries between GPU pairs rather than PHB (PCIe host bridge) entries. NVLink connections give you up to ~900 GB/s aggregate bandwidth on NVIDIA H100 NVLink (depending on topology) versus ~64 GB/s on PCIe 4.0 x16.
If NVLink shows as inactive on a server that has physical NVLink bridges, contact your hosting provider's support. It's usually a BIOS or driver configuration issue, not a hardware fault.
Step 4 โ Install Ollama
Ollama is the inference engine that handles model downloading, quantization, and GPU scheduling. The official install script always pulls the latest stable release:
Verify the installation and check that the binary is accessible:
ollama list
Enable and start Ollama as a system service so it persists across reboots:
sudo systemctl start ollama
At this point, Ollama is running and listening on http://127.0.0.1:11434. It will auto-detect your GPUs, but the next step configures it for optimal multi-GPU throughput.
Step 5 โ Configure Ollama for Multi-GPU Layer Offloading
Ollama's 2026 builds detect multiple GPUs automatically, but explicit environment variables ensure you get maximum VRAM utilization and parallel request handling.
Important: Ollama uses layer-based GPU offloading rather than full tensor parallelism. This means models are split across GPUs, but performance scaling is workload-dependent and not strictly linear like distributed inference engines.
Open a systemd override file:
Add the following configuration (adjust GPU IDs and counts to match your hardware):
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=4"
Environment="OLLAMA_FLASH_ATTENTION=true"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
What each setting does:
CUDA_VISIBLE_DEVICES โ explicitly maps which physical GPUs Ollama can use; avoids accidental CPU fallback
OLLAMA_NUM_PARALLEL โ number of concurrent inference requests handled across available GPU memory; optimal values depend on model size, context length, and VRAM headroom
OLLAMA_MAX_LOADED_MODELS โ how many models stay resident in VRAM (useful for multi-model deployments)
OLLAMA_FLASH_ATTENTION=true โ enables FlashAttention (when supported by the model and GPU architecture), which reduces memory bandwidth pressure on long contexts
OLLAMA_KV_CACHE_TYPE=q8_0 โ quantizes the KV cache to 8-bit, freeing VRAM with minimal quality impact
Apply the changes:
sudo systemctl restart ollama
Confirm all GPUs are active during inference:
Pull and run a model in a separate terminal, and you should see GPU utilization spread across all cards.
Note: Depending on the model and workload, you may still observe uneven GPU utilization. This is expected behavior with layer-based offloading.
Step 6 โ Install Open WebUI
Open WebUI provides a full-featured browser interface for your Ollama instance model switching, conversation history, RAG (retrieval-augmented generation), multi-user accounts, and API key management.
Deploy it via Docker:
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:main
Access the interface at: http://your-server-ip:3000
Connect Open WebUI to your Ollama backend:
Open WebUI โ Settings โ Connections
Set Ollama API Base URL to
http://127.0.0.1:11434Click Save and Test Connection
Once connected, all models pulled via Ollama appear in the model selector dropdown automatically.
Security note: If your server is internet-facing, place it behind a reverse proxy (Nginx or Caddy) with HTTPS, authentication (OAuth or basic auth), and firewall rules restricting access. Avoid exposing port 3000 directly to the public internet.
Note: Open WebUI does not require GPU access; all model inference is handled by Ollama on the host system.
Step 7 โ Pull Models and Confirm GPU Offloading
With the stack running, pull the models you need. Ollama supports most quantized GGUF formats:
ollama pull llama3.1:70b
ollama pull mistral-large
# Very large models require 8ร GPUs or more
# ollama pull deepseek-r1:671b-q4_k_m
ollama pull qwen2.5:72b
Note: Ultra-large models (500B+ parameters) typically require 8ร GPUs or more and careful tuning to run reliably.
Run a test prompt and monitor GPU utilization simultaneously:
ollama run llama3.1:70b "Explain transformer attention mechanisms in plain English"
# Terminal 2
watch -n 0.5 nvidia-smi
All GPUs should show non-zero utilization during generation. If only GPU 0 is active, revisit Step 5 and confirm the CUDA_VISIBLE_DEVICES override is applied correctly.
Step 8 โ Performance Tuning for Production Inference
8.1 Configure NCCL for Inter-GPU Communication
NCCL (NVIDIA Collective Communications Library) manages how GPUs communicate during parallel inference. These environment variables optimize NCCL behavior for NVLink-connected servers:
These settings are most beneficial for multi-node or InfiniBand environments; for single-node NVLink systems, default behavior is often sufficient.
Add to /etc/environment (applies system-wide on reboot):
NCCL_P2P_LEVEL=NV
NCCL_NET_GDR_LEVEL=1
CUDA_DEVICE_MAX_CONNECTIONS=1
For immediate effect in your current session, export each variable in your shell before starting Ollama.
8.2 Choose the Right Quantization Level
Model quantization is the single biggest lever for fitting large models into available VRAM while preserving generation quality:
| Format | VRAM Usage | Quality Impact | Best For |
|---|---|---|---|
| q4_k_m | Lowest | Moderate | Maximizing throughput, large models on limited VRAM |
| q5_k_m | Medium | Low | Balanced quality and efficiency |
| q6_k | Higher | Minimal | Near-full quality with manageable VRAM |
| q8_0 | Highest | Negligible | 70B models with 384+ GB total VRAM |
For a 70B model on 4ร H100 (320 GB total VRAM), q5_k_m or q6_k is typically the sweet spot.
8.3 Tune Parallel Request Handling
If you're serving multiple concurrent users through Open WebUI:
Start with a value equal to your GPU count as a baseline, then increase gradually while monitoring VRAM usage and latency under load.
8.4 Benchmark Your Setup
Run a simple throughput test to measure tokens per second:
Divide output token count by elapsed time. On 4ร H100s with Llama 3.1 70B at q5_k_m, expect roughly~60โ140 tokens/second depending on context length, quantization, and batching.
Troubleshooting Reference
Ollama is only using one GPU. Check CUDA_VISIBLE_DEVICES in the systemd override. Run
systemctl cat ollamato confirm the environment variables are actually applied. Restart the service after any change.NVLink not detected in topology. Verify physical NVLink bridges are installed and seated. Update to the latest 570.xx driver branch. Check BIOS for NVLink/NVSwitch settings. Contact your server provider's hardware support.
Out of VRAM errors on large models. Switch to a lower quantization (e.g., q4_k_m). Reduce OLLAMA_MAX_LOADED_MODELS to 1. Check that no other processes are consuming GPU memory with
nvidia-smi.Slow inference speeds despite GPU utilization. Enable OLLAMA_FLASH_ATTENTION=true. Apply the NCCL environment variables from Step 8.1. Confirm model weights are on NVMe storage; spinning disk can bottleneck initial model load significantly.
Open WebUI can't connect to Ollama. Confirm Ollama is running:
systemctl status ollama. Verify the API is reachable:curl http://127.0.0.1:11434. Check Docker networking, the--add-host=host.docker.internal:host-gatewayflag is required for container-to-host communication.
Final Thoughts
Deploying Ollama and Open WebUI on bare metal GPU servers gives you something cloud GPU rentals fundamentally can't: predictable cost, guaranteed availability, and complete data ownership. Once the stack is running, adding new models is a single ollama pull command, and the Open WebUI interface makes it immediately accessible to non-technical teammates.
The configuration in this guide is designed to be production-stable, not just a demo setup. The systemd service management, NCCL tuning, and quantization guidance are all aimed at sustained, multi-user inference workloads rather than one-off experiments.
Run Your Own LLM Infrastructure on KW Servers
KW Servers bare metal GPU dedicated servers are available in 250+ locations worldwide, including the USA, Canada, Singapore, Hong Kong, Tokyo, and Seoul.
Every server includes:
Full NVLink and PCIe multi-GPU configurations (up to 8ร GPUs per node)
Free 10 Tbps DDoS protection
24/7 expert infrastructure support
Instant deployment with clean Ubuntu 24.04 LTS images
โ Browse GPU Dedicated Servers โ Choose your deployment region โ Contact us for a free multi-GPU sizing recommendation.
Have questions about model quantization, VRAM requirements, or scaling to 8+ GPUs? Drop a comment below or reach out directly. We're happy to help you spec the right setup.
KW Servers Recommended Tutorials
Php, Control Panel, Linux, Dedicated Server, Web
How to Use MultiPHP INI Editor in WHM?
Master the MultiPHP INI Editor in WHM to customize PHP settings per version. This step-by-step guide helps you optimize performance and manage directives with ease using Basic and Editor modes.
Plesk, Control Panel, Web, Network
How to Set NS (Nameserver) Records in Plesk โ Step-by-Step Guide
Learn how to configure NS (Nameserver) records in Plesk for seamless DNS management. This guide covers setup, best practices, and troubleshooting to ensure optimal domain resolution and performance.
Control Panel, Dedicated Server, Web, Mysql, Security
What is cPanel? The Complete Guide to Dedicated Server Management (2026)
Master cPanel on your dedicated server. Learn file management, DNS config, email setup, database administration, and security features in this 2026 guide by KW Servers.
Discover KW Servers Dedicated Server Locations
KW Servers servers are available around the world, providing diverse options for hosting websites. Each region offers unique advantages, making it easier to choose a location that best suits your specific hosting needs.