Setting up Ollama server with Open WebUI

Ollama is a powerful AI model serving platform that makes it easy to run large language models locally. In this post, we’ll walk through setting up Ollama with Open WebUI in a Docker containerized environment, including GPU acceleration support. Visit Ollama’s website for more information.

What is Ollama?

Ollama is an AI platform that provides an easy way to run large language models locally. It offers:

Easy model management
GPU acceleration support
A clean API for model interactions
Containerized deployment

What is Open WebUI?

Open WebUI provides a web-based interface for interacting with Ollama models, making it easy to use AI models without needing to write code or deal with complex command-line interfaces. Visit the Open WebUI GitHub repository for more information.

Prerequisites

Before starting, ensure you have:

Docker installed on your host system
NVIDIA drivers installed (for GPU support)
Directory for AI model storage created

Docker Compose Configuration

We’ll use Docker Compose to set up both the Ollama server and Open WebUI interface. Create a docker-compose.yml file with the following content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    environment:
      - OLLAMA_MAX_LOADED_MODELS=1
      - OLLAMA_NUM_THREADS=8
      - OLLAMA_CONTEXT_LENGTH=38000
      - OLLAMA_CUDA=1
    volumes:
      - ${AI_DIR}/models:/root/.ollama/models
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    ports:
      - "8080:8080"
    depends_on:
      - ollama
    environment:
      - "OLLAMA_BASE_URL=http://localhost:11434"
      - "DEFAULT_MODELS=qwen3-coder:30b"
      - "ENABLE_RAG_WEB_SEARCH=true"
      - "WEBUI_SECRET_KEY="
    restart: unless-stopped

Using the setup

Model Management

To be completely honest, I ended up rarely using the Open WebUI. It is nice to have it when I want to ask a question on my cellphone, but when it comes to using Ollama for writing code, I would just point an agent running on my laptop to the server’s address.

The one aspect where I would use WebUI more is downloading new models. The container is set up so that models are stored in the ${AI_DIR}/models directory which is bound to a directory on the host. To add a new model, I typically run docker exec ollama pull <model>, which is slightly inconvenient compared to being able to post the address directly in a webpage.

I use quantized models from Hugging Face for better performance and smaller file sizes. These quantized models are ideal for running on a home server because they require much less VRAM.

Resource Limits

The biggest aspect of running my own AI model has been keeping a closer eye on the size of the context due to the resource limits. This really hasn’t been that difficult though. If you’ve been trying out AI agents as they come out and figured out how to be effective with them, then you have probably discovered that the output is much better when you break it into chunks. It’s also easier to review. Using a local model has just required an investment in the planning and breaking the task into smaller features. AI moves mountains in the same way as RI (real intelligence), one spoonful at a time.

I would keep an eye on my resource usage while the AI works by using watch -n 1 docker compose exec ollama ollama ps. This continuously outputs the CPU/GPU usage of the model while it works. With Ollama, I would try to keep everything in the GPU’s VRAM. Even though my system has 32 GB of RAM, I noticed a much slower output as soon as the model dipped into main memory. This is a bummer because I had hoped to leverage this to get away with larger models.

Another aspect limited by the resource limits is that I can only run one model at a time. I haven’t really felt this because I am the only user of my system, but as I get into subagents I think this will be more of a drawback. So far while experimenting with subagents, I have used the same model and they just have a different context given to them. In the future I plan to try out running smaller models for each subagent so that I can allow them to run in parallel, but I haven’t had the desire to experiment with this yet.