Contents

Upgrading to llama.cpp

The first time I tried llama.cpp

llama.cpp was one of the first inference servers available for running an LLM model. At the time, I wasn’t super impressed with AI but they were better than the old fashioned auto complete. I didn’t think a subscription was worth paying for but if I could run my own, i’d take it.

When I first got to the llama.cpp repo on github, running it still required compiling it from source code. Annoying but it wasn’t the end of the world. What actually stopped me was that there wasn’t something like hugging face available to pull models from. The models that were around were the hundreds of billion parameter versions or I could write code to create my own from training data.

Lastly, I didn’t have hardware that could run it. This wouldn’t have stopped me from trying but I would’ve quickly been uninpressed and let it go. In the end, I remember compiling the code but giving up on trying shortly after.

Setup now

llama.cpp has come a long way from when I first looked at the project. If you looked at my previous post for setting up ollama, ollama’s advantage used to be how easy it was to set it up. That advantage is now completely gone. In fact, since llama.cpp ships their UI within the same container, one could argue that it is even easier to setup.

llama.cpp can now be ran completely within a docker container. There are three main versions: full, light, and server. These contain different tooling depending on if you want to convert models, run a model, or serve a model. Each of these images also has a different version depending on the GPU brand that you are using.

For my setup, since I am using an NVIDIA GPU and I wanted to run a server that I would access from other machines, I went with the server-cuda version. The documentation provides docker commands for starting the image but I always prefer to convert this into a docker-compose.yml to make it easier to work with.

services:
  llama:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    restart: unless-stopped
    environment:
      # Use this for a one time download
      # - LLAMA_ARG_MODEL_URL=https://huggingface.co/unsloth/Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf
      - LLAMA_ARG_MODEL=/root/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf
      - LLAMA_ARG_CTX_SIZE=64000
    volumes:
      - ${AI_DIR}/models:/root/.cache/llama.cpp/
    ports:
      - 8080:8080
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

A few callouts when looking at this configuration

  • Make sure to have the docker nvidia toolkit installed. This makes it easy to use your GPU within the container in the deploy section
  • You’ll want to configure a volume mount for your downloaded models. These can be large and take a while to download. It’s nice not having to wait again if you switch models or recreate your container
  • LLAMA_ARG_MODEL_URL tells the server where to download the model from on hugging face. I will start this container with this to download a new model and then restart it with this line commented out
  • LLAMA_ARG_MODEL This is the already downloaded model that you want the server to run when it starts. It’d be nice if this had matched the download link closer. Probably my biggest issue is figuring out what I got wrong translating between the two
  • LLAMA_ARG_CTX_SIZE This allows you to specify the context size allowed. If it isn’t set, llama.cpp will try to use the context specified by the model. I use this if the model is using too much memory and running slow to limit the context.

There are some additional environment variables that you can set but these are the ones that I found myself using the most. If you expose your server publicly, make sure to set the LLAMA_API_KEY so a random isn’t able to access your server.

Once this is running, you can access the UI at <server url>:8080 or connect with agent of choice at <server url>:8080/v1.

How does it run?

The main reason I wanted to try out llama.cpp was because I kept getting response errors. My agents would chug along fine but every few responses would be some kind of server error. I also kept seeing a number of tools that I wanted to try that had support for llama.cpp but not ollama. What made me finally decide to switch was a comment online claiming they got a 30% performance boost by switching. I felt this was probably an exaguration since ollama actually uses llama.cpp under the hood but I figured even a small speed up wouldn’t hurt.

I was plesently surpised by how much llama.cpp ran and how easy it was to setup. The server response errors were completely gone, interacting with the agent was faster, and, the biggest surprise, was the slowdown I previously saw when my context was offloaded into the system memory was gone. I occasionally increase my context size and I have yet to be able to tell when the offloading starts to happen. I’m currently at a context 4 times what it was with ollama.

Feeling faster is one thing but some numbers to look at are a lot better. If you open up the llama.cpp logs you will see that it prints the eval times for every prompt it receives:

llama  | prompt eval time =     237.44 ms /   127 tokens (    1.87 ms per token,   534.88 tokens per second)
llama  |        eval time =    4215.06 ms /   320 tokens (   13.17 ms per token,    75.92 tokens per second)
llama  |       total time =    4452.49 ms /   447 tokens

I’m not sure why they don’t print the total time conversion but for this specific prompt I ended up getting 9.96 ms per token or 44.52 tokens per second. Unfortunetely, I don’t have these numbers from ollama to compare to but I have found other’s post their numbers online and this seems to be a speed that many find acceptable. For a 35B model, I’m very happy with these numbers. I actually plan to sacrifice some of this speed for ways to improve the accuracy of the model since I frequently leave the agent running in a loop and I’m not waiting on the response.

Drawbacks

I’ve only found one drawback to moving to llama.cpp and that is the model pulling method. Having to restart the container to have it pull a different model is a little annoying. There is probably an easy way to do this through the API but I’m not downloading a lot of models or switching frequently and it hasn’t bothered me enough to even look into it.

Conclusion

Overall, if you are running a local model, llama.cpp is a solid choice. All of the original inconvenienses of running the server are completely gone. There are a lot of different options available anymore but many of them are just a wrapper around llama.cpp. It would be nice to have some other options to compare to but the speed I’m seeing without any real inconveniences makes it hard for me to be motivated to even try something different.