Odysseus AI llama.cpp CUDA CPU Fallback

Last updated: June 6, 2026

If Odysseus sees your NVIDIA GPU but llama.cpp still assigns layers to CPU, stop changing compose files blindly. You may have Docker GPU access, but not a CUDA-enabled llama.cpp serve engine.

Quick answer

Passing nvidia-smi inside Docker confirms GPU passthrough. It does not confirm that llama.cpp has cudart, CUDA Toolkit, or a CUDA-enabled build. Check logs before changing GPU overlay settings again.

1. Prove Docker GPU Passthrough First

Run this from the Odysseus repository directory after the stack is up.

Copy command

docker compose exec odysseus nvidia-smi -L

If this fails

Go back to host driver, NVIDIA Container Toolkit, and compose overlay setup.

If this passes

Treat Docker passthrough as likely working and inspect llama.cpp or Cookbook dependency logs next.

2. Search the Odysseus Logs for CUDA Runtime Clues

The official README calls out this split: GPU passthrough is not the same as llama.cpp CUDA support.

Copy command

docker compose logs --tail 240 odysseus | grep -Ei "cuda|cudart|cudatoolkit|backend|cpu|llama|gpu"
# If the model runs in a tmux session, inspect the Cookbook or serve logs from the Odysseus UI too.

Look for CUDA Toolkit not found, cudart missing, tensors assigned to CPU, or a backend list that only contains CPU.

3. Map the Symptom to the Layer

nvidia-smi fails in the container

Docker GPU passthrough is broken. Fix NVIDIA Container Toolkit and the Odysseus GPU overlay.

nvidia-smi passes, but logs say CUDA Toolkit not found

Docker sees the GPU, but llama.cpp or its build environment is missing CUDA runtime pieces.

Logs show CPU layers or CPU KV cache

The serve engine is launching, but the selected model is not offloading as expected.

The model crashes during build

Reinstall or rebuild the serve engine from Cookbook Dependencies before changing model settings.

4. Reinstall the Serve Engine Through Cookbook

If the logs point to a Cookbook or llama.cpp build issue, use the in-app Cookbook Dependencies workflow to reinstall the serve engine. This is the path the official README points users toward for CUDA-enabled builds.

- Open Odysseus as an admin.
- Go to Cookbook.
- Open Dependencies.
- Reinstall llama.cpp or the serve engine you are using.
- Serve a small model first to confirm GPU layers before trying a larger one.

5. Retest with a Smaller Model

A GTX 10-series or small VRAM card can pass CUDA checks but still struggle with a 7B model at a large context size. Use a smaller quantized model to isolate runtime from model pressure.

Reduce model size

Try a smaller GGUF or lower quantization before changing the stack again.

Reduce context

Large context raises KV cache memory. Lower context while testing GPU offload.

Check actual GPU use

Watch nvidia-smi while serving to confirm the runtime is doing work on the GPU.

6. When to Stop Debugging llama.cpp

If you need a working chat first, connect Ollama or a remote OpenAI-compatible provider and return to llama.cpp later. Odysseus is the workspace; the model backend can be swapped.

Use the Ollama setup guide if the llama.cpp dependency path is blocking your first successful run.

Verify Against Official Issue Context

This page tracks the pattern from official issue #831 and the current README split between Docker GPU passthrough and llama.cpp CUDA runtime. Check the official issue and README for the latest maintainer notes.

FAQ

Why does llama.cpp fall back to CPU when nvidia-smi works?

nvidia-smi inside the container proves Docker can see the GPU. It does not prove that the llama.cpp serve engine was built with CUDA runtime support or that the selected model was launched with GPU layers.

Is this an Odysseus Docker GPU passthrough problem?

Not always. If nvidia-smi works in the Odysseus container, Docker passthrough is probably working. Focus next on llama.cpp, CUDA Toolkit, cudart, and Cookbook dependency installation.

What log lines show CPU fallback?

Look for tensors or KV cache layers assigned to CPU, backend_ptrs.size() showing only one backend, CUDA Toolkit not found, cudart missing, or CUDAToolkit discovery failures.

Should I reinstall the serve engine from Cookbook Dependencies?

Yes. The official README recommends reinstalling the serve engine through Cookbook Dependencies when the logs point to a Cookbook or llama.cpp CUDA build issue rather than Docker passthrough.

Can a model still use CPU because it is too large?

Yes. Even with CUDA working, a model that exceeds VRAM may offload less than expected, run slowly, or fail. Test with a smaller quantized model before debugging a large model.

Installation support

Need hands-on installation support?

Setup Helper is the self-service planner. Installation Support is for users who want a human to review the route, diagnose logs, or get Odysseus running. We confirm scope and price before any payment.

Related Guides

GPU Not Detected

Check host driver, Docker runtime, and compose overlay first.

Linux Setup

Set up Linux Docker or native CUDA/ROCm paths cleanly.

Cookbook Dependencies

Fix Cookbook dependency download crashes, tmux, WSL, and VRAM detection.

Hardware Requirements

Choose a model size that fits your VRAM and RAM.

Docker Setup

Enable the right NVIDIA or AMD compose overlay.

Troubleshooting

Return to the main setup issue checklist.