Llama.cpp Interactive App#
Before you begin#
Llama.cpp is an inference framework designed to run Large Language Models (LLMs) efficiently on CPUs or GPUs. It supports many families of open models and is optimized for high‑performance and on‑premise environments.
LLMs used with llama.cpp must be provided in the GGUF format. GGUF is an optimized binary format containing model weights, tokenizer data, and metadata required for inference. It loads quickly and is adapted to heterogeneous hardware.
Running the Llama.cpp Interactive App on our On Demand service is billed like any regular job executed on the Kuma cluster. Standard resource accounting applies according to requested CPUs, GPUs, memory, and runtime.
Form parameters#
When launching a llama.cpp session, the form allows you to configure all required parameters for the job.
Job parameters#
Cluster Select the SCITAS cluster on which your Llama.cpp will run.
Account Select the project or accounting entity under which the job will run.
Queue Choose the cluster partition (node type) on which the job should be scheduled.
Number of hours Define the maximum walltime for your session.
Number of GPUs per job Choose how many GPUs your job will allocate. Larger models may require more GPU memory, so selecting more GPUs may be necessary.
Number of CPUs for the job Specify how many CPU cores will be allocated for llama.cpp operations.
Job Memory in Gb Amount of RAM reserved for the job. You must adapt this value depending on the size of the selected LLM. More info
Accelerators
Cuda module is mandatory to use our Accelerators with Llama.cpp
Llama.cpp parameters#
Available LLMs Select the GGUF model you want to use. Each model has its own resource requirements.
Custom LLM Option to provide your own GGUF model instead of selecting one from the list.
LLM Storage
Please use a fast and efficient storage like /scratch to load your LLM.
Context window size (tokens) for the prompt Defines the maximum context length used by the model.
Number of GPU Layers
Determines how many layers are offloaded to the GPU. You may leave this set to auto.
Number of concurrent request slots the server can process (Parallel) Allows llama.cpp to handle multiple requests concurrently (useful for agents).
Maximum number of new tokens to generate per request Sets the upper limit of tokens for model outputs.
Extra arguments to pass to llama.cpp
Optional extra arguments such as --verbose, --no-jinja, --metrics, etc.
Once all parameters are set, click Launch to start the job.
Advanced Parameters#
Activate the Advanced Parameters
Each time you run a new Llama.cpp interactive app, you need to enable the Advanced Parameters manually.
Static port Set a static port to access llama.cpp.
Port Availability
The port must be available (not used by another service) and between 49152 and 65535.
Static API Key Define you own API Key to use with Llama.cpp OpenAI API. Use only alphanumeric characters.
Web UI access#
After your job is running, click the Connect button from the job page to access the Web UI connection details.
Connection panel
- The hostname of the node where llama.cpp is running
- The listening port used by the Web UI and the API
- The API-Key associated with the session
Connecting llama.cpp#
From there, you can:
- Open the Llama.cpp Web UI directly in your browser
- Share the access link with collaborators
- Use the provided information to access the llama.cpp API (inside the Kuma network).
The session data for llama.cpp can also be accessed through the data root directory (click on the Session ID) linked on the job page.
Llama.cpp log
You can find the llama.cpp log in the session’s root directory, in the file named output.log
API#
You can access the Llama.cpp server OpenAI API to use the inference with your tools and agents.
Access#
When running llama.cpp interactive App, a randomly generated (or statically defined) API key is provided and can be used to access the llama.cpp API.
From the Cluster Frontend, you can access your llama.cpp instance using the Host, port number and API Key.
$ curl http://<HOST>:<PORT>/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer <API-KEY>" -d '{ "messages": [{"role": "user","content": "Tell me a fun fact."}] }'
OpenCode#
OpenCode is an open source agent that helps you write code in your terminal, IDE, or desktop.
You can use this example config to use OpenCode with the llama.cpp OOD app:
- Setup the OpenCode Provider configuration:
{
"$schema": "https://opencode.ai/config.json",
"permission": {
"edit": "ask",
"bash": {
"*": "ask",
"git commit*": "deny",
"git push*": "deny",
"git reset*": "deny",
"git apply*": "deny",
"git revert*": "deny",
"git checkout*": "deny",
"git *": "allow",
"grep *": "allow"
}
},
"provider": {
"llamacpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama.cpp (local)",
"options": {
"baseURL": "http://127.0.0.1:8080/v1",
"apiKey": "<API-KEY>"
},
"models": {
"Qwen3.5-122B-A10B_Q5_K_M.gguf": {
"name": "Qwen3.5-122B-A10B Q5_K_M (Local)",
"limit": {
"context": 262144,
"output": 131072
}
}
}
}
},
"model": "llama.cpp/Qwen3.5-122B-A10B_Q5_K_M.gguf",
"small_model": "llama.cpp/Qwen3.5-122B-A10B_Q5_K_M.gguf"
}
<API-KEY>: Use your randomly generated (or statically defined) API key
- Create an SSH Tunnel to the node running the llama.cpp interactive App
<HOST>: The node running the app
- <PORT>: Use your randomly defined (or static) port
- Run OpenCode


