Ollama — Reference
Links
- Website: https://ollama.com
- API docs: https://github.com/ollama/ollama/blob/main/docs/api.md
- OpenAI compatibility: https://docs.ollama.com/api/openai-compatibility
- Model library: https://ollama.com/library
- Python client: https://github.com/ollama/ollama-python
- GitHub: https://github.com/ollama/ollama
API
Ollama exposes a REST API at http://192.168.1.107:11434. No authentication required (LAN-only access).
Generate (single-turn completion)
curl -s http://192.168.1.107:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain ZFS in one sentence",
"stream": false
}' | python3 -c "import json,sys; print(json.load(sys.stdin)['response'])"
Chat (multi-turn conversation)
curl -s http://192.168.1.107:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is restic?"}
],
"stream": false
}' | python3 -c "import json,sys; print(json.load(sys.stdin)['message']['content'])"
Generate embeddings
curl -s http://192.168.1.107:11434/api/embed -d '{
"model": "llama3.1",
"input": "homelab backup strategy"
}'
List local models
curl -s http://192.168.1.107:11434/api/tags | python3 -c \
"import json,sys; [print(f'{m[\"name\"]:40} {m[\"size\"]/1e9:.1f}GB') for m in json.load(sys.stdin)['models']]"
Show model details
curl -s http://192.168.1.107:11434/api/show -d '{"name": "llama3.1"}'
Pull a model
curl -s http://192.168.1.107:11434/api/pull -d '{"name": "gemma2:9b", "stream": false}'
Delete a model
curl -s -X DELETE http://192.168.1.107:11434/api/delete -d '{"name": "gemma2:9b"}'
List running models (loaded in memory)
curl -s http://192.168.1.107:11434/api/ps | python3 -c \
"import json,sys; [print(f'{m[\"name\"]} {m[\"size\"]/1e9:.1f}GB VRAM') for m in json.load(sys.stdin)['models']]"
Health check
curl -s http://192.168.1.107:11434/ # returns "Ollama is running"
Prometheus metrics
curl -s http://192.168.1.107:11434/api/metrics
OpenAI-Compatible Endpoints
These allow Ollama to be used as a drop-in replacement for OpenAI API clients.
# Chat completions (OpenAI format)
curl -s http://192.168.1.107:11434/v1/chat/completions -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello"}]
}'
# List models (OpenAI format)
curl -s http://192.168.1.107:11434/v1/models
# Embeddings (OpenAI format)
curl -s http://192.168.1.107:11434/v1/embeddings -d '{
"model": "llama3.1",
"input": "some text"
}'
CLI
Run from LXC 107 (ssh [email protected]).
# List models
ollama list
# Pull a model
ollama pull llama3.1
# Remove a model
ollama rm gemma2:9b
# Run interactive chat
ollama run llama3.1
# Show model info
ollama show llama3.1
# List running models
ollama ps
# Copy/alias a model
ollama cp llama3.1 my-custom-model
# Create a custom model from a Modelfile
ollama create my-model -f Modelfile
# Serve (already running via systemd)
ollama serve
What the API/CLI Cannot Do
| Gap | Workaround |
|---|---|
| Cannot set GPU/CPU limits per model via API | Configure via OLLAMA_NUM_GPU, OLLAMA_NUM_THREAD env vars |
| Cannot schedule model preloading | Use curl .../api/generate -d '{"model":"X","keep_alive":"24h"}' to keep loaded |
| No built-in auth — anyone on LAN can use the API | Firewall rules or reverse proxy auth if needed |
| Cannot fine-tune models via API | Use external tools, then import via ollama create |
| No API to check disk usage per model | Use ollama list CLI (shows sizes) or du -sh /root/.ollama/models/ |
| Cannot limit concurrent requests via API | Set OLLAMA_MAX_LOADED_MODELS and OLLAMA_NUM_PARALLEL env vars |
| Model download progress not available via non-streaming API | Use "stream": true when pulling to see progress |