RFC — Grafana Dashboards, Warp CLI & Linting
Generated: 2026-03-23
Table of Contents
1. Grafana Dashboard Proposals
Based on current observability data:
Already flowing into Prometheus:
- node-exporter on LXC 103 (192.168.1.22:9100) and LXC 108 (192.168.1.108:9100)
- cAdvisor on LXC 103 (192.168.1.22:9091)
- Ollama metrics (192.168.1.107:11434/api/metrics)
- Prometheus self-scrape
Already flowing into Loki: - All Docker containers on LXC 103 via Promtail - Forgejo runner (LXC 101) via systemd Promtail - Forgejo (LXC 100) via cron Python log pusher
Dashboard A: Homelab Overview
Works today (partial). The single pane of glass — land here first.
| Panel | Type | Query |
|---|---|---|
| Host CPU heatmap (LXC 103 + 108) | Stat/Gauge | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) |
| Host memory % used | Gauge | (1 - node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes) * 100 |
| Disk usage (103 + 108) | Bar gauge | (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 |
| Docker container count | Stat | count(container_last_seen{name!=""}) |
| Container restart storm (last 1h) | Stat | sum(increase(container_start_time_seconds[1h])) |
| Recent log errors (all services) | Logs panel | {host="docker-host"} \|= "error" \| json \| __error__="" |
| Loki ingestion rate | Time series | sum(rate(loki_ingester_samples_ingested_total[5m])) |
| Prometheus targets up | Stat | count(up == 1) vs count(up == 0) |
Dashboard B: Host Metrics (Node Exporter)
Works today.
Import Grafana community dashboard ID 1860 ("Node Exporter Full") — covers everything out of box.
Add a template variable to switch between 192.168.1.22:9100 and 192.168.1.108:9100.
Key panels included: CPU steal/iowait, memory pressure, disk I/O saturation, network throughput, filesystem fill prediction, load average.
Gap: Only LXC 103 and 108 have node-exporter. Add to other LXCs when needed. Priority: LXC 101 (runner), LXC 106 (Vault).
Dashboard C: Docker Containers (cAdvisor)
Works today.
Import Grafana community dashboard ID 14282 or 193.
Custom panels to add on top:
| Panel | Query |
|---|---|
| Container CPU top-5 | topk(5, sum by(name) (rate(container_cpu_usage_seconds_total{name!=""}[5m]))) |
| Container OOM kills | increase(container_oom_events_total[1h]) |
| Container restarts (per container) | increase(container_start_time_seconds{name!=""}[24h]) |
| Memory over limit % | container_memory_usage_bytes / container_spec_memory_limit_bytes * 100 |
Dashboard D: CI/CD — Forgejo Actions
Works today (Loki-based).
Adjust {job="forgejo"} to match the labels your Python log pusher applies.
| Panel | Query |
|---|---|
| Workflow runs (last 24h) | count_over_time({job="forgejo"} \|= "workflow_run" [24h]) |
| Failed workflows | count_over_time({job="forgejo"} \|= "conclusion=failure" [24h]) |
| Workflow run log stream | {job="forgejo"} \|= "workflow_run" |
| Runner errors | {job="runner"} \|= "error" |
| Runner job queue depth | count_over_time({job="runner"} \|= "pick up task" [5m]) |
| Recent log tail | Logs panel, {job="forgejo"}, last 50 lines |
Stretch: Forgejo exposes /metrics (Prometheus format) natively. Adding it as a scrape target gives richer data: push events, active runners, repo counts. Just needs a target entry in prometheus-config.yml.
Dashboard E: Ollama / LLM Inference
Works today — Prometheus already scrapes 192.168.1.107:11434/api/metrics.
Verify exact metric names by querying Prometheus Explore (metric_name{job="ollama"}) first, as Ollama's schema has evolved.
| Panel | Query |
|---|---|
| Active inference requests | ollama_requests_in_flight |
| Request rate (req/s) | rate(promhttp_metric_handler_requests_total[5m]) |
| Model load duration | histogram from ollama_model_load_duration_seconds |
| Go GC pressure | rate(go_gc_duration_seconds_count[5m]) |
| Memory (Go heap) | go_memstats_heap_inuse_bytes |
| GPU vRAM (if available) | ollama_gpu_memory_used_bytes |
Dashboard F: Observability Stack Self-Monitoring
Works today.
| Panel | Query |
|---|---|
| Prometheus ingestion rate | rate(prometheus_tsdb_samples_appended_total[5m]) |
| Prometheus storage size | prometheus_tsdb_storage_blocks_bytes |
| Prometheus query duration p99 | histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m])) |
| Active Prometheus scrape targets | count(up) |
| Loki log lines/sec | sum(rate(loki_ingester_samples_ingested_total[5m])) |
| Loki active streams | loki_ingester_memory_streams |
| Promtail send rate | rate(promtail_sent_entries_total[5m]) |
| Grafana active users (log-based) | count_over_time({container_name="grafana"} \|= "login" [5m]) |
Dashboard G: Security & Auth
Needs Vault scrape target added to Prometheus.
Enable Vault metrics — add to services/loki-stack/prometheus-config.yml:
- job_name: vault
static_configs:
- targets: ['192.168.1.106:8200']
metrics_path: /v1/sys/metrics
params:
format: ['prometheus']
bearer_token: <vault_read_token>
| Panel | Query |
|---|---|
| Vault seal status | vault_core_unsealed (1 = unsealed) |
| Secret read rate | rate(vault_secret_kv_count[5m]) |
| Token auth rate | rate(vault_token_lookup[5m]) |
| PocketID login attempts (log) | {container_name="pocketid"} \|= "login" |
| PocketID auth failures (log) | {container_name="pocketid"} \|= "unauthorized" |
| Vaultwarden logins (log) | {container_name="vaultwarden"} \|= "User .* logged in" |
Dashboard H: Application Services (Logs-based)
Works today — all Docker containers on LXC 103 ship logs via Promtail.
Adjust container_name labels to match what Promtail actually assigns (check with a Loki label browser).
| Service | Panel | LogQL |
|---|---|---|
| n8n | Workflow executions | {container_name="n8n"} \|= "Execution finished" |
| n8n | Workflow errors | {container_name="n8n"} \|= "error" \| json |
| Open WebUI | Active chat sessions | {container_name="open-webui"} \|= "chat" |
| Matrix | Federation errors | {container_name="synapse"} \|= "ERROR" |
| The Lounge | Connected users | {container_name="thelounge"} \|= "connected" |
| Seedbox | Download completions | {container_name="qbittorrent"} \|= "Torrent finished" |
| Gluetun | VPN reconnects | {container_name="gluetun"} \|= "Connected" |
| Caddy | Upstream errors | Not yet — needs Promtail on LXC 105 |
Dashboard I: Uptime / Gatus
Needs Gatus scrape target added to Prometheus.
Gatus exposes /metrics (Prometheus format) on port 8080. Add to prometheus-config.yml:
- job_name: gatus
static_configs:
- targets: ['192.168.1.22:8080']
| Panel | Query |
|---|---|
| Service availability % | avg(gatus_results_success) by (name) * 100 |
| Response time p95 | histogram_quantile(0.95, rate(gatus_results_duration_ms_bucket[5m])) |
| Down services count | count(gatus_results_success == 0) |
| Per-service time series | gatus_results_success{name=~".+"} |
This gives Gatus data proper Grafana visualization — much richer than Gatus's built-in UI.
Exporter Gap Summary
| LXC / Service | Missing | Priority |
|---|---|---|
| Vault (106) | Prometheus scrape of /v1/sys/metrics |
High |
| Forgejo (100) | Prometheus scrape of /metrics |
High |
| Gatus | Prometheus scrape of /metrics |
High |
| Runner (101) | node-exporter | Medium |
| Caddy (105) | Promtail + Caddy metrics | Medium |
| Gluetun (110) | node-exporter | Low |
| Ollama GPU | Verify metric names exist in Prometheus | Medium |
All of these are config changes only — no new software to deploy.
2. Warp CLI — Worth It?
There are two distinct products:
- Warp Terminal — full terminal app replacement (macOS/Linux/Windows), block-based output, AI assistant, MCP integration
ozCLI — headless agent runner; runs Warp AI agents in scripts, CI/CD pipelines, or any environment without the terminal app
Terminal App
Worth it if: - You want MCP-native AI assistance in the terminal (Grafana, Forgejo, Proxmox MCPs surface directly in your terminal session) - Block-based output appeals (each command's output is a discrete block you can search, copy, or share — useful for long Ansible runs and Docker build logs) - You use Warp Drive to store/share frequent commands (Vault write workflow, Proxmox queries, n8n API calls) - WARP.md project context files per repo (analogous to CLAUDE.md)
Not worth it if: - You use tmux — Warp kills tmux compatibility. Hard blocker. - Privacy/closed-source is a concern — closed source, free tier requires telemetry for AI features, all AI routes through GCP (US) - Your current terminal (iTerm2/Ghostty/Kitty + zsh) is already well-configured
oz CLI (the more interesting piece)
Warp's headless agent runner — runs AI agents with MCP access inside Forgejo Actions pipelines. Drop oz agent run into a workflow job, give it your Grafana/Forgejo/Vault MCP servers, and it can reason about infra mid-pipeline (e.g., check Grafana for anomalies before deploying, open a Forgejo issue on failure).
This is new capability that doesn't exist elsewhere for a self-hosted setup.
Privacy & Drawbacks
- Account required for AI features
- Free tier: telemetry must be enabled for AI (paid plans can opt out)
- All AI + Warp Drive sync routes through GCP US — not air-gap friendly
- Closed source (vs iTerm2, Kitty, Ghostty, Alacritty which are all open source)
- Higher resource usage than lightweight terminals
Recommendation
Try oz CLI experimentally in one Forgejo workflow. The ability to run an AI agent with MCP context inside CI is novel and worth a spike on a non-critical workflow.
Terminal app: Skip for now if tmux is part of your workflow. Revisit if that changes.
3. Linting Strategy
Recommended Stack
| Layer | Tool | What it catches |
|---|---|---|
| Pre-commit (local) | pre-commit framework |
Fast feedback before commit |
| YAML | yamllint |
All YAML syntax + style |
| Ansible | ansible-lint (profile: moderate) |
Playbook correctness, FQCN, idempotence |
| Shell | shellcheck |
Script bugs, quoting issues |
| IaC security | trivy fs |
Dockerfile/compose misconfigs + hardcoded secrets |
| Docker Compose | docker compose config |
Syntax + variable resolution |
| Python | ruff |
Fast replacement for flake8+isort |
| PR comments | reviewdog |
Turns lint output into inline Forgejo PR annotations |
Config Files
.ansible-lint:
profile: moderate
warn_list:
- yaml[line-length]
skip_list:
- experimental
exclude_paths:
- .git/
.yamllint.yml:
extends: default
rules:
line-length:
max: 120
allow-non-breakable-inline-mappings: true
truthy:
allowed-values: ["true", "false"]
check-keys: false
.pre-commit-config.yaml:
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-merge-conflict
- id: detect-private-key
- repo: https://github.com/adrienverge/yamllint
rev: v1.35.1
hooks:
- id: yamllint
args: [-c, .yamllint.yml]
- repo: https://github.com/ansible/ansible-lint
rev: v25.1.3
hooks:
- id: ansible-lint
- repo: https://github.com/shellcheck-py/shellcheck-py
rev: v0.10.0.1
hooks:
- id: shellcheck
args: [--severity=warning]
- repo: https://github.com/gitleaks/gitleaks
rev: v8.21.2
hooks:
- id: gitleaks
Forgejo Workflows
.forgejo/workflows/lint.yml — runs on all pushes and PRs:
name: Lint
on: [push, pull_request]
jobs:
yaml:
runs-on: native
container:
image: pipelinecomponents/yamllint:latest
steps:
- uses: actions/checkout@v4
- run: yamllint -c .yamllint.yml .
ansible:
runs-on: native
container:
image: ghcr.io/ansible/ansible-lint:latest
steps:
- uses: actions/checkout@v4
- run: ansible-lint --profile moderate
shellcheck:
runs-on: native
steps:
- uses: actions/checkout@v4
- run: |
apt-get update -qq && apt-get install -y -qq shellcheck
find . -name "*.sh" -print0 | xargs -0 shellcheck --severity=warning
compose-validate:
runs-on: native
steps:
- uses: actions/checkout@v4
- run: |
find . -name "docker-compose*.yml" -print0 | while IFS= read -r -d '' f; do
echo "Validating $f"
docker compose -f "$f" config --quiet
done
.forgejo/workflows/security.yml — runs on push to main and PRs:
name: Security Scan
on:
push:
branches: [main]
pull_request:
jobs:
trivy:
runs-on: native
container:
image: aquasec/trivy:latest
steps:
- uses: actions/checkout@v4
- name: Scan IaC and secrets
run: |
trivy fs . \
--scanners misconfig,secret \
--severity HIGH,CRITICAL \
--exit-code 1
On MegaLinter
MegaLinter bundles 100+ linters into one Docker image and can auto-fix and commit back. It works in Forgejo Actions via docker run (not via the marketplace action, which has resolution issues in Forgejo).
Tradeoff: Large image, requires Docker-in-Docker on your runner, harder to debug. The granular per-job approach above is easier to manage, parallelizes naturally, and caches better. Consider MegaLinter later if you want unified HTML reports.
MegaLinter flavor for this stack: oxsecurity/megalinter-ci_light:v8.
On AI Review Agents
For a self-hosted Forgejo setup, two options are viable today:
-
reviewdog— posts inline diff annotations on Forgejo PRs from deterministic linter output. No AI involved but dramatically better UX than CI pass/fail. Supportsgiteareporter natively. -
ai-review(Nikita-Filonov) — AI-powered PR review comments using Ollama (fully on-prem), Claude, or GPT-4. Posts inline Gitea/Forgejo PR comments. Experimental but working.
Both are worth setting up once baseline linting is clean, because AI review is most useful when it's not competing with dozens of style violations.
Adoption Order
- Add
.ansible-lint+.yamllint.ymlto the repo — local config only, zero CI impact - Install
pre-commitlocally, runpre-commit run --all-filesto see baseline - Fix highest-severity violations
- Add
lint.ymlworkflow in soft-fail mode first (--soft-fail/ ignore exit code) - Remove soft-fail once the baseline is clean
- Add
security.yml - Add
reviewdogfor inline PR comments - Experiment with
ai-review+ Ollama
Pre-commit vs CI division of responsibility
| Layer | Runs when | Speed | Purpose |
|---|---|---|---|
| Pre-commit | git commit (local) |
Fast — staged files only | Immediate feedback, block bad commits early |
| CI (Forgejo Actions) | Push / PR | Full repo scan | Authoritative gate, environment-independent |
Never rely solely on pre-commit — git commit --no-verify bypasses it. CI is the real enforcement layer.