Skip to content

RFC — Grafana Dashboards, Warp CLI & Linting

Generated: 2026-03-23


Table of Contents

  1. Grafana Dashboard Proposals
  2. Warp CLI — Worth It?
  3. Linting Strategy

1. Grafana Dashboard Proposals

Based on current observability data:

Already flowing into Prometheus: - node-exporter on LXC 103 (192.168.1.22:9100) and LXC 108 (192.168.1.108:9100) - cAdvisor on LXC 103 (192.168.1.22:9091) - Ollama metrics (192.168.1.107:11434/api/metrics) - Prometheus self-scrape

Already flowing into Loki: - All Docker containers on LXC 103 via Promtail - Forgejo runner (LXC 101) via systemd Promtail - Forgejo (LXC 100) via cron Python log pusher


Dashboard A: Homelab Overview

Works today (partial). The single pane of glass — land here first.

Panel Type Query
Host CPU heatmap (LXC 103 + 108) Stat/Gauge 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Host memory % used Gauge (1 - node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes) * 100
Disk usage (103 + 108) Bar gauge (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
Docker container count Stat count(container_last_seen{name!=""})
Container restart storm (last 1h) Stat sum(increase(container_start_time_seconds[1h]))
Recent log errors (all services) Logs panel {host="docker-host"} \|= "error" \| json \| __error__=""
Loki ingestion rate Time series sum(rate(loki_ingester_samples_ingested_total[5m]))
Prometheus targets up Stat count(up == 1) vs count(up == 0)

Dashboard B: Host Metrics (Node Exporter)

Works today.

Import Grafana community dashboard ID 1860 ("Node Exporter Full") — covers everything out of box. Add a template variable to switch between 192.168.1.22:9100 and 192.168.1.108:9100.

Key panels included: CPU steal/iowait, memory pressure, disk I/O saturation, network throughput, filesystem fill prediction, load average.

Gap: Only LXC 103 and 108 have node-exporter. Add to other LXCs when needed. Priority: LXC 101 (runner), LXC 106 (Vault).


Dashboard C: Docker Containers (cAdvisor)

Works today.

Import Grafana community dashboard ID 14282 or 193.

Custom panels to add on top:

Panel Query
Container CPU top-5 topk(5, sum by(name) (rate(container_cpu_usage_seconds_total{name!=""}[5m])))
Container OOM kills increase(container_oom_events_total[1h])
Container restarts (per container) increase(container_start_time_seconds{name!=""}[24h])
Memory over limit % container_memory_usage_bytes / container_spec_memory_limit_bytes * 100

Dashboard D: CI/CD — Forgejo Actions

Works today (Loki-based).

Adjust {job="forgejo"} to match the labels your Python log pusher applies.

Panel Query
Workflow runs (last 24h) count_over_time({job="forgejo"} \|= "workflow_run" [24h])
Failed workflows count_over_time({job="forgejo"} \|= "conclusion=failure" [24h])
Workflow run log stream {job="forgejo"} \|= "workflow_run"
Runner errors {job="runner"} \|= "error"
Runner job queue depth count_over_time({job="runner"} \|= "pick up task" [5m])
Recent log tail Logs panel, {job="forgejo"}, last 50 lines

Stretch: Forgejo exposes /metrics (Prometheus format) natively. Adding it as a scrape target gives richer data: push events, active runners, repo counts. Just needs a target entry in prometheus-config.yml.


Dashboard E: Ollama / LLM Inference

Works today — Prometheus already scrapes 192.168.1.107:11434/api/metrics.

Verify exact metric names by querying Prometheus Explore (metric_name{job="ollama"}) first, as Ollama's schema has evolved.

Panel Query
Active inference requests ollama_requests_in_flight
Request rate (req/s) rate(promhttp_metric_handler_requests_total[5m])
Model load duration histogram from ollama_model_load_duration_seconds
Go GC pressure rate(go_gc_duration_seconds_count[5m])
Memory (Go heap) go_memstats_heap_inuse_bytes
GPU vRAM (if available) ollama_gpu_memory_used_bytes

Dashboard F: Observability Stack Self-Monitoring

Works today.

Panel Query
Prometheus ingestion rate rate(prometheus_tsdb_samples_appended_total[5m])
Prometheus storage size prometheus_tsdb_storage_blocks_bytes
Prometheus query duration p99 histogram_quantile(0.99, rate(prometheus_engine_query_duration_seconds_bucket[5m]))
Active Prometheus scrape targets count(up)
Loki log lines/sec sum(rate(loki_ingester_samples_ingested_total[5m]))
Loki active streams loki_ingester_memory_streams
Promtail send rate rate(promtail_sent_entries_total[5m])
Grafana active users (log-based) count_over_time({container_name="grafana"} \|= "login" [5m])

Dashboard G: Security & Auth

Needs Vault scrape target added to Prometheus.

Enable Vault metrics — add to services/loki-stack/prometheus-config.yml:

- job_name: vault
  static_configs:
    - targets: ['192.168.1.106:8200']
  metrics_path: /v1/sys/metrics
  params:
    format: ['prometheus']
  bearer_token: <vault_read_token>
Panel Query
Vault seal status vault_core_unsealed (1 = unsealed)
Secret read rate rate(vault_secret_kv_count[5m])
Token auth rate rate(vault_token_lookup[5m])
PocketID login attempts (log) {container_name="pocketid"} \|= "login"
PocketID auth failures (log) {container_name="pocketid"} \|= "unauthorized"
Vaultwarden logins (log) {container_name="vaultwarden"} \|= "User .* logged in"

Dashboard H: Application Services (Logs-based)

Works today — all Docker containers on LXC 103 ship logs via Promtail.

Adjust container_name labels to match what Promtail actually assigns (check with a Loki label browser).

Service Panel LogQL
n8n Workflow executions {container_name="n8n"} \|= "Execution finished"
n8n Workflow errors {container_name="n8n"} \|= "error" \| json
Open WebUI Active chat sessions {container_name="open-webui"} \|= "chat"
Matrix Federation errors {container_name="synapse"} \|= "ERROR"
The Lounge Connected users {container_name="thelounge"} \|= "connected"
Seedbox Download completions {container_name="qbittorrent"} \|= "Torrent finished"
Gluetun VPN reconnects {container_name="gluetun"} \|= "Connected"
Caddy Upstream errors Not yet — needs Promtail on LXC 105

Dashboard I: Uptime / Gatus

Needs Gatus scrape target added to Prometheus.

Gatus exposes /metrics (Prometheus format) on port 8080. Add to prometheus-config.yml:

- job_name: gatus
  static_configs:
    - targets: ['192.168.1.22:8080']
Panel Query
Service availability % avg(gatus_results_success) by (name) * 100
Response time p95 histogram_quantile(0.95, rate(gatus_results_duration_ms_bucket[5m]))
Down services count count(gatus_results_success == 0)
Per-service time series gatus_results_success{name=~".+"}

This gives Gatus data proper Grafana visualization — much richer than Gatus's built-in UI.


Exporter Gap Summary

LXC / Service Missing Priority
Vault (106) Prometheus scrape of /v1/sys/metrics High
Forgejo (100) Prometheus scrape of /metrics High
Gatus Prometheus scrape of /metrics High
Runner (101) node-exporter Medium
Caddy (105) Promtail + Caddy metrics Medium
Gluetun (110) node-exporter Low
Ollama GPU Verify metric names exist in Prometheus Medium

All of these are config changes only — no new software to deploy.


2. Warp CLI — Worth It?

There are two distinct products:

  • Warp Terminal — full terminal app replacement (macOS/Linux/Windows), block-based output, AI assistant, MCP integration
  • oz CLI — headless agent runner; runs Warp AI agents in scripts, CI/CD pipelines, or any environment without the terminal app

Terminal App

Worth it if: - You want MCP-native AI assistance in the terminal (Grafana, Forgejo, Proxmox MCPs surface directly in your terminal session) - Block-based output appeals (each command's output is a discrete block you can search, copy, or share — useful for long Ansible runs and Docker build logs) - You use Warp Drive to store/share frequent commands (Vault write workflow, Proxmox queries, n8n API calls) - WARP.md project context files per repo (analogous to CLAUDE.md)

Not worth it if: - You use tmux — Warp kills tmux compatibility. Hard blocker. - Privacy/closed-source is a concern — closed source, free tier requires telemetry for AI features, all AI routes through GCP (US) - Your current terminal (iTerm2/Ghostty/Kitty + zsh) is already well-configured

oz CLI (the more interesting piece)

Warp's headless agent runner — runs AI agents with MCP access inside Forgejo Actions pipelines. Drop oz agent run into a workflow job, give it your Grafana/Forgejo/Vault MCP servers, and it can reason about infra mid-pipeline (e.g., check Grafana for anomalies before deploying, open a Forgejo issue on failure).

This is new capability that doesn't exist elsewhere for a self-hosted setup.

Privacy & Drawbacks

  • Account required for AI features
  • Free tier: telemetry must be enabled for AI (paid plans can opt out)
  • All AI + Warp Drive sync routes through GCP US — not air-gap friendly
  • Closed source (vs iTerm2, Kitty, Ghostty, Alacritty which are all open source)
  • Higher resource usage than lightweight terminals

Recommendation

Try oz CLI experimentally in one Forgejo workflow. The ability to run an AI agent with MCP context inside CI is novel and worth a spike on a non-critical workflow.

Terminal app: Skip for now if tmux is part of your workflow. Revisit if that changes.


3. Linting Strategy

Layer Tool What it catches
Pre-commit (local) pre-commit framework Fast feedback before commit
YAML yamllint All YAML syntax + style
Ansible ansible-lint (profile: moderate) Playbook correctness, FQCN, idempotence
Shell shellcheck Script bugs, quoting issues
IaC security trivy fs Dockerfile/compose misconfigs + hardcoded secrets
Docker Compose docker compose config Syntax + variable resolution
Python ruff Fast replacement for flake8+isort
PR comments reviewdog Turns lint output into inline Forgejo PR annotations

Config Files

.ansible-lint:

profile: moderate
warn_list:
  - yaml[line-length]
skip_list:
  - experimental
exclude_paths:
  - .git/

.yamllint.yml:

extends: default
rules:
  line-length:
    max: 120
    allow-non-breakable-inline-mappings: true
  truthy:
    allowed-values: ["true", "false"]
    check-keys: false

.pre-commit-config.yaml:

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.6.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-merge-conflict
      - id: detect-private-key

  - repo: https://github.com/adrienverge/yamllint
    rev: v1.35.1
    hooks:
      - id: yamllint
        args: [-c, .yamllint.yml]

  - repo: https://github.com/ansible/ansible-lint
    rev: v25.1.3
    hooks:
      - id: ansible-lint

  - repo: https://github.com/shellcheck-py/shellcheck-py
    rev: v0.10.0.1
    hooks:
      - id: shellcheck
        args: [--severity=warning]

  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.21.2
    hooks:
      - id: gitleaks

Forgejo Workflows

.forgejo/workflows/lint.yml — runs on all pushes and PRs:

name: Lint

on: [push, pull_request]

jobs:
  yaml:
    runs-on: native
    container:
      image: pipelinecomponents/yamllint:latest
    steps:
      - uses: actions/checkout@v4
      - run: yamllint -c .yamllint.yml .

  ansible:
    runs-on: native
    container:
      image: ghcr.io/ansible/ansible-lint:latest
    steps:
      - uses: actions/checkout@v4
      - run: ansible-lint --profile moderate

  shellcheck:
    runs-on: native
    steps:
      - uses: actions/checkout@v4
      - run: |
          apt-get update -qq && apt-get install -y -qq shellcheck
          find . -name "*.sh" -print0 | xargs -0 shellcheck --severity=warning

  compose-validate:
    runs-on: native
    steps:
      - uses: actions/checkout@v4
      - run: |
          find . -name "docker-compose*.yml" -print0 | while IFS= read -r -d '' f; do
            echo "Validating $f"
            docker compose -f "$f" config --quiet
          done

.forgejo/workflows/security.yml — runs on push to main and PRs:

name: Security Scan

on:
  push:
    branches: [main]
  pull_request:

jobs:
  trivy:
    runs-on: native
    container:
      image: aquasec/trivy:latest
    steps:
      - uses: actions/checkout@v4
      - name: Scan IaC and secrets
        run: |
          trivy fs . \
            --scanners misconfig,secret \
            --severity HIGH,CRITICAL \
            --exit-code 1

On MegaLinter

MegaLinter bundles 100+ linters into one Docker image and can auto-fix and commit back. It works in Forgejo Actions via docker run (not via the marketplace action, which has resolution issues in Forgejo).

Tradeoff: Large image, requires Docker-in-Docker on your runner, harder to debug. The granular per-job approach above is easier to manage, parallelizes naturally, and caches better. Consider MegaLinter later if you want unified HTML reports.

MegaLinter flavor for this stack: oxsecurity/megalinter-ci_light:v8.


On AI Review Agents

For a self-hosted Forgejo setup, two options are viable today:

  1. reviewdog — posts inline diff annotations on Forgejo PRs from deterministic linter output. No AI involved but dramatically better UX than CI pass/fail. Supports gitea reporter natively.

  2. ai-review (Nikita-Filonov) — AI-powered PR review comments using Ollama (fully on-prem), Claude, or GPT-4. Posts inline Gitea/Forgejo PR comments. Experimental but working.

Both are worth setting up once baseline linting is clean, because AI review is most useful when it's not competing with dozens of style violations.


Adoption Order

  1. Add .ansible-lint + .yamllint.yml to the repo — local config only, zero CI impact
  2. Install pre-commit locally, run pre-commit run --all-files to see baseline
  3. Fix highest-severity violations
  4. Add lint.yml workflow in soft-fail mode first (--soft-fail / ignore exit code)
  5. Remove soft-fail once the baseline is clean
  6. Add security.yml
  7. Add reviewdog for inline PR comments
  8. Experiment with ai-review + Ollama

Pre-commit vs CI division of responsibility

Layer Runs when Speed Purpose
Pre-commit git commit (local) Fast — staged files only Immediate feedback, block bad commits early
CI (Forgejo Actions) Push / PR Full repo scan Authoritative gate, environment-independent

Never rely solely on pre-commit — git commit --no-verify bypasses it. CI is the real enforcement layer.