Skip to content

Observability — Runbook

Credentials

Get Grafana admin password

vault kv get secret/homelab-sso   # look for grafana_admin_password

Rotate Grafana credentials

  1. Run pocketid-setup workflow — generates new values and writes to Vault
  2. Trigger Deploy Loki Stack workflow to pick up new values

Grafana-First Troubleshooting

Always check Grafana (Loki + Prometheus) BEFORE SSHing to a host. Use targeted LogQL/PromQL queries to find the issue, then SSH only if action is needed.

Service not responding? Check logs + metrics first

  1. Find error logs in Loki: Grafana → Explore → Loki
  2. Query: {container="<service>"} or {job="<service>"}
  3. Add filter: |= "error" to find errors only
  4. Or use | json | level="error" for JSON logs

  5. Check metrics in Prometheus: Grafana → Explore → Prometheus

  6. CPU: rate(container_cpu_usage_seconds_total{container="<service>"}[5m])
  7. Memory: container_memory_usage_bytes{container="<service>"}
  8. Disk: node_filesystem_avail_bytes{job="alloy"}

  9. Narrow by time range: use Grafana's time picker (last 15min, last 1h, etc)

  10. Only SSH if:

  11. Logs show nothing (logs stopped flowing or are absent)
  12. Metrics are all zeros or absent
  13. You need to run a command or restart a service

Per-Service Log Queries

Each service runbook has a Logs section with its exact Loki query, log format, contents, and SSH fallback. See individual runbooks for details.

Quick reference — services in Loki

Loki query Service LXC
{job="forgejo"} Forgejo app 100
{job="forgejo-runner"} Forgejo Runner 101
{job="forgejo-actions", task_id="<id>"} Workflow step logs 100
{job="loki-stack", container="<name>"} Loki, Prometheus, Grafana, Alloy 108
{job="caddy"} Caddy reverse proxy 105
{job="ollama"} Ollama LLM 107
{job="minecraft"} Minecraft + Playit 109
{job="gluetun"} Gluetun VPN 110
{job="seedbox"} Seedbox qBittorrent instances 111
{job="mediabot"} MediaManager, PostgreSQL, qBit, Prowlarr 113
{job="jellyfin"} Jellyfin media server 114
{job="netbird"} NetBird VPN mesh 115
{job="allmight"} Shoko, Grimmory, RomM, MariaDB 116
{job="tools"} code-server, thelounge, qbitwebui 118
{job="infra-apps"} gatus, ntfy, glance, all oauth2-proxies 119
{job="automation"} n8n 120
{job="matrix"} Synapse 121
{job="ai"} Open WebUI 122
{job="auth"} PocketID 123

NOT in Loki

  • Vault (LXC 106) — Alpine, writes to file, no Alloy agent. SSH: ssh [email protected] "tail -f /var/log/vault.log"

Filter patterns

# Plain text error search (most services)
{container="<service>"} |= "error"

# Avoid false positives from Gatus health checks
{container="gatus"} |= "error" != "errors=0"

# JSON log parsing (only for services that output JSON)
{container="<service>"} | json | level="error"

# Exclude info-level noise
{job="forgejo"} != "info"

# Regex filter
{container="n8n"} |~ "(?i)fail|crash|panic"

Forgejo Actions Workflow Logs

Workflow run logs are shipped to Loki with these labels:

Label Description Example
repo Repository name homelab
owner Repository owner holo
task_id Forgejo Actions task ID 820
container Docker container name (varies)
job Loki job label (varies)
stream stdout/stderr stdout

Find task IDs (preferred — via Loki): Query the runner log to get the latest task IDs directly:

{job="forgejo-runner"} |= "task"
Each line shows task <id> repo is <owner>/<repo>. Use the task ID in subsequent queries. This is faster than using the Forgejo API, which paginates oldest-first.

Find task IDs (alternative — via UI/API): Check the Forgejo Actions UI (task ID is in the URL), or use the API. Note: the Forgejo API has two different IDs — the run number in the URL (/actions/runs/802) differs from the internal id in the API response (e.g., 863). Use the internal id for get_run/list_run_jobs. The API returns runs oldest-first with no sort option.

Query examples:

# All logs for a specific workflow run (by task ID)
{repo="homelab", task_id="820"}

# Find failed Ansible tasks across all recent runs
{repo="homelab"} |= "FAILED"

# Find unreachable hosts
{repo="homelab"} |= "UNREACHABLE"

# All workflow logs for a specific repo in the last hour
{repo="homelab"}

# Filter by owner for cross-repo searches
{owner="holo"} |= "error"

Loki HTTP API (for programmatic access):

# Query via curl (timestamps in nanoseconds)
curl -s 'https://loki.eva-00.network/loki/api/v1/query_range' \
  --data-urlencode 'query={repo="homelab", task_id="820"}' \
  --data-urlencode 'start=<epoch_ns>' \
  --data-urlencode 'end=<epoch_ns>' \
  --data-urlencode 'limit=100' \
  --data-urlencode 'direction=forward'

# Get available labels
curl -s 'https://loki.eva-00.network/loki/api/v1/labels'

# Get label values (e.g. all task IDs)
curl -s 'https://loki.eva-00.network/loki/api/v1/label/task_id/values'

Programmatic log access (Claude / automation)

Primary: Grafana MCP tools (recommended for troubleshooting)

Use in order: 1. list_loki_label_values — discover available labels/values 2. query_loki_stats — verify stream has data 3. query_loki_logs — fetch log lines (max 100 per call)

Required: datasourceUid: P8E80F9AEF21F6940

Fallback: Loki HTTP API (for queries >100 lines)

Timestamps must be in nanoseconds (Unix epoch seconds × 10⁹).

Last resort: SSH — only when Loki itself is down or logs aren't collected.

Per-Service Metrics

Service PromQL Query
Docker host CPU rate(node_cpu_seconds_total{job="alloy"}[5m])
Docker host memory node_memory_MemAvailable_bytes{job="alloy"}
Container memory container_memory_usage_bytes{job="alloy"}
Disk space node_filesystem_avail_bytes{job="alloy", fstype!="tmpfs"}
Docker volumes container_fs_usage_bytes

Troubleshooting Checklist

Grafana shows no data

  1. Loki reachable? Grafana → Explore → Loki → {job="loki-stack"} — see Grafana/Loki/Prometheus logs
  2. Prometheus scraping? Grafana → Explore → Prometheus → up{job="alloy"} — should be 1
  3. Alloy running? {container="alloy"} logs — any connection errors to Loki?
  4. SSH only if: Loki container is down, Prometheus can't reach Alloy endpoint, or network unreachable

Service logs missing

  1. Check label in Loki: {container="<service>"} — does it exist?
  2. Docker running? {container="<service>"} should show recent logs
  3. Service exists? If new service, check Alloy config includes it
  4. SSH only if: Container logs don't exist in Loki but service is running (likely log driver issue)

Prometheus target down

  1. View targets: http://192.168.1.108:9090/targets
  2. Check Prometheus logs: {container="prometheus"}
  3. Verify target is reachable: {container="alloy"} — any scrape errors?
  4. SSH only if: Port number is wrong, or service is unreachable

Loki Labels Reference

All labels available in Loki (verified 2026-04-01):

Label Description Example values
job Log source group tools, infra-apps, automation, matrix, ai, auth, loki-stack, forgejo, forgejo-runner, forgejo-actions
container Docker container name n8n, pocketid, grafana, loki, ...
service_name Docker service name Same as container in most cases
stream Output stream stdout, stderr
owner Forgejo repo owner (actions only) holo
repo Forgejo repo name (actions only) homelab
task_id Forgejo Actions task ID (actions only) 820

Note: host label exists in Alloy configs for Forgejo/Runner but is not present in Docker discovery logs. Each LXC runs its own Alloy instance with a unique job label, so use job to identify which LXC a Docker log came from.

Logs Directory Reference

Service On-disk location Loki query Alloy agent
Loki Docker (LXC 108) {job="loki-stack", container="loki"} Sidecar
Prometheus Docker (LXC 108) {job="loki-stack", container="prometheus"} Sidecar
Grafana Docker (LXC 108) {job="loki-stack", container="grafana"} Sidecar
Forgejo app LXC 100 /var/log/forgejo/gitea.log {job="forgejo"} Native binary
Forgejo Runner LXC 101 /var/log/forgejo-runner.log {job="forgejo-runner"} Native binary
Forgejo Actions LXC 100 .log.zst files {job="forgejo-actions", task_id="<id>"} Python cron
Tools (code-server, thelounge, qbitwebui) Docker (LXC 118) stdout/stderr {job="tools", container="<name>"} Docker discovery
Infra-apps (gatus, ntfy, glance, oauth2-proxies) Docker (LXC 119) stdout/stderr {job="infra-apps", container="<name>"} Docker discovery
Automation (n8n) Docker (LXC 120) stdout/stderr {job="automation", container="<name>"} Docker discovery
Matrix (synapse) Docker (LXC 121) stdout/stderr {job="matrix", container="<name>"} Docker discovery
AI (open-webui) Docker (LXC 122) stdout/stderr {job="ai", container="<name>"} Docker discovery
Auth (pocketid) Docker (LXC 123) stdout/stderr {job="auth", container="<name>"} Docker discovery