Observability — Runbook
Credentials
Get Grafana admin password
vault kv get secret/homelab-sso # look for grafana_admin_password
Rotate Grafana credentials
- Run
pocketid-setupworkflow — generates new values and writes to Vault - Trigger
Deploy Loki Stackworkflow to pick up new values
Grafana-First Troubleshooting
Always check Grafana (Loki + Prometheus) BEFORE SSHing to a host. Use targeted LogQL/PromQL queries to find the issue, then SSH only if action is needed.
Service not responding? Check logs + metrics first
- Find error logs in Loki: Grafana → Explore → Loki
- Query:
{container="<service>"}or{job="<service>"} - Add filter:
|= "error"to find errors only -
Or use
| json | level="error"for JSON logs -
Check metrics in Prometheus: Grafana → Explore → Prometheus
- CPU:
rate(container_cpu_usage_seconds_total{container="<service>"}[5m]) - Memory:
container_memory_usage_bytes{container="<service>"} -
Disk:
node_filesystem_avail_bytes{job="alloy"} -
Narrow by time range: use Grafana's time picker (last 15min, last 1h, etc)
-
Only SSH if:
- Logs show nothing (logs stopped flowing or are absent)
- Metrics are all zeros or absent
- You need to run a command or restart a service
Per-Service Log Queries
Each service runbook has a Logs section with its exact Loki query, log format, contents, and SSH fallback. See individual runbooks for details.
Quick reference — services in Loki
| Loki query | Service | LXC |
|---|---|---|
{job="forgejo"} |
Forgejo app | 100 |
{job="forgejo-runner"} |
Forgejo Runner | 101 |
{job="forgejo-actions", task_id="<id>"} |
Workflow step logs | 100 |
{job="loki-stack", container="<name>"} |
Loki, Prometheus, Grafana, Alloy | 108 |
{job="caddy"} |
Caddy reverse proxy | 105 |
{job="ollama"} |
Ollama LLM | 107 |
{job="minecraft"} |
Minecraft + Playit | 109 |
{job="gluetun"} |
Gluetun VPN | 110 |
{job="seedbox"} |
Seedbox qBittorrent instances | 111 |
{job="mediabot"} |
MediaManager, PostgreSQL, qBit, Prowlarr | 113 |
{job="jellyfin"} |
Jellyfin media server | 114 |
{job="netbird"} |
NetBird VPN mesh | 115 |
{job="allmight"} |
Shoko, Grimmory, RomM, MariaDB | 116 |
{job="tools"} |
code-server, thelounge, qbitwebui | 118 |
{job="infra-apps"} |
gatus, ntfy, glance, all oauth2-proxies | 119 |
{job="automation"} |
n8n | 120 |
{job="matrix"} |
Synapse | 121 |
{job="ai"} |
Open WebUI | 122 |
{job="auth"} |
PocketID | 123 |
NOT in Loki
- Vault (LXC 106) — Alpine, writes to file, no Alloy agent. SSH:
ssh [email protected] "tail -f /var/log/vault.log"
Filter patterns
# Plain text error search (most services)
{container="<service>"} |= "error"
# Avoid false positives from Gatus health checks
{container="gatus"} |= "error" != "errors=0"
# JSON log parsing (only for services that output JSON)
{container="<service>"} | json | level="error"
# Exclude info-level noise
{job="forgejo"} != "info"
# Regex filter
{container="n8n"} |~ "(?i)fail|crash|panic"
Forgejo Actions Workflow Logs
Workflow run logs are shipped to Loki with these labels:
| Label | Description | Example |
|---|---|---|
repo |
Repository name | homelab |
owner |
Repository owner | holo |
task_id |
Forgejo Actions task ID | 820 |
container |
Docker container name | (varies) |
job |
Loki job label | (varies) |
stream |
stdout/stderr | stdout |
Find task IDs (preferred — via Loki): Query the runner log to get the latest task IDs directly:
{job="forgejo-runner"} |= "task"
task <id> repo is <owner>/<repo>. Use the task ID in subsequent queries. This is faster than using the Forgejo API, which paginates oldest-first.
Find task IDs (alternative — via UI/API): Check the Forgejo Actions UI (task ID is in the URL), or use the API. Note: the Forgejo API has two different IDs — the run number in the URL (/actions/runs/802) differs from the internal id in the API response (e.g., 863). Use the internal id for get_run/list_run_jobs. The API returns runs oldest-first with no sort option.
Query examples:
# All logs for a specific workflow run (by task ID)
{repo="homelab", task_id="820"}
# Find failed Ansible tasks across all recent runs
{repo="homelab"} |= "FAILED"
# Find unreachable hosts
{repo="homelab"} |= "UNREACHABLE"
# All workflow logs for a specific repo in the last hour
{repo="homelab"}
# Filter by owner for cross-repo searches
{owner="holo"} |= "error"
Loki HTTP API (for programmatic access):
# Query via curl (timestamps in nanoseconds)
curl -s 'https://loki.eva-00.network/loki/api/v1/query_range' \
--data-urlencode 'query={repo="homelab", task_id="820"}' \
--data-urlencode 'start=<epoch_ns>' \
--data-urlencode 'end=<epoch_ns>' \
--data-urlencode 'limit=100' \
--data-urlencode 'direction=forward'
# Get available labels
curl -s 'https://loki.eva-00.network/loki/api/v1/labels'
# Get label values (e.g. all task IDs)
curl -s 'https://loki.eva-00.network/loki/api/v1/label/task_id/values'
Programmatic log access (Claude / automation)
Primary: Grafana MCP tools (recommended for troubleshooting)
Use in order:
1. list_loki_label_values — discover available labels/values
2. query_loki_stats — verify stream has data
3. query_loki_logs — fetch log lines (max 100 per call)
Required: datasourceUid: P8E80F9AEF21F6940
Fallback: Loki HTTP API (for queries >100 lines)
Timestamps must be in nanoseconds (Unix epoch seconds × 10⁹).
Last resort: SSH — only when Loki itself is down or logs aren't collected.
Per-Service Metrics
| Service | PromQL Query |
|---|---|
| Docker host CPU | rate(node_cpu_seconds_total{job="alloy"}[5m]) |
| Docker host memory | node_memory_MemAvailable_bytes{job="alloy"} |
| Container memory | container_memory_usage_bytes{job="alloy"} |
| Disk space | node_filesystem_avail_bytes{job="alloy", fstype!="tmpfs"} |
| Docker volumes | container_fs_usage_bytes |
Troubleshooting Checklist
Grafana shows no data
- Loki reachable? Grafana → Explore → Loki →
{job="loki-stack"}— see Grafana/Loki/Prometheus logs - Prometheus scraping? Grafana → Explore → Prometheus →
up{job="alloy"}— should be 1 - Alloy running?
{container="alloy"}logs — any connection errors to Loki? - SSH only if: Loki container is down, Prometheus can't reach Alloy endpoint, or network unreachable
Service logs missing
- Check label in Loki:
{container="<service>"}— does it exist? - Docker running?
{container="<service>"}should show recent logs - Service exists? If new service, check Alloy config includes it
- SSH only if: Container logs don't exist in Loki but service is running (likely log driver issue)
Prometheus target down
- View targets: http://192.168.1.108:9090/targets
- Check Prometheus logs:
{container="prometheus"} - Verify target is reachable:
{container="alloy"}— any scrape errors? - SSH only if: Port number is wrong, or service is unreachable
Loki Labels Reference
All labels available in Loki (verified 2026-04-01):
| Label | Description | Example values |
|---|---|---|
job |
Log source group | tools, infra-apps, automation, matrix, ai, auth, loki-stack, forgejo, forgejo-runner, forgejo-actions |
container |
Docker container name | n8n, pocketid, grafana, loki, ... |
service_name |
Docker service name | Same as container in most cases |
stream |
Output stream | stdout, stderr |
owner |
Forgejo repo owner (actions only) | holo |
repo |
Forgejo repo name (actions only) | homelab |
task_id |
Forgejo Actions task ID (actions only) | 820 |
Note: host label exists in Alloy configs for Forgejo/Runner but is not present in Docker discovery logs. Each LXC runs its own Alloy instance with a unique job label, so use job to identify which LXC a Docker log came from.
Logs Directory Reference
| Service | On-disk location | Loki query | Alloy agent |
|---|---|---|---|
| Loki | Docker (LXC 108) | {job="loki-stack", container="loki"} |
Sidecar |
| Prometheus | Docker (LXC 108) | {job="loki-stack", container="prometheus"} |
Sidecar |
| Grafana | Docker (LXC 108) | {job="loki-stack", container="grafana"} |
Sidecar |
| Forgejo app | LXC 100 /var/log/forgejo/gitea.log |
{job="forgejo"} |
Native binary |
| Forgejo Runner | LXC 101 /var/log/forgejo-runner.log |
{job="forgejo-runner"} |
Native binary |
| Forgejo Actions | LXC 100 .log.zst files |
{job="forgejo-actions", task_id="<id>"} |
Python cron |
| Tools (code-server, thelounge, qbitwebui) | Docker (LXC 118) stdout/stderr | {job="tools", container="<name>"} |
Docker discovery |
| Infra-apps (gatus, ntfy, glance, oauth2-proxies) | Docker (LXC 119) stdout/stderr | {job="infra-apps", container="<name>"} |
Docker discovery |
| Automation (n8n) | Docker (LXC 120) stdout/stderr | {job="automation", container="<name>"} |
Docker discovery |
| Matrix (synapse) | Docker (LXC 121) stdout/stderr | {job="matrix", container="<name>"} |
Docker discovery |
| AI (open-webui) | Docker (LXC 122) stdout/stderr | {job="ai", container="<name>"} |
Docker discovery |
| Auth (pocketid) | Docker (LXC 123) stdout/stderr | {job="auth", container="<name>"} |
Docker discovery |