Lessons Learned

Operational incidents, mistakes, and insights worth remembering. Each entry captures what happened, why, and how to prevent it next time.

2026-06-16 — Claude Code Remote Control daemon: the gotchas that aren't in the docs

What happened: Got claude-remote-control.service running on agents-lxc (ADR-023 Option A), with the ready environment showing up on claude.ai/code and accepting browser-attached sessions. The path from "playbook deploy green" to "actually usable" took five separate manual fixes that weren't in any docs we had, and each one had a misleading symptom. Recording them so the next deploy reproduces the working state from the playbook alone, and so the next person hitting them doesn't burn a day.

The five fixes, in the order they bit:

/home/claude/.cache must exist before unit start. First start failed with Failed at step NAMESPACE spawning /usr/bin/claude: No such file or directory. The error points at /usr/bin/claude but the real victim was the unit's ReadWritePaths listing a dir that didn't exist — ProtectHome=read-only + ReadWritePaths=<dir> requires the dir to exist at mount-namespace setup time. Fix: pre-create .cache (and .claude, .codex) in the playbook right after the service user is created. Codified in chizuru-v2/ansible/playbooks/agents.yml task "Ensure systemd-unit ReadWritePaths directories exist".
The first-run wizard is not skippable just by accepting the OAuth prompt. claude auth login --claudeai (via paste-back code) seats ~/.claude/.credentials.json and claude auth status reports loggedIn: true, but launching claude interactively still triggers a multi-step wizard: theme → auth-method → another OAuth re-flow → workspace-trust → Enable Remote Control? (y/n). The systemd daemon hits the last prompt (y/n) headlessly and exits with Main process exited, code=exited, status=1/FAILURE. The misleading symptom in the daemon log is Error: Remote Control is not yet enabled for your account — which sounds like an account-level entitlement issue but is actually just an unseated per-machine confirmation. Two ways to seat it:
Run echo y | claude remote-control --name ready --spawn=worktree once as the service user. Persists as remoteDialogSeen: true in ~/.claude.json.
Skip the rest of the wizard entirely by pre-writing ~/.claude.json's projects[<workspace>] block with hasTrustDialogAccepted: true and enabledMcpjsonServers: ["<server-name>"] via jq.
Project MCP servers don't auto-enable from .mcp.json — they need an explicit per-project allow-list in ~/.claude.json. This is not in the .mcp.json docs. Without the project's path being listed under projects[…].enabledMcpjsonServers, claude.exe silently loads the registered servers as "configured but disabled", and /mcp shows them missing. Fix: jq-patch ~/.claude.json to add the workspace path with the right shape (see runbook section 7).
ReadWritePaths must include every dir each MCP server writes to. Symptom: browser session works, has filesystem access, can read the project, but /mcp shows no codex tools (you only see whatever Anthropic-relay MCP servers — Google Drive, etc. — your claude.ai account opted into). Root cause: codex spawns under claude.exe inside the daemon's namespace, tries to write to ~/.codex/auth.json, ~/.codex/logs_*.sqlite, and ~/.codex/sessions/, gets blocked by ProtectHome=read-only, exits, and the failure never surfaces — claude.exe treats a non-zero exit from an MCP child as "this server isn't registered" rather than as an error. The daemon and the session look healthy; only /mcp reveals the lie. Fix: add {{ service_home }}/.codex to ReadWritePaths. Generalisation: any MCP server invoked by the daemon needs its state dir in ReadWritePaths. Audit the .mcp.json against the unit's ReadWritePaths whenever you add a new MCP server.

The exact signature so you can recognise it next time (recorded here because this one cost a day):
- systemctl status claude-remote-control → active (running). No clue here.
- journalctl -u claude-remote-control → the only relevant line is one short MCP server "codex" failed warning emitted right after the first browser session attaches; it is not repeated and there is no stderr from the child. Easy to miss in a wall of session-init log lines.
- Browser session → loads fine, answers prompts, reads files. The slashed-circle "Initialized session" panel is unchanged whether codex MCP works or not, so it's useless as a canary.
- /mcp in the browser session → shows only the Anthropic-relay MCP servers your claude.ai account has authorised (commonly mcp__claude_ai_Google_Drive__*). It does not list the project's .mcp.json codex server — and there is no "failed" indicator; it just isn't there. This is the only reliable signal.
- ps -ef --forest on the LXC under the daemon's PID → for a healthy daemon you see claude remote-control --name ready → one claude.exe --session-id cse_… per browser session → each session has a codex mcp-server (or node /…/codex mcp-server) child. For a broken daemon, the session is there but the codex grandchild is missing. This is the second reliable signal and the one that points at the cause.
Repro / verification recipe (use to confirm you're looking at the same failure mode before adding paths blindly to ReadWritePaths):
```
# 1. As the service user, confirm codex itself can start:
sudo -u claude bash -lc 'codex mcp-server </dev/null'   # should print JSON-RPC banner; Ctrl-C
# 2. Now reproduce the unit's namespace and try again:
sudo systemd-run --pty --uid=claude --gid=claude \
    --property=ProtectHome=read-only \
    --property=ReadWritePaths=/home/claude/git \
    codex mcp-server </dev/null
#    → fails the same way the daemon does. Add the missing path and rerun until it banners.
```
The general pattern: any time you add a new MCP server to .mcp.json, run that systemd-run repro before deploying. It's 10 seconds and would have saved a day here.
StartLimitIntervalSec and StartLimitBurst belong in [Unit], not [Service]. systemd >= 250 moved them. The unit doesn't refuse to load — systemd just logs /etc/systemd/system/claude-remote-control.service: Unknown key 'StartLimitIntervalSec' in section [Service], ignoring. The cap silently has no effect, and a relay flap can ntfy-spam you. Fix: relocate both keys.

Diagnostic notes saved here so we don't re-derive them next time:

claude auth status reports loggedIn: true even when the API path is broken — trust it less than you'd think. The real test is claude --print "hello" (without --bare; --bare explicitly disables OAuth reads and is a self-inflicted false negative). If --print says "Not logged in", credentials.json is incomplete or expired.
The "Initialized session" panel in claude.ai/code with three slashed-circle steps (Clone repo / Run setup script / Start Claude Code) is not showing failures for self-hosted Remote Control — those steps are cloud-mode-only. Only "Set up a cloud container" applies; the others are intentionally skipped.
A self-hosted Remote Control env URL (env_…) changes on every daemon restart. The stable Glance entry is https://claude.ai/code itself; the env shows up by name (ready) in the list — not by ID. Don't pin to specific env_… URLs.
When debugging which daemon a browser session is attached to, ps -ef --forest on the LXC gives the truth: each session is a claude.exe --session-id cse_<id> child of the daemon, and each session further spawns its MCP children. If you see two claude remote-control parents, you've got a tmux test session you forgot to kill; clean that up before diagnosing the systemd one.

Codified in:

chizuru-v2/ansible/playbooks/agents.yml — pre-create ~/.claude, ~/.cache, ~/.codex upfront.
chizuru-v2/services/agents/claude-remote-control.service.j2 — ReadWritePaths now includes .codex; StartLimitIntervalSec/Burst moved to [Unit].
homelab-docs/docs/workstation/ai-agent-dev-environment.md Section 7 — runbook now spells out the seven interactive seating steps (auth login + device-auth + jq trust + jq enabledMcpjsonServers + echo y + enable + verify).

2026-05-29 — Pinning upstream release tarballs in IaC: always `tar tvz` before merging

What happened: The first three agents-lxc deploys after the initial merge failed back-to-back on the AoE install task — three different reasons in three separate commits, each catchable by a 30-second tar tvz upfront:

Wrong org: pinned njbrake/agent-of-empires (the maintainer's personal account, where releases no longer live) instead of agent-of-empires/agent-of-empires (the canonical org the README and Homebrew formula point at).
Wrong asset name: assumed Rust-style aoe-x86_64-unknown-linux-gnu.tar.gz; actual asset is Go-style aoe-linux-amd64.tar.gz.
Wrong binary path inside the archive: assumed bare aoe; the tarball ships the binary named after the asset, aoe-linux-amd64.

Each was one Forgejo Actions run plus one commit, each ~5 minutes to round-trip. Three rounds before the install task succeeded.

Root cause: I trusted my mental model of "Go release tarball convention" and "upstream maintainer location" instead of verifying against the actual release page. Each guess was wrong; each was expensive to recover from because the failure loop is "edit → commit → push → merge → workflow trigger → ~90s deploy → SSH for the log → diagnose". The version-check pattern we use (<tool> --version registered, when: guards re-download) is correct for idempotency on re-runs, but the first run's success still has to be verified by hand.

Rule going forward:

Before merging any IaC commit that pins an external release tarball URL, run this against the live tarball exactly as the playbook will:

curl -fsSL "<url>" | tar tvz

That one command confirms three things at once:

URL doesn't 404 — curl -fsSL exits non-zero on non-2xx, no piping garbage to tar.
Archive is a gzipped tar — otherwise gzip: not in gzip format.
The binary you want is at the path your install -m 0755 "$tmp/<name>" /usr/local/bin/<name> task assumes — tvz shows the contents and their layout.

If the project ships a .sha256 sidecar (the agent-of-empires releases do), pin that too in the same commit so future bumps are bit-for-bit verifiable. The cost is a second curl + sha256sum -c step in the install task; the payoff is "did upstream re-publish this tag" detection.

Applies to: AoE, lazygit, difftastic, delta, any future Go/Rust/single-binary tool installed via release-tarball in agents.yml, dev.yml, or any sibling playbook.

2026-05-28 — Forgejo Actions task logs: SSH + zstdcat is the working path; revisit

What happened: Two consecutive Forgejo Actions deploys of agents.yml (LXC 135, workflow Deploy Agents) failed mid-playbook. To diagnose I needed the full stderr of the failing Ansible task, and our existing options for that aren't a one-call story yet, so it took routing the URL run number → internal task ID → SSH + pct exec + zstdcat by hand. Recording the access pattern that worked, the gotcha that traps it, and the wrapper that should replace it later.

Access path that worked, end-to-end:

Find the failed run via the Forgejo API:
```
TOKEN=$(git -C ~/git/homelab-docs config --get remote.origin.url \
  | sed -n 's#.*claude:$[^@]*$@.*#\1#p')

curl -H "Authorization: token $TOKEN" \
  "https://git.eva-00.network/api/v1/repos/holo/chizuru-v2/actions/tasks?limit=20"
```
Each row has an internal id that is not the run number in the URL. services/forgejo/runbook.md Logs section is explicit: Loki's task_id label is the internal id, not /actions/runs/<N>. The tasks API response makes both visible side-by-side, so cross-walking is a python -c json away.

SSH to the Proxmox host and pct exec into LXC 100 (Forgejo) to read the compressed log:

ssh [email protected] \
  "pct exec 100 -- find /var/lib/forgejo/data/actions_log/<owner>/<repo> -name '<internal_id>.log.zst'"
# then:
ssh [email protected] \
  "pct exec 100 -- bash -c 'zstdcat /var/lib/forgejo/data/actions_log/<owner>/<repo>/<hash>/<internal_id>.log.zst'"

Example for holo/chizuru-v2 task 2216: /var/lib/forgejo/data/actions_log/holo/chizuru-v2/a8/2216.log.zst. Forgejo puts each task under a 2-char hash prefix dir.

Tail of the decompressed log carries the fatal: [host]: FAILED! => {...} block with the real stderr, the playbook line number that called the failing module, and a PLAY RECAP. That's exactly what's needed to fix the playbook.

Works immediately from anywhere with SSH to chizuru. The Forgejo web UI is real-time, the Loki path lands the same data with a ~2 min cron-push delay.

Why this needed a write-up: the most natural API call — "get me the log for task X" — does not exist in the Forgejo version exposed by git.eva-00.network. The Forgejo MCP server's list_run_jobs and get_job_log_preview return 404 (already flagged in the Forgejo service runbook). Without that endpoint, an agent (Claude / Codex / mcp-forgejo) staring at a failed workflow either guesses or has to bounce off a human.

Underlying failures it surfaced (separate concerns, recorded for completeness):

Task 2216 / run 712: Add Docker GPG key failed with curl (6) Could not resolve host: download.docker.com. Transient DNS at LXC-creation time — getent hosts download.docker.com worked when I rechecked from the LXC minutes later. Likely a race between Proxmox bringing up eth0 + propagating resolv.conf and the playbook's first external DNS resolution.
Task 2219 / run 716 (manual workflow_dispatch retry): Install base packages failed with No package matching 'git-delta' is available. Real config bug — Debian 12 doesn't ship git-delta in main (it lives in bookworm-backports). agents.yml should install delta from a release tarball, the same way as lazygit / difftastic, not from apt.

The backlog item to revisit:

For ad-hoc human debugging the SSH + zstdcat path is fine. For agent-driven debugging — and for "from anywhere" access where SSH isn't always handy — we want one of:

Tiny wrapper script committed to chizuru-v2 (or this repo): bin/forgejo-task-log <owner>/<repo> <task_id> that does the API lookup + SSH + pct exec + zstdcat in one call. Cheapest possible start; needs only SSH access and the existing token.
Loki via Grafana API for the same data, called from agents the same way mcp-grafana already pulls Prometheus/Loki. The claude_mcp_service_account_token from secret/grafana already exists; query is {job="forgejo-actions", task_id="<internal_id>"}. Caveat: ~2 min ingestion delay and the internal-id lookup still needed.
Upgrade Forgejo to a version exposing the run-log REST endpoint so mcp-forgejo can fetch logs directly — no SSH, no Loki, single tool call. Gated by upstream Forgejo features.

Pick when this stops being a one-off. Option 1 is the cheapest start. Option 2 is the right long-term path: works without inbound SSH, lands logs in the same dashboards as everything else, fits the existing agent-token pattern.

Prevention going forward:

This entry is the durable note for "how do I read a failed Forgejo Actions task's log."
Add retries: / delay: on the curl-based GPG-key tasks in agents.yml (and dev.yml) so transient DNS at LXC-creation self-heals without a manual workflow_dispatch.
Fix git-delta install in agents.yml: drop it from the apt base list, add a pinned release-tarball task (same pattern as lazygit / difftastic). Tracked as a separate small commit, not in this entry.

2026-04-02 — Cgroup OOM kills misreported as clean exits by Docker

What happened: Meilisearch on LXC 117 (karakeep) was restarting every ~60 seconds. docker inspect showed ExitCode: 0 and OOMKilled: false. Container logs showed no errors. docker stats showed normal memory (because it samples instantaneously, not at the moment of kill). Spent significant time checking config, master key encoding, and potential Meilisearch bugs before finding the real cause.

Root cause: The Linux kernel OOM killer operates at the cgroup level (the LXC's memory limit), not the container level. When it kills a process, Docker restarts it via restart: unless-stopped but reports the exit as clean (ExitCode: 0). docker inspect's OOMKilled field only reflects Docker's own OOM handling, not kernel-level cgroup kills.

The actual RSS was ~1.9 GB against a 2 GB LXC limit. Visible only in dmesg on the Proxmox host:

Memory cgroup out of memory: Killed process (meilisearch) anon-rss:1985264kB

Fix: Increased LXC RAM to 8 GB. Added MEILI_EXPERIMENTAL_REDUCE_INDEXING_MEMORY_USAGE=true to cap indexing spikes.

Prevention going forward: - Grafana alert on increase(node_vmstat_oom_kill{job="chizuru"}[5m]) > 0 fires immediately - Alloy on chizuru ships kernel journal to Loki — searchable at {job="chizuru-syslog"} |= "oom-kill" - Never trust docker inspect OOMKilled for containerized workloads inside LXCs. Always check dmesg on the Proxmox host when a container restarts unexpectedly with ExitCode 0.

2026-03-25 — Overly broad workflow path triggers

What happened: Adding mediabot_hosts to ansible/inventory.yml triggered every Forgejo Actions workflow (20+), because most workflows included ansible/inventory.yml and ansible/ansible.cfg in their paths: filter. Multiple unrelated deployments ran (and some failed, including Vaultwarden).

Impact: 220+ failed workflow runs accumulated over time. The Vaultwarden deploy was flagged as broken when the service itself was fine — it was just a spurious re-deploy triggered by an unrelated inventory change.

Root cause: When workflows were first created, inventory.yml and ansible.cfg were added as path triggers under the assumption that changes to shared Ansible files should redeploy everything. In practice, inventory changes are almost always scoped to a single service (adding/changing one host), and ansible.cfg almost never changes.

Fix: Removed ansible/inventory.yml and ansible/ansible.cfg from the paths: filter of all 20 workflows. Each workflow now only triggers on changes to its own service files, playbook, and workflow definition.

Rule going forward: Workflow path triggers should be scoped to the specific service. Shared files like inventory.yml should not be in path triggers. If a shared file change requires redeploying a service, use workflow_dispatch to trigger it manually.

2026-03-26 — Vault auto-unseal race condition

What happened: Vault was sealed (503) causing the Vaultwarden deploy to fail. The Proxmox hook script for auto-unseal had fired on boot but only applied 2 of 3 required unseal keys — one key failed because Vault wasn't ready yet.

Impact: Any workflow fetching secrets from Vault fails with 503. The 5-minute cron fallback would have eventually fixed it, but workflows that ran in the gap failed.

Root cause: The hook script used a fixed sleep 5 before attempting unseal. If Vault takes longer than 5 seconds to start listening (e.g. raft recovery, slow disk), the first unseal call fails silently and only 2 of 3 keys get applied.

Fix: Replaced sleep 5 with a retry loop that polls sys/seal-status for up to 30 seconds before starting the unseal sequence.

Rule going forward: Never use fixed sleeps to wait for services — always poll for readiness. The cron job is a good safety net but workflows shouldn't depend on it.

2026-03-26 — Glance custom icons must use base64 data URIs

What happened: After adding MediaManager to Glance, the bookmark and monitor showed a Plex icon (si:plex) because MediaManager has no Simple Icons entry. Replacing it with a local SVG asset (/assets/mediamanager-logo.svg) rendered blank, and switching to PNG (/assets/mediamanager-logo.png) also showed blank with a question mark on hover.

Impact: Cosmetic — wrong or missing icon on the Glance dashboard.

Root cause: Glance's icon: field for bookmarks and monitors supports three formats: Simple Icons (si:name), external URLs (https://...), and data URIs. Local asset paths (/assets/...) work for the main page template (e.g. <img src="/assets/diagram.svg">) but do not work in the icon: field of bookmarks/monitors.

Fix: Convert the custom icon to a small PNG (128x128), then base64-encode it as an inline data URI:

# Convert SVG to PNG (requires librsvg)
rsvg-convert -w 128 -h 128 logo.svg -o logo.png

# Generate data URI
echo "data:image/png;base64,$(base64 < logo.png | tr -d '\n')"

Paste the full data:image/png;base64,... string as the icon: value in glance.yml.

Rule going forward: For custom icons in Glance bookmarks/monitors, use one of:

si:name — if the service has a Simple Icons entry
di:name — if the service has a Dashboard Icons entry (preferred for self-hosted apps — better coverage)
https://... — a publicly accessible URL (e.g. raw GitHub link from a public repo)
data:image/png;base64,... — inline base64 data URI for anything else

2026-04-01 — Ansible uri module mangles passwords in JSON bodies

What happened: Grimmory's first-run setup API (POST /api/v1/setup) succeeded, but the immediate follow-up login (POST /api/v1/auth/login) via Ansible's uri module returned "Invalid credentials" — even with the exact same password. Direct curl with the same payload worked instantly.

Impact: Playbook failed mid-run. Retrying with delays didn't help — it wasn't a timing issue.

Root cause: Ansible's uri module serializes JSON bodies through Jinja2 templating, which can alter special characters in passwords. The password sent to the setup endpoint gets stored as a BCrypt hash, but the password sent to the login endpoint has been subtly modified by the serialization, so the hash comparison fails. This affects any endpoint that receives a password in a JSON body.

Fix: Switched all password-bearing API calls (/api/v1/setup, /api/v1/auth/login, user creation endpoints) from ansible.builtin.uri to ansible.builtin.shell with curl. Non-password API calls (library creation, settings, health checks) can safely use the uri module.

Rule going forward: In Ansible playbooks, always use shell + curl for any API call that includes a password in the JSON body. Use the uri module only for endpoints where the body contains no secrets.

2026-04-01 — Grimmory OIDC config is stored in the database, not env vars

What happened: After deploying Grimmory with OIDC_ENABLED=true, OIDC_CLIENT_ID, OIDC_CLIENT_SECRET, and OIDC_ISSUER_URI in docker-compose environment variables, the login page still showed a username/password form with no OIDC option. The public settings endpoint returned "oidcEnabled": false.

Impact: OIDC login was completely non-functional despite env vars being correctly set.

Root cause: Grimmory (formerly BookLore) is a Spring Boot application, but the OIDC env vars are not Spring Boot properties. Grimmory stores all OIDC configuration in its database (app_settings table) via AppSettingService.java. The OIDC_ENABLED env var in docker-compose is simply ignored — the app reads AppSettingKey.OIDC_ENABLED from the database with a default of "false".

This is unusual because most self-hosted apps configure OIDC via environment variables. The docker-compose template even included the env vars, which made the misconfiguration harder to diagnose.

Fix: Removed the useless OIDC env vars from docker-compose. Added a post-deploy API call to configure OIDC via PUT /api/v1/settings:

curl -X PUT "http://localhost:6060/api/v1/settings" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '[
    {"name": "OIDC_ENABLED", "value": "true"},
    {"name": "OIDC_PROVIDER_DETAILS", "value": {
      "providerName": "PocketID",
      "clientId": "...",
      "clientSecret": "...",
      "issuerUri": "https://auth.example.com",
      "scopes": "openid profile email",
      "claimMapping": {"username": "preferred_username", "name": "name", "email": "email", "groups": "groups"}
    }},
    {"name": "OIDC_FORCE_ONLY_MODE", "value": "true"}
  ]'

Rule going forward: When adding OIDC to a new service, verify the configuration mechanism by checking the source code. Don't assume env vars work just because the docker-compose template includes them. Check: does the app read from env vars, config files, or the database?

2026-04-01 — RomM OIDC metadata URL fallback points to itself

What happened: After enabling OIDC on RomM with PocketID, clicking "Login with PocketID" returned a 500 Internal Server Error: "Expecting value" (JSON parse error).

Impact: OIDC login completely broken.

Root cause: RomM uses authlib for OIDC. When OIDC_SERVER_METADATA_URL is not set, it falls back to constructing the discovery URL from OIDC_SERVER_APPLICATION_URL. We had OIDC_SERVER_APPLICATION_URL=https://romm.example.com (RomM's own URL), so authlib tried to fetch https://romm.example.com/.well-known/openid-configuration — which is RomM itself, not PocketID. RomM returned HTML, authlib tried to parse it as JSON, and crashed.

Fix: Explicitly set OIDC_SERVER_METADATA_URL=https://auth.example.com/.well-known/openid-configuration pointing to PocketID. Also corrected OIDC_SERVER_APPLICATION_URL to point to PocketID per the official docs.

Rule going forward: Always set OIDC_SERVER_METADATA_URL explicitly. Don't rely on fallback URL construction — it's a common source of circular reference bugs.

2026-04-01 — RomM OIDC callback URL mismatch

What happened: After fixing the metadata URL, PocketID login succeeded but the redirect back to RomM returned "Not Found."

Impact: OIDC flow completes authentication but fails on the callback.

Root cause: The PocketID client was configured with callback URL /api/oauth/openid-callback, but RomM's actual OIDC callback endpoint is /api/oauth/openid (defined in endpoints/auth.py). The -callback suffix doesn't exist.

Fix: Updated the callback URL in PocketID to https://romm.example.com/api/oauth/openid.

Rule going forward: Always verify the exact callback URL from the application's source code or documentation. Don't guess the path — common variations like /callback, /openid-callback, /oauth2/callback are all different.

What happened: After fixing the callback URL, RomM's OIDC login returned: {"detail": "Email is not verified."}.

Impact: Users can authenticate with PocketID but are rejected by RomM.

Root cause: PocketID advertises email_verified in its claims_supported OpenID Connect discovery. RomM's auth handler (base_handler.py) checks this claim — if PocketID says it supports email verification, RomM requires email_verified: true in the token. Our PocketID user had emailVerified: false.

Fix: Updated the PocketID user via API:

curl -X PUT "https://auth.example.com/api/users/<user-id>" \
  -H "X-API-Key: <api-key>" \
  -H "Content-Type: application/json" \
  -d '{"emailVerified": true}'

Rule going forward: When setting up PocketID OIDC for a new service, ensure all users have emailVerified: true. Some OIDC consumers enforce this when the provider advertises the claim.

2026-04-01 — Grimmory OIDC callback URL uses frontend route, not API endpoint

What happened: After configuring Grimmory's OIDC via the settings API, clicking the PocketID login button returned "Invalid callback URL" from PocketID.

Impact: OIDC authentication flow couldn't start.

Root cause: The PocketID client had callback URL /login/oauth2/code/pocketid (a standard Spring Security path). But Grimmory uses a custom SPA-based OIDC flow — the actual callback is a frontend route at /oauth2-callback, not a backend API endpoint. The frontend receives the authorization code and posts it to the backend at POST /api/v1/auth/oidc/callback.

Fix: Updated the PocketID client callback URL to https://library.example.com/oauth2-callback.

Rule going forward: Don't assume Spring Boot apps use the standard Spring Security OAuth2 callback path (/login/oauth2/code/{registrationId}). Check the application's source code — SPAs often use custom frontend routes for OAuth callbacks.

2026-04-01 — force_clean must reset database for DB-backed services

What happened: Running the Grimmory playbook with force_clean=true wiped the filesystem data directory but the application came back with existing state — setup was skipped because the admin user still existed.

Impact: force_clean didn't actually clean; the app wasn't in first-run state.

Root cause: The force_clean logic only deleted /unohana/grimmory/data (filesystem). But Grimmory stores users, settings, and library config in MariaDB, not on the filesystem. The database still had the admin user from the previous deployment.

Fix: Added a database reset step to force_clean:

docker exec mariadb mariadb -u root -e \
  "DROP DATABASE IF EXISTS grimmory;
   CREATE DATABASE grimmory;
   GRANT ALL PRIVILEGES ON grimmory.* TO 'grimmory'@'%';
   FLUSH PRIVILEGES;"

Rule going forward: When implementing force_clean for a service, identify where all persistent state lives. For DB-backed services, wiping the filesystem isn't enough — the database must be reset too. SQLite-based services (like Shoko) only need the config directory wiped since the DB file lives there.

2026-04-03 — Karakeep AI tagging failures with CPU-only Ollama

What happened: After importing ~1000 bookmarks into Karakeep, AI tagging failed on nearly all of them. Multiple root causes stacked:

Models not installed — llama3.1 and llava weren't in Ollama. Every tagging attempt failed with "model not found". The original ollama pull via pct exec silently failed due to $PATH not including /usr/local/bin — exit code was 0 despite the error.
Inference timeouts — Once models were installed, llama3.1 (8B params) on CPU took 30-120s per bookmark. The default INFERENCE_JOB_TIMEOUT_SEC=30 killed most jobs. Setting it to 120 helped but some still timed out.
Single-core inference — Ollama in the LXC only used 1 of 4 allocated cores. Bumping to 8 cores didn't help until we created a model variant with PARAMETER num_thread 8 via a Modelfile, which brought CPU usage to ~750%.
Queue desync — After resetting taggingStatus to pending in the bookmarks DB, no queue jobs existed in the separate queue.db. The inference worker only enqueues on bookmark creation, not on status change. The admin panel's "Regenerate AI Tags for Pending Bookmarks" button is the correct way to re-trigger.

Root cause: Multiple compounding issues: missing models, conservative timeouts for CPU inference, Ollama thread count not matching available cores, and misunderstanding how the internal job queue works.

Fix: - Pull models with full path: /usr/local/bin/ollama pull <model> - Set INFERENCE_JOB_TIMEOUT_SEC=120 and INFERENCE_FETCH_TIMEOUT_SEC=300 - Create model variants with explicit num_thread matching LXC core count - Use the admin panel "Regenerate" buttons to re-enqueue failed/pending jobs - Set OLLAMA_KEEP_ALIVE=-1 to avoid cold-start delays between requests

Rules going forward: - When using pct exec to run binaries, always use the full path (/usr/local/bin/ollama, not ollama) — $PATH is not inherited. - For CPU-only Ollama, consider smaller models (gemma3:4b) — community consensus is that smaller models produce better normalized tags for bookmark categorization. - Always set INFERENCE_JOB_TIMEOUT_SEC >= 120 for CPU inference. Note: there's a hardcoded 5-minute Node.js undici timeout bug (karakeep-app/karakeep#1586) that can't be overridden. - Use the admin panel (Settings > Admin > Background Jobs) to manage bulk re-tagging, not direct DB manipulation.