Skip to content

Lessons Learned

Operational incidents, mistakes, and insights worth remembering. Each entry captures what happened, why, and how to prevent it next time.


2026-04-02 — Cgroup OOM kills misreported as clean exits by Docker

What happened: Meilisearch on LXC 117 (karakeep) was restarting every ~60 seconds. docker inspect showed ExitCode: 0 and OOMKilled: false. Container logs showed no errors. docker stats showed normal memory (because it samples instantaneously, not at the moment of kill). Spent significant time checking config, master key encoding, and potential Meilisearch bugs before finding the real cause.

Root cause: The Linux kernel OOM killer operates at the cgroup level (the LXC's memory limit), not the container level. When it kills a process, Docker restarts it via restart: unless-stopped but reports the exit as clean (ExitCode: 0). docker inspect's OOMKilled field only reflects Docker's own OOM handling, not kernel-level cgroup kills.

The actual RSS was ~1.9 GB against a 2 GB LXC limit. Visible only in dmesg on the Proxmox host:

Memory cgroup out of memory: Killed process (meilisearch) anon-rss:1985264kB

Fix: Increased LXC RAM to 8 GB. Added MEILI_EXPERIMENTAL_REDUCE_INDEXING_MEMORY_USAGE=true to cap indexing spikes.

Prevention going forward: - Grafana alert on increase(node_vmstat_oom_kill{job="chizuru"}[5m]) > 0 fires immediately - Alloy on chizuru ships kernel journal to Loki — searchable at {job="chizuru-syslog"} |= "oom-kill" - Never trust docker inspect OOMKilled for containerized workloads inside LXCs. Always check dmesg on the Proxmox host when a container restarts unexpectedly with ExitCode 0.


2026-03-25 — Overly broad workflow path triggers

What happened: Adding mediabot_hosts to ansible/inventory.yml triggered every Forgejo Actions workflow (20+), because most workflows included ansible/inventory.yml and ansible/ansible.cfg in their paths: filter. Multiple unrelated deployments ran (and some failed, including Vaultwarden).

Impact: 220+ failed workflow runs accumulated over time. The Vaultwarden deploy was flagged as broken when the service itself was fine — it was just a spurious re-deploy triggered by an unrelated inventory change.

Root cause: When workflows were first created, inventory.yml and ansible.cfg were added as path triggers under the assumption that changes to shared Ansible files should redeploy everything. In practice, inventory changes are almost always scoped to a single service (adding/changing one host), and ansible.cfg almost never changes.

Fix: Removed ansible/inventory.yml and ansible/ansible.cfg from the paths: filter of all 20 workflows. Each workflow now only triggers on changes to its own service files, playbook, and workflow definition.

Rule going forward: Workflow path triggers should be scoped to the specific service. Shared files like inventory.yml should not be in path triggers. If a shared file change requires redeploying a service, use workflow_dispatch to trigger it manually.


2026-03-26 — Vault auto-unseal race condition

What happened: Vault was sealed (503) causing the Vaultwarden deploy to fail. The Proxmox hook script for auto-unseal had fired on boot but only applied 2 of 3 required unseal keys — one key failed because Vault wasn't ready yet.

Impact: Any workflow fetching secrets from Vault fails with 503. The 5-minute cron fallback would have eventually fixed it, but workflows that ran in the gap failed.

Root cause: The hook script used a fixed sleep 5 before attempting unseal. If Vault takes longer than 5 seconds to start listening (e.g. raft recovery, slow disk), the first unseal call fails silently and only 2 of 3 keys get applied.

Fix: Replaced sleep 5 with a retry loop that polls sys/seal-status for up to 30 seconds before starting the unseal sequence.

Rule going forward: Never use fixed sleeps to wait for services — always poll for readiness. The cron job is a good safety net but workflows shouldn't depend on it.


2026-03-26 — Glance custom icons must use base64 data URIs

What happened: After adding MediaManager to Glance, the bookmark and monitor showed a Plex icon (si:plex) because MediaManager has no Simple Icons entry. Replacing it with a local SVG asset (/assets/mediamanager-logo.svg) rendered blank, and switching to PNG (/assets/mediamanager-logo.png) also showed blank with a question mark on hover.

Impact: Cosmetic — wrong or missing icon on the Glance dashboard.

Root cause: Glance's icon: field for bookmarks and monitors supports three formats: Simple Icons (si:name), external URLs (https://...), and data URIs. Local asset paths (/assets/...) work for the main page template (e.g. <img src="/assets/diagram.svg">) but do not work in the icon: field of bookmarks/monitors.

Fix: Convert the custom icon to a small PNG (128x128), then base64-encode it as an inline data URI:

# Convert SVG to PNG (requires librsvg)
rsvg-convert -w 128 -h 128 logo.svg -o logo.png

# Generate data URI
echo "data:image/png;base64,$(base64 < logo.png | tr -d '\n')"

Paste the full data:image/png;base64,... string as the icon: value in glance.yml.

Rule going forward: For custom icons in Glance bookmarks/monitors, use one of:

  1. si:name — if the service has a Simple Icons entry
  2. di:name — if the service has a Dashboard Icons entry (preferred for self-hosted apps — better coverage)
  3. https://... — a publicly accessible URL (e.g. raw GitHub link from a public repo)
  4. data:image/png;base64,... — inline base64 data URI for anything else

2026-04-01 — Ansible uri module mangles passwords in JSON bodies

What happened: Grimmory's first-run setup API (POST /api/v1/setup) succeeded, but the immediate follow-up login (POST /api/v1/auth/login) via Ansible's uri module returned "Invalid credentials" — even with the exact same password. Direct curl with the same payload worked instantly.

Impact: Playbook failed mid-run. Retrying with delays didn't help — it wasn't a timing issue.

Root cause: Ansible's uri module serializes JSON bodies through Jinja2 templating, which can alter special characters in passwords. The password sent to the setup endpoint gets stored as a BCrypt hash, but the password sent to the login endpoint has been subtly modified by the serialization, so the hash comparison fails. This affects any endpoint that receives a password in a JSON body.

Fix: Switched all password-bearing API calls (/api/v1/setup, /api/v1/auth/login, user creation endpoints) from ansible.builtin.uri to ansible.builtin.shell with curl. Non-password API calls (library creation, settings, health checks) can safely use the uri module.

Rule going forward: In Ansible playbooks, always use shell + curl for any API call that includes a password in the JSON body. Use the uri module only for endpoints where the body contains no secrets.


2026-04-01 — Grimmory OIDC config is stored in the database, not env vars

What happened: After deploying Grimmory with OIDC_ENABLED=true, OIDC_CLIENT_ID, OIDC_CLIENT_SECRET, and OIDC_ISSUER_URI in docker-compose environment variables, the login page still showed a username/password form with no OIDC option. The public settings endpoint returned "oidcEnabled": false.

Impact: OIDC login was completely non-functional despite env vars being correctly set.

Root cause: Grimmory (formerly BookLore) is a Spring Boot application, but the OIDC env vars are not Spring Boot properties. Grimmory stores all OIDC configuration in its database (app_settings table) via AppSettingService.java. The OIDC_ENABLED env var in docker-compose is simply ignored — the app reads AppSettingKey.OIDC_ENABLED from the database with a default of "false".

This is unusual because most self-hosted apps configure OIDC via environment variables. The docker-compose template even included the env vars, which made the misconfiguration harder to diagnose.

Fix: Removed the useless OIDC env vars from docker-compose. Added a post-deploy API call to configure OIDC via PUT /api/v1/settings:

curl -X PUT "http://localhost:6060/api/v1/settings" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '[
    {"name": "OIDC_ENABLED", "value": "true"},
    {"name": "OIDC_PROVIDER_DETAILS", "value": {
      "providerName": "PocketID",
      "clientId": "...",
      "clientSecret": "...",
      "issuerUri": "https://auth.example.com",
      "scopes": "openid profile email",
      "claimMapping": {"username": "preferred_username", "name": "name", "email": "email", "groups": "groups"}
    }},
    {"name": "OIDC_FORCE_ONLY_MODE", "value": "true"}
  ]'

Rule going forward: When adding OIDC to a new service, verify the configuration mechanism by checking the source code. Don't assume env vars work just because the docker-compose template includes them. Check: does the app read from env vars, config files, or the database?


2026-04-01 — RomM OIDC metadata URL fallback points to itself

What happened: After enabling OIDC on RomM with PocketID, clicking "Login with PocketID" returned a 500 Internal Server Error: "Expecting value" (JSON parse error).

Impact: OIDC login completely broken.

Root cause: RomM uses authlib for OIDC. When OIDC_SERVER_METADATA_URL is not set, it falls back to constructing the discovery URL from OIDC_SERVER_APPLICATION_URL. We had OIDC_SERVER_APPLICATION_URL=https://romm.example.com (RomM's own URL), so authlib tried to fetch https://romm.example.com/.well-known/openid-configuration — which is RomM itself, not PocketID. RomM returned HTML, authlib tried to parse it as JSON, and crashed.

Fix: Explicitly set OIDC_SERVER_METADATA_URL=https://auth.example.com/.well-known/openid-configuration pointing to PocketID. Also corrected OIDC_SERVER_APPLICATION_URL to point to PocketID per the official docs.

Rule going forward: Always set OIDC_SERVER_METADATA_URL explicitly. Don't rely on fallback URL construction — it's a common source of circular reference bugs.


2026-04-01 — RomM OIDC callback URL mismatch

What happened: After fixing the metadata URL, PocketID login succeeded but the redirect back to RomM returned "Not Found."

Impact: OIDC flow completes authentication but fails on the callback.

Root cause: The PocketID client was configured with callback URL /api/oauth/openid-callback, but RomM's actual OIDC callback endpoint is /api/oauth/openid (defined in endpoints/auth.py). The -callback suffix doesn't exist.

Fix: Updated the callback URL in PocketID to https://romm.example.com/api/oauth/openid.

Rule going forward: Always verify the exact callback URL from the application's source code or documentation. Don't guess the path — common variations like /callback, /openid-callback, /oauth2/callback are all different.


2026-04-01 — PocketID "Email is not verified" blocks OIDC login

What happened: After fixing the callback URL, RomM's OIDC login returned: {"detail": "Email is not verified."}.

Impact: Users can authenticate with PocketID but are rejected by RomM.

Root cause: PocketID advertises email_verified in its claims_supported OpenID Connect discovery. RomM's auth handler (base_handler.py) checks this claim — if PocketID says it supports email verification, RomM requires email_verified: true in the token. Our PocketID user had emailVerified: false.

Fix: Updated the PocketID user via API:

curl -X PUT "https://auth.example.com/api/users/<user-id>" \
  -H "X-API-Key: <api-key>" \
  -H "Content-Type: application/json" \
  -d '{"emailVerified": true}'

Rule going forward: When setting up PocketID OIDC for a new service, ensure all users have emailVerified: true. Some OIDC consumers enforce this when the provider advertises the claim.


2026-04-01 — Grimmory OIDC callback URL uses frontend route, not API endpoint

What happened: After configuring Grimmory's OIDC via the settings API, clicking the PocketID login button returned "Invalid callback URL" from PocketID.

Impact: OIDC authentication flow couldn't start.

Root cause: The PocketID client had callback URL /login/oauth2/code/pocketid (a standard Spring Security path). But Grimmory uses a custom SPA-based OIDC flow — the actual callback is a frontend route at /oauth2-callback, not a backend API endpoint. The frontend receives the authorization code and posts it to the backend at POST /api/v1/auth/oidc/callback.

Fix: Updated the PocketID client callback URL to https://library.example.com/oauth2-callback.

Rule going forward: Don't assume Spring Boot apps use the standard Spring Security OAuth2 callback path (/login/oauth2/code/{registrationId}). Check the application's source code — SPAs often use custom frontend routes for OAuth callbacks.


2026-04-01 — force_clean must reset database for DB-backed services

What happened: Running the Grimmory playbook with force_clean=true wiped the filesystem data directory but the application came back with existing state — setup was skipped because the admin user still existed.

Impact: force_clean didn't actually clean; the app wasn't in first-run state.

Root cause: The force_clean logic only deleted /unohana/grimmory/data (filesystem). But Grimmory stores users, settings, and library config in MariaDB, not on the filesystem. The database still had the admin user from the previous deployment.

Fix: Added a database reset step to force_clean:

docker exec mariadb mariadb -u root -e \
  "DROP DATABASE IF EXISTS grimmory;
   CREATE DATABASE grimmory;
   GRANT ALL PRIVILEGES ON grimmory.* TO 'grimmory'@'%';
   FLUSH PRIVILEGES;"

Rule going forward: When implementing force_clean for a service, identify where all persistent state lives. For DB-backed services, wiping the filesystem isn't enough — the database must be reset too. SQLite-based services (like Shoko) only need the config directory wiped since the DB file lives there.


2026-04-03 — Karakeep AI tagging failures with CPU-only Ollama

What happened: After importing ~1000 bookmarks into Karakeep, AI tagging failed on nearly all of them. Multiple root causes stacked:

  1. Models not installedllama3.1 and llava weren't in Ollama. Every tagging attempt failed with "model not found". The original ollama pull via pct exec silently failed due to $PATH not including /usr/local/bin — exit code was 0 despite the error.
  2. Inference timeouts — Once models were installed, llama3.1 (8B params) on CPU took 30-120s per bookmark. The default INFERENCE_JOB_TIMEOUT_SEC=30 killed most jobs. Setting it to 120 helped but some still timed out.
  3. Single-core inference — Ollama in the LXC only used 1 of 4 allocated cores. Bumping to 8 cores didn't help until we created a model variant with PARAMETER num_thread 8 via a Modelfile, which brought CPU usage to ~750%.
  4. Queue desync — After resetting taggingStatus to pending in the bookmarks DB, no queue jobs existed in the separate queue.db. The inference worker only enqueues on bookmark creation, not on status change. The admin panel's "Regenerate AI Tags for Pending Bookmarks" button is the correct way to re-trigger.

Root cause: Multiple compounding issues: missing models, conservative timeouts for CPU inference, Ollama thread count not matching available cores, and misunderstanding how the internal job queue works.

Fix: - Pull models with full path: /usr/local/bin/ollama pull <model> - Set INFERENCE_JOB_TIMEOUT_SEC=120 and INFERENCE_FETCH_TIMEOUT_SEC=300 - Create model variants with explicit num_thread matching LXC core count - Use the admin panel "Regenerate" buttons to re-enqueue failed/pending jobs - Set OLLAMA_KEEP_ALIVE=-1 to avoid cold-start delays between requests

Rules going forward: - When using pct exec to run binaries, always use the full path (/usr/local/bin/ollama, not ollama) — $PATH is not inherited. - For CPU-only Ollama, consider smaller models (gemma3:4b) — community consensus is that smaller models produce better normalized tags for bookmark categorization. - Always set INFERENCE_JOB_TIMEOUT_SEC >= 120 for CPU inference. Note: there's a hardcoded 5-minute Node.js undici timeout bug (karakeep-app/karakeep#1586) that can't be overridden. - Use the admin panel (Settings > Admin > Background Jobs) to manage bulk re-tagging, not direct DB manipulation.