Skip to content

RFC: Karakeep — Google Search & Crawler-Blocked Sites

Status: Draft Date: 2026-05-18

Problem

Karakeep's crawler (Playwright/Chromium + metascraper) fails on Google search results and other bot-detection-protected pages. When the crawl is blocked:

  • No title, description, or favicon is extracted
  • Karakeep falls back to using the URL string as the title
  • The bookmark is effectively useless for browsing/searching later

This affects the iOS sharesheet, the browser extension, the web UI, and API-based imports — anywhere a URL is added. The MCP import path can pre-fill titles, but every other entry point lands titleless.

Scope

URL class Today's behavior
google.com/search?q=… Always blocked, title = URL
google.com/imgres?… Sometimes succeeds via alt-text; mostly URL
google.com/shopping/… Often succeeds, has OG tags
Cloudflare-gated sites Often blocked
Sites that check navigator.webdriver Blocked

Google is the most common case in personal browsing exports and the worst offender.

What's Already In Place

Mechanism Coverage Status
One-time REST PATCH backfill 15 existing bookmarks (cars / sequoia / silvia) Done — see derive_title() logic below
Pre-fill title= in create-bookmark calls MCP-driven imports only In use, ad-hoc

Pre-fill is not applied to iOS / extension / web / RSS additions — those go straight to Karakeep's API without title.

derive_title(url) logic

# Parses Google URL → meaningful title
# Examples:
#   q=akai+fmt-93bt&udm=2  → "akai fmt-93bt - Google Images"
#   q=egr+delete            → "egr delete - Google Search"
#   /shopping/product/1?q=…  → "<q> - Google Shopping"
#   /imgres?imgrefurl=URL    → "Google Images (from <host>)"

Reference implementation: services/n8n/workflows/karakeep-patch-titles.json in chizuru-v2. Runs on LXC 120 (automation / n8n). Mirror the snippet in services/karakeep/reference.md once accepted.

Why Karakeep's Native Tools Can't Solve It (at title-set time)

Investigated and ruled out:

Approach Why it doesn't work
Karakeep Rules Engine Actions are limited to addTag, addToList, downloadFullPageArchive, favouriteBookmark, archiveBookmark. No setTitle action.
CRAWLER_USER_AGENT env var Doesn't exist. Chromium ships its default UA; no override hook.
AI auto-tagging / summarization Operates on crawled content, not the URL. Can't rescue an empty crawl.
metascraper-google plugin Doesn't exist upstream (the Reddit equivalent does — PR #1302). Would require a fork or upstream contribution.

Karakeep webhooks DO exist and fire on bookmark crawl events (proven by the existing services/n8n/workflows/karakeep-crawled.json workflow, which receives bookmarkCrawled payloads). A webhook-driven patcher is feasible as a future enhancement — see Open Questions.

Existing upstream issues confirming the gap: - karakeep #2423 — Cloudflare blocking - karakeep #1193 — manual title at add-time - hoarder #745 — proxy settings request

Option Matrix

1. Periodic patcher (n8n scheduled workflow)

An n8n workflow on LXC 120 (automation) that scans Karakeep bookmarks every 5 minutes; for any Google search/imgres/shopping URL where content.title starts with http:// or https://, it derives a title from the URL's query parameters and PATCHes the root title field via the Karakeep REST API.

How it works: - n8n Schedule trigger every 5 min - Paginates /api/v1/bookmarks?limit=100 - Filters to google.com URLs needing patch (idempotent: only acts when manual title empty + crawled title is the raw URL) - PATCHes /api/v1/bookmarks/{id} with derived title - ~5 s runtime, negligible cost

Why n8n (not systemd on the Karakeep host): - Repo convention is "everything is IaC, deploy through Forgejo Actions" — n8n workflows are auto-imported by ansible/playbooks/automation.yml on push to services/n8n/** - An existing karakeep-crawled workflow proves Karakeep n8n integration works; this stays in the same pattern - Logs visible in n8n UI; auth via $env.KARAKEEP_URL / $env.KARAKEEP_API_KEY (already wired into n8n's compose)

Workflow impact: Identical. iOS sharesheet, extension, web add — all behave as today. Title appears within ~2 min of save.

Pros: universal (catches every entry point), no client changes, no Karakeep config change, reversible. Cons: 2-min worst-case delay before title appears. Bookmark briefly looks empty.

Recommended baseline. Solves the "I just want titles" requirement cleanly.

2. Logged-in Google cookies

Inject a real logged-in Google session into the crawler via BROWSER_COOKIE_PATH.

Setup: 1. Log into Google in a real browser 2. Export cookies (e.g. "Get cookies.txt LOCALLY" extension) as JSON 3. Mount into Karakeep container, set BROWSER_COOKIE_PATH=/data/cookies.json 4. Restart Karakeep

Workflow impact: Identical. Crawler sends cookies for matching domains (Playwright cookie store is domain-scoped, so no leak to other sites).

Pros: Gets real page content (not just title) for ~80–95% of Google requests. Same logic helps other auth-required sites. Cons: - Cookies expire every few weeks → manual refresh - Tied to a real Google account → use a dedicated scrape account, not personal - Still subject to Google's TLS fingerprinting and behavioral checks → occasional captchas

3. Browserless + stealth Chromium

Replace Karakeep's in-container Chromium with an external browserless/chromium instance running puppeteer-extra-plugin-stealth (masks navigator.webdriver, plugin signatures, headless UA, etc.).

Setup: - New container in Karakeep compose: browserless/chromium with stealth plugin - Set BROWSER_WEBSOCKET_URL=ws://browserless:3000/?token=… - Set BROWSER_CONNECT_ONDEMAND=true - ~500 MB extra RAM, internal-network-only

Workflow impact: Identical from every client's POV.

Pros: Improves all bot-detected sites (Cloudflare, behavioral checks), not just Google. Free, self-hosted, set-and-forget. Cons: Google has the most aggressive bot detection on the web — stealth gets ~40–70% on Google search, not 100%. Still need Option 1 as fallback for the misses.

4. Residential HTTP proxy

CRAWLER_HTTPS_PROXY=http://user:pass@proxy:port — single env var, routes crawl traffic through residential IPs.

Pros: Highest success rate against Google (~90%+). Cons: Ongoing cost ($2–15/GB residential, Bright Data / Oxylabs / IPRoyal). Not justified for a personal homelab.

Rejected unless other options prove insufficient.

5. iOS Shortcut (client-side pre-process)

Custom iOS Shortcut that intercepts the share action: parses Google URL → derives title → POSTs to Karakeep API with title set.

Pros: Instant. No server-side change. Cons: iOS-only. Doesn't cover browser extension, web UI, or RSS. Per-device install.

Useful as a complement to Option 1 for instant feedback on iOS.

6. Upstream metascraper-google plugin

Contribute a plugin to Karakeep modeled on metascraper-reddit (PR #1302). Detects google.com/search URLs, parses q=, returns synthesized title without crawling.

Pros: Cleanest long-term fix. Benefits all Karakeep users. No infra to maintain. Cons: Requires PR + review + release cycle. Not actionable in this RFC's timeframe.

Recommended as future work if Option 1 proves robust.

Decision

Accept Option 1 (periodic patcher) as the immediate fix. It's the minimal-surface-area solution that matches the stated requirement ("I just want titles"), works for every Karakeep entry point including iOS, and has zero impact on the existing crawler config.

Defer Options 2 + 3 for a follow-up RFC if/when we want real Google page content (not just titles) ingested. They stack cleanly on top of Option 1 — Option 1 becomes the safety net when stealth/cookies fail.

Reject Option 4 (residential proxy) for cost. Defer Option 5 (iOS Shortcut) — only valuable if 2-min delay becomes annoying. File Option 6 (upstream plugin) as future contribution.

Implementation Plan (Option 1)

  1. Workflow: services/n8n/workflows/karakeep-patch-titles.json in chizuru-v2.
  2. Schedule trigger: every 5 min
  3. Single Code node, JavaScript, uses fetch() directly against the Karakeep API
  4. Paginates /bookmarks?limit=100
  5. Filters to URLs containing google.com where content.title starts with http
  6. Derives title from q=, udm=, tbm=, imgrefurl= query params
  7. PATCHes /bookmarks/{id} with {"title": derived}
  8. Returns { scanned, patched, failed, patches[] } for n8n run history

  9. Deploy: standard chizuru-v2 flow.

  10. Commit services/n8n/workflows/karakeep-patch-titles.json → push to Forgejo main
  11. .forgejo/workflows/automation.yml triggers on services/n8n/** paths
  12. ansible/playbooks/automation.yml copies workflows into n8n container and runs n8n import:workflow --separate
  13. n8n is restarted; the new workflow is published and active

  14. Observability: n8n UI → Executions tab. Each run shows scanned/patched/failed counts. Bad runs surface as failed executions; n8n's existing alerting (if any) catches them. Optional Grafana panel can query n8n's Postgres for execution stats.

  15. Reversibility: Disable the workflow in n8n UI (or remove the JSON + re-deploy). Existing patched titles remain — Karakeep preserves the manual title field independently of whether the crawler title gets updated later.

  16. Extensibility: When other crawler-blocked domains emerge (e.g. specific Cloudflare-protected sites), add a domain-specific derive_title branch in the same Code node. Same scan, same patch loop.

Open Questions

  • Webhook-driven instead of polling. Karakeep webhooks fire on bookmarkCrawled (confirmed by karakeep-crawled workflow). A webhook-driven version of this patcher would reduce latency from ~5 min worst-case to near-zero. Cost: adds an external dependency for what's currently a self-contained polling workflow. Worth doing if the 5-min lag is ever annoying or if we extend to many more domains.
  • Should the patcher also handle non-Google crawler failures? Generalizing the "title starts with http" detector catches any blocked crawl, not just Google. Worth doing once we see real misses in the wild (e.g. Cloudflare-protected sites with no useful URL-derivable title).
  • Cookie injection scope (Option 2 deferred): would Karakeep's BROWSER_COOKIE_PATH accept a multi-domain cookies.json, or does it apply globally? Affects whether scrape account cookies could ever leak.

References