RFC: Karakeep — Google Search & Crawler-Blocked Sites

Status: Draft Date: 2026-05-18

Problem

Karakeep's crawler (Playwright/Chromium + metascraper) fails on Google search results and other bot-detection-protected pages. When the crawl is blocked:

No title, description, or favicon is extracted
Karakeep falls back to using the URL string as the title
The bookmark is effectively useless for browsing/searching later

This affects the iOS sharesheet, the browser extension, the web UI, and API-based imports — anywhere a URL is added. The MCP import path can pre-fill titles, but every other entry point lands titleless.

Scope

URL class	Today's behavior
`google.com/search?q=…`	Always blocked, title = URL
`google.com/imgres?…`	Sometimes succeeds via alt-text; mostly URL
`google.com/shopping/…`	Often succeeds, has OG tags
Cloudflare-gated sites	Often blocked
Sites that check `navigator.webdriver`	Blocked

Google is the most common case in personal browsing exports and the worst offender.

What's Already In Place

Mechanism	Coverage	Status
One-time REST PATCH backfill	15 existing bookmarks (cars / sequoia / silvia)	Done — see `derive_title()` logic below
Pre-fill `title=` in `create-bookmark` calls	MCP-driven imports only	In use, ad-hoc

Pre-fill is not applied to iOS / extension / web / RSS additions — those go straight to Karakeep's API without title.

`derive_title(url)` logic

# Parses Google URL → meaningful title
# Examples:
#   q=akai+fmt-93bt&udm=2  → "akai fmt-93bt - Google Images"
#   q=egr+delete            → "egr delete - Google Search"
#   /shopping/product/1?q=…  → "<q> - Google Shopping"
#   /imgres?imgrefurl=URL    → "Google Images (from <host>)"

Reference implementation: services/n8n/workflows/karakeep-patch-titles.json in chizuru-v2. Runs on LXC 120 (automation / n8n). Mirror the snippet in services/karakeep/reference.md once accepted.

Why Karakeep's Native Tools Can't Solve It (at title-set time)

Investigated and ruled out:

Approach	Why it doesn't work
Karakeep Rules Engine	Actions are limited to `addTag`, `addToList`, `downloadFullPageArchive`, `favouriteBookmark`, `archiveBookmark`. No `setTitle` action.
`CRAWLER_USER_AGENT` env var	Doesn't exist. Chromium ships its default UA; no override hook.
AI auto-tagging / summarization	Operates on crawled content, not the URL. Can't rescue an empty crawl.
`metascraper-google` plugin	Doesn't exist upstream (the Reddit equivalent does — PR #1302). Would require a fork or upstream contribution.

Karakeep webhooks DO exist and fire on bookmark crawl events (proven by the existing services/n8n/workflows/karakeep-crawled.json workflow, which receives bookmarkCrawled payloads). A webhook-driven patcher is feasible as a future enhancement — see Open Questions.

Existing upstream issues confirming the gap: - karakeep #2423 — Cloudflare blocking - karakeep #1193 — manual title at add-time - hoarder #745 — proxy settings request

Option Matrix

1. Periodic patcher (n8n scheduled workflow)

An n8n workflow on LXC 120 (automation) that scans Karakeep bookmarks every 5 minutes; for any Google search/imgres/shopping URL where content.title starts with http:// or https://, it derives a title from the URL's query parameters and PATCHes the root title field via the Karakeep REST API.

How it works: - n8n Schedule trigger every 5 min - Paginates /api/v1/bookmarks?limit=100 - Filters to google.com URLs needing patch (idempotent: only acts when manual title empty + crawled title is the raw URL) - PATCHes /api/v1/bookmarks/{id} with derived title - ~5 s runtime, negligible cost

Why n8n (not systemd on the Karakeep host): - Repo convention is "everything is IaC, deploy through Forgejo Actions" — n8n workflows are auto-imported by ansible/playbooks/automation.yml on push to services/n8n/** - An existing karakeep-crawled workflow proves Karakeep n8n integration works; this stays in the same pattern - Logs visible in n8n UI; auth via $env.KARAKEEP_URL / $env.KARAKEEP_API_KEY (already wired into n8n's compose)

Workflow impact: Identical. iOS sharesheet, extension, web add — all behave as today. Title appears within ~2 min of save.

Pros: universal (catches every entry point), no client changes, no Karakeep config change, reversible. Cons: 2-min worst-case delay before title appears. Bookmark briefly looks empty.

Recommended baseline. Solves the "I just want titles" requirement cleanly.

2. Logged-in Google cookies

Inject a real logged-in Google session into the crawler via BROWSER_COOKIE_PATH.

Setup: 1. Log into Google in a real browser 2. Export cookies (e.g. "Get cookies.txt LOCALLY" extension) as JSON 3. Mount into Karakeep container, set BROWSER_COOKIE_PATH=/data/cookies.json 4. Restart Karakeep

Workflow impact: Identical. Crawler sends cookies for matching domains (Playwright cookie store is domain-scoped, so no leak to other sites).

Pros: Gets real page content (not just title) for ~80–95% of Google requests. Same logic helps other auth-required sites. Cons: - Cookies expire every few weeks → manual refresh - Tied to a real Google account → use a dedicated scrape account, not personal - Still subject to Google's TLS fingerprinting and behavioral checks → occasional captchas

3. Browserless + stealth Chromium

Replace Karakeep's in-container Chromium with an external browserless/chromium instance running puppeteer-extra-plugin-stealth (masks navigator.webdriver, plugin signatures, headless UA, etc.).

Setup: - New container in Karakeep compose: browserless/chromium with stealth plugin - Set BROWSER_WEBSOCKET_URL=ws://browserless:3000/?token=… - Set BROWSER_CONNECT_ONDEMAND=true - ~500 MB extra RAM, internal-network-only

Workflow impact: Identical from every client's POV.

Pros: Improves all bot-detected sites (Cloudflare, behavioral checks), not just Google. Free, self-hosted, set-and-forget. Cons: Google has the most aggressive bot detection on the web — stealth gets ~40–70% on Google search, not 100%. Still need Option 1 as fallback for the misses.

4. Residential HTTP proxy

CRAWLER_HTTPS_PROXY=http://user:pass@proxy:port — single env var, routes crawl traffic through residential IPs.

Pros: Highest success rate against Google (~90%+). Cons: Ongoing cost ($2–15/GB residential, Bright Data / Oxylabs / IPRoyal). Not justified for a personal homelab.

Rejected unless other options prove insufficient.

5. iOS Shortcut (client-side pre-process)

Custom iOS Shortcut that intercepts the share action: parses Google URL → derives title → POSTs to Karakeep API with title set.

Pros: Instant. No server-side change. Cons: iOS-only. Doesn't cover browser extension, web UI, or RSS. Per-device install.

Useful as a complement to Option 1 for instant feedback on iOS.

6. Upstream `metascraper-google` plugin

Contribute a plugin to Karakeep modeled on metascraper-reddit (PR #1302). Detects google.com/search URLs, parses q=, returns synthesized title without crawling.

Pros: Cleanest long-term fix. Benefits all Karakeep users. No infra to maintain. Cons: Requires PR + review + release cycle. Not actionable in this RFC's timeframe.

Recommended as future work if Option 1 proves robust.

Decision

Accept Option 1 (periodic patcher) as the immediate fix. It's the minimal-surface-area solution that matches the stated requirement ("I just want titles"), works for every Karakeep entry point including iOS, and has zero impact on the existing crawler config.

Defer Options 2 + 3 for a follow-up RFC if/when we want real Google page content (not just titles) ingested. They stack cleanly on top of Option 1 — Option 1 becomes the safety net when stealth/cookies fail.

Reject Option 4 (residential proxy) for cost. Defer Option 5 (iOS Shortcut) — only valuable if 2-min delay becomes annoying. File Option 6 (upstream plugin) as future contribution.

Implementation Plan (Option 1)

Workflow: services/n8n/workflows/karakeep-patch-titles.json in chizuru-v2.
Schedule trigger: every 5 min
Single Code node, JavaScript, uses fetch() directly against the Karakeep API
Paginates /bookmarks?limit=100
Filters to URLs containing google.com where content.title starts with http
Derives title from q=, udm=, tbm=, imgrefurl= query params
PATCHes /bookmarks/{id} with {"title": derived}
Returns { scanned, patched, failed, patches[] } for n8n run history
Deploy: standard chizuru-v2 flow.
Commit services/n8n/workflows/karakeep-patch-titles.json → push to Forgejo main
.forgejo/workflows/automation.yml triggers on services/n8n/** paths
ansible/playbooks/automation.yml copies workflows into n8n container and runs n8n import:workflow --separate
n8n is restarted; the new workflow is published and active
Observability: n8n UI → Executions tab. Each run shows scanned/patched/failed counts. Bad runs surface as failed executions; n8n's existing alerting (if any) catches them. Optional Grafana panel can query n8n's Postgres for execution stats.
Reversibility: Disable the workflow in n8n UI (or remove the JSON + re-deploy). Existing patched titles remain — Karakeep preserves the manual title field independently of whether the crawler title gets updated later.
Extensibility: When other crawler-blocked domains emerge (e.g. specific Cloudflare-protected sites), add a domain-specific derive_title branch in the same Code node. Same scan, same patch loop.

Open Questions

Webhook-driven instead of polling. Karakeep webhooks fire on bookmarkCrawled (confirmed by karakeep-crawled workflow). A webhook-driven version of this patcher would reduce latency from ~5 min worst-case to near-zero. Cost: adds an external dependency for what's currently a self-contained polling workflow. Worth doing if the 5-min lag is ever annoying or if we extend to many more domains.
Should the patcher also handle non-Google crawler failures? Generalizing the "title starts with http" detector catches any blocked crawl, not just Google. Worth doing once we see real misses in the wild (e.g. Cloudflare-protected sites with no useful URL-derivable title).
Cookie injection scope (Option 2 deferred): would Karakeep's BROWSER_COOKIE_PATH accept a multi-domain cookies.json, or does it apply globally? Affects whether scrape account cookies could ever leak.

References

Karakeep config: https://docs.karakeep.app/configuration/environment-variables/
Browserless integration discussion: https://github.com/karakeep-app/karakeep/discussions/869
Reddit metascraper plugin (model for Google plugin): https://github.com/karakeep-app/karakeep/pull/1302
Existing related RFC: Bookmark & Archival Stack
Karakeep service docs: Setup, API reference