RFC: Karakeep — Google Search & Crawler-Blocked Sites
Status: Draft Date: 2026-05-18
Problem
Karakeep's crawler (Playwright/Chromium + metascraper) fails on Google search results and other bot-detection-protected pages. When the crawl is blocked:
- No title, description, or favicon is extracted
- Karakeep falls back to using the URL string as the title
- The bookmark is effectively useless for browsing/searching later
This affects the iOS sharesheet, the browser extension, the web UI, and API-based imports — anywhere a URL is added. The MCP import path can pre-fill titles, but every other entry point lands titleless.
Scope
| URL class | Today's behavior |
|---|---|
google.com/search?q=… |
Always blocked, title = URL |
google.com/imgres?… |
Sometimes succeeds via alt-text; mostly URL |
google.com/shopping/… |
Often succeeds, has OG tags |
| Cloudflare-gated sites | Often blocked |
Sites that check navigator.webdriver |
Blocked |
Google is the most common case in personal browsing exports and the worst offender.
What's Already In Place
| Mechanism | Coverage | Status |
|---|---|---|
| One-time REST PATCH backfill | 15 existing bookmarks (cars / sequoia / silvia) | Done — see derive_title() logic below |
Pre-fill title= in create-bookmark calls |
MCP-driven imports only | In use, ad-hoc |
Pre-fill is not applied to iOS / extension / web / RSS additions — those go straight to Karakeep's API without title.
derive_title(url) logic
# Parses Google URL → meaningful title
# Examples:
# q=akai+fmt-93bt&udm=2 → "akai fmt-93bt - Google Images"
# q=egr+delete → "egr delete - Google Search"
# /shopping/product/1?q=… → "<q> - Google Shopping"
# /imgres?imgrefurl=URL → "Google Images (from <host>)"
Reference implementation: services/n8n/workflows/karakeep-patch-titles.json in chizuru-v2. Runs on LXC 120 (automation / n8n). Mirror the snippet in services/karakeep/reference.md once accepted.
Why Karakeep's Native Tools Can't Solve It (at title-set time)
Investigated and ruled out:
| Approach | Why it doesn't work |
|---|---|
| Karakeep Rules Engine | Actions are limited to addTag, addToList, downloadFullPageArchive, favouriteBookmark, archiveBookmark. No setTitle action. |
CRAWLER_USER_AGENT env var |
Doesn't exist. Chromium ships its default UA; no override hook. |
| AI auto-tagging / summarization | Operates on crawled content, not the URL. Can't rescue an empty crawl. |
metascraper-google plugin |
Doesn't exist upstream (the Reddit equivalent does — PR #1302). Would require a fork or upstream contribution. |
Karakeep webhooks DO exist and fire on bookmark crawl events (proven by the existing services/n8n/workflows/karakeep-crawled.json workflow, which receives bookmarkCrawled payloads). A webhook-driven patcher is feasible as a future enhancement — see Open Questions.
Existing upstream issues confirming the gap: - karakeep #2423 — Cloudflare blocking - karakeep #1193 — manual title at add-time - hoarder #745 — proxy settings request
Option Matrix
1. Periodic patcher (n8n scheduled workflow)
An n8n workflow on LXC 120 (automation) that scans Karakeep bookmarks every 5 minutes; for any Google search/imgres/shopping URL where content.title starts with http:// or https://, it derives a title from the URL's query parameters and PATCHes the root title field via the Karakeep REST API.
How it works:
- n8n Schedule trigger every 5 min
- Paginates /api/v1/bookmarks?limit=100
- Filters to google.com URLs needing patch (idempotent: only acts when manual title empty + crawled title is the raw URL)
- PATCHes /api/v1/bookmarks/{id} with derived title
- ~5 s runtime, negligible cost
Why n8n (not systemd on the Karakeep host):
- Repo convention is "everything is IaC, deploy through Forgejo Actions" — n8n workflows are auto-imported by ansible/playbooks/automation.yml on push to services/n8n/**
- An existing karakeep-crawled workflow proves Karakeep n8n integration works; this stays in the same pattern
- Logs visible in n8n UI; auth via $env.KARAKEEP_URL / $env.KARAKEEP_API_KEY (already wired into n8n's compose)
Workflow impact: Identical. iOS sharesheet, extension, web add — all behave as today. Title appears within ~2 min of save.
Pros: universal (catches every entry point), no client changes, no Karakeep config change, reversible. Cons: 2-min worst-case delay before title appears. Bookmark briefly looks empty.
Recommended baseline. Solves the "I just want titles" requirement cleanly.
2. Logged-in Google cookies
Inject a real logged-in Google session into the crawler via BROWSER_COOKIE_PATH.
Setup:
1. Log into Google in a real browser
2. Export cookies (e.g. "Get cookies.txt LOCALLY" extension) as JSON
3. Mount into Karakeep container, set BROWSER_COOKIE_PATH=/data/cookies.json
4. Restart Karakeep
Workflow impact: Identical. Crawler sends cookies for matching domains (Playwright cookie store is domain-scoped, so no leak to other sites).
Pros: Gets real page content (not just title) for ~80–95% of Google requests. Same logic helps other auth-required sites. Cons: - Cookies expire every few weeks → manual refresh - Tied to a real Google account → use a dedicated scrape account, not personal - Still subject to Google's TLS fingerprinting and behavioral checks → occasional captchas
3. Browserless + stealth Chromium
Replace Karakeep's in-container Chromium with an external browserless/chromium instance running puppeteer-extra-plugin-stealth (masks navigator.webdriver, plugin signatures, headless UA, etc.).
Setup:
- New container in Karakeep compose: browserless/chromium with stealth plugin
- Set BROWSER_WEBSOCKET_URL=ws://browserless:3000/?token=…
- Set BROWSER_CONNECT_ONDEMAND=true
- ~500 MB extra RAM, internal-network-only
Workflow impact: Identical from every client's POV.
Pros: Improves all bot-detected sites (Cloudflare, behavioral checks), not just Google. Free, self-hosted, set-and-forget. Cons: Google has the most aggressive bot detection on the web — stealth gets ~40–70% on Google search, not 100%. Still need Option 1 as fallback for the misses.
4. Residential HTTP proxy
CRAWLER_HTTPS_PROXY=http://user:pass@proxy:port — single env var, routes crawl traffic through residential IPs.
Pros: Highest success rate against Google (~90%+). Cons: Ongoing cost ($2–15/GB residential, Bright Data / Oxylabs / IPRoyal). Not justified for a personal homelab.
Rejected unless other options prove insufficient.
5. iOS Shortcut (client-side pre-process)
Custom iOS Shortcut that intercepts the share action: parses Google URL → derives title → POSTs to Karakeep API with title set.
Pros: Instant. No server-side change. Cons: iOS-only. Doesn't cover browser extension, web UI, or RSS. Per-device install.
Useful as a complement to Option 1 for instant feedback on iOS.
6. Upstream metascraper-google plugin
Contribute a plugin to Karakeep modeled on metascraper-reddit (PR #1302). Detects google.com/search URLs, parses q=, returns synthesized title without crawling.
Pros: Cleanest long-term fix. Benefits all Karakeep users. No infra to maintain. Cons: Requires PR + review + release cycle. Not actionable in this RFC's timeframe.
Recommended as future work if Option 1 proves robust.
Decision
Accept Option 1 (periodic patcher) as the immediate fix. It's the minimal-surface-area solution that matches the stated requirement ("I just want titles"), works for every Karakeep entry point including iOS, and has zero impact on the existing crawler config.
Defer Options 2 + 3 for a follow-up RFC if/when we want real Google page content (not just titles) ingested. They stack cleanly on top of Option 1 — Option 1 becomes the safety net when stealth/cookies fail.
Reject Option 4 (residential proxy) for cost. Defer Option 5 (iOS Shortcut) — only valuable if 2-min delay becomes annoying. File Option 6 (upstream plugin) as future contribution.
Implementation Plan (Option 1)
- Workflow:
services/n8n/workflows/karakeep-patch-titles.jsonin chizuru-v2. - Schedule trigger: every 5 min
- Single Code node, JavaScript, uses
fetch()directly against the Karakeep API - Paginates
/bookmarks?limit=100 - Filters to URLs containing
google.comwherecontent.titlestarts withhttp - Derives title from
q=,udm=,tbm=,imgrefurl=query params - PATCHes
/bookmarks/{id}with{"title": derived} -
Returns
{ scanned, patched, failed, patches[] }for n8n run history -
Deploy: standard chizuru-v2 flow.
- Commit
services/n8n/workflows/karakeep-patch-titles.json→ push to Forgejomain .forgejo/workflows/automation.ymltriggers onservices/n8n/**pathsansible/playbooks/automation.ymlcopies workflows into n8n container and runsn8n import:workflow --separate-
n8n is restarted; the new workflow is published and active
-
Observability: n8n UI → Executions tab. Each run shows scanned/patched/failed counts. Bad runs surface as failed executions; n8n's existing alerting (if any) catches them. Optional Grafana panel can query n8n's Postgres for execution stats.
-
Reversibility: Disable the workflow in n8n UI (or remove the JSON + re-deploy). Existing patched titles remain — Karakeep preserves the manual
titlefield independently of whether the crawler title gets updated later. -
Extensibility: When other crawler-blocked domains emerge (e.g. specific Cloudflare-protected sites), add a domain-specific
derive_titlebranch in the same Code node. Same scan, same patch loop.
Open Questions
- Webhook-driven instead of polling. Karakeep webhooks fire on
bookmarkCrawled(confirmed bykarakeep-crawledworkflow). A webhook-driven version of this patcher would reduce latency from ~5 min worst-case to near-zero. Cost: adds an external dependency for what's currently a self-contained polling workflow. Worth doing if the 5-min lag is ever annoying or if we extend to many more domains. - Should the patcher also handle non-Google crawler failures? Generalizing the "title starts with http" detector catches any blocked crawl, not just Google. Worth doing once we see real misses in the wild (e.g. Cloudflare-protected sites with no useful URL-derivable title).
- Cookie injection scope (Option 2 deferred): would Karakeep's
BROWSER_COOKIE_PATHaccept a multi-domain cookies.json, or does it apply globally? Affects whether scrape account cookies could ever leak.
References
- Karakeep config: https://docs.karakeep.app/configuration/environment-variables/
- Browserless integration discussion: https://github.com/karakeep-app/karakeep/discussions/869
- Reddit metascraper plugin (model for Google plugin): https://github.com/karakeep-app/karakeep/pull/1302
- Existing related RFC: Bookmark & Archival Stack
- Karakeep service docs: Setup, API reference