ADR-010 — Archival Proxy Stack (Ladder + Byparr + Browserless)

Date: 2026-04-20 Status: Accepted

Context

The bookmark archival pipeline (n8n + Karakeep + ArchiveBox) needs to de-paywall articles before storing them. The initial approach used browserless/chromium as a sidecar on the n8n LXC with Googlebot UA spoofing + archive.today fallback.

Problem: Reuters (and other major news sites) use DataDome bot protection, which rejects all server-side HTTP requests regardless of User-Agent, Referer, X-Forwarded-For, or Accept headers. Every request returns HTTP 401 with a captcha-delivery.com JS challenge page. This challenge page was being stored in Karakeep as the "archived" content, with no ntfy notification (the content verification wasn't catching it).

Research findings (2026-04-20):

Approach	Result
Googlebot UA	401 — DataDome blocks
Google News bot UA	401 — DataDome blocks
Normal Chrome UA + Google referer	401 — DataDome blocks
Google Cache	Returns JS challenge (requires browser)
Google AMP CDN	Redirects to google.com/url (no content)
Reuters AMP endpoint	401 — DataDome blocks
Reuters internal API	401 — DataDome blocks
Accept: application/json	401 — DataDome blocks
Wayback Machine	404 (not archived)
Corsfix proxy (periscope.corsfix.com)	Works — uses server-side headless Chrome

Key insight: DataDome requires actual JavaScript execution + browser fingerprinting. No header trick bypasses it. The only solution is a headless browser that executes the challenge JS.

Decision

Deploy a dedicated archival proxy stack on its own LXC, separate from n8n:

Architecture

LXC 129 (archival) — 192.168.1.129
├── Ladder (port 8080)
│   ├── Go HTTP proxy with YAML rulesets
│   ├── Per-domain header tricks (Googlebot UA, cookies, referer)
│   └── HTML injection (ad/overlay removal)
├── Byparr (port 3001 → 8191)
│   ├── Pre-built image: ghcr.io/thephaseless/byparr
│   ├── Uses Camoufox (C++ engine-level anti-detection)
│   └── POST /v1 { cmd: "request.get", url } → returns JSON with rendered HTML
└── browserless/chromium (port 3000)
    ├── Direct headless Chrome API (Googlebot UA fallback)
    └── Kept as n8n fallback for custom JS interaction

LXC 120 (automation) — 192.168.1.120
└── n8n references archival via IP
    ├── LADDER_URL=http://192.168.1.129:8080
    ├── BYPARR_URL=http://192.168.1.129:3001
    ├── BROWSERLESS_URL=http://192.168.1.129:3000
    └── WAYBACK_ACCESS_KEY/SECRET_KEY (from Vault)

Why separate LXC

Isolation: FlareSolverr and browserless are memory-hungry (1G + 2G). n8n should not crash because a headless Chrome OOM'd.
Independent scaling: Can bump archival LXC memory without affecting n8n.
Extensibility: n8n will grow with more automations; archival services are a distinct concern.
Debuggability: Can restart/redeploy archival stack without touching n8n.

Why Ladder over direct browserless

Simpler for n8n: Single GET /raw/{url} call vs. constructing browserless JSON payloads.
Ruleset-driven: New sites can be added by editing a YAML file, no workflow changes needed.
Efficient: Header-based bypass (no browser needed) for 80% of paywalled sites.
Community rulesets: Can pull from everywall/ladder-rules for updates.

Why Byparr over FlareSolverr / custom playwright-proxy

FlareSolverr cannot solve DataDome. It uses undetected-chromedriver which only handles Cloudflare challenges. DataDome returns a JS challenge page that FlareSolverr reports as "Challenge not detected!" but the response is a 1.5KB DataDome script, not article content.
Custom playwright-proxy was fragile. Required a custom Dockerfile with system lib management (libXfixes, etc.) — broke on every rebuild due to missing X11 deps. Doesn't match the homelab pattern of using pre-built registry images.
Byparr uses Camoufox — a Firefox fork with C++ engine-level anti-detection. Fingerprint spoofing happens in the browser binary itself, not via JS patches that DataDome can detect.
Pre-built image: ghcr.io/thephaseless/byparr — no custom build steps, just docker compose up.
FlareSolverr-compatible API: POST /v1 with { cmd: "request.get", url } → JSON response with rendered HTML.

Why keep browserless alongside Byparr

browserless remains as a Googlebot UA fallback for sites that respond to bot user-agents.
Lower overhead than Byparr for simple headless Chrome tasks.

Three-workflow architecture (2026-04-22 redesign)

Replaced the monolithic archive-media workflow with three focused workflows. Each has a dedicated iOS Shortcut.

Workflow overview

Workflow	Shortcut	Webhook path	Destination	Purpose
`archive-media`	"Archive"	`/archive-media`	ArchiveBox / Karakeep	Archival routing
`save-to-read`	"Read Later"	`/save-to-read`	Readeck	Read-later with depaywall
`depaywall`	"Depaywall"	`/save-to-read` (mode=depaywall)	Readeck	Retry next strategy

Workflow 1: `archive-media` (simplified)

Trivial URL router — no depaywall, no content handlers, no profile learning.

Incoming URL (iOS Shortcut "Archive")
     │
     ▼
  Auth & Route
  ├── Validate auth (Bearer token)
  ├── Clean URL (strip tracking params, unwrap Google redirects)
  └── Route: reddit / github / karakeep
     │
     ▼
  Route by Handler (Switch)
     │
     ├── "reddit"
     │    ├── Clean Reddit URL → Fetch .json → Parse
     │    ├── ArchiveBox /api/v1/cli/add
     │    └── Tags: reddit, r/{subreddit}, post|comment
     │
     ├── "github"
     │    ├── ArchiveBox /api/v1/cli/add
     │    └── Tags: github
     │
     └── "karakeep" (everything else)
          └── Karakeep POST /api/v1/bookmarks (plain URL)

IaC: services/n8n/workflows/archive-media.json

Workflow 2: `save-to-read` (normal mode)

Saves articles to Readeck. Known paywalled domains get the depaywall chain with content handlers. Unknown domains get plain URL bookmarks (Readeck fetches them).

Incoming URL (iOS Shortcut "Read Later")
     │
     ▼
  Auth & Route
  ├── Validate auth (Bearer token)
  ├── Clean URL
  ├── Match domain against knownPaywalled map
  └── Also check staticData.learnedDomains
     │
     ▼
  Process & Upload
     │
     ├── Has strategies (known paywalled domain)
     │    │
     │    ▼
     │   For each strategy in chain:
     │    ├── Execute: ladder/byparr/wayback/browserless
     │    ├── isContentValid(html) — pre-upload check
     │    ├── mergeMultiPage(html) — Ars Technica multi-page
     │    ├── cleanContent(html) — embedded tweets + domain cleanup
     │    ├── Upload to Readeck as pre-fetched HTML
     │    │   Labels: source:{domain}, depaywalled-{strategy}
     │    └── Success → break
     │    │
     │    (all failed)
     │    ├── Plain URL to Readeck + depaywall-failed label
     │    └── ntfy /readeck "all strategies failed"
     │
     └── No strategies (unknown domain)
          └── Plain URL to Readeck (Readeck fetches it)
               Label: source:{domain}

IaC: services/n8n/workflows/save-to-read.json

Workflow 3: `depaywall` (retry endpoint on save-to-read)

Same webhook as save-to-read, triggered with { mode: "depaywall", url, existingBookmarkId }. Tries one strategy per invocation — user inspects and reruns if needed.

Incoming URL (iOS Shortcut "Depaywall")
     │
     ▼
  Auth & Route (mode=depaywall)
  ├── Clean URL, resolve strategies (known map + learned)
  └── If no strategies found: defaults to [ladder, byparr, wayback, browserless]
     │
     ▼
  Process & Upload (depaywall branch)
     │
     ├── Read staticData.depaywall[url] → { tried: [...], lastAttempt }
     │
     ├── DELETE existing Readeck bookmark (existingBookmarkId)
     │
     ├── Pick next untried strategy
     │    ├── Execute strategy
     │    ├── isContentValid + cleanContent
     │    └── Upload pre-fetched HTML to Readeck
     │         Labels: source:{domain}, depaywalled-{strategy}
     │
     ├── Success → ntfy "check article quality"
     │
     ├── Strategy failed → ntfy "{strategy} failed, N remaining"
     │    └── Add plain bookmark with depaywall-pending label
     │
     └── All exhausted → ntfy "all strategies exhausted"
          ├── Add plain bookmark with depaywall-failed label
          └── Clean up staticData entry

State tracking: staticData.depaywall[url] = { tried: ['ladder', ...], lastAttempt: ISO date } Entries cleaned up after 7 days (TODO).

Learned domains: When depaywall succeeds on an unknown domain, staticData.learnedDomains[domain] = [winningStrategy] is saved. Future save-to-read calls for that domain auto-apply the learned strategy.

Known paywalled domains

Hardcoded in save-to-read Auth & Route node. Domain → strategy chain.

Domain group	Domains	Strategy chain
DataDome-protected	reuters.com, bloomberg.com	byparr → wayback → browserless
Soft paywall	wsj.com, nytimes.com, washingtonpost.com, ft.com, economist.com, theathletic.com	ladder → byparr → wayback
Condé Nast	newyorker.com, wired.com, vanityfair.com, gq.com, bonappetit.com, architecturaldigest.com, cntraveler.com	ladder → byparr [→ wayback for some]
UK news	thetimes.co.uk, telegraph.co.uk	ladder → byparr → wayback
Business	businessinsider.com, insider.com, hbr.org, seekingalpha.com, barrons.com	ladder → byparr → wayback
Tech/other	arstechnica.com, theatlantic.com, medium.com	ladder → byparr

Strategy executors

Strategy	Endpoint	Technique	Best for
`ladder`	`GET LADDER_URL/raw/{url}`	Googlebot UA + YAML rulesets	Soft paywalls (80% of sites)
`byparr`	`POST BYPARR_URL/v1`	Camoufox stealth browser	DataDome, Cloudflare
`wayback`	Wayback Machine SPN2 API	Submit → poll → fetch snapshot	Hard paywalls, archival
`browserless`	`POST BROWSERLESS_URL/content`	Headless Chrome, Googlebot UA	Last resort fallback

Pre-upload content validation (isContentValid)

Strip Wayback Machine toolbar/wrapper HTML before analysis
Reject if < 1000 bytes or < 200 words
Challenge signals: DataDome, captcha-delivery, verification required, etc.
Paywall signals: subscribe prompts, login walls, robot checks (requires 2+ hits)

Content handlers

Run after strategy fetch + validation, before upload to Readeck. Pure string-replace, no DOM parser.

Execution order: 1. mergeMultiPage(html, url, domain) — async, fetches additional pages 2. cleanEmbeddedTweets(html) — universal tweet cleanup (runs inside cleanContent) 3. Domain-specific handler — per-site DOM cleanup

Universal handlers:

Handler	Cleanup
Embedded tweets	Style `<blockquote class="twitter-tweet">` with inline CSS, remove widgets.js, replace tweet iframes with links
Multi-page merger	Ars Technica: detect `nav.page-numbers`, fetch additional pages via Ladder, merge into single document

Domain-specific handlers:

Domain	Cleanup
`wikipedia.org`	Remove edit sections, footnote refs, infobox sidebar
`github.com`	Extract `<article>` only, clean "GitHub - org/" from title
`arstechnica.com`	Multi-page merge + remove page navigation
`medium.com`	Extract largest image from `<picture>` srcset
`wired.com`	Remove GenericCallout, ad slots, most popular
`theatlantic.com`	Remove audio player, related content
`substack.com`	Remove newsletter chrome — preserves tweets via negative lookahead
`stackoverflow.com`	Remove hot network questions, sidebar

What was removed from archive-media

Feature	Status
Depaywall chain + strategy executors	Moved to save-to-read
Content handlers (cleanContent, etc.)	Moved to save-to-read
isContentValid pre-upload check	Moved to save-to-read
ntfy notifications	Moved to save-to-read (topic: /readeck)
Profile learning + Forgejo sync	Dropped entirely
Ollama vision screenshot analysis	Dropped entirely
staticData URL profiles	Replaced by hardcoded knownPaywalled map + learnedDomains

iOS Shortcuts

Shortcut	Action	Webhook payload
"Archive"	Save to ArchiveBox or Karakeep	`POST /archive-media { url, title? }`
"Read Later"	Save to Readeck (depaywall if known)	`POST /save-to-read { url, title? }`
"Depaywall"	Retry depaywalling in Readeck	`POST /save-to-read { url, mode: "depaywall", existingBookmarkId }`

Ruleset strategy by bot protection type

Protection	Technique	Service
None / soft paywall	Googlebot UA + referer	Ladder
Cookie-based	Custom cookies in headers	Ladder
JavaScript overlay	HTML injection to remove	Ladder
Cloudflare	Camoufox engine-level stealth	Byparr
DataDome (Reuters, Bloomberg)	Camoufox engine-level stealth	Byparr
Hard paywall (no bypass)	Wayback Machine archive	Wayback SPN2 API

Resources

Component	CPU	RAM	Image
Ladder	0.5 core	128M	ghcr.io/everywall/ladder:latest
Byparr	1.5 cores	2G	ghcr.io/thephaseless/byparr:latest
browserless	1 core	2G	ghcr.io/browserless/chromium:latest
Total LXC 129	3 cores	4G

External dependencies

Mirrored to Forgejo mirrors org for resilience:

mirrors/periscope — reynaldichernando/periscope (UI reference, not deployed)
mirrors/ladder — everywall/ladder (deployed)
mirrors/ladder-rules — everywall/ladder-rules (reference rulesets)

Wayback Machine SPN2 API keys stored in Vault at secret/wayback.

Readeck (read-it-later)

LXC 130 on apps-pool, IP 192.168.1.130
Domain: readeck.eva-00.network
Auth: oauth2-proxy on LXC 119 (forwarded auth headers → Readeck auto-provisions users)
API: API tokens (web UI only at /profile/tokens, no programmatic creation). Pre-fetched HTML upload via POST /api/bookmarks multipart/form-data.
Vault: secret/readeck → secret_key (deployment), api_token (created manually post-deploy)
Role: Clean reading experience (highlights, annotations, EPUB export). Karakeep remains for bookmarking/archival.

Consequences

LXC count increases by 1 (29 → 30 containers on chizuru)
~4GB additional RAM usage on Proxmox host
Ladder ruleset must be maintained as sites change their protection
Byparr/Camoufox stealth may lag behind DataDome updates (cat-and-mouse game); manual intervention may still be needed for some sites
Wayback Machine SPN2 API has rate limits (free tier); not suitable for high-volume archival
Three workflows to maintain instead of one, but each is focused and independently debuggable
Readeck API token must be created manually after first deploy (no programmatic creation)