Skip to content

ADR-010 — Archival Proxy Stack (Ladder + Byparr + Browserless)

Date: 2026-04-20 Status: Accepted

Context

The bookmark archival pipeline (n8n + Karakeep + ArchiveBox) needs to de-paywall articles before storing them. The initial approach used browserless/chromium as a sidecar on the n8n LXC with Googlebot UA spoofing + archive.today fallback.

Problem: Reuters (and other major news sites) use DataDome bot protection, which rejects all server-side HTTP requests regardless of User-Agent, Referer, X-Forwarded-For, or Accept headers. Every request returns HTTP 401 with a captcha-delivery.com JS challenge page. This challenge page was being stored in Karakeep as the "archived" content, with no ntfy notification (the content verification wasn't catching it).

Research findings (2026-04-20):

Approach Result
Googlebot UA 401 — DataDome blocks
Google News bot UA 401 — DataDome blocks
Normal Chrome UA + Google referer 401 — DataDome blocks
Google Cache Returns JS challenge (requires browser)
Google AMP CDN Redirects to google.com/url (no content)
Reuters AMP endpoint 401 — DataDome blocks
Reuters internal API 401 — DataDome blocks
Accept: application/json 401 — DataDome blocks
Wayback Machine 404 (not archived)
Corsfix proxy (periscope.corsfix.com) Works — uses server-side headless Chrome

Key insight: DataDome requires actual JavaScript execution + browser fingerprinting. No header trick bypasses it. The only solution is a headless browser that executes the challenge JS.

Decision

Deploy a dedicated archival proxy stack on its own LXC, separate from n8n:

Architecture

LXC 129 (archival) — 192.168.1.129
├── Ladder (port 8080)
│   ├── Go HTTP proxy with YAML rulesets
│   ├── Per-domain header tricks (Googlebot UA, cookies, referer)
│   └── HTML injection (ad/overlay removal)
├── Byparr (port 3001 → 8191)
│   ├── Pre-built image: ghcr.io/thephaseless/byparr
│   ├── Uses Camoufox (C++ engine-level anti-detection)
│   └── POST /v1 { cmd: "request.get", url } → returns JSON with rendered HTML
└── browserless/chromium (port 3000)
    ├── Direct headless Chrome API (Googlebot UA fallback)
    └── Kept as n8n fallback for custom JS interaction

LXC 120 (automation) — 192.168.1.120
└── n8n references archival via IP
    ├── LADDER_URL=http://192.168.1.129:8080
    ├── BYPARR_URL=http://192.168.1.129:3001
    ├── BROWSERLESS_URL=http://192.168.1.129:3000
    └── WAYBACK_ACCESS_KEY/SECRET_KEY (from Vault)

Why separate LXC

  • Isolation: FlareSolverr and browserless are memory-hungry (1G + 2G). n8n should not crash because a headless Chrome OOM'd.
  • Independent scaling: Can bump archival LXC memory without affecting n8n.
  • Extensibility: n8n will grow with more automations; archival services are a distinct concern.
  • Debuggability: Can restart/redeploy archival stack without touching n8n.

Why Ladder over direct browserless

  • Simpler for n8n: Single GET /raw/{url} call vs. constructing browserless JSON payloads.
  • Ruleset-driven: New sites can be added by editing a YAML file, no workflow changes needed.
  • Efficient: Header-based bypass (no browser needed) for 80% of paywalled sites.
  • Community rulesets: Can pull from everywall/ladder-rules for updates.

Why Byparr over FlareSolverr / custom playwright-proxy

  • FlareSolverr cannot solve DataDome. It uses undetected-chromedriver which only handles Cloudflare challenges. DataDome returns a JS challenge page that FlareSolverr reports as "Challenge not detected!" but the response is a 1.5KB DataDome script, not article content.
  • Custom playwright-proxy was fragile. Required a custom Dockerfile with system lib management (libXfixes, etc.) — broke on every rebuild due to missing X11 deps. Doesn't match the homelab pattern of using pre-built registry images.
  • Byparr uses Camoufox — a Firefox fork with C++ engine-level anti-detection. Fingerprint spoofing happens in the browser binary itself, not via JS patches that DataDome can detect.
  • Pre-built image: ghcr.io/thephaseless/byparr — no custom build steps, just docker compose up.
  • FlareSolverr-compatible API: POST /v1 with { cmd: "request.get", url } → JSON response with rendered HTML.

Why keep browserless alongside Byparr

  • browserless remains as a Googlebot UA fallback for sites that respond to bot user-agents.
  • Lower overhead than Byparr for simple headless Chrome tasks.

Three-workflow architecture (2026-04-22 redesign)

Replaced the monolithic archive-media workflow with three focused workflows. Each has a dedicated iOS Shortcut.

Workflow overview

Workflow Shortcut Webhook path Destination Purpose
archive-media "Archive" /archive-media ArchiveBox / Karakeep Archival routing
save-to-read "Read Later" /save-to-read Readeck Read-later with depaywall
depaywall "Depaywall" /save-to-read (mode=depaywall) Readeck Retry next strategy

Workflow 1: archive-media (simplified)

Trivial URL router — no depaywall, no content handlers, no profile learning.

Incoming URL (iOS Shortcut "Archive")
     │
     ▼
  Auth & Route
  ├── Validate auth (Bearer token)
  ├── Clean URL (strip tracking params, unwrap Google redirects)
  └── Route: reddit / github / karakeep
     │
     ▼
  Route by Handler (Switch)
     │
     ├── "reddit"
     │    ├── Clean Reddit URL → Fetch .json → Parse
     │    ├── ArchiveBox /api/v1/cli/add
     │    └── Tags: reddit, r/{subreddit}, post|comment
     │
     ├── "github"
     │    ├── ArchiveBox /api/v1/cli/add
     │    └── Tags: github
     │
     └── "karakeep" (everything else)
          └── Karakeep POST /api/v1/bookmarks (plain URL)

IaC: services/n8n/workflows/archive-media.json

Workflow 2: save-to-read (normal mode)

Saves articles to Readeck. Known paywalled domains get the depaywall chain with content handlers. Unknown domains get plain URL bookmarks (Readeck fetches them).

Incoming URL (iOS Shortcut "Read Later")
     │
     ▼
  Auth & Route
  ├── Validate auth (Bearer token)
  ├── Clean URL
  ├── Match domain against knownPaywalled map
  └── Also check staticData.learnedDomains
     │
     ▼
  Process & Upload
     │
     ├── Has strategies (known paywalled domain)
     │    │
     │    ▼
     │   For each strategy in chain:
     │    ├── Execute: ladder/byparr/wayback/browserless
     │    ├── isContentValid(html) — pre-upload check
     │    ├── mergeMultiPage(html) — Ars Technica multi-page
     │    ├── cleanContent(html) — embedded tweets + domain cleanup
     │    ├── Upload to Readeck as pre-fetched HTML
     │    │   Labels: source:{domain}, depaywalled-{strategy}
     │    └── Success → break
     │    │
     │    (all failed)
     │    ├── Plain URL to Readeck + depaywall-failed label
     │    └── ntfy /readeck "all strategies failed"
     │
     └── No strategies (unknown domain)
          └── Plain URL to Readeck (Readeck fetches it)
               Label: source:{domain}

IaC: services/n8n/workflows/save-to-read.json

Workflow 3: depaywall (retry endpoint on save-to-read)

Same webhook as save-to-read, triggered with { mode: "depaywall", url, existingBookmarkId }. Tries one strategy per invocation — user inspects and reruns if needed.

Incoming URL (iOS Shortcut "Depaywall")
     │
     ▼
  Auth & Route (mode=depaywall)
  ├── Clean URL, resolve strategies (known map + learned)
  └── If no strategies found: defaults to [ladder, byparr, wayback, browserless]
     │
     ▼
  Process & Upload (depaywall branch)
     │
     ├── Read staticData.depaywall[url] → { tried: [...], lastAttempt }
     │
     ├── DELETE existing Readeck bookmark (existingBookmarkId)
     │
     ├── Pick next untried strategy
     │    ├── Execute strategy
     │    ├── isContentValid + cleanContent
     │    └── Upload pre-fetched HTML to Readeck
     │         Labels: source:{domain}, depaywalled-{strategy}
     │
     ├── Success → ntfy "check article quality"
     │
     ├── Strategy failed → ntfy "{strategy} failed, N remaining"
     │    └── Add plain bookmark with depaywall-pending label
     │
     └── All exhausted → ntfy "all strategies exhausted"
          ├── Add plain bookmark with depaywall-failed label
          └── Clean up staticData entry

State tracking: staticData.depaywall[url] = { tried: ['ladder', ...], lastAttempt: ISO date } Entries cleaned up after 7 days (TODO).

Learned domains: When depaywall succeeds on an unknown domain, staticData.learnedDomains[domain] = [winningStrategy] is saved. Future save-to-read calls for that domain auto-apply the learned strategy.

Known paywalled domains

Hardcoded in save-to-read Auth & Route node. Domain → strategy chain.

Domain group Domains Strategy chain
DataDome-protected reuters.com, bloomberg.com byparr → wayback → browserless
Soft paywall wsj.com, nytimes.com, washingtonpost.com, ft.com, economist.com, theathletic.com ladder → byparr → wayback
Condé Nast newyorker.com, wired.com, vanityfair.com, gq.com, bonappetit.com, architecturaldigest.com, cntraveler.com ladder → byparr [→ wayback for some]
UK news thetimes.co.uk, telegraph.co.uk ladder → byparr → wayback
Business businessinsider.com, insider.com, hbr.org, seekingalpha.com, barrons.com ladder → byparr → wayback
Tech/other arstechnica.com, theatlantic.com, medium.com ladder → byparr

Strategy executors

Strategy Endpoint Technique Best for
ladder GET LADDER_URL/raw/{url} Googlebot UA + YAML rulesets Soft paywalls (80% of sites)
byparr POST BYPARR_URL/v1 Camoufox stealth browser DataDome, Cloudflare
wayback Wayback Machine SPN2 API Submit → poll → fetch snapshot Hard paywalls, archival
browserless POST BROWSERLESS_URL/content Headless Chrome, Googlebot UA Last resort fallback

Pre-upload content validation (isContentValid)

  • Strip Wayback Machine toolbar/wrapper HTML before analysis
  • Reject if < 1000 bytes or < 200 words
  • Challenge signals: DataDome, captcha-delivery, verification required, etc.
  • Paywall signals: subscribe prompts, login walls, robot checks (requires 2+ hits)

Content handlers

Run after strategy fetch + validation, before upload to Readeck. Pure string-replace, no DOM parser.

Execution order: 1. mergeMultiPage(html, url, domain) — async, fetches additional pages 2. cleanEmbeddedTweets(html) — universal tweet cleanup (runs inside cleanContent) 3. Domain-specific handler — per-site DOM cleanup

Universal handlers:

Handler Cleanup
Embedded tweets Style <blockquote class="twitter-tweet"> with inline CSS, remove widgets.js, replace tweet iframes with links
Multi-page merger Ars Technica: detect nav.page-numbers, fetch additional pages via Ladder, merge into single document

Domain-specific handlers:

Domain Cleanup
wikipedia.org Remove edit sections, footnote refs, infobox sidebar
github.com Extract <article> only, clean "GitHub - org/" from title
arstechnica.com Multi-page merge + remove page navigation
medium.com Extract largest image from <picture> srcset
wired.com Remove GenericCallout, ad slots, most popular
theatlantic.com Remove audio player, related content
substack.com Remove newsletter chrome — preserves tweets via negative lookahead
stackoverflow.com Remove hot network questions, sidebar

What was removed from archive-media

Feature Status
Depaywall chain + strategy executors Moved to save-to-read
Content handlers (cleanContent, etc.) Moved to save-to-read
isContentValid pre-upload check Moved to save-to-read
ntfy notifications Moved to save-to-read (topic: /readeck)
Profile learning + Forgejo sync Dropped entirely
Ollama vision screenshot analysis Dropped entirely
staticData URL profiles Replaced by hardcoded knownPaywalled map + learnedDomains

iOS Shortcuts

Shortcut Action Webhook payload
"Archive" Save to ArchiveBox or Karakeep POST /archive-media { url, title? }
"Read Later" Save to Readeck (depaywall if known) POST /save-to-read { url, title? }
"Depaywall" Retry depaywalling in Readeck POST /save-to-read { url, mode: "depaywall", existingBookmarkId }

Ruleset strategy by bot protection type

Protection Technique Service
None / soft paywall Googlebot UA + referer Ladder
Cookie-based Custom cookies in headers Ladder
JavaScript overlay HTML injection to remove Ladder
Cloudflare Camoufox engine-level stealth Byparr
DataDome (Reuters, Bloomberg) Camoufox engine-level stealth Byparr
Hard paywall (no bypass) Wayback Machine archive Wayback SPN2 API

Resources

Component CPU RAM Image
Ladder 0.5 core 128M ghcr.io/everywall/ladder:latest
Byparr 1.5 cores 2G ghcr.io/thephaseless/byparr:latest
browserless 1 core 2G ghcr.io/browserless/chromium:latest
Total LXC 129 3 cores 4G

External dependencies

Mirrored to Forgejo mirrors org for resilience:

  • mirrors/periscope — reynaldichernando/periscope (UI reference, not deployed)
  • mirrors/ladder — everywall/ladder (deployed)
  • mirrors/ladder-rules — everywall/ladder-rules (reference rulesets)

Wayback Machine SPN2 API keys stored in Vault at secret/wayback.

Readeck (read-it-later)

  • LXC 130 on apps-pool, IP 192.168.1.130
  • Domain: readeck.eva-00.network
  • Auth: oauth2-proxy on LXC 119 (forwarded auth headers → Readeck auto-provisions users)
  • API: API tokens (web UI only at /profile/tokens, no programmatic creation). Pre-fetched HTML upload via POST /api/bookmarks multipart/form-data.
  • Vault: secret/readecksecret_key (deployment), api_token (created manually post-deploy)
  • Role: Clean reading experience (highlights, annotations, EPUB export). Karakeep remains for bookmarking/archival.

Consequences

  • LXC count increases by 1 (29 → 30 containers on chizuru)
  • ~4GB additional RAM usage on Proxmox host
  • Ladder ruleset must be maintained as sites change their protection
  • Byparr/Camoufox stealth may lag behind DataDome updates (cat-and-mouse game); manual intervention may still be needed for some sites
  • Wayback Machine SPN2 API has rate limits (free tier); not suitable for high-volume archival
  • Three workflows to maintain instead of one, but each is focused and independently debuggable
  • Readeck API token must be created manually after first deploy (no programmatic creation)