ADR-010 — Archival Proxy Stack (Ladder + Byparr + Browserless)
Date: 2026-04-20 Status: Accepted
Context
The bookmark archival pipeline (n8n + Karakeep + ArchiveBox) needs to de-paywall articles before storing them. The initial approach used browserless/chromium as a sidecar on the n8n LXC with Googlebot UA spoofing + archive.today fallback.
Problem: Reuters (and other major news sites) use DataDome bot protection, which rejects all server-side HTTP requests regardless of User-Agent, Referer, X-Forwarded-For, or Accept headers. Every request returns HTTP 401 with a captcha-delivery.com JS challenge page. This challenge page was being stored in Karakeep as the "archived" content, with no ntfy notification (the content verification wasn't catching it).
Research findings (2026-04-20):
| Approach | Result |
|---|---|
| Googlebot UA | 401 — DataDome blocks |
| Google News bot UA | 401 — DataDome blocks |
| Normal Chrome UA + Google referer | 401 — DataDome blocks |
| Google Cache | Returns JS challenge (requires browser) |
| Google AMP CDN | Redirects to google.com/url (no content) |
| Reuters AMP endpoint | 401 — DataDome blocks |
| Reuters internal API | 401 — DataDome blocks |
| Accept: application/json | 401 — DataDome blocks |
| Wayback Machine | 404 (not archived) |
| Corsfix proxy (periscope.corsfix.com) | Works — uses server-side headless Chrome |
Key insight: DataDome requires actual JavaScript execution + browser fingerprinting. No header trick bypasses it. The only solution is a headless browser that executes the challenge JS.
Decision
Deploy a dedicated archival proxy stack on its own LXC, separate from n8n:
Architecture
LXC 129 (archival) — 192.168.1.129
├── Ladder (port 8080)
│ ├── Go HTTP proxy with YAML rulesets
│ ├── Per-domain header tricks (Googlebot UA, cookies, referer)
│ └── HTML injection (ad/overlay removal)
├── Byparr (port 3001 → 8191)
│ ├── Pre-built image: ghcr.io/thephaseless/byparr
│ ├── Uses Camoufox (C++ engine-level anti-detection)
│ └── POST /v1 { cmd: "request.get", url } → returns JSON with rendered HTML
└── browserless/chromium (port 3000)
├── Direct headless Chrome API (Googlebot UA fallback)
└── Kept as n8n fallback for custom JS interaction
LXC 120 (automation) — 192.168.1.120
└── n8n references archival via IP
├── LADDER_URL=http://192.168.1.129:8080
├── BYPARR_URL=http://192.168.1.129:3001
├── BROWSERLESS_URL=http://192.168.1.129:3000
└── WAYBACK_ACCESS_KEY/SECRET_KEY (from Vault)
Why separate LXC
- Isolation: FlareSolverr and browserless are memory-hungry (1G + 2G). n8n should not crash because a headless Chrome OOM'd.
- Independent scaling: Can bump archival LXC memory without affecting n8n.
- Extensibility: n8n will grow with more automations; archival services are a distinct concern.
- Debuggability: Can restart/redeploy archival stack without touching n8n.
Why Ladder over direct browserless
- Simpler for n8n: Single
GET /raw/{url}call vs. constructing browserless JSON payloads. - Ruleset-driven: New sites can be added by editing a YAML file, no workflow changes needed.
- Efficient: Header-based bypass (no browser needed) for 80% of paywalled sites.
- Community rulesets: Can pull from everywall/ladder-rules for updates.
Why Byparr over FlareSolverr / custom playwright-proxy
- FlareSolverr cannot solve DataDome. It uses undetected-chromedriver which only handles Cloudflare challenges. DataDome returns a JS challenge page that FlareSolverr reports as "Challenge not detected!" but the response is a 1.5KB DataDome script, not article content.
- Custom playwright-proxy was fragile. Required a custom Dockerfile with system lib management (libXfixes, etc.) — broke on every rebuild due to missing X11 deps. Doesn't match the homelab pattern of using pre-built registry images.
- Byparr uses Camoufox — a Firefox fork with C++ engine-level anti-detection. Fingerprint spoofing happens in the browser binary itself, not via JS patches that DataDome can detect.
- Pre-built image:
ghcr.io/thephaseless/byparr— no custom build steps, justdocker compose up. - FlareSolverr-compatible API:
POST /v1with{ cmd: "request.get", url }→ JSON response with rendered HTML.
Why keep browserless alongside Byparr
- browserless remains as a Googlebot UA fallback for sites that respond to bot user-agents.
- Lower overhead than Byparr for simple headless Chrome tasks.
Three-workflow architecture (2026-04-22 redesign)
Replaced the monolithic archive-media workflow with three focused workflows. Each has a dedicated iOS Shortcut.
Workflow overview
| Workflow | Shortcut | Webhook path | Destination | Purpose |
|---|---|---|---|---|
archive-media |
"Archive" | /archive-media |
ArchiveBox / Karakeep | Archival routing |
save-to-read |
"Read Later" | /save-to-read |
Readeck | Read-later with depaywall |
depaywall |
"Depaywall" | /save-to-read (mode=depaywall) |
Readeck | Retry next strategy |
Workflow 1: archive-media (simplified)
Trivial URL router — no depaywall, no content handlers, no profile learning.
Incoming URL (iOS Shortcut "Archive")
│
▼
Auth & Route
├── Validate auth (Bearer token)
├── Clean URL (strip tracking params, unwrap Google redirects)
└── Route: reddit / github / karakeep
│
▼
Route by Handler (Switch)
│
├── "reddit"
│ ├── Clean Reddit URL → Fetch .json → Parse
│ ├── ArchiveBox /api/v1/cli/add
│ └── Tags: reddit, r/{subreddit}, post|comment
│
├── "github"
│ ├── ArchiveBox /api/v1/cli/add
│ └── Tags: github
│
└── "karakeep" (everything else)
└── Karakeep POST /api/v1/bookmarks (plain URL)
IaC: services/n8n/workflows/archive-media.json
Workflow 2: save-to-read (normal mode)
Saves articles to Readeck. Known paywalled domains get the depaywall chain with content handlers. Unknown domains get plain URL bookmarks (Readeck fetches them).
Incoming URL (iOS Shortcut "Read Later")
│
▼
Auth & Route
├── Validate auth (Bearer token)
├── Clean URL
├── Match domain against knownPaywalled map
└── Also check staticData.learnedDomains
│
▼
Process & Upload
│
├── Has strategies (known paywalled domain)
│ │
│ ▼
│ For each strategy in chain:
│ ├── Execute: ladder/byparr/wayback/browserless
│ ├── isContentValid(html) — pre-upload check
│ ├── mergeMultiPage(html) — Ars Technica multi-page
│ ├── cleanContent(html) — embedded tweets + domain cleanup
│ ├── Upload to Readeck as pre-fetched HTML
│ │ Labels: source:{domain}, depaywalled-{strategy}
│ └── Success → break
│ │
│ (all failed)
│ ├── Plain URL to Readeck + depaywall-failed label
│ └── ntfy /readeck "all strategies failed"
│
└── No strategies (unknown domain)
└── Plain URL to Readeck (Readeck fetches it)
Label: source:{domain}
IaC: services/n8n/workflows/save-to-read.json
Workflow 3: depaywall (retry endpoint on save-to-read)
Same webhook as save-to-read, triggered with { mode: "depaywall", url, existingBookmarkId }. Tries one strategy per invocation — user inspects and reruns if needed.
Incoming URL (iOS Shortcut "Depaywall")
│
▼
Auth & Route (mode=depaywall)
├── Clean URL, resolve strategies (known map + learned)
└── If no strategies found: defaults to [ladder, byparr, wayback, browserless]
│
▼
Process & Upload (depaywall branch)
│
├── Read staticData.depaywall[url] → { tried: [...], lastAttempt }
│
├── DELETE existing Readeck bookmark (existingBookmarkId)
│
├── Pick next untried strategy
│ ├── Execute strategy
│ ├── isContentValid + cleanContent
│ └── Upload pre-fetched HTML to Readeck
│ Labels: source:{domain}, depaywalled-{strategy}
│
├── Success → ntfy "check article quality"
│
├── Strategy failed → ntfy "{strategy} failed, N remaining"
│ └── Add plain bookmark with depaywall-pending label
│
└── All exhausted → ntfy "all strategies exhausted"
├── Add plain bookmark with depaywall-failed label
└── Clean up staticData entry
State tracking: staticData.depaywall[url] = { tried: ['ladder', ...], lastAttempt: ISO date }
Entries cleaned up after 7 days (TODO).
Learned domains: When depaywall succeeds on an unknown domain, staticData.learnedDomains[domain] = [winningStrategy] is saved. Future save-to-read calls for that domain auto-apply the learned strategy.
Known paywalled domains
Hardcoded in save-to-read Auth & Route node. Domain → strategy chain.
| Domain group | Domains | Strategy chain |
|---|---|---|
| DataDome-protected | reuters.com, bloomberg.com | byparr → wayback → browserless |
| Soft paywall | wsj.com, nytimes.com, washingtonpost.com, ft.com, economist.com, theathletic.com | ladder → byparr → wayback |
| Condé Nast | newyorker.com, wired.com, vanityfair.com, gq.com, bonappetit.com, architecturaldigest.com, cntraveler.com | ladder → byparr [→ wayback for some] |
| UK news | thetimes.co.uk, telegraph.co.uk | ladder → byparr → wayback |
| Business | businessinsider.com, insider.com, hbr.org, seekingalpha.com, barrons.com | ladder → byparr → wayback |
| Tech/other | arstechnica.com, theatlantic.com, medium.com | ladder → byparr |
Strategy executors
| Strategy | Endpoint | Technique | Best for |
|---|---|---|---|
ladder |
GET LADDER_URL/raw/{url} |
Googlebot UA + YAML rulesets | Soft paywalls (80% of sites) |
byparr |
POST BYPARR_URL/v1 |
Camoufox stealth browser | DataDome, Cloudflare |
wayback |
Wayback Machine SPN2 API | Submit → poll → fetch snapshot | Hard paywalls, archival |
browserless |
POST BROWSERLESS_URL/content |
Headless Chrome, Googlebot UA | Last resort fallback |
Pre-upload content validation (isContentValid)
- Strip Wayback Machine toolbar/wrapper HTML before analysis
- Reject if < 1000 bytes or < 200 words
- Challenge signals: DataDome, captcha-delivery, verification required, etc.
- Paywall signals: subscribe prompts, login walls, robot checks (requires 2+ hits)
Content handlers
Run after strategy fetch + validation, before upload to Readeck. Pure string-replace, no DOM parser.
Execution order:
1. mergeMultiPage(html, url, domain) — async, fetches additional pages
2. cleanEmbeddedTweets(html) — universal tweet cleanup (runs inside cleanContent)
3. Domain-specific handler — per-site DOM cleanup
Universal handlers:
| Handler | Cleanup |
|---|---|
| Embedded tweets | Style <blockquote class="twitter-tweet"> with inline CSS, remove widgets.js, replace tweet iframes with links |
| Multi-page merger | Ars Technica: detect nav.page-numbers, fetch additional pages via Ladder, merge into single document |
Domain-specific handlers:
| Domain | Cleanup |
|---|---|
wikipedia.org |
Remove edit sections, footnote refs, infobox sidebar |
github.com |
Extract <article> only, clean "GitHub - org/" from title |
arstechnica.com |
Multi-page merge + remove page navigation |
medium.com |
Extract largest image from <picture> srcset |
wired.com |
Remove GenericCallout, ad slots, most popular |
theatlantic.com |
Remove audio player, related content |
substack.com |
Remove newsletter chrome — preserves tweets via negative lookahead |
stackoverflow.com |
Remove hot network questions, sidebar |
What was removed from archive-media
| Feature | Status |
|---|---|
| Depaywall chain + strategy executors | Moved to save-to-read |
| Content handlers (cleanContent, etc.) | Moved to save-to-read |
| isContentValid pre-upload check | Moved to save-to-read |
| ntfy notifications | Moved to save-to-read (topic: /readeck) |
| Profile learning + Forgejo sync | Dropped entirely |
| Ollama vision screenshot analysis | Dropped entirely |
| staticData URL profiles | Replaced by hardcoded knownPaywalled map + learnedDomains |
iOS Shortcuts
| Shortcut | Action | Webhook payload |
|---|---|---|
| "Archive" | Save to ArchiveBox or Karakeep | POST /archive-media { url, title? } |
| "Read Later" | Save to Readeck (depaywall if known) | POST /save-to-read { url, title? } |
| "Depaywall" | Retry depaywalling in Readeck | POST /save-to-read { url, mode: "depaywall", existingBookmarkId } |
Ruleset strategy by bot protection type
| Protection | Technique | Service |
|---|---|---|
| None / soft paywall | Googlebot UA + referer | Ladder |
| Cookie-based | Custom cookies in headers | Ladder |
| JavaScript overlay | HTML injection to remove | Ladder |
| Cloudflare | Camoufox engine-level stealth | Byparr |
| DataDome (Reuters, Bloomberg) | Camoufox engine-level stealth | Byparr |
| Hard paywall (no bypass) | Wayback Machine archive | Wayback SPN2 API |
Resources
| Component | CPU | RAM | Image |
|---|---|---|---|
| Ladder | 0.5 core | 128M | ghcr.io/everywall/ladder:latest |
| Byparr | 1.5 cores | 2G | ghcr.io/thephaseless/byparr:latest |
| browserless | 1 core | 2G | ghcr.io/browserless/chromium:latest |
| Total LXC 129 | 3 cores | 4G |
External dependencies
Mirrored to Forgejo mirrors org for resilience:
mirrors/periscope— reynaldichernando/periscope (UI reference, not deployed)mirrors/ladder— everywall/ladder (deployed)mirrors/ladder-rules— everywall/ladder-rules (reference rulesets)
Wayback Machine SPN2 API keys stored in Vault at secret/wayback.
Readeck (read-it-later)
- LXC 130 on apps-pool, IP 192.168.1.130
- Domain: readeck.eva-00.network
- Auth: oauth2-proxy on LXC 119 (forwarded auth headers → Readeck auto-provisions users)
- API: API tokens (web UI only at
/profile/tokens, no programmatic creation). Pre-fetched HTML upload viaPOST /api/bookmarksmultipart/form-data. - Vault:
secret/readeck→secret_key(deployment),api_token(created manually post-deploy) - Role: Clean reading experience (highlights, annotations, EPUB export). Karakeep remains for bookmarking/archival.
Consequences
- LXC count increases by 1 (29 → 30 containers on chizuru)
- ~4GB additional RAM usage on Proxmox host
- Ladder ruleset must be maintained as sites change their protection
- Byparr/Camoufox stealth may lag behind DataDome updates (cat-and-mouse game); manual intervention may still be needed for some sites
- Wayback Machine SPN2 API has rate limits (free tier); not suitable for high-volume archival
- Three workflows to maintain instead of one, but each is focused and independently debuggable
- Readeck API token must be created manually after first deploy (no programmatic creation)