Skip to content

RFC: Bookmark & Archival Stack (Replacing Raindrop + Readwise)

Status: Draft Date: 2026-04-16

Problem

Currently using Karakeep for bookmarking, but need a complete workflow for: - Archiving web pages with reliable offline copies - Highlighting and annotating saved content - Read-later with iOS support - Archiving Reddit posts automatically - Archiving paywalled articles (news sites) - Replacing Raindrop.io and Readwise (paid, iOS)

Known Issues

  • Google search bookmarks broken: Google wraps links in redirect URLs (google.com/url?sa=t&url=...). Karakeep crawls the redirect, which fails.
  • archive.is source attribution: Bookmarking an archive.is URL shows "archive.is" as the source instead of the original publisher.
  • No deep archival: Karakeep uses monolith for HTML snapshots, but doesn't produce screenshot or media files.

Decision: Keep Karakeep

Karakeep is the right tool. Already deployed, integrates with local Ollama for AI tagging, has native iOS app with offline reading, highlighting (4 colors + notes), RSS auto-hoarding, and a full REST API. Main gaps: deep archival formats and no spaced repetition (Readwise's core feature).

Karakeep Rules Engine (Native Auto-Tagging)

Karakeep has a built-in rules engine at Settings > Rules with IFTTT-style automation:

Events (triggers): bookmarkAdded, tagAdded, tagRemoved, addedToList, removedFromList, favourited, archived

Conditions (filters): - urlContains / urlDoesNotContain - titleContains / titleDoesNotContain - bookmarkTypeIs (link / text / asset) - bookmarkSourceIs (api / web / cli / mobile / extension / singlefile / rss / import) - importedFromFeed (specific RSS feed) - hasTag, isFavourited, isArchived, alwaysTrue - Composite: and / or (nestable up to 10 levels)

Actions: addTag, removeTag, addToList, removeFromList, downloadFullPageArchive, favouriteBookmark, archiveBookmark

Tagging via API: Tags cannot be included in the create-bookmark call. Two-step: create bookmark, then POST /bookmarks/{id}/tags.

Bookmark types: link, text, asset (image/pdf). No sub-types. Use tags for finer classification (e.g., source:reddit, type:article).

Duplicate handling: Creating a link with an existing URL returns HTTP 200 with the existing bookmark (upsert behavior).


Tool Comparison

Karakeep (current -- keeping)

Feature Status
Offline archive monolith (single HTML)
Highlighting 4 colors + notes (v0.31.0)
iOS app Native, with offline reading
AI tagging Yes (OpenAI/Ollama)
API Full REST API
RSS auto-hoarding Yes
Rules engine IFTTT-style URL/title/type conditions
Full-page snapshot HTML only (no PDF/screenshot/WARC)

Linkwarden (runner-up)

Better archival (screenshot+PDF+SingleFile per page). Native iOS app (Dec 2025). AI tagging (OpenAI/Anthropic). Dual-license (AGPL self-hosted, paid cloud). Worth considering only if Karakeep's archival proves insufficient.

Wallabag

Article text only. No visual snapshots. iOS app missing highlighting. Dealbreaker.

Omnivore

Dead project. Acqui-hired by ElevenLabs (Nov 2024). Not recommended.

Shiori

Too basic. No highlighting, no iOS app, no AI. Skip.


Content Type Primary Format Secondary Format Key Tool
Web articles SingleFile HTML Screenshot PNG single-file-cli (headless)
Reddit posts .json endpoint SingleFile on old.reddit.com curl/wget + SingleFile
Paywalled articles SingleFile (while authenticated) Readability text extract SingleFile CLI with cookies
Twitter/X gallery-dl JSON + media SingleFile for individual tweets gallery-dl
Mastodon API JSON export SingleFile Mastodon API
Entire websites WARC/WACZ SingleFile per page Browsertrix Crawler / ArchiveBox

Format Comparison

Format Fidelity Size Searchable Future-Proof Easy to View
SingleFile HTML Excellent 1-5MB Yes (grep/browser) Excellent Any browser
Monolith HTML Good (no JS) 1-5MB Yes Excellent Any browser
WARC Excellent 10-50MB Only with tools ISO standard Needs replay tool
PDF Lossy (layout breaks) Small-Med Yes (text layer) Excellent Universal
MHTML Good Medium Partial Dead format Chrome only

Key insight: monolith (what Karakeep uses) does NOT execute JavaScript, so it misses dynamically loaded content. SingleFile runs a real browser engine and captures JS-rendered content faithfully. SingleFile is strictly better for archival fidelity.

WARC: Not Needed

Overkill for personal use. Can't be viewed directly. Skip unless archiving entire websites.


Reddit-Specific Archiving

The .json API

Append .json to any Reddit URL. Works on www.reddit.com, old.reddit.com, bare reddit.com.

https://old.reddit.com/r/DataHoarder/comments/abc123/title.json?limit=500&sort=top

Response structure: JSON array of two Listing objects: - [0] -- post data at [0].data.children[0].data - [1] -- comment tree (kind t1 nodes)

Key post fields:

Field Description
subreddit Subreddit name (no r/ prefix)
title Post title
author Author username
selftext / selftext_html Text body (text posts)
url Link target or full URL
score / upvote_ratio Votes
num_comments Total comment count (even if not all returned)
is_self true = text post
is_video true = Reddit-hosted video
is_gallery true = gallery post
post_hint "image", "hosted:video", "rich:video", "link", "self"
media.reddit_video Video metadata (fallback_url, duration, height)
media_metadata Gallery image metadata (keyed by media_id)
gallery_data Gallery ordering and captions
crosspost_parent_list Parent post data if crosspost
domain Source domain for link posts

Detecting Media Type

Text-only:       is_self == true
Image:            post_hint == "image" OR i.redd.it URL
Gallery:          is_gallery == true, media_metadata present
Reddit video:     is_video == true, post_hint == "hosted:video"
External video:   post_hint == "rich:video" (YouTube etc.)
Link/article:     post_hint == "link"
Crosspost:        crosspost_parent present

Comment Limits

  • ?limit=500 returns max 500 comments (default 200, max 500)
  • ?depth=10 controls tree depth (max 10)
  • ?sort=top|confidence|new|controversial|old|qa
  • num_comments on post gives true total regardless of returned count
  • Truncated branches show as "more" objects (kind "more") with data.count and data.children (IDs to expand)
  • Expand via POST /api/morechildren with link_id=t3_{post_id}&children={comma-separated IDs}

Reddit Media Downloads

Media type URL source Download tool
Images (i.redd.it) url field Direct download (curl/wget)
Galleries media_metadata[id].s.u (HTML-encoded, strip amp;) gallery-dl
Reddit video (v.redd.it) media.reddit_video.fallback_url (video only, no audio) yt-dlp (auto-muxes audio)
Reddit video audio v.redd.it/{id}/DASH_audio.mp4 (separate) yt-dlp handles this
External video url field yt-dlp

Important: Reddit video has audio in a separate file. yt-dlp handles muxing automatically (requires ffmpeg). gallery-dl does NOT download videos; use yt-dlp for those.

URL Parsing

import re
pattern = r'(?:https?://)?(?:\w+\.)?reddit\.com/r/(\w+)/comments/(\w+)(?:/\w+)?(?:/(\w+))?'
# Group 1: subreddit, Group 2: post_id, Group 3: comment_id (if comment permalink)

Other URL forms: redd.it/{id} (short link, redirects), reddit.com/gallery/{id}, v.redd.it/{id}, i.redd.it/{filename}

Rate Limits

  • Unauthenticated: 10 req/min (by IP)
  • OAuth: 100 req/min
  • Must send proper User-Agent or requests blocked

Reddit Shadow DOM Problem (New Reddit)

Critical finding: New Reddit uses Shreddit, a framework built on Lit (Google's web component library) with Shadow DOM for every component. This is the #1 archiving problem:

Aspect Old Reddit (old.reddit.com) New Reddit (reddit.com)
Rendering Server-rendered HTML + jQuery Lit web components + Shadow DOM
JS required No (content in raw HTML) Yes (nothing renders without JS)
SingleFile size ~700KB-2MB 20-30MB (Shadow DOM stylesheet duplication)
SingleFile stability Reliable Can crash browser (memory exhaustion)
monolith Works (content in HTML) Captures nothing (no JS engine)
Galleries/albums Static HTML images JS carousels (only first image captured)
Comment structure Plain HTML <div> nesting Shadow DOM components

Why it matters: SingleFile duplicates shared stylesheets inside each Shadow DOM component. A thread with many comments = many Shadow DOM components = massive file bloat and potential OOM crashes. (SingleFile issue #1514)

Workaround for SingleFile: Enable "Stylesheets > group duplicate stylesheets together" or use ZIP output format (auto-deduplicates).

The rule: ALWAYS convert to old.reddit.com before archiving. This is non-negotiable.

Reddit Archiving Strategy (Multi-Layer)

For maximum fidelity, use multiple layers:

Layer 1: Data (structured, complete)
    .json endpoint -> full post + top 500 comments + metadata
    Stored as JSON file alongside bookmark

Layer 2: Visual (how it looked)
    SingleFile on old.reddit.com (with comments expanded)
    Stored in Karakeep + ArchiveBox

Layer 3: Media (full resolution)
    gallery-dl -> images + galleries
    yt-dlp -> Reddit-hosted video (auto-muxes audio) + external video
    Stored in ArchiveBox

Layer 4: Screenshot (fallback)
    ArchiveBox screenshot extractor -> full-page PNG
    Insurance when SingleFile fails

Comment expansion before archiving: - Old Reddit with ?limit=500&depth=10 renders up to 500 comments server-side - For automated workflow: this is sufficient (comments are in the HTML) - For manual high-fidelity: use "Expand Everything" userscript before SingleFile save

Gallery limitation: SingleFile cannot archive Reddit galleries/albums (JS carousels). Only the first image is captured. The n8n workflow must extract all gallery image URLs from .json response (media_metadata[id].s.u) and archive them separately via gallery-dl or direct download.

Reddit Embeds Within Posts

Embed Type SingleFile (old.reddit) .json API Separate Tool
Twitter/X posts Fallback blockquote text only URL only yt-dlp (video), manual (text)
YouTube videos Thumbnail/poster only URL only yt-dlp
Imgur albums First image only URL only gallery-dl
Reddit crossposts Partial (text) Full parent data N/A

No tool handles all embeds perfectly. The n8n workflow should: 1. Parse .json for external URLs (url, url_overridden_by_dest, domain) 2. If YouTube: add to ArchiveBox (yt-dlp captures the video) 3. If Imgur: run gallery-dl 4. If Twitter: best-effort (text in blockquote, video via yt-dlp if applicable)


Archiving Articles with Embedded Content

The Problem

News articles commonly embed tweets, YouTube videos, Instagram posts, and interactive content. These are loaded via JavaScript widgets and iframes from third-party domains. Most archiving tools handle them poorly.

Embed Capture by Tool

Content Type SingleFile (extension) SingleFile (CLI) monolith ArchiveBox
Tweet text + images Yes (rendered iframe) Yes (with --browser-wait-delay 10000) Fallback blockquote only Via SingleFile extractor
Tweet videos Poster frame only Poster frame only No No (need yt-dlp separately)
YouTube player UI Static snapshot Static snapshot No Via SingleFile
YouTube video file No No No Only if yt-dlp detects embed
Instagram embed Broken since 2025 (auth required) Broken No Broken
TikTok embed Partial (widget UI) Partial No Partial
Spotify embed Player UI (no audio) Player UI No Player UI
Interactive charts Static SVG/Canvas snapshot Static snapshot No Screenshot only

Key Findings

Tweets: When widgets.js executes, it replaces <blockquote class="twitter-tweet"> elements with rendered iframes. SingleFile (extension) captures the rendered iframe content including profile pic, formatted text, and static images. If a tweet is deleted after archiving, the embedded version is preserved in your SingleFile HTML (no external dependency). monolith only gets the fallback blockquote text.

YouTube: Embeds are iframes (youtube.com/embed/VIDEO_ID). SingleFile captures the player UI (thumbnail, title) but NOT the video stream. ArchiveBox's yt-dlp may detect embedded YouTube players via its generic extractor, but it's not guaranteed. Reliable approach: extract YouTube URLs from HTML and add to ArchiveBox separately.

Instagram: Meta removed unauthenticated oEmbed in April 2025. All Instagram embeds are now broken without auth. Archives made before this date are fine. New archives will show empty containers.

SingleFile CLI Settings for Embeds

single-file \
  --browser-wait-delay=10000 \
  --browser-wait-until=networkidle0 \
  --browser-load-max-time=60000 \
  --browser-capture-max-time=60000 \
  https://article-with-embeds.com output.html
  • --browser-wait-delay=10000: Give 10s for tweet/YouTube widgets to render
  • --browser-wait-until=networkidle0: Wait until no network activity
  • --no-remove-frames: Must be set to preserve iframe content (embeds)

The n8n workflow should handle embeds as a post-archive extraction step:

1. Archive article with SingleFile (with wait delays for embed rendering)
2. Parse the archived HTML for embedded URLs:
   - grep for youtube.com/embed/ -> extract video IDs
   - grep for twitter.com or x.com -> extract tweet URLs
   - grep for instagram.com -> flag as likely broken
3. For each embedded URL:
   - YouTube: POST to ArchiveBox (yt-dlp captures video)
   - Twitter video: POST to ArchiveBox (yt-dlp)
   - Tweet text: already preserved in SingleFile (if rendered)
4. Add note to Karakeep bookmark: "Embedded media: 2 YouTube videos, 1 tweet (archived separately)"

Extract YouTube from HTML:

import re
youtube_ids = re.findall(r'youtube\.com/embed/([^"?&]+)', html_content)
for vid_id in set(youtube_ids):
    # POST https://youtube.com/watch?v={vid_id} to ArchiveBox


Recovering Deleted Reddit Posts

Detection via .json Endpoint

State author selftext/body removed_by_category
Active username actual content null
User-deleted [deleted] [deleted] null
Mod-removed [deleted] [removed] "moderator"
Admin-removed [deleted] [removed] "admin" or "anti_evil_ops"

Recovery Services (Priority Order)

1. Arctic Shift (best coverage, 2005-2026): - API: https://arctic-shift.photon-reddit.com - GET /api/posts/ids?ids={id} -- up to 500 posts - GET /api/comments/ids?ids={id} -- up to 500 comments - GET /api/comments/tree?link_id={post_id} -- full comment tree (up to 25000) - GET /api/posts/search?subreddit=X&author=Y&title=Z - 2.5 billion items, no auth required, free - Web UI: https://arctic-shift.photon-reddit.com/search

2. PullPush.io (good for recent data): - API: https://api.pullpush.io - GET /reddit/search/submission/?ids={id} - GET /reddit/search/comment/?ids={id} - No auth required, free

3. Wayback Machine (pre-August 2025 only): - Reddit blocked Wayback Machine crawling in August 2025 - Pre-existing snapshots still accessible - GET https://archive.org/wayback/available?url={reddit_url} -- check availability - Coverage was always spotty

Dead services: Pushshift (mod-only since 2023), Unddit (down), Reveddit (depends on Pushshift, broken for regular users), Google Cache (discontinued 2024).


SingleFile CLI (Headless, No Browser Extension)

single-file-cli runs completely headless via Deno + Chrome DevTools Protocol.

npm install -g single-file-cli
single-file https://example.com output.html
single-file --browser-cookies-file cookies.txt https://paywalled-site.com/article output.html
docker run capsulecode/singlefile https://example.com > output.html

Karakeep's Native SingleFile Integration

POST /api/v1/bookmarks/singlefile
  • Data field: file, URL field: url, add API key header
  • ?ifexists=overwrite updates existing bookmarks with better archive content
  • ?ifexists=skip / append / overwrite-recrawl / append-recrawl also available

ArchiveBox as Fire-and-Forget Vault

ArchiveBox is NOT part of the fallback chain. It runs on every bookmark in parallel as a separate long-term preservation vault.

Deployed Config (v0.7.3, LXC 128)

# Enabled extractors
SAVE_DOM=True           # Raw HTML DOM capture
SAVE_WGET=True          # Full recursive wget (also generates WARC)
SAVE_WARC=True          # WARC format (requires SAVE_WGET=True)
SAVE_SCREENSHOT=True    # Full-page PNG screenshot
SAVE_MEDIA=True         # Video/audio via yt-dlp
SAVE_READABILITY=True   # Clean article text extraction

# Disabled extractors
SAVE_SINGLEFILE=False   # Too resource-heavy, redundant with dom+wget
SAVE_PDF=False          # Redundant
SAVE_ARCHIVE_DOT_ORG=False
SAVE_HEADERS=False
SAVE_GIT=False

# Limits
MEDIA_MAX_SIZE=750m
TIMEOUT=120

API (v0.7.3 — via API wrapper sidecar)

v0.7.3 has no built-in REST API. A thin HTTP wrapper sidecar exposes the CLI:

  • POST /add -- archive a URL (body: {url, depth, tag, overwrite})
  • GET /health -- health check
  • Auth: Authorization: Bearer <token> header
  • Token stored in Vault at secret/archiveboxapi_key
  • Runs on port 8001 (same Docker network, shares /data volume)
  • n8n env: ARCHIVEBOX_URL=http://192.168.1.128:8001

Storage

  • SQLite DB: /opt/archivebox/data (local rootfs, 16GB disk)
  • Archive files: /mnt/archivebox/archive (urahara bind mount, bulk storage)
  • Estimated: ~7-10MB per Reddit post, ~70-100GB for 10,000 posts

archive.is: Optional Fallback, Not Primary

Fallback when Karakeep's monolith fails. Cannot bypass paywalls. No official API.

Python archiveis library:

import archiveis
archive_url = archiveis.capture("https://example.com/article")

Check existing: curl -sI "https://archive.today/newest/{url}"

Limitations: Rate limit ~1 hour/URL, must NOT use Cloudflare DNS (1.1.1.1), spoof browser UA, HTTP 503 on aggressive use.


URL Cleaning (Google, Apple News, Redirects)

Google Search Result Redirects

https://www.google.com/url?sa=t&url=https%3A%2F%2Factual-site.com%2Farticle&ved=...

Solution: tidy-url (Node.js)

import { TidyURL } from 'tidy-url';
const result = TidyURL.clean(url);
console.log(result.url); // cleaned URL

Handles Google, Facebook, Twitter/X, Steam, AMP. Strips tracking params. Proton fork: @protontech/tidy-url.

Google Search Page URLs

https://www.google.com/search?q=how+to+self+host+bookmarks&client=safari&...

Extract search query:

from urllib.parse import urlparse, parse_qs
query = parse_qs(urlparse(url).query).get('q', [''])[0]

For use case #3 (save a search), the n8n workflow extracts q= and uses it as the bookmark title.

https://apple.news/AbCdEfGhIjKl

Apple News URLs do NOT issue HTTP redirects. They serve an HTML page with a JavaScript-based redirect. curl -L will NOT work.

Resolution: fetch HTML + parse:

import requests, re
response = requests.get("https://apple.news/AbCdEfGhIjKl")
match = re.search(r'redirectToUrlAfterTimeout\("([^"]+)"', response.text)
real_url = match.group(1) if match else None

Limitation: Apple News+ exclusive articles have no public URL to resolve.

n8n URL Cleaning Node (Unified)

The n8n Code node handles all URL types in one place:

const { TidyURL } = require('tidy-url');

function cleanUrl(url) {
  // Apple News: fetch and parse JS redirect
  if (url.includes('apple.news/')) {
    // HTTP GET + regex parse redirectToUrlAfterTimeout
    // Return extracted URL
  }

  // Google Search page: extract query, keep as-is but set title
  if (url.match(/google\.\w+\/search\?/)) {
    const query = new URL(url).searchParams.get('q');
    return { url, title: query, type: 'search' };
  }

  // Everything else: tidy-url handles Google/Facebook/Twitter redirects
  const cleaned = TidyURL.clean(url);
  return { url: cleaned.url, title: null, type: 'link' };
}

Karakeep Rules (Auto-Tagging)

Set up in Settings > Rules:

Rule Event Condition Action
Reddit posts bookmarkAdded urlContains("reddit.com") addTag("reddit")
News articles bookmarkAdded urlContains("washingtonpost.com") OR urlContains("independent.co.uk") OR ... addTag("news")
News (broad) bookmarkAdded n8n sets tag via API addTag("news")
Fandom wikis bookmarkAdded urlContains("fandom.com") addTag("fandom")
From RSS bookmarkAdded bookmarkSourceIs("rss") addTag("rss-feed")
From iOS Shortcut bookmarkAdded bookmarkSourceIs("api") (handled by n8n tagging)

Limitation: Karakeep rules can only match URL substrings, not regex. For complex tagging (e.g., extracting subreddit name, detecting news domains broadly), n8n must add tags via the API after bookmark creation.


Tiered Archival Fallback Chain

Bookmark created in Karakeep
    --> Karakeep crawls with monolith (automatic)
    --> Webhook fires on "crawled" event
    --> n8n receives webhook
    --> n8n GET /api/v1/bookmarks/{id} to check content
        |
        Tier 1: htmlContent exists?
            --> YES: local archive succeeded, done
            |
        Tier 2: NO content --> archive.is fallback
            --> Check archive.today/newest/{url}
            --> Found? Grab the archive.is URL
            --> Not found? archiveis.capture(url)
            --> PATCH bookmark note with archive.is link
            --> Tag: "archive.is-fallback"
            |
        Tier 3: archive.is also fails
            --> Tag: "needs-manual-archive"
            --> ntfy notification: "Bookmark X needs manual SingleFile"
        |
        In parallel (every bookmark, regardless of tier):
            --> POST to ArchiveBox API (fire-and-forget)

iOS Shortcuts

Two separate shortcuts for different purposes:

Shortcut 1: "Archive Page"

Bookmarks + archives a page. Goes through full pipeline: URL cleaning, Karakeep bookmark, tagging, ArchiveBox vault, fallback chain.

1. [Receive] input from Share Sheet (URL)
2. [Get Contents of URL]
     URL: https://n8n.domain/webhook/archive-page
     Method: POST
     Headers: Content-Type: application/json, Authorization: Bearer <token>
     Body: {"url": "[Shortcut Input]"}
3. [Get Dictionary Value] "bookmarkId" from response
4. [Show Notification] "Archived!"

n8n "archive-page" webhook does: - tidy-url clean + Apple News resolve + Google Search extract - POST to Karakeep (bookmark + tags) - POST to ArchiveBox (fire-and-forget) - Return bookmarkId

Shortcut 2: "Archive Website"

Deep archive only. Sends straight to ArchiveBox, no Karakeep bookmark. For when you want to preserve an entire website/page for long-term storage without cluttering your reading list.

1. [Receive] input from Share Sheet (URL)
2. [Get Contents of URL]
     URL: https://n8n.domain/webhook/archive-website
     Method: POST
     Headers: Content-Type: application/json, Authorization: Bearer <token>
     Body: {"url": "[Shortcut Input]"}
3. [Show Notification] "Sent to ArchiveBox!"

n8n "archive-website" webhook does: - tidy-url clean (strip tracking params) - POST directly to ArchiveBox API: POST /api/v1/cli/add - ArchiveBox saves: SingleFile + screenshot + media + readability - No Karakeep, no tagging, no fallback chain - Returns: { "status": "queued" }

Use cases for "Archive Website": - Wiki pages you want preserved but don't need to read later - Documentation pages that might disappear - Forum threads / discussions for reference - Entire sites you want a snapshot of (ArchiveBox follows links with --depth=1) - Anything you want in the vault but not in your reading list

ArchiveBox depth crawling (for actual full-site archiving):

n8n webhook body option: {"url": "...", "depth": 1}
-> archivebox add --depth=1 <url>
--depth=1 follows all links on the page and archives those too. Use sparingly (can be hundreds of pages).


Use Cases (iOS-Focused)

UC1: Save a Reddit Post from iOS

Trigger: Scrolling Reddit app/Safari, share a post or comment.

What happens (multi-layer archiving):

  1. Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook
  2. n8n detects reddit.com URL, parses it:
    pattern = r'reddit\.com/r/(\w+)/comments/(\w+)(?:/\w+)?(?:/(\w+))?'
    # Extracts: subreddit, post_id, comment_id (if comment)
    
  3. CRITICAL: n8n converts to old.reddit.com URL (new Reddit uses Shadow DOM, produces 20-30MB files and can crash archivers)
  4. Layer 1 (Data): n8n fetches .json?limit=500&sort=top:
  5. Extracts subreddit, title, num_comments, post_hint, media fields
  6. Saves full JSON response to storage (complete structured data)
  7. If num_comments > 500: adds note "Post has {num_comments} comments, top 500 archived"
  8. If comment permalink: fetches parent post JSON too
  9. n8n creates Karakeep bookmark with the old.reddit.com URL + ?limit=500&depth=10
  10. n8n adds tags via API:
  11. reddit (source)
  12. r/{subreddit} (specific sub)
  13. Content type tag based on post_hint: image, video, article, discussion
  14. If comment permalink: comment
  15. Layer 2 (Visual): Karakeep monolith crawls old.reddit.com page (comments visible in server-rendered HTML, no Shadow DOM issues)
  16. Layer 3 (Media): n8n handles media based on .json data:
  17. Image (i.redd.it): Direct download URL from url field -> ArchiveBox
  18. Gallery (is_gallery): n8n extracts ALL image URLs from media_metadata[id].s.u (HTML-decode amp;), sends each to ArchiveBox. SingleFile/monolith only capture the first gallery image (JS carousel).
  19. Video (v.redd.it): ArchiveBox yt-dlp auto-muxes video+audio. Direct URL at media.reddit_video.fallback_url
  20. External link (post_hint == "link"): n8n also bookmarks the linked URL separately in Karakeep (gets its own archive pipeline)
  21. Embedded YouTube (post_hint == "rich:video"): Extract URL, add to ArchiveBox for yt-dlp
  22. Crosspost (crosspost_parent present): Extract parent post data from crosspost_parent_list[0], archive parent's media too
  23. Layer 4 (Screenshot): ArchiveBox screenshot extractor captures full-page PNG (fallback)
  24. Fallback chain runs (Tier 1/2/3) for the Karakeep crawl

Expected result in Karakeep: - Bookmark with title from Reddit post - Tags: reddit, r/selfhosted, article (or image/video/discussion) - Monolith archive of old.reddit.com page (up to 500 comments visible) - Note with: comment count info, embedded media summary - If external link: second bookmark for the linked article

Expected result in ArchiveBox: - SingleFile HTML of old.reddit.com page - Screenshot PNG - All gallery images (downloaded individually, not via SingleFile) - Video files (yt-dlp, muxed with audio) - JSON data file (complete structured data)

Gaps / considerations: - Reddit rate limits: 10 req/min unauthenticated. For occasional saves this is fine. Bulk saves need OAuth. - Gallery images: SingleFile/monolith CANNOT capture JS carousels. Must extract from .json media_metadata and download separately. - Comment expansion beyond 500: Not practical for automated workflow. num_comments tells you what you missed. - Reddit embeds (tweets, YouTube in comments): Only fallback text captured by SingleFile on old Reddit. External URLs extracted from .json and archived separately.


UC2: Read-It-Later Article from Safari

Trigger: Reading an article in Safari, want to save for later.

What happens: 1. Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook 2. n8n cleans URL with tidy-url (strips tracking params) 3. n8n creates Karakeep bookmark 4. Karakeep rules engine fires: - urlContains("washingtonpost.com") -> addTag("news"), addTag("source:washington-post") - urlContains("independent.co.uk") -> addTag("news"), addTag("source:the-independent") - (one rule per known news domain) 5. n8n also detects news domains from a maintained list and adds news tag via API (covers domains not in Karakeep rules) 6. Karakeep monolith crawls the article 7. Fallback chain runs 8. ArchiveBox archives with SingleFile CLI (--browser-wait-delay=10000 --browser-wait-until=networkidle0) to give embedded tweets/YouTube time to render 9. Embedded content extraction (post-archive step in n8n): - Parse archived HTML for youtube.com/embed/ -> extract video IDs, add to ArchiveBox (yt-dlp) - Parse for twitter.com / x.com embeds -> tweet text already captured by SingleFile if widgets.js rendered; video needs yt-dlp - Parse for instagram.com -> flag in note (broken since April 2025, no fix) - Add note: "Embedded media: N YouTube videos, N tweets (archived separately)"

Expected result: - Bookmark with article title and publisher metadata - Tags: news, source:washington-post (automatic) - AI tags from Ollama (topic-based) - Monolith archive for offline reading (embedded tweets as fallback blockquotes only) - ArchiveBox has: SingleFile with rendered embeds, screenshot, embedded YouTube videos (yt-dlp) - Available in Karakeep iOS app for offline reading + highlighting

Embedded content limitations: - monolith (Karakeep) does NOT execute JS -> embedded tweets show as <blockquote> fallback text, YouTube shows placeholder only - SingleFile (ArchiveBox) with wait delays captures rendered tweets with profile pic/text/images, but NOT tweet videos or YouTube video streams - Instagram embeds broken since Meta removed unauthenticated oEmbed (April 2025) - YouTube videos must be archived separately via yt-dlp

Karakeep rules to set up:

Rule: "News - Washington Post"
  Event: bookmarkAdded
  Condition: urlContains("washingtonpost.com")
  Actions: addTag("news"), addTag("source:washington-post")

Rule: "News - The Independent"
  Event: bookmarkAdded
  Condition: urlContains("independent.co.uk")
  Actions: addTag("news"), addTag("source:the-independent")

(repeat for each known news domain)

For unknown news domains: n8n maintains a broader list of news domains (or uses a heuristic like checking for <article> tags, og:type=article, or known news TLDs). Falls back to AI tagging.


UC3: Save a Google Search for Later

Trigger: Doing a Google search, want to save the search itself for later research.

What happens: 1. Share the Google search URL from Safari 2. Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook 3. n8n detects Google search URL (google.com/search?q=...) 4. n8n extracts query: parse_qs(url).get('q')[0] -> "how to self host bookmarks" 5. n8n creates Karakeep bookmark with: - URL: the original Google search URL (kept as-is, NOT cleaned) - Title override: the extracted search query ("how to self host bookmarks") - Tags: search, google 6. No offline copy needed (per requirements) 7. n8n takes a screenshot via headless browser (Playwright/Puppeteer on server):

page.goto(google_search_url)
page.screenshot(path="search.png", full_page=True)
8. Screenshot uploaded to Karakeep as an asset attached to the bookmark (or saved to ArchiveBox)

Expected result: - Bookmark titled "how to self host bookmarks" (not "Google Search") - Tags: search, google - Screenshot of search results page - No monolith archive (not needed for a search page)

Consideration: Google may serve CAPTCHAs to headless browsers. May need realistic UA string or cookie consent handling. If screenshot fails, skip -- it's nice-to-have.


UC4: Bulk Import from File

Trigger: Have a file (txt, md, html, etc.) with a mix of URLs.

What happens: 1. Upload file to n8n (or drop in a watched folder) 2. n8n parses URLs from the file: - .txt: one URL per line - .md: extract URLs from markdown links [text](url) and bare URLs - .html: Netscape bookmark format (Karakeep supports this natively) or extract href attributes - .csv: URL column 3. For each URL, n8n runs the same pipeline as the iOS Shortcut: - Clean URL (tidy-url, Apple News resolution, etc.) - Detect type (Reddit, news, search, generic) - POST to Karakeep with source: "import" and importSessionId for batch grouping - Add appropriate tags via API - POST to ArchiveBox (fire-and-forget) 4. Karakeep deduplicates automatically (existing URLs return HTTP 200) 5. Rate limiting: process URLs with a delay (respect Reddit 10 req/min, archive.is limits)

For Netscape HTML (Chrome/Firefox bookmark export): Use Karakeep's built-in import at Settings > Import. It handles the format natively.

Expected result: - All URLs imported as bookmarks - Each tagged appropriately (reddit posts get reddit + r/{sub}, news gets news + source, etc.) - Duplicates silently skipped - Import batch grouped by importSessionId


UC5: Archive a Fandom Webpage

Trigger: Reading a Fandom wiki page, want to archive the full page.

What happens: 1. Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook 2. n8n cleans URL with tidy-url 3. n8n creates Karakeep bookmark 4. Karakeep rule fires: urlContains("fandom.com") -> addTag("fandom") 5. n8n also extracts the wiki name from URL (e.g., naruto.fandom.com -> tag fandom:naruto) 6. Karakeep monolith crawls the page 7. ArchiveBox archives: SingleFile (full page with all images/CSS), screenshot, readability text 8. Fallback chain runs if monolith fails

Expected result: - Bookmark with wiki page title - Tags: fandom, fandom:naruto (or whatever wiki) - Full offline archive (Fandom pages are mostly static HTML, monolith works well) - ArchiveBox has screenshot + SingleFile backup

Note: Fandom pages are ad-heavy. SingleFile/monolith may capture ads. Readability extraction gives cleaner text.


Trigger: Someone shares an Apple News link in Messages, or you share from Apple News app.

What happens: 1. Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook 2. n8n detects apple.news/ URL 3. n8n resolves to real URL:

response = requests.get("https://apple.news/AbCdEfGhIjKl")
match = re.search(r'redirectToUrlAfterTimeout\("([^"]+)"', response.text)
real_url = match.group(1)  # e.g., https://washingtonpost.com/article/...
4. If resolution fails (Apple News+ exclusive): bookmark the apple.news URL as-is, tag apple-news-exclusive, needs-attention 5. If resolved: n8n creates Karakeep bookmark with the real URL (not apple.news) 6. Normal pipeline: Karakeep rules tag it (e.g., news, source:washington-post) 7. Karakeep monolith crawls the real article 8. Fallback chain + ArchiveBox in parallel

Expected result: - Bookmark points to the real article URL (washingtonpost.com), NOT apple.news - Tags: news, source:washington-post, via:apple-news - Publisher shows correctly (Washington Post, not Apple) - Full archive of the actual article

Edge case: Apple News+ exclusive content has no public URL. These get tagged apple-news-exclusive + needs-attention for manual handling.


UC7: Auto-Tag Everything "News" as News

Implementation: Combination of Karakeep rules + n8n logic.

Karakeep rules (for known domains):

urlContains("washingtonpost.com") -> addTag("news")
urlContains("nytimes.com") -> addTag("news")
urlContains("theguardian.com") -> addTag("news")
urlContains("bbc.co.uk/news") -> addTag("news")
urlContains("reuters.com") -> addTag("news")
urlContains("apnews.com") -> addTag("news")
urlContains("independent.co.uk") -> addTag("news")
urlContains("cnn.com") -> addTag("news")
(... extend for all known news domains)

n8n fallback (for unknown domains): - Maintain a JSON list of 500+ news domains in n8n (community lists exist) - On every bookmark, check domain against list - If match: add news tag via API - If no match but AI tags suggest news content: add news tag

AI tagging: Configure Karakeep's AI settings (User Settings > AI Settings) with custom instruction: "If the content is a news article, always include the tag 'news'."


UC8: Every Item Tagged with Source

Implementation: n8n adds source:{domain} tag to every bookmark.

After creating a Karakeep bookmark, n8n:

from urllib.parse import urlparse
domain = urlparse(url).netloc.replace('www.', '')
# "reddit.com", "washingtonpost.com", "naruto.fandom.com"
tag = f"source:{domain}"
# POST /bookmarks/{id}/tags with tag

This is automatic for every bookmark, regardless of type. Karakeep's bookmarkTypeIs condition (link/text/asset) also helps distinguish at a glance.

Combined with UC7: A bookmark from Washington Post gets: - news (from Karakeep rule or n8n news domain list) - source:washingtonpost.com (from n8n domain extraction) - AI-generated topic tags (from Ollama)


UC9: Paywalled Content Archiving

Trigger: Want to archive a paywalled article.

What happens (automated attempt): 1. Bookmark created (via iOS Shortcut or Karakeep app) 2. Karakeep monolith tries to crawl -> fails (paywall blocks it) 3. n8n Tier 2: archive.is fallback - Check archive.today/newest/{url} (someone else may have archived it) - If found: link in note, tag paywalled, archived-via:archive.is - If not found: archiveis.capture(url) (archive.is server may bypass soft paywalls) 4. If archive.is also gets paywalled content: - Tag: paywalled, needs-manual-archive - ntfy notification: "Paywalled article needs manual archive: {title}"

Manual fallback (desktop): 5. Open article in browser while logged in to your subscription 6. Run SingleFile CLI with cookies: single-file --browser-cookies-file cookies.txt {url} output.html 7. Push to Karakeep: POST /api/v1/bookmarks/singlefile?ifexists=overwrite 8. n8n detects the update, removes needs-manual-archive tag 9. Final tags: paywalled, archived (or paywalled, source:nytimes.com, news)

Tagging logic in n8n:

Outcome Tags applied
Monolith succeeded (not paywalled) Normal tags only
Monolith failed, archive.is succeeded paywalled, archived-via:archive.is
Monolith failed, archive.is got paywall too paywalled, needs-manual-archive
Manual SingleFile uploaded paywalled (remove needs-manual-archive)

How n8n detects paywall vs other failures: - If the crawled HTML contains paywall indicators (keywords like "subscribe", "paywall", "premium", short content length), tag as paywalled - If crawl returned nothing at all (connection refused, 403), may be bot protection, not paywall -> different handling - This heuristic isn't perfect but covers most cases


UC10: Recover Old/Deleted Reddit Saves

Trigger: Have a list of old saved Reddit post URLs, some may be deleted.

What happens: 1. Import list of saved Reddit URLs (from Reddit export or manual list) 2. For each URL, n8n checks Reddit .json endpoint: - author == "[deleted]" or selftext == "[removed]" -> post is deleted/removed - Active posts: normal archival pipeline (UC1) 3. For deleted posts, n8n tries recovery services in order:

Step 1: Arctic Shift (best coverage, 2005-2026)
  GET https://arctic-shift.photon-reddit.com/api/posts/ids?ids={post_id}
  -> Returns original content if captured before deletion

Step 2: PullPush.io (good for recent data)
  GET https://api.pullpush.io/reddit/search/submission/?ids={post_id}

Step 3: Wayback Machine (pre-Aug 2025 snapshots only)
  GET https://archive.org/wayback/available?url={reddit_url}
  -> Only works for posts archived before Reddit blocked Wayback in Aug 2025
4. If content recovered: - Create Karakeep bookmark with old.reddit.com URL - Save recovered content in note (title, author, body text) - Tags: reddit, r/{subreddit}, recovered, deleted-post - ArchiveBox archives whatever is still accessible 5. If recovery fails: - Create Karakeep bookmark anyway (URL preserved) - Tags: reddit, unrecoverable, deleted-post - Note: "Post deleted, content not recovered. Checked: Arctic Shift, PullPush, Wayback Machine"

Key services:

Service Coverage API Status
Arctic Shift 2005-2026, 2.5B items REST, no auth, free Working
PullPush.io Active ingestion REST, no auth, free Working
Wayback Machine Pre-Aug 2025 only REST, no auth No new Reddit snapshots
Pushshift N/A Mod-only since 2023 Inaccessible
Unddit N/A Connection refused Dead
Reveddit N/A Depends on Pushshift Broken

Batch workflow for old saves:

1. Export saved posts from Reddit (Settings > Data > Export)
2. Parse the export for post/comment URLs
3. For each URL:
   a. Check if still active (.json endpoint)
   b. If active: archive normally (UC1)
   c. If deleted: recovery chain (Arctic Shift -> PullPush -> Wayback)
   d. Create Karakeep bookmark with appropriate tags
4. Generate summary: X recovered, Y unrecoverable, Z still active


iOS / Desktop
  |
  +-- iOS Shortcut "Archive Media" (share sheet)
  |     |
  |     +-- POST to n8n webhook "archive-media"
  |         Reddit: ArchiveBox only (dom/warc/screenshot/wget/media), tagged by subreddit
  |         Non-Reddit: Karakeep + ArchiveBox in parallel
  |
  +-- iOS Shortcut "Archive Website" (share sheet)
  |     |
  |     +-- POST to n8n webhook "archive-website"
  |         (ArchiveBox only, no Karakeep, no tags)
  |
  +-- Karakeep iOS app (native, for quick bookmarks)
  |
  +-- Karakeep browser extension (desktop)
  |
  v
n8n Orchestration Hub
  |
  +-- Webhook: "archive-media" (from iOS Shortcut "Archive Media")
  |     |-- URL type detection (Reddit vs generic)
  |     |-- tidy-url: clean redirects & tracking params
  |     |-- Reddit path:
  |     |     |-- Resolve share links (/s/, redd.it)
  |     |     |-- Rewrite to old.reddit.com
  |     |     |-- Fetch .json (metadata, subreddit, media URLs)
  |     |     |-- POST to ArchiveBox API wrapper (tag=subreddit)
  |     |     |-- Archive extra media (galleries, videos)
  |     |     |-- Skip Karakeep
  |     |-- Generic path:
  |     |     |-- POST to Karakeep (bookmark + tags)
  |     |     |-- POST to ArchiveBox (parallel)
  |     |-- Log to media-archives ledger (Forgejo CSV)
  |     |-- Return result to Shortcut
  |
  +-- Webhook: "archive-website" (from iOS Shortcut "Archive Website")
  |     |-- tidy-url: clean tracking params only
  |     |-- POST directly to ArchiveBox API wrapper
  |     |-- Optional: --depth=1 for full-site crawl
  |     |-- No Karakeep, no tagging, no fallback chain
  |     |-- Return: { "status": "queued" }
  |
  +-- Webhook: "karakeep-crawled" (from Karakeep)
  |     |-- Tier 1: Check htmlContent
  |     |-- Tier 2: archive.is fallback (paywalled domains only)
  |     |-- Tier 3: ntfy notification
  |     |-- Paywall detection + tagging
  |
  v
Karakeep (daily driver — non-Reddit bookmarks)
  |-- Search, AI tagging (Ollama), highlighting
  |-- iOS app with offline reading
  |-- Rules engine: URL-based auto-tagging
  |
ArchiveBox v0.7.3 (archive vault — all URLs)
  |-- Extractors: dom, wget, warc, screenshot, media (yt-dlp), readability
  |-- Disabled: singlefile, pdf, archive.org, headers, git
  |-- API wrapper sidecar (POST /add on port 8001)
  |-- Archives stored on urahara bind mount
  |-- LXC 128, 192.168.1.128
  |
media-archives ledger (Forgejo repo: claude/media-archives)
  |-- Monthly CSV files (2026/04.csv, 2026/05.csv, ...)
  |-- Columns: date_saved, url, source, type, title, tags, karakeep_id,
  |            archivebox_url, archive_today_url, reddit_account, status
  |-- Updated by n8n via Forgejo API

Karakeep Rules to Configure

# Source tagging (known domains)
bookmarkAdded + urlContains("reddit.com")           -> addTag("reddit")
bookmarkAdded + urlContains("fandom.com")            -> addTag("fandom")

# News tagging (per domain)
bookmarkAdded + urlContains("washingtonpost.com")    -> addTag("news"), addTag("source:washington-post")
bookmarkAdded + urlContains("nytimes.com")           -> addTag("news"), addTag("source:nytimes")
bookmarkAdded + urlContains("theguardian.com")       -> addTag("news"), addTag("source:the-guardian")
bookmarkAdded + urlContains("bbc.co.uk/news")        -> addTag("news"), addTag("source:bbc")
bookmarkAdded + urlContains("reuters.com")           -> addTag("news"), addTag("source:reuters")
bookmarkAdded + urlContains("apnews.com")            -> addTag("news"), addTag("source:ap")
bookmarkAdded + urlContains("independent.co.uk")     -> addTag("news"), addTag("source:the-independent")
bookmarkAdded + urlContains("cnn.com")               -> addTag("news"), addTag("source:cnn")
(... extend as needed)

# Source tagging (via n8n for all bookmarks)
# n8n extracts domain and adds source:{domain} tag via API

Readwise Gap

No self-hosted tool replaces Readwise's spaced repetition / daily review. Karakeep highlighting + Obsidian export is partial. Accept the gap or keep paying.


Decision Matrix

Approach Effort Coverage
Karakeep + n8n (URL cleaning + fallback + tagging) Medium 90% of needs
Karakeep + n8n + ArchiveBox vault Medium-High 95% of needs
Karakeep + n8n + ArchiveBox + iOS Shortcut + Reddit recovery High 98% of needs

Paywalled Article Pipeline (Detailed — 2026-04-20)

Confirmed Implementation Details

Component Status Location
Paywalled domains list .txt file in Forgejo repo holo/media-archives repo, paywalled-domains.txt (NOTE: was under claude/ — needs to be moved to correct owner)
Proactive detection Implemented archive-media workflow, "Clean Generic URL" node checks domain against list
Hardcoded fallback list Implemented reuters.com, bloomberg.com, wsj.com, nytimes.com, ft.com, economist.com, washingtonpost.com, theathletic.com, businessinsider.com, insider.com, thetimes.co.uk, telegraph.co.uk, hbr.org, seekingalpha.com, barrons.com
CSV ledger Implemented Monthly CSV (e.g. 2026/04.csv) in same repo, written via Forgejo API
Reactive detection Implemented karakeep-crawled workflow, "Paywall Detection" node (short content + paywall keywords)
Auto-add domain to list Implemented karakeep-crawled workflow appends new domain to paywalled-domains.txt
Karakeep SingleFile endpoint Available but NOT wired POST /api/v1/bookmarks/singlefile?ifexists=overwrite

Desired Flow (user-specified 2026-04-20)

1. User sends link via iOS Shortcut
2. n8n reads paywalled-domains.txt to check if domain is known-paywalled
3. If NOT paywalled → normal processing (Karakeep bookmark, tags, ledger)
4. If paywalled:
   a. Process article through archive.today (get de-paywalled content)
   b. Store the de-paywalled content IN Karakeep (not just a link)
   c. Bookmark shows the ORIGINAL article URL (not archive.today URL)
   d. Tag: "paywalled"
   e. Note: link to archive.today version
5. Every processed link (Reddit, article, paywalled, etc.) → logged to CSV ledger

Resolved (2026-04-20)

  1. De-paywalled content in Karakeep: Using Option A — fetch archive.today snapshot HTML, upload via Karakeep's SingleFile endpoint (POST /api/v1/bookmarks/singlefile?ifexists=overwrite) with original URL. Bookmark shows original article URL but stored content is the de-paywalled version.

  2. Repo ownership: Created homelab org on Forgejo. Repo transferred from claude/media-archives to homelab/media-archives. Both claude and holo are Owners.

  3. "Process using archive.today" means: Check archive.today/newest/<url> with redirect: 'manual'. If 302, snapshot exists — follow redirect to get snapshot URL, fetch full HTML, upload to Karakeep SingleFile endpoint. If no snapshot exists, fall back to note-only (link to archive.today for manual use).

Current Implementation (v3 — 2026-04-21, profile-driven refactor)

Infrastructure: Dedicated archival LXC 129 (192.168.1.129) with: - Ladder (port 8080) — Go-based paywall bypass proxy with YAML rulesets - Byparr (port 3001→8191) — Camoufox stealth browser (DataDome/Cloudflare bypass) - browserless/chromium (port 3000) — Direct headless Chrome API (last resort fallback)

n8n (LXC 120) calls archival services via IP: LADDER_URL=http://192.168.1.129:8080, BYPARR_URL=http://192.168.1.129:3001, BROWSERLESS_URL=http://192.168.1.129:3000.

Profile-driven routing (replaces hardcoded if/else chains):

The workflow now uses URL profiles to route incoming links. Profiles map domains to handlers and strategy chains. Config is stored in $getWorkflowStaticData('global') (n8n SQLite), bootstrapped from hardcoded defaults on first run, with learned changes synced to url-profiles.json in Forgejo.

Handlers: | Handler | Domains | Destination | |---|---|---| | archivebox | reddit.com, github.com | ArchiveBox (dedicated pipelines) | | archive-passthrough | archive.today/fo/ph/is/li/md | Extract original URL → Byparr fetch → Karakeep | | depaywall | reuters, bloomberg, wsj, nyt, wapo, ft, medium, etc. | Strategy chain → Karakeep (SingleFile) | | default | Everything else | Karakeep (direct bookmark) |

Strategy chain (for depaywall handler): 1. Ladder /raw/{url} — ruleset applies headers (Googlebot UA, cookies, referer) 2. Byparr POST /v1 — Camoufox stealth browser (DataDome, Cloudflare) 3. Wayback Machine SPN2 API — submit + poll + fetch archived page 4. Browserless /content — Googlebot UA headless Chrome (last resort) 5. Fail → plain bookmark + ntfy notification + tag needs-manual-archive

Self-validation (post-upload): After each strategy uploads HTML to Karakeep: 1. Wait 5s for Karakeep to process 2. Pull bookmark back — check content exists and is substantial 3. Check for challenge/paywall signals in stored content 4. If invalid: DELETE bookmark, try next strategy 5. Conservative: if content endpoint unavailable, assume OK (don't delete good bookmarks)

Profile learning: - Known domain, same strategy wins: no change - Known domain, different strategy wins: reorder (move winner to front, keep others) - New domain succeeds: create profile with winning strategy first + ntfy "domain learned" - Sync to Forgejo: fire-and-forget write to url-profiles.json after learn events

ntfy notifications (specific per event): - Per-strategy failure: "byparr returned bad content for wsj.com — trying next strategy" - Terminal failure: "All strategies failed for wsj.com/article-xyz" - Learning event: "New domain learned: economist.com — strategy ladder worked"

Content verification checks: - HTML length > 1000 chars - No bot-challenge indicators (captcha-delivery.com, datadome, verify your identity, etc.) - Word count > 200 (after stripping tags/scripts/styles) - No paywall CTA indicators (checks for 2+ signals: "subscribe to continue", "sign in to read", etc.)

Ladder ruleset (services/archival/ruleset.yaml): - Reuters, WSJ, Bloomberg → DataDome-protected (Ladder best-effort, Byparr usually needed) - NYT, WaPo, The Athletic, Conde Nast, FT, Medium → Googlebot UA + header tricks - Custom HTML injections to strip ads/overlays per-domain

Key findings: - Reuters/Bloomberg: DataDome bot protection. Requires Camoufox (Byparr) — no header trick works. - WSJ: Hard paywall + DataDome. All automated strategies currently fail. Wayback is best bet. - archive.today via Byparr: Times out (~56s). Direct HTTP gets 429. Only works as manual passthrough.

API patterns in Code nodes

  • Process with Profile (node-am-023): Uses this.helpers.httpRequest() + Buffer.from() — the n8n-official API.
  • Reddit/GitHub/Generic nodes (006, 010, 011, 016, 020): Still use fetch() + atob()/btoa() — both patterns work in n8n 2.13.2.

Next Steps

  • [x] Deploy ArchiveBox LXC (v0.7.3 + API wrapper sidecar, LXC 128) — 2026-04-19
  • [x] Build n8n "archive-media" webhook (Reddit-specific + generic paths) — 2026-04-19
  • [x] Build n8n Reddit-specific handler (subreddit tags, .json fetch, media) — 2026-04-19
  • [x] Set up media-archives ledger repo (monthly CSV via Forgejo API) — 2026-04-19
  • [x] Update n8n "archive-website" webhook for v0.7.3 API wrapper — 2026-04-19
  • [x] Deploy archival proxy stack (Ladder + Byparr + browserless, LXC 129) — 2026-04-20
  • [x] Build profile-driven workflow with self-validation and learning — 2026-04-21
  • [ ] Configure Karakeep rules engine (news domains, reddit, fandom)
  • [ ] Build n8n Tier 1/2/3 fallback workflow (karakeep-crawled webhook)
  • [ ] Build n8n Apple News URL resolver
  • [ ] Build n8n Google Search handler (extract query, screenshot)
  • [ ] Build n8n source:{domain} auto-tagger
  • [ ] Build n8n bulk import workflow
  • [ ] Build n8n Reddit recovery workflow (Arctic Shift + PullPush)
  • [ ] Build iOS Shortcut "Archive Media"
  • [ ] Build iOS Shortcut "Archive Website"
  • [ ] Add Reddit RSS feeds to Karakeep (old.reddit.com)
  • [ ] Test Karakeep highlighting workflow on iOS
  • [ ] Decide on Readwise gap: accept, build custom, or keep paying