RFC: Bookmark & Archival Stack (Replacing Raindrop + Readwise)
Status: Draft Date: 2026-04-16
Problem
Currently using Karakeep for bookmarking, but need a complete workflow for: - Archiving web pages with reliable offline copies - Highlighting and annotating saved content - Read-later with iOS support - Archiving Reddit posts automatically - Archiving paywalled articles (news sites) - Replacing Raindrop.io and Readwise (paid, iOS)
Known Issues
- Google search bookmarks broken: Google wraps links in redirect URLs (
google.com/url?sa=t&url=...). Karakeep crawls the redirect, which fails. - archive.is source attribution: Bookmarking an archive.is URL shows "archive.is" as the source instead of the original publisher.
- No deep archival: Karakeep uses monolith for HTML snapshots, but doesn't produce screenshot or media files.
Decision: Keep Karakeep
Karakeep is the right tool. Already deployed, integrates with local Ollama for AI tagging, has native iOS app with offline reading, highlighting (4 colors + notes), RSS auto-hoarding, and a full REST API. Main gaps: deep archival formats and no spaced repetition (Readwise's core feature).
Karakeep Rules Engine (Native Auto-Tagging)
Karakeep has a built-in rules engine at Settings > Rules with IFTTT-style automation:
Events (triggers): bookmarkAdded, tagAdded, tagRemoved, addedToList, removedFromList, favourited, archived
Conditions (filters):
- urlContains / urlDoesNotContain
- titleContains / titleDoesNotContain
- bookmarkTypeIs (link / text / asset)
- bookmarkSourceIs (api / web / cli / mobile / extension / singlefile / rss / import)
- importedFromFeed (specific RSS feed)
- hasTag, isFavourited, isArchived, alwaysTrue
- Composite: and / or (nestable up to 10 levels)
Actions: addTag, removeTag, addToList, removeFromList, downloadFullPageArchive, favouriteBookmark, archiveBookmark
Tagging via API: Tags cannot be included in the create-bookmark call. Two-step: create bookmark, then POST /bookmarks/{id}/tags.
Bookmark types: link, text, asset (image/pdf). No sub-types. Use tags for finer classification (e.g., source:reddit, type:article).
Duplicate handling: Creating a link with an existing URL returns HTTP 200 with the existing bookmark (upsert behavior).
Tool Comparison
Karakeep (current -- keeping)
| Feature | Status |
|---|---|
| Offline archive | monolith (single HTML) |
| Highlighting | 4 colors + notes (v0.31.0) |
| iOS app | Native, with offline reading |
| AI tagging | Yes (OpenAI/Ollama) |
| API | Full REST API |
| RSS auto-hoarding | Yes |
| Rules engine | IFTTT-style URL/title/type conditions |
| Full-page snapshot | HTML only (no PDF/screenshot/WARC) |
Linkwarden (runner-up)
Better archival (screenshot+PDF+SingleFile per page). Native iOS app (Dec 2025). AI tagging (OpenAI/Anthropic). Dual-license (AGPL self-hosted, paid cloud). Worth considering only if Karakeep's archival proves insufficient.
Wallabag
Article text only. No visual snapshots. iOS app missing highlighting. Dealbreaker.
Omnivore
Dead project. Acqui-hired by ElevenLabs (Nov 2024). Not recommended.
Shiori
Too basic. No highlighting, no iOS app, no AI. Skip.
Recommended Archive Formats by Content Type
| Content Type | Primary Format | Secondary Format | Key Tool |
|---|---|---|---|
| Web articles | SingleFile HTML | Screenshot PNG | single-file-cli (headless) |
| Reddit posts | .json endpoint |
SingleFile on old.reddit.com | curl/wget + SingleFile |
| Paywalled articles | SingleFile (while authenticated) | Readability text extract | SingleFile CLI with cookies |
| Twitter/X | gallery-dl JSON + media | SingleFile for individual tweets | gallery-dl |
| Mastodon | API JSON export | SingleFile | Mastodon API |
| Entire websites | WARC/WACZ | SingleFile per page | Browsertrix Crawler / ArchiveBox |
Format Comparison
| Format | Fidelity | Size | Searchable | Future-Proof | Easy to View |
|---|---|---|---|---|---|
| SingleFile HTML | Excellent | 1-5MB | Yes (grep/browser) | Excellent | Any browser |
| Monolith HTML | Good (no JS) | 1-5MB | Yes | Excellent | Any browser |
| WARC | Excellent | 10-50MB | Only with tools | ISO standard | Needs replay tool |
| Lossy (layout breaks) | Small-Med | Yes (text layer) | Excellent | Universal | |
| MHTML | Good | Medium | Partial | Dead format | Chrome only |
Key insight: monolith (what Karakeep uses) does NOT execute JavaScript, so it misses dynamically loaded content. SingleFile runs a real browser engine and captures JS-rendered content faithfully. SingleFile is strictly better for archival fidelity.
WARC: Not Needed
Overkill for personal use. Can't be viewed directly. Skip unless archiving entire websites.
Reddit-Specific Archiving
The .json API
Append .json to any Reddit URL. Works on www.reddit.com, old.reddit.com, bare reddit.com.
https://old.reddit.com/r/DataHoarder/comments/abc123/title.json?limit=500&sort=top
Response structure: JSON array of two Listing objects:
- [0] -- post data at [0].data.children[0].data
- [1] -- comment tree (kind t1 nodes)
Key post fields:
| Field | Description |
|---|---|
subreddit |
Subreddit name (no r/ prefix) |
title |
Post title |
author |
Author username |
selftext / selftext_html |
Text body (text posts) |
url |
Link target or full URL |
score / upvote_ratio |
Votes |
num_comments |
Total comment count (even if not all returned) |
is_self |
true = text post |
is_video |
true = Reddit-hosted video |
is_gallery |
true = gallery post |
post_hint |
"image", "hosted:video", "rich:video", "link", "self" |
media.reddit_video |
Video metadata (fallback_url, duration, height) |
media_metadata |
Gallery image metadata (keyed by media_id) |
gallery_data |
Gallery ordering and captions |
crosspost_parent_list |
Parent post data if crosspost |
domain |
Source domain for link posts |
Detecting Media Type
Text-only: is_self == true
Image: post_hint == "image" OR i.redd.it URL
Gallery: is_gallery == true, media_metadata present
Reddit video: is_video == true, post_hint == "hosted:video"
External video: post_hint == "rich:video" (YouTube etc.)
Link/article: post_hint == "link"
Crosspost: crosspost_parent present
Comment Limits
?limit=500returns max 500 comments (default 200, max 500)?depth=10controls tree depth (max 10)?sort=top|confidence|new|controversial|old|qanum_commentson post gives true total regardless of returned count- Truncated branches show as
"more"objects (kind"more") withdata.countanddata.children(IDs to expand) - Expand via
POST /api/morechildrenwithlink_id=t3_{post_id}&children={comma-separated IDs}
Reddit Media Downloads
| Media type | URL source | Download tool |
|---|---|---|
| Images (i.redd.it) | url field |
Direct download (curl/wget) |
| Galleries | media_metadata[id].s.u (HTML-encoded, strip amp;) |
gallery-dl |
| Reddit video (v.redd.it) | media.reddit_video.fallback_url (video only, no audio) |
yt-dlp (auto-muxes audio) |
| Reddit video audio | v.redd.it/{id}/DASH_audio.mp4 (separate) |
yt-dlp handles this |
| External video | url field |
yt-dlp |
Important: Reddit video has audio in a separate file. yt-dlp handles muxing automatically (requires ffmpeg). gallery-dl does NOT download videos; use yt-dlp for those.
URL Parsing
import re
pattern = r'(?:https?://)?(?:\w+\.)?reddit\.com/r/(\w+)/comments/(\w+)(?:/\w+)?(?:/(\w+))?'
# Group 1: subreddit, Group 2: post_id, Group 3: comment_id (if comment permalink)
Other URL forms: redd.it/{id} (short link, redirects), reddit.com/gallery/{id}, v.redd.it/{id}, i.redd.it/{filename}
Rate Limits
- Unauthenticated: 10 req/min (by IP)
- OAuth: 100 req/min
- Must send proper User-Agent or requests blocked
Reddit Shadow DOM Problem (New Reddit)
Critical finding: New Reddit uses Shreddit, a framework built on Lit (Google's web component library) with Shadow DOM for every component. This is the #1 archiving problem:
| Aspect | Old Reddit (old.reddit.com) | New Reddit (reddit.com) |
|---|---|---|
| Rendering | Server-rendered HTML + jQuery | Lit web components + Shadow DOM |
| JS required | No (content in raw HTML) | Yes (nothing renders without JS) |
| SingleFile size | ~700KB-2MB | 20-30MB (Shadow DOM stylesheet duplication) |
| SingleFile stability | Reliable | Can crash browser (memory exhaustion) |
| monolith | Works (content in HTML) | Captures nothing (no JS engine) |
| Galleries/albums | Static HTML images | JS carousels (only first image captured) |
| Comment structure | Plain HTML <div> nesting |
Shadow DOM components |
Why it matters: SingleFile duplicates shared stylesheets inside each Shadow DOM component. A thread with many comments = many Shadow DOM components = massive file bloat and potential OOM crashes. (SingleFile issue #1514)
Workaround for SingleFile: Enable "Stylesheets > group duplicate stylesheets together" or use ZIP output format (auto-deduplicates).
The rule: ALWAYS convert to old.reddit.com before archiving. This is non-negotiable.
Reddit Archiving Strategy (Multi-Layer)
For maximum fidelity, use multiple layers:
Layer 1: Data (structured, complete)
.json endpoint -> full post + top 500 comments + metadata
Stored as JSON file alongside bookmark
Layer 2: Visual (how it looked)
SingleFile on old.reddit.com (with comments expanded)
Stored in Karakeep + ArchiveBox
Layer 3: Media (full resolution)
gallery-dl -> images + galleries
yt-dlp -> Reddit-hosted video (auto-muxes audio) + external video
Stored in ArchiveBox
Layer 4: Screenshot (fallback)
ArchiveBox screenshot extractor -> full-page PNG
Insurance when SingleFile fails
Comment expansion before archiving:
- Old Reddit with ?limit=500&depth=10 renders up to 500 comments server-side
- For automated workflow: this is sufficient (comments are in the HTML)
- For manual high-fidelity: use "Expand Everything" userscript before SingleFile save
Gallery limitation: SingleFile cannot archive Reddit galleries/albums (JS carousels). Only the first image is captured. The n8n workflow must extract all gallery image URLs from .json response (media_metadata[id].s.u) and archive them separately via gallery-dl or direct download.
Reddit Embeds Within Posts
| Embed Type | SingleFile (old.reddit) | .json API | Separate Tool |
|---|---|---|---|
| Twitter/X posts | Fallback blockquote text only | URL only | yt-dlp (video), manual (text) |
| YouTube videos | Thumbnail/poster only | URL only | yt-dlp |
| Imgur albums | First image only | URL only | gallery-dl |
| Reddit crossposts | Partial (text) | Full parent data | N/A |
No tool handles all embeds perfectly. The n8n workflow should:
1. Parse .json for external URLs (url, url_overridden_by_dest, domain)
2. If YouTube: add to ArchiveBox (yt-dlp captures the video)
3. If Imgur: run gallery-dl
4. If Twitter: best-effort (text in blockquote, video via yt-dlp if applicable)
Archiving Articles with Embedded Content
The Problem
News articles commonly embed tweets, YouTube videos, Instagram posts, and interactive content. These are loaded via JavaScript widgets and iframes from third-party domains. Most archiving tools handle them poorly.
Embed Capture by Tool
| Content Type | SingleFile (extension) | SingleFile (CLI) | monolith | ArchiveBox |
|---|---|---|---|---|
| Tweet text + images | Yes (rendered iframe) | Yes (with --browser-wait-delay 10000) |
Fallback blockquote only | Via SingleFile extractor |
| Tweet videos | Poster frame only | Poster frame only | No | No (need yt-dlp separately) |
| YouTube player UI | Static snapshot | Static snapshot | No | Via SingleFile |
| YouTube video file | No | No | No | Only if yt-dlp detects embed |
| Instagram embed | Broken since 2025 (auth required) | Broken | No | Broken |
| TikTok embed | Partial (widget UI) | Partial | No | Partial |
| Spotify embed | Player UI (no audio) | Player UI | No | Player UI |
| Interactive charts | Static SVG/Canvas snapshot | Static snapshot | No | Screenshot only |
Key Findings
Tweets: When widgets.js executes, it replaces <blockquote class="twitter-tweet"> elements with rendered iframes. SingleFile (extension) captures the rendered iframe content including profile pic, formatted text, and static images. If a tweet is deleted after archiving, the embedded version is preserved in your SingleFile HTML (no external dependency). monolith only gets the fallback blockquote text.
YouTube: Embeds are iframes (youtube.com/embed/VIDEO_ID). SingleFile captures the player UI (thumbnail, title) but NOT the video stream. ArchiveBox's yt-dlp may detect embedded YouTube players via its generic extractor, but it's not guaranteed. Reliable approach: extract YouTube URLs from HTML and add to ArchiveBox separately.
Instagram: Meta removed unauthenticated oEmbed in April 2025. All Instagram embeds are now broken without auth. Archives made before this date are fine. New archives will show empty containers.
SingleFile CLI Settings for Embeds
single-file \
--browser-wait-delay=10000 \
--browser-wait-until=networkidle0 \
--browser-load-max-time=60000 \
--browser-capture-max-time=60000 \
https://article-with-embeds.com output.html
--browser-wait-delay=10000: Give 10s for tweet/YouTube widgets to render--browser-wait-until=networkidle0: Wait until no network activity--no-remove-frames: Must be set to preserve iframe content (embeds)
Recommended Approach for Articles with Embeds
The n8n workflow should handle embeds as a post-archive extraction step:
1. Archive article with SingleFile (with wait delays for embed rendering)
2. Parse the archived HTML for embedded URLs:
- grep for youtube.com/embed/ -> extract video IDs
- grep for twitter.com or x.com -> extract tweet URLs
- grep for instagram.com -> flag as likely broken
3. For each embedded URL:
- YouTube: POST to ArchiveBox (yt-dlp captures video)
- Twitter video: POST to ArchiveBox (yt-dlp)
- Tweet text: already preserved in SingleFile (if rendered)
4. Add note to Karakeep bookmark: "Embedded media: 2 YouTube videos, 1 tweet (archived separately)"
Extract YouTube from HTML:
import re
youtube_ids = re.findall(r'youtube\.com/embed/([^"?&]+)', html_content)
for vid_id in set(youtube_ids):
# POST https://youtube.com/watch?v={vid_id} to ArchiveBox
Recovering Deleted Reddit Posts
Detection via .json Endpoint
| State | author |
selftext/body |
removed_by_category |
|---|---|---|---|
| Active | username | actual content | null |
| User-deleted | [deleted] |
[deleted] |
null |
| Mod-removed | [deleted] |
[removed] |
"moderator" |
| Admin-removed | [deleted] |
[removed] |
"admin" or "anti_evil_ops" |
Recovery Services (Priority Order)
1. Arctic Shift (best coverage, 2005-2026):
- API: https://arctic-shift.photon-reddit.com
- GET /api/posts/ids?ids={id} -- up to 500 posts
- GET /api/comments/ids?ids={id} -- up to 500 comments
- GET /api/comments/tree?link_id={post_id} -- full comment tree (up to 25000)
- GET /api/posts/search?subreddit=X&author=Y&title=Z
- 2.5 billion items, no auth required, free
- Web UI: https://arctic-shift.photon-reddit.com/search
2. PullPush.io (good for recent data):
- API: https://api.pullpush.io
- GET /reddit/search/submission/?ids={id}
- GET /reddit/search/comment/?ids={id}
- No auth required, free
3. Wayback Machine (pre-August 2025 only):
- Reddit blocked Wayback Machine crawling in August 2025
- Pre-existing snapshots still accessible
- GET https://archive.org/wayback/available?url={reddit_url} -- check availability
- Coverage was always spotty
Dead services: Pushshift (mod-only since 2023), Unddit (down), Reveddit (depends on Pushshift, broken for regular users), Google Cache (discontinued 2024).
SingleFile CLI (Headless, No Browser Extension)
single-file-cli runs completely headless via Deno + Chrome DevTools Protocol.
npm install -g single-file-cli
single-file https://example.com output.html
single-file --browser-cookies-file cookies.txt https://paywalled-site.com/article output.html
docker run capsulecode/singlefile https://example.com > output.html
Karakeep's Native SingleFile Integration
POST /api/v1/bookmarks/singlefile
- Data field:
file, URL field:url, add API key header ?ifexists=overwriteupdates existing bookmarks with better archive content?ifexists=skip/append/overwrite-recrawl/append-recrawlalso available
ArchiveBox as Fire-and-Forget Vault
ArchiveBox is NOT part of the fallback chain. It runs on every bookmark in parallel as a separate long-term preservation vault.
Deployed Config (v0.7.3, LXC 128)
# Enabled extractors
SAVE_DOM=True # Raw HTML DOM capture
SAVE_WGET=True # Full recursive wget (also generates WARC)
SAVE_WARC=True # WARC format (requires SAVE_WGET=True)
SAVE_SCREENSHOT=True # Full-page PNG screenshot
SAVE_MEDIA=True # Video/audio via yt-dlp
SAVE_READABILITY=True # Clean article text extraction
# Disabled extractors
SAVE_SINGLEFILE=False # Too resource-heavy, redundant with dom+wget
SAVE_PDF=False # Redundant
SAVE_ARCHIVE_DOT_ORG=False
SAVE_HEADERS=False
SAVE_GIT=False
# Limits
MEDIA_MAX_SIZE=750m
TIMEOUT=120
API (v0.7.3 — via API wrapper sidecar)
v0.7.3 has no built-in REST API. A thin HTTP wrapper sidecar exposes the CLI:
POST /add-- archive a URL (body:{url, depth, tag, overwrite})GET /health-- health check- Auth:
Authorization: Bearer <token>header - Token stored in Vault at
secret/archivebox→api_key - Runs on port 8001 (same Docker network, shares /data volume)
- n8n env:
ARCHIVEBOX_URL=http://192.168.1.128:8001
Storage
- SQLite DB:
/opt/archivebox/data(local rootfs, 16GB disk) - Archive files:
/mnt/archivebox/archive(urahara bind mount, bulk storage) - Estimated: ~7-10MB per Reddit post, ~70-100GB for 10,000 posts
archive.is: Optional Fallback, Not Primary
Fallback when Karakeep's monolith fails. Cannot bypass paywalls. No official API.
Python archiveis library:
import archiveis
archive_url = archiveis.capture("https://example.com/article")
Check existing: curl -sI "https://archive.today/newest/{url}"
Limitations: Rate limit ~1 hour/URL, must NOT use Cloudflare DNS (1.1.1.1), spoof browser UA, HTTP 503 on aggressive use.
URL Cleaning (Google, Apple News, Redirects)
Google Search Result Redirects
https://www.google.com/url?sa=t&url=https%3A%2F%2Factual-site.com%2Farticle&ved=...
Solution: tidy-url (Node.js)
import { TidyURL } from 'tidy-url';
const result = TidyURL.clean(url);
console.log(result.url); // cleaned URL
Handles Google, Facebook, Twitter/X, Steam, AMP. Strips tracking params. Proton fork: @protontech/tidy-url.
Google Search Page URLs
https://www.google.com/search?q=how+to+self+host+bookmarks&client=safari&...
Extract search query:
from urllib.parse import urlparse, parse_qs
query = parse_qs(urlparse(url).query).get('q', [''])[0]
For use case #3 (save a search), the n8n workflow extracts q= and uses it as the bookmark title.
Apple News Links
https://apple.news/AbCdEfGhIjKl
Apple News URLs do NOT issue HTTP redirects. They serve an HTML page with a JavaScript-based redirect. curl -L will NOT work.
Resolution: fetch HTML + parse:
import requests, re
response = requests.get("https://apple.news/AbCdEfGhIjKl")
match = re.search(r'redirectToUrlAfterTimeout\("([^"]+)"', response.text)
real_url = match.group(1) if match else None
Limitation: Apple News+ exclusive articles have no public URL to resolve.
n8n URL Cleaning Node (Unified)
The n8n Code node handles all URL types in one place:
const { TidyURL } = require('tidy-url');
function cleanUrl(url) {
// Apple News: fetch and parse JS redirect
if (url.includes('apple.news/')) {
// HTTP GET + regex parse redirectToUrlAfterTimeout
// Return extracted URL
}
// Google Search page: extract query, keep as-is but set title
if (url.match(/google\.\w+\/search\?/)) {
const query = new URL(url).searchParams.get('q');
return { url, title: query, type: 'search' };
}
// Everything else: tidy-url handles Google/Facebook/Twitter redirects
const cleaned = TidyURL.clean(url);
return { url: cleaned.url, title: null, type: 'link' };
}
Karakeep Rules (Auto-Tagging)
Set up in Settings > Rules:
| Rule | Event | Condition | Action |
|---|---|---|---|
| Reddit posts | bookmarkAdded | urlContains("reddit.com") | addTag("reddit") |
| News articles | bookmarkAdded | urlContains("washingtonpost.com") OR urlContains("independent.co.uk") OR ... | addTag("news") |
| News (broad) | bookmarkAdded | n8n sets tag via API | addTag("news") |
| Fandom wikis | bookmarkAdded | urlContains("fandom.com") | addTag("fandom") |
| From RSS | bookmarkAdded | bookmarkSourceIs("rss") | addTag("rss-feed") |
| From iOS Shortcut | bookmarkAdded | bookmarkSourceIs("api") | (handled by n8n tagging) |
Limitation: Karakeep rules can only match URL substrings, not regex. For complex tagging (e.g., extracting subreddit name, detecting news domains broadly), n8n must add tags via the API after bookmark creation.
Tiered Archival Fallback Chain
Bookmark created in Karakeep
--> Karakeep crawls with monolith (automatic)
--> Webhook fires on "crawled" event
--> n8n receives webhook
--> n8n GET /api/v1/bookmarks/{id} to check content
|
Tier 1: htmlContent exists?
--> YES: local archive succeeded, done
|
Tier 2: NO content --> archive.is fallback
--> Check archive.today/newest/{url}
--> Found? Grab the archive.is URL
--> Not found? archiveis.capture(url)
--> PATCH bookmark note with archive.is link
--> Tag: "archive.is-fallback"
|
Tier 3: archive.is also fails
--> Tag: "needs-manual-archive"
--> ntfy notification: "Bookmark X needs manual SingleFile"
|
In parallel (every bookmark, regardless of tier):
--> POST to ArchiveBox API (fire-and-forget)
iOS Shortcuts
Two separate shortcuts for different purposes:
Shortcut 1: "Archive Page"
Bookmarks + archives a page. Goes through full pipeline: URL cleaning, Karakeep bookmark, tagging, ArchiveBox vault, fallback chain.
1. [Receive] input from Share Sheet (URL)
2. [Get Contents of URL]
URL: https://n8n.domain/webhook/archive-page
Method: POST
Headers: Content-Type: application/json, Authorization: Bearer <token>
Body: {"url": "[Shortcut Input]"}
3. [Get Dictionary Value] "bookmarkId" from response
4. [Show Notification] "Archived!"
n8n "archive-page" webhook does: - tidy-url clean + Apple News resolve + Google Search extract - POST to Karakeep (bookmark + tags) - POST to ArchiveBox (fire-and-forget) - Return bookmarkId
Shortcut 2: "Archive Website"
Deep archive only. Sends straight to ArchiveBox, no Karakeep bookmark. For when you want to preserve an entire website/page for long-term storage without cluttering your reading list.
1. [Receive] input from Share Sheet (URL)
2. [Get Contents of URL]
URL: https://n8n.domain/webhook/archive-website
Method: POST
Headers: Content-Type: application/json, Authorization: Bearer <token>
Body: {"url": "[Shortcut Input]"}
3. [Show Notification] "Sent to ArchiveBox!"
n8n "archive-website" webhook does:
- tidy-url clean (strip tracking params)
- POST directly to ArchiveBox API: POST /api/v1/cli/add
- ArchiveBox saves: SingleFile + screenshot + media + readability
- No Karakeep, no tagging, no fallback chain
- Returns: { "status": "queued" }
Use cases for "Archive Website":
- Wiki pages you want preserved but don't need to read later
- Documentation pages that might disappear
- Forum threads / discussions for reference
- Entire sites you want a snapshot of (ArchiveBox follows links with --depth=1)
- Anything you want in the vault but not in your reading list
ArchiveBox depth crawling (for actual full-site archiving):
n8n webhook body option: {"url": "...", "depth": 1}
-> archivebox add --depth=1 <url>
--depth=1 follows all links on the page and archives those too. Use sparingly (can be hundreds of pages).
Use Cases (iOS-Focused)
UC1: Save a Reddit Post from iOS
Trigger: Scrolling Reddit app/Safari, share a post or comment.
What happens (multi-layer archiving):
- Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook
- n8n detects
reddit.comURL, parses it:pattern = r'reddit\.com/r/(\w+)/comments/(\w+)(?:/\w+)?(?:/(\w+))?' # Extracts: subreddit, post_id, comment_id (if comment) - CRITICAL: n8n converts to
old.reddit.comURL (new Reddit uses Shadow DOM, produces 20-30MB files and can crash archivers) - Layer 1 (Data): n8n fetches
.json?limit=500&sort=top: - Extracts
subreddit,title,num_comments,post_hint, media fields - Saves full JSON response to storage (complete structured data)
- If
num_comments > 500: adds note "Post has {num_comments} comments, top 500 archived" - If comment permalink: fetches parent post JSON too
- n8n creates Karakeep bookmark with the
old.reddit.comURL +?limit=500&depth=10 - n8n adds tags via API:
reddit(source)r/{subreddit}(specific sub)- Content type tag based on
post_hint:image,video,article,discussion - If comment permalink:
comment - Layer 2 (Visual): Karakeep monolith crawls
old.reddit.compage (comments visible in server-rendered HTML, no Shadow DOM issues) - Layer 3 (Media): n8n handles media based on
.jsondata: - Image (
i.redd.it): Direct download URL fromurlfield -> ArchiveBox - Gallery (
is_gallery): n8n extracts ALL image URLs frommedia_metadata[id].s.u(HTML-decodeamp;), sends each to ArchiveBox. SingleFile/monolith only capture the first gallery image (JS carousel). - Video (
v.redd.it): ArchiveBox yt-dlp auto-muxes video+audio. Direct URL atmedia.reddit_video.fallback_url - External link (
post_hint == "link"): n8n also bookmarks the linked URL separately in Karakeep (gets its own archive pipeline) - Embedded YouTube (
post_hint == "rich:video"): Extract URL, add to ArchiveBox for yt-dlp - Crosspost (
crosspost_parentpresent): Extract parent post data fromcrosspost_parent_list[0], archive parent's media too - Layer 4 (Screenshot): ArchiveBox screenshot extractor captures full-page PNG (fallback)
- Fallback chain runs (Tier 1/2/3) for the Karakeep crawl
Expected result in Karakeep:
- Bookmark with title from Reddit post
- Tags: reddit, r/selfhosted, article (or image/video/discussion)
- Monolith archive of old.reddit.com page (up to 500 comments visible)
- Note with: comment count info, embedded media summary
- If external link: second bookmark for the linked article
Expected result in ArchiveBox: - SingleFile HTML of old.reddit.com page - Screenshot PNG - All gallery images (downloaded individually, not via SingleFile) - Video files (yt-dlp, muxed with audio) - JSON data file (complete structured data)
Gaps / considerations:
- Reddit rate limits: 10 req/min unauthenticated. For occasional saves this is fine. Bulk saves need OAuth.
- Gallery images: SingleFile/monolith CANNOT capture JS carousels. Must extract from .json media_metadata and download separately.
- Comment expansion beyond 500: Not practical for automated workflow. num_comments tells you what you missed.
- Reddit embeds (tweets, YouTube in comments): Only fallback text captured by SingleFile on old Reddit. External URLs extracted from .json and archived separately.
UC2: Read-It-Later Article from Safari
Trigger: Reading an article in Safari, want to save for later.
What happens:
1. Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook
2. n8n cleans URL with tidy-url (strips tracking params)
3. n8n creates Karakeep bookmark
4. Karakeep rules engine fires:
- urlContains("washingtonpost.com") -> addTag("news"), addTag("source:washington-post")
- urlContains("independent.co.uk") -> addTag("news"), addTag("source:the-independent")
- (one rule per known news domain)
5. n8n also detects news domains from a maintained list and adds news tag via API (covers domains not in Karakeep rules)
6. Karakeep monolith crawls the article
7. Fallback chain runs
8. ArchiveBox archives with SingleFile CLI (--browser-wait-delay=10000 --browser-wait-until=networkidle0) to give embedded tweets/YouTube time to render
9. Embedded content extraction (post-archive step in n8n):
- Parse archived HTML for youtube.com/embed/ -> extract video IDs, add to ArchiveBox (yt-dlp)
- Parse for twitter.com / x.com embeds -> tweet text already captured by SingleFile if widgets.js rendered; video needs yt-dlp
- Parse for instagram.com -> flag in note (broken since April 2025, no fix)
- Add note: "Embedded media: N YouTube videos, N tweets (archived separately)"
Expected result:
- Bookmark with article title and publisher metadata
- Tags: news, source:washington-post (automatic)
- AI tags from Ollama (topic-based)
- Monolith archive for offline reading (embedded tweets as fallback blockquotes only)
- ArchiveBox has: SingleFile with rendered embeds, screenshot, embedded YouTube videos (yt-dlp)
- Available in Karakeep iOS app for offline reading + highlighting
Embedded content limitations:
- monolith (Karakeep) does NOT execute JS -> embedded tweets show as <blockquote> fallback text, YouTube shows placeholder only
- SingleFile (ArchiveBox) with wait delays captures rendered tweets with profile pic/text/images, but NOT tweet videos or YouTube video streams
- Instagram embeds broken since Meta removed unauthenticated oEmbed (April 2025)
- YouTube videos must be archived separately via yt-dlp
Karakeep rules to set up:
Rule: "News - Washington Post"
Event: bookmarkAdded
Condition: urlContains("washingtonpost.com")
Actions: addTag("news"), addTag("source:washington-post")
Rule: "News - The Independent"
Event: bookmarkAdded
Condition: urlContains("independent.co.uk")
Actions: addTag("news"), addTag("source:the-independent")
(repeat for each known news domain)
For unknown news domains: n8n maintains a broader list of news domains (or uses a heuristic like checking for <article> tags, og:type=article, or known news TLDs). Falls back to AI tagging.
UC3: Save a Google Search for Later
Trigger: Doing a Google search, want to save the search itself for later research.
What happens:
1. Share the Google search URL from Safari
2. Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook
3. n8n detects Google search URL (google.com/search?q=...)
4. n8n extracts query: parse_qs(url).get('q')[0] -> "how to self host bookmarks"
5. n8n creates Karakeep bookmark with:
- URL: the original Google search URL (kept as-is, NOT cleaned)
- Title override: the extracted search query ("how to self host bookmarks")
- Tags: search, google
6. No offline copy needed (per requirements)
7. n8n takes a screenshot via headless browser (Playwright/Puppeteer on server):
page.goto(google_search_url)
page.screenshot(path="search.png", full_page=True)
Expected result:
- Bookmark titled "how to self host bookmarks" (not "Google Search")
- Tags: search, google
- Screenshot of search results page
- No monolith archive (not needed for a search page)
Consideration: Google may serve CAPTCHAs to headless browsers. May need realistic UA string or cookie consent handling. If screenshot fails, skip -- it's nice-to-have.
UC4: Bulk Import from File
Trigger: Have a file (txt, md, html, etc.) with a mix of URLs.
What happens:
1. Upload file to n8n (or drop in a watched folder)
2. n8n parses URLs from the file:
- .txt: one URL per line
- .md: extract URLs from markdown links [text](url) and bare URLs
- .html: Netscape bookmark format (Karakeep supports this natively) or extract href attributes
- .csv: URL column
3. For each URL, n8n runs the same pipeline as the iOS Shortcut:
- Clean URL (tidy-url, Apple News resolution, etc.)
- Detect type (Reddit, news, search, generic)
- POST to Karakeep with source: "import" and importSessionId for batch grouping
- Add appropriate tags via API
- POST to ArchiveBox (fire-and-forget)
4. Karakeep deduplicates automatically (existing URLs return HTTP 200)
5. Rate limiting: process URLs with a delay (respect Reddit 10 req/min, archive.is limits)
For Netscape HTML (Chrome/Firefox bookmark export): Use Karakeep's built-in import at Settings > Import. It handles the format natively.
Expected result:
- All URLs imported as bookmarks
- Each tagged appropriately (reddit posts get reddit + r/{sub}, news gets news + source, etc.)
- Duplicates silently skipped
- Import batch grouped by importSessionId
UC5: Archive a Fandom Webpage
Trigger: Reading a Fandom wiki page, want to archive the full page.
What happens:
1. Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook
2. n8n cleans URL with tidy-url
3. n8n creates Karakeep bookmark
4. Karakeep rule fires: urlContains("fandom.com") -> addTag("fandom")
5. n8n also extracts the wiki name from URL (e.g., naruto.fandom.com -> tag fandom:naruto)
6. Karakeep monolith crawls the page
7. ArchiveBox archives: SingleFile (full page with all images/CSS), screenshot, readability text
8. Fallback chain runs if monolith fails
Expected result:
- Bookmark with wiki page title
- Tags: fandom, fandom:naruto (or whatever wiki)
- Full offline archive (Fandom pages are mostly static HTML, monolith works well)
- ArchiveBox has screenshot + SingleFile backup
Note: Fandom pages are ad-heavy. SingleFile/monolith may capture ads. Readability extraction gives cleaner text.
UC6: Archive an Apple News Link
Trigger: Someone shares an Apple News link in Messages, or you share from Apple News app.
What happens:
1. Share sheet -> "Archive Page" Shortcut -> POST to n8n webhook
2. n8n detects apple.news/ URL
3. n8n resolves to real URL:
response = requests.get("https://apple.news/AbCdEfGhIjKl")
match = re.search(r'redirectToUrlAfterTimeout\("([^"]+)"', response.text)
real_url = match.group(1) # e.g., https://washingtonpost.com/article/...
apple-news-exclusive, needs-attention
5. If resolved: n8n creates Karakeep bookmark with the real URL (not apple.news)
6. Normal pipeline: Karakeep rules tag it (e.g., news, source:washington-post)
7. Karakeep monolith crawls the real article
8. Fallback chain + ArchiveBox in parallel
Expected result:
- Bookmark points to the real article URL (washingtonpost.com), NOT apple.news
- Tags: news, source:washington-post, via:apple-news
- Publisher shows correctly (Washington Post, not Apple)
- Full archive of the actual article
Edge case: Apple News+ exclusive content has no public URL. These get tagged apple-news-exclusive + needs-attention for manual handling.
UC7: Auto-Tag Everything "News" as News
Implementation: Combination of Karakeep rules + n8n logic.
Karakeep rules (for known domains):
urlContains("washingtonpost.com") -> addTag("news")
urlContains("nytimes.com") -> addTag("news")
urlContains("theguardian.com") -> addTag("news")
urlContains("bbc.co.uk/news") -> addTag("news")
urlContains("reuters.com") -> addTag("news")
urlContains("apnews.com") -> addTag("news")
urlContains("independent.co.uk") -> addTag("news")
urlContains("cnn.com") -> addTag("news")
(... extend for all known news domains)
n8n fallback (for unknown domains):
- Maintain a JSON list of 500+ news domains in n8n (community lists exist)
- On every bookmark, check domain against list
- If match: add news tag via API
- If no match but AI tags suggest news content: add news tag
AI tagging: Configure Karakeep's AI settings (User Settings > AI Settings) with custom instruction: "If the content is a news article, always include the tag 'news'."
UC8: Every Item Tagged with Source
Implementation: n8n adds source:{domain} tag to every bookmark.
After creating a Karakeep bookmark, n8n:
from urllib.parse import urlparse
domain = urlparse(url).netloc.replace('www.', '')
# "reddit.com", "washingtonpost.com", "naruto.fandom.com"
tag = f"source:{domain}"
# POST /bookmarks/{id}/tags with tag
This is automatic for every bookmark, regardless of type. Karakeep's bookmarkTypeIs condition (link/text/asset) also helps distinguish at a glance.
Combined with UC7: A bookmark from Washington Post gets:
- news (from Karakeep rule or n8n news domain list)
- source:washingtonpost.com (from n8n domain extraction)
- AI-generated topic tags (from Ollama)
UC9: Paywalled Content Archiving
Trigger: Want to archive a paywalled article.
What happens (automated attempt):
1. Bookmark created (via iOS Shortcut or Karakeep app)
2. Karakeep monolith tries to crawl -> fails (paywall blocks it)
3. n8n Tier 2: archive.is fallback
- Check archive.today/newest/{url} (someone else may have archived it)
- If found: link in note, tag paywalled, archived-via:archive.is
- If not found: archiveis.capture(url) (archive.is server may bypass soft paywalls)
4. If archive.is also gets paywalled content:
- Tag: paywalled, needs-manual-archive
- ntfy notification: "Paywalled article needs manual archive: {title}"
Manual fallback (desktop):
5. Open article in browser while logged in to your subscription
6. Run SingleFile CLI with cookies: single-file --browser-cookies-file cookies.txt {url} output.html
7. Push to Karakeep: POST /api/v1/bookmarks/singlefile?ifexists=overwrite
8. n8n detects the update, removes needs-manual-archive tag
9. Final tags: paywalled, archived (or paywalled, source:nytimes.com, news)
Tagging logic in n8n:
| Outcome | Tags applied |
|---|---|
| Monolith succeeded (not paywalled) | Normal tags only |
| Monolith failed, archive.is succeeded | paywalled, archived-via:archive.is |
| Monolith failed, archive.is got paywall too | paywalled, needs-manual-archive |
| Manual SingleFile uploaded | paywalled (remove needs-manual-archive) |
How n8n detects paywall vs other failures:
- If the crawled HTML contains paywall indicators (keywords like "subscribe", "paywall", "premium", short content length), tag as paywalled
- If crawl returned nothing at all (connection refused, 403), may be bot protection, not paywall -> different handling
- This heuristic isn't perfect but covers most cases
UC10: Recover Old/Deleted Reddit Saves
Trigger: Have a list of old saved Reddit post URLs, some may be deleted.
What happens:
1. Import list of saved Reddit URLs (from Reddit export or manual list)
2. For each URL, n8n checks Reddit .json endpoint:
- author == "[deleted]" or selftext == "[removed]" -> post is deleted/removed
- Active posts: normal archival pipeline (UC1)
3. For deleted posts, n8n tries recovery services in order:
Step 1: Arctic Shift (best coverage, 2005-2026)
GET https://arctic-shift.photon-reddit.com/api/posts/ids?ids={post_id}
-> Returns original content if captured before deletion
Step 2: PullPush.io (good for recent data)
GET https://api.pullpush.io/reddit/search/submission/?ids={post_id}
Step 3: Wayback Machine (pre-Aug 2025 snapshots only)
GET https://archive.org/wayback/available?url={reddit_url}
-> Only works for posts archived before Reddit blocked Wayback in Aug 2025
old.reddit.com URL
- Save recovered content in note (title, author, body text)
- Tags: reddit, r/{subreddit}, recovered, deleted-post
- ArchiveBox archives whatever is still accessible
5. If recovery fails:
- Create Karakeep bookmark anyway (URL preserved)
- Tags: reddit, unrecoverable, deleted-post
- Note: "Post deleted, content not recovered. Checked: Arctic Shift, PullPush, Wayback Machine"
Key services:
| Service | Coverage | API | Status |
|---|---|---|---|
| Arctic Shift | 2005-2026, 2.5B items | REST, no auth, free | Working |
| PullPush.io | Active ingestion | REST, no auth, free | Working |
| Wayback Machine | Pre-Aug 2025 only | REST, no auth | No new Reddit snapshots |
| Pushshift | N/A | Mod-only since 2023 | Inaccessible |
| Unddit | N/A | Connection refused | Dead |
| Reveddit | N/A | Depends on Pushshift | Broken |
Batch workflow for old saves:
1. Export saved posts from Reddit (Settings > Data > Export)
2. Parse the export for post/comment URLs
3. For each URL:
a. Check if still active (.json endpoint)
b. If active: archive normally (UC1)
c. If deleted: recovery chain (Arctic Shift -> PullPush -> Wayback)
d. Create Karakeep bookmark with appropriate tags
4. Generate summary: X recovered, Y unrecoverable, Z still active
Recommended Architecture (Final — Implemented 2026-04-19)
iOS / Desktop
|
+-- iOS Shortcut "Archive Media" (share sheet)
| |
| +-- POST to n8n webhook "archive-media"
| Reddit: ArchiveBox only (dom/warc/screenshot/wget/media), tagged by subreddit
| Non-Reddit: Karakeep + ArchiveBox in parallel
|
+-- iOS Shortcut "Archive Website" (share sheet)
| |
| +-- POST to n8n webhook "archive-website"
| (ArchiveBox only, no Karakeep, no tags)
|
+-- Karakeep iOS app (native, for quick bookmarks)
|
+-- Karakeep browser extension (desktop)
|
v
n8n Orchestration Hub
|
+-- Webhook: "archive-media" (from iOS Shortcut "Archive Media")
| |-- URL type detection (Reddit vs generic)
| |-- tidy-url: clean redirects & tracking params
| |-- Reddit path:
| | |-- Resolve share links (/s/, redd.it)
| | |-- Rewrite to old.reddit.com
| | |-- Fetch .json (metadata, subreddit, media URLs)
| | |-- POST to ArchiveBox API wrapper (tag=subreddit)
| | |-- Archive extra media (galleries, videos)
| | |-- Skip Karakeep
| |-- Generic path:
| | |-- POST to Karakeep (bookmark + tags)
| | |-- POST to ArchiveBox (parallel)
| |-- Log to media-archives ledger (Forgejo CSV)
| |-- Return result to Shortcut
|
+-- Webhook: "archive-website" (from iOS Shortcut "Archive Website")
| |-- tidy-url: clean tracking params only
| |-- POST directly to ArchiveBox API wrapper
| |-- Optional: --depth=1 for full-site crawl
| |-- No Karakeep, no tagging, no fallback chain
| |-- Return: { "status": "queued" }
|
+-- Webhook: "karakeep-crawled" (from Karakeep)
| |-- Tier 1: Check htmlContent
| |-- Tier 2: archive.is fallback (paywalled domains only)
| |-- Tier 3: ntfy notification
| |-- Paywall detection + tagging
|
v
Karakeep (daily driver — non-Reddit bookmarks)
|-- Search, AI tagging (Ollama), highlighting
|-- iOS app with offline reading
|-- Rules engine: URL-based auto-tagging
|
ArchiveBox v0.7.3 (archive vault — all URLs)
|-- Extractors: dom, wget, warc, screenshot, media (yt-dlp), readability
|-- Disabled: singlefile, pdf, archive.org, headers, git
|-- API wrapper sidecar (POST /add on port 8001)
|-- Archives stored on urahara bind mount
|-- LXC 128, 192.168.1.128
|
media-archives ledger (Forgejo repo: claude/media-archives)
|-- Monthly CSV files (2026/04.csv, 2026/05.csv, ...)
|-- Columns: date_saved, url, source, type, title, tags, karakeep_id,
| archivebox_url, archive_today_url, reddit_account, status
|-- Updated by n8n via Forgejo API
Karakeep Rules to Configure
# Source tagging (known domains)
bookmarkAdded + urlContains("reddit.com") -> addTag("reddit")
bookmarkAdded + urlContains("fandom.com") -> addTag("fandom")
# News tagging (per domain)
bookmarkAdded + urlContains("washingtonpost.com") -> addTag("news"), addTag("source:washington-post")
bookmarkAdded + urlContains("nytimes.com") -> addTag("news"), addTag("source:nytimes")
bookmarkAdded + urlContains("theguardian.com") -> addTag("news"), addTag("source:the-guardian")
bookmarkAdded + urlContains("bbc.co.uk/news") -> addTag("news"), addTag("source:bbc")
bookmarkAdded + urlContains("reuters.com") -> addTag("news"), addTag("source:reuters")
bookmarkAdded + urlContains("apnews.com") -> addTag("news"), addTag("source:ap")
bookmarkAdded + urlContains("independent.co.uk") -> addTag("news"), addTag("source:the-independent")
bookmarkAdded + urlContains("cnn.com") -> addTag("news"), addTag("source:cnn")
(... extend as needed)
# Source tagging (via n8n for all bookmarks)
# n8n extracts domain and adds source:{domain} tag via API
Readwise Gap
No self-hosted tool replaces Readwise's spaced repetition / daily review. Karakeep highlighting + Obsidian export is partial. Accept the gap or keep paying.
Decision Matrix
| Approach | Effort | Coverage |
|---|---|---|
| Karakeep + n8n (URL cleaning + fallback + tagging) | Medium | 90% of needs |
| Karakeep + n8n + ArchiveBox vault | Medium-High | 95% of needs |
| Karakeep + n8n + ArchiveBox + iOS Shortcut + Reddit recovery | High | 98% of needs |
Paywalled Article Pipeline (Detailed — 2026-04-20)
Confirmed Implementation Details
| Component | Status | Location |
|---|---|---|
| Paywalled domains list | .txt file in Forgejo repo | holo/media-archives repo, paywalled-domains.txt (NOTE: was under claude/ — needs to be moved to correct owner) |
| Proactive detection | Implemented | archive-media workflow, "Clean Generic URL" node checks domain against list |
| Hardcoded fallback list | Implemented | reuters.com, bloomberg.com, wsj.com, nytimes.com, ft.com, economist.com, washingtonpost.com, theathletic.com, businessinsider.com, insider.com, thetimes.co.uk, telegraph.co.uk, hbr.org, seekingalpha.com, barrons.com |
| CSV ledger | Implemented | Monthly CSV (e.g. 2026/04.csv) in same repo, written via Forgejo API |
| Reactive detection | Implemented | karakeep-crawled workflow, "Paywall Detection" node (short content + paywall keywords) |
| Auto-add domain to list | Implemented | karakeep-crawled workflow appends new domain to paywalled-domains.txt |
| Karakeep SingleFile endpoint | Available but NOT wired | POST /api/v1/bookmarks/singlefile?ifexists=overwrite |
Desired Flow (user-specified 2026-04-20)
1. User sends link via iOS Shortcut
2. n8n reads paywalled-domains.txt to check if domain is known-paywalled
3. If NOT paywalled → normal processing (Karakeep bookmark, tags, ledger)
4. If paywalled:
a. Process article through archive.today (get de-paywalled content)
b. Store the de-paywalled content IN Karakeep (not just a link)
c. Bookmark shows the ORIGINAL article URL (not archive.today URL)
d. Tag: "paywalled"
e. Note: link to archive.today version
5. Every processed link (Reddit, article, paywalled, etc.) → logged to CSV ledger
Resolved (2026-04-20)
-
De-paywalled content in Karakeep: Using Option A — fetch archive.today snapshot HTML, upload via Karakeep's SingleFile endpoint (
POST /api/v1/bookmarks/singlefile?ifexists=overwrite) with original URL. Bookmark shows original article URL but stored content is the de-paywalled version. -
Repo ownership: Created
homelaborg on Forgejo. Repo transferred fromclaude/media-archivestohomelab/media-archives. Bothclaudeandholoare Owners. -
"Process using archive.today" means: Check
archive.today/newest/<url>withredirect: 'manual'. If 302, snapshot exists — follow redirect to get snapshot URL, fetch full HTML, upload to Karakeep SingleFile endpoint. If no snapshot exists, fall back to note-only (link to archive.today for manual use).
Current Implementation (v3 — 2026-04-21, profile-driven refactor)
Infrastructure: Dedicated archival LXC 129 (192.168.1.129) with: - Ladder (port 8080) — Go-based paywall bypass proxy with YAML rulesets - Byparr (port 3001→8191) — Camoufox stealth browser (DataDome/Cloudflare bypass) - browserless/chromium (port 3000) — Direct headless Chrome API (last resort fallback)
n8n (LXC 120) calls archival services via IP: LADDER_URL=http://192.168.1.129:8080, BYPARR_URL=http://192.168.1.129:3001, BROWSERLESS_URL=http://192.168.1.129:3000.
Profile-driven routing (replaces hardcoded if/else chains):
The workflow now uses URL profiles to route incoming links. Profiles map domains to handlers and strategy chains. Config is stored in $getWorkflowStaticData('global') (n8n SQLite), bootstrapped from hardcoded defaults on first run, with learned changes synced to url-profiles.json in Forgejo.
Handlers:
| Handler | Domains | Destination |
|---|---|---|
| archivebox | reddit.com, github.com | ArchiveBox (dedicated pipelines) |
| archive-passthrough | archive.today/fo/ph/is/li/md | Extract original URL → Byparr fetch → Karakeep |
| depaywall | reuters, bloomberg, wsj, nyt, wapo, ft, medium, etc. | Strategy chain → Karakeep (SingleFile) |
| default | Everything else | Karakeep (direct bookmark) |
Strategy chain (for depaywall handler):
1. Ladder /raw/{url} — ruleset applies headers (Googlebot UA, cookies, referer)
2. Byparr POST /v1 — Camoufox stealth browser (DataDome, Cloudflare)
3. Wayback Machine SPN2 API — submit + poll + fetch archived page
4. Browserless /content — Googlebot UA headless Chrome (last resort)
5. Fail → plain bookmark + ntfy notification + tag needs-manual-archive
Self-validation (post-upload): After each strategy uploads HTML to Karakeep: 1. Wait 5s for Karakeep to process 2. Pull bookmark back — check content exists and is substantial 3. Check for challenge/paywall signals in stored content 4. If invalid: DELETE bookmark, try next strategy 5. Conservative: if content endpoint unavailable, assume OK (don't delete good bookmarks)
Profile learning:
- Known domain, same strategy wins: no change
- Known domain, different strategy wins: reorder (move winner to front, keep others)
- New domain succeeds: create profile with winning strategy first + ntfy "domain learned"
- Sync to Forgejo: fire-and-forget write to url-profiles.json after learn events
ntfy notifications (specific per event): - Per-strategy failure: "byparr returned bad content for wsj.com — trying next strategy" - Terminal failure: "All strategies failed for wsj.com/article-xyz" - Learning event: "New domain learned: economist.com — strategy ladder worked"
Content verification checks: - HTML length > 1000 chars - No bot-challenge indicators (captcha-delivery.com, datadome, verify your identity, etc.) - Word count > 200 (after stripping tags/scripts/styles) - No paywall CTA indicators (checks for 2+ signals: "subscribe to continue", "sign in to read", etc.)
Ladder ruleset (services/archival/ruleset.yaml):
- Reuters, WSJ, Bloomberg → DataDome-protected (Ladder best-effort, Byparr usually needed)
- NYT, WaPo, The Athletic, Conde Nast, FT, Medium → Googlebot UA + header tricks
- Custom HTML injections to strip ads/overlays per-domain
Key findings: - Reuters/Bloomberg: DataDome bot protection. Requires Camoufox (Byparr) — no header trick works. - WSJ: Hard paywall + DataDome. All automated strategies currently fail. Wayback is best bet. - archive.today via Byparr: Times out (~56s). Direct HTTP gets 429. Only works as manual passthrough.
API patterns in Code nodes
- Process with Profile (node-am-023): Uses
this.helpers.httpRequest()+Buffer.from()— the n8n-official API. - Reddit/GitHub/Generic nodes (006, 010, 011, 016, 020): Still use
fetch()+atob()/btoa()— both patterns work in n8n 2.13.2.
Next Steps
- [x] Deploy ArchiveBox LXC (v0.7.3 + API wrapper sidecar, LXC 128) — 2026-04-19
- [x] Build n8n "archive-media" webhook (Reddit-specific + generic paths) — 2026-04-19
- [x] Build n8n Reddit-specific handler (subreddit tags, .json fetch, media) — 2026-04-19
- [x] Set up media-archives ledger repo (monthly CSV via Forgejo API) — 2026-04-19
- [x] Update n8n "archive-website" webhook for v0.7.3 API wrapper — 2026-04-19
- [x] Deploy archival proxy stack (Ladder + Byparr + browserless, LXC 129) — 2026-04-20
- [x] Build profile-driven workflow with self-validation and learning — 2026-04-21
- [ ] Configure Karakeep rules engine (news domains, reddit, fandom)
- [ ] Build n8n Tier 1/2/3 fallback workflow (karakeep-crawled webhook)
- [ ] Build n8n Apple News URL resolver
- [ ] Build n8n Google Search handler (extract query, screenshot)
- [ ] Build n8n source:{domain} auto-tagger
- [ ] Build n8n bulk import workflow
- [ ] Build n8n Reddit recovery workflow (Arctic Shift + PullPush)
- [ ] Build iOS Shortcut "Archive Media"
- [ ] Build iOS Shortcut "Archive Website"
- [ ] Add Reddit RSS feeds to Karakeep (old.reddit.com)
- [ ] Test Karakeep highlighting workflow on iOS
- [ ] Decide on Readwise gap: accept, build custom, or keep paying