ArchiveBox — Setup
Web archival vault running ArchiveBox v0.7.3 with an API wrapper sidecar. Captures DOM, WARC, screenshots, wget mirrors, and media (yt-dlp) for every archived URL. Primarily used for Reddit post archival. Readability disabled (not useful for Reddit threads). Runs as Docker containers on a dedicated Debian LXC (128). Auth via oauth2-proxy + PocketID.
Infrastructure
| Host | LXC ID | Internal | External | CPU | RAM | Disk |
|---|---|---|---|---|---|---|
| Debian LXC | 128 | 192.168.1.128:8000 (web) / :8001 (API) | https://archive.eva-00.network | 2 cores | 3 GiB | 16 GiB |
Storage Layout
| Data | Location | Notes |
|---|---|---|
| SQLite DB + config | /opt/archivebox/data (local rootfs) |
Fast FSYNC required |
| Archives (DOM, WARC, screenshots, media) | /mnt/archivebox/archive (urahara bind mount) |
Growth-prone, backed up via Backrest |
| API wrapper source | /opt/archivebox/api-wrapper (local rootfs) |
Built as Docker image on deploy |
Proxmox bind mount: /mnt/pve/urahara/archivebox/archive -> /mnt/archivebox/archive
Architecture
ArchiveBox v0.7.3 has no REST API. A thin Python HTTP wrapper (api-wrapper) runs as a sidecar container, sharing the /data volume, and exposes the CLI via HTTP:
+-----------------+
| archivebox |
Port 8000 (web) | v0.7.3 |
<----------------> (Web UI only) |
| /data volume |
+--------+--------+
|
shared /data volume
|
+--------+--------+
| api-wrapper |
Port 8001 (API) | (Python HTTP) |
<----------------> Bearer token |
| /data volume |
+-----------------+
API Wrapper Endpoints
| Method | Path | Purpose |
|---|---|---|
| POST | /add |
Archive a URL (params: url, depth, tag, overwrite) |
| GET | /health |
Health check |
Auth: Authorization: Bearer <token> (token stored in Vault)
Observability
Logs
Container logs are collected via Docker log driver. Query in Loki:
| Query | Purpose |
|---|---|
{container_name="archivebox"} |
ArchiveBox app logs |
{container_name="archivebox-api"} |
API wrapper logs |
{container_name="archivebox"} \|= "error" |
Errors only |
{container_name="archivebox-api"} \|= "POST /add" |
Archive requests |
Access: Grafana -> Explore -> Loki -> Enter query
IaC
| Artifact | Path |
|---|---|
| Playbook | ansible/playbooks/archivebox.yml |
| Workflow | .forgejo/workflows/archivebox.yml |
| Docker Compose | services/archivebox/docker-compose.yml |
| API Wrapper | services/archivebox/api-wrapper/ (app.py + Dockerfile) |
| Caddy entry | services/caddy/Caddyfile -> archive.eva-00.network |
| Glance entry | services/glance/glance.yml -> Knowledge section |
| OAuth2 proxy | services/external-proxies/docker-compose.yml -> port 8592 |
The playbook manages the full lifecycle:
- Installs Docker
- Fetches secrets from Vault (admin creds + API wrapper token)
- Deploys API wrapper source files
- Builds api-wrapper Docker image locally
- Deploys Docker Compose stack (archivebox + api-wrapper)
- Runs
archivebox init --setup(first run) - Creates holo superuser via Django management command
- Stores API wrapper token in Vault
- Deploys Alloy (monitoring)
Auth
| Component | Details |
|---|---|
| Auth method | oauth2-proxy on LXC 119 (port 8592) |
| OIDC Provider | PocketID (auth.eva-00.network) |
| Callback URL | https://archive.eva-00.network/oauth2/callback |
| PocketID Client ID | c31a0831-3877-4ca5-883a-179a252474a4 |
| ArchiveBox admin | holo (only user, Django superuser) |
| API wrapper | Bearer token (no OIDC, internal only) |
ArchiveBox has no native OIDC support. All web access is gated by oauth2-proxy. The API wrapper (port 8001) is internal-only and uses a static Bearer token for n8n integration.
Secrets
Vault path: secret/data/archivebox
| Key | Purpose |
|---|---|
admin_username |
Django superuser username (holo) |
admin_password |
Django superuser password |
api_key |
API wrapper Bearer token (used by n8n) |
Vault path: secret/data/external-oauth2-proxies (shared)
| Key | Purpose |
|---|---|
archivebox_client_id |
PocketID OIDC client ID |
archivebox_client_secret |
PocketID OIDC client secret |
archivebox_cookie_secret |
oauth2-proxy cookie encryption key |
Archiving Configuration
ArchiveBox is configured for maximum fidelity on web pages and Reddit posts:
| Extractor | Enabled | Purpose |
|---|---|---|
| DOM | Yes | Raw page HTML dump |
| wget | Yes | Full recursive mirror + generates WARC |
| WARC | Yes | Web archive format (requires wget) |
| Screenshot | Yes | Full-page PNG via headless Chrome |
| Media | Yes | Video/audio download via yt-dlp |
| Readability | No | Not useful for Reddit posts (primary use case) |
| SingleFile | No | Slow, Chrome-heavy, not needed with DOM+wget |
| No | Redundant | |
| archive.org | No | Don't submit to Internet Archive |
| Headers | No | Not useful for archival |
| Git | No | Not relevant |
n8n Integration — Archival Workflows
Two separate workflows, two iOS Shortcuts:
flowchart TD
A[iOS / macOS Share Sheet] --> B{Which Shortcut?}
B -->|"Archive Media"| C["/webhook/archive-media"]
B -->|"Archive Website"| D["/webhook/archive-website"]
C --> E{Reddit URL?}
E -->|Yes| F[Clean URL → old.reddit.com]
F --> G[Fetch .json metadata]
G --> H[ArchiveBox API wrapper]
H --> I[Tag with subreddit]
I --> J[Archive extra media]
J --> K[Log to ledger]
E -->|No| L{URL type?}
L -->|PDF| M[Karakeep bookmark]
L -->|GitHub / Google| M
L -->|Generic article| M
M --> K
D --> N[Clean URL]
N --> O[ArchiveBox API wrapper]
O --> P[DOM + WARC + screenshot + wget + media]
K --> Q[Respond to client]
style H fill:#f96,stroke:#333
style O fill:#f96,stroke:#333
style M fill:#69f,stroke:#333
archive-media (iOS Shortcut: "Archive Media")
Routes URLs based on type:
| URL type | Destination | Notes |
|---|---|---|
| Reddit post | ArchiveBox only | Tags with subreddit, fetches .json metadata |
PDF (.pdf) |
Karakeep only | Stored as document bookmark |
| GitHub / Google (docs, search) | Karakeep only | Better reading experience |
| Everything else | Karakeep only | Generic bookmarks |
Reddit path:
POST /webhook/archive-media { url: "reddit.com/..." }
-> Detect Reddit URL
-> Clean URL (resolve /s/ share links, rewrite to old.reddit.com)
-> Fetch .json metadata (title, subreddit, type, media URLs)
-> POST to API wrapper /add (tag=subreddit name)
-> Archive extra media URLs (images/videos from post)
-> Log to media-archives ledger (Forgejo repo)
-> Respond with result
Non-Reddit path (PDFs, GitHub, Google, generic):
POST /webhook/archive-media { url: "..." }
-> Clean URL (strip tracking, unwrap Google redirects)
-> Create Karakeep bookmark
-> Log to media-archives ledger
-> Respond with result
archive-website (iOS Shortcut: "Archive Website")
For full website archiving — sends directly to ArchiveBox, no Karakeep.
POST /webhook/archive-website { url: "...", depth: 0 }
-> Clean URL (strip tracking, rewrite Reddit to old.reddit.com)
-> POST to API wrapper /add
-> Respond with result
Use this when you specifically want a full DOM/WARC/screenshot archive of a non-Reddit page.
Media Archives Ledger
All processed links are logged to claude/media-archives repo on Forgejo as monthly CSVs:
- Path:
2026/04.csv(year/month) - Columns: date_saved, url, source, type, title, tags, karakeep_id, archivebox_url, archive_today_url, reddit_account, status
- Updated via Forgejo API (Contents endpoint)
Reddit Archiving Notes
- old.reddit.com is essential — standard DOM, wget works, no Shadow DOM
- Reddit
.jsonendpoint (append.jsonto any post URL) works without auth for individual posts (~10 req/min) archivebox updateonly runs missing extractors (smart incremental)- Typical Reddit post archive: ~5-10MB (dom + warc + screenshot + wget + media)
- Estimated disk for 10,000 posts: ~70-100GB
Backup Strategy
| Data | Mechanism | Schedule |
|---|---|---|
| LXC 128 snapshot (rootfs + SQLite) | PBS | Weekly |
SQLite DB (/opt/archivebox/data/index.sqlite3) |
Databasement | Daily 2:00am |
| Archive files on urahara | Backrest (archivebox-archives plan) |
Daily 3:30am |
| Offline | senku USB rsync (/Volumes/sazabi/urahara/archivebox/) |
On-demand |
Adding URLs Programmatically
# Via API wrapper (preferred for automation)
curl -X POST http://192.168.1.128:8001/add \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{"url": "https://old.reddit.com/r/...", "tag": "subreddit_name"}'
# Via CLI (direct)
ssh [email protected] docker exec archivebox archivebox add 'https://example.com'
# Via n8n webhook (iOS Shortcuts)
curl -X POST https://n8n.eva-00.network/webhook/archive-media \
-H "Authorization: Bearer <webhook_secret>" \
-H "Content-Type: application/json" \
-d '{"url": "https://reddit.com/r/..."}'