ArchiveBox — Setup

Web archival vault running ArchiveBox v0.7.3 with an API wrapper sidecar. Captures DOM, WARC, screenshots, wget mirrors, and media (yt-dlp) for every archived URL. Primarily used for Reddit post archival. Readability disabled (not useful for Reddit threads). Runs as Docker containers on a dedicated Debian LXC (128). Auth via oauth2-proxy + PocketID.

Infrastructure

Host	LXC ID	Internal	External	CPU	RAM	Disk
Debian LXC	128	192.168.1.128:8000 (web) / :8001 (API)	https://archive.eva-00.network	2 cores	3 GiB	16 GiB

Storage Layout

Data	Location	Notes
SQLite DB + config	`/opt/archivebox/data` (local rootfs)	Fast FSYNC required
Archives (DOM, WARC, screenshots, media)	`/mnt/archivebox/archive` (urahara bind mount)	Growth-prone, backed up via Backrest
API wrapper source	`/opt/archivebox/api-wrapper` (local rootfs)	Built as Docker image on deploy

Proxmox bind mount: /mnt/pve/urahara/archivebox/archive -> /mnt/archivebox/archive

Architecture

ArchiveBox v0.7.3 has no REST API. A thin Python HTTP wrapper (api-wrapper) runs as a sidecar container, sharing the /data volume, and exposes the CLI via HTTP:

                    +-----------------+
                    |  archivebox     |
  Port 8000 (web)  |  v0.7.3         |
  <---------------->  (Web UI only)   |
                    |  /data volume   |
                    +--------+--------+
                             |
                    shared /data volume
                             |
                    +--------+--------+
                    |  api-wrapper    |
  Port 8001 (API)  |  (Python HTTP)  |
  <---------------->  Bearer token   |
                    |  /data volume   |
                    +-----------------+

API Wrapper Endpoints

Method	Path	Purpose
POST	`/add`	Archive a URL (params: url, depth, tag, overwrite)
GET	`/health`	Health check

Auth: Authorization: Bearer <token> (token stored in Vault)

Observability

Logs

Container logs are collected via Docker log driver. Query in Loki:

Query	Purpose
`{container_name="archivebox"}`	ArchiveBox app logs
`{container_name="archivebox-api"}`	API wrapper logs
`{container_name="archivebox"} \\|= "error"`	Errors only
`{container_name="archivebox-api"} \\|= "POST /add"`	Archive requests

Access: Grafana -> Explore -> Loki -> Enter query

IaC

Artifact	Path
Playbook	`ansible/playbooks/archivebox.yml`
Workflow	`.forgejo/workflows/archivebox.yml`
Docker Compose	`services/archivebox/docker-compose.yml`
API Wrapper	`services/archivebox/api-wrapper/` (app.py + Dockerfile)
Caddy entry	`services/caddy/Caddyfile` -> `archive.eva-00.network`
Glance entry	`services/glance/glance.yml` -> Knowledge section
OAuth2 proxy	`services/external-proxies/docker-compose.yml` -> port 8592

The playbook manages the full lifecycle:

Installs Docker
Fetches secrets from Vault (admin creds + API wrapper token)
Deploys API wrapper source files
Builds api-wrapper Docker image locally
Deploys Docker Compose stack (archivebox + api-wrapper)
Runs archivebox init --setup (first run)
Creates holo superuser via Django management command
Stores API wrapper token in Vault
Deploys Alloy (monitoring)

Auth

Component	Details
Auth method	oauth2-proxy on LXC 119 (port 8592)
OIDC Provider	PocketID (`auth.eva-00.network`)
Callback URL	`https://archive.eva-00.network/oauth2/callback`
PocketID Client ID	`c31a0831-3877-4ca5-883a-179a252474a4`
ArchiveBox admin	holo (only user, Django superuser)
API wrapper	Bearer token (no OIDC, internal only)

ArchiveBox has no native OIDC support. All web access is gated by oauth2-proxy. The API wrapper (port 8001) is internal-only and uses a static Bearer token for n8n integration.

Secrets

Vault path: `secret/data/archivebox`

Key	Purpose
`admin_username`	Django superuser username (holo)
`admin_password`	Django superuser password
`api_key`	API wrapper Bearer token (used by n8n)

Vault path: `secret/data/external-oauth2-proxies` (shared)

Key	Purpose
`archivebox_client_id`	PocketID OIDC client ID
`archivebox_client_secret`	PocketID OIDC client secret
`archivebox_cookie_secret`	oauth2-proxy cookie encryption key

Archiving Configuration

ArchiveBox is configured for maximum fidelity on web pages and Reddit posts:

Extractor	Enabled	Purpose
DOM	Yes	Raw page HTML dump
wget	Yes	Full recursive mirror + generates WARC
WARC	Yes	Web archive format (requires wget)
Screenshot	Yes	Full-page PNG via headless Chrome
Media	Yes	Video/audio download via yt-dlp
Readability	No	Not useful for Reddit posts (primary use case)
SingleFile	No	Slow, Chrome-heavy, not needed with DOM+wget
PDF	No	Redundant
archive.org	No	Don't submit to Internet Archive
Headers	No	Not useful for archival
Git	No	Not relevant

n8n Integration — Archival Workflows

Two separate workflows, two iOS Shortcuts:

flowchart TD
    A[iOS / macOS Share Sheet] --> B{Which Shortcut?}
    B -->|"Archive Media"| C["/webhook/archive-media"]
    B -->|"Archive Website"| D["/webhook/archive-website"]

    C --> E{Reddit URL?}
    E -->|Yes| F[Clean URL → old.reddit.com]
    F --> G[Fetch .json metadata]
    G --> H[ArchiveBox API wrapper]
    H --> I[Tag with subreddit]
    I --> J[Archive extra media]
    J --> K[Log to ledger]

    E -->|No| L{URL type?}
    L -->|PDF| M[Karakeep bookmark]
    L -->|GitHub / Google| M
    L -->|Generic article| M
    M --> K

    D --> N[Clean URL]
    N --> O[ArchiveBox API wrapper]
    O --> P[DOM + WARC + screenshot + wget + media]

    K --> Q[Respond to client]

    style H fill:#f96,stroke:#333
    style O fill:#f96,stroke:#333
    style M fill:#69f,stroke:#333

archive-media (iOS Shortcut: "Archive Media")

Routes URLs based on type:

URL type	Destination	Notes
Reddit post	ArchiveBox only	Tags with subreddit, fetches .json metadata
PDF (`.pdf`)	Karakeep only	Stored as document bookmark
GitHub / Google (docs, search)	Karakeep only	Better reading experience
Everything else	Karakeep only	Generic bookmarks

Reddit path:

POST /webhook/archive-media { url: "reddit.com/..." }
  -> Detect Reddit URL
  -> Clean URL (resolve /s/ share links, rewrite to old.reddit.com)
  -> Fetch .json metadata (title, subreddit, type, media URLs)
  -> POST to API wrapper /add (tag=subreddit name)
  -> Archive extra media URLs (images/videos from post)
  -> Log to media-archives ledger (Forgejo repo)
  -> Respond with result

Non-Reddit path (PDFs, GitHub, Google, generic):

POST /webhook/archive-media { url: "..." }
  -> Clean URL (strip tracking, unwrap Google redirects)
  -> Create Karakeep bookmark
  -> Log to media-archives ledger
  -> Respond with result

archive-website (iOS Shortcut: "Archive Website")

For full website archiving — sends directly to ArchiveBox, no Karakeep.

POST /webhook/archive-website { url: "...", depth: 0 }
  -> Clean URL (strip tracking, rewrite Reddit to old.reddit.com)
  -> POST to API wrapper /add
  -> Respond with result

Use this when you specifically want a full DOM/WARC/screenshot archive of a non-Reddit page.

Media Archives Ledger

All processed links are logged to claude/media-archives repo on Forgejo as monthly CSVs:

Path: 2026/04.csv (year/month)
Columns: date_saved, url, source, type, title, tags, karakeep_id, archivebox_url, archive_today_url, reddit_account, status
Updated via Forgejo API (Contents endpoint)

Reddit Archiving Notes

old.reddit.com is essential — standard DOM, wget works, no Shadow DOM
Reddit .json endpoint (append .json to any post URL) works without auth for individual posts (~10 req/min)
archivebox update only runs missing extractors (smart incremental)
Typical Reddit post archive: ~5-10MB (dom + warc + screenshot + wget + media)
Estimated disk for 10,000 posts: ~70-100GB

Backup Strategy

Data	Mechanism	Schedule
LXC 128 snapshot (rootfs + SQLite)	PBS	Weekly
SQLite DB (`/opt/archivebox/data/index.sqlite3`)	Databasement	Daily 2:00am
Archive files on urahara	Backrest (`archivebox-archives` plan)	Daily 3:30am
Offline	senku USB rsync (`/Volumes/sazabi/urahara/archivebox/`)	On-demand

Adding URLs Programmatically

# Via API wrapper (preferred for automation)
curl -X POST http://192.168.1.128:8001/add \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://old.reddit.com/r/...", "tag": "subreddit_name"}'

# Via CLI (direct)
ssh [email protected] docker exec archivebox archivebox add 'https://example.com'

# Via n8n webhook (iOS Shortcuts)
curl -X POST https://n8n.eva-00.network/webhook/archive-media \
  -H "Authorization: Bearer <webhook_secret>" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://reddit.com/r/..."}'