Skip to content

ArchiveBox — Setup

Web archival vault running ArchiveBox v0.7.3 with an API wrapper sidecar. Captures DOM, WARC, screenshots, wget mirrors, and media (yt-dlp) for every archived URL. Primarily used for Reddit post archival. Readability disabled (not useful for Reddit threads). Runs as Docker containers on a dedicated Debian LXC (128). Auth via oauth2-proxy + PocketID.

Infrastructure

Host LXC ID Internal External CPU RAM Disk
Debian LXC 128 192.168.1.128:8000 (web) / :8001 (API) https://archive.eva-00.network 2 cores 3 GiB 16 GiB

Storage Layout

Data Location Notes
SQLite DB + config /opt/archivebox/data (local rootfs) Fast FSYNC required
Archives (DOM, WARC, screenshots, media) /mnt/archivebox/archive (urahara bind mount) Growth-prone, backed up via Backrest
API wrapper source /opt/archivebox/api-wrapper (local rootfs) Built as Docker image on deploy

Proxmox bind mount: /mnt/pve/urahara/archivebox/archive -> /mnt/archivebox/archive

Architecture

ArchiveBox v0.7.3 has no REST API. A thin Python HTTP wrapper (api-wrapper) runs as a sidecar container, sharing the /data volume, and exposes the CLI via HTTP:

                    +-----------------+
                    |  archivebox     |
  Port 8000 (web)  |  v0.7.3         |
  <---------------->  (Web UI only)   |
                    |  /data volume   |
                    +--------+--------+
                             |
                    shared /data volume
                             |
                    +--------+--------+
                    |  api-wrapper    |
  Port 8001 (API)  |  (Python HTTP)  |
  <---------------->  Bearer token   |
                    |  /data volume   |
                    +-----------------+

API Wrapper Endpoints

Method Path Purpose
POST /add Archive a URL (params: url, depth, tag, overwrite)
GET /health Health check

Auth: Authorization: Bearer <token> (token stored in Vault)

Observability

Logs

Container logs are collected via Docker log driver. Query in Loki:

Query Purpose
{container_name="archivebox"} ArchiveBox app logs
{container_name="archivebox-api"} API wrapper logs
{container_name="archivebox"} \|= "error" Errors only
{container_name="archivebox-api"} \|= "POST /add" Archive requests

Access: Grafana -> Explore -> Loki -> Enter query

IaC

Artifact Path
Playbook ansible/playbooks/archivebox.yml
Workflow .forgejo/workflows/archivebox.yml
Docker Compose services/archivebox/docker-compose.yml
API Wrapper services/archivebox/api-wrapper/ (app.py + Dockerfile)
Caddy entry services/caddy/Caddyfile -> archive.eva-00.network
Glance entry services/glance/glance.yml -> Knowledge section
OAuth2 proxy services/external-proxies/docker-compose.yml -> port 8592

The playbook manages the full lifecycle:

  1. Installs Docker
  2. Fetches secrets from Vault (admin creds + API wrapper token)
  3. Deploys API wrapper source files
  4. Builds api-wrapper Docker image locally
  5. Deploys Docker Compose stack (archivebox + api-wrapper)
  6. Runs archivebox init --setup (first run)
  7. Creates holo superuser via Django management command
  8. Stores API wrapper token in Vault
  9. Deploys Alloy (monitoring)

Auth

Component Details
Auth method oauth2-proxy on LXC 119 (port 8592)
OIDC Provider PocketID (auth.eva-00.network)
Callback URL https://archive.eva-00.network/oauth2/callback
PocketID Client ID c31a0831-3877-4ca5-883a-179a252474a4
ArchiveBox admin holo (only user, Django superuser)
API wrapper Bearer token (no OIDC, internal only)

ArchiveBox has no native OIDC support. All web access is gated by oauth2-proxy. The API wrapper (port 8001) is internal-only and uses a static Bearer token for n8n integration.

Secrets

Vault path: secret/data/archivebox

Key Purpose
admin_username Django superuser username (holo)
admin_password Django superuser password
api_key API wrapper Bearer token (used by n8n)

Vault path: secret/data/external-oauth2-proxies (shared)

Key Purpose
archivebox_client_id PocketID OIDC client ID
archivebox_client_secret PocketID OIDC client secret
archivebox_cookie_secret oauth2-proxy cookie encryption key

Archiving Configuration

ArchiveBox is configured for maximum fidelity on web pages and Reddit posts:

Extractor Enabled Purpose
DOM Yes Raw page HTML dump
wget Yes Full recursive mirror + generates WARC
WARC Yes Web archive format (requires wget)
Screenshot Yes Full-page PNG via headless Chrome
Media Yes Video/audio download via yt-dlp
Readability No Not useful for Reddit posts (primary use case)
SingleFile No Slow, Chrome-heavy, not needed with DOM+wget
PDF No Redundant
archive.org No Don't submit to Internet Archive
Headers No Not useful for archival
Git No Not relevant

n8n Integration — Archival Workflows

Two separate workflows, two iOS Shortcuts:

flowchart TD
    A[iOS / macOS Share Sheet] --> B{Which Shortcut?}
    B -->|"Archive Media"| C["/webhook/archive-media"]
    B -->|"Archive Website"| D["/webhook/archive-website"]

    C --> E{Reddit URL?}
    E -->|Yes| F[Clean URL → old.reddit.com]
    F --> G[Fetch .json metadata]
    G --> H[ArchiveBox API wrapper]
    H --> I[Tag with subreddit]
    I --> J[Archive extra media]
    J --> K[Log to ledger]

    E -->|No| L{URL type?}
    L -->|PDF| M[Karakeep bookmark]
    L -->|GitHub / Google| M
    L -->|Generic article| M
    M --> K

    D --> N[Clean URL]
    N --> O[ArchiveBox API wrapper]
    O --> P[DOM + WARC + screenshot + wget + media]

    K --> Q[Respond to client]

    style H fill:#f96,stroke:#333
    style O fill:#f96,stroke:#333
    style M fill:#69f,stroke:#333

archive-media (iOS Shortcut: "Archive Media")

Routes URLs based on type:

URL type Destination Notes
Reddit post ArchiveBox only Tags with subreddit, fetches .json metadata
PDF (.pdf) Karakeep only Stored as document bookmark
GitHub / Google (docs, search) Karakeep only Better reading experience
Everything else Karakeep only Generic bookmarks

Reddit path:

POST /webhook/archive-media { url: "reddit.com/..." }
  -> Detect Reddit URL
  -> Clean URL (resolve /s/ share links, rewrite to old.reddit.com)
  -> Fetch .json metadata (title, subreddit, type, media URLs)
  -> POST to API wrapper /add (tag=subreddit name)
  -> Archive extra media URLs (images/videos from post)
  -> Log to media-archives ledger (Forgejo repo)
  -> Respond with result

Non-Reddit path (PDFs, GitHub, Google, generic):

POST /webhook/archive-media { url: "..." }
  -> Clean URL (strip tracking, unwrap Google redirects)
  -> Create Karakeep bookmark
  -> Log to media-archives ledger
  -> Respond with result

archive-website (iOS Shortcut: "Archive Website")

For full website archiving — sends directly to ArchiveBox, no Karakeep.

POST /webhook/archive-website { url: "...", depth: 0 }
  -> Clean URL (strip tracking, rewrite Reddit to old.reddit.com)
  -> POST to API wrapper /add
  -> Respond with result

Use this when you specifically want a full DOM/WARC/screenshot archive of a non-Reddit page.

Media Archives Ledger

All processed links are logged to claude/media-archives repo on Forgejo as monthly CSVs:

  • Path: 2026/04.csv (year/month)
  • Columns: date_saved, url, source, type, title, tags, karakeep_id, archivebox_url, archive_today_url, reddit_account, status
  • Updated via Forgejo API (Contents endpoint)

Reddit Archiving Notes

  • old.reddit.com is essential — standard DOM, wget works, no Shadow DOM
  • Reddit .json endpoint (append .json to any post URL) works without auth for individual posts (~10 req/min)
  • archivebox update only runs missing extractors (smart incremental)
  • Typical Reddit post archive: ~5-10MB (dom + warc + screenshot + wget + media)
  • Estimated disk for 10,000 posts: ~70-100GB

Backup Strategy

Data Mechanism Schedule
LXC 128 snapshot (rootfs + SQLite) PBS Weekly
SQLite DB (/opt/archivebox/data/index.sqlite3) Databasement Daily 2:00am
Archive files on urahara Backrest (archivebox-archives plan) Daily 3:30am
Offline senku USB rsync (/Volumes/sazabi/urahara/archivebox/) On-demand

Adding URLs Programmatically

# Via API wrapper (preferred for automation)
curl -X POST http://192.168.1.128:8001/add \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://old.reddit.com/r/...", "tag": "subreddit_name"}'

# Via CLI (direct)
ssh [email protected] docker exec archivebox archivebox add 'https://example.com'

# Via n8n webhook (iOS Shortcuts)
curl -X POST https://n8n.eva-00.network/webhook/archive-media \
  -H "Authorization: Bearer <webhook_secret>" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://reddit.com/r/..."}'