operator runbook

wake-up-at-3am playbook for self-hosted briven. every section assumes you have shell access to the host running docker compose for the briven stack.

the first thing to run before any of the recipes below: briven doctor. it prints which sub-system is unhealthy in seconds and rules out half of the diagnostics below.

api won't boot

look at docker compose logs api. the api refuses to start with a clear error message when a required env var is missing or invalid:

BRIVEN_ENCRYPTION_KEY must be 64 hex chars — generate with openssl rand -hex 32, paste into .env, redeploy. if you already have project env vars stored encrypted with a different key, see rotate encryption key below before changing this value or every secret in the database becomes unreadable.
BRIVEN_BETTER_AUTH_SECRET must be set — same fix; this one is safe to rotate without data loss but every active session is invalidated.
BRIVEN_DATABASE_URL: ECONNREFUSED — postgres isn't up yet or isn't reachable on the docker network. docker compose ps postgres and docker compose logs postgres; usually a stale pid lock, fixed by docker compose down postgres && docker compose up -d postgres.

magic link doesn't arrive

if BRIVEN_MITTERA_API_URL or BRIVEN_MITTERA_API_KEYis unset, briven prints the magic link to api stdout instead of sending email. that’s intentional for self-host first boot. find it:

docker compose logs api 2>&1 | grep magic_link | tail -1

if mittera is configured but mail still isn’t arriving, check the api log for mittera_send_failedentries — the API key may be wrong, the sender domain isn’t verified on the mittera side, or mittera is rejecting the request for another reason (the response body is logged, truncated to 240 chars). The link itself is always valid for 10 minutes; re-requesting just sends a fresh one.

promote yourself to platform admin

/admin is gated by the users.is_admin column. The first user gets it via SQL; everyone after that gets it via the admin tab itself.

docker compose exec postgres psql -U postgres -d briven_control \
  -c "UPDATE users SET is_admin = true WHERE email = '<your-email>'"

rotate encryption key (per-project env vars)

BRIVEN_ENCRYPTION_KEY is the AES-256-GCM key for project env vars at rest. rotating it requires a re-encrypt pass before the new key takes effect. plan a brief maintenance window — the api refuses writes while the migration runs.

generate the new key: openssl rand -hex 32
stop the api: docker compose stop api

run the rotation script with the OLD key in OLD_KEY and NEW key in NEW_KEY:

docker compose run --rm \
  -e BRIVEN_ENCRYPTION_KEY_OLD=<old> \
  -e BRIVEN_ENCRYPTION_KEY=<new> \
  api node packages/cli/dist/scripts/rotate-encryption-key.js

update .env with the new key and start the api: docker compose up -d api

restore from backup

backups land in MinIO (or S3, B2 — whatever BRIVEN_BACKUP_DESTINATION points at). list them:

docker compose exec minio mc ls local/briven-backups/

restore a single dump (control plane shown — replace with `briven_data` for the data plane):

# 1. download the dump
docker compose exec minio mc cp local/briven-backups/2026-05-09.sql.gz /tmp/

# 2. stop the api so nothing writes during restore
docker compose stop api

# 3. drop + recreate the database
docker compose exec postgres psql -U postgres \
  -c "DROP DATABASE briven_control" \
  -c "CREATE DATABASE briven_control"

# 4. pipe the dump in
gunzip -c /tmp/2026-05-09.sql.gz | docker compose exec -T postgres psql -U postgres -d briven_control

# 5. start the api back up
docker compose up -d api

monthly restore drill is wired up in infra/backups/restore-drill.sh and runs against an ephemeral db so you don't need to take prod down to verify the dumps are valid.

suspend a project (abuse, billing past-due, customer ask)

docker compose exec postgres psql -U postgres -d briven_control \
  -c "UPDATE projects SET status = 'suspended', suspended_reason = '<reason>' WHERE id = '<p_...>'"

suspended projects refuse all api calls (404 from /v1/projects/:id/*) and their realtime subscriptions are forcibly closed within the next pump cycle. resume by flipping status back to active.

invocations are slow / rate-limited

first stop in Grafana: the runtime invocations dashboard. p95 over 500ms or a 429 spike points to one of:

cold-start storm — runtime is killing isolates faster than it can warm them. raise BRIVEN_RUNTIME_ISOLATE_TTL_SEC from the default 600 and bounce runtime.
rate-limit at the gateway — a single project is hitting tier ceilings. confirm in the api log: rate_limited project=p_.... resolve with the customer (upgrade tier or reduce hot-loop traffic).
postgres saturated — the postgres health dashboard will show connection pool exhaustion or lock waits. pg_stat_activity tells you which query.

websocket subs flapping

if many clients are reconnecting in a tight loop, the realtime subs dashboard shows a sawtooth open/close pattern. usual causes:

traefik is timing out idle connections. set traefik.http.middlewares.briven-ws.headers.customRequestHeaders.X-WS-Timeout=300 and confirm ws_keepalive_msin the realtime env is < the proxy timeout.
postgres is closing the LISTEN connection because the api went idle. the realtime service auto-reconnects in 1s steps with backoff; if you see hundreds of these in seconds the postgres host needs investigation.

discord alerts stopped firing

alerts route through alertmanager → benjojo/alertmanager-discord bridge → discord webhook. when the channel goes quiet either prometheus stopped firing (rule drift, scrape down) or the discord webhook url is invalid.

confirm prometheus thinks rules are alerting: docker compose exec prometheus promtool query instant 'ALERTS'.
confirm alertmanager received them: docker compose logs alertmanager --tail 100 — look for msg="Notify success".
confirm the bridge fired: docker compose logs alertmanager-discord-alerts --tail 100. a 401 here means the discord webhook url has been revoked — regenerate it in the discord channel and update DISCORD_WEBHOOK_ALERTS / DISCORD_WEBHOOK_DEPLOYS in the dokploy env, then restart the two bridge services.

backup off-site upload failed

systemd journal: journalctl -u briven-backup.service -n 200. the most common failure is the b2 application key being revoked server-side. easy fix: re-issue at backblaze, update the env, then systemctl start briven-backup.service to retry without waiting for the timer.

required env on the kvm running pg-dump.sh + restore-drill.sh:

BRIVEN_BACKUP_B2_KEY_ID — Backblaze B2 application key id.
BRIVEN_BACKUP_B2_APP_KEY — secret half of the application key. write-only scope is enough.
BRIVEN_BACKUP_B2_BUCKET — bucket name (e.g. briven-prod-backups-eu-central).
BRIVEN_BACKUP_PREFIX — key prefix inside the bucket. defaults to prod.
BRIVEN_BACKUP_CONTROL_URL + BRIVEN_BACKUP_DATA_URL — postgres dsns for the meta-db + the data plane.

set a bucket lifecycle rule directly in the b2 UI: keep daily snapshots 30 days, monthly snapshots 12 months. the scripts don't manage retention — they assume the bucket does.

incident disclosure

incidents that touch customer data — get a short message out within 72h and post the full post-mortem at docs.briven.tech/changelog within 30 days. template:

Title: <one-line summary, no jargon>
Detected: <utc>
Resolved: <utc>
Customer impact: <which projects, what data, exposure window>
Root cause: <one paragraph>
Mitigation in place: <bullets>
Long-term fixes: <bullets>

when in doubt — collect a snapshot

before opening an issue or paging support, run this on the host. attach the resulting tarball:

d=$(date -u +%Y%m%d-%H%M%S)
mkdir briven-snapshot-$d && cd briven-snapshot-$d
docker compose ps > ps.txt
docker compose logs --tail 500 api    > api.log    2>&1
docker compose logs --tail 500 runtime > runtime.log 2>&1
docker compose logs --tail 500 realtime > realtime.log 2>&1
docker compose logs --tail 500 postgres > postgres.log 2>&1
docker compose exec postgres psql -U postgres -d briven_control \
  -c "SELECT count(*) FROM projects" \
  -c "SELECT count(*) FROM users" \
  -c "SELECT version()" > db.txt 2>&1
cd .. && tar czf briven-snapshot-$d.tar.gz briven-snapshot-$d/

everything in the snapshot is operator-side metadata — no customer secrets, no project env vars, no row data.