Backing up an 18-stack homelab to Backblaze B2 with restic
A pragmatic, encrypted, off-site backup for a Docker Compose homelab — what to back up, what to skip, and the per-database tricks that make restores reliable.
Backing up an 18-stack homelab to Backblaze B2 with restic
My homelab grew the way most homelabs do: one Docker Compose stack at a time. After a year, I had eighteen of them — media (Plex, Jellyfin, the *arr suite), photos (Immich), an NVR (Frigate), face recognition (CompreFace), monitoring (Prometheus + Grafana), a few databases, a TeamSpeak server, two game servers, plus the random utility containers that always sneak in.
What I did not have was a backup.
When I sat down to fix that, the obvious “just rsync everything” plan turned out to be wrong in interesting ways. This is what I built instead, and the trade-offs that drove each decision.
What “backup” actually means
The first useful question wasn’t “how do I back this up” but “what’s actually irreplaceable?” The answer split cleanly into three buckets:
| Bucket | Examples | Backed up? |
|---|---|---|
| Irreplaceable (~10 GB) | Stack configs, .env secrets, every database, app state | Yes — daily |
| Regenerable from a source of truth (~50 GB) | Frigate camera recordings, Ollama model files, Tdarr transcode cache, ML model weights | No — re-derivable |
| Bulk media (~8 TB) | Movies, TV, music, photos | No — separate strategy |
This is the most important architectural call: bulk media is out of scope. Storing it off-site at scale is expensive, and the download pipeline is itself the source of truth — Sonarr and Radarr know how to refetch anything that goes missing, given their config DBs (which are in the backup). Photos in Immich are the partial exception; they get a separate local-RAID strategy that I’ll write up another time.
Cutting media out shrank the daily backup target by three orders of magnitude. That changed every other decision downstream.
Architecture in three phases
The whole system is one shell script invoked once a day by cron. It does three things in order:
- Dump every database into a 0700 staging directory.
- Snapshot a curated set of paths — staging plus selected bind mounts — to an encrypted restic repository on Backblaze B2.
- Prune older snapshots once a week according to a retention policy.
Total runtime is around 2 minutes. The daily delta is a few hundred KB to a few MB, depending on how much database churn there’s been.
The interesting part: heterogeneous database dumps
This is where most “just rsync the data dir” backup plans go wrong. Copying live database files while the engine is writing to them produces corrupt copies — sometimes silently. Different engines need different tricks:
| Engine | Cases | Approach |
|---|---|---|
| Postgres (multi-DB cluster) | Local cluster with 11 user databases | pg_dumpall via docker exec (single-DB pg_dump would silently lose 10 of them) |
| Postgres (single-DB) | One application database with extensions | pg_dump --clean --if-exists |
| Postgres (managed/remote) | Provider-managed instance | Throwaway client container (docker run --rm postgres:17 pg_dump) — the image must match or exceed the server major version |
| MariaDB | Single-server multi-DB | mariadb-dump --single-transaction (online, lock-free for InnoDB) |
| Redis | Bind-mount data dir | BGSAVE + poll LASTSAVE until the new RDB is on disk; tar it from inside the container to sidestep host UID issues |
| Redis | Anonymous Docker volume | Same BGSAVE dance, then stream the RDB out via docker exec cat |
| SQLite | Apps that hold the file open | Python’s stdlib sqlite3.Connection.backup() — the only online-safe SQLite copy method |
| App-native | Sonarr / Radarr / Prowlarr | Trigger their built-in backup via REST API (POST /api/v3/command {"name":"Backup"}), then ship the resulting zip — gives you config.xml for free |
| Distroless container | Grafana, Portainer | Brief docker stop → docker cp → docker start. ~3-second downtime is acceptable for tools used by one person |
The orchestrator runs every dump as a separate stage and is fail-soft: one broken container does not abort the whole night. Failures are logged, surfaced via a Healthchecks.io ping, and visible in the Prometheus textfile collector. Restic itself is fail-fast — if the upload errors, the run errors.
Why restic over the alternatives
I considered four families:
- Plain rsync +
pg_dumpscripts. Simplest, no new tools. No encryption, no deduplication, no snapshots. Fine if you only need yesterday’s data and storage is local. - Borg. Similar feature set to restic. Better local performance, weaker cloud-storage support.
- Duplicati / Kopia. Web UI for browsing and restoring. Duplicati has had real reliability issues historically; Kopia is solid but newer.
- restic. Single Go binary, encrypted by default with AES-256, content-addressable storage gives transparent deduplication, native backends for B2/S3/Azure/GCS/SFTP, mountable snapshots via FUSE.
Restic won on operational simplicity. One binary, one repo, one config file. The whole thing is a few-hundred-line shell script and never a moving target — which is exactly what I want from infrastructure I touch four times a year.
Storage cost: under the free tier
After deduplication and compression, the working set is about 2 GB. Backblaze B2’s first 10 GB are free indefinitely, and egress is free up to 3× your stored size per month — which means a full disaster restore is also free. Even if the repo grew to 30 GB I’d be paying about $0.12/month.
This is what made the “configs only” scope so attractive: the 8 TB of media that I left out of scope would have cost real money to mirror; the 10 GB that’s actually irreplaceable costs nothing.
Failure modes I designed for
A backup script that nobody ever looks at is a backup script that has been silently broken for six months. So:
- Failed dumps don’t kill the run. A snapshot with 14 of 16 databases is still useful.
- Healthchecks.io ping on success,
/failping on partial failure, missed-ping email if the host is down entirely. - Optional Prometheus textfile metrics (
backup_last_success_timestamp_seconds,backup_duration_seconds,backup_repo_size_bytes) so the existing Grafana dashboard knows about backups too. - Documented restore runbook with a per-engine restore command for every database in scope. Trust in restic isn’t enough; the recipe for using it has to live somewhere accessible.
- The runbook lives outside the encrypted repo. Storing it inside is a chicken-and-egg problem. It goes in a password manager and a separate plaintext folder in the same B2 bucket.
Disaster recovery in about two hours
The realistic path from blank hardware to “it’s running again”:
- Fresh OS install, then
apt install docker.io docker-compose-plugin restic - Recreate the host user with the same UID, the external Docker network, and any disk mount points
- Set the four B2/restic environment variables from the password manager
restic restore latest --target /- For each stack: bring up the database container, pipe in the dump
docker compose up -dper stack to fetch images and start everything
Roughly two hours. Bulk media — the part deliberately not in the backup — refills itself over time as Sonarr and Radarr re-grab anything missing.
What I’d improve next
- Second off-site copy in a different provider, for ransomware/account-loss insurance.
restic copymakes this a one-liner. - Restore drill on a cadence. A backup you’ve never restored from is a hypothesis, not a backup. I want to schedule a quarterly test-restore to a throwaway VM.
- Bulk media on a separate strategy — likely a second on-host disk or a NAS, with
rsync --link-destfor cheap incrementals. Different problem, different tool.
Takeaways
- Decide what you’re actually trying to protect before you pick the tool. The scope decision is more important than the technology.
- Heterogeneous workloads need heterogeneous capture strategies — there is no universal “back up a database” command.
- Encryption belongs at the client, not at the storage provider. The provider can be compromised; your password can be in your head.
- A backup that’s never been restored from is not a backup. Build the restore path into the design from day one, and document it somewhere that survives the disaster.
Stack used: restic, Backblaze B2, Docker Compose, Postgres 14/17, MariaDB, Redis/Valkey, SQLite, Bash, cron, Prometheus, Grafana, Healthchecks.io.