Archiver

December 8, 2024

Want to archive folders however you want?

Context

I had been trying to figure out a more efficient way of backing up directories on my computers. For example, I have a huge directory where my Obsidian vault lives. This vault is filled with a bunch of pdf and markdown files that I use for my everyday notes. I do like keeping this folder backed up in case my laptop or PC dies.

The way I would backup this folder was by simply turning it into a GitHub repo and uploading everything privately to my GitHub account. This was not what GitHub was made for and I was nearing 2GB for the entire repo. It was time to search for a different solution.

Enter the infamous xz vulnerability. This put compression libraries on my radar. So I decided to get the entire directory and compress it down using xz and gzip. I noticed that the compression on xz was far better than gzip. For my purposes I wanted to optimize for size at the cost of compression speed. Additionally the nice people at r/DataHoarder had mentioned that a really cool software program called par2 was a thing. I could recover any amount I wanted from a specific file, using bit parity magic.

For my purposes, I want to store big directories as small as possible, store their MD5-Sum to make sure I can verify the files are correct, and store their parity file to be able to recover any part of them I want (up to 30% in my case). This is because sharing things on a local NAS or even with Syncthing this way would be very easy and efficient.

Some downsides to this system are:

xz is not as common as I thought and it’s been somewhat complicated to move files to systems that don’t have this utility.
Incremental backups are not possible.

PS: I recognize that xz might have vulnerabilities that might not be.

Archiver

Archiver is a bash script that helps you backup whatever you want, however you want.

Usage

For example:

archive.json

[
  {
    "name": "Wallpapers",
    "target": "~/Media/Pictures/Wallpapers",
    "archive": {
      "name": "walls",
      "destination": "~/OneDrive/Wallpapers"
    },
    "timestamps": {
      "last_archive": 1712438594,
      "last_upload": 1712438693
    },
    "sync_command": "onedrive --synchronize --single-directory 'Wallpapers'",
    "md5sum": "6de26f11ad638fd145f3d1412e0bf1c6"
  },
  {
    "name": "Books",
    "target": "~/Documents/Books",
    "archive": {
      "name": "books",
      "destination": "~/OneDrive/Books"
    },
    "timestamps": {
      "last_archive": 1712439022,
      "last_upload": 1712439638
    },
    "sync_command": "onedrive --synchronize --single-directory 'Books'",
    "md5sum": "b4ae6185bb5a20d19c0b30f9778a10cb"
  }
]

You specify list of attribute sets with three key parts:

target: the target directory you wish to backup
archive: the name of the archive and where you want to store the archive
sync_command: the command you wish to use to back up this specific directory

Other details like timestamps and name are useful for other purposes if you wish to climb under the hood to use them. The MD5-Sum is also useful if you wish to verify the legitimacy of your files after retrieving them.

PS: an example archives.json is provided.

Methodology

Your target gets converted into a tarball.
That tarball is compressed into an xz archive.
- This format was chosen because of its excellent compression ratio.
- Though in the future I would like to implement multiple formats for this.
A parity archive is created from that compressed tarball.
- Uses the par2cmdline utilities.
- A single block file with 30% redundancy is created.
- Additionally, you can use the index file that’s created but par2 doesn’t really need it.
A unix timestamp and MD5-Sum is taken from the archived tarball.
Your sync_command hook is run at the end and a secondary timestamp is taken at the end of this.
Your archives.json file is updated with all of the fresh timestamps and MD5-Sum.