Skip to content

Archiver

Tags: programming

Want to archive folders however you want?

Context

I had been trying to figure out a more efficient way of backing up directories on my computers. For example, I have a huge directory where my Obsidian vault lives. This vault is filled with a bunch of pdf and markdown files that I use for my everyday notes. I do like keeping this folder backed up in case my laptop or PC dies.

The way I would backup this folder was by simply turning it into a GitHub repo and uploading everything privately to my GitHub account. This was not what GitHub was made for and I was nearing 2GB for the entire repo. It was time to search for a different solution.

Enter the infamous xz vulnerability. This put compression libraries on my radar. So I decided to get the entire directory and compress it down using xz and gzip. I noticed that the compression on xz was far better than gzip. For my purposes I wanted to optimize for size at the cost of compression speed. Additionally the nice people at r/DataHoarder had mentioned that a really cool software program called par2 was a thing. I could recover any amount I wanted from a specific file, using bit parity magic.

For my purposes, I want to store big directories as small as possible, store their MD5-Sum to make sure I can verify the files are correct, and store their parity file to be able to recover any part of them I want (up to 30% in my case). This is because sharing things on a local NAS or even with Syncthing this way would be very easy and efficient.

Some downsides to this system are:

PS: I recognize that xz might have vulnerabilities that might not be.

Archiver

Archiver is a bash script that helps you backup whatever you want, however you want.

Usage

For example:

archive.json

[
  {
    "name": "Wallpapers",
    "target": "~/Media/Pictures/Wallpapers",
    "archive": {
      "name": "walls",
      "destination": "~/OneDrive/Wallpapers"
    },
    "timestamps": {
      "last_archive": 1712438594,
      "last_upload": 1712438693
    },
    "sync_command": "onedrive --synchronize --single-directory 'Wallpapers'",
    "md5sum": "6de26f11ad638fd145f3d1412e0bf1c6"
  },
  {
    "name": "Books",
    "target": "~/Documents/Books",
    "archive": {
      "name": "books",
      "destination": "~/OneDrive/Books"
    },
    "timestamps": {
      "last_archive": 1712439022,
      "last_upload": 1712439638
    },
    "sync_command": "onedrive --synchronize --single-directory 'Books'",
    "md5sum": "b4ae6185bb5a20d19c0b30f9778a10cb"
  }
]

You specify list of attribute sets with three key parts:

Other details like timestamps and name are useful for other purposes if you wish to climb under the hood to use them. The MD5-Sum is also useful if you wish to verify the legitimacy of your files after retrieving them.

PS: an example archives.json is provided.

Methodology

  1. Your target gets converted into a tarball.
  2. That tarball is compressed into an xz archive.
    • This format was chosen because of its excellent compression ratio.
    • Though in the future I would like to implement multiple formats for this.
  3. A parity archive is created from that compressed tarball.
    • Uses the par2cmdline utilities.
    • A single block file with 30% redundancy is created.
    • Additionally, you can use the index file that’s created but par2 doesn’t really need it.
  4. A unix timestamp and MD5-Sum is taken from the archived tarball.
  5. Your sync_command hook is run at the end and a secondary timestamp is taken at the end of this.
  6. Your archives.json file is updated with all of the fresh timestamps and MD5-Sum.