I have a lot of tar and disk image backups, as well as raw photos, that I want to squeeze onto a hard drive for long term offline archival, but I want to make the most of the drive’s capacity so I want to compress them at the highest ratio supported by standard tools. I’ve zeroed out the free space in my disk images so I can save the entire image while only having it take up as much space as there are actual files on them, and raw images in my experience can have their size reduced by a third or even half with max compression (and I would assume it’s lossless since file level compression can regenerate the original file in its entirety?)

I’ve heard horror stories of compressed files being made completely unextractable by a single corrupted bit but I don’t know how much a risk that still is in 2025, though since I plan to leave the hard drive unplugged for long periods, I want the best chance of recovery if something does go wrong.

I also want the files to be extractable with just the Linux/Unix standard binutils since this is my disaster recovery plan and I want to be able to work with it through a Linux live image without installing any extra packages when my server dies, hence I’m only looking at gz, xz, or bz2.

So out of the three, which is generally considered more stable and corruption resistant when the compression ratio is turned all the way up? Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting? Additionally, should I be generating separate checksum files for the original data or do the compressed formats include checksumming themselves?

  • GenderNeutralBro@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    8
    ·
    1 month ago

    Generally speaking, xz provides higher compression.

    None of these are well optimized for images. Depending on your image format, you might be better off leaving those files alone or converting them to a more modern format like JPEG-XL. Supposedly JPEG-XL can further compress JPEG files with no additional loss of quality, and it also has an efficient lossless mode.

    Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting?

    As far as I know, no common compression algorithms feature built-in error correction, nor does tar. This is something you can do with external tools, instead.

    For validation, you can save a hash of the compressed output. md5 is a bad hashing algorithm but it’s still generally fine (and widely used) for this purpose. SHA256 is much more robust if you are worried about dedicated malicious forgery, and not just random corruption.

    Usually, you’d just put hash files alongside your archive files with appropriate names, so you can manually check them later. Note that this will not provide you with information about which parts of the archive are corrupt, only that it is corrupt.

    For error correction, consider par2. Same idea: you give it a file, and it creates a secondary file that can be used alongside the original for error correction later.

    I also want the files to be extractable with just the Linux/Unix standard binutils

    That is a key advantage of this method. Adding a hash file or par file does not change the basic archive, so you don’t need any special tools to work with it.

    You should also consider your file system and media. Some file systems offer built-in error correction. And some media types are less susceptible to corruption than others, either due to physical durability or to baked-in error correction.