I have a lot of tar and disk image backups, as well as raw photos, that I want to squeeze onto a hard drive for long term offline archival, but I want to make the most of the drive’s capacity so I want to compress them at the highest ratio supported by standard tools. I’ve zeroed out the free space in my disk images so I can save the entire image while only having it take up as much space as there are actual files on them, and raw images in my experience can have their size reduced by a third or even half with max compression (and I would assume it’s lossless since file level compression can regenerate the original file in its entirety?)

I’ve heard horror stories of compressed files being made completely unextractable by a single corrupted bit but I don’t know how much a risk that still is in 2025, though since I plan to leave the hard drive unplugged for long periods, I want the best chance of recovery if something does go wrong.

I also want the files to be extractable with just the Linux/Unix standard binutils since this is my disaster recovery plan and I want to be able to work with it through a Linux live image without installing any extra packages when my server dies, hence I’m only looking at gz, xz, or bz2.

So out of the three, which is generally considered more stable and corruption resistant when the compression ratio is turned all the way up? Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting? Additionally, should I be generating separate checksum files for the original data or do the compressed formats include checksumming themselves?

  • DasFaultier@sh.itjust.works
    link
    fedilink
    arrow-up
    5
    ·
    29 days ago

    und denke mal, bei dem Username, dass du deutsch sprechen kannst haha Jup, stimmt. :D

    Ich bleib’ trotzdem mal bei Englisch, damit’s im englischen Thread verstanden wird.

    ENGLISH: Yeah, you’re right, I wasn’t particularly on-topic there. :D I tried to address your underlying assumptions as well as the actual file format question, and it kinda derailed from there.

    Sooo, file format… I think you’re restricting yourself too much if you just use the formats that are included in binutils. Also, you have conflicting goals there: it’s compression (make the most of your storage) vs. resilience (have a format that is stable in the long term). Someone here recommended lzip, which is definitely a right answer for good compression ratio. The Wikipedia article I linked features a table that compares compressed archive formats, so that might be a good starting point to find resilient formats. Look out for formats with at least Integrity Check and possibly Recovery Record, as these seem to be more important than compression ratio. When you have settled on a format, run some tests to find the best compression algorithm for your material. You might also want to measure throughput/time while you’re at it to find variants that offer a reasonable compromise between compression and performance. If you’re so inclined, try to read a few format specs to find suitable candidates.

    You’re generally looking for formats that:

    • are in widespread use
    • are specified/standardized publicly
    • are of a low complexity
    • don’t have features like DRM/Encryption/anti-copy
    • are self-documenting
    • are robust
    • don’t have external dependencies (e.g. for other file formats)
    • are free of any restrictive licensing/patents
    • can be validated.

    You might want to read up on more technical infos on how an actual archive handles these challenges at https://slubarchiv.slub-dresden.de/technische-standards-fuer-die-ablieferung-von-digitalen-dokumenten and the PDF files with specifications linked there (all in German).

    • Ferk@lemmy.ml
      link
      fedilink
      arrow-up
      3
      ·
      edit-2
      29 days ago

      Just note that @RiverRabbits@lemmy.blahaj.zone wasn’t the one who opened the Thread, that’s why they said they didn’t ask the question (I get the feeling there might have been some confusion here :P ).

      Still, very informative comment.

      • RiverRabbits@lemmy.blahaj.zone
        link
        fedilink
        arrow-up
        3
        ·
        29 days ago

        Haha, yeah I’m not the OP! But the way my german is phrased here and how the replier interpreted it would read as super passive aggressive (think “I didn’t ask that question but thanks”), and for that I apologize 😭 I just meant I’m not the OP😌