I am one of a network of academic researchers from around the world working on collecting media market data. One problem is that referenced sources often disappear which makes validation later difficult or impossible. So, I thought I would recommend self-hosting something like archive.org that would allow affiliated researchers to submit their web references and have their sources efficiently archived in a central project repository. That would allow validation and continuity for when web-hosted text and files disappear or researchers leave.

I have been looking at ArchiveBox. If you have experience of this or a similar solution, would that fit the bill? The important thing is efficiency for researchers submitting/retrieving pages and files, and openness in structure and formats so that the archive would remain useful if ArchiveBox or similar disappears. FOSS of course means you can’t be locked out anyway.

  • hexagonwin@lemmy.today
    link
    fedilink
    English
    arrow-up
    2
    ·
    11 hours ago

    webrecorder browsertrix should work for this. they even have a hosted/paid service which could be better than selfhosting depending on the circumstance.

    saving as html with singlefile and sharing manually could be easier/simpler, the concept is easy to understand for non computer people imo.

    other than that i recently found out hoardy-web, doesn’t really fit your usecase as this is basically saving everything you see on your browser for personal archiving though. very well made but somehow it isn’t as widely known as other stuff in this area…

    • Stopwatch1986@lemmy.mlOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      8 hours ago

      One advantage and disadvantage of having webrecorder host our archived pages is that the archive may survive longer than, or not as long as our project.

      I have been using singlefile for years. It’s great but not for seamlessly making cached web pages available to the general public reading our reports and finding that cited links are now dead. And it doesn’t support URLs point to PDF, CSV files. A public-facing repository of singlefile files with an index for ToC might do it though. Simplicity is good for future-proofing an archive.

      Something like archive.org and archive.is would be ideal, but we have no control over its future and practices.