I am one of a network of academic researchers from around the world working on collecting media market data. One problem is that referenced sources often disappear which makes validation later difficult or impossible. So, I thought I would recommend self-hosting something like archive.org that would allow affiliated researchers to submit their web references and have their sources efficiently archived in a central project repository. That would allow validation and continuity for when web-hosted text and files disappear or researchers leave.

I have been looking at ArchiveBox. If you have experience of this or a similar solution, would that fit the bill? The important thing is efficiency for researchers submitting/retrieving pages and files, and openness in structure and formats so that the archive would remain useful if ArchiveBox or similar disappears. FOSS of course means you can’t be locked out anyway.

  • irmadlad@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    7 hours ago

    I wonder if an authorised remote user (ie an affiliated researcher) can easily instruct ArchiveBox to store a URL and later retrieve it

    Once you download the data and persist it on local storage, it’s available to whomever has access to that drive or server.

    Also, ideally a random user should be able to retrieve the archived web page or file (eg a PDF, CSV etc).

    For rando access, you could put the data on a public ftp server, or even get fancier with html styled pages. If I understand you correctly, you want a random user to be reading your report that has citations, so that when a rando user clicks the citation, they are presented with whatever you downloaded with ArchiveBox. Kind of Wikipedia style. Speaking of which, a wiki framework might be just the ticket you are looking for.

    Download the data, integrate it in to a selfhosted wiki, and it would be available to rando users. Of course your wiki server will have to have all the accoutrements of security so you don’t get hacked by a bazillion bots.