News

The Internet Archive, the Wayback machine and Archivarix

Published on Nov 28, 2022

Update

An Open Source alternative to Archivarix is 'the wayback machine downloader', which can be found here. Now this works very well but does require some work by ther user. First, it is a Ruby Gem, so one will need Ruby installed first.See here for the latest version for Windows, - other flavours are available.

So install Ruby - basically just follow the Wizard, and then open an elevated command windows and type: gem install wayback_machine_downloader

That will install the Gen for the downloader. Then, again in an elevated cmd window, type something like this: wayback_machine_downloader http://example.com - where example.com is replaced by your target on the Internet archive. NOte that the site will be downlaoded into a newly created sub-directory in the dir from which it is launched, so best to use something like D:\temp as ones root.

Full usage looks like this:

Usage: wayback_machine_downloader http://example.com

Download an entire website from the Wayback Machine.

Optional options:
-d, --directory PATH Directory to save the downloaded files into
               Default is ./websites/ plus the domain name
-s, --all-timestamps Download all snapshots/timestamps for a given website
-f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334)
-t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334)
-e, --exact-url Download only the url provided and not the full site
-o, --only ONLY_FILTER Restrict downloading to urls that match this filter
               (use // notation for the filter to be treated as a regex)
-x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter
               (use // notation for the filter to be treated as a regex)
-a, --all Expand downloading to error files (40x and 50x) and redirections (30x)
-c, --concurrency NUMBER Number of multiple files to download at a time
               Default is one file at a time (ie. 20)
-p, --maximum-snapshot NUMBER Maximum snapshot pages to consider (Default is 100)
               Count an average of 150,000 snapshots per page
-l, --list Only list file urls in a JSON format with the archived timestamps, won't download anything

Most of you will be aware of the Internet Archive, a US based not for profit outfit that backs up internet web sites for posterity - amongst other things. Part of this site is the Wayback machine - a user facing GUI that allows access to the various moments in time when a site might have archived. Obviously the 'net is vast so not everything gets captured and even when it does, often there can be large - months or even years - of gaps between snapshots.

For example, this url will take you to a snapshot of this site, showing that there were 35 captures between 2010 and 2022. So probably enough to restore most if not all of the site if the original Domain had gone off line or been canceled. Not important for a site like this, but for a political or historical blog, probably rather more so.

So the Wayback machine lets you visit snapshots in time to review what a site might have looked like at a moment in time - this url shows the precursors of this imcuk.net site as it was in 2001. But what if you want to restore the entire site to recreate ir or perhaps edit it? Well, the Internet Archives has some suggestions about how that can be done, possibly using WGET, but archivarix.com greatly simplifies the process. It allows a site in the Internet Archive to be scrapped and then patched up into a single zip file which can be downloaded. Even better, it offers a free CMS which them uploads the zip to any Apache web server and reassembles the website for publication or private browsing. The CMS also allows one to modify the restored site as required.

If the site has less than 200 items then all of this is free and above 200 items a fee is payable (by PaPal) but for less that 1 USD a site of many pages can be restored.

Highly recommended!

<< Go back to the previous page