News

The Internet Archive (part 2)

Published on Dec 4, 2022

We've already spoken about tools to get material down from the Archive, but not about how material gets added to it. It seems that the Archive (IA) uses some sort of scheduled/algorithmic spider routine to capture stuff, but often these captures do not grab the entire site, thus leaving holes.

It used to be possible to provide URLs to the IA to prompt them to spider a site but this seems to have been discontinued, probably due to over use.

It is still possible to use various browser add-ons to upload single pages and these tend to work well.

Finally, there are several Open Source Python and Ruby scripts which attempt to archive complete sites. These include:

A Gem based wayback_archiver
An npm installable script that uses a sitemap to grab and archive pages.
A Python script
Another Python script, loadable via pip.

These all work but also all return a few 429 TOO MANY REQUESTS: https://www.i*** errors, which seem to be caused the IA being rather restrictive how frequently data can be uploaded to it.

All of these can use the xml site map from the life site and some a local file as well. Like this - py submit_urls.py sitemap.xml

Also check out the IA page on this.

<< Go back to the previous page