The web provides instant access to almost any information, but there is very little thought put into preserving this information for future generations. Old blogs, news articles and web pages are usually purged when they are no longer needed or relevant. Internet Archive has recognized that this information does have historical significance, and aims to save snapshots of the web in a digital library that is open to researchers, historians, scholars, and the general public. The Internet Archive supports a number of projects including the Wayback Machine with over 150 billion web pages archived from 1996. Their about page explains its mission:
The Internet Archive is working to prevent the Internet – a new medium with major historical significance – and other “born-digital” materials from disappearing into the past. Collaborating with institutions including the Library of Congress and the Smithsonian, we are working to preserve a record for generations to come.
Over time there have been various ways to add to this data, such as an FTP-based system for uploading content along with XML descriptors. There has also been an SSH mechanism to access their servers manually.
Now, there is also an API that is modeled on the Amazon S3 cloud storage API. In a fairly terse document you can see how to access their servers (the service relies on their own API keys). And because the Internet Archive API is based on the Amazon S3 API, the Amazon developer site is also a useful information source.
As they note, what this API lets you do and manage includes:
- Items (things with details pages) get mapped to S3 Buckets.
– ie: http://archive.org/details/stats is also available as:
or, per s3 dns bucket style:
– Files within items are also available as S3 keys, ex:
- Doing a PUT on the S3 endpoint will result in a new internet archive Item
- Files may also be uploaded to an Item in the same way keys are added, via S3 PUT.
– When a file is added to an Item, it is staged in temporary storage and ingested via the Archive’s content management system. This can take some time.
They also provide some examples to show you command-line usage of the API:
Text item (a PDF will be OCR’d):
curl –location –header ‘x-amz-auto-make-bucket:1′ \
–header ‘x-archive-meta01-collection:opensource’ \
–header ‘x-archive-meta-mediatype:texts’ \
–header ‘x-archive-meta-sponsor:Andrew W. Mellon Foundation’ \
–header ‘x-archive-meta-language:eng’ \
–header “authorization: LOW $accesskey:$secret” \
–upload-file /home/samuel/public_html/intro-to-k.pdf \
Their API does not support all of the functionality of the S3 API, so read their docs to get the details.
As with other providers who have adopted (or not) the syntax of another vendor’s API, here is a new example of how building on the design and collective developer knowledge base of a popular existing API might help other APIs as well.