By Nick Baker
February 2005
School of Information
University of Michigan
To archive a website and make it accessible you must do the following things:
Crawl the Site
Process the Results
This document concentrates on the stuff that isn't covered in the Internet Archive's manual on Heritrix.
Use the Heritrix Crawler to gather the site files and store them locally. Because Heritrix is written in Java, it is theoretically platform independent. However, there are some peculiarities to getting started on different systems that are covered below.
Once you've gotten Heritrix running, follow the directions in the Heritrix User Manual to perform a crawl.
After you've successfully run a crawl you'll have a bunch of ARC files. Heritrix bundles all of the images, html, etc. it finds into large binary files, complete with the relevant metadata. Together with your crawl logs, these files represent all you need to archive a website.
Here is an example arc file. This one is very small (~500K) but they often range up to 100MB.
To get a quick look at what's in an ARC file, you can run this script: arc_extractor (Change ".txt" to ".pl" to run it as a Perl script.) It will un-bundle the ARC file and spit all the files into a directory with somewhat arbitrary names. It's not the most useful script, but it should give you an idea of the basic principles involved in reading ARC files.
Note: you may have to decompress the ARC files using gunzip if the files end in ".gz" before running scripts to process them.
For this demonstration, we used three Perl scripts to process the ARC files we got from several crawls of the umich.edu domain.
arc_optimizer reads through the ARC files and retrieves new and changed content. This is important because multiple crawls of the same site may have large amounts of duplication. Therefore, it is much more efficient to store only the changed content. The script creates files with the ".arco" extension, meaning that they are optimized ARC files. While we keep copies of all the ARC files generated offline, we store the smaller ARCO files online for mirroring purposes.
arco_indexer creates an external tab-delimited index of the ARCO files, including such metadata as uri, datetime, and the position of the file within the ARCO file. It could easily be modified to index the ARC files themselves if so desired.
Once you have produced ARCO files and inserted the information into a database, you can mirror the site.
This project uses PHP and MySQL to mirror the information from our crawls.
Note: We were not able to mirror all of the information we gathered. The crawls produced ~300GB of information, while our server space was limited to ~30GB of files. Moreover, the metadata required would make queries unacceptably slow if it were all included in the MySQL tables. Therefore, we have arbitrarily chosen ~15GB worth of files comprising 503,216 unique items from 5 crawls to mirror for demonstration purposes.
The MySQL table is structured as follows:
mysql> describe archives; +----------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+-------------+------+-----+---------+----------------+ | id | int(16) | | PRI | NULL | auto_increment | | uri | text | YES | | NULL | | | date | varchar(14) | YES | | NULL | | | content | varchar(32) | YES | | NULL | | | start | varchar(32) | YES | | NULL | | | length | varchar(32) | YES | | NULL | | | arc | varchar(64) | YES | | NULL | | | response | char(3) | YES | | NULL | | +----------+-------------+------+-----+---------+----------------+ 9 rows in set (0.00 sec)
The fields in the table hold all the information needed to extract the information from an ARCO file and present it to the user.
| Field | Description | |
| unique id | id | assigned by the database |
| identifiers | uri | the uri that was crawled |
| date | the 14-digit datestamp | |
| http info | content | the mime-type or content-type of the document |
| response | the response code from the server | |
| arco file info | start | the starting position in the arco file |
| length | the length in bytes of the file | |
| arc | the filename of the arco file |
For non-text files, such as JPEG images, the PHP reads the binary information from the ARCO file, appends a header with the proper Content-type from the table, and returns the information to the web browser. Here is a JPEG Image from the archive.
For HTML, the script makes some modifications and additions to the page. Here is an HTML Page from the archive. Notice the green "archive" tab in the upper right corner.
Here are excerpts from the PHP, with a description of what each portion does. (You can also view the complete PHP source.)
Header information with a <base> tag pointing at the original uri, and links to our stylesheet and javascript files.
// Define new header info $newHead = '<base href="' . $myrow["uri"] . '">
<link href="http://www.si.umich.edu/mirror/archive/styles.css" rel="stylesheet" type="text/css">
<script language="JavaScript" src="http://www.si.umich.edu/mirror/archive/toolbox.js"></script>'; // Add new header info after <head> tag or <html> tag if(preg_match("/<head[^>]*>/i", $file)) {
$file = preg_replace("/(<head[^>]*>)/i", "$1 ".$newHead, $file);
} else {
$file = preg_replace("/(<html[^>]*>)/i", "$1 <head> ".$newHead."</head>", $file);
}
Layers at the beginning of the body with information about the page and other versions available.
// Define new body info $newBody = '<div class="archiveButton"><a href="javascript:archiveLayerSwitch(\'archiveLayer\')"
class="archiveLink">archive</a></div>
<div id="archiveLayer" class="archiveHidden">
File archived on ' . UnDateStamp($myrow["date"]) . '. Content may be protected by copyright.
View the live page <a href="' . $myrow["uri"] . '">here</a>.' . $dateLinks . '</div>'; // Add new body info after <body> tag
$file = preg_replace("/(<body[^>]*>)/i", "$1 ".$newBody, $file);
JavaScript at the end to rewrite the links, srcs, and embeds so that they point to the archive rather than to the live site.
$baseScript = "http://www.si.umich.edu" . $PHP_SELF; $redirectScript = '<script language="Javascript">
<!--
var archivePHP = "' . $baseScript . '?date=' . $myrow["date"] . '&uri=";
var baseScript = "' . $baseScript . '"; function rewriteURL(itemArray, itemAttribute) {
var i = 0;
for(i = 0; i < itemArray.length; i++) {
if (itemArray[i][itemAttribute].indexOf("mailto:") == -1 &&
itemArray[i][itemAttribute].indexOf(archivePHP) == -1 && itemArray[i][itemAttribute].indexOf(baseScript) == -1 && itemArray[i][itemAttribute] != "'.$myrow["uri"] . '" &&
itemArray[i][itemAttribute].indexOf("javascript:") == -1) {
var newLink = itemArray[i][itemAttribute];
newLink = newLink.replace(/\&/g, "%26");
newLink = newLink.replace(/=/g, "%3D");
itemArray[i][itemAttribute] = archivePHP + newLink;
} }
}
if (document.links) rewriteURL(document.links, "href");
if (document.frames) rewriteURL(document.frames, "src");
if (document.images) rewriteURL(document.images, "src");
if (document.embeds) rewriteURL(document.embeds, "src");
if (document.body && document.body.background)
document.body.background = archivePHP + document.body.background;
//-->
</script>'; // Add script at </body> or </html> tag
if(preg_match("/<\/body[^>]*>/i", $file)) {
$file = preg_replace("/(<\/body[^>]*>)/i", "$1 ".$redirectScript, $file);
} else {
$file = preg_replace("/(<\/html[^>]*>)/i", $redirectScript . " $1", $file);
}
This is similar to the way the Internet Archive's Wayback Machine works, but there is more functionality within the page. If everything works correctly, this should allow users to surf through the archived site as though it were still live. JavaScript and Flash can wreak havoc with the system, but well designed and accessible sites shouldn't have these problems.
Once all the information is in a table, it is easy to access it through various queries. Our entry page organizes the files by domain.