To get the stripped core content into Drupal, I wrote a PHP script that uses the Drupal API. It is invoked using the drush command, as follows:
drush -r $D7 -l www.iac.org scr inhale.php
Here's the script itself:
<?php
//
// DJM, 2012-02-08
//
// PHP script that takes a list of filenames as input, and converts those files
// into Legacy nodes on the www.iac.org web site
//
// The first line of each file is assumed to be the HTML <title> attribute, and is
// stored in the node's title field.
// The remaining lines are stored in the node's body.
//
// The filename itself is the legacy URL with pound signs (#) in place of slashes,
// and at signs (@) in place of spaces. This script replaces slashes and spaces,
// and stores the result in the field_old_url field.
//
// The field_status field is always set to 'Not Started'
$home_dir = "/home/djmolny/legacy-import/";
while (( $filename = readline("")) != FALSE) {
print 'filename="' . $filename . '"' . "\n";
$f = fopen($home_dir . $filename, 'r');
if ($f == FALSE) { exit(1); }
$title = trim(fgets($f));
print 'title="' . $title . "\"\n";
$body = fread($f, 1024*1024); // 1MB limit is arbitrary, but should suffice
$old_url = str_replace("#", "/", str_replace("@", " ", $filename));
$old_url = str_replace("public//", "http://www.iac.org/", $old_url);
$old_url = str_replace("members//", "http://members.iac.org/", $old_url);
print 'old_url="' . $old_url . "\"\n";
$node = new stdClass();
$node->type = 'legacy_page';
node_object_prepare($node);
$node->title = $title;
$node->language = LANGUAGE_NONE;
$node->body[$node->language][0]['value'] = $body;
$node->body[$node->language][0]['summary'] = text_summary($body);
$node->body[$node->language][0]['format'] = 'full_html';
$node->field_old_url[$node->language][0]['value'] = $old_url;
$node->field_status[$node->language][0]['value'] = 'Not Started';
node_save($node);
print("Done!\n\n\n"); // !!!
}
?>
Note: Drupal rejected numerous files because they contained special symbols that are not part of the UTF-8 character set, such as "½ loop", "360º roll", or "Fédération". I edited each of these files manually, replacing the symbols with their HTML equivalents (½, °, and é, respectively.) I thought about scripting this process, but since the import is a one-time exercise I decided it wasn't worth the effort. However I'm documenting the problem in case it crops up somewhere down the road.
