next up previous contents index
Next: 4.7.3 Gathering from password-protected Up: 4.7 Gatherer administration Previous: 4.7.1 Setting variables in

4.7.2 Local file system gathering for reduced CPU load

       

Although the Gatherer's work load is specified using URLs, often the files being gathered are located on a local file system. In this case it is much more efficient to gather directly from the local file system than via FTP/Gopher/HTTP/News, primarily because of all the UNIX forking required to gather information via these network processes. For example, our measurements indicate it causes from 4-7x more CPU load to gather from FTP than directly from the local file system. For large collections (e.g., archive sites containing many thousands of files), the CPU savings can be considerable.

Starting with Harvest Version 1.1, it is possible to tell the Gatherer how to translate URLs to local file system names, using the Local-Mapping Gatherer configuration file variable (see Section 4.7.1). The syntax is:

        Local-Mapping: URL_prefix local_path_prefix

This causes all URLs starting with URL_prefix to be translated to files starting with the prefix local_path_prefix while gathering, but to be left as URLs in the results of queries (so the objects can be retrieved as usual). Note that no regular expressions are supported here. As an example, the specification

        Local-Mapping: http://harvest.cs.colorado.edu/~hardy/ /homes/hardy/public_html/
        Local-Mapping: ftp://ftp.cs.colorado.edu/pub/cs/ /cs/ftp/

would cause the URL http://harvest.cs.colorado.edu/~hardy/Home.html to be translated to the local file name /homes/hardy/public_html/Home.html, while the URL ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Harvest.Conf.ps.Z would be translated to the local file name /cs/ftp/techreports/schwartz/Harvest.Conf.ps.Z.

Local gathering will work over NFS file systems. A local mapping will fail if: the local filename cannot be opened for reading; or the local filename is not a regular file; or the local filename has execute bits set. So, for directories, symbolic links and CGI scripts, the HTTP server is always contacted rather than the local file system interface. Lastly, the Gatherer does not perform any URL syntax translations for local mappings. If your URL has characters that should be escaped (as in RFC 1738 [3]), then the local mapping will fail. Starting with version 1.4 patchlevel 2 Essence will print [L] after URLs which were successfully accessed locally.

Note that if your network is highly congested, it may actually be faster to gather via HTTP/FTP/Gopher than via NFS, because NFS becomes very inefficient in highly congested situations. Even better would be to run local Gatherers on the hosts where the disks reside, and access them directly via the local file system.



next up previous contents index
Next: 4.7.3 Gathering from password-protected Up: 4.7 Gatherer administration Previous: 4.7.1 Setting variables in



Duane Wessels
Wed Jan 31 23:46:21 PST 1996