4.7.6 The local disk cache

Next: 4.7.7 Incorporating manually generated Up: 4.7 Gatherer administration Previous: 4.7.5 Periodic gathering and

4.7.6 The local disk cache

The Gatherer maintains a local disk cache of files it gathered to reduce network traffic from restarting aborted gathering attempts. However, since the remote server must still be contacted whenever Gatherer runs, please do not set your cron job to run Gatherer frequently. A typical value might be weekly or monthly, depending on how congested the network and how important it is to have the most current data.

By default, the Gatherer's local disk cache is deleted after each successful completion. To save the local disk cache between Gatherer sessions, define Keep-Cache: yes in your Gatherer configuration file (Section 4.7).

If you want your Broker's index to reflect new data, then you must run the Gatherer and run a Broker collection. By default, a Broker will perform collections once a day. If you want the Broker to collect data as soon as it's gathered, then you will need to coordinate the timing of the completion of the Gatherer and the Broker collections.

If you run your Gatherer frequently and you use the Keep-Cache: yes in your Gatherer configuration file, then the Gatherer's local disk cache may interfere with retrieving updates. By default, objects in the local disk cache expire after 7 days; however, you can expire objects more quickly by setting the GATHERER_CACHE_TTL environment variable to the number of seconds for the Time-To-Live (TTL) before you run the Gatherer, or you can change RunGatherer to remove the Gatherer's tmp directory after each Gatherer run. For example, to expire objects in the local disk cache after one day:

        % setenv GATHERER_CACHE_TTL 86400       # one day
        % ./RunGatherer

One final note: the Gatherer's local disk cache size defaults to 32 MBs, but you can change this value by setting the HARVEST_MAX_LOCAL_CACHE environment variable to the number of bytes before you run the Gatherer. For example, to have a maximum cache of 10 MB you can do as follows:

        % setenv HARVEST_MAX_LOCAL_CACHE 10485760       # 10 MB
        % ./RunGatherer

If you have access to the software that creates the files that you are indexing (e.g., if all updates are funneled through a particular editor, update script, or system call), you can modify this software to schedule realtime Gatherer updates whenever a file is created or updated. For example, if all users update the files being indexed using a particular program, this program could be modified to run the Gatherer upon completion of the user's update.

Note that, when used in conjunction with cron, the Gatherer provides a more powerful data ``mirroring'' facility than the often-used mirror package. In particular, you can use the Gatherer to replicate the contents of one or more sites, retrieve data in multiple formats via multiple protocols (FTP, HTTP, etc.), optionally perform a variety of type- or site-specific transformations on the data, and serve the results very efficiently as compressed SOIF object summary streams to other sites that wish to use the data for building indexes or for other purposes.

Next: 4.7.7 Incorporating manually generated Up: 4.7 Gatherer administration Previous: 4.7.5 Periodic gathering and

Duane Wessels
Wed Jan 31 23:46:21 PST 1996