4.3.5 Gatherer enumeration vs. candidate selection

Next: 4.4 Generating LeafNode/RootNode URLs Up: 4.3 RootNode specifications Previous: 4.3.4 Using extreme values

4.3.5 Gatherer enumeration vs. candidate selection

In addition to using the URL-Filter and Host-Filter files for the RootNode specification mechanism described in Section 4.3, you can prevent documents from being indexed through customizing the stoplist.cf file, described in Section 4.5.4. Since these mechanisms are invoked at different times, they have different effects. The URL-Filter and Host-Filter mechanisms are invoked by the Gatherer's ``RootNode'' enumeration programs. Using these filters as stop lists can prevent unwanted objects from being retrieved across the network. This can dramatically reduce gathering time and network traffic.

The stoplist.cf file is used by the Essence content extraction system (described in Section 4.5) after the objects are retrieved, to select which objects should be content extracted and indexed. This can be useful because Essence provides a more powerful means of rejecting indexing candidates, in which you can customize based not only file naming conventions but also on file contents (e.g., looking at strings at the beginning of a file or at UNIX ``magic'' numbers), and also by more sophisticated file-grouping schemes (e.g., deciding not to extract contents from object code files for which source code is available).

As an example of combining these mechanisms, suppose you want to index the ``.ps'' files linked into your WWW site. You could do this by having a stoplist.cf file that contains ``HTML'', and a RootNode URL-Filter that contains:

        Allow \.html
        Allow \.ps
        Deny .*

As a final note, independent of these customizations the Gatherer attempts to avoid retrieving objects where possible, by using a local disk cache of objects, and by using the HTTP ``If-Modified-Since'' request header. The local disk cache is described in Section 4.7.5.

Duane Wessels
Wed Jan 31 23:46:21 PST 1996