This example demonstrates how to customize the type recognition and candidate selection steps in the Gatherer (see Section 4.5.4). This Gatherer recognizes World Wide Web home pages, and is configured only to collect indexing information from these home pages.
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-3 % ./RunGatherer
To view the configuration file for this Gatherer, look at example-3.cf. As in Appendix C.2, this Gatherer has its own library directory that contains a customization for Essence. Since we're only interested in indexing home pages, we need only define the heuristics for recognizing home pages. As shown below, we can use URL naming heuristics to define a home page in lib/byurl.cf. We've also added a default Unknown type to make candidate selection easier in this file.
HomeHTML ^http:.*/$ HomeHTML ^http:.*[hH]ome\.html$ HomeHTML ^http:.*[hH]ome[pP]age\.html$ HomeHTML ^http:.*[wW]elcome\.html$ HomeHTML ^http:.*/index\.html$
The lib/stoplist.cf configuration file contains a list of types not to index. In this example, Unknown is the only type name listed in stoplist.configuration, so the Gatherer will only reject files of the Unknown type. You can also recognize URLs by their filename (in byname.cf) or by their content (in bycontent.cf and magic); although in this example, we don't need to use those mechanisms. The default HomeHTML.sum summarizer summarizes each HomeHTML file.
After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. You'll notice that only content summaries for HomeHTML files are present. To view the content summaries, type:
% gather localhost 9333 | more
We have a demo that uses a similar customization to collect structured indexing information from over 20,000 Home Pages around the Web.