This example is a simple Gatherer that uses the default customizations. The only work that the user does to configure this Gatherer is to specify the list of URLs from which to gather (see Section 4).
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-1 % ./RunGatherer
To view the configuration file for this Gatherer, look at example-1.cf. The first few lines are variables that specify some local information about the Gatherer (see Section 4.7.1). For example, each content summary will contain the name of the Gatherer ( Gatherer-Name) that generated it. The port number ( Gatherer-Port) that will be used to export the indexing information, as is the directory that contains the Gatherer ( Top-Directory). Notice that there is one RootNode URL and one LeafNode URL.
After the Gatherer has finished, it will start up the Gatherer daemon which will export the content summaries. To view the content summaries, type:
% gather localhost 9111 | more
The following SOIF object should look similar to those that this Gatherer generates.
@FILE { http://harvest.cs.colorado.edu/~schwartz/IRTF.html Time-to-Live{7}: 9676800 Last-Modification-Time{1}: 0 Refresh-Rate{7}: 2419200 Gatherer-Name{25}: Example Gatherer Number 1 Gatherer-Host{22}: powell.cs.colorado.edu Gatherer-Version{3}: 0.4 Update-Time{9}: 781478043 Type{4}: HTML File-Size{4}: 2099 MD5{32}: c2fa35fd44a47634f39086652e879170 Partial-Text{151}: research problems Mic Bowman Peter Danzig Udi Manber Michael Schwartz Darren Hardy talk talk Harvest talk Advanced Research Projects Agency URL-References{628}: ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/RD.ResearchProblems.Jour.ps.Z ftp://grand.central.org/afs/transarc.com/public/mic/html/Bio.html http://excalibur.usc.edu/people/danzig.html http://glimpse.cs.arizona.edu:1994/udi.html http://harvest.cs.colorado.edu/~schwartz/Home.html http://harvest.cs.colorado.edu/~hardy/Home.html ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPCC94.Slides.ps.Z ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPC94.Slides.ps.Z http://harvest.cs.colorado.edu/harvest/Home.html ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/IETF.Jul94.Slides.ps.Z http://ftp.arpa.mil/ResearchAreas/NETS/Internet.html Title{84}: IRTF Research Group on Resource Discovery IRTF Research Group on Resource Discovery Keywords{121}: advanced agency bowman danzig darren hardy harvest manber mic michael peter problems projects research schwartz talk udi }
Notice that although the Gatherer configuration file lists only 2 URLs (one in the RootNode section and one in the LeafNode section), there are more than 2 content summaries in the Gatherer's database. The Gatherer expanded the RootNode URL into dozens of LeafNode URLs by recursively extracting the links from the HTML file at the RootNode http://harvest.cs.colorado.edu/. Then, for each LeafNode given to the Gatherer, it generated a content summary for it as in the above example summary for http://harvest.cs.colorado.edu/~schwartz/IRTF.html.
The HTML summarizer will extract structured information about the Author and Title of the file. It will also extract any URL links into the URL-References attribute, and any anchor tags into the Partial-Text attribute. Other information about the HTML file such as its MD5 [17] and its size ( File-Size) in bytes are also added to the content summary.