next up previous contents index
Next: C.2 Example 2 - Up: C Gatherer Examples Previous: C Gatherer Examples

C.1 Example 1 - A simple Gatherer

 

This example is a simple Gatherer that uses the default customizations. The only work that the user does to configure this Gatherer is to specify the list of URLs from which to gather (see Section 4).

To run this example, type:

        % cd $HARVEST_HOME/gatherers/example-1
        % ./RunGatherer

To view the configuration file for this Gatherer, look at example-1.cf. The first few lines are variables that specify some local information about the Gatherer (see Section 4.7.1). For example, each content summary will contain the name of the Gatherer ( Gatherer-Name) that generated it. The port number ( Gatherer-Port) that will be used to export the indexing information, as is the directory that contains the Gatherer ( Top-Directory). Notice that there is one RootNode URL and one LeafNode URL.

After the Gatherer has finished, it will start up the Gatherer daemon which will export the content summaries. To view the content summaries, type:

        % gather localhost 9111 | more

  The following SOIF object should look similar to those that this Gatherer generates.

        @FILE { http://harvest.cs.colorado.edu/~schwartz/IRTF.html
        Time-to-Live{7}:        9676800
        Last-Modification-Time{1}:      0
        Refresh-Rate{7}:        2419200
        Gatherer-Name{25}:      Example Gatherer Number 1
        Gatherer-Host{22}:      powell.cs.colorado.edu
        Gatherer-Version{3}:    0.4
        Update-Time{9}: 781478043
        Type{4}:        HTML
        File-Size{4}:   2099
        MD5{32}:        c2fa35fd44a47634f39086652e879170
        Partial-Text{151}:      research problems
        Mic Bowman
        Peter Danzig
        Udi Manber
        Michael Schwartz
        Darren Hardy 
        talk
        talk
        Harvest
        talk
        Advanced
        Research Projects Agency
        
        URL-References{628}:
        ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/RD.ResearchProblems.Jour.ps.Z
        ftp://grand.central.org/afs/transarc.com/public/mic/html/Bio.html
        http://excalibur.usc.edu/people/danzig.html
        http://glimpse.cs.arizona.edu:1994/udi.html
        http://harvest.cs.colorado.edu/~schwartz/Home.html
        http://harvest.cs.colorado.edu/~hardy/Home.html
        ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPCC94.Slides.ps.Z
        ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPC94.Slides.ps.Z
        http://harvest.cs.colorado.edu/harvest/Home.html
        ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/IETF.Jul94.Slides.ps.Z
        http://ftp.arpa.mil/ResearchAreas/NETS/Internet.html
        
        Title{84}:      IRTF Research Group on Resource Discovery
        IRTF Research Group on Resource Discovery
        
        Keywords{121}:  advanced agency bowman danzig darren hardy harvest manber mic
        michael peter problems projects research schwartz talk udi
        
        }

Notice that although the Gatherer configuration file lists only 2 URLs (one in the RootNode section and one in the LeafNode section), there are more than 2 content summaries in the Gatherer's database. The Gatherer expanded the RootNode URL into dozens of LeafNode URLs by recursively extracting the links from the HTML file at the RootNode http://harvest.cs.colorado.edu/. Then, for each LeafNode given to the Gatherer, it generated a content summary for it as in the above example summary for http://harvest.cs.colorado.edu/~schwartz/IRTF.html.

The HTML summarizer will extract structured information about the Author and Title of the file. It will also extract any URL links into the URL-References attribute, and any anchor tags into the Partial-Text attribute. Other information about the HTML file such as its MD5 [17] and its size ( File-Size) in bytes are also added to the content summary.

   


next up previous contents index
Next: C.2 Example 2 - Up: C Gatherer Examples Previous: C Gatherer Examples



Duane Wessels
Wed Jan 31 23:46:21 PST 1996