next up previous contents index
Next: 4.3.1 RootNode filters Up: 4 The Gatherer Previous: Cleaning out a

4.3 RootNode specifications

                   

The RootNode specification facility described in Section 4.2 provides a basic set of default enumeration actions for RootNodes. Often it is useful to enumerate beyond the default limits -- for example, to increase the enumeration limit beyond 250 URLs, or to allow site boundaries to be crossed when enumerating HTML links. Starting with Harvest Version 1.1, it is possible to specify these and other aspects of enumeration, using the following syntax (which is backwards-compatible with Harvest Version 1.0):

        <RootNodes>
        URL EnumSpec
        URL EnumSpec
        ...
        </RootNodes>

where EnumSpec is on a single line (using ``\'' to escape linefeeds), with the following syntax:

        URL=URL-Max[,URL-Filter-filename]  \
        Host=Host-Max[,Host-Filter-filename] \
        Access=TypeList \
        Delay=Seconds \
        Depth=Number \
        Enumeration=Enumeration-Program

The EnumSpec modifiers are all optional, and have the following meanings:

URL-Max
The number specified on the right hand side of the ``URL='' expression lists the maximum number of LeafNode URLs to generate at all levels of depth, from the current URL. Note that URL-Max is the maximum number of URLs that are generated during the enumeration, and not a limit on how many URLs can pass through the candidate selection phase (see Section 4.5.4).

URL-Filter-filename
This is the name of a file containing a set of regular expression filters (discussed below) to allow or deny particular LeafNodes in the enumeration. The default filter is $HARVEST_HOME/lib/gatherer/URL-filter-default which excludes many image and sound files.

Host-Max
The number specified on the right hand side of the ``Host='' expression lists the maximum number of hosts that will be touched during the RootNode enumeration. This enumeration actually counts hosts by IP address so that aliased hosts are properly enumerated. Note that this does not work correctly for multi-homed hosts, or for hosts with rotating DNS entries (used by some sites for load balancing heavily accessed servers).

Note: Prior to Harvest Version 1.2 the ``Host=...'' line was called ``Site=...''. We changed the name to ``Host='' because it is more intuitively meaningful (being a host count limit, not a site count limit). For backwards compatibility with older Gatherer configuration files, we will continue to treat ``Site='' as an alias for ``Host=''.

Host-Filter-filename
This is the name of a file containing a set of regular expression filters to allow or deny particular hosts in the enumeration. Each expression can specify both a host name (or IP address) and a port number (in case you have multiple servers running on different ports of the same server and you want to index only one). The syntax is ``hostname:port''.

Access
If the RootNode is an HTTP URL, then you can specify which access methods across which to enumerate. gif Valid access method types are: FILE, FTP, Gopher, HTTP, News, Telnet, or WAIS. Use a ``|'' character between type names to allow multiple access methods. For example, ``Access=HTTP|FTP|Gopher'' will follow HTTP, FTP, and Gopher URLs while enumerating an HTTP RootNode URL.

Delay
This is the number of seconds to wait between server contacts.

Depth
This is the maximum number of levels of enumeration that will be followed during gathering. Depth=0 means that there is no limit to the depth of the enumeration. Depth=1 means the specified URL will be retrieved, and all the URLs referenced by the specified URL will be retrieved; and so on for higher Depth values. In other words, the enumeration will follow links up to Depth steps away from the specified URL.

Enumeration-Program
This modifier adds a very flexible way to control a Gatherer. The Enumeration-Program is a filter which reads URLs as input and writes new enumeration parameters on output. See section 4.3.2 for specific details.

By default, URL-Max defaults to 250, URL-Filter defaults to no limit, Host-Max defaults to 1, Host-Filter defaults to no limit, Access defaults to HTTP only, Delay defaults to 1 second, and Depth defaults to zero gif. There is no way to specify an unlimited value for URL-Max or Host-Max.





next up previous contents index
Next: 4.3.1 RootNode filters Up: 4 The Gatherer Previous: Cleaning out a



Duane Wessels
Wed Jan 31 23:46:21 PST 1996