Next: 4.3.1 RootNode filters
Up: 4 The Gatherer
Previous: Cleaning out a
The RootNode specification facility described in
Section 4.2 provides a basic set of default enumeration
actions for RootNodes. Often it is useful to enumerate beyond the default
limits -- for example, to increase the enumeration limit beyond 250 URLs, or
to allow site boundaries to be crossed when enumerating HTML links. Starting
with Harvest Version 1.1, it is possible to specify these and other aspects of
enumeration, using the following syntax (which is backwards-compatible with
Harvest Version 1.0):
<RootNodes>
URL EnumSpec
URL EnumSpec
...
</RootNodes>
where EnumSpec is on a single line (using ``\
'' to escape
linefeeds), with the following syntax:
URL=URL-Max[,URL-Filter-filename] \
Host=Host-Max[,Host-Filter-filename] \
Access=TypeList \
Delay=Seconds \
Depth=Number \
Enumeration=Enumeration-Program
The EnumSpec modifiers are all optional, and have the following meanings:
- URL-Max
-
The number specified on the right hand side of the ``URL='' expression
lists the maximum number of LeafNode URLs to generate at all levels of
depth, from the current URL. Note that URL-Max is the maximum
number of URLs that are generated during the enumeration, and not
a limit on how many URLs can pass through the candidate selection phase
(see Section 4.5.4).
- URL-Filter-filename
-
This is the name of a file containing a set of regular expression
filters (discussed below) to allow or deny particular LeafNodes in the
enumeration. The default filter is
$HARVEST_HOME/lib/gatherer/URL-filter-default which excludes
many image and sound files.
- Host-Max
-
The number specified on the right hand side of the ``Host='' expression
lists the maximum number of hosts that will be touched during the
RootNode enumeration. This enumeration actually counts hosts by IP
address so that aliased hosts are properly enumerated. Note that this
does not work correctly for multi-homed hosts, or for hosts with
rotating DNS entries (used by some sites for load balancing heavily
accessed servers).
Note: Prior to Harvest Version 1.2 the ``Host=...'' line was
called ``Site=...''. We changed the name to ``Host='' because it is
more intuitively meaningful (being a host count limit, not a site count
limit). For backwards compatibility with older Gatherer configuration
files, we will continue to treat ``Site='' as an alias for ``Host=''.
- Host-Filter-filename
-
This is the name of a file containing a set of regular expression
filters to allow or deny particular hosts in the enumeration. Each
expression can specify both a host name (or IP address) and a port
number (in case you have multiple servers running on different ports of
the same server and you want to index only one). The syntax is
``hostname:port''.
- Access
-
If the RootNode is an HTTP URL, then you can specify which access methods
across which to enumerate.
Valid access method types are: FILE, FTP,
Gopher, HTTP, News, Telnet, or WAIS. Use a ``
|
'' character
between type names to allow multiple access methods. For example,
``Access=HTTP|FTP|Gopher
'' will follow HTTP, FTP, and Gopher URLs while
enumerating an HTTP RootNode URL.
- Delay
-
This is the number of seconds to wait between server contacts.
- Depth
-
This is the maximum number of levels of enumeration that will be followed during
gathering. Depth=0 means that there is no limit to the depth of
the enumeration. Depth=1 means the specified URL will be retrieved, and
all the URLs referenced by the specified URL will be retrieved; and so on for
higher Depth values. In other words, the enumeration will follow links up to
Depth steps away from the specified URL.
- Enumeration-Program
-
This modifier adds a very flexible way to control a Gatherer. The
Enumeration-Program is a filter which reads URLs as input and writes new
enumeration parameters on output. See
section 4.3.2 for specific details.
By default, URL-Max defaults to 250, URL-Filter defaults to no
limit, Host-Max defaults to 1, Host-Filter defaults to no limit,
Access defaults to HTTP only, Delay defaults to 1 second, and
Depth defaults to zero . There is no way to specify an unlimited value for
URL-Max or Host-Max.
Next: 4.3.1 RootNode filters
Up: 4 The Gatherer
Previous: Cleaning out a
Duane Wessels
Wed Jan 31 23:46:21 PST 1996