4.7.1 Setting variables in the Gatherer configuration file

Next: 4.7.2 Local file system Up: 4.7 Gatherer administration Previous: 4.7 Gatherer administration

4.7.1 Setting variables in the Gatherer configuration file

In addition to customizing the steps described in Section 4.5.4, you can customize the Gatherer by setting variables in the Gatherer configuration file. This file consists of two parts: a list of variables that specify information about the Gatherer (such as its name, host, and port number), and two lists of URLs (divided into RootNodes and LeafNodes) from which to collect indexing information. Section 4 shows an example Gatherer configuration file. In this section we focus on the variables that the user can set in the first part of the Gatherer configuration file.

Each variable name starts in the first column, ends with a colon, then is followed by the value. The following table shows the supported variables:

        Access-Delay:           Default delay between URLs accesses.
        Data-Directory:         Directory where GDBM database is written.
        Debug-Options:          Debugging options passed to child programs.
        Errorlog-File:          File for logging errors.
        Essence-Options:        Any extra options to pass to Essence.
        FTP-Auth:	        Username/password for protected FTP documents.
        Gatherd-Inetd:          Denotes that gatherd is run from inetd.
        Gatherer-Host:          Full hostname where the Gatherer is run.
        Gatherer-Name:          A Unique name for the Gatherer.
        Gatherer-Options:       Extra options for the Gatherer.
        Gatherer-Port:          Port number for gatherd.
        Gatherer-Version:       Version string for the Gatherer.
        HTTP-Basic-Auth:	Username/password for protected HTTP documents.
        HTTP-Proxy:             host:port of your HTTP proxy.
        Keep-Cache:	        ``yes'' to not remove {\em local disk cache}.
        Lib-Directory:          Directory where configuration files live.
        Local-Mapping:          Mapping information for local gathering.
        Log-File:               File for logging progress.
        Post-Summarizing:       A rules-file for post-summarizing.
        Refresh-Rate:	        Object refresh-rate in seconds, default 1 week.
        Time-To-Live:	        Object time-to-live in seconds, default 1 month.
        Top-Directory:          Top-level directory for the Gatherer.
        Working-Directory:      Directory for tmp files and local disk cache.

Notes:

We recommend that you use the Top-Directory variable, since it will set the Data-Directory, Lib-Directory, and Working-Directory variables.
Both Working-Directory and Data-Directory will have files in them after the Gatherer has run. The Working-Directory will hold the local-disk cache that the Gatherer uses to reduce network I/O, and the Data-Directory will hold the GDBM databases that contain the content summaries.
You should use full rather than relative pathnames.
All variable definitions must come before the RootNode or LeafNode URLs.
Any line that starts with a ``#'' is a comment.
Local-Mapping is discussed in Section 4.7.2.
HTTP-Proxy will retrieve HTTP URLs via a proxy host. The syntax is hostname:port; for example, harvest-cache.cs.colorado.edu:3128.
Essence-Options is particularly useful, as it lets you customize basic aspects of the Gatherer easily.
The only valid Gatherer-Options is --save-space which directs the Gatherer to be more space efficient when preparing its database for export.
The Gatherer program will accept the -background flag which will cause the Gatherer to run in the background.

The Essence options are:

Option                  Meaning
--------------------------------------------------------------------
--allowlist filename    File with list of types to allow
--fake-md5s             Generates MD5s for SOIF objects from a .unnest program
--fast-summarizing      Trade speed for some consistency.  Use only when
                        an external summarizer is known to generate clean,
                        unique attributes.
--full-text             Use entire file instead of summarizing.  Alternatively,
                        you can perform full text indexing of individual file
                        types by using the {\tt FullText.sum} summarizer (see
                        Section~\ref{sec:cust-summarize} for details).
--max-deletions n       Number of GDBM deletions before reorganization
--minimal-bookkeeping   Generates a minimal amount of bookkeeping attrs
--no-access             Do not read contents of objects
--no-keywords           Do not automatically generate keywords
--stoplist filename     File with list of types to remove
--type-only             Only type data; do not summarize objects

A particular note about full text summarizing: Using the Essence --full-text option causes files not to be passed through the Essence content extraction mechanism. Instead, their entire content is included in the SOIF summary stream. In some cases this may produce unwanted results (e.g., it will directly include the PostScript for a document rather than first passing the data through a PostScript to text extractor, providing few searchable terms and large SOIF objects). Using the individual file type summarizing mechanism described in Section 4.5.4 will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence --full-text option to perform content extraction before including the full text of documents.

Next: 4.7.2 Local file system Up: 4.7 Gatherer administration Previous: 4.7 Gatherer administration

Duane Wessels
Wed Jan 31 23:46:21 PST 1996