next up previous contents index
Next: 4.7.2 Local file system Up: 4.7 Gatherer administration Previous: 4.7 Gatherer administration

4.7.1 Setting variables in the Gatherer configuration file

   

       

In addition to customizing the steps described in Section 4.5.4, you can customize the Gatherer by setting variables in the Gatherer configuration file. This file consists of two parts: a list of variables that specify information about the Gatherer (such as its name, host, and port number), and two lists of URLs (divided into RootNodes and LeafNodes) from which to collect indexing information. Section 4 shows an example Gatherer configuration file. In this section we focus on the variables that the user can set in the first part of the Gatherer configuration file.

Each variable name starts in the first column, ends with a colon, then is followed by the value. The following table shows the supported variables:

                             

             

        Access-Delay:           Default delay between URLs accesses.
        Data-Directory:         Directory where GDBM database is written.
        Debug-Options:          Debugging options passed to child programs.
        Errorlog-File:          File for logging errors.
        Essence-Options:        Any extra options to pass to Essence.
        FTP-Auth:	        Username/password for protected FTP documents.
        Gatherd-Inetd:          Denotes that gatherd is run from inetd.
        Gatherer-Host:          Full hostname where the Gatherer is run.
        Gatherer-Name:          A Unique name for the Gatherer.
        Gatherer-Options:       Extra options for the Gatherer.
        Gatherer-Port:          Port number for gatherd.
        Gatherer-Version:       Version string for the Gatherer.
        HTTP-Basic-Auth:	Username/password for protected HTTP documents.
        HTTP-Proxy:             host:port of your HTTP proxy.
        Keep-Cache:	        ``yes'' to not remove {\em local disk cache}.
        Lib-Directory:          Directory where configuration files live.
        Local-Mapping:          Mapping information for local gathering.
        Log-File:               File for logging progress.
        Post-Summarizing:       A rules-file for post-summarizing.
        Refresh-Rate:	        Object refresh-rate in seconds, default 1 week.
        Time-To-Live:	        Object time-to-live in seconds, default 1 month.
        Top-Directory:          Top-level directory for the Gatherer.
        Working-Directory:      Directory for tmp files and local disk cache.

Notes:

                         

The Essence options are:

Option                  Meaning
--------------------------------------------------------------------
--allowlist filename    File with list of types to allow
--fake-md5s             Generates MD5s for SOIF objects from a .unnest program
--fast-summarizing      Trade speed for some consistency.  Use only when
                        an external summarizer is known to generate clean,
                        unique attributes.
--full-text             Use entire file instead of summarizing.  Alternatively,
                        you can perform full text indexing of individual file
                        types by using the {\tt FullText.sum} summarizer (see
                        Section~\ref{sec:cust-summarize} for details).
--max-deletions n       Number of GDBM deletions before reorganization
--minimal-bookkeeping   Generates a minimal amount of bookkeeping attrs
--no-access             Do not read contents of objects
--no-keywords           Do not automatically generate keywords
--stoplist filename     File with list of types to remove
--type-only             Only type data; do not summarize objects

 

A particular note about full text summarizing: Using the Essence --full-text option causes files not to be passed through the Essence content extraction mechanism. Instead, their entire content is included in the SOIF summary stream. In some cases this may produce unwanted results (e.g., it will directly include the PostScript for a document rather than first passing the data through a PostScript to text extractor, providing few searchable terms and large SOIF objects). Using the individual file type summarizing mechanism described in Section 4.5.4 will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence --full-text option to perform content extraction before including the full text of documents.

   



next up previous contents index
Next: 4.7.2 Local file system Up: 4.7 Gatherer administration Previous: 4.7 Gatherer administration



Duane Wessels
Wed Jan 31 23:46:21 PST 1996