In addition to customizing the steps described in Section 4.5.4, you can customize the Gatherer by setting variables in the Gatherer configuration file. This file consists of two parts: a list of variables that specify information about the Gatherer (such as its name, host, and port number), and two lists of URLs (divided into RootNodes and LeafNodes) from which to collect indexing information. Section 4 shows an example Gatherer configuration file. In this section we focus on the variables that the user can set in the first part of the Gatherer configuration file.
Each variable name starts in the first column, ends with a colon, then is followed by the value. The following table shows the supported variables:
Access-Delay: Default delay between URLs accesses. Data-Directory: Directory where GDBM database is written. Debug-Options: Debugging options passed to child programs. Errorlog-File: File for logging errors. Essence-Options: Any extra options to pass to Essence. FTP-Auth: Username/password for protected FTP documents. Gatherd-Inetd: Denotes that gatherd is run from inetd. Gatherer-Host: Full hostname where the Gatherer is run. Gatherer-Name: A Unique name for the Gatherer. Gatherer-Options: Extra options for the Gatherer. Gatherer-Port: Port number for gatherd. Gatherer-Version: Version string for the Gatherer. HTTP-Basic-Auth: Username/password for protected HTTP documents. HTTP-Proxy: host:port of your HTTP proxy. Keep-Cache: ``yes'' to not remove {\em local disk cache}. Lib-Directory: Directory where configuration files live. Local-Mapping: Mapping information for local gathering. Log-File: File for logging progress. Post-Summarizing: A rules-file for post-summarizing. Refresh-Rate: Object refresh-rate in seconds, default 1 week. Time-To-Live: Object time-to-live in seconds, default 1 month. Top-Directory: Top-level directory for the Gatherer. Working-Directory: Directory for tmp files and local disk cache.
Notes:
--save-space
which directs the
Gatherer to be more space efficient when preparing its database for export.
-background
flag which
will cause the Gatherer to run in the background.
The Essence options are:
Option Meaning -------------------------------------------------------------------- --allowlist filename File with list of types to allow --fake-md5s Generates MD5s for SOIF objects from a .unnest program --fast-summarizing Trade speed for some consistency. Use only when an external summarizer is known to generate clean, unique attributes. --full-text Use entire file instead of summarizing. Alternatively, you can perform full text indexing of individual file types by using the {\tt FullText.sum} summarizer (see Section~\ref{sec:cust-summarize} for details). --max-deletions n Number of GDBM deletions before reorganization --minimal-bookkeeping Generates a minimal amount of bookkeeping attrs --no-access Do not read contents of objects --no-keywords Do not automatically generate keywords --stoplist filename File with list of types to remove --type-only Only type data; do not summarize objects
A particular note about full text summarizing: Using the Essence
--full-text
option causes files not to be passed through the Essence
content extraction mechanism. Instead, their entire content is included in
the SOIF summary stream. In some cases this may produce unwanted results
(e.g., it will directly include the PostScript for a document rather than
first passing the data through a PostScript to text extractor, providing few
searchable terms and large SOIF objects). Using the individual file type
summarizing mechanism described in Section 4.5.4 will work
better in this regard, but will require you to specify how data are extracted
for each individual file type. In a future version of Harvest we will change
the Essence --full-text
option to perform content extraction before
including the full text of documents.