next up previous contents index
Next: Customizing the type Up: 4.5 Extracting data for Previous: The translation table

4.5.4 Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps

     

The Harvest Gatherer's actions are defined by a set of configuration and utility files, and a corresponding set of executable programs referenced by some of the configuration files.

If you want to customize a Gatherer, you should create bin and lib subdirectories in the directory where you are running the Gatherer, and then copy $HARVEST_HOME/lib/gatherer/*.cf and $HARVEST_HOME/lib/gatherer/magic into your lib directory. Then add to your Gatherer configuration file:

        Lib-Directory:         lib
The details about what each of these files does are described below. The basic contents of a typical Gatherer's directory is as follows (note: some of the files names below can be changed by setting variables in the Gatherer configuration file, as described in Section 4.7.1):

                                           

        RunGatherd*    bin/           GathName.cf    log.errors     tmp/
        RunGatherer*   data/          lib/           log.gatherer

        bin:
        MyNewType.sum* Exploder.unnest*

        data:
        All-Templates.gz   INFO.soif          gatherd.cf
        INDEX.gdbm         PRODUCTION.gdbm    gatherd.log

        lib:
        bycontent.cf   byurl.cf       quick-sum.cf
        byname.cf      magic          stoplist.cf

        tmp:
        cache-liburl/

The RunGatherd and RunGatherer are used to export the Gatherer's database after a machine reboot and to run the Gatherer, respectively. The log.errors and log.gatherer files contain error messages and the output of the Essence typing step, respectively (Essence will be described shortly). The GathName.cf file is the Gatherer's configuration file.

The bin directory contains any summarizers and any other program needed by the summarizers or by the presentation unnesting steps. If you were to customize the Gatherer by adding a summarizer or a presentation unnesting program, you would place those programs in this bin directory; the MyNewType.sum and Exploder.unnest are examples (see Section 4.5.4).

 

The data directory contains the Gatherer's database which gatherd exports. The Gatherer's database consists of the All-Templates.gz, INDEX.gdbm, INFO.soif, and PRODUCTION.gdbm files. The gatherd.cf file is used to support access control as described in Section 4.7.4. The gatherd.log file is where the gatherd program logs its information.

The lib directory contains the configuration files used by the Gatherer's subsystems, namely Essence. These files are described briefly in the following table:

        bycontent.cf    Content parsing heuristics for type recognition step
        byname.cf       File naming heuristics for type recognition step
        byurl.cf        URL naming heuristics for type recognition step
        magic           UNIX ``file'' command specifications (matched against
                        bycontent.cf strings)
        quick-sum.cf    Extracts attributes for summarizing step.
        stoplist.cf     File types to reject during candidate selection

                       

We discuss each of the customizable steps in the subsections below.





next up previous contents index
Next: Customizing the type Up: 4.5 Extracting data for Previous: The translation table



Duane Wessels
Wed Jan 31 23:46:21 PST 1996