4.7.7 Incorporating manually generated information into a Gatherer

Next: 4.8 Troubleshooting Up: 4.7 Gatherer administration Previous: 4.7.6 The local disk

4.7.7 Incorporating manually generated information into a Gatherer

You may want to inspect the quality of the automatically-generated SOIF templates. In general, Essence's techniques for automatic information extraction produce imperfect results. Sometimes it is possible to customize the summarizers to better suit the particular context (see Section 4.5.4). Sometimes, however, it makes sense to augment or change the automatically generated keywords with manually entered information. For example, you may want to add Title attributes to the content summaries for a set of PostScript documents (since it's difficult to parse them out of PostScript automatically).

Harvest provides some programs that automatically clean up a Gatherer's database. The rmbinary program removes any binary data from the templates. The cleandb program does some simple validation of SOIF objects, and when given the -truncate flag it will truncate the Keywords data field to 8 kilobytes. To help in manually managing the Gatherer's databases, the gdbmutil GDBM database management tool is provided in $HARVEST_HOME/lib/gatherer.

In a future release of Harvest we will provide a forms-based mechanism to make it easy to provide manual annotations. In the meantime, you can annotate the Gatherer's database with manually generated information by using the mktemplate, template2db, mergedb, and mkindex programs. You first need to create a file (called, say, annotations) in the following format:

        @FILE { url1
        Attribute-Name-1:        DATA
        Attribute-Name-2:        DATA
        ...
        Attribute-Name-n:        DATA
        }
        
        @FILE { url2
        Attribute-Name-1:        DATA
        Attribute-Name-2:        DATA
        ...
        Attribute-Name-n:        DATA
        }
        
        ...

Note that the Attributes must begin in column 0 and have one tab after the colon, and the DATA must be on a single line.

Next, run the mktemplate and template2db programs to generate SOIF and then GDBM versions of these data (you can have several files containing the annotations, and generate a singe GDBM database from the above commands):

        % set path = ($HARVEST_HOME/lib/gatherer $path)
        % mktemplate annotations [annotations2 ...] | template2db annotations.gdbm

Finally, you run mergedb to incorporate the annotations into the automatically generated data, and mkindex to generate an index for it. The usage line for mergedb is:

        mergedb production automatic manual [manual ...]

The idea is that production is the final GDBM database that the Gatherer will serve. This is a new database that will be generated from the other databases on the command line. automatic is the GDBM database that a Gatherer automatically generated in a previous run (e.g., WORKING.gdbm or a previous PRODUCTION.gdbm). manual and so on are the GDBM databases that you manually created. When mergedb runs, it builds the production database by first copying the templates from the manual databases, and then merging in the attributes from the automatic database. In case of a conflict (the same attribute with different values in the manual and automatic databases), the manual values override the automatic values.

By keeping the automatically and manually generated data stored separately, you can avoid losing the manual updates when doing periodic automatic gathering. To do this, you will need to set up a script to remerge the manual annotations with the automatically gathered data after each gathering.

An example use of mergedb is:

        % mergedb PRODUCTION.new PRODUCTION.gdbm annotations.gdbm
        % mv PRODUCTION.new PRODUCTION.gdbm
        % mkindex

If the manual database looked like this:

        @FILE { url1
        my-manual-attribute:  this is a neat attribute
        }

and the automatic database looked like this:

        @FILE { url1
        keywords:   boulder colorado
        file-size:  1034
        md5:        c3d79dc037efd538ce50464089af2fb6
        }

then in the end, the production database will look like this:

        @FILE { url1
        my-manual-attribute:  this is a neat attribute
        keywords:   boulder colorado
        file-size:  1034
        md5:        c3d79dc037efd538ce50464089af2fb6
        }

Next: 4.8 Troubleshooting Up: 4.7 Gatherer administration Previous: 4.7.6 The local disk

Duane Wessels
Wed Jan 31 23:46:21 PST 1996