You may want to inspect the quality of the automatically-generated SOIF templates. In general, Essence's techniques for automatic information extraction produce imperfect results. Sometimes it is possible to customize the summarizers to better suit the particular context (see Section 4.5.4). Sometimes, however, it makes sense to augment or change the automatically generated keywords with manually entered information. For example, you may want to add Title attributes to the content summaries for a set of PostScript documents (since it's difficult to parse them out of PostScript automatically).
Harvest provides some programs that automatically clean up a Gatherer's
database. The rmbinary program removes any binary data from the
templates. The cleandb program does some simple validation of SOIF
objects, and when given the -truncate
flag it will truncate the
Keywords data field to 8 kilobytes. To help in manually managing the
Gatherer's databases, the gdbmutil GDBM database management tool is
provided in $HARVEST_HOME/lib/gatherer.
In a future release of Harvest we will provide a forms-based mechanism to make it easy to provide manual annotations. In the meantime, you can annotate the Gatherer's database with manually generated information by using the mktemplate, template2db, mergedb, and mkindex programs. You first need to create a file (called, say, annotations) in the following format:
@FILE { url1 Attribute-Name-1: DATA Attribute-Name-2: DATA ... Attribute-Name-n: DATA } @FILE { url2 Attribute-Name-1: DATA Attribute-Name-2: DATA ... Attribute-Name-n: DATA } ...
Note that the Attributes must begin in column 0 and have one tab after the colon, and the DATA must be on a single line.
Next, run the mktemplate and template2db programs to generate SOIF and then GDBM versions of these data (you can have several files containing the annotations, and generate a singe GDBM database from the above commands):
% set path = ($HARVEST_HOME/lib/gatherer $path) % mktemplate annotations [annotations2 ...] | template2db annotations.gdbm
Finally, you run mergedb to incorporate the annotations into the automatically generated data, and mkindex to generate an index for it. The usage line for mergedb is:
mergedb production automatic manual [manual ...]
The idea is that production is the final GDBM database that the Gatherer will serve. This is a new database that will be generated from the other databases on the command line. automatic is the GDBM database that a Gatherer automatically generated in a previous run (e.g., WORKING.gdbm or a previous PRODUCTION.gdbm). manual and so on are the GDBM databases that you manually created. When mergedb runs, it builds the production database by first copying the templates from the manual databases, and then merging in the attributes from the automatic database. In case of a conflict (the same attribute with different values in the manual and automatic databases), the manual values override the automatic values.
By keeping the automatically and manually generated data stored separately, you can avoid losing the manual updates when doing periodic automatic gathering. To do this, you will need to set up a script to remerge the manual annotations with the automatically gathered data after each gathering.
An example use of mergedb is:
% mergedb PRODUCTION.new PRODUCTION.gdbm annotations.gdbm % mv PRODUCTION.new PRODUCTION.gdbm % mkindex
If the manual database looked like this:
@FILE { url1 my-manual-attribute: this is a neat attribute }
and the automatic database looked like this:
@FILE { url1 keywords: boulder colorado file-size: 1034 md5: c3d79dc037efd538ce50464089af2fb6 }
then in the end, the production database will look like this:
@FILE { url1 my-manual-attribute: this is a neat attribute keywords: boulder colorado file-size: 1034 md5: c3d79dc037efd538ce50464089af2fb6 }