The Gatherer is able to ``explode'' a resource into a stream of content summaries. This is useful for files that contain manually-generated information that may describe one or more resources, or for building a gateway between various structured formats and SOIF (see Appendix B).
This example demonstrates an exploder for the Linux Software Map (LSM) format. LSM files contain structured information (like the author, location, etc.) about software available for the Linux operating system. A demo of our LSM Gatherer and Broker is available.
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-2 % ./RunGatherer
To view the configuration file for this Gatherer, look at example-2.cf. Notice that the Gatherer has its own Lib-Directory (see Section 4.7.1 for help on writing configuration files). The library directory contains the typing and candidate selection customizations for Essence. In this example, we've only customized the candidate selection step. lib/stoplist.cf defines the types that Essence should not index. This example uses an empty stoplist.cf file to direct Essence to index all files.
The Gatherer retrieves each of the LeafNode URLs, which are all Linux Software Map files from the Linux FTP archive tsx-11.mit.edu. The Gatherer recognizes that a ``.lsm'' file is LSM type because of the naming heuristic present in lib/byname.cf. The LSM type is a ``nested'' type as specified in the Essence source code. Exploder programs (named TypeName.unnest) are run on nested types rather than the usual summarizers. The LSM.unnest program is the standard exploder program that takes an LSM file and generates one or more corresponding SOIF objects. When the Gatherer finishes, it contains one or more corresponding SOIF objects for the software described within each LSM file.
After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type:
% gather localhost 9222 | more
Because tsx-11.mit.edu is a popular and heavily loaded archive, the Gatherer often won't be able to retrieve the LSM files. If you suspect that something went wrong, look in log.errors and log.gatherer to try to determine the problem.
The following two SOIF objects were generated by this Gatherer. The first object is summarizes the LSM file itself, and the second object summarizes the software described in the LSM file.
@FILE { ftp://tsx-11.mit.edu/pub/linux/docs/linux-doc-project/man-pages-1.4.lsm Time-to-Live{7}: 9676800 Last-Modification-Time{9}: 781931042 Refresh-Rate{7}: 2419200 Gatherer-Name{25}: Example Gatherer Number 2 Gatherer-Host{22}: powell.cs.colorado.edu Gatherer-Version{3}: 0.4 Type{3}: LSM Update-Time{9}: 781931042 File-Size{3}: 848 MD5{32}: 67377f3ea214ab680892c82906081caf } @FILE { ftp://ftp.cs.unc.edu/pub/faith/linux/man-pages-1.4.tar.gz Time-to-Live{7}: 9676800 Last-Modification-Time{9}: 781931042 Refresh-Rate{7}: 2419200 Gatherer-Name{25}: Example Gatherer Number 2 Gatherer-Host{22}: powell.cs.colorado.edu Gatherer-Version{3}: 0.4 Update-Time{9}: 781931042 Type{16}: GNUCompressedTar Title{48}: Section 2, 3, 4, 5, 7, and 9 man pages for Linux Version{3}: 1.4 Description{124}: Man pages for Linux. Mostly section 2 is complete. Section 3 has over 200 man pages, but it still far from being finished. Author{27}: Linux Documentation Project AuthorEmail{11}: DOC channel Maintainer{9}: Rik Faith MaintEmail{16}: faith@cs.unc.edu Site{45}: ftp.cs.unc.edu sunsite.unc.edu tsx-11.mit.edu Path{94}: /pub/faith/linux /pub/Linux/docs/linux-doc-project/man-pages /pub/linux/docs/linux-doc-project File{20}: man-pages-1.4.tar.gz FileSize{4}: 170k CopyPolicy{47}: Public Domain or otherwise freely distributable Keywords{10}: man pages Entered{24}: Sun Sep 11 19:52:06 1994 EnteredBy{9}: Rik Faith CheckedEmail{16}: faith@cs.unc.edu }
We've also built a Gatherer that explodes about a half-dozen index files from various PC archives into more than 25,000 content summaries. Each of these index files contain hundreds of a one-line descriptions about PC software distributions that are available via anonymous FTP. We have a demo available via the Web.