After the Gatherer retrieves a document, it passes the document through
a subsystem called Essence [11,10] to extract
indexing information. Essence allows the Gatherer to collect indexing
information easily from a wide variety of information, using different
techniques depending on the type of data and the needs of the particular
corpus being indexed. In a nutshell, Essence can determine the type of
data pointed to by a URL (e.g., PostScript vs. HTML), ``unravel'' presentation nesting formats (such as
compressed ``tar'' files), select which types of data to index (e.g.,
don't index Audio files), and then apply a type-specific extraction
algorithm (called a summarizer) to the data to generate a content
summary. Users can customize each of these aspects, but often this is
not necessary: Harvest is distributed with a ``stock'' set of type
recognizers, presentation unnesters, candidate selectors, and
summarizers that work well for many applications.
Starting with Harvest Version 1.2 we are also integrating support for summarizers based on outside ``component technologies'' of both a free and a commercial nature.
Below we describe the stock summarizer set, the current components distribution, and how users can customize summarizers to change how they operate and add summarizers for new types of data. If you develop a summarizer (or an interface to a commercial system) that is likely to be useful to other users, please notify us via email at harvest-dvl@cs.colorado.edu so we may include it in our components distribution.
Type Summarizer Function -------------------------------------------------------------------- Audio Extract file name Bibliographic Extract author and titles Binary Extract meaningful strings and manual page summary C, CHeader Extract procedure names, included file names, and comments Dvi Invoke the Text summarizer on extracted ASCII text FAQ, FullText, README Extract all words in file Framemaker Up-convert to SGML and pass through SGML summarizer Font Extract comments HTML Extract anchors, hypertext links, and selected fields (see SGML) LaTex Parse selected LaTex fields (author, title, etc.) Mail Extract certain header fields Makefile Extract comments and target names ManPage Extract synopsis, author, title, etc., based on ``-man'' macros News Extract certain header fields Object Extract symbol table Patch Extract patched file names Perl Extract procedure names and comments PostScript Extract text in word processor-specific fashion, and pass through Text summarizer. RCS, SCCS Extract revision control summary RTF Up-convert to SGML and pass through SGML summarizer SGML Extract fields named in extraction table (see Section~\ref{sec:sgml}) ShellScript Extract comments SourceDistribution Extract full text of README file and comments from Makefile and source code files, and summarize any manual pages SymbolicLink Extract file name, owner, and date created Tex Invoke the Text summarizer on extracted ASCII text Text Extract first 100 lines plus first sentence of each remaining paragraph Troff Extract author, title, etc., based on ``-man'', ``-ms'', ``-me'' macro packages, or extract section headers and topic sentences. Unrecognized Extract file name, owner, and date created.