Starting with Version 1.2, Harvest summarizes HTML using the generic SGML summarizer described in Section 4.5.2. The advantage of this approach is that the summarizer is more easily customizable, and fits with the well-conceived SGML model (where you define DTDs for individual document types and build interpretation software to understand DTDs rather than individual document types). The downside is that the summarizer is now pickier about syntax, and many Web documents are not syntically correct. Because of this pickiness, the default is for the HTML summarizer to run with syntax checking outputs disabled. If your documents are so badly formed that they confuse the parser, this may mean the summarizing process dies uncerimoniously. If you find that some of your HTML documents do not get summarized or only get summarized in part, you can turn syntax-checking output on by setting syntax_check = 1 in $HARVEST_HOME/lib/gatherer/SGML.sum. That will allow you to see which documents are invalid and where.
Note that part of the reason for this problem is that Web browsers (like Netscape) do not insist on well-formed documents. So, users can easily create documents that are not completely valid, yet display fine. The problem should become less pronounced if/when people shift to creating HTML documents using HTML editors rather than editing the raw HTML themselves.
Below is the default SGML-to-SOIF table used by the HTML summarizer. The pathname to this file is $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl. Individual Gatherers may do customized HTML summarizing by placing a modified version of this file in the Gatherer lib directory. Another way to customize is to modify the HTML.sum script and add a -t option to the SGML.sum command. For example:
SGML.sum -t $HARVEST_HOME/lib/my-HTML.table HTML $*
HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- <A> keywords,parent <A:HREF> url-references <ADDRESS> address <B> keywords,parent <BODY> body <CITE> references <CODE> ignore <EM> keywords,parent <H1> headings <H2> headings <H3> headings <H4> headings <H5> headings <H6> headings <HEAD> head <I> keywords,parent <IMG:SRC> images <META:CONTENT> $NAME <STRONG> keywords,parent <TITLE> title <TT> keywords,parent <UL> keywords,parent
In HTML, the document title is written as:
<TITLE>My Home Page</TITLE>
The above translation table will place this in the SOIF summary as:
title{13}: My Home Page
Note that ``keywords,parent'' occurs frequently in the table. For any specially marked text (bold, emphasized, hypertext links, etc.), the words will be copied into the keywords attribute and also left in the content of the parent element. This keeps the body of the text readable by not removing certain words.
Any text that appears inside a pair of CODE tags will not show up in the summary because we specified ``ignore'' as the SOIF attribute.
URLs in HTML anchors are written as
<A HREF="http://harvest.cs.colorado.edu/">
The specification for <A:HREF>
in the above translation table causes
this to appear as
url-references{32}: http://harvest.cs.colorado.edu/