Starting with Harvest Version 1.2, it is possible to summarize documents that conform to the Standard Generalized Markup Language (SGML) [12], for which you have a Document Type Definition (DTD). The World-Wide Web's Hypertext Mark-up Language (HTML) is actually a particular application of SGML, with a corresponding DTD. (In fact, the Harvest HTML summarizer now uses the HTML DTD and our SGML summarizing mechanism, which provides various advantages; see Section 4.5.2.) SGML is being used in an increasingly broad variety of applications, for example as a format for storing data for a number of physical sciences. Because SGML allows documents to contain a good deal of structure, Harvest can summarize SGML documents very effectively.
The SGML summarizer ( SGML.sum) uses the sgmls program by James Clark to parse the SGML document. The parser needs both a DTD for the document and a Declaration file that describes the allowed character set. The SGML.sum program uses a table that maps SGML tags to SOIF attributes.