HARVEST is an integrated set of tools to gather, extract, organize, search, cache, and replicate relevant information across the Internet [5]. With modest effort users can tailor Harvest to digest information in many different formats, and offer custom search services on the Internet.
A key goal of Harvest is to provide a flexible system that can be configured in various ways to create many types of indexes, making very efficient use of Internet servers, network links, and index space on disk. Our measurements indicate that Harvest can reduce server load by a factor of over 6,000, network traffic by a factor of 60, and index space requirements by a factor of over 40 when building indexes compared with other systems, such as Archie, WAIS, and the World Wide Web Worm.
Harvest also allows users to extract structured (attribute-value pair) information from many different information formats and build indexes that allow these attributes to be referenced during queries (e.g., searching for all documents with a certain regular expression in the title field).
An important advantage of Harvest is that it provides a data gathering architecture for constructing indexes. This stands in contrast to WHOIS++ [19] (which requires users to construct indexing templates manually) and GILS [1] (which does not define how index data are collected). Harvest allows users to build indexes using either manually constructed templates (for maximum control over index content) or automatically extracted data constructed templates (for easy coverage of large collections), or using a hybrid of the two methods.
For more detailed comparisons with related systems, see [4] or our online FAQ.
We provide an overview of the Harvest subsystems in the next section.