Date: Wed, 28 Jan 1998 15:44:38 -0500 (EST)
From: Judy Hallman <hallman@email.unc.edu>
To: Web-Walkers <web-walkers@unc.edu>
Subject: Summary of Web-Walkers meeting, Nov. 5, on Verity
Summary of Web-Walkers meeting November 5, 1997
Verity Search Engine
Jim Murrell, ATN, jim_murrell@unc.edu lead the discussion.
UNC-CH previously ran Harvest. It took lots of disk space, was slow building indexes, and if it broke while it was indexing, Jim had to start it over. Verity was easy to install and can index in 6 hours what Harvest did in 4 days.
Verity uses a meta-database -- it refers to (links to) documents on the server, not the real host. It produces a score, but there is no explanation of what the score means.
UNC-CH is currently running two versions of Verity; Systems is experimenting with the new version. A new machine has been ordered for running test versions. The test version requires an id and password of an administrator.
A "collection" is a domain or host. Our license agreement is for up to 50 collections (50 servers). Collections can have subcollections. It is possible to index less than a host; for example, to index the Library and Journalism directories on SunSITE.
We are currently indexing many types of files, including ASCII, HTML, Acrobat PDF, and WYSIWYG. ASCII files include core dumps and e-mail files. HTML files include .html and .htm files. We will stop indexing ASCII files so that core dumps and e-mail files will no longer be indexed.
Re-indexing (by spider) is not good to the network. Verity can do incremental indexing (if a file has not been changed, it leaves the index as it was); if the file is gone, entries are removed from the index. It can index or exclude based on patterens in URLs. There is a compression feature that compresses 15.16 gig to 1 gig.
Frequency of updating can vary from one server to another. The ATN server (help.unc.edu) might take an hour to index completely and 1/2 hour for an update, while Ra might take 24 hours to index completely and 5-6 hours to update. Update frequency can be specified by URL -- for example, the DTH directory could be indexed daily; the Gazette weekly, and other stuff weekly.
There is HTML code to display the query form and HTML code to display results. In the results display, clicking on the title brings up a page on the ATN server (Bes), while clicking on the URL goes to the real server. To start with, the title will not be clickable.
Can't display the size of the file, at least not yet.
ATN will bring in all of the current 50 servers on campus that are linked from the home papge (they are listed at http://www.unc.edu/campus/aboutweb/howto/servers.html), including parts of SunSITE. Directories can be excluded by using robot.txt files. Verity obeys Unix permissions regarding who can read files; Jim will check on password protected files to make sure they don't get indexed.
Jim will put the new version into production.
The Verity users guide is online at http://bes.isis.unc.edu/~murrell/
Attendees:
Deb Aikat, JOMC, daikat@email.unc.edu
Lisa Croucher, ATN, lisa_croucher@unc.edu
Elizabeth A. Evans, Carolina Population Center, evans@unc.edu
Judy Hallman, ATN, judy_hallman@unc.edu
Linda Jen, ATN, linda_jen@unc.edu
Tom Sweet, ATN, tom_sweet@unc.edu
Debby Weiss for David Romani, Davis Library Systems, weisd@ils.unc.edu
Judy Hallman (judy_hallman@unc.edu, http://www.unc.edu/~hallman/)
Campus Webmaster, UNC-Chapel Hill