The WebCrawler is a tool for searching for Web documents. It constructs a
database by traversing the Internet using a Web Robot and then
indexing the full text with a simple filtering mechanism. The search engine
processes each user request by evaluating each document against the
keywords to compute a weighted sum, then returns a sorted list of matching
documents.
Key Links
URL for Front Page:
http://webcrawler.com/WebCrawler/Home.html
URL for Forms Search Page:
http://webcrawler.com/
URL for Non-Forms
Search Page: http://webcrawler.com/cgi-bin/WebQuery
URL for FAQ
Page: http://webcrawler.com/WebCrawler/Help/FAQ.html
URL for Help
Page: http://webcrawler.com/WebCrawler/Help/Help.html
URL for Creator's
Page: http://info.webcrawler.com/bp/bio.html
URL for Staff
Page: http://webcrawler.com/WebCrawler/Facts/Team.html
Home Organization: Originally created at
University of Washington in
Seattle, WebCrawler is now owned and operated by
America Online, Inc.
Organization
-
WebCrawler is an exclusively searchable database of Web documents,
built on a custom software engine written by the author using C.
- Features and Limitations:
- Supports simple Boolean OR (by default) or Boolean AND (by
clicking the Forms checkbox) across multiple keywords, but doesn't
handle Boolean Not, complex Boolean combinations, or Proximity
Searching.
- The databases creates its indexes by identifying words on space
and punctation boundaries, converts them to lowercase, and
strips off common suffixes such as -s, -er, and
-ment. It also filters out common words such as web,
Internet, be, and, and or.
- The server weights the hits on the quality the match between keywords
and documents, then returns the highest ranking documents in sorted
order. The user specifies the number of hits as discrete amounts (10,
25, 100, or 500).
- The engine indexes and searches across filenames, document titles,
as well as full textual content.
- WebCrawler provides both Forms and Non-forms interfaces to the search
engine, however Forms support is required for most of the search features.
- The information catalogued by WebCrawler has no specific focus or
content restrictions.
Administration
- Document information is gathered automatically by a custom Web searcher
and from user-suggested URL's.
- Average response time for basic access is about 5 seconds, and searches
return within 30 seconds.
- The server runs on a Pentium computer under NextStep, and the
dcoument gathering engine operates from a similar second machine.
The WebCrawler index currently contains information on over 420,000
documents.
- The layout and organization of the server are very simple and the
information provided is quite helpful. The flexibility of the search
engine (smart truncation, etc), the simplicity of the search page,
and the formatting of the search results make the server ideal for
new and experienced users.
- Additional Services
- The help page demonstrates sample queries, with suggestions for
improving search quality, and a description of the indexing process.
- The server maintains a list of the
Top 25
URLs linked from other documents. This is not a reflection
of the actual traffic on a particular document, but the number of
hotlists and index pages that include a pointer to it.
- The server allows users to suggest documents for inclusion into
the search.
This collection is Copyright © 1995-6 by Matt Slot, but has been designed
for public use. Permission is hereby granted for unlimited print and electronic
redistribution. Your feedback is
appreciated.
Matt Slot *
fprefect@ambrosiasw.com *
2/21/96