One way to organize information on the Internet is to create a document or collection that maintains lists of links organized by their content. Such a service, often called a Subject Catalog or Subject Index, is often very easy to navigate and use, since locating desired information is simply a matter of trying the links under an appropriate topic.
Because subject catalogs must be carefully organized, they often require an administrative staff or dedicated contributors and guest editors to locate useful documents and link them under the relevant subject heading. Well-maintained services will often include a brief summary, or Abstract, and other information to help users select the most useful materials.
Subject catalogs are often organized hierarchically to make it much easier to navigate from the general to the specific topic of interest. Well written catalogs also contain cross-references between related topics under different headings such as "Business:Computer Sales" "Computers:Vendors".
A user selects and navigates the links of a subject index by their relation to the desired information, or can simply browse the listed categories for interesting links. Since the documents are grouped by their content, once a suitable file is found, there are often many more links in the same section.
Subject catalogs are useful for people who have a general idea about the information they are seeking, but just don't know where to start.
Services such as YAHOO! and the EINet Galaxy work hard to catalog as many Internet resources as possible -- typically, hundreds or thousands of document collections a week! Resources are gathered by accepting URL suggestions from users, scanning newsgroups for WWW announcments, and watching other subject catalogs for new links.
By cataloguing as many resources as possible, these servers aim to provide a complete list of relevant documents for each subject area. In this model, the burden of selecting the best resources from this "complete" list is often left to the user.
Other services like the Whole Internet Catalog and IPL follow a different philosophy. Rather than maximizing the number of links on their servers, they keep abreast of efforts and collections in various fields. This does not mean that they don't take suggestions or list new resources, but that linked documents are carefully evaluated before incorporation.
WIC and IPL users will appreciate the extended abstracts and careful maintenance that the administrators provide. The links they recommend are hand-picked as the most valuable or complete within the subject area. Aging and dead links are quickly updated or removed by the staff. A user may not find all of the relevant information in a field using such an index, but they will almost certainly find something useful.
Larger catalogs can contain numerous levels of abstraction. For example, YAHOO! is organized into 14 general subject headings and 50 sub-headings on the first page. The user must navigate 3 or more index pages to actually find a page of links, but such pages typically contain more than 20 references to documents. In addition, the catalog makes abundant use of cross-references between related headings. The size and complexity of larger catalogs are simplified by careful organization, making them the most useful for general subject browsing.
The Subject Clearinghouse and EINet Galaxy use another technique to index Internet documents. The top-level document describes certain subject headings, which link to fairly comprehensive lists or Internet resources. These lists are generated contributors or "guest editors", who scour the net looking for germane material.
Document lists are suitable for those who need a complete guide to online resources in a particular subject, because they are generally more complete than other services that register URLs from service announcements or user suggestion. However, such collections don't provide value-added services such as document abstracts -- the pages really are organized as simple lists of links.
As the number of indexed documents grows, administrators reorganize subject headings to accomodate and differentiate popular topics into smaller groups. Careful maintenance makes this process of growth and adjustment transparent to the users.
Once a number of documents have been indexed, the user describes the desired information using selected words or phrases, called "keywords", that are entered into the computer. The search engine then examines the database for documents that match or are related to the user's criteria, and returns to the user a list of the selected documents.
Search engines work on the principle that the information content of a document can be summarized by extracting those words already in the title or text. By ranking the extracted text by its position in title or text, the number of times it appears in the document, and other criteria, the database reduces the number of incidental words or phrases, known as false drops, from those relevant to the topic.
To later retrieve such a document, the user must enter criteria that describe the document as it was extracted and indexed into the database. Often this means a user searching for a particular document must know enough about it to select the best keywords to selectively identify that file.
On the flip side, a user looking for a range of documents in a particular subject area should select representative keywords to select the largest possible set while eliminating incidental matches and false drops. Effective use of keyword controls, such as Boolean and proximity operators, can focus or expand the results of a search as desired.
Search engines with larger databases are typically much more likely to contain relevant and larger result sets for given criteria. For this reason, most search engines strive for and advertise the number of documents they have indexed.
Since document gathering is usually performed by automatic software, the database is rebuilt regularly. This means that new or updated documents are often indexed shortly after coming online, and that dead links are removed in timely fashion. The database grows because the indexing software saves new links it discovers, as well as user suggested URLs, which can be explored and the indexed in the next cycle.
To cut through most of the hype and numbers that are tossed around, the Matrix maintains database statistics for each searchable engine (save WWW Worm, which doesn't show that data).
Fulltext documents in searchable databases:
Natural Language Queries: For novice Internet users, this is probably the easiest way to search the Web. Users enter questions in natural English, and the server software extracts relevant keywords to create a database query. For example, the phrase "Find pages about AIDS, cancer, or heart disease" would resolve into the individual keywords AIDS, cancer, heart, and disease.
Boolean Linking: One of the most popular ways servers handle multiple keywords is by linking each with a Boolean AND or Boolean OR. For example, the query "food cajun spice" would return only documents that contain all of the keywords food, cajun, andspicy if linked with Boolean AND, and would return any documents containing any one of the keywords food, cajun, orspicy if linked by Boolean OR. Although certain engines simply perform one kind of linking, others let you select whether to perform a narrow search using AND or a broad search using OR.
Boolean Controls: Similar to Boolean Linking above, this method connects multiple keywords with Boolean operators to improve a search by narrowing or broadening the search criteria. However, this technique provides much more control over the search parameters, because the user specifies how the words are linked using each of the Boolean operators AND, OR, and NOT, as well as the proximity operator NEAR. Using parentheses to group operations, users can create complex Boolean queries such as "((dog OR animal) AND (bite OR bitten)) OR rabies".
Note: The Overview document categorizes both Boolean Linking and Boolean Controls under the same header. Services that only support a single form of linking are typically rated lower than those which let the user specify whether to link with AND or OR. Similarly, servers which support complex Boolean syntax are rated much higher.
Keyword Controls: Rather than requiring some relation between keywords, some search engines allow the each keyword to be qualified individually. Each keyword in the query can be prefixed with special characters like + or - to indicate that they are required (much like Boolean AND) or that they are required to not be in the document. Often, unqualified keywords are linked a Boolean OR by default. For example, the Keyword Control query "candy sugar dentist -saccharine +cavity" is equivalent to the Boolean "(candy OR sugar OR dentist) AND (NOT saccharine) AND cavity".
Keyword Truncation: Finally, most systems perform some sort of suffix management on keywords. This helps users to get the most for their queries by generalizing each keyword to its root, and expanding the search to include all forms of that root word. On such a server, a query containing the keyword computers may actually return documents containing compute, computed, computer, computes, computers, and computing. Some servers allow the user to choose which words are truncated, typically by appending a * character to the end of the root word, like comput*; most others, however, perform the truncation automatically according to their own rules.
Matt Slot * fprefect@ambrosiasw.com * 6/8/96