Web Matrix: What's the Difference?

Some Answers about Search Engines and Subject Catalogs


This document answers some common questions about Internet services. Although it is organized in question/answer pairs, it's not really a FAQ or terminology reference. However, these are issues that I feel should be addressed for a complete (and honest) discussion. This document will be referenced and linked at appropriate points in the actual Matrix to provide a little more context for the discussion.

Subject Catalogs

What is a Subject Catalog?

One way to organize information on the Internet is to create a document or collection that maintains lists of links organized by their content. Such a service, often called a Subject Catalog or Subject Index, is often very easy to navigate and use, since locating desired information is simply a matter of trying the links under an appropriate topic.

Because subject catalogs must be carefully organized, they often require an administrative staff or dedicated contributors and guest editors to locate useful documents and link them under the relevant subject heading. Well-maintained services will often include a brief summary, or Abstract, and other information to help users select the most useful materials.

Subject catalogs are often organized hierarchically to make it much easier to navigate from the general to the specific topic of interest. Well written catalogs also contain cross-references between related topics under different headings such as "Business:Computer Sales" "Computers:Vendors".

A user selects and navigates the links of a subject index by their relation to the desired information, or can simply browse the listed categories for interesting links. Since the documents are grouped by their content, once a suitable file is found, there are often many more links in the same section.

Subject catalogs are useful for people who have a general idea about the information they are seeking, but just don't know where to start.

Is size really important?

The usefulness of a Subject Index is dependent on the value of the of the links it contains. There are 2 approaches that administrators take to maximize the usefulness of their collection.

Services such as YAHOO! and the EINet Galaxy work hard to catalog as many Internet resources as possible -- typically, hundreds or thousands of document collections a week! Resources are gathered by accepting URL suggestions from users, scanning newsgroups for WWW announcments, and watching other subject catalogs for new links.

By cataloguing as many resources as possible, these servers aim to provide a complete list of relevant documents for each subject area. In this model, the burden of selecting the best resources from this "complete" list is often left to the user.

Other services like the Whole Internet Catalog and IPL follow a different philosophy. Rather than maximizing the number of links on their servers, they keep abreast of efforts and collections in various fields. This does not mean that they don't take suggestions or list new resources, but that linked documents are carefully evaluated before incorporation.

WIC and IPL users will appreciate the extended abstracts and careful maintenance that the administrators provide. The links they recommend are hand-picked as the most valuable or complete within the subject area. Aging and dead links are quickly updated or removed by the staff. A user may not find all of the relevant information in a field using such an index, but they will almost certainly find something useful.

How are Subject Catalogs structured?

Most subject catalogs are hierarchically organized into layers from general to specific topics. As the number of links in a catalog increases, it becomes important to effectively organize how the information is organized. Smaller catalogs are often composed of a top-level directory that references secondary pages of links, abstracts, and other information. Smaller catalogs are best suited to novice users, due to the valuable abstracts and carefully moderated links.

Larger catalogs can contain numerous levels of abstraction. For example, YAHOO! is organized into 14 general subject headings and 50 sub-headings on the first page. The user must navigate 3 or more index pages to actually find a page of links, but such pages typically contain more than 20 references to documents. In addition, the catalog makes abundant use of cross-references between related headings. The size and complexity of larger catalogs are simplified by careful organization, making them the most useful for general subject browsing.

The Subject Clearinghouse and EINet Galaxy use another technique to index Internet documents. The top-level document describes certain subject headings, which link to fairly comprehensive lists or Internet resources. These lists are generated contributors or "guest editors", who scour the net looking for germane material.

Document lists are suitable for those who need a complete guide to online resources in a particular subject, because they are generally more complete than other services that register URLs from service announcements or user suggestion. However, such collections don't provide value-added services such as document abstracts -- the pages really are organized as simple lists of links.

Who does the work?

Subject catalogs are maintained by a fairly large number of administrator, editors, and contributors. New documents are found by automated search tools, new service announcements, and user suggestion. For each document, the administrator writes a descriptive abstract and inserts the information into the catalog under the appropriate subject heading or headings. Similarly, outdated documents or dead links must be removed by a human administrator or editor, and is typically handled in response to user feedback.

As the number of indexed documents grows, administrators reorganize subject headings to accomodate and differentiate popular topics into smaller groups. Careful maintenance makes this process of growth and adjustment transparent to the users.


Search Engines

What is a Search Engine?

Another way to collect and organize Internet resources uses large databases of information and presents a way to select documents based on certain words, phrases, or patterns within those documents. A Web spider or other software will examine a document and "index", or enter it into the database, based on words extracted from the title or text; in addition, the software also searches the document for pointers or URLs for other documents that haven't been indexed yet.

Once a number of documents have been indexed, the user describes the desired information using selected words or phrases, called "keywords", that are entered into the computer. The search engine then examines the database for documents that match or are related to the user's criteria, and returns to the user a list of the selected documents.

Search engines work on the principle that the information content of a document can be summarized by extracting those words already in the title or text. By ranking the extracted text by its position in title or text, the number of times it appears in the document, and other criteria, the database reduces the number of incidental words or phrases, known as false drops, from those relevant to the topic.

To later retrieve such a document, the user must enter criteria that describe the document as it was extracted and indexed into the database. Often this means a user searching for a particular document must know enough about it to select the best keywords to selectively identify that file.

On the flip side, a user looking for a range of documents in a particular subject area should select representative keywords to select the largest possible set while eliminating incidental matches and false drops. Effective use of keyword controls, such as Boolean and proximity operators, can focus or expand the results of a search as desired.

Search engines with larger databases are typically much more likely to contain relevant and larger result sets for given criteria. For this reason, most search engines strive for and advertise the number of documents they have indexed.

Since document gathering is usually performed by automatic software, the database is rebuilt regularly. This means that new or updated documents are often indexed shortly after coming online, and that dead links are removed in timely fashion. The database grows because the indexing software saves new links it discovers, as well as user suggested URLs, which can be explored and the indexed in the next cycle.

Is size really important?

Probably the most important, and certainly the popular, comparison of search engines is based on the number of documents indexed in their database. For this reason, most searchable databases advertise their size proudly and often compare themselves to other well-known services. However, there are several measures of number of documents in a database:
  1. Documents where the entire fulltext has been indexed.
  2. Documents where the URL, name, and headings or excerpts have been indexed.
  3. Documents where the URL and name have been indexed (e.g. images or sounds).
  4. Documents where some descriptive text have been indexed.
Some services index the complete text of a document, some only selected portions. Other databases count a document as indexed simply because another document contains its URL -- on the assumption that descriptive text accompanies such a hyperlink! Although each method represents a count of "indexed" documents, only the first is the best measure of a service.

To cut through most of the hype and numbers that are tossed around, the Matrix maintains database statistics for each searchable engine (save WWW Worm, which doesn't show that data).

Fulltext documents in searchable databases:

  1. Alta Vista - 21 Million
  2. Lycos - 5 Million
  3. Inktomi - 2.8 Million
  4. InfoSeek - 1 Million
  5. WebCrawler - 420,000
  6. WWW Worm - <unknown>
Astute users will notice that these totals are an order of magnitude larger than any subject index; even YAHOO! has less than 500,000 links. Recognize that a searchable database describes every page in a collection to better represent the content of that collection, but a subject catalog only points to the front page or key pages of a collection. Considering that collections of pages can number 5, 10, 50, or more, the actual number of documents indexed by a subject catalog is significantly higher than the total links would indicate.

What are useful searching features?

Among the popular search engines, there are several key features that I feel are useful to most people. This is my chance to describe why you should care about the checkmarks on the overview charts.

Natural Language Queries: For novice Internet users, this is probably the easiest way to search the Web. Users enter questions in natural English, and the server software extracts relevant keywords to create a database query. For example, the phrase "Find pages about AIDS, cancer, or heart disease" would resolve into the individual keywords AIDS, cancer, heart, and disease.

Boolean Linking: One of the most popular ways servers handle multiple keywords is by linking each with a Boolean AND or Boolean OR. For example, the query "food cajun spice" would return only documents that contain all of the keywords food, cajun, andspicy if linked with Boolean AND, and would return any documents containing any one of the keywords food, cajun, orspicy if linked by Boolean OR. Although certain engines simply perform one kind of linking, others let you select whether to perform a narrow search using AND or a broad search using OR.

Boolean Controls: Similar to Boolean Linking above, this method connects multiple keywords with Boolean operators to improve a search by narrowing or broadening the search criteria. However, this technique provides much more control over the search parameters, because the user specifies how the words are linked using each of the Boolean operators AND, OR, and NOT, as well as the proximity operator NEAR. Using parentheses to group operations, users can create complex Boolean queries such as "((dog OR animal) AND (bite OR bitten)) OR rabies".

Note: The Overview document categorizes both Boolean Linking and Boolean Controls under the same header. Services that only support a single form of linking are typically rated lower than those which let the user specify whether to link with AND or OR. Similarly, servers which support complex Boolean syntax are rated much higher.

Keyword Controls: Rather than requiring some relation between keywords, some search engines allow the each keyword to be qualified individually. Each keyword in the query can be prefixed with special characters like + or - to indicate that they are required (much like Boolean AND) or that they are required to not be in the document. Often, unqualified keywords are linked a Boolean OR by default. For example, the Keyword Control query "candy sugar dentist -saccharine +cavity" is equivalent to the Boolean "(candy OR sugar OR dentist) AND (NOT saccharine) AND cavity".

Keyword Truncation: Finally, most systems perform some sort of suffix management on keywords. This helps users to get the most for their queries by generalizing each keyword to its root, and expanding the search to include all forms of that root word. On such a server, a query containing the keyword computers may actually return documents containing compute, computed, computer, computes, computers, and computing. Some servers allow the user to choose which words are truncated, typically by appending a * character to the end of the root word, like comput*; most others, however, perform the truncation automatically according to their own rules.


This collection is Copyright © 1995-6 by Matt Slot, but has been designed for public use. Permission is hereby granted for unlimited print and electronic redistribution. Your feedback is appreciated.

Matt Slot * fprefect@ambrosiasw.com * 6/8/96