Web Matrix: Terminology Reference

Common Terms for Evaluating Internet Indices


This document describes some of the descriptive vocabulary used in the evaluation of Web indices. It is not intended as a basic glossary for general Internet or World Wide Web concepts; for such information, you may look at John December's Internet Tools Summary for relevant descriptions and tutorials.
Abstract:
A brief paragraph describing document, so that the user can decide to download the file based on more than the link text. Abstracts are often written as part of the URL submission process, or by server staff when added to a Subject Catalog. Don't confuse an Abstract, which is a summary separate from the original, with another term, Document Extract.
Boolean Search
A technique for finding documents that include or exclude data based on multiple criteria. By combining or restricting search keywords using Boolean operators and, or, and not, you can specify simple or complex formulas. Boolean searching provides a standard command interface for extracting matching data from a database.

Dead Links
Because the Web consists of documents authored and located at numerous network locations, the level of document support and maintenance varies. Some collections are regulary used and incorrect information is updated, others may not be updated for weeks or months. As machine names change, users documents move, or network connections fail, document links become out of date. A dead link is a URL that leads to no existing document, indicated by the message "404 Error".

Document Extract:
A portion of a document, typically returned as part of a match when using a Search Engine. New files are usually gathered in large amounts automatically by searching and indexing software, so it is rather inconvenient for an administrator to write a document summary or Abstract, for each file. Search software generally saves sample text from a file with its URL and indexed keywords, as a simple "preview".
Engine
See Search Engine

Forms Search
Web pages and current browsers support an interactive mechanism called HTML Forms, which allow users to enter complex sets of information and request services from the Web server based on that information. A Forms-based input page that calls a remote Search Engine creates a powerful feedback tool.

Front Page
The document that a company or organization uses to establish their Internet presence, often leading to other sources of online information related to that organization's purpose. Such a page is different than an internal homepage, which is accessed by user's or member's of that organization -- a hotlist of relevant, but not public, links or information.

Gathering
The administrators of subject catalogs and searchable databases must not only maintain their collection of URL's, but should continue to find more Internet documents to expand their collection. The process of gathering URL's can be done manually (by serendipity and scanning the What's New lists) or by running automated software (such as Web Spiders ) which returns a list of "discovered" resources.

Hotlist
As users explore the net, they build up a list of URL's and links that they want to remember. Typical links in these hotlists include: entertainment pages, Internet reference documentss, or the homepages of their friends. Often a user will make his hotlist available to the Web public, for reference or easy access. Organizations may also keep hotlists for their users, to provide quick links to commonly used or referred to documents.
ISINDEX Search
See Non-Forms Search

Keyword
When searching for information inside a database collection, a user needs to tell the software how to identify the desired data. The user enters a word or phrase relevant to the information being sought, and the database software examines each record for a match. Such matches, called "hits", are selected because they contain the entered word or phrase. Keyword searching can be improved by combining with other techniques: Boolean Searching, Proximity Searching, or Vocabulary Control.

Load Balancing
Popular Web services can become too busy to run from a single computer, and administrators may choose to distribute the document collection and processing across several networked computers. To reduce the Server Load that numerous users place place on critical resources, the server may be configured to perform automatic balancing between available computers. By passing off requests to alternating machines, the server can improve response time (often transparently) significantly.

Mirrors
When a popular server becomes to busy to support the Server Load , other sites may volunteer to run a copy of the same software and database on their own computers. By duplicating, or mirroring, the original server's data, the new site can serve local users much faster and reduce the load on the first computer.

Non-Forms Search
Simple search interface that takes a single keyword and processes it using server software to generate an output document (e.g., entering a word and getting back a dictionary entry). This type of search has been superceded by HTML Forms, which allow complex criteria to be passed to the server and return richer information.

Page
In the context of this collection, a page is any document that is available for browsing. Some documents provide key access or information about a service or organization on special pages, called "public pages" or "homepages".

Proximity Search
Another technique for improving the quality of keyword searching, a proximity search lets you identify documents with certain phrases or word combinations. Such tools let you specify multiple words that occur in close proximity, and thus a better chance of correspondence, rather than two words that may be located anywhere in a single document.

Regular Expression
Regular expressions offer a way to search documents using pattern matching. Such tools let you mix substrings, wildcards, and repetitive sequences to create a complex key to search against, resulting in a powerful and specific set of matches. Regular expressions are not designed for searching by content, but useful for finding certain files or very specific data strings.

Root/Suffix Management
Some search engines are robust enough to recognize and shorten long words such as "dogs" or "running" to the appropriate root words "dog" and "run". This makes searching for such words much easier because it is not necessary to consider every permutation of that word when trying to find it.

Server Load
The amount of work, such as networking or database searching, that a Web server is performing at any given time. A server with a high load will not respond to user requests quickly, or may not work reliably. A site may choose to replace the server with a faster computer, or purchase a second computer to share the processing load and improve performance.

Search Engine
The software on a Web server that applies user criteria to a database of documents to build a match set. The speed of the engine is based on the size of the collection and the complexity of the search, as well as how the software written. Custom software written in C is much faster than those scripted in Perl or csh.

Searchable Index
A Web server that lets you find documents in its collection by entering a keyword or other criteria, and returning a set of documents that describe the input in some way. Many of the popular Internet services are searchable indices, and many others support at least some form of searching. A document set returned in from a searchable index is characteristically filled with accidental hits (or false drops), documents that match the user's criteria but don't really contain the desired information. (Refer to the Answer Page for a lengthier explanation.)

Subject Catalog
A Subject Index or Subject Catalog is a service that organizes linked documents by their content and subject matter. Organized into general categories (or alphabetically) at the top level, documents are layered hierarchically and collected with related pages. Although any subject index may be smaller than a searchable one, it is generally a much more reliable tool. (Refer to the Answer Page for a lengthier explanation.)

Vocabulary Control
Selecting a suitable keyword for a search is often difficult, especially when there are several related terms that have similar meanings. Unless the criteria reflect every relevant word in the language, matching documents may be missed in a search. Vocabulary Control means establishing a standard set of keywords and identifying documents by these keywords, to improve the users chances of finding every relevant document.

Web Crawlers, Spiders, Worms
Each of these terms are used to describe software that automatically downloads and catalogs Internet documents. By reading each document it discovers, the software builds a list of additional pages to visit. This is a popular method for creating a database suitable for a Searchable Index.

What's New? Lists
As more document collections and cool homepages come online, Web servers and other organizations regularly compile them into lists. By scanning a few What's New? lists, you can keep track of the latest and greatest pages on the Web -- and beat your friends to them!


This collection is Copyright © 1995-6 by Matt Slot, but has been designed for public use. Permission is hereby granted for unlimited print and electronic redistribution. Your feedback is appreciated.

Matt Slot * fprefect@ambrosiasw.com * 12/4/96