Web Matrix:
Terminology Reference
Common Terms for Evaluating Internet Indices
This document describes some of the descriptive vocabulary used in the
evaluation of Web indices. It is not intended as a basic glossary
for general Internet or World Wide Web concepts; for such information, you
may look at John December's Internet Tools Summary for relevant descriptions and tutorials.
- Abstract:
- A brief paragraph describing document, so that the user can decide to
download the file based on more than the link text. Abstracts are often
written as part of the URL submission process, or by server staff
when added to a Subject
Catalog. Don't confuse an Abstract, which is a summary separate
from the original, with another term, Document Extract.
- Boolean Search
- A technique for finding documents that include or exclude data based
on multiple criteria. By combining or restricting search keywords using
Boolean operators and, or, and not, you can specify
simple or complex formulas. Boolean searching provides a standard
command interface for extracting matching data from a database.
- Dead Links
- Because the Web consists of documents authored and located at numerous
network locations, the level of document support and maintenance varies.
Some collections are regulary used and incorrect information is updated,
others may not be updated for weeks or months. As machine names change,
users documents move, or network connections fail, document links
become out of date. A dead link is a URL that leads to no existing
document, indicated by the message "404 Error".
- Document Extract:
- A portion of a document, typically returned as part of a match when
using a Search Engine.
New files are usually gathered in large amounts automatically by
searching and indexing software, so it is rather inconvenient for
an administrator to write a document summary or
Abstract, for each file.
Search software generally saves sample text from a file with its
URL and indexed keywords, as a simple "preview".
- Engine
- See Search Engine
- Forms Search
- Web pages and current browsers support an interactive mechanism
called HTML Forms, which allow users to enter complex sets of
information and request services from the Web server based on that
information. A Forms-based input page that calls a remote
Search Engine
creates a powerful feedback tool.
- Front Page
- The document that a company or organization uses to establish their
Internet presence, often leading to other sources of online information
related to that organization's purpose. Such a page is different than
an internal homepage, which is accessed by user's or member's of that
organization -- a hotlist of relevant, but not public, links or
information.
- Gathering
- The administrators of subject catalogs and searchable databases must
not only maintain their collection of URL's, but should continue to
find more Internet documents to expand their collection. The process
of gathering URL's can be done manually (by serendipity and scanning
the What's New lists)
or by running automated software (such as
Web Spiders ) which
returns a list of "discovered" resources.
- Hotlist
- As users explore the net, they build up a list of URL's and links that
they want to remember. Typical links in these hotlists include:
entertainment pages, Internet reference documentss, or the homepages
of their friends. Often a user will make his hotlist available to the
Web public, for reference or easy access. Organizations may also keep
hotlists for their users, to provide quick links to commonly used
or referred to documents.
- ISINDEX Search
- See Non-Forms Search
- Keyword
- When searching for information inside a database collection, a user needs
to tell the software how to identify the desired data. The user enters
a word or phrase relevant to the information being sought, and the
database software examines each record for a match. Such matches, called
"hits", are selected because they contain the entered word or phrase.
Keyword searching can be improved by combining with other techniques:
Boolean Searching,
Proximity Searching,
or Vocabulary
Control.
- Load Balancing
- Popular Web services can become too busy to run from a single computer,
and administrators may choose to distribute the document collection
and processing across several networked computers. To reduce the
Server Load that
numerous users place place on critical resources, the server may be
configured to perform automatic balancing between available computers.
By passing off requests to alternating machines, the server can improve
response time (often transparently) significantly.
- Mirrors
- When a popular server becomes to busy to support the
Server Load , other sites
may volunteer to run a copy of the same software and database on their own
computers. By duplicating, or mirroring, the original server's data, the
new site can serve local users much faster and reduce the load on the
first computer.
- Non-Forms Search
- Simple search interface that takes a single keyword and processes it
using server software to generate an output document (e.g., entering
a word and getting back a dictionary entry). This type of search has
been superceded by HTML Forms,
which allow complex criteria to be passed to the server and return richer
information.
- Page
- In the context of this collection, a page is any document that is
available for browsing. Some documents provide key access or information
about a service or organization on special pages, called "public pages"
or "homepages".
- Proximity Search
- Another technique for improving the quality of
keyword searching, a proximity search
lets you identify documents with certain phrases or word combinations. Such
tools let you specify multiple words that occur in close proximity, and
thus a better chance of correspondence, rather than two words that may be
located anywhere in a single document.
- Regular Expression
- Regular expressions offer a way to search documents using pattern
matching. Such tools let you mix substrings, wildcards, and repetitive
sequences to create a complex key to search against, resulting in a
powerful and specific set of matches. Regular expressions are not
designed for searching by content, but useful for finding certain
files or very specific data strings.
- Root/Suffix Management
- Some search engines are robust enough to recognize and shorten long
words such as "dogs" or "running" to the appropriate root words "dog"
and "run". This makes searching for such words much easier because it
is not necessary to consider every permutation of that word when trying
to find it.
- Server Load
- The amount of work, such as networking or database searching, that a
Web server is performing at any given time. A server with a high load
will not respond to user requests quickly, or may not work reliably.
A site may choose to replace the server with a faster computer, or
purchase a second computer to share the processing load and improve
performance.
- Search Engine
- The software on a Web server that applies user criteria to a database
of documents to build a match set. The speed of the engine is based
on the size of the collection and the complexity of the search, as well
as how the software written. Custom software written in C is much faster
than those scripted in Perl or csh.
- Searchable Index
- A Web server that lets you find documents in its collection by
entering a keyword or other
criteria, and returning a set of documents that describe the input in some
way. Many of the popular Internet services are searchable indices, and
many others support at least some form of searching. A document set
returned in from a searchable index is characteristically filled with
accidental hits (or false drops), documents that match the user's
criteria but don't really contain the desired information. (Refer to
the Answer Page for a lengthier
explanation.)
- Subject Catalog
- A Subject Index or Subject Catalog is a service that organizes linked
documents by their content and subject matter. Organized into general
categories (or alphabetically) at the top level, documents are layered
hierarchically and collected with related pages. Although any subject
index may be smaller than a searchable one, it is generally a much more
reliable tool. (Refer to the Answer
Page for a lengthier explanation.)
- Vocabulary Control
- Selecting a suitable keyword for a search is often difficult, especially
when there are several related terms that have similar meanings. Unless
the criteria reflect every relevant word in the language, matching
documents may be missed in a search. Vocabulary Control means
establishing a standard set of keywords and identifying documents by
these keywords, to improve the users chances of finding every relevant
document.
- Web Crawlers,
Spiders,
Worms
- Each of these terms are used to describe software that automatically
downloads and catalogs Internet documents. By reading each document it
discovers, the software builds a list of additional pages to visit.
This is a popular method for creating a database suitable for a
Searchable Index.
- What's New? Lists
- As more document collections and cool homepages come online, Web
servers and other organizations regularly compile them into lists.
By scanning a few What's New? lists, you can keep track of the
latest and greatest pages on the Web -- and beat your friends to
them!
This collection is Copyright © 1995-6 by Matt Slot, but has been designed
for public use. Permission is hereby granted for unlimited print and electronic
redistribution. Your feedback is
appreciated.
Matt Slot *
fprefect@ambrosiasw.com *
12/4/96