On Mon, 29 Mar 2004, Jules Richardson wrote:
Similarly, Google trying to be "helpful" can
be a real pain, when it
goes and tries to be clever about finding results (returning not only
matches to what you searched for, but other things it considers close).
"Stemming" they seem to call it. Trouble is there seems to be no way of
turning it off...
"Stemming" is the canonical term in IR ("Information Retrieval") for
the
process of stripping off prefixes, suffixes, plurals, etc. to attempt to
find a match for the base word, not the specific string of characters.
For example, should "diagrams" be considered a match for "diagram",
"viruses" and "virii" match "virus"?
Google seems to have a unique use of "+". 'course, as George Morrow said,
"Standards are wonderful; everyone should have one of their own."
In many search systems, "A+B" means presence of A AND presence of B.
(In some digital electronics texts, "AB" means A AND B, and "A+B"
means A OR B.)
In Google, "+" seems to mean turn off the stemming, and reject any pages
that do not have that EXACT search term present. Therefore, "A+B" would
mean an exact match for B and a "loose" match for A.
In many search systems, "next" is a "stopword" - a word that is
ignored,
(such as A, AN, THE, ...) because it is presumed to not help the search
process. In a system that has no options for case sensitivity, how do you
search for "NeXT"??
A search engine that just returns what you ask for
from the web would be
nice - no indexing of news, mailing lists etc, no ads, and no trying to
be intelligent by stripping out words, modifying words, randomly
inserting or removing punctuation etc.
It would be GREAT to have a system that let you control such
"features". But how many people would actually learn how to use it?
What percentage of the users are looking for something more
involved than "Britney Spears naked"?
--
Grumpy Ol' Fred cisin(a)xenosoft.com
suggested readings:
Frakes "Information Retrieval Data Structures and Algorithms"
Salton "Automated Text Processing"