CoCalc -- ArticleIP

OANC_GrAF / data / written_1 / journal / slate / 37 / ArticleIP_2545.txt

³⁹⁶⁷³ views

6
Search Me
7

8
Imagine walking into your
9
local library to look for a book. Hoping for a librarian to guide you, you are
10
confronted instead by a bewildering array of entrepreneurs, each offering to
11
find what you're looking for. But none has cataloged the whole library, each
12
has cataloged a different part of it, each uses a different system, and none is
13
terribly satisfactory. That is the situation on the Internet today.
14

15
The
16
old-fashioned librarian still has an edge over the vast resources and computing
17
power of the Internet in several ways. First, a library offers more valuable
18
stuff . Certainly there is more raw information on the Web than in any
19
library. But the kind of stuff people are actually willing to pay for is sadly
20
lacking. So long as publishers fear cannibalizing their sales--or aren't able
21
to make money maintaining Web sites--there will be books and periodicals that
22
are only available offline, or online for a fee. You can read the Wall
23
Street Journal free at your library, but you have to pay to read it
24
on the Web. And of course
25
there are for-a-fee search services like LEXIS-NEXIS that allow you to read documents online, but
26
for the time being, serious research--especially for the budgetarily
27
challenged--still usually leads to paper.
28

29
Libraries also have a distributed cataloging
30
infrastructure . This means that everyone shares a common cataloging system.
31
When a new book gets published, a single library will do the work to abstract
32
and catalog it, then share that work with all the others. This effort
33
(coordinated by the Online
34
Computer Library Center) is assisted by standards for cataloging
35
information . The Dewey Decimal System and the Library of Congress are both
36
schemes that help guarantee that a book can be found the same way in different
37
libraries.
38

39
Most
40
important, though, libraries invite reduced expectations . No one expects
41
to walk into a library and get a list of every book that contains the word
42
"poker" organized by subject, title, and author. We're just happy to look up
43
"poker" in the (online?) card catalog and find the books that are actually
44
about poker. And typically we'd be just as happy not to find the references to
45
"red-hot poker" or "poker faced."
46

48
Of the many Internet search services, Yahoo! comes the closest to this
49
card-catalog approach. It does it the old-fashioned way, hiring people to look
50
at each site and assign it an abstract and one or more categories. This is easy
51
for people to understand, but it's not comprehensive. Even very diverse sites
52
seldom appear in more than a few categories. For instance, a search for Slate
53
(click here) shows it somewhere under the Politics
54
category, but not the Movie Reviews or Economics
55
categories. Yet Slate runs a movie review every week and has published many
56
articles on economics.
57

58
Even if Yahoo! wanted to be
59
this comprehensive, the humans cataloging the materials cannot possibly keep up
60
with the ever changing nature of the Web. To stay current, they would have to
61
read every page of every site every day. To solve this problem, most other
62
search services use computers instead of humans to do their cataloging. Their
63
machines scan every Web page they can find by a process known as crawling (see
64
my discussion of crawling in a previous "Webhead" column) and put every word on every page
65
into a giant index.
66

67
One of the first sites to
68
take that approach was Infoseek. Click here to search for "slate." The naive user might expect to
69
see a link to our home page. But the problem is that our home page includes few
70
instances of the word "slate"--many fewer than, say, the home page of a dealer
71
of roofing materials or pool tables is likely to. That's the limitation of a
72
system based on text indexing. It doesn't really know what a page is about,
73
just what words appear on it. Using a little artificial intelligence, the
74
computer tries to decide which pages are more "about" a given word than others,
75
but it's not always successful.
76

77
HotBot is another search
78
site. Try the "slate" search by clicking here. By the way, as you try out these searches, you may
79
see an advertisement for Slate. That means we bought the rights to the word
80
"slate." Whenever someone searches for it, our ad shows up. On the Web you can
81
buy words--isn't that great?
82

83
Excite's site uses some of that
84
artificial intelligence to help you refine your searches. Try searching for
85
"slate" by clicking here. If you find a page that you like you can click on "more like
86
this." Excite will return a list of pages which "look like" the page you
87
clicked on. That means that similar words appear on the pages with similar
88
frequency. Sometimes this works, more often it returns pages that seem
89
impressively unrelated. For instance, this search for pages like the Slate parody Stale yields not only Slate (a good
90
guess) but also the Steel Lunch Boxes Web page (not a good guess, but
91
entertaining nonetheless).
92

93
AltaVista's site takes a
94
weirder approach to this idea of refining searches. Search for Slate by
95
clicking here. Adding or subtracting words from your search criteria
96
can help find what you're looking for. For instance adding "Kinsley" and
97
subtracting "roofing" would probably increase your chances of finding our home
98
page. Their "LiveTopics" technology attempts to help you do this. They look up
99
"slate" in their index and see that it often occurs on the same page as "roof,"
100
so they suggest this as a possible refinement of the search. But since these
101
suggestions are generated by computer, they can be very weird. (To see this in
102
action, click on one of the "LiveTopics" links on the AltaVista search-results
103
page.) How did "Adaptec" or "Skadden" get on the list? Your guess is as good as
104
mine. (Well, for Skadden it's not such a mystery, considering that "Slate" is
105
part of the name of the D.C. law firm Skadden, Arps, Slate, Meagher &
106
Flom.)
107

108
The best solutions for searching will probably result from
109
a combination of humans and computers. If AltaVista's list of search
110
refinements was generated by a human, for instance, it might be more helpful.
111
If the producers of every site cataloged it themselves, then Yahoo! wouldn't
112
have a hard time keeping up with them. Of course, everyone would have to agree
113
on standard ways to do this, and if everyone agreed, for-profit search sites
114
like Yahoo! probably wouldn't be necessary. So don't be surprised if the status
115
quo lasts just a little while longer.
116

117

118

119

120

121

Product

Resources

Company