Nothing but Net
The Internet is a moving
target. Every minute, thousands of Web pages are updated or abandoned. Messages
sent to newsgroups replace older postings. All but a fraction of the chat-room
conversations and digital images that streak across the Net vanish after
they're displayed.
Seeking
to preserve the chaos of the Net for posterity is Brewster Kahle, a man with a
mission, a server, and a lot of magnetic tape. Kahle, who once designed
computers for Thinking Machines Corp., founded the Internet Archive in 1996 to
collect and store all the disparate bits of the Internet. From offices in the
Presidio, the former Army base adjacent to San Francisco's Golden Gate Bridge,
the Internet Archive's powerful computer the Internet at high speeds.
Consulting intelligent algorithms about what information to store and how
often, the archive's computer copies data to tape cassettes on a Quantum
DLT4500 recorder. When each cassette is full, a robotic arm removes it, stores
it in a carousel, and replaces it with a blank one.
The Internet may seem impossibly vast to users, but in fact
it's quite finite. The entire World Wide Web is currently estimated to contain
about 1.5 terabytes (or 1.5 million megabytes) of data. Newsgroups, , and other
Internet subsystems account for another 5.5 or so terabytes. (Compare these
numbers with the 20 terabytes of ASCII data contained in the Library of
Congress' 20 million books or the 8 terabytes of data at the average video
store.) With tape-cassette storage costing only $20 per gigabyte (1 billion
bytes), archiving the Internet is practically economical. Already, the
archivists have stockpiled more than 2 terabytes of the Net, and currently
they're storing about 100 gigabytes of data every month. Faster connections to
the Net promise to speed things up, and Kahle estimates that his group will be
done by the end of 1997.
Storing
the Internet once is only the beginning. As experienced Web surfers
know, things change rapidly on the Net. The archive doesn't have the computer
muscle to store the publicly available Internet every week, but even if it did,
a lot of stuff would still fall through the cracks. On sites like MSNBC and CNN, breaking news comes and goes
every minute, which means pages disappear faster than they can currently be
squirreled away. Slate is updated daily. Shifting faster still are Web sites
generated by databases, such as the online bookstore Amazon.com. Because the
information these sites produce is specific to a user's experience, they can
generate a literally infinite number of different pages. Finally, much of the
traffic on the Internet is dynamic--chat rooms, instant messages, and now even
phone conversations. To archive the Internet with absolute fidelity would
require cloning not only every computer on the Internet, but also every person
using every computer.
Many responsible netizens already archive
themselves for selfish reasons. Archiving is a no-brainer for publication sites
like the San Francisco Chronicle 's The Gate, which collects the contents of the daily
newspaper and connects them to a good search engine. And other sites like
Deja News already
assemble postings from the Internet newsgroups.
Where the
Internet Archive trumps these archives, of course, is in its sheer
comprehensiveness. While it isn't a replica of the Internet, it's a start. And
it's not useful just to historians. Suppose your Web browser allowed you to
specify not only an address but also a date. Remember that headline you saw on
Wired News , but
have been unable to find since? The headline was posted for only a day, and you
haven't had much luck using the site's search tool to locate the piece. But
using the Internet Archive to turn back the hands of time will uncover it for
you. And what about your teen-age cousin's Web page, with that cute picture of
her Mohawk? Cousin's mother cancelled her ISP account, and now the site is
gone. But an intelligent browser could catch the "no such site" error and look
it up on the archive instead, displaying the last-known version. Did your
favorite politician really just flip-flop on your hot-button issue? Compare
last year's campaign Web site with today's. These are just a few of the many
valuable services that promise to keep the nonprofit Internet Archive richly
endowed.
Useful though it might be, the idea of archiving the
Internet is assailed by all sides. David Berreby argued last year in Slate that exhaustive
documentation of our world threatens to box us into a corner. The recent
"Documenting the Digital
Age" conference gathered experts from the computing, telecommunication, and
archiving worlds to explore these issues. Corporate executives complained that
because their archives are routinely subpoenaed by plaintiffs' attorneys, they
have every incentive to shred their data instead of preserving them. Lawyers
worried aloud about privacy and copyright concerns. Should you have the right
to exclude your public page from the archive? (Consensus opinion: Yes.) Should
we be saving usage logs, which detail every page a person sees? (Probably not.)
Doesn't this whole thing violate current copyright laws left and right? (Almost
certainly.) Should those laws be amended to allow such an archive?
(Probably.)
Professional archivists argue
that it's a waste of time to store the Internet without providing a proper
historical context. Others say that having too
much information
about the Web at our disposal will be as bad as not having enough. They add
that finding things promptly on the Web with a search engine is hard enough,
that using it as a historical research tool would be incredibly painful. They
advocate an orderly weeding, assembling, and categorizing of digital records.
Microsoft's chief technical officer (and Slate contributor), Nathan Myhrvold,
whose "Save the Web" memo last year helped start the archive movement, counters
that we don't know now what will be important later. Your cousin might grow up
to be president, at which point her teen-age Mohawk Web site will become
substantially more important than it is now. Myhrvold adds that it's better to
start saving today's Internet now, even if it is badly collected and organized,
rather than lose it forever.
And to
think that Brewster Kahle thought he was just solving a problem by starting the
Internet Archive, and not introducing lots of new ones.