Wednesday, March 17, 2004

The Extent of Revisionist History

The web is a very forgetful and forgiving place.

Without Google's cached pages or the Internet Archive, the web would have almost no memory.

You'd never be able to revisit how CNN first responded to 9/11, reexamine what they said as the Iraqi War began, or remember how now-defunct companies sounded during their heyday.

Sadly, this historical record is fragile. The Internet Archive is only as good as Amazon (its owner) decides it should be. Google's cached pages refresh regularly, only showing a window into the brief past. Moreover, webmasters can prevent these pages from being generated in the first place.

By creating a robots.txt file, webmasters can tell these services which pages they don't want catalogued.

Last fall, I heard that the Bush Administration was preventing the creation of this de facto historical record by making large portions of www.whitehouse.org off-limits to archiving services.

I heard this news, believed it probably to be true, and moved on.

Only just now did I discover the immensity of www.whitehouse.gov/robots.txt. If you printed out the list of pages and directories that the Bush Administration wants off-limits to history, it would take 31 pages!

For a window into how protective the Bush Administration has became during this term, the Internet Archive has documented the growth of their robots.txt file.

For the first 8 months of his term the Bush Administration didn't see the need to block much of anything, as evidenced by their robots.txt file from August 29, 2001. Suddenly, on September 1, 2001, they started blocking archiving and caching functions for a list of pages that has grown and grown.

What happened at the end of August 2001 to start this trend?

1 comment:

Marsha Woodbury said...

Andy,

This is important research and you did great
to publicize it. In a few years people might care
more about these things--I hope so. I will try to
write something about it to spread the word!

Marsha