About two years ago Google announced (probably to take some of the wind out of Cuil’s sails) that they had found over 1 trillion unique URL’s on the web. Keep in mind that not all of those are unique as some pages could have multiple URL’s but its still a mind bottlingly large number.
I never really realised the scale of the web though till I started crawling it myself. Not armed with billions, or even millions of dollars I am doing this on the cheap but you can still suck down a large amount of stuff. The initial Google index had about 26 million pages and I always thought this was a fairly good size to start with. After all, if you can crawl 26 million pages you should have gotten a representative view of the web right?
One site that my spider managed to find has come up with 2 million unique URL’s as I write this. I had a feeling that there was a lot of duplicate content in this so I wrote some scripts and believe it or not there are actually 2 million unique pieces of content in this one single smallish site.
Pretty amazing. It also makes me wonder how Google or any other search engine can index it all. Not because of some technical limitation on their end, but because even if you crawled a single URL every second its still going to take you about 23 days to crawl each of those URL’s and most sites wont let you crawl that quickly.
Interesting stuff when you think about it and hats off to the people at Google, Bing, Blekko, Yahoo and Gigablast.