Crawling the Web is nothing new. Search engines like Google, Yahoo!, and others have always done it as a way to index and build their databases of online content so that they can provide better, up to date search results for their visitors. The trouble is, building these crawlers and storing all of that information requires a lot of infrastructure.
Google currently houses the world’s largest and most complete index of the Web’s contents. The company’s entire core value is based on its ability to efficiently retrieve and store information on a massive scale and make that information available to searches and tools in an accessible, speedy manner. To accomplish this, Google has built a gigantic network of servers, databases, data storage facilities, and technicians. It spends billions of dollars a year just to maintain that infrastructure and tens of millions more adding to it.
So competing with Google is not as simple as throwing up a new website and programming another Web crawler. It’s an expensive proposition that few organizations can even consider attempting.
A nonprofit called Common Crawl, however, has been doing just that.
They’ve amassed a database of over five billion Web pages and are now opening it up to developers, free of charge, who wish to access the information for their apps, websites, and software. This, the company believes, will release engineers who have often seen Google as the only choice when developing new ideas about using the Web. The cost of amassing your own database and the near-monopoly Google has enjoyed on massive Web data thanks to that cost have meant that only researchers at Google could do that kind of data manipulation on that kind of scale.
Now, with the information freely available, students, private enterprise, public entities, and more can utilize a database that is large enough to compete with Google’s. Gilad Elbaz, founder of Common Crawl, says that “the Web represents, as far as I know, the largest accumulation of knowledge, and there’s so much you can build on top.” He stresses that with the financial and time costs of building a massive database for use out of the way and that huge mass of human knowledge now available for free, the innovations large and small will begin to explode.
The project, of course, is not meant to rival or “kill” Google. In fact, Google’s director of research, Peter Norving, is on board with the nonprofit as is the Massachusetts Institute of Technology (MIT) Lab Director Joi Ito. While Google hasn’t released any official statement on Common Crawl, they are not opposing or attacking it and are at least giving it a sideways wink and nod through Norving’s participation on Common Crawl’s advisory board.
The engine has already inspired several new Web startups, including TinEye, a “reverse” search engine that analyzes images and provides similar ones to the user – http://www.tineye.com.
To use Common Crawl, developers need only set up an Amazon cloud account (costs about $25) and they have full access to the crawler’s database.
A very cool development in the world of search and in the area of enablement.