7.5. Deep Web FAQ

Discover over 70,000+ searchable databases and specialty search engines.

What is the Deep Web?

The Deep Web is content that resides in searchable databases, the results from which can only be discovered by a direct query. Without the directed query, the database does not publish the result. When queried, Deep Web sites post their results as dynamic Web pages in real-time. Though these dynamic pages have a unique URL address that allows them to be retrieved again later, they are not persistent.

How does the Deep Web differ from the "surface" Web?

Search engines - the primary means for finding information on the "surface" Web - obtain their listings in two ways. Authors may submit their own Web pages for listing, generally a minor contributor to total listings. Or, search engines "crawl" or "spider" documents by following one hypertext link to another. Simply stated, when indexing a given document or page, if the crawler encounters a hypertext link on that page to another document, it records that incidence and schedules that new page for later crawling. Like ripples propagating across a pond, in this manner search engine crawlers are able to extend their indexes further and further from their starting points.

Thus, to be discovered, "surface" Web pages must be static and linked to other pages. Traditional search engines cannot "see" or retrieve content in the Deep Web, which by definition is dynamic content served up in real time from a database in response to a direct query.

How much of the Deep and "surface" Web is captured by CompletePlanet?

Approximately 70,000+ of the estimated total 200,000 Deep Web sites and about 11,000 of the estimated total 45,000 "surface" Web search sites are presently listed on CompletePlanet. CompletePlanet was created as a public service and as a test bed for the Deep Query Manager (DQM) .

The DQM is a research, information sharing and management tool for organizations that accesses tens of thousands of Deep Web databases and Internet search engines. With it, individual users can search vast stretches of the Internet in one search and can share their results across the organization or with selected co-workers as appropriate. The system provides a very powerful infrastructure for finding and managing large amounts of information within the organization.

Why haven’t I heard before about the Deep Web?

In the earliest days of the Web, there were relatively few documents and sites. It was a manageable task to "post" all documents as "static" pages. Because all results were persistent and constantly available, they could easily be crawled by conventional search engines.

What has not been broadly recognized is that information is now being published in a different means on the Web, especially for larger sites or for traditional information providers now moving their content online. The sheer volume of these sites requires the information to be managed from a database, the results of which are "hidden in plain sight" from search engines.

The evolution of the Web to a database-centric design has been gradual and largely unnoticed. Many Internet information professionals have noted the importance of searchable databases to Web content. But BrightPlanet’s Deep Web white paper is the first to comprehensively define, quantify and characterize this entirely different category of Web content.

Is the Deep Web the same thing as the "invisible" Web?

As early as 1994, Dr. Jill Ellsworth first coined the phrase "invisible Web" to refer to information content that was "invisible" to conventional search engines. We avoid the term "invisible Web" because it is inaccurate. The only thing "invisible" about searchable databases is that they are not indexable or queryable by conventional search engines. Using our technology, they are totally "visible" to those that need to access them.

The real problem is not the "visibility" or "invisibility" of the Web, but the spidering technologies used by conventional search engines to collect their content. For these reasons, we have chosen to call information in searchable databases the Deep Web. Yes, it is somewhat hidden from traditional engines, but clearly available if different technology such as ours is used to access it.

How large is the Deep Web?

Public information on the Deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web. The Deep Web contains 7,500 terabytes of information, compared to 19 terabytes of information in the surface Web. The Deep Web contains nearly 550 billion individual documents compared to the 1 billion of the surface Web. More than an estimated 200,000 Deep Web sites presently exist. Sixty of the largest Deep Web sites collectively contain about 750 terabytes of information - sufficient by themselves to exceed the size of the surface Web by 40 times.

How does the content and quality of the Deep Web differ from the "surface" Web?

Deep Web sites tend to be narrower with deeper content than conventional surface sites. Total quality content of the Deep Web is at least 1,000 to 2,000 times greater than that of the surface Web. Deep Web content is highly relevant to every information need, market and domain. More than half of the Deep Web content resides in topic specific databases. A full 95% of the Deep Web is publicly accessible information - not subject to fees or subscriptions.

Is the Deep Web growing faster or slower than the "surface" Web?

The Deep Web is the fastest growing category of new information on the Internet. All signs point to the Deep Web as the dominant paradigm for the next-generation Internet.

Why can’t I search the Deep Web using standard search engines?

Searching on the Internet today can be compared to dragging a net across the surface of the ocean. There is a wealth of information that is deep, and therefore missed. The reason is simple: basic search methodology and technology have not evolved significantly since the inception of the Internet.

Traditional search engines create their card catalogs by spidering or crawling "surface" Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines cannot "see" or retrieve content in the Deep Web, which is defined as content in searchable databases that only appears dynamically in response to a direct query. Because traditional search engine crawlers can not probe beneath the surface, the Deep Web has been hidden in plain sight. That's why BrightPlanet created CompletePlanet and the Deep Query Manager .

I occasionally see Deep Web content using search engines. Why is that?

Any Deep Web content listed on a static Web page is discoverable by crawlers and therefore indexable by search engines. This can occur when a Web page author discovers some useful Deep Web content and posts its dynamic URL address on a static Web page.

I often miss "surface" Web content using search engines. Why is that?

Search engines themselves impose decision rules with respect to either depth or breadth of surface pages indexed for a given site. There is also broad variability in the timeliness of results from these engines. Specialized surface sources or engines should therefore be considered when truly deep searching is desired. Again, the "bright line" between Deep and surface Web shows shades of gray.

What other factors may make Internet information Deep?

The World Wide Web (HTTP protocol) is but a subset of Internet content. Other Internet protocols besides the Web include FTP (file transfer protocol), email, news, Telnet and Gopher (most prominent among pre-Web protocols). There is also a large storehouse of private, intranet information hidden behind firewalls; many large companies have internal document stores that exceed terabytes of information. Also, on average 44% of the "contents" of a typical Web document reside in HTML and other coded information (for example, XML or Javascripts). Finally, multimedia (images, music) is another growing category of Internet content.

All of these sources can contribute to Deep Internet content. However, CompletePlanet is currently focused on only public, text-based content, whether surface or Deep.

Where can I learn more about the Deep Web?

See BrightPlanet’s comprehensive 41-page white paper, The Deep Web: Surfacing Hidden Value .