|
|
|
|
Search engines are special computer programs which explore the billions of pages of web content throughout the internet in an effort to provide relevant search results to end users. Though some search engines access a far greater percentage of those web pages than others, no single search engine “sees” or “crawls” the entire internet. Regardless of a search engine’s coverage size, the basic technology is largely the same the same as other search engines. Three primary components of each search engine are: RobotAlso known as a “bot,” “spider,” or “crawler,” the search engine robot is a computer program that automatically surfs the internet. The robot may search for specific web addresses either automatically, or as the result of the web address’s submittal directly to the search engine. Once the robot finds a new website, it:
The robot’s activities, however, are not limited to merely finding new web pages. Once a robot finds a valid web page, it will follow the links provided within that page to other locations within the website. In addition, search engine robots periodically return to the web address of the indexed pages to locate and copy any changes made within the page or to register any dead or new links. The frequency of such follow-up crawling varies dependant upon the search engine in use. Though all robots locate and copy a website’s index.html page, how much further they crawl throughout the rest of the website depends upon the specific program, or more precisely, the specific search engine in use. In general, the larger the search engine index, the more likely the robot will index multiple pages per website.
REFERENCES:
IndexOnce the search engine robot has copied the desired information from the target web page, this information is deposited into the search engine index. The index may contain hundreds of millions, or even billions of web pages copied by the robot. As noted by SearchEngineWatch.com, though the most popular search engines have indexed billions of web pages, index size is not all that matters.
Reported Size: The size of the reported index does not necessarily include only completely indexed pages. The total may include “partially-indexed” pages, pages that are known to the search engine, primarily Google, only as a result of links pointing to that particular web address from other web pages. In such a case, the search engine is “aware” of the web page, but very little of that page, if any is actually indexed. Additionally, search engine indexes may include duplicate and spam pages which further distort the reported size. Page Depth: In addition to size, the page depth of a search engine index may affect the relevancy of the indexed pages. The page depth is the amount of information copied by the robot to the index. For instance, if the search engine copies 101K of data for each page, only the first 101K of larger pages will be indexed. As a result, longer pages will not have the information below that amount included in a search of that particular search engine.
REFERENCES:
Search and Retrieval SystemFirst Generation SearchingIn order to deliver relevant search results, each search engine company has developed special algorithms to rank the resulting web pages of a given search in order of relevance. In typical first-generation search engines, the retrieval system program analyzes the search query (the text of the user's search), and then examines the index for pages that contain this particular search term or terms. The program analyzes every single relevant page in order to determine how important the search term is on the page. The pages found most relevant will be listed in the search results first. For examples of search engines utilizing first generation features, see General Free Text Search Engines.
REFERENCES:
Second Generation SearchingPeer Ranking: Also called collective judgment, search systems such as Google’s PageRankTM derive their results from the behavior and judgment of millions of Web developers. In the words of Google, “PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value.” PankRank interprets each link to a web page from another web page as a vote for the linked page. In addition to analyzing the number of votes cast for a particular page, the search system then weighs each vote based on the quality of the of the page casting the vote (the page containing the link). Directory Use: In order to supplement results gained through crawling and indexing the internet, select first generation search services have partnered with search directory services in order to include content from human gathered directories. The partnership between the search engine Lycos and the Looksmart directory search function (Looksmart also utilizes web crawling features) even places the Looksmart results in the first ten results of a Lycos search if the pages are relevant. Concept Processing - Search Interpretation: Second generation services employ concept processing technology to a search statement in order to determine the “probable intent of a search.” The purpose of such concept processing is to shift “the burden of coming up with precise or extensive terminology . . . from the user to the engine.” This essentially placed such search systems in the role of search query thesaurus. Often accomplished by the use of human generated indexes, this category currently includes Ask Jeeves (Though it is unclear how long it will remain in the category: Jeeves Scales Back Natural Language with Latest Facelift) and formerly included Oingo (Google Acquires Applied Semantics). Concept Processing - "Horizontal" Presentation of Results: Most search tools return results in one long, vertical list. In contrast to this method, the most prevalent concept processing technology employed places the results of a search into concept categories or “clusters” for further inspection by the user. This allows the user to review the concept categories before examining each particular page result. If the clusters are relevant to the user’s search, such organization makes it easier to focus in on the desired results. For more information on cluster technology, see Clustering Search Engines.
REFERENCES:
| |||||||||||||||||||
|
|