|
|
|
|
Because most indexed websites are added to the database automatically by the search engine robot, the simplest method of preventing a page’s inclusion in the database is to prevent the robot from copy the page to the index. This is controlled by the webpage designer, without input from the search engine companies and is only effective against web robots that recognize either, the robots meta tag, the robots.txt file, or both. This does not remove the web pages from the internet or prevent access to those pages; it merely prevents the page from being indexed into the search engine and thus prevents its display as a search result. Robots Meta TagAs noted in the Target Web Page Components section, the robots meta tag is information placed in the head section of a web page’s HTML. Unseen by the web page visitor, this meta tag provides instruction on how the search engine robot is to treat the page both when indexing and when crawling the page. Though most robots automatically index web pages and follow the links therein, the robots meta tag enables the web designer to prevent either of these actions for any particular page of a website. Examples include:
This method of robot exclusion has significant drawbacks. First, this meta tag only affects the particular page in which it is included. Therefore, to use on large, multi-page web sites requires the web designer to include the tag on each individual page that should not be indexed or followed to other links. Second, and most important, only a few of the current search engine robots support this meta tag. Therefore, search engines not adhering to this standard will index the site anyway according to their algorithms. Searches using one of these search engines, or a metasearch engine employing such a search engine, will continue to produce the page information as a result.
Robots.txt FileConsidered the most basic method to prevent web page information from becoming indexed by a search engine, the robot.txt file is considered “a digital gatekeeper” that provides an instruction sheet to the engine’s robot. Unlike the robots meta tag, the robots.txt file is placed in the website’s index, home page, or other top-level server directory and designed to encompass any and all pages that must be excluded. In addition, the robots.txt file provides greater flexibility as to not only which files or web pages are indexed, but also as to which search engine robots are allowed or prohibited from indexing that information. To exclude all robots from the entire server:
To allow all robots complete access:
To exclude all robots from part of the server:
To exclude a single robot:
To allow a single robot:
Despite the flexibility and straightforwardness of this protocol, it is nonetheless susceptible to “bad robots,” robots which either ignore the robot.txt file or instead use the robots.txt file to access the hidden directories within the website. Many such robots are designed to harvest e-mail addresses for spammers or to copy the entire website neither of which benefits the website owner.
More:
REFERENCES:
|
|
|