Robot Exclusion

Basics Copyright Trademarks Censorship Site Map


Home Up


Robot Exclusion Index Changes Internet Filtering

Because most indexed websites are added to the database automatically by the search engine robot, the simplest method of preventing a page’s inclusion in the database is to prevent the robot from copy the page to the index.  This is controlled by the webpage designer, without input from the search engine companies and is only effective against web robots that recognize either, the robots meta tag, the robots.txt file, or both.  This does not remove the web pages from the internet or prevent access to those pages; it merely prevents the page from being indexed into the search engine and thus prevents its display as a search result.

Robots Meta Tag

As noted in the Target Web Page Components section, the robots meta tag is information placed in the head section of a web page’s HTML.  Unseen by the web page visitor, this meta tag provides instruction on how the search engine robot is to treat the page both when indexing and when crawling the page.  Though most robots automatically index web pages and follow the links therein, the robots meta tag enables the web designer to prevent either of these actions for any particular page of a website.

Examples include:

<meta name="robots" content="index,follow">

<meta name="robots" content="noindex,follow">

<meta name="robots" content="index,nofollow">

<meta name="robots" content="noindex,nofollow">

This method of robot exclusion has significant drawbacks.  First, this meta tag only affects the particular page in which it is included.  Therefore, to use on large, multi-page web sites requires the web designer to include the tag on each individual page that should not be indexed or followed to other links.  Second, and most important, only a few of the current search engine robots support this meta tag.  Therefore, search engines not adhering to this standard will index the site anyway according to their algorithms.  Searches using one of these search engines, or a metasearch engine employing such a search engine, will continue to produce the page information as a result.


Back to Top

 

Robots.txt File

Considered the most basic method to prevent web page information from becoming indexed by a search engine, the robot.txt file is considered “a digital gatekeeper” that provides an instruction sheet to the engine’s robot.  Unlike the robots meta tag, the robots.txt file is placed in the website’s index, home page, or other top-level server directory and designed to encompass any and all pages that must be excluded.  In addition, the robots.txt file provides greater flexibility as to not only which files or web pages are indexed, but also as to which search engine robots are allowed or prohibited from indexing that information.

To exclude all robots from the entire server:

User-agent: *

Disallow: /

To allow all robots complete access:

User-agent: *

Disallow:

To exclude all robots from part of the server:

User-agent: *

Disallow: /cgi-bin/

Disallow: /tmp/

Disallow: /private/

To exclude a single robot:

User-agent: BadBot

Disallow: /

To allow a single robot:

User-agent: WebCrawler

Disallow:

User-agent: *

Disallow: /

Despite the flexibility and straightforwardness of this protocol, it is nonetheless susceptible to “bad robots,” robots which either ignore the robot.txt file or instead use the robots.txt file to access the hidden directories within the website.  Many such robots are designed to harvest e-mail addresses for spammers or to copy the entire website neither of which benefits the website owner.

 

More:

  • View the contents of any website’s robots.txt file at [www.company-name.com/robots.txt].  *Requires the substitution of the desired web address.
  • For an example of potential problems and misuse, see Robots, Iraq, and the White House.

 

REFERENCES:

Back to Top

 

 


Home | Basics | Copyright | Trademarks | Censorship | Site Map

This website was created as an assignment for the Cyberspace Law seminar at the University of North Carolina School of Law.  Information contained in this site should not be considered legal advice. This website was created solely for educational purposes. All copyrighted content, trade names, and trademarks incorporated into this website are property of their respective owners and are reproduced with permission and/or under the Fair Use guidelines for educational purposes.

Last updated: 04/12/05.