robots.txt : Java Glossary

* 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all)

robots.txt is a file you can place in the root directory of your website to tell web crawlers (search engines) which pages to index and which to ignore. A typical robots.txt file might look like this:

# parts of the mindprod.com website not indexed
user-agent: *
disallow: /template.html
disallow: /include/
disallow: /jgloss/include/
Sitemap: http://mindprod.com/sitemap.gz

It means, for all browsers, don’t look at the file template.html or anything in the two directories mentioned. There is no way to tell it to avoid certain file extensions. Note that the Sitemap directive takes a full URL (Uniform Resource Locator), unlike the others.

You can also control spiders with the robots meta tag or with the X-Robots-Tag field in the HTTP (Hypertext Transfer Protocol) response header. and with a sitemap. This is not a human-comprehensible HTML (Hypertext Markup Language) page but a gzipped XML (extensible Markup Language) document in a special format.

clockwatchers.com robots.txt tutorial
Google Sitemap Generator Utility
Robot
robots.txt howto
robots.txt tutorial
search engines
sitemap
spider

standard footer
	This page is posted on the web at:	http://mindprod.com/jgloss/robotstxt.html
	Optional Replicator mirror of mindprod.com on local hard disk J:	J:\mindprod\jgloss\robotstxt.html
	Please read the feedback from other visitors, or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission.
	Canadian Mind Products IP:[65.110.21.43] Your face IP:[216.73.216.23]
Feedback	You are visitor number