robots.txt : Java Glossary

*0-9ABCDEFGHIJKLMNOPQRSTUVWXYZ (all)

robots.txt
robots.txt is a file you can place in the root directory of your website to tell web crawlers (search engines) which pages to index and which to ignore. A typical robots.txt file might look like this:
# parts of the mindprod.com website not indexed
user-agent: *
disallow: /template.html
disallow: /include/
disallow: /jgloss/include/
Sitemap: http://mindprod.com/sitemap.gz

It means, for all browsers, don’t look at the file template.html or anything in the two directories mentioned. There is no way to tell it to avoid certain file extensions. Note that the Sitemap directive takes a full URL (Uniform Resource Locator), unlike the others.

You can also control spiders with the robots meta tag or with the X-Robots-Tag field in the HTTP (Hypertext Transfer Protocol) response header. and with a sitemap. This is not a human-comprehensible HTML (Hypertext Markup Language) page but a gzipped XML (extensible Markup Language) document in a special format.


This page is posted
on the web at:

http://mindprod.com/jgloss/robotstxt.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\jgloss\robotstxt.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[18.226.166.106]
You are visitor number