How to keep robots out of your web site

Perfect · 2011-11-29 09:08

The robots.txt file

You know that search engines have been created to help people find information quickly on the Internet and search engines gain much of their information by robots (also known as spiders or crawlers), looking for the pages web for them.

Spiders or robots crawlers scan the Web for recording and all kinds of information. Usually start with URL submitted by users, or the links on the website, the map files or the top level of a site.

Once the robot accesses the home page recursively accesses all pages linked from that page. But the robot can also see all the pages found on a particular server.

After the robot finds a website that does the indexing of titles, keywords, text, etc, but sometimes you may want to prevent search engines from indexing some of its websites, including press releases, and specially marked Web pages (in the example: affiliate pages), but if the individual robots pursuant to these agreements is pure volunteerism.

The robot exclusion PROTOCOL

So if you want robots to prevent the entry of some of its web pages, you can ask robots to ignore web pages that you do not want indexed, and for that you can place a robots.txt file in the root of your local server website.

In the example you have a directory called e-books and wants to make robots to keep out of it, the robots.txt file should say:

User-agent: * Disallow: e-books /

When you do not have enough control over the server to create a robots.txt file, you can try adding a META tag to the head section of any HTML document.

In the example, a label like this tells robots not to index and follow links on a page in particular:

meta name = "ROBOTS" content = "noindex, nofollow"

Support for the robots meta tag is not as frequent as the Robots Exclusion Protocol, but most major web indexes now support it.

NEWS OFFERS

To keep the search engines your news announcements, you can create an "X-No-Archive" in the line of its publications' headings:

X-no-archive: yes

However, although common news clients allow you to add a line X-no-archive to the headers of your news announcements, some of them do not allow.

The problem is that most search engines assume that all the information they find is public unless otherwise noted.

So be careful, because although the robot and file exclusion rules can help keep the material out of the major search engines there are others who do not respect these rules.

If you are very concerned about the privacy of your e-mail and Usenet messages, you must use some anonymous remailers and PGP. You can read about it here:

http://www.io.com/ http://www.well.com/user/abacard/remail.html ~ combs / htmls / crypto.html
http://world.std.com/ ~ franl / pgp /

Even if they are not particularly concerned about privacy, remember that anything you write will be indexed and archived somewhere in eternity, in order to use the robots.txt file as much as you need it.

Perfect · 2011-11-29 09:08

How to keep robots out of your web site

Perfect

Perfect

Section