Merge commit 'origin/master'
[libwww-perl-eserte.git] / doc / norobots.html
blob6525205be87a4f8c963af13f217cb806128cc023
1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
2 <html>
3 <head>
4 <title>A Standard for Robot Exclusion</title>
5 </head>
6 <body>
8 <h1>A Standard for Robot Exclusion</h1>
10 Table of contents:
12 <ul>
13 <li>
15 <a href="#status">
16 Status of this document
17 </a>
19 <li>
21 <a href="#introduction">
22 Introduction
23 </a>
25 <li>
27 <a href="#method">
28 Method
29 </a>
31 <li>
33 <a href="#format">
34 Format
35 </a>
37 <li>
39 <a href="#examples">
40 Examples
41 </a>
43 <li>
45 <a href="#code">
46 Example Code
47 </a>
49 <li>
51 <a href="#author">
52 Author's Address
53 </a>
55 </ul>
56 <hr>
58 <h2><a name="status">Status of this document</a></h2>
60 This document represents a consensus on 30 June 1994 on the robots
61 mailing list (robots-request@nexor.co.uk), between the majority of
62 robot authors and other people with an interest in robots. It has
63 also been open for discussion on the Technical World Wide Web
64 mailing list (www-talk@info.cern.ch). This document is based on a
65 previous working draft under the same title.
67 <p>
69 It is not an official standard backed by a standards body,
70 or owned by any comercial organisation.
72 It is not enforced by anybody, and there no guarantee that
73 all current and future robots will use it.
75 Consider it a common facility the majority of robot authors
76 offer the WWW community to protect WWW server against
77 unwanted accesses by their robots.</p>
79 <p>
81 The latest version of this document can be found on
82 <a href="http://web.nexor.co.uk/mak/doc/robots/norobots.html">
83 http://web.nexor.co.uk/mak/doc/robots/norobots.html</a>.</p>
85 <hr>
87 <h2><a name="introduction">Introduction</a></h2>
89 WWW Robots (also called wanderers or spiders) are programs
90 that traverse many pages in the World Wide Web by
91 recursively retrieving linked pages. For more information
92 see <a href="robots.html">the robots page</a>.
94 <p>
96 In 1993 and 1994 there have been occasions where robots
97 have visited WWW servers where they weren't welcome for
98 various reasons. Sometimes these reasons were robot specific,
99 e.g. certain robots swamped servers with rapid-fire
100 requests, or retrieved the same files repeatedly.
101 In other situations robots traversed parts of WWW servers
102 that weren't suitable, e.g. very deep virtual trees,
103 duplicated information, temporary information, or
104 cgi-scripts with side-effects (such as voting).</p>
108 These incidents indicated the need for established
109 mechanisms for WWW servers to indicate to robots which parts
110 of their server should not be accessed. This standard
111 addresses this need with an operational solution.</p>
113 <hr>
115 <h2><a name="method">The Method</a></h2>
117 The method used to exclude robots from a server is to
118 create a file on the server which specifies an access
119 policy for robots.
121 This file must be accessible via HTTP on the local URL
122 "<code>/robots.txt</code>".
123 The contents of this file are specified <a href="#format">below</a>.
127 This approach was chosen because it can be easily
128 implemented on any existing WWW server, and a robot can find
129 the access policy with only a single document retrieval.</p>
133 A possible drawback of this single-file approach is that only a
134 server administrator can maintain such a list, not the
135 individual document maintainers on the server. This can be
136 resolved by a local process to construct the single file
137 from a number of others, but if, or how, this is done is
138 outside of the scope of this document.</p>
142 The choice of the URL was motivated by several criteria:</p>
144 <ul>
145 <li>
147 The filename should fit in file naming restrictions of all
148 common operating systems.
150 <li>
152 The filename extension should not require extra server
153 configuration.
155 <li>
157 The filename should indicate the purpose of the file
158 and be easy to remember.
160 <li>
162 The likelihood of a clash with existing files should
163 be minimal.
165 </ul>
166 <hr>
168 <h2><a name="format">The Format</a></h2>
170 The format and semantics of the "<code>/robots.txt</code>" file
171 are as follows:
175 The file consists of one or more records separated by one or
176 more blank lines (terminated by CR,CR/NL, or NL). Each
177 record contains lines of the form
178 "<code>&lt;field&gt;:&lt;optionalspace&gt;&lt;value&gt;&lt;optionalspace&gt;</code>".
179 The field name is case insensitive.</p>
183 Comments can be included in file using UNIX bourne shell
184 conventions: the '<code>#</code>' character is used to
185 indicate that preceding space (if any) and the remainder of
186 the line up to the line termination is discarded.
187 Lines containing only a comment are discarded completely,
188 and therefore do not indicate a record boundary.</p>
191 The record starts with one or more <code>User-agent</code>
192 lines, followed by one or more <code>Disallow</code> lines,
193 as detailed below. Unrecognised headers are ignored.</p>
195 <dl>
196 <dt>User-agent</dt>
197 <dd>
199 The value of this field is the name of the robot the
200 record is describing access policy for.
203 If more than one User-agent field is present the record
204 describes an identical access policy for more
205 than one robot. At least one field needs to be present
206 per record.</p>
209 The robot should be liberal in interpreting this field.
210 A case insensitive substring match of the name without
211 version information is recommended.</p>
215 If the value is '<code>*</code>', the record describes
216 the default access policy for any robot that has not not
217 matched any of the other records. It is not allowed to
218 have two such records in the "<code>/robots.txt</code>"
219 file.</p></dd>
221 <dt>Disallow</dt>
222 <dd>
224 The value of this field specifies a partial URL that is not
225 to be visited. This can be a full path, or a partial
226 path; any URL that starts with this value will not be
227 retrieved. For example, <code>Disallow: /help</code>
228 disallows both <code>/help.html</code> and
229 <code>/help/index.html</code>, whereas
230 <code>Disallow: /help/</code> would disallow
231 <code>/help/index.html</code>
232 but allow <code>/help.html</code>.
236 Any empty value, indicates that all URLs can be
237 retrieved. At least one Disallow field needs to
238 be present in a record.</p></dd>
240 </dl>
242 The presence of an empty "<code>/robots.txt</code>" file
243 has no explicit associated semantics, it will be treated
244 as if it was not present, i.e. all robots will consider
245 themselves welcome.
247 <hr>
249 <h2><a name="examples">Examples</a></h2>
251 The following example "<code>/robots.txt</code>" file specifies
252 that no robots should visit any URL starting with
253 "<code>/cyberworld/map/</code>" or
254 "<code>/tmp/</code>:
256 <hr>
257 <pre>
258 # robots.txt for http://www.site.com/
260 User-agent: *
261 Disallow: /cyberworld/map/ # This is an infinite virtual URL space
262 Disallow: /tmp/ # these will soon disappear
263 </pre>
264 <hr>
266 This example "<code>/robots.txt</code>" file specifies
267 that no robots should visit any URL starting with
268 "<code>/cyberworld/map/</code>", except the robot called
269 "<code>cybermapper</code>":
271 <hr>
272 <pre>
273 # robots.txt for http://www.site.com/
275 User-agent: *
276 Disallow: /cyberworld/map/ # This is an infinite virtual URL space
278 # Cybermapper knows where to go.
279 User-agent: cybermapper
280 Disallow:
281 </pre>
282 <hr>
284 This example indicates that no robots should visit
285 this site further:
287 <hr>
288 <pre>
289 # go away
290 User-agent: *
291 Disallow: /
292 </pre>
293 <hr>
295 <h2><a name="code">Example Code</a></h2>
297 Although it is not part of this specification, some example code
298 in Perl is available in <a href="norobots.pl">norobots.pl</a>. It
299 is a bit more flexible in its parsing than this document
300 specificies, and is provided as-is, without warranty.
302 <h2><a name="author">Author's Address</a></h2>
304 <address>
305 <a href="/mak/mak.html">Martijn Koster</a>
306 &lt;m.koster@webcrawler.com&gt;<br>
307 NEXOR<br>
308 PO Box 132, <br>
309 Nottingham,<br>
310 The United Kingdom<br>
311 Phone: +44 602 520576
312 </address>
313 </body>
314 </html>