doc/norobots.html

   1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
   2 <html>
   3 <head>
   4 <title>A Standard for Robot Exclusion</title>
   5 </head>
   6 <body>
   7
   8 <h1>A Standard for Robot Exclusion</h1>
   9
  10 Table of contents:
  11
  12 <ul>
  13 <li>
  14
  15 <a href="#status">
  16 Status of this document
  17 </a>
  18
  19 <li>
  20
  21 <a href="#introduction">
  22 Introduction
  23 </a>
  24
  25 <li>
  26
  27 <a href="#method">
  28 Method
  29 </a>
  30
  31 <li>
  32
  33 <a href="#format">
  34 Format
  35 </a>
  36
  37 <li>
  38
  39 <a href="#examples">
  40 Examples
  41 </a>
  42
  43 <li>
  44
  45 <a href="#code">
  46 Example Code
  47 </a>
  48
  49 <li>
  50
  51 <a href="#author">
  52 Author's Address
  53 </a>
  54
  55 </ul>
  56 <hr>
  57
  58 <h2><a name="status">Status of this document</a></h2>
  59
  60 This document represents a consensus on 30 June 1994 on the robots
  61 mailing list (robots-request@nexor.co.uk), between the majority of
  62 robot authors and other people with an interest in robots. It has
  63 also been open for discussion on the Technical World Wide Web
  64 mailing list (www-talk@info.cern.ch). This document is based on a
  65 previous working draft under the same title.
  66
  67 <p>
  68
  69 It is not an official standard backed by a standards body,
  70 or owned by any comercial organisation.
  71
  72 It is not enforced by anybody, and there no guarantee that
  73 all current and future robots will use it.
  74
  75 Consider it a common facility the majority of robot authors
  76 offer the WWW community to protect WWW server against
  77 unwanted accesses by their robots.</p>
  78
  79 <p>
  80
  81 The latest version of this document can be found on
  82 <a href="http://web.nexor.co.uk/mak/doc/robots/norobots.html">
  83 http://web.nexor.co.uk/mak/doc/robots/norobots.html</a>.</p>
  84
  85 <hr>
  86
  87 <h2><a name="introduction">Introduction</a></h2>
  88
  89 WWW Robots (also called wanderers or spiders) are programs
  90 that traverse many pages in the World Wide Web by
  91 recursively retrieving linked pages. For more information
  92 see <a href="robots.html">the robots page</a>.
  93
  94 <p>
  95
  96 In 1993 and 1994 there have been occasions where robots
  97 have visited WWW servers where they weren't welcome for
  98 various reasons. Sometimes these reasons were robot specific,
  99 e.g. certain robots swamped servers with rapid-fire
 100 requests, or retrieved the same files repeatedly.
 101 In other situations robots traversed parts of WWW servers
 102 that weren't suitable, e.g. very deep virtual trees,
 103 duplicated information, temporary information, or
 104 cgi-scripts with side-effects (such as voting).</p>
 105
 106 <p>
 107
 108 These incidents indicated the need for established
 109 mechanisms for WWW servers to indicate to robots which parts
 110 of their server should not be accessed. This standard
 111 addresses this need with an operational solution.</p>
 112
 113 <hr>
 114
 115 <h2><a name="method">The Method</a></h2>
 116
 117 The method used to exclude robots from a server is to
 118 create a file on the server which specifies an access
 119 policy for robots.
 120
 121 This file must be accessible via HTTP on the local URL
 122 "<code>/robots.txt</code>".
 123 The contents of this file are specified <a href="#format">below</a>.
 124
 125 <p>
 126
 127 This approach was chosen because it can be easily
 128 implemented on any existing WWW server, and a robot can find
 129 the access policy with only a single document retrieval.</p>
 130
 131 <p>
 132
 133 A possible drawback of this single-file approach is that only a
 134 server administrator can maintain such a list, not the
 135 individual document maintainers on the server. This can be
 136 resolved by a local process to construct the single file
 137 from a number of others, but if, or how, this is done is
 138 outside of the scope of this document.</p>
 139
 140 <p>
 141
 142 The choice of the URL was motivated by several criteria:</p>
 143
 144 <ul>
 145 <li>
 146
 147 The filename should fit in file naming restrictions of all
 148 common operating systems.
 149
 150 <li>
 151
 152 The filename extension should not require extra server
 153 configuration.
 154
 155 <li>
 156
 157 The filename should indicate the purpose of the file
 158 and be easy to remember.
 159
 160 <li>
 161
 162 The likelihood of a clash with existing files should
 163 be minimal.
 164
 165 </ul>
 166 <hr>
 167
 168 <h2><a name="format">The Format</a></h2>
 169
 170 The format and semantics of the "<code>/robots.txt</code>" file
 171 are as follows:
 172
 173 <p>
 174
 175 The file consists of one or more records separated by one or
 176 more blank lines (terminated by CR,CR/NL, or NL). Each
 177 record contains lines of the form
 178 "<code>&lt;field&gt;:&lt;optionalspace&gt;&lt;value&gt;&lt;optionalspace&gt;</code>".
 179 The field name is case insensitive.</p>
 180
 181 <p>
 182
 183 Comments can be included in file using UNIX bourne shell
 184 conventions: the '<code>#</code>' character is used to
 185 indicate that preceding space (if any) and the remainder of
 186 the line up to the line termination is discarded.
 187 Lines containing only a comment are discarded completely,
 188 and therefore do not indicate a record boundary.</p>
 189
 190 <p>
 191 The record starts with one or more <code>User-agent</code>
 192 lines, followed by one or more <code>Disallow</code> lines,
 193 as detailed below. Unrecognised headers are ignored.</p>
 194
 195 <dl>
 196 <dt>User-agent</dt>
 197 <dd>
 198
 199 The value of this field is the name of the robot the
 200 record is describing access policy for.
 201
 202 <p>
 203 If more than one User-agent field is present the record
 204 describes an identical access policy for more
 205 than one robot. At least one field needs to be present
 206 per record.</p>
 207
 208 <p>
 209 The robot should be liberal in interpreting this field.
 210 A case insensitive substring match of the name without
 211 version information is recommended.</p>
 212
 213 <p>
 214
 215 If the value is '<code>*</code>', the record describes
 216 the default access policy for any robot that has not not
 217 matched any of the other records. It is not allowed to
 218 have two such records in the "<code>/robots.txt</code>"
 219 file.</p></dd>
 220
 221 <dt>Disallow</dt>
 222 <dd>
 223
 224 The value of this field specifies a partial URL that is not
 225 to be visited. This can be a full path, or a partial
 226 path; any URL that starts with this value will not be
 227 retrieved. For example, <code>Disallow: /help</code>
 228 disallows both <code>/help.html</code> and
 229 <code>/help/index.html</code>, whereas
 230 <code>Disallow: /help/</code> would disallow
 231 <code>/help/index.html</code>
 232 but allow <code>/help.html</code>.
 233
 234 <p>
 235
 236 Any empty value, indicates that all URLs can be
 237 retrieved. At least one Disallow field needs to
 238 be present in a record.</p></dd>
 239
 240 </dl>
 241
 242 The presence of an empty "<code>/robots.txt</code>" file
 243 has no explicit associated semantics, it will be treated
 244 as if it was not present, i.e. all robots will consider
 245 themselves welcome.
 246
 247 <hr>
 248
 249 <h2><a name="examples">Examples</a></h2>
 250
 251 The following example "<code>/robots.txt</code>" file specifies
 252 that no robots should visit any URL starting with
 253 "<code>/cyberworld/map/</code>" or
 254 "<code>/tmp/</code>:
 255
 256 <hr>
 257 <pre>
 258 # robots.txt for http://www.site.com/
 259
 260 User-agent: *
 261 Disallow: /cyberworld/map/ # This is an infinite virtual URL space
 262 Disallow: /tmp/ # these will soon disappear
 263 </pre>
 264 <hr>
 265
 266 This example "<code>/robots.txt</code>" file specifies
 267 that no robots should visit any URL starting with
 268 "<code>/cyberworld/map/</code>", except the robot called
 269 "<code>cybermapper</code>":
 270
 271 <hr>
 272 <pre>
 273 # robots.txt for http://www.site.com/
 274
 275 User-agent: *
 276 Disallow: /cyberworld/map/ # This is an infinite virtual URL space
 277
 278 # Cybermapper knows where to go.
 279 User-agent: cybermapper
 280 Disallow:
 281 </pre>
 282 <hr>
 283
 284 This example indicates that no robots should visit
 285 this site further:
 286
 287 <hr>
 288 <pre>
 289 # go away
 290 User-agent: *
 291 Disallow: /
 292 </pre>
 293 <hr>
 294
 295 <h2><a name="code">Example Code</a></h2>
 296
 297 Although it is not part of this specification, some example code
 298 in Perl is available in <a href="norobots.pl">norobots.pl</a>.  It
 299 is a bit more flexible in its parsing than this document
 300 specificies, and is provided as-is, without warranty.
 301
 302 <h2><a name="author">Author's Address</a></h2>
 303
 304 <address>
 305 <a href="/mak/mak.html">Martijn Koster</a>
 306 &lt;m.koster@webcrawler.com&gt;<br>
 307 NEXOR<br>
 308 PO Box 132, <br>
 309 Nottingham,<br>
 310 The United Kingdom<br>
 311 Phone: +44 602 520576
 312 </address>
 313 </body>
 314 </html>