Doc/howto/urllib2.rst

   1 ==============================================
   2  HOWTO Fetch Internet Resources Using urllib2
   3 ==============================================
   4 ----------------------------
   5   Fetching URLs With Python
   6 ----------------------------
   7
   8
   9 .. note::
  10
  11     There is an French translation of an earlier revision of this
  12     HOWTO, available at `urllib2 - Le Manuel manquant
  13     <http://www.voidspace/python/articles/urllib2_francais.shtml>`_.
  14
  15 .. contents:: urllib2 Tutorial
  16
  17
  18 Introduction
  19 ============
  20
  21 .. sidebar:: Related Articles
  22
  23     You may also find useful the following article on fetching web
  24     resources with Python :
  25
  26     * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
  27
  28         A tutorial on *Basic Authentication*, with examples in Python.
  29
  30     This HOWTO is written by `Michael Foord
  31     <http://www.voidspace.org.uk/python/index.shtml>`_.
  32
  33 **urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs
  34 (Uniform Resource Locators). It offers a very simple interface, in the form of
  35 the *urlopen* function. This is capable of fetching URLs using a variety
  36 of different protocols. It also offers a slightly more complex
  37 interface for handling common situations - like basic authentication,
  38 cookies, proxies and so on. These are provided by objects called
  39 handlers and openers.
  40
  41 urllib2 supports fetching URLs for many "URL schemes" (identified by the string
  42 before the ":" in URL - for example "ftp" is the URL scheme of
  43 "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
  44 This tutorial focuses on the most common case, HTTP.
  45
  46 For straightforward situations *urlopen* is very easy to use. But as
  47 soon as you encounter errors or non-trivial cases when opening HTTP
  48 URLs, you will need some understanding of the HyperText Transfer
  49 Protocol. The most comprehensive and authoritative reference to HTTP
  50 is :RFC:`2616`. This is a technical document and not intended to be
  51 easy to read. This HOWTO aims to illustrate using *urllib2*, with
  52 enough detail about HTTP to help you through. It is not intended to
  53 replace the `urllib2 docs <http://docs.python.org/lib/module-urllib2.html>`_ ,
  54 but is supplementary to them.
  55
  56
  57 Fetching URLs
  58 =============
  59
  60 The simplest way to use urllib2 is as follows : ::
  61
  62     import urllib2
  63     response = urllib2.urlopen('http://python.org/')
  64     html = response.read()
  65
  66 Many uses of urllib2 will be that simple (note that instead of an
  67 'http:' URL we could have used an URL starting with 'ftp:', 'file:',
  68 etc.).  However, it's the purpose of this tutorial to explain the more
  69 complicated cases, concentrating on HTTP.
  70
  71 HTTP is based on requests and responses - the client makes requests
  72 and servers send responses. urllib2 mirrors this with a ``Request``
  73 object which represents the HTTP request you are making. In its
  74 simplest form you create a Request object that specifies the URL you
  75 want to fetch. Calling ``urlopen`` with this Request object returns a
  76 response object for the URL requested. This response is a file-like
  77 object, which means you can for example call .read() on the response :
  78 ::
  79
  80     import urllib2
  81
  82     req = urllib2.Request('http://www.voidspace.org.uk')
  83     response = urllib2.urlopen(req)
  84     the_page = response.read()
  85
  86 Note that urllib2 makes use of the same Request interface to handle
  87 all URL schemes.  For example, you can make an FTP request like so: ::
  88
  89     req = urllib2.Request('ftp://example.com/')
  90
  91 In the case of HTTP, there are two extra things that Request objects
  92 allow you to do: First, you can pass data to be sent to the server.
  93 Second, you can pass extra information ("metadata") *about* the data
  94 or the about request itself, to the server - this information is sent
  95 as HTTP "headers".  Let's look at each of these in turn.
  96
  97 Data
  98 ----
  99
 100 Sometimes you want to send data to a URL (often the URL will refer to
 101 a CGI (Common Gateway Interface) script [#]_ or other web
 102 application). With HTTP, this is often done using what's known as a
 103 **POST** request. This is often what your browser does when you submit
 104 a HTML form that you filled in on the web. Not all POSTs have to come
 105 from forms: you can use a POST to transmit arbitrary data to your own
 106 application. In the common case of HTML forms, the data needs to be
 107 encoded in a standard way, and then passed to the Request object as
 108 the ``data`` argument. The encoding is done using a function from the
 109 ``urllib`` library *not* from ``urllib2``. ::
 110
 111     import urllib
 112     import urllib2
 113
 114     url = 'http://www.someserver.com/cgi-bin/register.cgi'
 115     values = {'name' : 'Michael Foord',
 116               'location' : 'Northampton',
 117               'language' : 'Python' }
 118
 119     data = urllib.urlencode(values)
 120     req = urllib2.Request(url, data)
 121     response = urllib2.urlopen(req)
 122     the_page = response.read()
 123
 124 Note that other encodings are sometimes required (e.g. for file upload
 125 from HTML forms - see
 126 `HTML Specification, Form Submission <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_
 127 for more details).
 128
 129 If you do not pass the ``data`` argument, urllib2 uses a **GET**
 130 request. One way in which GET and POST requests differ is that POST
 131 requests often have "side-effects": they change the state of the
 132 system in some way (for example by placing an order with the website
 133 for a hundredweight of tinned spam to be delivered to your door).
 134 Though the HTTP standard makes it clear that POSTs are intended to
 135 *always* cause side-effects, and GET requests *never* to cause
 136 side-effects, nothing prevents a GET request from having side-effects,
 137 nor a POST requests from having no side-effects. Data can also be
 138 passed in an HTTP GET request by encoding it in the URL itself.
 139
 140 This is done as follows::
 141
 142     >>> import urllib2
 143     >>> import urllib
 144     >>> data = {}
 145     >>> data['name'] = 'Somebody Here'
 146     >>> data['location'] = 'Northampton'
 147     >>> data['language'] = 'Python'
 148     >>> url_values = urllib.urlencode(data)
 149     >>> print url_values
 150     name=Somebody+Here&language=Python&location=Northampton
 151     >>> url = 'http://www.example.com/example.cgi'
 152     >>> full_url = url + '?' + url_values
 153     >>> data = urllib2.open(full_url)
 154
 155 Notice that the full URL is created by adding a ``?`` to the URL, followed by
 156 the encoded values.
 157
 158 Headers
 159 -------
 160
 161 We'll discuss here one particular HTTP header, to illustrate how to
 162 add headers to your HTTP request.
 163
 164 Some websites [#]_ dislike being browsed by programs, or send
 165 different versions to different browsers [#]_ . By default urllib2
 166 identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are
 167 the major and minor version numbers of the Python release,
 168 e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
 169 not work. The way a browser identifies itself is through the
 170 ``User-Agent`` header [#]_. When you create a Request object you can
 171 pass a dictionary of headers in. The following example makes the same
 172 request as above, but identifies itself as a version of Internet
 173 Explorer [#]_. ::
 174
 175     import urllib
 176     import urllib2
 177
 178     url = 'http://www.someserver.com/cgi-bin/register.cgi'
 179     user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
 180     values = {'name' : 'Michael Foord',
 181               'location' : 'Northampton',
 182               'language' : 'Python' }
 183     headers = { 'User-Agent' : user_agent }
 184
 185     data = urllib.urlencode(values)
 186     req = urllib2.Request(url, data, headers)
 187     response = urllib2.urlopen(req)
 188     the_page = response.read()
 189
 190 The response also has two useful methods. See the section on `info and
 191 geturl`_ which comes after we have a look at what happens when things
 192 go wrong.
 193
 194
 195 Handling Exceptions
 196 ===================
 197
 198 *urlopen* raises ``URLError`` when it cannot handle a response (though
 199 as usual with Python APIs, builtin exceptions such as ValueError,
 200 TypeError etc. may also be raised).
 201
 202 ``HTTPError`` is the subclass of ``URLError`` raised in the specific
 203 case of HTTP URLs.
 204
 205 URLError
 206 --------
 207
 208 Often, URLError is raised because there is no network connection (no
 209 route to the specified server), or the specified server doesn't exist.
 210 In this case, the exception raised will have a 'reason' attribute,
 211 which is a tuple containing an error code and a text error message.
 212
 213 e.g. ::
 214
 215     >>> req = urllib2.Request('http://www.pretend_server.org')
 216     >>> try: urllib2.urlopen(req)
 217     >>> except URLError, e:
 218     >>>    print e.reason
 219     >>>
 220     (4, 'getaddrinfo failed')
 221
 222
 223 HTTPError
 224 ---------
 225
 226 Every HTTP response from the server contains a numeric "status
 227 code". Sometimes the status code indicates that the server is unable
 228 to fulfil the request. The default handlers will handle some of these
 229 responses for you (for example, if the response is a "redirection"
 230 that requests the client fetch the document from a different URL,
 231 urllib2 will handle that for you). For those it can't handle, urlopen
 232 will raise an ``HTTPError``. Typical errors include '404' (page not
 233 found), '403' (request forbidden), and '401' (authentication
 234 required).
 235
 236 See section 10 of RFC 2616 for a reference on all the HTTP error
 237 codes.
 238
 239 The ``HTTPError`` instance raised will have an integer 'code'
 240 attribute, which corresponds to the error sent by the server.
 241
 242 Error Codes
 243 ~~~~~~~~~~~
 244
 245 Because the default handlers handle redirects (codes in the 300
 246 range), and codes in the 100-299 range indicate success, you will
 247 usually only see error codes in the 400-599 range.
 248
 249 ``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful
 250 dictionary of response codes in that shows all the response codes used
 251 by RFC 2616. The dictionary is reproduced here for convenience ::
 252
 253     # Table mapping response codes to messages; entries have the
 254     # form {code: (shortmessage, longmessage)}.
 255     responses = {
 256         100: ('Continue', 'Request received, please continue'),
 257         101: ('Switching Protocols',
 258               'Switching to new protocol; obey Upgrade header'),
 259
 260         200: ('OK', 'Request fulfilled, document follows'),
 261         201: ('Created', 'Document created, URL follows'),
 262         202: ('Accepted',
 263               'Request accepted, processing continues off-line'),
 264         203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
 265         204: ('No Content', 'Request fulfilled, nothing follows'),
 266         205: ('Reset Content', 'Clear input form for further input.'),
 267         206: ('Partial Content', 'Partial content follows.'),
 268
 269         300: ('Multiple Choices',
 270               'Object has several resources -- see URI list'),
 271         301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
 272         302: ('Found', 'Object moved temporarily -- see URI list'),
 273         303: ('See Other', 'Object moved -- see Method and URL list'),
 274         304: ('Not Modified',
 275               'Document has not changed since given time'),
 276         305: ('Use Proxy',
 277               'You must use proxy specified in Location to access this '
 278               'resource.'),
 279         307: ('Temporary Redirect',
 280               'Object moved temporarily -- see URI list'),
 281
 282         400: ('Bad Request',
 283               'Bad request syntax or unsupported method'),
 284         401: ('Unauthorized',
 285               'No permission -- see authorization schemes'),
 286         402: ('Payment Required',
 287               'No payment -- see charging schemes'),
 288         403: ('Forbidden',
 289               'Request forbidden -- authorization will not help'),
 290         404: ('Not Found', 'Nothing matches the given URI'),
 291         405: ('Method Not Allowed',
 292               'Specified method is invalid for this server.'),
 293         406: ('Not Acceptable', 'URI not available in preferred format.'),
 294         407: ('Proxy Authentication Required', 'You must authenticate with '
 295               'this proxy before proceeding.'),
 296         408: ('Request Timeout', 'Request timed out; try again later.'),
 297         409: ('Conflict', 'Request conflict.'),
 298         410: ('Gone',
 299               'URI no longer exists and has been permanently removed.'),
 300         411: ('Length Required', 'Client must specify Content-Length.'),
 301         412: ('Precondition Failed', 'Precondition in headers is false.'),
 302         413: ('Request Entity Too Large', 'Entity is too large.'),
 303         414: ('Request-URI Too Long', 'URI is too long.'),
 304         415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
 305         416: ('Requested Range Not Satisfiable',
 306               'Cannot satisfy request range.'),
 307         417: ('Expectation Failed',
 308               'Expect condition could not be satisfied.'),
 309
 310         500: ('Internal Server Error', 'Server got itself in trouble'),
 311         501: ('Not Implemented',
 312               'Server does not support this operation'),
 313         502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
 314         503: ('Service Unavailable',
 315               'The server cannot process the request due to a high load'),
 316         504: ('Gateway Timeout',
 317               'The gateway server did not receive a timely response'),
 318         505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
 319         }
 320
 321 When an error is raised the server responds by returning an HTTP error
 322 code *and* an error page. You can use the ``HTTPError`` instance as a
 323 response on the page returned. This means that as well as the code
 324 attribute, it also has read, geturl, and info, methods. ::
 325
 326     >>> req = urllib2.Request('http://www.python.org/fish.html')
 327     >>> try:
 328     >>>     urllib2.urlopen(req)
 329     >>> except URLError, e:
 330     >>>     print e.code
 331     >>>     print e.read()
 332     >>>
 333     404
 334     <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 335         "http://www.w3.org/TR/html4/loose.dtd">
 336     <?xml-stylesheet href="./css/ht2html.css"
 337         type="text/css"?>
 338     <html><head><title>Error 404: File Not Found</title>
 339     ...... etc...
 340
 341 Wrapping it Up
 342 --------------
 343
 344 So if you want to be prepared for ``HTTPError`` *or* ``URLError``
 345 there are two basic approaches. I prefer the second approach.
 346
 347 Number 1
 348 ~~~~~~~~
 349
 350 ::
 351
 352
 353     from urllib2 import Request, urlopen, URLError, HTTPError
 354     req = Request(someurl)
 355     try:
 356         response = urlopen(req)
 357     except HTTPError, e:
 358         print 'The server couldn\'t fulfill the request.'
 359         print 'Error code: ', e.code
 360     except URLError, e:
 361         print 'We failed to reach a server.'
 362         print 'Reason: ', e.reason
 363     else:
 364         # everything is fine
 365
 366
 367 .. note::
 368
 369     The ``except HTTPError`` *must* come first, otherwise ``except URLError``
 370     will *also* catch an ``HTTPError``.
 371
 372 Number 2
 373 ~~~~~~~~
 374
 375 ::
 376
 377     from urllib2 import Request, urlopen, URLError
 378     req = Request(someurl)
 379     try:
 380         response = urlopen(req)
 381     except URLError, e:
 382         if hasattr(e, 'reason'):
 383             print 'We failed to reach a server.'
 384             print 'Reason: ', e.reason
 385         elif hasattr(e, 'code'):
 386             print 'The server couldn\'t fulfill the request.'
 387             print 'Error code: ', e.code
 388     else:
 389         # everything is fine
 390
 391
 392 info and geturl
 393 ===============
 394
 395 The response returned by urlopen (or the ``HTTPError`` instance) has
 396 two useful methods ``info`` and ``geturl``.
 397
 398 **geturl** - this returns the real URL of the page fetched. This is
 399 useful because ``urlopen`` (or the opener object used) may have
 400 followed a redirect. The URL of the page fetched may not be the same
 401 as the URL requested.
 402
 403 **info** - this returns a dictionary-like object that describes the
 404 page fetched, particularly the headers sent by the server. It is
 405 currently an ``httplib.HTTPMessage`` instance.
 406
 407 Typical headers include 'Content-length', 'Content-type', and so
 408 on. See the
 409 `Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
 410 for a useful listing of HTTP headers with brief explanations of their meaning
 411 and use.
 412
 413
 414 Openers and Handlers
 415 ====================
 416
 417 When you fetch a URL you use an opener (an instance of the perhaps
 418 confusingly-named ``urllib2.OpenerDirector``). Normally we have been using
 419 the default opener - via ``urlopen`` - but you can create custom
 420 openers. Openers use handlers. All the "heavy lifting" is done by the
 421 handlers. Each handler knows how to open URLs for a particular URL
 422 scheme (http, ftp, etc.), or how to handle an aspect of URL opening,
 423 for example HTTP redirections or HTTP cookies.
 424
 425 You will want to create openers if you want to fetch URLs with
 426 specific handlers installed, for example to get an opener that handles
 427 cookies, or to get an opener that does not handle redirections.
 428
 429 To create an opener, instantiate an OpenerDirector, and then call
 430 .add_handler(some_handler_instance) repeatedly.
 431
 432 Alternatively, you can use ``build_opener``, which is a convenience
 433 function for creating opener objects with a single function call.
 434 ``build_opener`` adds several handlers by default, but provides a
 435 quick way to add more and/or override the default handlers.
 436
 437 Other sorts of handlers you might want to can handle proxies,
 438 authentication, and other common but slightly specialised
 439 situations.
 440
 441 ``install_opener`` can be used to make an ``opener`` object the
 442 (global) default opener. This means that calls to ``urlopen`` will use
 443 the opener you have installed.
 444
 445 Opener objects have an ``open`` method, which can be called directly
 446 to fetch urls in the same way as the ``urlopen`` function: there's no
 447 need to call ``install_opener``, except as a convenience.
 448
 449
 450 Basic Authentication
 451 ====================
 452
 453 To illustrate creating and installing a handler we will use the
 454 ``HTTPBasicAuthHandler``. For a more detailed discussion of this
 455 subject - including an explanation of how Basic Authentication works -
 456 see the `Basic Authentication Tutorial  <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
 457
 458 When authentication is required, the server sends a header (as well as
 459 the 401 error code) requesting authentication.  This specifies the
 460 authentication scheme and a 'realm'. The header looks like :
 461 ``Www-authenticate: SCHEME realm="REALM"``.
 462
 463 e.g. ::
 464
 465     Www-authenticate: Basic realm="cPanel Users"
 466
 467
 468 The client should then retry the request with the appropriate name and
 469 password for the realm included as a header in the request. This is
 470 'basic authentication'. In order to simplify this process we can
 471 create an instance of ``HTTPBasicAuthHandler`` and an opener to use
 472 this handler.
 473
 474 The ``HTTPBasicAuthHandler`` uses an object called a password manager
 475 to handle the mapping of URLs and realms to passwords and
 476 usernames. If you know what the realm is (from the authentication
 477 header sent by the server), then you can use a
 478 ``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In
 479 that case, it is convenient to use
 480 ``HTTPPasswordMgrWithDefaultRealm``. This allows you to specify a
 481 default username and password for a URL. This will be supplied in the
 482 absence of you providing an alternative combination for a specific
 483 realm. We indicate this by providing ``None`` as the realm argument to
 484 the ``add_password`` method.
 485
 486 The top-level URL is the first URL that requires authentication. URLs
 487 "deeper" than the URL you pass to .add_password() will also match. ::
 488
 489     # create a password manager
 490     password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
 491
 492     # Add the username and password.
 493     # If we knew the realm, we could use it instead of ``None``.
 494     top_level_url = "http://example.com/foo/"
 495     password_mgr.add_password(None, top_level_url, username, password)
 496
 497     handler = urllib2.HTTPBasicAuthHandler(password_mgr)
 498
 499     # create "opener" (OpenerDirector instance)
 500     opener = urllib2.build_opener(handler)
 501
 502     # use the opener to fetch a URL
 503     opener.open(a_url)
 504
 505     # Install the opener.
 506     # Now all calls to urllib2.urlopen use our opener.
 507     urllib2.install_opener(opener)
 508
 509 .. note::
 510
 511     In the above example we only supplied our ``HHTPBasicAuthHandler``
 512     to ``build_opener``. By default openers have the handlers for
 513     normal situations - ``ProxyHandler``, ``UnknownHandler``,
 514     ``HTTPHandler``, ``HTTPDefaultErrorHandler``,
 515     ``HTTPRedirectHandler``, ``FTPHandler``, ``FileHandler``,
 516     ``HTTPErrorProcessor``.
 517
 518 top_level_url is in fact *either* a full URL (including the 'http:'
 519 scheme component and the hostname and optionally the port number)
 520 e.g. "http://example.com/" *or* an "authority" (i.e. the hostname,
 521 optionally including the port number) e.g. "example.com" or
 522 "example.com:8080" (the latter example includes a port number).  The
 523 authority, if present, must NOT contain the "userinfo" component - for
 524 example "joe@password:example.com" is not correct.
 525
 526
 527 Proxies
 528 =======
 529
 530 **urllib2** will auto-detect your proxy settings and use those. This
 531 is through the ``ProxyHandler`` which is part of the normal handler
 532 chain. Normally that's a good thing, but there are occasions when it
 533 may not be helpful [#]_. One way to do this is to setup our own
 534 ``ProxyHandler``, with no proxies defined. This is done using similar
 535 steps to setting up a `Basic Authentication`_ handler : ::
 536
 537     >>> proxy_support = urllib2.ProxyHandler({})
 538     >>> opener = urllib2.build_opener(proxy_support)
 539     >>> urllib2.install_opener(opener)
 540
 541 .. note::
 542
 543     Currently ``urllib2`` *does not* support fetching of ``https``
 544     locations through a proxy. This can be a problem.
 545
 546 Sockets and Layers
 547 ==================
 548
 549 The Python support for fetching resources from the web is
 550 layered. urllib2 uses the httplib library, which in turn uses the
 551 socket library.
 552
 553 As of Python 2.3 you can specify how long a socket should wait for a
 554 response before timing out. This can be useful in applications which
 555 have to fetch web pages. By default the socket module has *no timeout*
 556 and can hang. Currently, the socket timeout is not exposed at the
 557 httplib or urllib2 levels.  However, you can set the default timeout
 558 globally for all sockets using : ::
 559
 560     import socket
 561     import urllib2
 562
 563     # timeout in seconds
 564     timeout = 10
 565     socket.setdefaulttimeout(timeout)
 566
 567     # this call to urllib2.urlopen now uses the default timeout
 568     # we have set in the socket module
 569     req = urllib2.Request('http://www.voidspace.org.uk')
 570     response = urllib2.urlopen(req)
 571
 572
 573 -------
 574
 575
 576 Footnotes
 577 =========
 578
 579 This document was reviewed and revised by John Lee.
 580
 581 .. [#] For an introduction to the CGI protocol see
 582        `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
 583 .. [#] Like Google for example. The *proper* way to use google from a program
 584        is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
 585        `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
 586        for some examples of using the Google API.
 587 .. [#] Browser sniffing is a very bad practise for website design - building
 588        sites using web standards is much more sensible. Unfortunately a lot of
 589        sites still send different versions to different browsers.
 590 .. [#] The user agent for MSIE 6 is
 591        *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
 592 .. [#] For details of more HTTP request headers, see
 593        `Quick Reference to HTTP Headers`_.
 594 .. [#] In my case I have to use a proxy to access the internet at work. If you
 595        attempt to fetch *localhost* URLs through this proxy it blocks them. IE
 596        is set to use the proxy, which urllib2 picks up on. In order to test
 597        scripts with a localhost server, I have to prevent urllib2 from using
 598        the proxy.