Doc/library/urlparse.rst

   1 :mod:`urlparse` --- Parse URLs into components
   2 ==============================================
   3
   4 .. module:: urlparse
   5    :synopsis: Parse URLs into or assemble them from components.
   6
   7
   8 .. index::
   9    single: WWW
  10    single: World Wide Web
  11    single: URL
  12    pair: URL; parsing
  13    pair: relative; URL
  14
  15 .. note::
  16    The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.0.
  17    The :term:`2to3` tool will automatically adapt imports when converting
  18    your sources to 3.0.
  19
  20
  21 This module defines a standard interface to break Uniform Resource Locator (URL)
  22 strings up in components (addressing scheme, network location, path etc.), to
  23 combine the components back into a URL string, and to convert a "relative URL"
  24 to an absolute URL given a "base URL."
  25
  26 The module has been designed to match the Internet RFC on Relative Uniform
  27 Resource Locators (and discovered a bug in an earlier draft!). It supports the
  28 following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
  29 ``https``, ``imap``, ``mailto``, ``mms``, ``news``,  ``nntp``, ``prospero``,
  30 ``rsync``, ``rtsp``, ``rtspu``,  ``sftp``, ``shttp``, ``sip``, ``sips``,
  31 ``snews``, ``svn``,  ``svn+ssh``, ``telnet``, ``wais``.
  32
  33 .. versionadded:: 2.5
  34    Support for the ``sftp`` and ``sips`` schemes.
  35
  36 The :mod:`urlparse` module defines the following functions:
  37
  38
  39 .. function:: urlparse(urlstring[, default_scheme[, allow_fragments]])
  40
  41    Parse a URL into six components, returning a 6-tuple.  This corresponds to the
  42    general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
  43    Each tuple item is a string, possibly empty. The components are not broken up in
  44    smaller parts (for example, the network location is a single string), and %
  45    escapes are not expanded. The delimiters as shown above are not part of the
  46    result, except for a leading slash in the *path* component, which is retained if
  47    present.  For example:
  48
  49       >>> from urlparse import urlparse
  50       >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
  51       >>> o   # doctest: +NORMALIZE_WHITESPACE
  52       ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
  53                   params='', query='', fragment='')
  54       >>> o.scheme
  55       'http'
  56       >>> o.port
  57       80
  58       >>> o.geturl()
  59       'http://www.cwi.nl:80/%7Eguido/Python.html'
  60
  61    If the *default_scheme* argument is specified, it gives the default addressing
  62    scheme, to be used only if the URL does not specify one.  The default value for
  63    this argument is the empty string.
  64
  65    If the *allow_fragments* argument is false, fragment identifiers are not
  66    allowed, even if the URL's addressing scheme normally does support them.  The
  67    default value for this argument is :const:`True`.
  68
  69    The return value is actually an instance of a subclass of :class:`tuple`.  This
  70    class has the following additional read-only convenience attributes:
  71
  72    +------------------+-------+--------------------------+----------------------+
  73    | Attribute        | Index | Value                    | Value if not present |
  74    +==================+=======+==========================+======================+
  75    | :attr:`scheme`   | 0     | URL scheme specifier     | empty string         |
  76    +------------------+-------+--------------------------+----------------------+
  77    | :attr:`netloc`   | 1     | Network location part    | empty string         |
  78    +------------------+-------+--------------------------+----------------------+
  79    | :attr:`path`     | 2     | Hierarchical path        | empty string         |
  80    +------------------+-------+--------------------------+----------------------+
  81    | :attr:`params`   | 3     | Parameters for last path | empty string         |
  82    |                  |       | element                  |                      |
  83    +------------------+-------+--------------------------+----------------------+
  84    | :attr:`query`    | 4     | Query component          | empty string         |
  85    +------------------+-------+--------------------------+----------------------+
  86    | :attr:`fragment` | 5     | Fragment identifier      | empty string         |
  87    +------------------+-------+--------------------------+----------------------+
  88    | :attr:`username` |       | User name                | :const:`None`        |
  89    +------------------+-------+--------------------------+----------------------+
  90    | :attr:`password` |       | Password                 | :const:`None`        |
  91    +------------------+-------+--------------------------+----------------------+
  92    | :attr:`hostname` |       | Host name (lower case)   | :const:`None`        |
  93    +------------------+-------+--------------------------+----------------------+
  94    | :attr:`port`     |       | Port number as integer,  | :const:`None`        |
  95    |                  |       | if present               |                      |
  96    +------------------+-------+--------------------------+----------------------+
  97
  98    See section :ref:`urlparse-result-object` for more information on the result
  99    object.
 100
 101    .. versionchanged:: 2.5
 102       Added attributes to return value.
 103
 104 .. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]])
 105
 106    Parse a query string given as a string argument (data of type
 107    :mimetype:`application/x-www-form-urlencoded`).  Data are returned as a
 108    dictionary.  The dictionary keys are the unique query variable names and the
 109    values are lists of values for each name.
 110
 111    The optional argument *keep_blank_values* is a flag indicating whether blank
 112    values in URL encoded queries should be treated as blank strings.   A true value
 113    indicates that blanks should be retained as  blank strings.  The default false
 114    value indicates that blank values are to be ignored and treated as if they were
 115    not included.
 116
 117    The optional argument *strict_parsing* is a flag indicating what to do with
 118    parsing errors.  If false (the default), errors are silently ignored.  If true,
 119    errors raise a :exc:`ValueError` exception.
 120
 121    Use the :func:`urllib.urlencode` function to convert such dictionaries into
 122    query strings.
 123
 124
 125 .. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]])
 126
 127    Parse a query string given as a string argument (data of type
 128    :mimetype:`application/x-www-form-urlencoded`).  Data are returned as a list of
 129    name, value pairs.
 130
 131    The optional argument *keep_blank_values* is a flag indicating whether blank
 132    values in URL encoded queries should be treated as blank strings.   A true value
 133    indicates that blanks should be retained as  blank strings.  The default false
 134    value indicates that blank values are to be ignored and treated as if they were
 135    not included.
 136
 137    The optional argument *strict_parsing* is a flag indicating what to do with
 138    parsing errors.  If false (the default), errors are silently ignored.  If true,
 139    errors raise a :exc:`ValueError` exception.
 140
 141    Use the :func:`urllib.urlencode` function to convert such lists of pairs into
 142    query strings.
 143
 144 .. function:: urlunparse(parts)
 145
 146    Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
 147    can be any six-item iterable. This may result in a slightly different, but
 148    equivalent URL, if the URL that was parsed originally had unnecessary delimiters
 149    (for example, a ? with an empty query; the RFC states that these are
 150    equivalent).
 151
 152
 153 .. function:: urlsplit(urlstring[, default_scheme[, allow_fragments]])
 154
 155    This is similar to :func:`urlparse`, but does not split the params from the URL.
 156    This should generally be used instead of :func:`urlparse` if the more recent URL
 157    syntax allowing parameters to be applied to each segment of the *path* portion
 158    of the URL (see :rfc:`2396`) is wanted.  A separate function is needed to
 159    separate the path segments and parameters.  This function returns a 5-tuple:
 160    (addressing scheme, network location, path, query, fragment identifier).
 161
 162    The return value is actually an instance of a subclass of :class:`tuple`.  This
 163    class has the following additional read-only convenience attributes:
 164
 165    +------------------+-------+-------------------------+----------------------+
 166    | Attribute        | Index | Value                   | Value if not present |
 167    +==================+=======+=========================+======================+
 168    | :attr:`scheme`   | 0     | URL scheme specifier    | empty string         |
 169    +------------------+-------+-------------------------+----------------------+
 170    | :attr:`netloc`   | 1     | Network location part   | empty string         |
 171    +------------------+-------+-------------------------+----------------------+
 172    | :attr:`path`     | 2     | Hierarchical path       | empty string         |
 173    +------------------+-------+-------------------------+----------------------+
 174    | :attr:`query`    | 3     | Query component         | empty string         |
 175    +------------------+-------+-------------------------+----------------------+
 176    | :attr:`fragment` | 4     | Fragment identifier     | empty string         |
 177    +------------------+-------+-------------------------+----------------------+
 178    | :attr:`username` |       | User name               | :const:`None`        |
 179    +------------------+-------+-------------------------+----------------------+
 180    | :attr:`password` |       | Password                | :const:`None`        |
 181    +------------------+-------+-------------------------+----------------------+
 182    | :attr:`hostname` |       | Host name (lower case)  | :const:`None`        |
 183    +------------------+-------+-------------------------+----------------------+
 184    | :attr:`port`     |       | Port number as integer, | :const:`None`        |
 185    |                  |       | if present              |                      |
 186    +------------------+-------+-------------------------+----------------------+
 187
 188    See section :ref:`urlparse-result-object` for more information on the result
 189    object.
 190
 191    .. versionadded:: 2.2
 192
 193    .. versionchanged:: 2.5
 194       Added attributes to return value.
 195
 196
 197 .. function:: urlunsplit(parts)
 198
 199    Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
 200    URL as a string. The *parts* argument can be any five-item iterable. This may
 201    result in a slightly different, but equivalent URL, if the URL that was parsed
 202    originally had unnecessary delimiters (for example, a ? with an empty query; the
 203    RFC states that these are equivalent).
 204
 205    .. versionadded:: 2.2
 206
 207
 208 .. function:: urljoin(base, url[, allow_fragments])
 209
 210    Construct a full ("absolute") URL by combining a "base URL" (*base*) with
 211    another URL (*url*).  Informally, this uses components of the base URL, in
 212    particular the addressing scheme, the network location and (part of) the path,
 213    to provide missing components in the relative URL.  For example:
 214
 215       >>> from urlparse import urljoin
 216       >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
 217       'http://www.cwi.nl/%7Eguido/FAQ.html'
 218
 219    The *allow_fragments* argument has the same meaning and default as for
 220    :func:`urlparse`.
 221
 222    .. note::
 223
 224       If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
 225       the *url*'s host name and/or scheme will be present in the result.  For example:
 226
 227    .. doctest::
 228
 229       >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
 230       ...         '//www.python.org/%7Eguido')
 231       'http://www.python.org/%7Eguido'
 232
 233    If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
 234    :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
 235
 236
 237 .. function:: urldefrag(url)
 238
 239    If *url* contains a fragment identifier, returns a modified version of *url*
 240    with no fragment identifier, and the fragment identifier as a separate string.
 241    If there is no fragment identifier in *url*, returns *url* unmodified and an
 242    empty string.
 243
 244
 245 .. seealso::
 246
 247    :rfc:`1738` - Uniform Resource Locators (URL)
 248       This specifies the formal syntax and semantics of absolute URLs.
 249
 250    :rfc:`1808` - Relative Uniform Resource Locators
 251       This Request For Comments includes the rules for joining an absolute and a
 252       relative URL, including a fair number of "Abnormal Examples" which govern the
 253       treatment of border cases.
 254
 255    :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
 256       Document describing the generic syntactic requirements for both Uniform Resource
 257       Names (URNs) and Uniform Resource Locators (URLs).
 258
 259
 260 .. _urlparse-result-object:
 261
 262 Results of :func:`urlparse` and :func:`urlsplit`
 263 ------------------------------------------------
 264
 265 The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
 266 subclasses of the :class:`tuple` type.  These subclasses add the attributes
 267 described in those functions, as well as provide an additional method:
 268
 269
 270 .. method:: ParseResult.geturl()
 271
 272    Return the re-combined version of the original URL as a string. This may differ
 273    from the original URL in that the scheme will always be normalized to lower case
 274    and empty components may be dropped. Specifically, empty parameters, queries,
 275    and fragment identifiers will be removed.
 276
 277    The result of this method is a fixpoint if passed back through the original
 278    parsing function:
 279
 280       >>> import urlparse
 281       >>> url = 'HTTP://www.Python.org/doc/#'
 282
 283       >>> r1 = urlparse.urlsplit(url)
 284       >>> r1.geturl()
 285       'http://www.Python.org/doc/'
 286
 287       >>> r2 = urlparse.urlsplit(r1.geturl())
 288       >>> r2.geturl()
 289       'http://www.Python.org/doc/'
 290
 291    .. versionadded:: 2.5
 292
 293 The following classes provide the implementations of the parse results:
 294
 295
 296 .. class:: BaseResult
 297
 298    Base class for the concrete result classes.  This provides most of the attribute
 299    definitions.  It does not provide a :meth:`geturl` method.  It is derived from
 300    :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__`
 301    methods.
 302
 303
 304 .. class:: ParseResult(scheme, netloc, path, params, query, fragment)
 305
 306    Concrete class for :func:`urlparse` results.  The :meth:`__new__` method is
 307    overridden to support checking that the right number of arguments are passed.
 308
 309
 310 .. class:: SplitResult(scheme, netloc, path, query, fragment)
 311
 312    Concrete class for :func:`urlsplit` results.  The :meth:`__new__` method is
 313    overridden to support checking that the right number of arguments are passed.
 314