6.4. Uniform Resource Identifiers (RFC2396)

This module defines functions and classes for working with URI as defined by RFC2396: http://www.ietf.org/rfc/rfc2396.txt

In keeping with usage in the specification we use URI in both the singular and plural sense.

In addition to parsing and formating URI from strings, this module also supports computing and resolving relative URI. To do this we define two notional operators.

The resolve operator:

U = B [*] R

calculates a new URI ‘U’ from a base URI ‘B’ and a relative URI ‘R’.

The relative operator:

U [/] B = R

calcualtes the relative URI ‘R’ formed by expressing ‘U’ relative to ‘B’.

The Relative operator defines the reverse of the resolve operator, however note that in some cases several different values of R can resolve to the same URL with a common base URI.

6.4.1. Creating URI Instances

To create URI use the URI.from_octets() class method. This method takes both character and binary strings though in the first case the string must contain only ASCII characters and in the latter only bytes that represent ASCII characters. The following function can help convert general character strings to a suitable format but it is not a full implementation of the IRI specification, in particular it does not encode delimiters (such as space) and it does not deal intelligently with unicode domain names (these must be converted to their ASCII URI forms first).


Extracts a URI octet-string from a unicode string.

A character string

Returns a character string with any characters outside the US-ASCII range replaced by URI-escaped UTF-8 sequences. This is not a general escaping method. All other characters are ignored, including non-URI characters like space. It is assumed that any (other) characters requiring escaping are already escaped.

The encoding algorithm used is the same as the one adopted by HTML. This is not part of the RFC standard which only defines the behaviour for streams of octets but it is in line with the approach adopted by the later IRI spec.

6.4.2. URI

class pyslet.rfc2396.URI(octets)

Bases: pyslet.py2.CmpMixin, pyslet.pep8.PEP8Compatibility

Class to represent URI References

You won’t normally instantiate a URI directly as it represents a generic URI. This class is designed to be overridden by scheme-specific implementations. Use the class method from_octets() to create instances.

If you are creating your own derived classes call the parent contstructor to populate the attributes defined here from the URI’s string representation passing a character string representing the octets of the URI. (For backwards compatibility a binary string will be accepted provided it can be decoded as US ASCII characters.) You can override the scheme-specific part of the parsing by defining your own implementation of parse_scheme_specific_part().

It is an error if the octets string contains characters that are not allowed in a URI.


The following details have changed significantly following updates in 0.5.20160123 to introduce support for Python 3. Although the character/byte/octet descriptions have changed the actual affect on running code is minimal when running under Python 2.

Unless otherwise stated, all attributes are character strings that encode the ‘octets’ in each component of the URI. These atrributes retain the %-escaping. To obtain the actual data use unescape_data() to obtain the original octets (as a byte string). The specification does not specify any particular encoding for interpreting these octets, indeed in some types of URI these binary components may have no character-based interpretation.

For example, the URI “%E8%8B%B1%E5%9B%BD.xml” is a character string that represents a UTF-8 and URL-encoded path segment using the Chinese word for United Kingdom. To obtain the correct unicode path segment you would first use unescape_data() to obtain the binary string of bytes and then decode with UTF-8:

>>> src = "%E8%8B%B1%E5%9B%BD.xml"
>>> uri.unescape_data(src).decode('utf-8')

URI can be converted to strings but the result is a character string that retains any %-encoding. Therefore, these character strings always use the restricted character set defined by the specification (a subset of US ASCII) and, in Python 2, can be freely converted between the str and unicode types.

URI are immutable and can be compared and used as keys in dictionaries. Two URI compare equal if their canonical forms are identical. See canonicalize() for more information.

classmethod from_octets(octets, strict=False)

Creates an instance of URI from a string


This method was changed in Pyslet 0.5.20160123 to introduce support for Python 3. It now takes either type of string but a character string is now preferred.

This is the main method you should use for creating instances. It uses the URI’s scheme to determine the appropriate subclass to create. See register() for more information.

A string of characters that represents the URI’s octets. If a binary string is passed it is assumed to be US ASCII and converted to a character string.
strict (defaults to False)
If the character string contains characters outside of the US ASCII character range then encode_unicode_uri() is called before the string is used to create the instance. You can turn off this behaviour (to enable strict URI-parsing) by passing strict=True

Pyslet manages the importing and registering of the following URI schemes using it’s own classes: http, https, file and urn. Additional modules are loaded and schemes registered ‘on demand’ when instances of the corresponding URI are first created.

scheme_class = {'urn': <class 'pyslet.urn.URN'>, 'http': <class 'pyslet.http.params.HTTPURL'>, 'https': <class 'pyslet.http.params.HTTPSURL'>, 'file': <class 'pyslet.rfc2396.FileURL'>}

A dictionary mapping lower-case URI schemes onto the special classes used to represent them

classmethod register(scheme, uri_class)

Registers a class to represent a scheme

A string representing a URI scheme, e.g., ‘http’. The string is converted to lower-case before it is registered.
A class derived from URI that is used to represent URI from scheme

If a class has already been registered for the scheme it is replaced. The mapping is kept in the scheme_class dictionary.

classmethod from_virtual_path(path)

Converts a virtual file path into a URI instance

A pyslet.vfs.VirtualFilePath instance representing a file path in a virtual file system. The path is always made absolute before being converted to a FileURL.

The authority (host name) in the resulting URL is usually left blank except when running under Windows, in which case the URL is constructed according to the recommendations in this blog post. In other words, UNC paths are mapped to both the network location and path components of the resulting file URL.

For named virtual file systems (i.e., those that don’t map directly to the functions in Python’s built-in os and os.path modules) the file system name is used for the authority. (If path is from a named virutal file system and is a UNC path then URIException is raised.)

classmethod from_path(path)

Converts a local file path into a URI instance.

A file path string.

Uses path to create an instance of pyslet.vfs.OSFilePath, see from_virtual_path() for more info.

octets = None

The character string representing this URI’s octets

fragment = None

The fragment string that was appended to the URI or None if no fragment was given.

scheme = None

The URI scheme, if present

authority = None

The authority (e.g., host name) of a hierarchical URI

abs_path = None

The absolute path of a hierarchical URI (None if the path is relative)

query = None

The optional query associated with a hierarchical URI

scheme_specific_part = None

The scheme specific part of the URI

rel_path = None

The relative path of a hierarchical URI (None if the path is absolute)

opaque_part = None

None if the URI is hierarchical, otherwise the same as scheme_specific_part


Parses the scheme specific part of the URI

Parses the scheme specific part of the URI from scheme_specific_part. This attribute is set by the constructor, the role of this method is to parse this attribute and set any scheme-specific attribute values.

This method should overridden by derived classes if they use a format other than the hierarchical URI format described in RFC2396.

The default implementation implements the generic parsing of hierarchical URI setting the following attribute values: authority, abs_path and query. If the URI is not of a hierarchical type then opaque_part is set instead. Unset attributes have the value None.


Returns a canonical form of this URI

For unknown schemes we simply convert the scheme to lower case so that, for example, X-scheme:data becomes x-scheme:data.

Derived classes should apply their own transformation rules.


Returns a new URI comprised of the scheme and authority only.

Only valid for absolute URI, returns None otherwise.

The canonical root does not include a trailing slash. The canonical root is used to define the domain of a resource, often for security purposes.

If the URI is non-hierarchical then the just the scheme is returned.

resolve(base, current_doc_ref=None)

Resolves a relative URI against a base URI

A URI instance representing the base URI against which to resolve this URI. You may also pass a URI string for this parameter.
The optional current_doc_ref allows you to handle the special case of resolving the empty URI. Strictly speaking, fragments are not part of the URI itself so a relative URI consisting of the empty string, or a relative URI consisting of just a fragment both refer to the current document. By default, current_doc_ref is assumed to be the same as base but there are cases where the base URI is not the same as the URI used to originally retrieve the document and this optional parameter allows you to cope with those cases.

Returns a new URI instance.

If the base URI is also relative then the result is a relative URI, otherwise the result is an absolute URI. The RFC does not actually go into the procedure for combining relative URI but if B is an absolute URI and R1 and R2 are relative URI then using the resolve operator ([*], see above):

U1 = B [*] R1
U2 = U1 [*] R2
U2 = ( B [*] R1 ) [*] R2

The last expression prompts the issue of associativity, in other words, is the following expression also valid?

U2 = B [*] ( R1 [*] R2 )

For this to work it must be possible to use the resolve operator to combine two relative URI to make a third, which is what we allow here.


Calculates a URI expressed relative to base.

A URI instance representing the base URI against which to calculate the relative URI. You may also pass a URI string for this parameter.

Returns a new URI instance.

As we allow the resolve() method for two relative paths it makes sense for the Relative operator to also be defined:

R3 = R1 [*] R2
R3 [/] R1 = R2

There are some significant restrictions, URI are classified by how specified they are with:

absolute URI > authority > absolute path > relative path

If R is absolute, or simply more specified than B on the above scale and:

U = B [*] R

then U = R regardless of the value of B and therefore:

U [/] B = U if B is less specified than U

Also note that if U is a relative URI then B cannot be absolute. In fact B must always be less than, or equally specified to U because B is the base URI from which U has been derived:

U [/] B = undefined if B is more specified than U

Therefore the only interesting cases are when B is equally specified to U. To give a concrete example:

U = /HD/User/setting.txt
B = /HD/folder/file.txt

/HD/User/setting.txt [\] /HD/folder/file.txt = ../User/setting.txt
/HD/User/setting.txt = /HD/folder/file.txt [*] ../User/setting.txt

And for relative paths:

U = User/setting.txt
B = User/folder/file.txt

User/setting.txt [\] User/folder/file.txt = ../setting.txt
User/setting.txt = User/folder/file.txt [*] ../setting.txt

Compares this URI with another

Another URI instance.

Returns True if the canonical representations of the URIs match.


Returns True if this URI is absolute

An absolute URI is fully specified with a scheme, e.g., ‘http’.


Gets the file name associated with this resource

Returns None if the URI scheme does not have the concept. By default the file name is extracted from the last component of the path. Note the subtle difference between returning None and returning an empty string (indicating that the URI represents a directory-like object).

The return result is always a character string.

class pyslet.rfc2396.ServerBasedURL(octets)

Bases: pyslet.rfc2396.URI

Represents server-based URI

A server-based URI is one of the form:

<scheme> '://' [<userinfo> '@'] <host> [':' <port>] <path>

the default port for this type of URL


Returns a hostname and integer port tuple

The format is suitable for socket operations. The main purpose of this method is to determine if the port is set on the URL and, if it isn’t, to return the default port for this URL type instead.


Returns a canonical form of this URI

In addition to returning the scheme in lower-case form, this method forces the host to be lower case and removes the port specifier if it matches the DEFAULT_PORT for this type or URI.

No transformation is performed on the path component.

class pyslet.rfc2396.FileURL(octets='file:///')

Bases: pyslet.rfc2396.ServerBasedURL

Represents the file URL scheme defined by RFC1738

Do not create instances directly, instead use (for example):

furl = URI.from_octets('file:///...')

Returns the system path name corresponding to this file URL

If the system supports unicode file names (as reported by os.path.supports_unicode_filenames) then get_pathname also returns a unicode string, otherwise it returns an 8-bit string encoded in the underlying file system encoding.

There are some libraries (notably sax) that will fail when passed files opened using unicode paths. The force8bit flag can be used to force get_pathname to return a byte string encoded using the native file system encoding.

If the URL does not represent a path in the native file system then URIException is raised.


Returns a virtual file path corresponding to this URL

The result is a pyslet.vfs.FilePath instance.

The host component of the URL is used to determine which virtual file system the file belongs to. If there is no virtual file system matching the URL’s host and the native file system support UNC paths (i.e., is Windows) the host will be placed in the machine portion of the UNC path.

Path parameters e.g., /dir/file;lang=en in the URL are ignored.


Returns a locally portable version of the URL

The result is a character string, not a URI instance.

In Pyslet, all hiearchical URI are treated as using the UTF-8 encoding for characters outside US ASCII. As a result, file URL are expressed using percent-encoded UTF-8 multi-byte sequences. When converting these URLs to file paths the difference is taken into account correctly but if you attempt to output a URL generated by Pyslet and use it in another application you may find that the URL is not recognised. This is paritcularly a problem on Windows where file URLs are expected to be encoded with the native file system encoding.

The purpose of this method is to return a version of the URL re-encoded in the local file system encoding for portability such as being copy-pasted into a browser address bar.

6.4.3. Canonicalization and Escaping

pyslet.rfc2396.canonicalize_data(source, unreserved_test=is_unreserved, allowed_test=is_allowed)

Returns the canonical form of source string.

The canonical form is the same string but any unreserved characters represented as hex escapes in source are unencoded and any unescaped characters that are neither reserved nor unreserved are escaped.

A string of characters. Characters must be in the US ASCII range. Use encode_unicode_uri() first if necessary. Will raise UnicodeEncodeError if non-ASCII characters are encountered.

A function with the same signature as is_unreserved(), which it defaults to. By providing a different function you can control which characters will have their escapes removed. It does not affect which unescaped characters are escaped.

To give an example, by default the ‘.’ is unreserved so the sequence %2E will be removed when canonicalizing the source. However, if the specific part of the URL scheme you are dealing with applies some reserved purpose to ‘.’ then source may contain both encoded and unencoded versions to disambiguate its usage. In this case you would want to remove ‘.’ from the definition of unreserved to prevent it being unescaped.

If you don’t want any escapes removed, simply pass:

lambda x: False

Defaults to is_allowed()

See parse_uric() for more information.

All hex escapes are promoted to upper case.

pyslet.rfc2396.escape_data(source, reserved_test=is_reserved, allowed_test=is_allowed)

Performs URI escaping on source

Returns the escaped character string.


The input string. This can be a binary or character string. For character strings all characters must be in the US ASCII range. Use encode_unicode_uri() first if necessary. Will raise UnicodeEncodeError if non-ASCII characters are encountered. For binary strings there is no constraint on the range of allowable octets.


In Python 2 the ASCII character constraint is only applied when source is of type unicode.


Default is_reserved(), the function to test if a character should be escaped. This function should take a single character as an argument and return True if the character must be escaped. Characters for which this function returns False will still be escaped if they are not allowed to appear unescaped in URI (see allowed_test below).

Quoting from RFC2396:

Characters in the “reserved” set are not reserved in all contexts. The set of characters actually reserved within any given URI component is defined by that component. In general, a character is reserved if the semantics of the URI changes if the character is replaced with its escaped US-ASCII encoding.

Therefore, you may want to reduce the set of characters that are escaped based on the target component for the data. Different rules apply to a path component compared with, for example, the query string. A number of alternative test functions are provided to assist with escaping an alternative set of characters.

For example, suppose you want to ensure that your data is escaped to the rules of the earlier RFC1738. In that specification, a fore-runner of RFC2396, the “~” was not classifed as a valid URL character and required escaping. It was later added to the mark category enabling it to appear unescaped. To ensure that this character is escaped for compatibility with older systems you might do this when escaping data with a path component (where ‘~’ is often used):

path_component = uri.escape_data(
    dir_name, reserved_test=uri.is_reserved_1738)

In addition to escaping “~”, the above will also leave “$”, “+” and “,” unescaped as they were classified as ‘extra’ characters in RFC1738 and were not reserved.


Defaults to is_allowed()

See parse_uric() for more information.

By default there is no difference between RFC2396 and RFC2732 in operation as in RFC2732 “[” and “]” are legal URI characters but they are also in the default reserved set so will be escaped anyway. In RFC2396 they were escaped on the basis of not being allowed.

The difference comes if you are using a reduced set of reserved characters. For example:

>>> print uri.escape_data("[file].txt")
>>> print uri.escape_data(
        "[file].txt", reserved_test=uri.is_path_segment_reserved)
>>> print uri.escape_data(
        "[file].txt", reserved_test=uri.is_path_segment_reserved,

Performs URI unescaping

The URI-encoded string

Removes escape sequences. The string is returned as a binary string of octets, not a string of characters. Escape sequences such as %E9 will result in the byte value 233 and not the character é.

The character encoding that applies may depend on the context and it cannot always be assumed to be UTF-8 (though in most cases that will be the correct way to interpret the result).

pyslet.rfc2396.path_sep = u'/'

Constant for “/” character.

6.4.4. Basic Syntax

RFC2396 defines a number of character classes (see pyslet.unicode5.CharClass) to assist with the parsing of URI.

The bound test method of each class is exposed for convenience (you don’t need to pass an instance). These pseudo-functions therefore all take a single character as an argument and return True if the character matches the class. They will also accept None and return False in that case.


Tests production: upalpha


Tests production: lowalpha


Tests production: alpha


Tests production: digit


Tests production: alphanum


Tests production: reserved

The reserved characters are:

";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," | "[" | "]"

This function uses the larger reserved set defined by the update in RFC2732. The additional reserved characters are “[” and “]” which were not originally part of the character set allowed in URI by RFC2396.


Tests production: reserved

The reserved characters are:

";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","

This function enables strict parsing according to RFC2396, for general use you should use is_reserved() which takes into consideration the update in RFC2732 to accommodate IPv6 literals.


Tests production: reserved

The reserved characters are:

";" | "/" | "?" | ":" | "@" | "&" | "="

This function enables parsing according to the earlier RFC1738.


Tests production: unreserved

Despite the name, some characters are neither reserved nor unreserved.


Tests production: unreserved

Tests the definition of unreserved from the earlier RFC1738. The following characters were considered ‘safe’ in RFC1738 (and so are unreserved there) but were later classified as reserved in RFC2396:

"$" | "+" | ","

The “~” is considered unreserved in RFC2396 but is neither reserved nor unreserved in RFC1738 and so therefore must be escaped for compatibility with early URL parsing systems.


Test production: safe (RFC 1738 only)

The safe characters are:

"$" | "-" | "_" | "." | "+"

Test production: safe (RFC 1738 only)

The safe characters are:

"!" | "*" | "'" | "(" | ")" | ","

Tests production: mark

The mark characters are:

"-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

Convenience function for testing allowed characters

Returns True if c is a character allowed in a URI according to the looser definitions of RFC2732, False otherwise. A character is allowed (unescaped) in a URI if it is either reserved or unreserved.


Convenience function for testing allowed characters

Returns True if c is a character allowed in a URI according to the stricter definitions in RFC2396, False otherwise. A character is allowed (unescaped) in a URI if it is either reserved or unreserved.


Convenience function for testing allowed characters

Returns True if c is a character allowed in a URI according to the older definitions in RFC1738, False otherwise. A character is allowed (unescaped) in a URI if it is either reserved or unreserved.


Tests production: hex

Accepts upper or lower case forms.


Tests production: control


Tests production: space


Tests production: delims

The delims characters are:

"<" | ">" | "#" | "%" | <">

Tests production: unwise

The unwise characters are:

"{" | "}" | "|" | "\" | "^" | "`"

This function uses the smaller unwise set defined by the update in RFC2732. The characters “[” and “]” were removed from this set in order to support IPv6 literals.

This function is provided for completeness and is not used internally for parsing URLs.


Tests production: unwise

The unwise characters are:

"{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

This function enables strict parsing according to RFC2396, the definition of unwise characters was updated in RFC2732 to exclude “[” and “]”.


Convenience function for parsing production authority

Quoting the specification of production authority:

Within the authority component, the characters “;”, “:”, “@”, “?”, and “/” are reserved

Convenience function for escaping path segments

From RFC2396:

Within a path segment, the characters “/”, “;”, “=”, and “?” are reserved.

Convenience function for escaping query strings

From RFC2396:

Within a query component, the characters “;”, “/”, “?”, “:”, “@”, “&”, “=”, “+”, “,”, and “$” are reserved

Some fragments of URI parsing are exposed for reuse by other modules.

pyslet.rfc2396.parse_uric(source, pos=0, allowed_test=is_allowed)

Returns the number of URI characters in a source string

A source string (of characters)
The place at which to start parsing (defaults to 0)

Defaults to is_allowed()

Test function indicating if a character is allowed unencoded in a URI. For stricter RFC2396 compliant parsing you may also pass is_allowed_2396() or is_allowed_1738().

For information, RFC2396 added “~” to the range of allowed characters and RFC2732 added “[” and “]” to support IPv6 literals.

This function can be used to scan a string of characters for a URI, for example:

x = "http://www.pyslet.org/ is great"
url = x[:parse_uric(x, 0)]

It does not check the validity of the URI against the specification. The purpose is to allow a URI to be extracted from some source text. It assumes that all characters that must be encoded in URI are encoded, so characters outside the ASCII character set automatically terminate the URI as do any unescaped characters outside the allowed set (defined by the allowed_test). See encode_unicode_uri() for details of how to create an appropriate source string in contexts where non-ASCII characters may be present.


Splits an authority component

A character string containing the authority component of a URI.

Returns a triple of:

(userinfo, host, port)

There is no parsing of the individual components which may or may not be syntactically valid according to the specification. The userinfo is defined as anything up to the “@” symbol or None if there is no “@”. The port is defined as any digit-string (possibly empty) after the last “:” character or None if there is no “:” or if there is non-empty string containing anything other than a digit after the last “:”.

The return values are always character strings (or None). There is no unescaping or other parsing of the values.

pyslet.rfc2396.split_path(path, abs_path=True)

Splits a URI-encoded path into path segments

A character string containing the path component of a URI. If path is None we treat as for an empty string.
A flag (defaults to True) indicating whether or not the path is relative or absolute. This flag only affects the handling of the empty path. An empty absolute path is treated as if it were ‘/’ and returns a list containing a single empty path segment whereas an empty relative path returns a list with no path segments, in other words, an empty list.

The return result is always a list of character strings split from path. It will only end in an empty path segment if the path ends with a slash.

pyslet.rfc2396.split_abs_path(path, abs_path=True)

Provided for backwards compatibility

Equivalent to:

split_path(abs_path, True)

Provided for backwards compatibility

Equivalent to:

split_path(abs_path, False)

Normalizes a list of path_segments

A list of character strings representing path segments, for example, as returned by split_path().

Normalizing follows the rules for resolving relative URI paths, ‘./’ and trailing ‘.’ are removed, ‘seg/../’ and trailing seg/.. are also removed.

6.4.5. Exceptions

class pyslet.rfc2396.URIException

Bases: exceptions.Exception

Base class for URI-related exceptions

class pyslet.rfc2396.URIRelativeError

Bases: pyslet.rfc2396.URIException

Exceptions raised while resolve relative URI

6.4.6. Legacy

The following definitions are provided for backwards compatibility only.


An instance of URIFactoryClass that can be used for creating URI instances.

class pyslet.rfc2396.URIFactoryClass

Bases: pyslet.pep8.PEP8Compatibility