2.6. Uniform Resource Identifiers (RFC2396)

This module defines functions and classes for working with URIs as defined by RFC2396: http://www.ietf.org/rfc/rfc2396.txt

In addition to parsing and formating URI from strings of octets, this module also supports computing and resolving relative URI. To do this we define two notaional operators.

The resolve operator:

U = B [*] R

calculates a new URI ‘U’ from a base URI ‘B’ and a relative URI ‘R’.

The relative operator:

U [/] B = R

calcualtes the relative URI ‘R’ formed by expressing ‘U’ relative to ‘B’.

Clearly the Relative operator defines the reverse of the resolve operator, however note that in some cases several different values of R can resolve to the same URL with a common base URI.

2.6.1. Creating URI Instances

pyslet.rfc2396.URIFactory

An instance of URIFactoryClass that can be used for creating URI instances.

class pyslet.rfc2396.URIFactoryClass

A factory class that contains methods for creating URI instances.

URI(octets)

Creates an instance of URI from a string of octets.

URLFromPathname(path)

Converts a local file path into a URI instance.

If the path is not absolute it is made absolute by resolving it relative to the current working directory before converting it to a URI.

Under Windows, the URL is constructed according to the recommendations on this blog post: http://blogs.msdn.com/b/ie/archive/2006/12/06/file-uris-in-windows.aspx So UNC paths are mapped to both the network location and path components of the resulting URI.

URLFromVirtualFilePath(path)

Converts a virtual file path into a URI instance.

Resolve(b, r)

Evaluates the resolve operator B [*] R, resolving R relative to B

The input parameters are converted to URI objects if necessary.

Relative(u, b)

Evaluates the relative operator U [/] B, returning U relative to B

The input parameters are converted to URI objects if necessary.

pyslet.rfc2396.EncodeUnicodeURI(uSrc)

Takes a unicode string that is supposed to be a URI and returns an octent string.

The encoding algorithm used is the same as the one adopted by HTML: utf-8 and then %-escape. This is not part of the RFC standard which only defines the behaviour for streams of octets.

2.6.2. URI

class pyslet.rfc2396.URI(octets)

Bases: object

Class to represent URI Reference.

octets = None

The octet string representing this URI.

fragment = None

The fragment string that was appended to the URI or None if no fragment was given.

scheme = None

The URI scheme, if present.

authority = None

The authority (e.g., host name) of a hierarchical URI

absPath = None

The absolute path of a hierarchical URI (None if the path is relative)

query = None

The optional query associated with a hierarchical URI.

schemeSpecificPart = None

The scheme specific part of the URI.

relPath = None

The relative path of a hierarchical URI (None if the path is absolute)

opaquePart = None

None if the URI is hierarchical, otherwise the same as schemeSpecificPart.

GetFileName()

Returns the file name associated with this resource or None if the URL scheme does not have the concept. By default the file name is extracted from the last component of the path. Note the subtle difference between returning None and returning an empty string (indicating that the URI represents a directory-like object).

GetCanonicalRoot()

Returns a new URI comprised of the scheme and authority only.

Only valid for absolute URIs.

Resolve(base, currentDocRef=None)

Resolves a (relative) URI relative to base returning a new URI instance

If the base URI is also relative then the result is a relative URI, otherwise the result is an absolute URI. The RFC does not actually go into the procedure for combining relative URIs but if B is an absolute URI and R1 and R2 are relative URIs then using the resolve operator:

U1 = B [*] R1
U2 = U1 [*] R2
U2 = ( B [*] R1 ) [*] R2

The last expression prompts the issue of associativity, in other words, is the following expression also valid?

U2 = B [*] ( R1 [*] R2 )

For this to work it must be possible to use the resolve operator to combine two relative URIs to make a third, which is what we allow here.

The optional currentDocRef allows you to handle the special case of resolving the empty URI. Strictly speaking, fragments are not part of the URI itself so a relative URI consisting of the empty string, or a relative URI consisting of just a fragment both refer to the current document. By default, currentDocRef is assumed to be the same as base but there are cases where the base URI is not the same as the URI used to originally retrieve the document and the optional parameter allows you to cope with those cases.

Relative(base)

Evaluates the Relative operator, returning the URI expressed relative to base.

As we also allow the Resolve method for relative paths it makes sense for the Relative operator to also be defined:

R3 = R1 [*] R2
R3 [/] R1 = R2

Note that there are some restrictions....

U = B [*] R

If R is absolute, or simply more specified than B on the following scale:

absolute URI > authority > absolute path > relative path

then U = R regardless of the value of B and therefore:

U [/] B = U if B is less specified than U.

Also note that if U is a relative URI then B cannot be absolute. In fact B must always be less than, or equally specified to U because B is the base URI from which U has been derived.

U [/] B = undefined if B is more specified than U

Therefore the only interesting cases are when B is equally specified to U. To give a concrete example:

U = /HD/User/setting.txt
B = /HD/folder/file.txt

/HD/User/setting.txt [\] /HD/folder/file.txt = ../User/setting.txt
/HD/User/setting.txt = /HD/folder/file.txt [*] ../User/setting.txt

And for relative paths:

U = User/setting.txt
B = User/folder/file.txt

User/setting.txt [\] User/folder/file.txt = ../setting.txt
User/setting.txt = User/folder/file.txt [*] ../setting.txt              
canonicalize()

Returns a canonical form of this URI

match(otherURI)

Compares this URI against otherURI returning True if they match.

IsAbsolute()

Returns True if this URI is absolute, i.e., fully specified with a scheme name.

URI.__str__()

URI are always returned as a string (of bytes), not a unicode string.

The reason for this restriction is best illustrated with an example:

The URI %E8%8B%B1%E5%9B%BD.xml is a UTF-8 and URL-encoded path segment using the Chinese word for United Kingdom. When we remove the URL-encoding we get the string ‘\xe8\x8b\xb1\xe5\x9b\xbd.xml’ which must be interpreted with utf-8 to get the intended path segment value: u’\u82f1\u56fd.xml’. However, if the URL was marked as being a unicode string of characters then this second stage would not be carried out and the result would be the unicode string u’\xe8\x8b\xb1\xe5\x9b\xbd’, which is a meaningless string of 6 characters taken from the European Latin-1 character set.