6.5. Uniform Resource Names (RFC2141)¶
This module defines functions and classes for working with URI as defined by RFC2141: http://www.ietf.org/rfc/rfc2141.txt
6.5.1. Creating URN Instances¶
URN instances are created automatically by the
from_octets() method and no special action
is required when parsing them from character strings.
If you are in a URN specific context you may perform a looser parse of a
URN from a surrounding character stream using
parse_urn() but the
return result is a character string rather than a URN instance.
Finally, you can construct a URN from a namespace identifier and namespace specific string directly. The resulting object can then be converted directly to a well-formatted URN using string conversion or used in any context where a URI instance is required.
Parses a run of URN characters from a string
- A character string containing URN characters. Will accept binary strings encoding ASCII characters (only).
returns the src up to, but not including, the first character that fails to match the production for URN char (as a character string).
URN(octets=None, nid=None, nss=None)¶
Represents a URN
There are two forms of constructor, the first uses a single positional argument and matches the constructor for the base URI class. This enables URNs to be created automatically from
- A character string containing the URN
The second form of constructor allows you to construct a URN from a namespace identifier and a namespace-specific string, both values are required in this form of the constructor.
- The namespace identifier, a string.
- The namespace-specific string, encoded appropriately for inclusion in a URN.
ValueError is raised if the arguments are not passed correctly,
URIExceptionis raised if there a problem parsing or creating the URN itself.
the namespace identifier for this URN
the namespace specific part of the URN
6.5.3. Translating to and from Text¶
Translates a source string into URN characters
- A binary or unicode string. In the latter case the string is encoded with utf-8 as part of being translated, in the former case it must be a valid UTF-8 string of bytes.
A function that tests if a character is reserved. It defaults to
is_reserved()but can be any function that takes a single argument and returns a boolean. You can’t prevent a character from being encoded with this function (even if you pass lambda x:False, but you can add additional characters to the list of those that should be escaped. For example, to encode the ‘.’ character you could pass:
lambda x: x=='.'
The result is a URI-encode string suitable for adding to the namespace-specific part of a URN.
Translates a URN string into an unencoded source string
The main purpose of this function is to remove %-encoding but it will also check for the illegal 0-byte and raise an error if one is encountered.
Returns a character string without %-escapes. As part of the conversion the implicit UTF-8 encoding is removed.
6.5.4. Basic Syntax¶
The module defines a number of character classes (see
pyslet.unicode5.CharClass) to assist with the parsing of URN.
The bound test method of each class is exposed for convenience (you don’t need to pass an instance). These pseudo-functions therefore all take a single character as an argument and return True if the character matches the class. They will also accept None and return False in that case.
Returns True if c matches upper
Returns True if c matches lower
Returns True if c matches number
Returns True if c matches letnum
Test a unicode character.
Returns True if the character is in the class.
If c is None, False is returned.
This function uses an internal cache to speed up tests of complex classes. Test results are cached in 256 character blocks. The cache does not require a lock to make this method thread-safe (a lock would have a significant performance penalty) as it uses a simple python list. The worst case race condition would result in two separate threads calculating the same block simultaneously and assigning it the same slot in the cache but python’s list object is thread-safe under assignment (and the two calculated blocks will be identical) so this is not an issue.
Why does this matter? This function is called a lot, particularly when parsing XML. When parsing a tag the parser will repeatedly test each character to determine if it is a valid name character and the definition of name character is complex. Here are some illustrative figures calculated using cProfile for a typical 1MB XML file which calls test 142198 times: with no cache 0.42s spent in test, with the cache 0.11s spent.
Returns True if c matches letnumhyp
Returns True if c matches reserved
The reserved characters are:
"%" | "/" | "?" | "#"
Returns True if c matches other
The other characters are:
"(" | ")" | "+" | "," | "-" | "." | ":" | "=" | "@" | ";" | "$" | "_" | "!" | "*" | "'"
Returns True if c matches trans
Note that translated characters include reserved characters, even though they should normally be escaped (and in the case of ‘%’ MUST be escaped). The effect is that URNs consist of runs of characters that match the production for trans.
Returns True if c matches hex