6.9. Unicode Characters¶

6.9.1. Utility Functions¶

pyslet.unicode5.detect_encoding(magic)¶

Detects text encoding

magic: A string of bytes

Given a byte string this function looks at (up to) four bytes and returns a best guess at the unicode encoding being used for the data.

It returns a string suitable for passing to Python’s native decode method, e.g., ‘utf_8’. The default is ‘utf_8’, an encoding which will also work if the data is plain ASCII.

6.9.2. Character Classes¶

class pyslet.unicode5.CharClass(*args)¶

Bases: pyslet.py2.UnicodeMixin

Represents a class of unicode characters.

A class of characters is represented internally by a list of character ranges that define the class. This is efficient because most character classes are defined in blocks of characters.

For the constructor, multiple arguments can be provided.

String arguments add all characters in the string to the class. For example, CharClass(‘abcxyz’) creates a class comprising two ranges: a-c and x-z.

Tuple/List arguments can be used to pass pairs of characters that define a range. For example, CharClass((‘a’,’z’)) creates a class comprising the letters a-z.

Instances of CharClass can also be used in the constructor to add an existing class.

Instances support Python’s repr function:

>>> c = CharClass('abcxyz')
>>> print repr(c)
CharClass((u'a',u'c'), (u'x',u'z'))

The string representation of a CharClass is a python regular expression suitable for matching a single character from the CharClass:

>>> print str(c)
[a-cx-z]

classmethod ucd_category(category)¶

Returns the character class representing the Unicode category.

You must not modify the returned instance, if you want to derive a character class from one of the standard Unicode categories then you should create a copy by passing the result of this class method to the CharClass constructor, e.g. to create a class of all general controls and the space character:

c=CharClass(CharClass.ucd_category(u"Cc"))
c.add_char(u" ")

classmethod ucd_block(block_name)¶

Returns the character class representing the Unicode block.

You must not modify the returned instance, if you want to derive a character class from one of the standard Unicode blocks then you should create a copy by passing the result of this class method to the CharClass constructor, e.g. to create a class combining all Basic Latin characters and those in the Latin-1 Supplement:

c=CharClass(CharClass.ucd_block(u"Basic Latin"))
c.add_class(CharClass.ucd_block(u"Latin-1 Supplement")

format_re()¶: Create a representation of the class suitable for putting in [] in a python regular expression

add_range(a, z)¶: Adds a range of characters from a to z to the class

subtract_range(a, z)¶: Subtracts a range of characters from the character class

add_char(c)¶: Adds a single character to the character class

subtract_char(c)¶: Subtracts a single character from the character class

add_class(c)¶

Adds all the characters in c to the character class

This is effectively a union operation.

subtract_class(c)¶: Subtracts all the characters in c from the character class

negate()¶

Negates this character class

As a convenience returns the object as the result enabling this method to be used in construction, e.g.:
c = CharClass('

‘).negate()

Results in the class of all characters except line feed and carriage return.

test(c)¶

Test a unicode character.

Returns True if the character is in the class.

If c is None, False is returned.

This function uses an internal cache to speed up tests of complex classes. Test results are cached in 256 character blocks. The cache does not require a lock to make this method thread-safe (a lock would have a significant performance penalty) as it uses a simple python list. The worst case race condition would result in two separate threads calculating the same block simultaneously and assigning it the same slot in the cache but python’s list object is thread-safe under assignment (and the two calculated blocks will be identical) so this is not an issue.

Why does this matter? This function is called a lot, particularly when parsing XML. When parsing a tag the parser will repeatedly test each character to determine if it is a valid name character and the definition of name character is complex. Here are some illustrative figures calculated using cProfile for a typical 1MB XML file which calls test 142198 times: with no cache 0.42s spent in test, with the cache 0.11s spent.

6.9.3. Parsing Text and Binary Data¶

class pyslet.unicode5.BasicParser(source)¶

Bases: pyslet.unicode5.ParserMixin, pyslet.pep8.PEP8Compatibility

A base class for parsing character strings or binary data

source: Can be either a string of characters or a string of bytes.

BasicParser instances can parse either characters or bytes but not both simultaneously, you must choose on construction by passing an appropriate str (Python 2: unicode), bytes or bytearray object.

Binary mode is suitable for parsing data described in terms of OCTETS, such as many IETF and internet standards. When passing string literals to parsing methods in binary mode use the binary string literal form:

parser.match(b':')

Methods that return the parsed data in its original form will also return bytes objects in binary mode.

Methods are named according to the type of operation they perform.

match_*

Returns a boolean True or False depending on whether or not a syntax production is matched at the current location. The state of the parser is unchanged. This type of method is only used for very simple productions, e.g., match_digit().

parse_*

Attempts to parse a syntax element returning an appropriate object as the result or None if the production is not present. The position of the parser is only changed if the element was parsed successfully. This type of method is intended for fairly simple productions, e.g., parse_integer(). More complex productions are implemented using require_* methods but the general parse_production() can be used to enable more complex look-ahead scenarios.

require_*

Parses a syntax production, returning an appropriate object as the result. If the production is not matched a ParserError is raised.

On success, the position of the parser points to the first character after the parsed production ready to continue parsing. On failure, the parser is positioned at the point at which the exception was raised.

When deriving your own sub-classes you will normally use the require_* pattern to extend the parser.

Compatibility note: if you are attempting to use the same source for both Python 2 and 3 then you may not be able to rely on the parser mode:

>>> from pyslet.unicode5 import BasicParser
>>> p = BasicParser("hello")
>>> p.raw

The above interpreter session will print True in Python 2 and False in Python 3. This is just another manifestation of the changes to string handling between the two releases. If you are dealing with ASCII data you can ignore the issue, otherwise you should consider using one of the various techniques for forcing strings to be interpreted as unicode when running in Python 2. The most important thing is consistency between the type of object you pass to the constructor and those that you pass to the various parsing methods. You may find the pyslet.py2.ul() and/or pyslet.py2.u8() functions useful for forcing text mode.

raw = None¶: True if parser is working in binary mode.

src = None¶: the string being parsed

pos = None¶: the position of the current character

the_char = None¶

The current character or None if the parser is positioned outside the src string.

In binary mode this will be a byte, which is an integer in Python 3 but a character in Python 2. In text mode it is a (unicode) character.

setpos(new_pos)¶

Sets the position of the parser to new_pos

Useful for saving the parser state and returning later:

save_pos = parser.pos
#
# do some look-ahead parsing
#
parser.setpos(save_pos)

next_char()¶

Points the parser at the next character.

Updates pos and the_char.

parser_error(production=None)¶

Raises an error encountered by the parser

See ParserError for details.

If production is None then the previous error is re-raised. If multiple errors have been raised previously the one with the most advanced parser position is used. This is useful in situations where there are multiple alternative productions, none of which can be successfully parsed. It allows parser methods to catch the exception from the last possible choice and raise an error relating to the closest previous match. For example:

def require_abc(self):
    result = p.parse_production(p.require_a)
    if result is None:
        result = p.parse_production(p.require_b)
    if result is None:
        result = p.parse_production(p.require_c)
    if result is None:
        # will raise the most advanced error raised during
        # the three previous methods
        p.parser_error()
    else:
        return result

See parse_production() for more details on this pattern.

The position of the parser is always set to the position of the error raised.

peek(nchars)¶

Returns the next nchars characters or bytes.

If there are less than nchars remaining then a shorter string is returned.

match_end()¶: True if all of src has been parsed

match(match_string)¶: Returns true if match_string is at the current position

parse(match_string)¶

Parses match_string

Returns match_string or None if it cannot be parsed.

require(match_string, production=None)¶

Parses and requires match_string

match_string: The string to be parsed
production: Optional name of production, defaults to match_string itself.

For consistency, returns match_string on success.

match_insensitive(lower_string)¶

Returns true if lower_string is matched (ignoring case).

lower_string must already be a lower-cased string.

parse_insensitive(lower_string)¶

Parses lower_string ignoring case in the source.

lower_string: Must be a lower-cased string

Advances the parser to the first character after lower_string. Returns the matched string which may differ in case from lower_string.

parse_until(match_string)¶

Parses up to but not including match_string.

Advances the parser to the first character of match_string. If match_string is not found (or is None) then all the remaining characters in the source are parsed.

Returns the parsed text, even if empty. Never returns None.

match_one(match_chars)¶

Returns true if one of match_chars is at the current position.

The ‘in’ operator is used to test match_chars so this can be a list or tuple of characters (or bytes), it does not have to be string.

parse_one(match_chars)¶

Parses one of match_chars.

match_chars: A string (list or tuple) of characters or bytes

Returns the character (or byte) or None if no match is found.

Warning: in binary mode, this method will return a single byte value, the type of which will differ in Python 2. In Python 3, bytes are integers, in Python 2 they are binary strings of length 1. You can use the function py2.byte() to help ensure your source works on both platforms, for example:

from .py2 import byte
c = parser.parse_one(b"+-")
if c == byte(b"+"):
    # do plus thing...
elif c is not None:
    # must be minus...
else:
    # do something else...

match_digit()¶

Returns true if the current character is a digit

Only ASCII digits are considered, in binary mode byte values 0x30 to 0x39 are matched.

parse_digit()¶

Parses a digit character.

Returns the digit character/byte, or None if no digit is found. Like match_digit() only ASCII digits are parsed.

parse_digit_value()¶

Parses a single digit value.

Returns the digit value, or None if no digit is found. Like match_digit() only ASCII digits are parsed.

parse_digits(min, max=None)¶

Parses a string of digits

min: The minimum number of digits to parse. There is a special cases where min=0, in this case an empty string may be returned.
max (default None): The maximum number of digits to parse, or None there is no maximum.

Returns the string of digits or None if no digits can be parsed. Like parse_digit(), only ASCII digits are considered.

parse_integer(min=None, max=None, max_digits=None)¶

Parses an integer (or long).

min (optional, defaults to None): A lower bound on the acceptable integer value, the result will always be >= min on success
max (optional, defaults to None): An upper bound on the acceptable integer value, the result will always be <= max on success
max_digits (optional, defaults to None): The limit on the number of digits, i.e., the field width.

If a suitable integer can’t be parsed then None is returned. This method only processes ASCII digits.

Warning: in Python 2 the result may be of type long.

match_hex_digit()¶

Returns true if the current character is a hex-digit

Only ASCII digits are considered, letters can be either upper or lower case. In binary mode byte values 0x30 to 0x39, 0x41-0x46 and 0x61-0x66 are matched.

parse_hex_digit()¶

Parses a hex-digit.

Returns the digit, or None if no digit is found. See match_hex_digit() for which characters/bytes are considered hex-digits.

parse_hex_digits(min, max=None)¶

Parses a string of hex-digits

min: The minimum number of hex-digits to parse. There is a special cases where min=0, in this case an empty string may be returned.
max (default None): The maximum number of hex-digits to parse, or None there is no maximum.

Returns the string of hex-digits or None if no digits can be parsed. See match_hex_digit() for which characters/bytes are considered hex-digits.

class pyslet.unicode5.ParserError(production, parser=None)¶

Bases: exceptions.ValueError

Exception raised by BasicParser

production: The name of the production being parsed
parser: The BasicParser instance raising the error (optional)

ParserError is a subclass of ValueError.

pos = None¶: the position of the parser when the error was raised

left = None¶: up to 40 characters/bytes to the left of pos

right = None¶: up to 40 characters/bytes to the right of pos