6.10. Unicode Characters

6.10.1. Utility Functions

pyslet.unicode5.detect_encoding(magic)

Detects text encoding

magic
A string of bytes

Given a byte string this function looks at (up to) four bytes and returns a best guess at the unicode encoding being used for the data.

It returns a string suitable for passing to Python’s native decode method, e.g., ‘utf-8’. The default is ‘utf-8’, an encoding which will also work if the data is plain ASCII.

6.10.2. Character Classes

class pyslet.unicode5.CharClass(*args)

Bases: pyslet.py2.UnicodeMixin

Represents a class of unicode characters.

A class of characters is represented internally by a list of character ranges that define the class. This is efficient because most character classes are defined in blocks of characters.

For the constructor, multiple arguments can be provided.

String arguments add all characters in the string to the class. For example, CharClass(‘abcxyz’) creates a class comprising two ranges: a-c and x-z.

Tuple/List arguments can be used to pass pairs of characters that define a range. For example, CharClass((‘a’,’z’)) creates a class comprising the letters a-z.

Instances of CharClass can also be used in the constructor to add an existing class.

Instances support Python’s repr function:

>>> c = CharClass('abcxyz')
>>> print repr(c)
CharClass((u'a',u'c'), (u'x',u'z'))

The string representation of a CharClass is a python regular expression suitable for matching a single character from the CharClass:

>>> print str(c)
[a-cx-z]
classmethod ucd_category(category)

Returns the character class representing the Unicode category.

You must not modify the returned instance, if you want to derive a character class from one of the standard Unicode categories then you should create a copy by passing the result of this class method to the CharClass constructor, e.g. to create a class of all general controls and the space character:

c=CharClass(CharClass.ucd_category(u"Cc"))
c.add_char(u" ")
classmethod ucd_block(block_name)

Returns the character class representing the Unicode block.

You must not modify the returned instance, if you want to derive a character class from one of the standard Unicode blocks then you should create a copy by passing the result of this class method to the CharClass constructor, e.g. to create a class combining all Basic Latin characters and those in the Latin-1 Supplement:

c=CharClass(CharClass.ucd_block(u"Basic Latin"))
c.add_class(CharClass.ucd_block(u"Latin-1 Supplement")
format_re()

Create a representation of the class suitable for putting in [] in a python regular expression

add_range(a, z)

Adds a range of characters from a to z to the class

subtract_range(a, z)

Subtracts a range of characters from the character class

add_char(c)

Adds a single character to the character class

subtract_char(c)

Subtracts a single character from the character class

add_class(c)

Adds all the characters in c to the character class

This is effectively a union operation.

subtract_class(c)

Subtracts all the characters in c from the character class

negate()

Negates this character class

test(c)

Test a unicode character.

Returns True if the character is in the class.

If c is None, False is returned.

6.10.3. Parsing Text and Binary Data

class pyslet.unicode5.BasicParser(source)

Bases: pyslet.pep8.PEP8Compatibility

An abstract class for parsing character strings or binary data

source
Can be either a string of characters or a string of bytes.

BasicParser instances can parse either characters or bytes but not both simultaneously, you must choose on construction by passing an appropriate str (Python 2: unicode), bytes or bytearray object.

Binary mode is suitable for parsing data described in terms of OCTETS, such as many IETF and internet standards. When passing string literals to parsing methods in binary mode use the binary string literal form:

parser.match(b':')

Methods that return the parsed data in its original form will also return bytes objects in binary mode.

Methods are named according to the type of operation they perform.

match_*
Returns a boolean True or False depending on whether or not a syntax production is matched at the current location. The state of the parser is unchanged. This type of method is only used for very simple productions, e.g., match_digit().
parse_*
Attempts to parse a syntax element returning an appropriate object as the result or None if the production is not present. The position of the parser is only changed if the element was parsed successfully. This type of method is intended for fairly simple productions, e.g., parse_integer(). More complex productions are implemented using require_* methods but the general parse_production() can be used to enable more complex look-ahead scenarios.
require_*

Parses a syntax production, returning an appropriate object as the result. If the production is not matched a ParserError is raised.

On success, the position of the parser points to the first character after the parsed production ready to continue parsing. On failure, the parser is positioned at the point at which the exception was raised.

When deriving your own sub-classes you will normally use the require_* pattern to extend the parser.

Compatibility note: if you are attempting to use the same source for both Python 2 and 3 then you may not be able to rely on the parser mode:

>>> from pyslet.unicode5 import BasicParser
>>> p = BasicParser("hello")
>>> p.raw

The above interpreter session will print True in Python 2 and False in Python 3. This is just another manifestation of the changes to string handling between the two releases. If you are dealing with ASCII data you can ignore the issue, otherwise you should consider using one of the various techniques for forcing strings to be interpreted as unicode when running in Python 2. The most important thing is consistency between the type of object you pass to the constructor and those that you pass to the various parsing methods. You may find the pyslet.py2.ul() and/or pyslet.py2.u8() functions useful for forcing text mode.

raw = None

True if parser is working in binary mode.

src = None

the string being parsed

pos = None

the position of the current character

the_char = None

The current character or None if the parser is positioned outside the src string.

In binary mode this will be a byte, which is an integer in Python 3 but a character in Python 2. In text mode it is a (unicode) character.

setpos(new_pos)

Sets the position of the parser to new_pos

Useful for saving the parser state and returning later:

save_pos = parser.pos
#
# do some look-ahead parsing
#
parser.setpos(save_pos)
next_char()

Points the parser at the next character.

Updates pos and the_char.

parser_error(production=None)

Raises an error encountered by the parser

See ParserError for details.

If production is None then the previous error is re-raised. If multiple errors have been raised previously the one with the most advanced parser position is used. This is useful in situations where there are multiple alternative productions, none of which can be successfully parsed. It allows parser methods to catch the exception from the last possible choice and raise an error relating to the closest previous match. For example:

def require_abc(self):
    result = p.parse_production(p.require_a)
    if result is None:
        result = p.parse_production(p.require_b)
    if result is None:
        result = p.parse_production(p.require_c)
    if result is None:
        # will raise the most advanced error raised during
        # the three previous methods
        p.parser_error()
    else:
        return result

See parse_production() for more details on this pattern.

The position of the parser is always set to the position of the error raised.

require_production(result, production=None)

Returns result if not None or raises ParserError.

result
The result of a parse_* type method.
production
Optional string used to customise the error message.

This method is intended to be used as a conversion function allowing any parse_* method to be converted into a require_* method. E.g.:

p = BasicParser("hello")
num = p.require_production(p.parse_integer(), "Number")

ParserError: Expected Number at [0]
require_production_end(result, production=None)

Returns result if not None and parsing is complete.

This method is similar to require_production() except that it enforces the constraint that the entire source must have been parsed. Essentially, it just calls require_end() before returning result.

parse_production(require_method, *args, **kwargs)

Executes the bound method require_method.

require_method
A bound method that will be called with *args
args
The positional arguments to pass to require_method
kwargs
The keyword arguments to pass to require_method

This method is intended to be used as a conversion function allowing any require_* method to be converted into a parse_* method for the purposes of look-ahead.

If successful the result of the method is returned. If any ValueError (including ParserError) is raised, the exception is caught, the parser rewound and None is returned.

peek(nchars)

Returns the next nchars characters or bytes.

If there are less than nchars remaining then a shorter string is returned.

match_end()

True if all of src has been parsed

require_end(production='end')

Tests that all of src has been parsed

There is no return result.

match(match_string)

Returns true if match_string is at the current position

parse(match_string)

Parses match_string

Returns match_string or None if it cannot be parsed.

require(match_string, production=None)

Parses and requires match_string

match_string
The string to be parsed
production
Optional name of production, defaults to match_string itself.

For consistency, returns match_string on success.

match_insensitive(lower_string)

Returns true if lower_string is matched (ignoring case).

lower_string must already be a lower-cased string.

parse_insensitive(lower_string)

Parses lower_string ignoring case in the source.

lower_string
Must be a lower-cased string

Advances the parser to the first character after lower_string. Returns the matched string which may differ in case from lower_string.

parse_until(match_string)

Parses up to but not including match_string.

Advances the parser to the first character of match_string. If match_string is not found (or is None) then all the remaining characters in the source are parsed.

Returns the parsed text, even if empty. Never returns None.

match_one(match_chars)

Returns true if one of match_chars is at the current position

parse_one(match_chars)

Parses one of match_chars.

match_chars
A string of characters or bytes

Returns the character (or byte) or None if no match is found.

Warning: in binary mode, this method will return a single byte value, the type of which will differ in Python 2. In Python 3, bytes are integers, in Python 2 they are binary strings of length 1. You can use the function py2.byte() to help ensure your source works on both platforms, for example:

from .py2 import byte
c = parser.parse_one(b"+-")
if c == byte(b"+"):
    # do plus thing...
elif c:
    # must be minus...
else:
    # do something else...
match_digit()

Returns true if the current character is a digit

Only ASCII digits are considered, in binary mode byte values 0x30 to 0x39 are matched.

parse_digit()

Parses a digit character.

Returns the digit character/byte, or None if no digit is found. Like match_digit() only ASCII digits are parsed.

parse_digit_value()

Parses a single digit value.

Returns the digit value, or None if no digit is found. Like match_digit() only ASCII digits are parsed.

parse_digits(min, max=None)

Parses a string of digits

min
The minimum number of digits to parse. There is a special cases where min=0, in this case an empty string may be returned.
max (default None)
The maximum number of digits to parse, or None there is no maximum.

Returns the string of digits or None if no digits can be parsed. Like parse_digit(), only ASCII digits are considered.

parse_integer(min=None, max=None, max_digits=None)

Parses an integer (or long).

min (optional, defaults to None)
A lower bound on the acceptable integer value, the result will always be >= min on success
max (optional, defaults to None)
An upper bound on the acceptable integer value, the result will always be <= max on success
max_digits (optional, defaults to None)
The limit on the number of digits, i.e., the field width.

If a suitable integer can’t be parsed then None is returned. This method only processes ASCII digits.

Warning: in Python 2 the result may be of type long.

match_hex_digit()

Returns true if the current character is a hex-digit

Only ASCII digits are considered, letters can be either upper or lower case. In binary mode byte values 0x30 to 0x39, 0x41-0x46 and 0x61-0x66 are matched.

parse_hex_digit()

Parses a hex-digit.

Returns the digit, or None if no digit is found. See match_hex_digit() for which characters/bytes are considered hex-digits.

parse_hex_digits(min, max=None)

Parses a string of hex-digits

min
The minimum number of hex-digits to parse. There is a special cases where min=0, in this case an empty string may be returned.
max (default None)
The maximum number of hex-digits to parse, or None there is no maximum.

Returns the string of hex-digits or None if no digits can be parsed. See match_hex_digit() for which characters/bytes are considered hex-digits.