6.2. XML: Basic Constructs

This module defines classes for working with XML documents. The version of the standard implemented is the Extensible Markup Language (Fifth Edition), for more info see: http://www.w3.org/TR/xml/

XML is an integral part of many standards for LET but Pyslet takes a slightly different approach from the pre-existing XML support in the Python language. XML elements are represented by instances of a basic Element class which can be used as a base class to customize document processing for specific types of XML document. It also allows these XML elements to ‘come live’ with additional methods and behaviours.

6.2.1. Documents

class pyslet.xml20081126.structures.Node(parent)

Bases: object

Base class for Element and Document shared attributes.

XML documents are defined hierarchicaly, each element has a parent which is either another element or an XML document.

parent = None

The parent of this element, for XML documents this attribute is used as a sentinel to simplify traversal of the hierarchy and is set to None.

GetChildren()

Returns an iterator over this object’s children.

classmethod get_element_class(name)

Returns a class object suitable for representing element name

name is a unicode string representing the element name.

The default implementation returns None - for elements this has the effect of deferring the call to the parent document (where this method is overridden to return Element).

This method is called immediately prior to ChildElement() and (when applicable) GetChildClass().

The real purpose of this method is to allow an element class to directly control the way the name of a child element maps to a class to represent it. You would normally override this method in the Document to map element names to classes but in some cases you may want to tweek the mapping at the individual element level. For example, if the same element name is used for two different purposes in the same XML document, although confusing, this is allowed in XML schema.

GetChildClass(stag_class)

Returns the element class implied by the STag for stag_class in this context.

This method is only called when the XMLParser.sgmlOmittag option is in effect. It is called prior to ChildElement() below and gives the context (the parent element or document) a chance to modify the child element that will be created (or reject it out-right, by returning None).

For well-formed XML documents the default implementation is sufficient as it simply returns stag_class.

The XML parser may pass None for stag_class indicating that PCDATA has been found in element content. This method should return the first child element that may contain (directly or indirectly) PCDATA or None if no children may contain PCDATA (or SGML-style omittag is not supported)

ChildElement(childClass, name=None)

Returns a new child of the given class attached to this object.

  • childClass is a class (or callable) used to create a new instance

    of Element.

  • name is the name given to the element (by the caller). If no name

    is given then the default name for the child is used. When the child returned is an existing instance, name is ignored.

ProcessingInstruction(target, instruction='')

Abstract method for handling processing instructions encountered by the parser while parsing this object’s content.

By default, processing instructions are ignored.

class pyslet.xml20081126.structures.Document(root=None, baseURI=None, reqManager=None, **args)

Bases: pyslet.xml20081126.structures.Node

Base class for all XML documents.

Initialises a new Document from optional keyword arguments.

With no arguments, a new Document is created with no baseURI or root element.

If root is a class object (descended from Element) it is used to create the root element of the document.

If root is an orphan instance of Element (i.e., it has no parent) is is used as the root element of the document and its Element.AttachToDocument() method is called.

baseURI can be set on construction (see SetBase) and a reqManager object can optionally be passed for managing and http(s) connections.

baseURI = None

The base uri of the document.

lang = None

The default language of the document.

declaration = None

The XML declaration (or None if no XMLDeclaration is used)

dtd = None

The dtd associated with the document.

root = None

The root element or None if no root element has been created yet.

GetChildren()

If the document has a root element it is returned in a single item list, otherwise an empty list is returned.

XMLParser(entity)

Returns an XMLParser instance suitable for parsing this type of document.

This method allows some document classes to override the parser used to parse them. This method is only used when parsing existing document instances (see Read() for more information).

Classes that override this method may still register themselves with RegisterDocumentClass() but if they do then the default XMLParser object will be used when the this document class is automatically created when parsing an unidentified XML stream.

classmethod get_element_class(name)

Returns a class object suitable for representing name

name is a unicode string representing the element name.

The default implementation returns Element.

ChildElement(childClass, name=None)

Creates the root element of the given document.

If there is already a root element it is detached from the document first using Element.DetachFromDocument().

SetBase(baseURI)

Sets the baseURI of the document to the given URI.

baseURI should be an instance of pyslet.rfc2396.URI or an object that can be passed to its constructor.

Relative file paths are resolved relative to the current working directory immediately and the absolute URI is recorded as the document’s baseURI.

GetBase()

Returns a string representation of the document’s baseURI.

SetLang(lang)

Sets the default language for the document.

GetLang()

Returns the default language for the document.

ValidationError(msg, element, data=None, aname=None)

Called when a validation error is triggered by element.

This method is designed to be overriden to implement custom error handling or logging (which is likely to be added in future to this module).

msg contains a brief message suitable for describing the error in a log file. data and aname have the same meanings as Element.ValidationError.

Read(src=None, **args)

Reads this document, parsing it from a source stream.

With no arguments the document is read from the baseURI which must have been specified on construction or with a call to the SetBase() method.

You can override the document’s baseURI by passing a value for src which may be an instance of XMLEntity or an object that can be passed as a valid source to its constructor.

Create(dst=None, **args)

Creates the Document.

Create outputs the document as an XML stream. The stream is written to the baseURI by default but if the ‘dst’ argument is provided then it is written directly to there instead. dst can be any object that supports the writing of unicode strings.

Currently only documents with file type baseURIs are supported. The file’s parent directories are created if required. The file is always written using the UTF-8 as per the XML standard.

Update(**args)

Updates the Document.

Update outputs the document as an XML stream. The stream is written to the baseURI which must already exist! Currently only documents with file type baseURIs are supported.

DiffString(otherDoc, before=10, after=5)

Compares this document to otherDoc and returns first point of difference.

pyslet.xml20081126.structures.RegisterDocumentClass(doc_class, root_name, public_id=None, system_id=None)

Registers a document class for use by XMLParser.parse_document().

This module maintains a single table of document classes which can be used to identify the correct class to use to represent a document based on the information obtained from the DTD.

  • doc_class

    is the class object being registered, it must be derived from Document

  • root_name

    is the name of the root element or None if this class can be used with any root element.

  • public_id

    is the public ID of the doctype, or None if any doctype can be used with this document class.

  • system_id

    is the system ID of the doctype, this will usually be None indicating that the document class can match any system ID.

6.2.1.1. Characters

pyslet.xml20081126.structures.IsChar(c)

Tests if the character c matches the production for [2] Char.

If c is None IsChar returns False.

pyslet.xml20081126.structures.IsDiscouraged(c)

Tests if the character c is one of the characters discouraged in the specification.

Note that this test is currently limited to the range of unicode characters available in the narrow python build.

6.2.1.2. Common Syntactic Constructs

pyslet.xml20081126.structures.is_s(c)

Tests if a single character c matches production [3] S

pyslet.xml20081126.structures.IsWhiteSpace(data)

Tests if every character in data matches production [3] S

pyslet.xml20081126.structures.ContainsS(data)

Tests if data contains any characters matching production [3] S

pyslet.xml20081126.structures.StripLeadingS(data)

Returns data with leading S removed.

pyslet.xml20081126.structures.NormalizeSpace(data)

Returns data normalized according to the further processing rules for attribute-value normalization:

”...by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character”

pyslet.xml20081126.structures.CollapseSpace(data, sMode=True, sTest=<function is_s>)

Returns data with all spaces collapsed to a single space.

sMode determines the fate of any leading space, by default it is True and leading spaces are ignored provided the string has some non-space characters.

You can override the test of what consitutes a space by passing a function for sTest, by default we use is_s.

Note on degenerate case: this function is intended to be called with non-empty strings and will never return an empty string. If there is no data then a single space is returned (regardless of sMode).

pyslet.xml20081126.structures.IsNameStartChar(c)

Tests if the character c matches production [4] NameStartChar.

pyslet.xml20081126.structures.IsNameChar(c)

Tests if a single character c matches production [4a] NameChar

pyslet.xml20081126.structures.IsValidName(name)

Tests if name is a string matching production [5] Name

pyslet.xml20081126.structures.IsReservedName(name)

Tests if name is reserved for future standardization, e.g., if it begins with ‘xml’.

pyslet.xml20081126.structures.IsPubidChar(c)

Tests if the character c matches production for [13] PubidChar.

6.2.1.3. Character Data and Markup

pyslet.xml20081126.structures.EscapeCharData(src, quote=False)

Returns a unicode string with XML reserved characters escaped.

We also escape return characters to prevent them being ignored. If quote is True then the string is returned as a quoted attribute value.

pyslet.xml20081126.structures.EscapeCharData7(src, quote=False)

Returns a unicode string with reserved and non-ASCII characters escaped.

6.2.1.4. CDATA Sections

pyslet.xml20081126.structures.EscapeCDSect(src)

Returns a unicode string enclosed in <!CDATA[[ ]]> with ]]> by the clumsy sequence: ]]>]]&gt;<!CDATA[[

Degenerate case: an empty string is returned as an empty string

6.2.1.5. Prolog and Document Type Declaration

class pyslet.xml20081126.structures.XMLDTD

Bases: object

An object that models a document type declaration.

The document type declaration acts as a container for the entity, element and attribute declarations used in a document.

name = None

The declared Name of the root element

parameterEntities = None

A dictionary of XMLParameterEntity instances keyed on entity name.

generalEntities = None

A dictionary of XMLGeneralEntity instances keyed on entity name.

notations = None

A dictionary of XMLNotation instances keyed on notation name.

elementList = None

A dictionary of ElementType definitions keyed on the name of element.

attributeLists = None

A dictionary of dictionaries, keyed on element name. Each of the resulting dictionaries is a dictionary of XMLAttributeDefinition keyed on attribute name.

DeclareEntity(entity)

Declares an entity in this document.

The same method is used for both general and parameter entities. The value of entity can be either an XMLGeneralEntity or an XMLParameterEntity instance.

GetParameterEntity(name)

Returns the parameter entity definition matching name.

Returns an instance of XMLParameterEntity. If no parameter has been declared with name then None is returned.

GetEntity(name)

Returns the general entity definition matching name.

Returns an instance of XMLGeneralEntity. If no general has been declared with name then None is returned.

DeclareNotation(notation)

Declares a notation for this document.

The value of notation must be a XMLNotation instance.

GetNotation(name)

Returns the notation declaration matching name.

Returns an instance of XMLNotation. If no notation has been declared with name then None is returned.

DeclareElementType(etype)

Declares an element type.

etype is an ElementType instance containing the element definition.

GetElementType(element_name)

Looks up an element type definition.

element_name is the name of the element type to look up

The method returns an instance of ElementType or None if no element with that name has been declared.

DeclareAttribute(element_name, attributeDef)

Declares an attribute.

  • element_name

    is the name of the element type which should have this attribute applied

  • attributeDef

    is an XMLAttributeDefinition instance describing the attribute being declared.

GetAttributeList(name)

Returns a dictionary of attribute definitions for the element type name.

If there are no attributes declared for this element type, None is returned.

GetAttributeDefinition(element_name, attributeName)

Looks up an attribute definition.

element_name is the name of the element type in which to search

attributeName is the name of the attribute to search for.

The method returns an instance of XMLAttributeDefinition or None if no attribute matching this description has been declared.

class pyslet.xml20081126.structures.XMLDeclaration(version, encoding='UTF-8', standalone=False)

Bases: pyslet.xml20081126.structures.XMLTextDeclaration

Represents a full XML declaration.

Unlike the parent class, XMLTextDeclaration, the version is required. standalone defaults to False as this is the assumed value if there is no standalone declaration.

standalone = None

Whether an XML document is standalone.

6.2.2. Logical Structures

class pyslet.xml20081126.structures.Element(parent, name=None)

Bases: pyslet.xml20081126.structures.Node

Basic class that represents all XML elements.

Some aspects of the element’s XML serialisation behaviour are controlled by special class attributes that can be set on derived classes.

XMLNAME
the default name of the element the class represents.
XMLCONTENT
the default content model of the element; one of the ElementType constants.
ID
the name of the ID attribute if the element has a unique ID. With this class attribute set, ID handling is automatic (see SetID() and id below).

By default, attributes are simply stored as strings mapped in an internal dictionary. It is often more useful to map XML attributes on to python attributes, parsing and validating their values to python objects. This mapping can be provided using class attributes of the form XMLATTR_aname where aname is the name of the attribute as it would appear in the XML element start or empty element tag.

XMLATTR_aname=<string>

This form creates a simple mapping from the XML attribute ‘aname’ to a python attribute with a defined name. For example, you might want to create a mapping like this to avoid a python reserved word:

XMLATTR_class="styleClass"

This allows XML elements like this:

<element class="x"/>

To be parsed into python objects that behave like this:

element.styleClass=="x"     # True

If an instance is missing a python attribute corresponding to a defined XML attribute, or it’s value has been set to None, then the XML attribute is omitted from the element’s tag when generating XML output.

XMLATTR_aname=(<string>, decodeFunction, encodeFunction)

More complex attributes can be handled by setting XMLATTR_aname to a tuple. The first item is the python attribute name (as above); the decodeFunction is a simple callable that takes a string argument and returns the decoded value of the attribute and the encodeFunction performs the reverse transformation.

The encode/decode functions can be None to indicate a no-operation.

For example, you might want to create an integer attribute using something like:

<!-- source XML -->
<element apples="5"/>

# class attribute definition
XMLATTR_apples=('nApples',int,unicode)

# resulting object behaves like this...
element.nApples==5      # True

XMLATTR_aname=(<string>, decodeFunction, encodeFunction, type)

When XML attribute values are parsed from tags the optional type component of the tuple descriptor can be used to indicate a multi-valued attribute (for example, XML attributes defined using one of the plural forms, IDREFS, ENTITIES and NMTOKENS). If the type value is not None then the XML attribute value is first split by white-space, as per the XML specification, and then the decode function is applied to each resulting component. The instance attribute is then set depending on the value of type:

types.ListType

The instance attribute becomes a list, for example:

<!-- source XML -->
<element primes="2 3 5 7"/>

# class attribute definition
XMLATTR_primes=('primes',int,unicode)

# resulting object behaves like this...
element.primes==[2,3,5,7]       # True

types.DictType

The instance attribute becomes a dictionary mapping parsed values on to their frequency, for example:

<!-- source XML -->
<element fruit="apple pear orange pear"/>

# class attribute definition
XMLATTR_fruit=('fruit',None,None,types.DictType)

# resulting object behaves like this...
element.fruit=={'apple':1, 'orange':1, 'pear':2}        # True

In this case, the decode function (if given) must return a hashable object.

When creating XML output the reverse transformations are performed using the encode functions and the type (plain, list or dict) of the attribute’s current value. The declared multi-valued type is ignored. For dictionary values the order of the output values may not be the same as the order originally read from the XML input.

Warning: Empty lists and dictionaries result in XML attribute values which are present but with empty strings. If you wish to omit these attributes in the output XML you must set the attribute value to None in the instance.

XMLAMAP XMLARMAP

Internally, the XMLATTR_* descriptors are parsed into two mappings. The XMLAMAP maps XML attribute names onto a tuple of:

(<python attribute name>, decodeFunction, type)

The XMLARMAP maps python attribute names onto a tuple of:

(<xml attribute name>, encodeFunction)

The mappings are created automatically as needed.

For legacy reasons, the multi-valued rules can also be invoked by setting an instance member to either a list or dictionary prior to parsing the instance from XML (e.g., in a constructor).

XML attribute names may contain many characters that are not legal in Python method names and automated attribute processing is not supported for these attributes. In practice, the only significant limitation is the colon. The common xml-prefixed attributes such as xml:lang are handled using special purposes methods.

XMLCONTENT = 2

for consistency with the behaviour of the default methods we claim to be mixed content

reset(resetAttributes=False)

Clears all attributes and (optional) children.

GetDocument()

Returns the document that contains the element.

If the element is an orphan, or is the descendent of an orphan then None is returned.

SetID(id)

Sets the id of the element, registering the change with the enclosing document.

If the id is already taken then XMLIDClashError is raised.

MangleAttributeName(name)

Returns a mangled attribute name, used when setting attributes.

If name cannot be mangled, None is returned.

UnmangleAttributeName(mName)

Returns an unmangled attribute name, used when getting attributes.

If mName is not a mangled name, None is returned.

GetAttributes()

Returns a dictionary object that maps attribute names onto values.

Each attribute value is represented as a (possibly unicode) string. Derived classes should override this method if they define any custom attribute setters.

The dictionary returned represents a copy of the information in the element and so may be modified by the caller.

SetAttribute(name, value)

Sets the value of an attribute.

If value is None then the attribute is removed or, if an XMLATTR_ mapping is in place its value is set to an empty list, dictionary or None as appropriate.

GetAttribute(name)

Gets the value of a single attribute as a string.

If the element has no attribute with name then KeyError is raised.

IsEmpty()

Returns True/False indicating whether this element must be empty.

If the class defines the XMLCONTENT attribute then the model is taken from there and this method returns True only if XMLCONTENT is XMLEmpty.

Otherwise, the method defaults to False

IsMixed()

Indicates whether or not the element may contain mixed content.

If the class defines the XMLCONTENT attribute then the model is taken from there and this method returns True only if XMLCONTENT is XMLMixedContent.

Otherwise, the method default ot True

GetChildren()

Returns an iterable of the element’s children.

This method iterates through the internal list of children. Derived classes with custom factory elements MUST override this method.

Each child is either a string type, unicode string type or instance of Element (or a derived class thereof). We do not represent comments, processing instructions or other meta-markup.

GetCanonicalChildren()

A wrapper for GetChildren() that returns an iterable of the element’s children canonicalized for white space.

We check the current setting of xml:space, returning the same list of children as GetChildren() if ‘preserve’ is in force. Otherwise we remove any leading space and collapse all others to a single space character.

ChildElement(childClass, name=None)

Returns a new child of the given class attached to this element.

A new child is created and attached to the element’s model unless the model supports a single element of the given childClass and the element already exists, in which case the existing instance is returned.

childClass is a class (or callable) used to create a new instance.

name is the name given to the element (by the caller). If no name is given then the default name for the child is used. When the child returned is an existing instance, name is ignored.

The default implementation checks for a custom factory method and calls it if defined and does no further processing. A custom factory method is a method of the form ClassName or an attribute that is being used to hold instances of this child. The attribute must already exist and can be one of None (optional child, new child is created), a list (optional repeatable child, new child is created and appended) or an instance of childClass (required/existing child, no new child is created, existing instance returned).

When no custom factory method is found the class hierarchy of childClass is enumerated and the search continues for factory methods corresponding to these parent classes.

If no custom factory method is defined then the default processing simply creates an instance of child (if necessary) and attaches it to the internal list of children.

DeleteChild(child)

Deletes the given child from this element’s children.

We follow the same factory conventions as for child creation except that an attribute pointing to a single child (of this class) will be replaced with None. If a custom factory method is found then the corresponding Delete_ClassName method must also be defined.

FindChildren(childClass, childList, max=None)

Finds up to max children of class childClass from the element and its children.

Deprecated in favour of list(FindChildrenDepthFirst(childClass,False))

All matching children are added to childList. If specifing a max number of matches then the incoming list must originally be empty to prevent early termination.

Note that if max is None, the default, then all children of the given class are returned with the proviso that nested matches are not included. In other words, if the model of childClass allows further elements of type childClass as children (directly or indirectly) then only the top-level match is returned.

Effectively this method provides a depth-first list of children. For example, to get all <div> elements in an HTML <body> you would have to recurse over the resulting list calling FindChildren again until the list of matching children stops growing.

FindChildrenBreadthFirst(childClass, subMatch=True, maxDepth=1000)

A generator method that iterates over children of class childClass using a breadth first scan.

childClass may also be a tuple as per the definition of the builtin isinstance function in python.

If subMatch is True (the default) then matching elements are also scanned for nested matches. If False, only the outer-most matching element is returned.

maxDepth controls the depth of the scan with level 1 indicating direct children only. It must be a positive integer and defaults to 1000.

Warning: to reduce memory requirements when searching large documents this method performs a two-pass scan of the element’s children, i.e., GetChildren() will be called twice.

Given that XML documents tend to be broader than they are deep FindChildrenDepthFirst() is a better method to use for general purposes.

FindChildrenDepthFirst(childClass, subMatch=True, maxDepth=1000)

A generator method that iterates over children of class childClass using a depth first scan.

childClass may also be a tuple as per the definition of the builtin isinstance function in python.

If subMatch is True (the default) then matching elements are also scanned for nested matches. If False, only the outer-most matching element is returned.

maxDepth controls the depth of the scan with level 1 indicating direct children only. It must be a positive integer and defaults to 1000.

FindParent(parentClass)

Finds the first parent of class parentClass of this element.

If this element has no parent of the given class then None is returned.

AttachToParent(parent)

Called to attach an orphan element to a parent.

This method does not do any special handling of child elements, the caller takes responsibility for ensuring that this element will be returned by future calls to parent.GetChildren(). However, AttachToDocument() is called to ensure id registrations are made.

AttachToDocument(doc=None)

Called when the element is first attached to a document.

The default implementation ensures that any ID attributes belonging to this element or its descendents are registered.

DetachFromParent()

Called to detach an element from its parent, making it an orphan

This method does not do any special handling of child elements, the caller takes responsibility for ensuring that this element will no longer be returned by future calls to parent.GetChildren(). However, DetachFromDocument() is called to ensure id registrations are removed.

DetachFromDocument(doc=None)

Called when an element is being detached from a document.

The default implementation ensures that any ID attributes belonging to this element or its descendents are unregistered.

AddData(data)

Adds a string or unicode string to this element’s children.

This method raises a ValidationError if the element cannot take data children.

content_changed()

Notifies an element that its content has changed.

The default implementation tidies up the list of children to make future comparisons simpler and faster.

GenerateValue(ignoreElements=False)

A generator function that returns the strings that compromise this element’s value (useful when handling elements that contain a large amount of data). For more information see GetValue(). Note that:

string.join(e.GenerateValue(),u'')==e.GetValue()
GetValue(ignoreElements=False)

By default, returns a single unicode string representing the element’s data.

The default implementation is only supported for elements where mixed content is permitted (IsMixed()). It uses GetChildren() to iterate through the children.

If the element is empty an empty string is returned.

Derived classes may return more complex objects, such as values of basic python types or class instances, performing validation based on application-defined rules.

If the element contains child elements then XMLMixedContentError is raised. You can pass ignoreElements as True to override this behaviour in the unlikely event that you want:

<!-- elements like this... -->
<data>This is <em>the</em> value</data>

# to behave like this:
data.GetValue(True)==u"This is  value" 
SetValue(value)

Replaces the value of the element with the (unicode) value.

The default implementation is only supported for elements where mixed content is permitted (IsMixed()) and only affects the internally maintained list of children. Elements with more complex mixed models MUST override this method.

If value is None then the element becomes empty.

Derived classes may allow more complex values to be set, such as values of basic python types or class instances depending on the element type being represented in the application.

ValidationError(msg, data=None, aname=None)

Indicates that a validation error occurred in this element.

An error message indicates the nature of the error.

The data that caused the error may be given in data.

Furthermore, the attribute name may also be given indicating that the offending data was in an attribute of the element and not the element itself.

SortNames(nameList)

Given a list of element or attribute names, sorts them in a predictable order

The default implementation assumes that the names are strings or unicode strings so uses the default sort method.

Copy(parent=None)

Creates a new instance of this element which is a deep copy of this one.

parent is the parent node to attach the new element to. If it is None then a new orphan element is created.

This method mimics the process of serialisation and deserialisation (without the need to generate markup). As a result, element attributes are serialised and deserialised to strings during the copy process.

GetBase()

Returns the value of the xml:base attribute as a string.

SetBase(base)

Sets the value of the xml:base attribute from a string.

Changing the base of an element effects the interpretation of all relative URIs in this element and its children.

ResolveBase()

Returns a fully specified URI for the base of the current element.

The URI is calculated using any xml:base values of the element or its ancestors and ultimately relative to the baseURI.

If the element is not contained by a Document, or the document does not have a fully specified baseURI then the return result may be a relative path or even None, if no base information is available.

ResolveURI(uriref)

Resolves a URI reference in the current context.

uriref
A pyslet.rfc2396.URI instance or a string

The argument is resolved relative to the xml:base values of the element’s ancestors and ultimately relative to the document’s baseURI. Ther result may still be a relative URI, there may be no base set or the base may only be known in relative terms.

RelativeURI(href)

Returns href expressed relative to the element’s base.

If href is a relative URI then it is converted to a fully specified URL by interpreting it as being the URI of a file expressed relative to the current working directory.

If the element does not have a fully-specified base URL then href is returned as a fully-specified URL itself.

GetLang()

Returns the value of the xml:lang attribute as a string.

SetLang(lang)

Sets the value of the xml:lang attribute from a string.

See ResolveLang() for how to obtain the effective language of an element.

ResolveLang()

Returns the effective language for the current element.

The language is resolved using the xml:lang value of the element or its ancestors. If no xml:lang is in effect then None is returned.

PrettyPrint()

Indicates if this element’s content should be pretty-printed.

This method is used when formatting XML files to text streams. The behaviour can be affected by the xml:space attribute or by derived classes that can override the default behaviour.

If this element has xml:space set to ‘preserve’ then we return False. If self.parent.PrettyPrint() returns False then we return False.

Otherwise we return False if we know the element is (or should be) mixed content, True otherwise.

Note: an element of undetermined content model that contains only elements and white space is pretty printed.

WriteXMLAttributes(attributes, escapeFunction=<function EscapeCharData>, root=False)

Adds strings representing the element’s attributes

attributes is a list of unicode strings. Attributes should be appended as strings of the form ‘name=”value”’ with values escaped appropriately for XML output.

GenerateXML(escapeFunction=<function EscapeCharData>, indent='', tab='\t', root=False)

A generator function that returns strings representing the serialised version of this element:

# the element's serialised output can be obtained as a single string
string.join(e.GenerateXML(),'')
class pyslet.xml20081126.structures.ElementType

Bases: object

An object for representing element type definitions.

Any = 1

Content type constant for EMPTY

Mixed = 2

Content type constant for ANY

ElementContent = 3

Content type constant for mixed content

SGMLCDATA = 4

Content type constant for element content

name = None

The name of this element

contentType = None

The content type of this element, one of the constants defined above.

contentModel = None

A XMLContentParticle instance which contains the element’s content model or None in the case of EMPTY or ANY declarations.

particleMap = None

A mapping used to validate the content model during parsing. It maps the name of the first child element found to a list of XMLNameParticle instances that can represent it in the content model. For more information see XMLNameParticle.particleMap.

BuildModel()

Builds internal strutures to support model validation.

IsDeterministic()

Tests if the content model is deterministic.

For degenerates cases (elements declared with ANY or EMPTY) the method always returns True.

class pyslet.xml20081126.structures.XMLContentParticle

Bases: object

An object for representing content particles.

ZeroOrOne = 1

Occurrence constant for ‘?’

OneOrMore = 3

Occurrence constant for ‘+’

occurrence = None

One of the occurrence constants defined above.

BuildParticleMaps(exitParticles)

Abstract method that builds the particle maps for this node or its children.

For more information see XMLNameParticle.particleMap.

Although only name particles have particle maps this method is called for all particle types to allow the model to be built hierarchically from the root out to the terminal (name) nodes. exitParticles provides a mapping to all the following particles outside the part of the hierarchy rooted at the current node that are directly reachable from the particles inside.

SeekParticles(pMap)

Abstract method that adds all possible entry particles to pMap.

pMap is a mapping from element name to a list of XMLNameParticles XMLNameParticle.

Returns True if a required particle was added, False if all particles added are optional.

Like BuildParticleMaps(), this method is called for all particle types. The mappings requested represent all particles inside the part of the hierarchy rooted at the current node that are directly reachable from the preceeding particles outside.

AddParticles(srcMap, pMap)

A utility method that adds particles from srcMap to pMap.

Both maps are mappings from element name to a list of XMLNameParticles XMLNameParticle. All entries in srcMap not currently in pMap are added.

IsDeterministic(pMap)

A utility method for identifying deterministic particle maps.

A deterministic particle map is one in which name maps uniquely to a single content particle. A non-deterministic particle map contains an ambiguity, for example ((b,d)|(b,e)). The particle map created by SeekParticles() for the enclosing choice list is would have two entries for ‘b’, one to map the first particle of the first sequence and one to the first particle of the second sequence.

Although non-deterministic content models are not allowed in SGML they are tolerated in XML and are only flagged as compatibility errors.

class pyslet.xml20081126.structures.XMLNameParticle

Bases: pyslet.xml20081126.structures.XMLContentParticle

An object representing a content particle for a named element in the grammar

name = None

the name of the element type that matches this particle

particleMap = None

Each XMLNameParticle has a particle map that maps the name of the ‘next’ element found in the content model to the list of possible XMLNameParticles XMLNameParticle that represent it in the content model.

The content model can be traversed using ContentParticleCursor.

class pyslet.xml20081126.structures.XMLChoiceList

Bases: pyslet.xml20081126.structures.XMLContentParticle

An object representing a choice list of content particles in the grammar

class pyslet.xml20081126.structures.XMLSequenceList

Bases: pyslet.xml20081126.structures.XMLContentParticle

An object representing a sequence list of content particles in the grammar

class pyslet.xml20081126.structures.XMLAttributeDefinition

Bases: object

An object for representing attribute declarations.

CData = 0

Type constant representing CDATA

ID = 1

Type constant representing ID

IDRef = 2

Type constant representing IDREF

IDRefs = 3

Type constant representing IDREFS

Entity = 4

Type constant representing ENTITY

Entities = 5

Type constant representing ENTITIES

NmToken = 6

Type constant representing NMTOKEN

NmTokens = 7

Type constant representing NMTOKENS

Notation = 8

Type constant representing NOTATION

Implied = 0

Presence constant representing #IMPLIED

Required = 1

Presence constant representing #REQUIRED

Fixed = 2

Presence constant representing #FIXED

Default = 3

Presence constant representing a declared default value

entity = None

the entity in which this attribute was declared

name = None

the name of the attribute

type = None

One of the above type constants

values = None

An optional dictionary of values

defaultValue = None

An optional default value

6.2.3. Physical Structures

class pyslet.xml20081126.structures.XMLEntity(src=None, encoding=None, reqManager=None)

Bases: object

An object representing an entity.

This object serves two purposes, it acts as both the object used to store information about declared entities and also as a parser for feeding unicode characters to the main XMLParser.

Optional src, encoding and reqManager parameters can be provided, if src is not None then these parameters are used to open the entity reader immediately using one of the Open methods described below.

src may be a unicode string, a byte string, an instance of pyslet.rfc2396.URI or any object that supports file-like behaviour. If using a file, the file must support seek behaviour.

location = None

the location of this entity (used as the base URI to resolve relative links)

mimetype = None

the mime type of the entity, if known, or None

encoding = None

the encoding of the entity (text entities)

charSource = None

A unicode data reader used to read characters from the entity. If None, then the entity is closed.

bom = None

flag to indicate whether or not the byte order mark was detected. If detected the flag is set to True. An initial byte order mark is not reported in the_char or by the next_char() method.

the_char = None

the character at the current position in the entity

lineNum = None

the current line number within the entity (first line is line 1)

linePos = None

the current character position within the entity (first char is 1)

buffText = None

used by XMLParser.push_entity()

ChunkSize = 4096

Characters are read from the dataSource in chunks. The default chunk size is 4KB.

In fact, in some circumstances the entity reader starts more cautiously. If the entity reader expects to read an XML or Text declaration, which may have an encoding declaration then it reads one character at a time until the declaration is complete. This allows the reader to change to the encoding in the declaration without causing errors caused by reading too many characters using the wrong codec. See ChangeEncoding() and KeepEncoding() for more information.

GetName()

Abstract method to return a name to represent this entity in logs and error messages.

IsExternal()

Returns True if this is an external entity.

The default implementation returns True if location is not None, False otherwise.

Open()

Opens the entity for reading.

The default implementation uses OpenURI() to open the entity from location if available, otherwise it raises UnimplementedError.

IsOpen()

Returns True if the entity is open for reading.

OpenUnicode(src)

Opens the entity from a unicode string.

OpenString(src, encoding=None)

Opens the entity from a byte string.

The optional encoding is used to convert the string to unicode and defaults to None - meaning that the auto-detection method will be applied.

The advantage of using this method instead of converting the string to unicode and calling OpenUnicode() is that this method creates a unicode reader object to parse the string instead of making a copy of it in memory.

OpenFile(src, encoding='utf-8')

Opens the entity from an existing (open) binary file.

The optional encoding provides a hint as to the intended encoding of the data and defaults to UTF-8. Unlike other Open* methods we do not assume that the file is seekable however, you may set encoding to None for a seekable file thus invoking auto-detection of the encoding.

OpenURI(src, encoding=None, reqManager=None)

Opens the entity from a URI passed in src.

The file, http and https schemes are the only ones supported.

The optional encoding provides a hint as to the intended encoding of the data and defaults to UTF-8. For http(s) resources this parameter is only used if the charset cannot be read successfully from the HTTP headers.

The optional reqManager allows you to pass an existing instance of pyslet.http.client.Client for handling URI with http or https schemes.

OpenHTTPResponse(src, encoding='utf-8')

Opens the entity from an HTTP response passed in src.

The optional encoding provides a hint as to the intended encoding of the data and defaults to UTF-8. This parameter is only used if the charset cannot be read successfully from the HTTP response headers.

reset()

Resets an open entity, causing it to return to the first character in the entity.

GetPositionStr()

Returns a short string describing the current line number and character position.

For example, if the current character is pointing to character 6 of line 4 then it will return the string ‘Line 4.6’

next_char()

Advances to the next character in an open entity.

This method takes care of the End-of-Line handling rules for XML which force us to remove any CR characters and replace them with LF if they appear on their own or to silenty drop them if they appear as part of a CR-LF combination.

AutoDetectEncoding(srcFile)

Auto-detects the character encoding in srcFile.

Should only be called for seek-able streams opened in binary mode.

ChangeEncoding(encoding)

Changes the encoding used to interpret the entity’s stream.

In many cases we can only guess at the encoding used in a file or other byte stream. However, XML has a mechanism for declaring the encoding as part of the XML or Text declaration. This declaration can typically be parsed even if the encoding has been guessed incorrectly initially. This method allows the XML parser to notify the entity that a new encoding has been declared and that future characters should be interpreted with this new encoding.

You can only change the encoding once. This method calls KeepEncoding() once the encoding has been changed.

KeepEncoding()

Tells the entity parser that the encoding will not be changed again.

This entity parser starts in a cautious mode, parsing the entity one character a time to avoid errors caused by buffering with the wrong encoding. This method should be called once the encoding is determined so that the entity parser can use its internal character buffer.

NextLine()

Called when the entity reader detects a new line.

This method increases the internal line count and resets the character position to the beginning of the line. You will not normally need to call this directly as line handling is done automatically by next_char().

close()

Closes the entity.

class pyslet.xml20081126.structures.XMLGeneralEntity(name=None, definition=None, notation=None)

Bases: pyslet.xml20081126.structures.XMLDeclaredEntity

An object for representing general entities.

A general entity can be constructed with an optional name, definition and notation, used to initialise the following fields.

notation = None

the notation name for external unparsed entities

GetName()

Returns the name of the entity formatted as a general entity reference.

class pyslet.xml20081126.structures.XMLParameterEntity(name=None, definition=None)

Bases: pyslet.xml20081126.structures.XMLDeclaredEntity

An object for representing parameter entities.

A parameter entity can be constructed with an optional name and definition, used to initialise the following two fields.

next_char()

Overrridden to provide trailing space during special parameter entity handling.

OpenAsPE()

Opens the parameter entity for reading in the context of a DTD.

This special method implements the rule that the replacement text of a parameter entity, when included as a PE, must be enlarged by the attachment of a leading and trailing space.

GetName()

Returns the name of the entity formatted as a parameter entity reference.

class pyslet.xml20081126.structures.XMLExternalID(public=None, system=None)

Bases: object

Used to represent external references to entities.

Returns an instance of XMLExternalID. One of public and system should be provided.

get_location(base=None)

Returns the absolute URI where the external entity can be found.

Returns a pyslet.rfc2396.URI resolved against base if applicable. If there is no system identifier then None is returned.

class pyslet.xml20081126.structures.XMLTextDeclaration(version='1.0', encoding='UTF-8')

Bases: object

Represents the text components of an XML declaration.

Both version and encoding are optional, though one or other are required depending on the context in which the declaration will be used.

class pyslet.xml20081126.structures.XMLNotation(name, external_id)

Bases: object

Represents an XML Notation

Returns an XMLNotation instance.

external_id is a XMLExternalID instance in which one of public or system must be provided.

name = None

the notation name

external_id = None

the external ID of the notation (an XMLExternalID instance)

6.2.3.1. Character Classes

pyslet.xml20081126.structures.IsLetter(c)

Tests if the character c matches production [84] Letter.

pyslet.xml20081126.structures.IsBaseChar(c)

Tests if the character c matches production [85] BaseChar.

pyslet.xml20081126.structures.IsIdeographic(c)

Tests if the character c matches production [86] Ideographic.

pyslet.xml20081126.structures.IsCombiningChar(c)

Tests if the character c matches production [87] CombiningChar.

pyslet.xml20081126.structures.is_digit(c)

Tests if the character c matches production [88] Digit.

pyslet.xml20081126.structures.IsExtender(c)

Tests if the character c matches production [89] Extender.