6.3. XML: Parsing XML Documents¶

This module exposes a number of internal functions typically defined privately in XML parser implementations which make it easier to reuse concepts from XML in other modules. For example, the IsNameStartChar() tells you if a character matches the production for NameStartChar in the XML standard.

class pyslet.xml20081126.parser.XMLParser(entity)¶

Bases: pyslet.pep8.PEP8Compatibility

An XMLParser object

entity: The XMLEntity to parse.

XMLParser objects are used to parse entities for the constructs defined by the numbered productions in the XML specification.

XMLParser has a number of optional attributes, all of which default to False. Attributes with names started ‘check’ increase the strictness of the parser. All other parser flags, if set to True, will not result in a conforming XML processor.

DocumentClassTable = {}¶

A dictionary mapping doctype parameters onto class objects.

For more information about how this is used see get_document_class() and RegisterDocumentClass().

RefModeNone = 0¶: Default constant used for setting refMode

RefModeInContent = 1¶: Treat references as per “in Content” rules

RefModeInAttributeValue = 2¶: Treat references as per “in Attribute Value” rules

RefModeAsAttributeValue = 3¶: Treat references as per “as Attribute Value” rules

RefModeInEntityValue = 4¶: Treat references as per “in EntityValue” rules

RefModeInDTD = 5¶: Treat references as per “in DTD” rules

PredefinedEntities = {'amp': '&', 'lt': '<', 'gt': '>', 'apos': "'", 'quot': '"'}¶: A mapping from the names of the predefined entities (lt, gt, amp, apos, quot) to their replacement characters.

checkValidity = None¶: checks XML validity constraints If checkValidity is True, and all other options are left at their default (False) setting then the parser will behave as a validating XML parser.

valid = None¶: Flag indicating if the document is valid, only set if checkValidity is True

nonFatalErrors = None¶: A list of non-fatal errors discovered during parsing, only populated if checkValidity is True

checkCompatibility = None¶: checks XML compatibility constraints; will cause checkValidity to be set to True when parsing

checkAllErrors = None¶: checks all constraints; will cause checkValidity and checkCompatibility to be set to True when parsing.

raiseValidityErrors = None¶: treats validity errors as fatal errors

dontCheckWellFormedness = None¶: provides a loose parser for XML-like documents

unicodeCompatibility = None¶: See http://www.w3.org/TR/unicode-xml/

sgmlNamecaseGeneral = None¶: option that simulates SGML’s NAMECASE GENERAL YES

sgmlNamecaseEntity = None¶: option that simulates SGML’s NAMECASE ENTITY YES

sgmlOmittag = None¶: option that simulates SGML’s OMITTAG YES

sgmlShorttag = None¶: option that simulates SGML’s SHORTTAG YES

sgmlContent = None¶

This option simulates some aspects of SGML content handling based on class attributes of the element being parsed.

Element classes with XMLCONTENT=:py:data:XMLEmpty are treated as elements declared EMPTY, these elements are treated as if they were introduced with an empty element tag even if they weren’t, as per SGML’s rules. Note that this SGML feature “has nothing to do with markup minimization” (i.e., sgmlOmittag.)

refMode = None¶

The current parser mode for interpreting references.

XML documents can contain five different types of reference: parameter entity, internal general entity, external parsed entity, (external) unparsed entity and character entity.

The rules for interpreting these references vary depending on the current mode of the parser, for example, in content a reference to an internal entity is replaced, but in the definition of an entity value it is not. This means that the behaviour of the parse_reference() method will differ depending on the mode.

The parser takes care of setting the mode automatically but if you wish to use some of the parsing methods in isolation to parse fragments of XML documents, then you will need to set the refMode directly using one of the RefMode* family of constants defined above.

entity = None¶: The current entity being parsed

the_char = None¶: the current character; None indicates end of stream

declaration = None¶: The declaration being parsed or None

dtd = None¶: The documnet type declaration of the document being parsed. This member is initialised to None as well-formed XML documents are not required to have an associated dtd.

doc = None¶: The document being parsed

docEntity = None¶: The document entity

element = None¶: The current element being parsed

elementType = None¶: The element type of the current element

get_context()¶

Returns the parser’s context

This is either the current element or the document if no element is being parsed.

next_char()¶

Moves to the next character in the stream.

The current character can always be read from the_char. If there are no characters left in the current entity then entities are popped from an internal entity stack automatically.

buff_text(unused_chars)¶

Buffers characters that have already been parsed.

unused_chars: A string of characters to be pushed back to the parser in the order in which they are to be parsed.

This method enables characters to be pushed back into the parser forcing them to be parsed next. The current character is saved and will be parsed (again) once the buffer is exhausted.

push_entity(entity)¶

Starts parsing an entity

entity: An XMLEntity instance which is to be parsed.

the_char is set to the current character in the entity’s stream. The current entity is pushed onto an internal stack and will be resumed when this entity has been parsed completely.

Note that in the degenerate case where the entity being pushed is empty (or is already positioned at the end of the file) then push_entity does nothing.

check_encoding(entity, declared_encoding)¶

Checks the entity against the declared encoding

entity: An XMLEntity instance which is being parsed.
declared_encoding: A string containing the declared encoding in any declaration or None if there was no declared encoding in the entity.

get_external_entity()¶

Returns the external entity currently being parsed.

If no external entity is being parsed then None is returned.

standalone()¶

True if the document should be treated as standalone.

A document may be declared standalone or it may effectively be standalone due to the absence of a DTD, or the absence of an external DTD subset and parameter entity references.

declared_standalone()¶: True if the current document was declared standalone.

well_formedness_error(msg='well-formedness error', error_class=<class 'pyslet.xml20081126.structures.XMLWellFormedError'>)¶

Raises an XMLWellFormedError error.

msg: An optional message string
error_class: an optional error class which must be a class object derived from py:class:XMLWellFormednessError.

Called by the parsing methods whenever a well-formedness constraint is violated.

The method raises an instance of error_class and does not return. This method can be overridden by derived parsers to implement more sophisticated error logging.

validity_error(msg='validity error', error=<class 'pyslet.xml20081126.structures.XMLValidityError'>)¶

Called when the parser encounters a validity error.

msg: An optional message string
error: An optional error class or instance which must be a (class) object derived from py:class:XMLValidityError.

The behaviour varies depending on the setting of the checkValidity and raiseValidityErrors options. The default (both False) causes validity errors to be ignored. When checking validity an error message is logged to nonFatalErrors and valid is set to False. Furthermore, if raiseValidityErrors is True error is raised (or a new instance of error is raised) and parsing terminates.

This method can be overridden by derived parsers to implement more sophisticated error logging.

compatibility_error(msg='compatibility error')¶

Called when the parser encounters a compatibility error.

msg: An optional message string

The behaviour varies depending on the setting of the checkCompatibility flag. The default (False) causes compatibility errors to be ignored. When checking compatibility an error message is logged to nonFatalErrors.

This method can be overridden by derived parsers to implement more sophisticated error logging.

processing_error(msg='Processing error')¶

Called when the parser encounters a general processing error.

msg: An optional message string

The behaviour varies depending on the setting of the checkAllErrors flag. The default (False) causes processing errors to be ignored. When checking all errors an error message is logged to nonFatalErrors.

This method can be overridden by derived parsers to implement more sophisticated error logging.

parse_literal(match)¶

Parses an optional literal string.

match: The literal string to match

Returns True if match is successfully parsed and False otherwise. There is no partial matching, if match is not found then the parser is left in its original position.

parse_required_literal(match, production='Literal String')¶

Parses a required literal string.

match: The literal string to match
production: An optional string describing the context in which the literal was expected.

There is no return value. If the literal is not matched a wellformed error is generated.

parse_decimal_digits()¶

Parses a, possibly empty, string of decimal digits.

Decimal digits match [0-9]. Returns the parsed digits as a string or an empty string if no digits were matched.

parse_required_decimal_digits(production='Digits')¶

Parses a required sring of decimal digits.

production: An optional string describing the context in which the decimal digits were expected.

Decimal digits match [0-9]. Returns the parsed digits as a string.

parse_hex_digits()¶

Parses a, possibly empty, string of hexadecimal digits

Hex digits match [0-9a-fA-F]. Returns the parsed digits as a string or an empty string if no digits were matched.

parse_required_hex_digits(production='Hex Digits')¶

Parses a required string of hexadecimal digits.

production: An optional string describing the context in which the hexadecimal digits were expected.

Hex digits match [0-9a-fA-F]. Returns the parsed digits as a string.

parse_quote(q=None)¶

Parses the quote character

q: An optional character to parse as if it were a quote. By default either one of “’” or ‘”’ is accepted.

Returns the character parsed or raises a well formed error.

parse_document(doc=None)¶

[1] document: parses a Document.

doc: The Document instance that will be parsed. The declaration, dtd and elements are added to this document. If doc is None then a new instance is created using get_document_class() to identify the correct class to use to represent the document based on information in the prolog or, if the prolog lacks a declaration, the root element.

This method returns the document that was parsed, an instance of Document.

get_document_class(dtd)¶

Returns a class object suitable for this dtd

dtd: A XMLDTD instance

Returns a class object derived from Document suitable for representing a document with the given document type declaration.

In cases where no doctype declaration is made a dummy declaration is created based on the name of the root element. For example, if the root element is called “database” then the dtd is treated as if it was declared as follows:

<!DOCTYPE database>

This default implementation uses the following three pieces of information to locate a class registered with RegisterDocumentClass(). The PublicID, SystemID and the name of the root element. If an exact match is not found then wildcard matches are attempted, ignoring the SystemID, PublicID and finally the root element in turn. If a document class still cannot be found then wildcard matches are tried matching only the PublicID, SystemID and root element in turn.

If no document class cab be found, Document is returned.

is_s()¶

Tests if the current character matches S

Returns a boolean value, True if S is matched.

By default calls is_s()

In Unicode compatibility mode the function maps the unicode white space characters at code points 2028 and 2029 to line feed and space respectively.

parse_s()¶

[3] S

Parses white space returning it as a string. If there is no white space at the current position then an empty string is returned.

The productions in the specification do not make explicit mention of parameter entity references, they are covered by the general statement that “Parameter entity references are recognized anwhere in the DTD...” In practice, this means that while parsing the DTD, anywhere that an S is permitted a parameter entity reference may also be recognized. This method implements this behaviour, recognizing parameter entity references within S when refMode is RefModeInDTD.

parse_required_s(production='[3] S')¶

[3] S: Parses required white space

production: An optional string describing the production being parsed. This allows more useful errors than simply ‘expected [3] S’ to be logged.

If there is no white space then a well-formedness error is raised.

parse_name()¶

[5] Name

Parses an optional name. The name is returned as a unicode string. If no Name can be parsed then None is returned.

parse_required_name(production='Name')¶

[5] Name

production: An optional string describing the production being parsed. This allows more useful errors than simply ‘expected [5] Name’ to be logged.

Parses a required Name, returning it as a string. If no name can be parsed then a well-formed error is raised.

parse_names()¶

[6] Names

This method returns a tuple of unicode strings. If no names can be parsed then None is returned.

parse_nmtoken()¶

[7] Nmtoken

Returns a Nmtoken as a string or, if no Nmtoken can be parsed then None is returned.

parse_nmtokens()¶

[8] Nmtokens

This method returns a tuple of unicode strings. If no tokens can be parsed then None is returned.

parse_entity_value()¶

[9] EntityValue

Parses an EntityValue, returning it as a unicode string.

This method automatically expands other parameter entity references but does not expand general or character references.

parse_att_value()¶

[10] AttValue

The value is returned without the surrounding quotes and with any references expanded.

The behaviour of this method is affected significantly by the setting of the dontCheckWellFormedness flag. When set, attribute values can be parsed without surrounding quotes. For compatibility with SGML these values should match one of the formal value types (e.g., Name) but this is not enforced so values like width=100% can be parsed without error.

parse_system_literal()¶

[11] SystemLiteral

The value of the literal is returned as a string without the enclosing quotes.

parse_pubid_literal()¶

[12] PubidLiteral

The value of the literal is returned as a string without the enclosing quotes.

parse_char_data()¶

[14] CharData

Parses a run of character data. The method adds the parsed data to the current element. In the default parsing mode it returns None.

When the parser option sgmlOmittag is selected the method returns any parsed character data that could not be added to the current element due to a model violation. Note that in this SGML-like mode any S is treated as being in the current element as the violation doesn’t occur until the first non-S character (so any implied start tag is treated as being immediately prior to the first non-S).

parse_comment(got_literal=False)¶

[15] Comment

got_literal: If True then the method assumes that the ‘<!–’ literal has already been parsed.

Returns the comment as a string.

parse_pi(got_literal=False)¶

[16] PI: parses a processing instruction.

got_literal: If True the method assumes the ‘<?’ literal has already been parsed.

This method calls the Node.ProcessingInstruction() of the current element or of the document if no element has been parsed yet.

parse_pi_target()¶

[17] PITarget

Parses a processing instruction target name, the name is returned.

parse_cdsect(got_literal=False, cdend=u']]>')¶

[18] CDSect

got_literal: If True then the method assumes the initial literal has already been parsed. (By default, CDStart.)
cdend: Optional string. The literal used to signify the end of the CDATA section can be overridden by passing an alternative literal in cdend. Defaults to ‘]]>’

This method adds any parsed data to the current element, there is no return value.

parse_cdstart()¶

[19] CDStart

Parses the literal that starts a CDATA section.

parse_cdata(cdend=']]>')¶

[20] CData

Parses a run of CData up to but not including cdend.

This method adds any parsed data to the current element, there is no return value.

parse_cdend()¶

[21] CDEnd

Parses the end of a CDATA section.

parse_prolog()¶

[22] prolog

Parses the document prolog, including the XML declaration and dtd.

parse_xml_decl(got_literal=False)¶

[23] XMLDecl

got_literal: If True the initial literal ‘<?xml’ is assumed to have already been parsed.

Returns an XMLDeclaration instance. Also, if an encoding is given in the declaration then the method changes the encoding of the current entity to match. For more information see ChangeEncoding().

parse_version_info(got_literal=False)¶

[24] VersionInfo

got_literal: If True, the method assumes the initial white space and ‘version’ literal has been parsed already.

The version number is returned as a string.

parse_eq(production='[25] Eq')¶

[25] Eq

production: An optional string describing the production being parsed. This allows more useful errors than simply ‘expected [25] Eq’ to be logged.

Parses an equal sign, optionally surrounded by white space

parse_version_num()¶

[26] VersionNum

Parses the XML version number, returning it as a string, e.g., “1.0”.

parse_misc()¶

[27] Misc

This method parses everything that matches the production Misc*

parse_doctypedecl(got_literal=False)¶

[28] doctypedecl

got_literal: If True, the method assumes the initial ‘<!DOCTYPE’ literal has been parsed already.

This method creates a new instance of XMLDTD and assigns it to py:attr:dtd, it also returns this instance as the result.

parse_decl_sep()¶

[28a] DeclSep

Parses a declaration separator.

parse_int_subset()¶

[28b] intSubset

Parses an internal subset.

parse_markup_decl(got_literal=False)¶

[29] markupDecl

got_literal: If True, the method assumes the initial ‘<’ literal has been parsed already.

Returns True if a markupDecl was found, False otherwise.

parse_ext_subset()¶

[30] extSubset

Parses an external subset

parse_ext_subset_decl()¶

[31] extSubsetDecl

Parses declarations in the external subset.

check_pe_between_declarations(check_entity)¶

[31] extSubsetDecl

check_entity: A XMLEntity object, the entity we should still be parsing.

Checks the well-formedness constraint on use of PEs between declarations.

parse_sd_decl(got_literal=False)¶

[32] SDDecl

got_literal: If True, the method assumes the initial ‘standalone’ literal has been parsed already.

Returns True if the document should be treated as standalone; False otherwise.

parse_element()¶

[39] element

The class used to represent the element is determined by calling the get_element_class() method of the current document. If there is no document yet then a new document is created automatically (see parse_document() for more information).

The element is added as a child of the current element using Node.ChildElement().

The method returns a boolean value:

True: the element was parsed normally
False: the element is not allowed in this context

The second case only occurs when the sgmlOmittag option is in use and it indicates that the content of the enclosing element has ended. The Tag is buffered so that it can be reparsed when the stack of nested parse_content() and parse_element() calls is unwound to the point where it is allowed by the context.

check_attributes(name, attrs)¶

Checks attrs against the declarations for an element.

name: The name of the element
attrs: A dictionary of attributes

Adds any omitted defaults to the attribute list. Also, checks the validity of the attributes which may result in values being further normalized as per the rules for collapsing spaces in tokenized values.

match_xml_name(element, name)¶

Tests if name is a possible name for element.

element: A Element instance.
name: The name of an end tag, as a string.

This method is used by the parser to determine if an end tag is the end tag of this element. It is provided as a separate method to allow it to be overridden by derived parsers.

The default implementation simply compares name with GetXMLName()

check_expected_particle(name)¶

Checks the validity of element name in the current context.

name: The name of the element encountered. An empty string for name indicates the enclosing end tag was found.

This method also maintains the position of a pointer into the element’s content model.

get_stag_class(name, attrs=None)¶

[40] STag

name: The name of the element being started
attrs: A dictionary of attributes of the element being started

Returns information suitable for starting the element in the current context.

If there is no Document instance yet this method assumes that it is being called for the root element and selects an appropriate class based on the contents of the prolog and/or name.

When using the sgmlOmittag option name may be None indicating that the method should return information about the element implied by PCDATA in the current context (only called when an attempt to add data to the current context has already failed).

The result is a triple of:

element_class: the element class that this STag must introduce or None if this STag does not belong (directly or indirectly) in the current context
element_name: the name of the element (to pass to ChildElement) or None to use the default
buff_flag: True indicates an omitted tag and that the triggering STag (i.e., the STag with name name) should be buffered.

parse_stag()¶

[40] STag, [44] EmptyElemTag

This method returns a tuple of (name, attrs, emptyFlag) where:

name: the name of the element parsed
attrs: a dictionary of attribute values keyed by attribute name
emptyFlag: a boolean; True indicates that the tag was an empty element tag.

parse_attribute()¶

[41] Attribute

Returns a tuple of (name, value) where:

name: is the name of the attribute or None if sgmlShorttag is True and a short form attribute value was supplied.
value: the attribute value.

If dontCheckWellFormedness is set the parser uses a very generous form of parsing attribute values to accomodate common syntax errors.

parse_etag(got_literal=False)¶

[42] ETag

got_literal: If True, the method assumes the initial ‘</’ literal has been parsed already.

The method returns the name of the end element parsed.

parse_content()¶

[43] content

The method returns:

True: indicates that the content was parsed normally
False: indicates that the content contained data or markup not allowed in this context

The second case only occurs when the sgmlOmittag option is in use and it indicates that the enclosing element has ended (i.e., the element’s ETag has been omitted). See py:meth:parse_element for more information.

handle_data(data, cdata=False)¶

[43] content

data: A string of data to be handled
cdata: If True data is treated as character data (even if it matches the production for S).

Data is handled by calling AddData() even if the data is optional white space.

unhandled_data(data)¶

[43] content

data: A string of unhandled data

This method is only called when the sgmlOmittag option is in use. It processes data that occurs in a context where data is not allowed.

It returns a boolean result:

True: the data was consumed by a sub-element (with an omitted start tag)
False: the data has been buffered and indicates the end of the current content (an omitted end tag).

parse_empty_elem_tag()¶

[44] EmptyElemTag

There is no method for parsing empty element tags alone.

This method raises NotImplementedError. Instead, you should call parse_stag() and examine the result. If it returns False then an empty element was parsed.

parse_element_decl(got_literal=False)¶

[45] elementdecl

got_literal: If True, the method assumes that the ‘<!ELEMENT’ literal has already been parsed.

Declares the element type in the dtd, (if present). There is no return result.

parse_content_spec(etype)¶

[46] contentspec

etype: An ElementType instance.

Sets the contentType and contentModel attributes of etype, there is no return value.

parse_children(got_literal=False, group_entity=None)¶

[47] children

got_literal: If True, the method assumes that the initial ‘(‘ literal has already been parsed, including any following white space.
group_entity: An optional XMLEntity object. If got_literal is True then group_entity must be the entity in which the opening ‘(‘ was parsed which started the choice group.

The method returns an instance of XMLContentParticle.

parse_cp()¶

[48] cp

Returns an XMLContentParticle instance.

parse_choice(first_child=None, group_entity=None)¶

[49] choice

first_child: An optional XMLContentParticle instance. If present the method assumes that the first particle and any following white space has already been parsed.
group_entity: An optional XMLEntity object. If first_child is given then group_entity must be the entity in which the opening ‘(‘ was parsed which started the choice group.

Returns an XMLChoiceList instance.

parse_seq(first_child=None, group_entity=None)¶

[50] seq

first_child: An optional XMLContentParticle instance. If present the method assumes that the first particle and any following white space has already been parsed. In this case, group_entity must be set to the entity which contained the opening ‘(‘ literal.
group_entity: An optional XMLEntity object, see above.

Returns a XMLSequenceList instance.

parse_mixed(got_literal=False, group_entity=None)¶

[51] Mixed

got_literal: If True, the method assumes that the #PCDATA literal has already been parsed. In this case, group_entity must be set to the entity which contained the opening ‘(‘ literal.
group_entity: An optional XMLEntity object, see above.

Returns an instance of XMLChoiceList with occurrence ZeroOrMore representing the list of elements that may appear in the mixed content model. If the mixed model contains #PCDATA only the choice list will be empty.

parse_attlist_decl(got_literal=False)¶

[52] AttlistDecl

got_literal: If True, assumes that the leading ‘<!ATTLIST’ literal has already been parsed.

Declares the attriutes in the dtd, (if present). There is no return result.

parse_att_def(got_s=False)¶

[53] AttDef

got_s: If True, the method assumes that the leading S has already been parsed.

Returns an instance of XMLAttributeDefinition.

parse_att_type(a)¶

[54] AttType

a: A required XMLAttributeDefinition instance.

This method sets the type and values fields of a.

Note that, to avoid unnecessary look ahead, this method does not call parse_string_type() or parse_enumerated_type().

parse_string_type(a)¶

[55] StringType

a: A required XMLAttributeDefinition instance.

This method sets the type and values fields of a.

This method is provided for completeness. It is not called during normal parsing operations.

parse_tokenized_type(a)¶

[56] TokenizedType

a: A required XMLAttributeDefinition instance.

This method sets the type and values fields of a.

parse_enumerated_type(a)¶

[57] EnumeratedType

a: A required XMLAttributeDefinition instance.

This method sets the type and values fields of a.

This method is provided for completeness. It is not called during normal parsing operations.

parse_notation_type(got_literal=False)¶

[58] NotationType

got_literal: If True, assumes that the leading ‘NOTATION’ literal has already been parsed.

Returns a list of strings representing the names of the declared notations being referred to.

parse_enumeration()¶

[59] Enumeration

Returns a dictionary of strings representing the tokens in the enumeration.

parse_default_decl(a)¶

[60] DefaultDecl: parses an attribute’s default declaration.

a: A required XMLAttributeDefinition instance.

This method sets the presence and defaultValue fields of a.

parse_conditional_sect(got_literal_entity=None)¶

[61] conditionalSect

got_literal_entity: An optional XMLEntity object. If given, the method assumes that the initial literal ‘<![‘ has already been parsed from that entity.

parse_include_sect(got_literal_entity=None)¶

[62] includeSect:

got_literal_entity: An optional XMLEntity object. If given, the method assumes that the production, up to and including the keyword ‘INCLUDE’ has already been parsed and that the opening ‘<![‘ literal was parsed from that entity.

There is no return value.

parse_ignore_sect(got_literal_entity=None)¶

[63] ignoreSect

got_literal_entity: An optional XMLEntity object. If given, the method assumes that the production, up to and including the keyword ‘IGNORE’ has already been parsed and that the opening ‘<![‘ literal was parsed from this entity.

There is no return value.

parse_ignore_sect_contents()¶

[64] ignoreSectContents

Parses the contents of an ignored section. The method returns no data.

parse_ignore()¶

[65] Ignore

Parses a run of characters in an ignored section. This method returns no data.

parse_char_ref(got_literal=False)¶

[66] CharRef

got_literal: If True, assumes that the leading ‘&’ literal has already been parsed.

The method returns a unicode string containing the character referred to.

parse_reference()¶

[67] Reference

This method returns any data parsed as a result of the reference. For a character reference this will be the character referred to. For a general entity the data returned will depend on the parsing context. For more information see parse_entity_ref().

parse_entity_ref(got_literal=False)¶

[68] EntityRef

got_literal: If True, assumes that the leading ‘&’ literal has already been parsed.

This method returns any data parsed as a result of the reference. For example, if this method is called in a context where entity references are bypassed then the string returned will be the literal characters parsed, e.g., “&ref;”.

If the entity reference is parsed successfully in a context where Entity references are recognized, the reference is looked up according to the rules for validating and non-validating parsers and, if required by the parsing mode, the entity is opened and pushed onto the parser so that parsing continues with the first character of the entity’s replacement text.

A special case is made for the predefined entities. When parsed in a context where entity references are recognized these entities are expanded immediately and the resulting character returned. For example, the entity & returns the ‘&’ character instead of pushing an entity with replacement text ‘&’.

Inclusion of an unescaped & is common so when we are not checking well-formedness we treat ‘&’ not followed by a name as if it were ‘&’. Similarly we are generous about the missing ‘;’.

lookup_predefined_entity(name)¶

Looks up pre-defined entities, e.g., “lt”

This method can be overridden by variant parsers to implement other pre-defined entity tables.

parse_pe_reference(got_literal=False)¶

[69] PEReference

got_literal: If True, assumes that the initial ‘%’ literal has already been parsed.

This method returns any data parsed as a result of the reference. Normally this will be an empty string because the method is typically called in contexts where PEReferences are recognized. However, if this method is called in a context where PEReferences are not recognized the returned string will be the literal characters parsed, e.g., “%ref;”

If the parameter entity reference is parsed successfully in a context where PEReferences are recognized, the reference is looked up according to the rules for validating and non-validating parsers and, if required by the parsing mode, the entity is opened and pushed onto the parser so that parsing continues with the first character of the entity’s replacement text.

parse_entity_decl(got_literal=False)¶

[70] EntityDecl

got_literal: If True, assumes that the literal ‘<!ENTITY’ has already been parsed.

Returns an instance of either XMLGeneralEntity or XMLParameterEntity depending on the type of entity parsed.

parse_ge_decl(got_literal=False)¶

[71] GEDecl

got_literal: If True, assumes that the literal ‘<!ENTITY’ and the required S has already been parsed.

Returns an instance of XMLGeneralEntity.

parse_pe_decl(got_literal=False)¶

[72] PEDecl

got_literal: If True, assumes that the literal ‘<!ENTITY’ and the required S has already been parsed.

Returns an instance of XMLParameterEntity.

parse_entity_def(ge)¶

[73] EntityDef

ge: The general entity being parsed, an XMLGeneralEntity instance.

This method sets the definition and notation fields from the parsed entity definition.

parse_pe_def(pe)¶

[74] PEDef

pe: The parameter entity being parsed, an XMLParameterEntity instance.

This method sets the definition field from the parsed parameter entity definition. There is no return value.

parse_external_id(allow_public_only=False)¶

[75] ExternalID

allow_public_only

An external ID must have a SYSTEM literal, and may have a PUBLIC identifier. If allow_public_only is True then the method will also allow an external identifier with a PUBLIC identifier but no SYSTEM literal. In this mode the parser behaves as it would when parsing the production:

(ExternalID | PublicID) S?

Returns an XMLExternalID instance.

resolve_external_id(external_id, entity=None)¶

[75] ExternalID: resolves an external ID, returning a URI.

external_id: A XMLExternalID instance.
entity: An optional XMLEntity instance. Can be used to force the resolution of relative URIs to be relative to the base of the given entity. If it is None then the currently open external entity (where available) is used instead.

Returns an instance of pyslet.rfc2396.URI or None if the external ID cannot be resolved.

The default implementation simply calls get_location() with the entity’s base URL and ignores the public ID. Derived parsers may recognize public identifiers and resolve accordingly.

parse_ndata_decl(got_literal=False)¶

[76] NDataDecl

got_literal: If True, assumes that the literal ‘NDATA’ has already been parsed.

Returns the name of the notation used by the unparsed entity as a string without the preceding ‘NDATA’ literal.

parse_text_decl(got_literal=False)¶

[77] TextDecl

got_literal: If True, assumes that the literal ‘<?xml’ has already been parsed.

Returns an XMLTextDeclaration instance.

parse_encoding_decl(got_literal=False)¶

[80] EncodingDecl

got_literal: If True, assumes that the literal ‘encoding’ has already been parsed.

Returns the declaration name without the enclosing quotes.

parse_enc_name()¶

[81] EncName

Returns the encoding name as a string or None if no valid encoding name start character was found.

parse_notation_decl(got_literal=False)¶

[82] NotationDecl

got_literal: If True, assumes that the literal ‘<!NOTATION’ has already been parsed.

Declares the notation in the dtd, (if present). There is no return result.

parse_public_id()¶

[83] PublicID

The literal string is returned without the PUBLIC prefix or the enclosing quotes.