2.5. HTML

This module defines functions and classes for working with HTML documents. The version of the standard implemented is, loosely speaking, the HTML 4.0.1 Specification: http://www.w3.org/TR/html401/

This module contains code that can help parse HTML documents into classes based on the basic xml20081126 XML module, acting as a gateway to XHTML.

This module also exposes a number of internal functions typically defined privately in HTML parser implementations which make it easier to reuse concepts from HTML in other modules. For example, the LengthType used for storing HTML lengths (which can be pixel or relative) is used extensively by imsqtiv2p1.

pyslet.html40_19991224.HTML40_PUBLICID

The public ID to use in the declaration of an HTML document

pyslet.html40_19991224.XHTML_NAMESPACE

The namespace to use in the delcaration of an XHTML document

2.5.1. (X)HTML Documents

This module contains an experimental class for working with HTML documents. At the time of writing the implementation is designed to provide just enough HTML parsing to support the use of HTML within other standards (such as Atom and QTI).

class pyslet.html40_19991224.XHTMLDocument(**args)

Bases: pyslet.xmlnames20091208.XMLNSDocument

Represents an HTML document.

Although HTML documents are not always represented using XML they can be, and therefore we base our implementation on the pyslet.xmlnames20091208.XMLNSDocument class - a namespace-aware variant of the basic pyslet.xml20081126.XMLDocument class.

classMap = {('http://www.w3.org/1999/xhtml', 'dl'): <class 'pyslet.html40_19991224.DL'>, ('http://www.w3.org/1999/xhtml', 'ins'): <class 'pyslet.html40_19991224.Ins'>, ('http://www.w3.org/1999/xhtml', 'optgroup'): <class 'pyslet.html40_19991224.OptGroup'>, ('http://www.w3.org/1999/xhtml', 'thead'): <class 'pyslet.html40_19991224.THead'>, ('http://www.w3.org/1999/xhtml', 'var'): <class 'pyslet.html40_19991224.Var'>, ('http://www.w3.org/1999/xhtml', 'h2'): <class 'pyslet.html40_19991224.H2'>, ('http://www.w3.org/1999/xhtml', 'frameset'): <class 'pyslet.html40_19991224.Frameset'>, ('http://www.w3.org/1999/xhtml', 'acronym'): <class 'pyslet.html40_19991224.Acronym'>, ('http://www.w3.org/1999/xhtml', 'br'): <class 'pyslet.html40_19991224.Br'>, ('http://www.w3.org/1999/xhtml', 'param'): <class 'pyslet.html40_19991224.Param'>, ('http://www.w3.org/1999/xhtml', 'input'): <class 'pyslet.html40_19991224.Input'>, ('http://www.w3.org/1999/xhtml', 'fieldset'): <class 'pyslet.html40_19991224.FieldSet'>, ('http://www.w3.org/1999/xhtml', 'basefont'): <class 'pyslet.html40_19991224.BaseFont'>, ('http://www.w3.org/1999/xhtml', 'u'): <class 'pyslet.html40_19991224.U'>, ('http://www.w3.org/1999/xhtml', 'strong'): <class 'pyslet.html40_19991224.Strong'>, ('http://www.w3.org/1999/xhtml', 'noscript'): <class 'pyslet.html40_19991224.NoScript'>, ('http://www.w3.org/1999/xhtml', 'small'): <class 'pyslet.html40_19991224.Small'>, ('http://www.w3.org/1999/xhtml', 'caption'): <class 'pyslet.html40_19991224.Caption'>, ('http://www.w3.org/1999/xhtml', 'sup'): <class 'pyslet.html40_19991224.Sup'>, ('http://www.w3.org/1999/xhtml', 'big'): <class 'pyslet.html40_19991224.Big'>, ('http://www.w3.org/1999/xhtml', 'em'): <class 'pyslet.html40_19991224.Em'>, ('http://www.w3.org/1999/xhtml', 'form'): <class 'pyslet.html40_19991224.Form'>, ('http://www.w3.org/1999/xhtml', 'meta'): <class 'pyslet.html40_19991224.Meta'>, ('http://www.w3.org/1999/xhtml', 'blockquote'): <class 'pyslet.html40_19991224.Blockquote'>, ('http://www.w3.org/1999/xhtml', 'a'): <class 'pyslet.html40_19991224.A'>, ('http://www.w3.org/1999/xhtml', 'strike'): <class 'pyslet.html40_19991224.Strike'>, ('http://www.w3.org/1999/xhtml', 'legend'): <class 'pyslet.html40_19991224.Legend'>, ('http://www.w3.org/1999/xhtml', 'tt'): <class 'pyslet.html40_19991224.TT'>, ('http://www.w3.org/1999/xhtml', 'h3'): <class 'pyslet.html40_19991224.H3'>, ('http://www.w3.org/1999/xhtml', 'area'): <class 'pyslet.html40_19991224.Area'>, ('http://www.w3.org/1999/xhtml', 'tfoot'): <class 'pyslet.html40_19991224.TFoot'>, ('http://www.w3.org/1999/xhtml', 'script'): <class 'pyslet.html40_19991224.Script'>, ('http://www.w3.org/1999/xhtml', 'center'): <class 'pyslet.html40_19991224.Center'>, ('http://www.w3.org/1999/xhtml', 'q'): <class 'pyslet.html40_19991224.Q'>, ('http://www.w3.org/1999/xhtml', 'cite'): <class 'pyslet.html40_19991224.Cite'>, ('http://www.w3.org/1999/xhtml', 'frame'): <class 'pyslet.html40_19991224.Frame'>, ('http://www.w3.org/1999/xhtml', 'address'): <class 'pyslet.html40_19991224.Address'>, ('http://www.w3.org/1999/xhtml', 'hr'): <class 'pyslet.html40_19991224.HR'>, ('http://www.w3.org/1999/xhtml', 'li'): <class 'pyslet.html40_19991224.LI'>, ('http://www.w3.org/1999/xhtml', 'map'): <class 'pyslet.html40_19991224.Map'>, ('http://www.w3.org/1999/xhtml', 'h4'): <class 'pyslet.html40_19991224.H4'>, ('http://www.w3.org/1999/xhtml', 'td'): <class 'pyslet.html40_19991224.TD'>, ('http://www.w3.org/1999/xhtml', 'table'): <class 'pyslet.html40_19991224.Table'>, ('http://www.w3.org/1999/xhtml', 'span'): <class 'pyslet.html40_19991224.Span'>, ('http://www.w3.org/1999/xhtml', 'ul'): <class 'pyslet.html40_19991224.UL'>, ('http://www.w3.org/1999/xhtml', 'head'): <class 'pyslet.html40_19991224.Head'>, ('http://www.w3.org/1999/xhtml', 'samp'): <class 'pyslet.html40_19991224.Samp'>, ('http://www.w3.org/1999/xhtml', 'tr'): <class 'pyslet.html40_19991224.TR'>, ('http://www.w3.org/1999/xhtml', 'sub'): <class 'pyslet.html40_19991224.Sub'>, ('http://www.w3.org/1999/xhtml', 's'): <class 'pyslet.html40_19991224.S'>, ('http://www.w3.org/1999/xhtml', 'select'): <class 'pyslet.html40_19991224.Select'>, ('http://www.w3.org/1999/xhtml', 'col'): <class 'pyslet.html40_19991224.Col'>, ('http://www.w3.org/1999/xhtml', 'dd'): <class 'pyslet.html40_19991224.DD'>, ('http://www.w3.org/1999/xhtml', 'iframe'): <class 'pyslet.html40_19991224.IFrame'>, ('http://www.w3.org/1999/xhtml', 'abbr'): <class 'pyslet.html40_19991224.Abbr'>, ('http://www.w3.org/1999/xhtml', 'font'): <class 'pyslet.html40_19991224.Font'>, ('http://www.w3.org/1999/xhtml', 'tbody'): <class 'pyslet.html40_19991224.TBody'>, ('http://www.w3.org/1999/xhtml', 'img'): <class 'pyslet.html40_19991224.Img'>, ('http://www.w3.org/1999/xhtml', 'object'): <class 'pyslet.html40_19991224.Object'>, ('http://www.w3.org/1999/xhtml', 'bdo'): <class 'pyslet.html40_19991224.BDO'>, ('http://www.w3.org/1999/xhtml', 'body'): <class 'pyslet.html40_19991224.Body'>, ('http://www.w3.org/1999/xhtml', 'dt'): <class 'pyslet.html40_19991224.DT'>, ('http://www.w3.org/1999/xhtml', 'base'): <class 'pyslet.html40_19991224.Base'>, ('http://www.w3.org/1999/xhtml', 'th'): <class 'pyslet.html40_19991224.TH'>, ('http://www.w3.org/1999/xhtml', 'label'): <class 'pyslet.html40_19991224.Label'>, ('http://www.w3.org/1999/xhtml', 'textarea'): <class 'pyslet.html40_19991224.TextArea'>, ('http://www.w3.org/1999/xhtml', 'dfn'): <class 'pyslet.html40_19991224.Dfn'>, ('http://www.w3.org/1999/xhtml', 'button'): <class 'pyslet.html40_19991224.Button'>, ('http://www.w3.org/1999/xhtml', 'ol'): <class 'pyslet.html40_19991224.OL'>, ('http://www.w3.org/1999/xhtml', 'h5'): <class 'pyslet.html40_19991224.H5'>, ('http://www.w3.org/1999/xhtml', 'link'): <class 'pyslet.html40_19991224.Link'>, ('http://www.w3.org/1999/xhtml', 'pre'): <class 'pyslet.html40_19991224.Pre'>, ('http://www.w3.org/1999/xhtml', 'colgroup'): <class 'pyslet.html40_19991224.ColGroup'>, ('http://www.w3.org/1999/xhtml', 'style'): <class 'pyslet.html40_19991224.Style'>, ('http://www.w3.org/1999/xhtml', 'div'): <class 'pyslet.html40_19991224.Div'>, ('http://www.w3.org/1999/xhtml', 'h6'): <class 'pyslet.html40_19991224.H6'>, ('http://www.w3.org/1999/xhtml', 'i'): <class 'pyslet.html40_19991224.I'>, ('http://www.w3.org/1999/xhtml', 'title'): <class 'pyslet.html40_19991224.Title'>, ('http://www.w3.org/1999/xhtml', 'code'): <class 'pyslet.html40_19991224.Code'>, ('http://www.w3.org/1999/xhtml', 'del'): <class 'pyslet.html40_19991224.Del'>, ('http://www.w3.org/1999/xhtml', 'kbd'): <class 'pyslet.html40_19991224.Kbd'>, ('http://www.w3.org/1999/xhtml', 'html'): <class 'pyslet.html40_19991224.HTML'>, ('http://www.w3.org/1999/xhtml', 'option'): <class 'pyslet.html40_19991224.Option'>, ('http://www.w3.org/1999/xhtml', 'p'): <class 'pyslet.html40_19991224.P'>, ('http://www.w3.org/1999/xhtml', 'h1'): <class 'pyslet.html40_19991224.H1'>, ('http://www.w3.org/1999/xhtml', 'b'): <class 'pyslet.html40_19991224.B'>}

Data member used to store a mapping from element names to the classes used to represent them. This mapping is initialized when the module is loaded.

XMLParser(entity)

We override the basic XML parser to use a custom parser that is intelligent about the use of omitted tags, elements defined to have CDATA content and other SGML-based variations. Note that if the document starts with an XML declaration then the normal XML parser is used instead.

You won’t normally need to call this method as it is invoked automatically when you call pyslet.xml20081126.XMLDocument.Read().

The result is always a proper element hierarchy rooted in an HTML node, even if no tags are present at all the parser will construct an HTML document containing a single Div element to hold the parsed text.

GetChildClass(stagClass)

Always returns HTML.

2.5.2. Basic Types

2.5.2.1. Length Values

Length values are used in many places in HTML, the most common being the width and height values on images. There are two ways of specifying a Length, a simple integer number of pixels or a percentage of some base length defined by the context (such as the width of the browser window).

class pyslet.html40_19991224.LengthType(value, valueType=None)

Bases: object

Represents the HTML Length:

<!ENTITY % Length "CDATA" -- nn for pixels or nn% for percentage length -->
  • value can be either an integer value, another LengthType instance or a

    string.

  • if value is an integer then valueType can be used to select Pixel or

    Percentage

  • if value is a string then it is parsed for the length as per the format

    defined for length attributes in HTML.

By default values are assumed to be Pixel lengths but valueType can be used to force such a value to be a Percentage if desired.

Pixel = 0

data constant used to indicate pixel co-ordinates

Percentage = 1

data constant used to indicate relative (percentage) co-ordinates

type = None

type is one of the the LengthType constants: Pixel or Percentage

value = None

value is the integer value of the length

__nonzero__()

Length values are non-zero if they have a non-zero value (pixel or percentage).

__str__()

Formats the length as a string of form nn for pixels or nn% for percentage.

__unicode__()

Formats the length as a unicode string of form nn for pixels or nn% for percentage.

GetValue(dim=None)

Returns the value of the Length, dim is the size of the dimension used for interpreting percentage values. I.e., 100% will return dim.

Add(value)

Adds value to the length.

If value is another LengthType instance then its value is added to the value of this instances’ value only if the types match. If value is an integer it is assumed to be a value of pixel type - a mismatch raises ValueError.

__weakref__

list of weak references to the object (if defined)

2.5.2.2. Coordinate Values

Coordinate values are simple lists of Lengths. In most cases Pyslet doesn’t define special types for lists of basic types but coordinates are represented in attribute values using comma separation, not space-separation. As a result they require special processing in order to be decoded/encoded correctly from/to XML streams.

class pyslet.html40_19991224.Coords(values=None)

Represents HTML Coords values

<!ENTITY % Coords "CDATA" -- comma-separated list of lengths -->

Instances can be initialized from an existing list of LengthType, or a list of any object that can be used to construct a LengthType. It can also be constructed from a string formatted as per the HTML attribute definition.

The resulting object behaves like a list of LengthType instances, for example:

x=Coords("10, 50, 60%,75%")
len(x)==4
x[0].value==10
x[2].type==LengthType.Percentage
str(x[3])=="75%"
# items are also assignable...
x[1]="40%"
x[1].type==LengthType.Percentage
x[1].value==40
values = None

a list of LengthType values

__unicode__()

Formats the Coords as comma-separated unicode string of Length values.

__str__()

Formats the Coords as a comma-separated string of Length values.

TestRect(x, y, width, height)

Tests an x,y point against a rect with these coordinates.

HTML defines the rect co-ordinates as: left-x, top-y, right-x, bottom-y

TestCircle(x, y, width, height)

Tests an x,y point against a circle with these coordinates.

HTML defines a circle as: center-x, center-y, radius.

The specification adds the following note:

When the radius value is a percentage value, user agents should calculate the final radius value based on the associated object’s width and height. The radius should be the smaller value of the two.
TestPoly(x, y, width, height)

Tests an x,y point against a poly with these coordinates.

HTML defines a poly as: x1, y1, x2, y2, ..., xN, yN.

The specification adds the following note:

The first x and y coordinate pair and the last should be the same to close the polygon. When these coordinate values are not the same, user agents should infer an additional coordinate pair to close the polygon.

The algorithm used is the “Ray Casting” algorithm described here: http://en.wikipedia.org/wiki/Point_in_polygon

2.5.2.3. URIs

URIs are represented by instances of the underlying pyselt.rfc2396.URI class, these functions provide a simple wrapper around the functions defined in that module.

pyslet.html40_19991224.DecodeURI(src)

Decodes a URI from src:

<!ENTITY % URI "CDATA"  -- a Uniform Resource Identifier -->

Note that we adopt the algorithm recommended in Appendix B of the specification, which involves replacing non-ASCII characters with percent-encoded UTF-sequences.

For more information see psylet.rfc2396.EncodeUnicodeURI()

pyslet.html40_19991224.EncodeURI(uri)

Encoding a URI means just converting it into a string.

By definition, a URI will only result in ASCII characters that can be freely converted to Unicode by the default encoding. However, it does mean that this function doesn’t adhere to the principal of using the ASCII encoding only at the latest possible time.