5.5. HTTP Grammar¶

5.5.1. Using the Grammar¶

The functions and data definitions here are exposed to enable normative use in other modules. Use of the grammar itself is typically through use of a parser. There are two types of parser, an OctetParser that is used for parsing raw strings (or octets, represented by bytes in Python) such as those obtained from the HTTP connection itself and a WordParser that tokenizes the input string first and then provides a higher-level word-based parser.

class pyslet.http.grammar.OctetParser(source)¶

Bases: pyslet.unicode5.BasicParser

A special purpose parser for parsing HTTP productions

Strictly speaking, HTTP operates only on bytes so the parser is always set to binary mode. However, as a concession to the various normative references to HTTP in other specifications where character strings are parsed they will be accepted provided they only contain US ASCII characters.

parse_lws()¶

Parses a single instance of the production LWS

The return value is the LWS string parsed or None if there is no LWS.

parse_onetext(unfold=False)¶

Parses a single TEXT instance.

unfold: Pass True to replace folding LWS with a single SP. Defaults to False.

Parses a single byte or run of LWS matching the production TEXT. The return value is either:

1 a single byte of TEXT (not a binary string) excluding

the LWS characters

2 a binary string of LWS

3 None if no TEXT was found

You may find the utility function pyslet.py2.is_byte() useful to distinguish cases 1 and 2 correctly in both Python 2 and Python 3.

parse_text(unfold=False)¶

Parses TEXT

unfold: Pass True to replace folding LWS with a single SP. Defaults to False.

Parses a run of characters matching the production TEXT. The return value is the matching TEXT as a binary string (including any LWS) or None if no TEXT was found.

parse_token()¶

Parses a token.

Parses a single instance of the production token. The return value is the matching token as a binary string or None if no token was found.

parse_comment(unfold=False)¶

Parses a comment.

unfold: Pass True to replace folding LWS with a single SP. Defaults to False.

Parses a single instance of the production comment. The return value is the entire matching comment as a binary string (including the brackets, quoted pairs and any nested comments) or None if no comment was found.

parse_ctext(unfold=False)¶

Parses ctext.

unfold: Pass True to replace folding LWS with a single SP. Defaults to False.

Parses a run of characters matching the production ctext. The return value is the matching ctext as a binary string (including any LWS) or None if no ctext was found.

The original text of RFC2616 is ambiguous in the definition of ctext but the later errata corrected this to exclude the backslash byte ($5C) so we stop if we encounter one.

parse_quoted_string(unfold=False)¶

Parses a quoted-string.

unfold: Pass True to replace folding LWS with a single SP. Defaults to False.

Parses a single instance of the production quoted-string. The return value is the entire matching string (including the quotes and any quoted pairs) or None if no quoted-string was found.

parse_qdtext(unfold=False)¶

Parses qdtext.

Parses a run of characters matching the production qdtext. The return value is the matching qdtext string (including any LWS) or None if no qdtext was found.

If unfold is True then any folding LWS is replaced with a single SP. It defaults to False

Although the production for qdtext would include the backslash character we stop if we encounter one, following the RFC2616 errata instead.

parse_quoted_pair()¶

Parses a single quoted-pair.

The return value is the matching binary string including the backslash so it will always be of length 2 or None if no quoted-pair was found.

class pyslet.http.grammar.WordParser(source, ignore_sp=True)¶

Bases: pyslet.unicode5.ParserMixin

A word-level parser and tokeniser for the HTTP grammar.

source: The binary string to be parsed into words. It will normally be valid TEXT but it can contain control characters if they are escaped as part of a comment or quoted string. For compatibility, character strings are accepted provided they only contain US ASCII characters
ingore_sp (defaults to True): LWS is unfolded automatically. By default the parser ignores spaces according to the rules for implied LWS in the specification and neither SP nor HT will be stored in the word list. If you set ignore_sp to False then LWS is not ignored and each run of LWS is returned as a single SP in the word list.

The source is parsed completely into words on construction using OctetParser. If the source contains a CRLF (or any other non-TEXT bytes) that is not part of a folding or escape sequence it raises ParserError.

For the purposes of this parser, a word may be either a single byte (in which case it is a separator or SP, note that HT is never stored in the word list) or a binary string, in which case it is a token, a comment or a quoted string. Warning: in Python 2 a single byte is indistinguishable from a binary string of length 1.

Methods follow the same pattern as that described in the related pyslet.unicode5.BasicParser using match_, parse_ and require_ naming conventions. It also includes the pyslet.unicode5.ParseMixin class to enable the convenience methods for converting between look-ahead and non-look-ahead parsing modes.

pos = None¶: a pointer to the current word in the list

the_word = None¶: the current word or None

setpos(pos)¶

Sets the current position of the parser.

Example usage for look-ahead:

# wp is a WordParser instance
savepos=wp.pos
try:
        # parse a token/sub-token combination
        token=wp.require_token()
        wp.require_separator(byte('/'))
        subtoken=wp.require_token()
        return token,subtoken
except BadSyntax:
        wp.setpos(savepos)
        return None,None

parser_error(production=None)¶

Raises an error encountered by the parser

See BadSyntax for details.

If production is None then the previous error is re-raised. If multiple errors have been raised previously the one with the most advanced parser position is used. The operation is similar to pyslet.unicode5.BasicParser.parser_error().

To improve the quality of error messages an internal record of the starting position of each word is kept (within the original source).

The position of the parser is always set to the position of the error raised.

match_end()¶: True if all of words have been parsed

peek()¶

Returns the next word

If there are no more words, returns None.

parse_word()¶

Parses any word from the list

Returns the word parsed or None if the parser was already at the end of the word list.

parse_word_as_bstr()¶

Parses any word from the list

Returns a binary string representing the word. In cases where the next work is a separator it converts the word to a binary string (in Python 2 this is a noop) before returning it.

is_token()¶: Returns True if the current word is a token

parse_token()¶

Parses a token from the list of words

Returns the token or None if the next word was not a token. The return value is a binary string. This is consistent with the use of this method for parsing tokens in contexts where a token or a quoted string may be present.

parse_tokenlower()¶

Returns a lower-cased token parsed from the word list

Returns None if the next word was not a token. Unlike parse_token() the result is a character string.

parse_tokenlist()¶

Parses a list of tokens

Returns the list or [] if no tokens were found. Lists are defined by RFC2616 as being comma-separated. Note that empty items are ignored, so strings such as “x,,y” return just [“x”, “y”].

The list of tokens is returned as a list of character strings.

require_token(expected='token')¶

Returns the current token or raises BadSyntax

expected: the name of the expected production, it defaults to “token”.

is_integer()¶: Returns True if the current word is an integer token

parse_integer()¶

Parses an integer token from the list of words

Return the integer’s value or None.

require_integer(expected='integer')¶

Parses an integer or raises BadSyntax

expected: can be set to the name of the expected object, defaults to “integer”.

is_hexinteger()¶: Returns True if the current word is a hex token

parse_hexinteger()¶

Parses a hex integer token from the list of words

Return the hex integer’s value or None.

require_hexinteger(expected='hex integer')¶

Parses a hex integer or raises BadSyntax

expected: can be set to the name of the expected object, defaults to “hex integer”.

is_separator(sep)¶: Returns True if the current word matches sep

parse_separator(sep)¶

Parses a sep from the list of words.

Returns True if the current word matches sep and False otherwise.

require_separator(sep, expected=None)¶

Parses sep or raises BadSyntax

sep: A separtor byte (not a binary string).
expected: can be set to the name of the expected object

is_quoted_string()¶: Returns True if the current word is a quoted string.

parse_quoted_string()¶

Parses a quoted string from the list of words.

Returns the decoded value of the quoted string or None.

parse_sp()¶

Parses a SP from the list of words.

Returns True if the current word is a SP and False otherwise.

parse_parameters(parameters, ignore_allsp=True, case_sensitive=False, qmode=None)¶

Parses a set of parameters

parameters: the dictionary in which to store the parsed parameters
ignore_allsp: a boolean (defaults to True) which causes the function to ignore all LWS in the word list. If set to False then space around the ‘=’ separator is treated as an error and raises BadSyntax.
case_sensitive: controls whether parameter names are treated as case sensitive, defaults to False.
qmode: allows you to pass a special parameter name that will terminate parameter parsing (without being parsed itself). This is used to support headers such as the “Accept” header in which the parameter called “q” marks the boundary between media-type parameters and Accept extension parameters. Defaults to None

Updates the parameters dictionary with the new parameter definitions. The key in the dictionary is the parameter name (converted to lower case if parameters are being dealt with case insensitively) and the value is a 2-item tuple of (name, value) always preserving the original case of the parameter name.

Returns the parameters dictionary as the result. The method always succeeds as parameter lists can be empty.

Compatibility warning: parameter names must be tokens and are therefore converted to character strings. Parameter values, on the other hand, may be quoted strings containing characters from unknown character sets and are therefore always represented as binary strings.

parse_remainder(sep='')¶

Parses the rest of the words

The result is a single string representing the remaining words joined with sep, which defaults to an empty string.

Returns an empty string if the parser is at the end of the word list.

5.5.2. Basic Syntax¶

This section defines functions for handling basic elements of the HTTP grammar, refer to Section 2.2 of RFC2616 for details.

The HTTP protocol only deals with octets so the following functions take a single byte as an argument and return True if the byte matches the production and False otherwise. As a convenience they all accept None as an argument and will return False.

A byte is defined as the type returned by indexing a binary string and is therefore an integer in the range 0..255 in Python 3 and a single character string in Python 2.

pyslet.http.grammar.is_octet(b)¶: Returns True if a byte matches the production for OCTET.

pyslet.http.grammar.is_char(b)¶: Returns True if a byte matches the production for CHAR.

pyslet.http.grammar.is_upalpha(b)¶: Returns True if a byte matches the production for UPALPHA.

pyslet.http.grammar.is_loalpha(b)¶: Returns True if a byte matches the production for LOALPHA.

pyslet.http.grammar.is_alpha(b)¶: Returns True if a byte matches the production for ALPHA.

pyslet.http.grammar.is_digit(b)¶: Returns True if a byte matches the production for DIGIT.

pyslet.http.grammar.is_digits(src)¶

Returns True if all bytes match the production for DIGIT.

Empty strings return False

pyslet.http.grammar.is_ctl(b)¶: Returns True if a byte matches the production for CTL.

pyslet.http.grammar.is_separator(b)¶: Returns True if a byte is a separator

pyslet.http.grammar.is_hex(b)¶: Returns True if a byte matches the production for HEX.

The following constants are defined to speed up comparisons, in each case they are the byte (see above) corresponding to the syntax elements defined in the specification.

And similarly, these byte constants are not defined in the grammar but are useful for comparisons. Again they are the byte representing these separators and will have a different type in Python 2 and 3.

The following binary string constant is defined for completeness:

There are no special definitions for LWS and TEXT, these productions are handled by OctetParser

The following functions operate on binary strings. Note that in Python 2 a byte is also a binary string (of length 1) but in Python 3 a byte is not a valid string. Use pyslet.py2.byte_to_bstr() if you need to create a binary string from a single byte.

pyslet.http.grammar.is_hexdigits(src)¶

Returns True if all bytes match the production for HEX.

Empty strings return False

pyslet.http.grammar.check_token(t)¶

Raises ValueError if t is not a valid token

t: A binary string, will also accept a single byte.

Returns a character string representing the token on success.

pyslet.http.grammar.decode_quoted_string(qstring)¶

Decodes a quoted string, returning the unencoded string.

Surrounding double quotes are removed and quoted bytes, bytes preceded by $5C (backslash), are unescaped.

The return value is a binary string. In most cases you will want to decode it using the latin-1 (iso-8859-1) codec as that was the original intention of RFC2616 but in practice anything outside US ASCII is likely to be non-portable.

pyslet.http.grammar.quote_string(s, force=True)¶

Places a string in double quotes, returning the quoted string.

force: Always quote the string, defaults to True. If False then valid tokens are not quoted but returned as-is.

This is the reverse of decode_quoted_string(). Note that only the double quote, and CTL characters other than SP and HT are quoted in the output.

5.5.3. Misc Functions¶

pyslet.http.grammar.format_parameters(parameters)¶

Formats a dictionary of parameters

This function is suitable for formatting parameter dictionaries parsed by WordParser.parse_parameters(). These dictionaries are key/value pairs where the keys are character strings and the values are binary strings.

Parameter values are quoted only if their values require it, that is, only if their values are not valid tokens.

5.5.4. Exceptions¶

class pyslet.http.grammar.BadSyntax(production='', parser=None)¶

Raised by the WordParser

Whenever a syntax error is encountered by the parsers. Note that tokenization errors are raised separately during construction itself.

production: The name of the production being parsed. (Defaults to an empty string.)
parser: The WordParser instance raising the error (optional)

BadSyntax is a subclass of ValueError.