5.5. HTTP Grammar

This section defines functions for handling basic elements of the HTTP grammar, refer to Section 2.2 of RFC2616 for details.

The HTTP protocol only deals with octets but as a convenience, and due to the blurring of octet and character strings in Python 2.x we process characters as if they were octets.

pyslet.http.grammar.is_octet(c)

Returns True if a character matches the production for OCTET.

pyslet.http.grammar.is_char(c)

Returns True if a character matches the production for CHAR.

pyslet.http.grammar.is_upalpha(c)

Returns True if a character matches the production for UPALPHA.

pyslet.http.grammar.is_loalpha(c)

Returns True if a character matches the production for LOALPHA.

pyslet.http.grammar.is_alpha(c)

Returns True if a character matches the production for ALPHA.

pyslet.http.grammar.is_digit(c)

Returns True if a character matches the production for DIGIT.

pyslet.http.grammar.is_digits(src)

Returns True if all characters match the production for DIGIT.

Empty strings return False

pyslet.http.grammar.is_ctl(c)

Returns True if a character matches the production for CTL.

LWS and TEXT productions are handled by OctetParser

pyslet.http.grammar.is_hex(c)

Returns True if a characters matches the production for HEX.

pyslet.http.grammar.is_hexdigits(src)

Returns True if all characters match the production for HEX.

Empty strings return False

pyslet.http.grammar.check_token(t)

Raises ValueError if t is not a valid token

pyslet.http.grammar.is_separator(c)

Returns True if a character is a separator

pyslet.http.grammar.decode_quoted_string(qstring)

Decodes a quoted string, returning the unencoded string.

Surrounding double quotes are removed and quoted characters (characters preceded by ) are unescaped.

pyslet.http.grammar.quote_string(s, force=True)

Places a string in double quotes, returning the quoted string.

This is the reverse of decode_quoted_string(). Note that only the double quote, and CTL characters other than SP and HT are quoted in the output.

If force is False then valid tokens are not quoted.

pyslet.http.grammar.format_parameters(parameters)

Formats a dictionary of parameters

This function is suitable for formatting parameter dictionaries parsed by WordParser.parse_parameters().

Parameter values are quoted only if their values require it, that is, only if their values are not valid tokens.

5.5.1. Using the Grammar

The functions and data definitions above are exposed to enable normative use in other modules but use of the grammar is typically through use of a parser. There are two types of parser, an OctetParser that is used for parsing raw strings (or octets) such as those obtained from the HTTP connection itself and a WordParser that tokenizes the input string first and then provides a higher-level word-based parser.

class pyslet.http.grammar.OctetParser(source)

Bases: pyslet.unicode5.BasicParser

A special purpose parser for parsing HTTP productions.

parse_lws()

Parses a single instance of the production LWS

The return value is the LWS string parsed or None if there is no LWS.

parse_onetext(unfold=False)

Parses a single TEXT instance.

Parses a single character or run of LWS matching the production TEXT. The return value is the matching character, LWS string or None if no TEXT was found.

If unfold is True then any folding LWS is replaced with a single SP. It defaults to False

parse_text(unfold=False)

Parses TEXT

Parses a run of characters matching the production TEXT. The return value is the matching TEXT string (including any LWS) or None if no TEXT was found.

If unfold is True then any folding LWS is replaced with a single SP. It defaults to False

parse_token()

Parses a token.

Parses a single instance of the production token. The return value is the matching token string or None if no token was found.

parse_comment(unfold=False)

Parses a comment.

Parses a single instance of the production comment. The return value is the entire matching comment string (including the brackets, quoted pairs and any nested comments) or None if no comment was found.

If unfold is True then any folding LWS is replaced with a single SP. It defaults to False

parse_ctext(unfold=False)

Parses ctext.

Parses a run of characters matching the production ctext. The return value is the matching ctext string (including any LWS) or None if no ctext was found.

If unfold is True then any folding LWS is replaced with a single SP. It defaults to False

Although the production for ctext would include the backslash character we stop if we encounter one as the grammar is ambiguous at this point.

parse_quoted_string(unfold=False)

Parses a quoted-string.

Parses a single instance of the production quoted-string. The return value is the entire matching string (including the quotes and any quoted pairs) or None if no quoted-string was found.

If unfold is True then any folding LWS is replaced with a single SP. It defaults to False

parse_qdtext(unfold=False)

Parses qdtext.

Parses a run of characters matching the production qdtext. The return value is the matching qdtext string (including any LWS) or None if no qdtext was found.

If unfold is True then any folding LWS is replaced with a single SP. It defaults to False

Although the production for qdtext would include the backslash character we stop if we encounter one as the grammar is ambiguous at this point.

parse_quoted_pair()

Parses a single quoted-pair.

The return value is the matching string including the backslash so it will always be of length 2 or None if no quoted-pair was found.

class pyslet.http.grammar.WordParser(source, ignore_sp=True)

Bases: object

A word-level parser and tokeniser for the HTTP grammar.

source is the string to be parsed into words. It will normally be valid TEXT but it can contain control characters if they are escaped as part of a comment or quoted string.

LWS is unfolded automatically. By default the parser ignores spaces according to the rules for implied LWS in the specification and neither SP nor HT will be stored in the word list. If you set ignore_sp to False then LWS is not ignored and each run of LWS is returned as a single SP in the word list.

If the source contains a CRLF (or any other non-TEXT character) that is not part of a folding or escape sequence it raises ValueError

The resulting words may be a token, a single separator character, a comment or a quoted string. To determine the type of word, look at the first character.

  • ‘(‘ means the word is a comment, surrounded by ‘(‘ and ‘)’
  • a double quote means the word is an encoded quoted string (use py:func:decode_quoted_string to decode it)
  • other separator chars are just themselves and only appear as single character strings. (HT is never returned.)
  • Any other character indicates a token.

Methods of the form require_* raise BadSyntax if the production is not found.

pos = None

a pointer to the current word in the list

the_word = None

the current word or None

setpos(pos)

Sets the current position of the parser.

Example usage for look-ahead:

# wp is a WordParser instance
savepos=wp.pos
try:
        # parse a token/sub-token combination
        token=wp.require_token()
        wp.require_separator('/')
        subtoken=wp.require_token()
        return token,subtoken
except BadSyntax:
        wp.setpos(savepos)
        return None,None
peek()

Returns the next word

If there are no more words, returns None.

syntax_error(expected)

Raises BadSyntax.

expected
a descriptive string indicating the expected production.
require_production(result, production=None)

Returns result if result is not None

If result is None, raises BadSyntax.

production
can be used to customize the error message with the name of the expected production.
parse_production(require_method, *args)

Executes the bound method require_method passing args.

If successful the result of the method is returned. If BadSyntax is raised, the exception is caught, the parser rewound and None is returned.

require_production_end(result, production=None)

Checks for a required production and the end of the word list

Returns result if result is not None and parsing is now complete, otherwise raises BadSyntax.

production
can be used to customize the error message with the name of the expected production.
require_end(production=None)

Checks for the end of the word list

If the parser is not at the end of the word list BadSyntax is raised.

production
can be used to customize the error message with the name of the production being parsed.
parse_word()

Parses any word from the list

Returns the word parsed or None if the parser was already at the end of the word list.

is_token()

Returns True if the current word is a token

parse_token()

Parses a token from the list of words

Returns the token or None if the next word was not a token.

parse_tokenlower()

Returns a lower-cased token parsed from the word list

Returns None if the next word was not a token.

parse_tokenlist()

Parses a list of tokens

Returns the list or [] if no tokens were found. Lists are defined by RFC2616 as being comma-separated. Note that empty items are ignored, so string such as “x,,y” return just [“x”, “y”].

require_token(expected='token')

Returns the current token or raises BadSyntax

expected
the name of the expected production, it defaults to “token”.
is_integer()

Returns True if the current word is an integer token

parse_integer()

Parses an integer token from the list of words

Return the integer’s value or None.

require_integer(expected='integer')

Parses an integer or raises BadSyntax

expected
can be set to the name of the expected object, defaults to “integer”.
is_hexinteger()

Returns True if the current word is a hex token

parse_hexinteger()

Parses a hex integer token from the list of words

Return the hex integer’s value or None.

require_hexinteger(expected='hex integer')

Parses a hex integer or raises BadSyntax

expected
can be set to the name of the expected object, defaults to “hex integer”.
is_separator(sep)

Returns True if the current word matches sep

parse_separator(sep)

Parses a sep from the list of words.

Returns True if the current word matches sep and False otherwise.

require_separator(sep, expected=None)

Parses sep or raises BadSyntax

expected
can be set to the name of the expected object
is_quoted_string()

Returns True if the current word is a quoted string.

parse_quoted_string()

Parses a quoted string from the list of words.

Returns the decoded value of the quoted string or None.

parse_sp()

Parses a SP from the list of words.

Returns True if the current word is a SP and False otherwise.

parse_parameters(parameters, ignore_allsp=True, case_sensitive=False, qmode=None)

Parses a set of parameters

parameters
the dictionary in which to store the parsed parameters
ignore_allsp
a boolean (defaults to True) which causes the function to ignore all LWS in the word list. If set to False then space around the ‘=’ separator is treated as an error and raises BadSyntax.
case_sensitive
controls whether parameter names are treated as case sensitive, defaults to False.
qmode
allows you to pass a special parameter name that will terminate parameter parsing (without being parsed itself). This is used to support headers such as the “Accept” header in which the parameter called “q” marks the boundary between media-type parameters and Accept extension parameters. Defaults to None

Updates the parameters dictionary with the new parameter definitions. The key in the dictionary is the parameter name (converted to lower case if parameters are being dealt with case insensitively) and the value is a 2-item tuple of (name, value) always preserving the original case of the parameter name.

parse_remainder(sep='')

Parses the rest of the words

The result is a single string representing the remaining words joined with sep, which defaults to an empty string.

Returns an empty string if the parser is at the end of the word list.

class pyslet.http.grammar.BadSyntax

Raised when a syntax error is encountered by the parsers

This is just a trivial sub-class of the built-in ValueError.