bs4.builder._lxml#

Module Contents#

Classes#

LXMLTreeBuilderForXML

Turn a textual document into a Beautiful Soup object tree.

LXMLTreeBuilder

This TreeBuilder knows facts about HTML.

class bs4.builder._lxml.LXMLTreeBuilderForXML(parser=None, empty_element_tags=None, **kwargs)#

Bases: bs4.builder.TreeBuilder

Turn a textual document into a Beautiful Soup object tree.

DEFAULT_PARSER_CLASS#
is_xml = True#
processing_instruction_class#
NAME = 'lxml-xml'#
ALTERNATE_NAMES = ['xml']#
features#
CHUNK_SIZE = 512#
DEFAULT_NSMAPS#
DEFAULT_NSMAPS_INVERTED#
initialize_soup(soup)#

Let the BeautifulSoup object know about the standard namespace mapping.

Parameters:

soup – A BeautifulSoup.

_register_namespaces(mapping)#

Let the BeautifulSoup object know about namespaces encountered while parsing the document.

This might be useful later on when creating CSS selectors.

This will track (almost) all namespaces, even ones that were only in scope for part of the document. If two namespaces have the same prefix, only the first one encountered will be tracked. Un-prefixed namespaces are not tracked.

Parameters:

mapping – A dictionary mapping namespace prefixes to URIs.

default_parser(encoding)#

Find the default parser for the given encoding.

Parameters:

encoding – A string.

Returns:

Either a parser object or a class, which will be instantiated with default arguments.

parser_for(encoding)#

Instantiate an appropriate parser for the given encoding.

Parameters:

encoding – A string.

Returns:

A parser object such as an etree.XMLParser.

_getNsTag(tag)#
prepare_markup(markup, user_specified_encoding=None, exclude_encodings=None, document_declared_encoding=None)#

Run any preliminary steps necessary to make incoming markup acceptable to the parser.

lxml really wants to get a bytestring and convert it to Unicode itself. So instead of using UnicodeDammit to convert the bytestring to Unicode using different encodings, this implementation uses EncodingDetector to iterate over the encodings, and tell lxml to try to parse the document as each one in turn.

Parameters:
  • markup – Some markup – hopefully a bytestring.

  • user_specified_encoding – The user asked to try this encoding.

  • document_declared_encoding – The markup itself claims to be in this encoding.

  • exclude_encodings – The user asked _not_ to try any of these encodings.

Yield:

A series of 4-tuples: (markup, encoding, declared encoding,

has undergone character replacement)

Each 4-tuple represents a strategy for converting the document to Unicode and parsing it. Each strategy will be tried in turn.

feed(markup)#

Run some incoming markup through some parsing process, populating the BeautifulSoup object in self.soup.

This method is not implemented in TreeBuilder; it must be implemented in subclasses.

Returns:

None.

close()#
start(name, attrs, nsmap={})#
_prefix_for_namespace(namespace)#

Find the currently active prefix for the given namespace.

end(name)#
pi(target, data)#
data(content)#
doctype(name, pubid, system)#
comment(content)#

Handle comments as Comment objects.

test_fragment_to_document(fragment)#

See TreeBuilder.

class bs4.builder._lxml.LXMLTreeBuilder(multi_valued_attributes=USE_DEFAULT, preserve_whitespace_tags=USE_DEFAULT, store_line_numbers=USE_DEFAULT, string_containers=USE_DEFAULT)#

Bases: bs4.builder.HTMLTreeBuilder, LXMLTreeBuilderForXML

This TreeBuilder knows facts about HTML.

Such as which tags are empty-element tags.

NAME#
ALTERNATE_NAMES = ['lxml-html']#
features#
is_xml = False#
processing_instruction_class#
default_parser(encoding)#

Find the default parser for the given encoding.

Parameters:

encoding – A string.

Returns:

Either a parser object or a class, which will be instantiated with default arguments.

feed(markup)#

Run some incoming markup through some parsing process, populating the BeautifulSoup object in self.soup.

This method is not implemented in TreeBuilder; it must be implemented in subclasses.

Returns:

None.

test_fragment_to_document(fragment)#

See TreeBuilder.