bs4.builder#

Submodules#

Package Contents#

Classes#

TreeBuilderRegistry

A way of looking up TreeBuilder subclasses by their name or by desired

TreeBuilder

Turn a textual document into a Beautiful Soup object tree.

SAXTreeBuilder

A Beautiful Soup treebuilder that listens for SAX events.

HTMLTreeBuilder

This TreeBuilder knows facts about HTML.

class bs4.builder.TreeBuilderRegistry#

Bases: object

A way of looking up TreeBuilder subclasses by their name or by desired features.

register(treebuilder_class)#

Register a treebuilder based on its advertised features.

Parameters:

treebuilder_class – A subclass of Treebuilder. its .features attribute should list its features.

lookup(*features)#

Look up a TreeBuilder subclass with the desired features.

Parameters:

features – A list of features to look for. If none are provided, the most recently registered TreeBuilder subclass will be used.

Returns:

A TreeBuilder subclass, or None if there’s no registered subclass with all the requested features.

class bs4.builder.TreeBuilder(multi_valued_attributes=USE_DEFAULT, preserve_whitespace_tags=USE_DEFAULT, store_line_numbers=USE_DEFAULT, string_containers=USE_DEFAULT)#

Bases: object

Turn a textual document into a Beautiful Soup object tree.

NAME = '[Unknown tree builder]'#
ALTERNATE_NAMES = []#
features = []#
is_xml = False#
picklable = False#
empty_element_tags#
DEFAULT_CDATA_LIST_ATTRIBUTES#
DEFAULT_PRESERVE_WHITESPACE_TAGS#
DEFAULT_STRING_CONTAINERS#
USE_DEFAULT#
TRACKS_LINE_NUMBERS = False#
initialize_soup(soup)#

The BeautifulSoup object has been initialized and is now being associated with the TreeBuilder.

Parameters:

soup – A BeautifulSoup object.

reset()#

Do any work necessary to reset the underlying parser for a new document.

By default, this does nothing.

can_be_empty_element(tag_name)#

Might a tag with this name be an empty-element tag?

The final markup may or may not actually present this tag as self-closing.

For instance: an HTMLBuilder does not consider a <p> tag to be an empty-element tag (it’s not in HTMLBuilder.empty_element_tags). This means an empty <p> tag will be presented as “<p></p>”, not “<p/>” or “<p>”.

The default implementation has no opinion about which tags are empty-element tags, so a tag will be presented as an empty-element tag if and only if it has no children. “<foo></foo>” will become “<foo/>”, and “<foo>bar</foo>” will be left alone.

Parameters:

tag_name – The name of a markup tag.

abstract feed(markup)#

Run some incoming markup through some parsing process, populating the BeautifulSoup object in self.soup.

This method is not implemented in TreeBuilder; it must be implemented in subclasses.

Returns:

None.

prepare_markup(markup, user_specified_encoding=None, document_declared_encoding=None, exclude_encodings=None)#

Run any preliminary steps necessary to make incoming markup acceptable to the parser.

Parameters:
  • markup – Some markup – probably a bytestring.

  • user_specified_encoding – The user asked to try this encoding.

  • document_declared_encoding – The markup itself claims to be in this encoding. NOTE: This argument is not used by the calling code and can probably be removed.

  • exclude_encodings – The user asked _not_ to try any of these encodings.

Yield:

A series of 4-tuples: (markup, encoding, declared encoding,

has undergone character replacement)

Each 4-tuple represents a strategy for converting the document to Unicode and parsing it. Each strategy will be tried in turn.

By default, the only strategy is to parse the markup as-is. See LXMLTreeBuilderForXML and HTMLParserTreeBuilder for implementations that take into account the quirks of particular parsers.

test_fragment_to_document(fragment)#

Wrap an HTML fragment to make it look like a document.

Different parsers do this differently. For instance, lxml introduces an empty <head> tag, and html5lib doesn’t. Abstracting this away lets us write simple tests which run HTML fragments through the parser and compare the results against other HTML fragments.

This method should not be used outside of tests.

Parameters:

fragment – A string – fragment of HTML.

Returns:

A string – a full HTML document.

set_up_substitutions(tag)#

Set up any substitutions that will need to be performed on a Tag when it’s output as a string.

By default, this does nothing. See HTMLTreeBuilder for a case where this is used.

Parameters:

tag – A Tag

Returns:

Whether or not a substitution was performed.

_replace_cdata_list_attribute_values(tag_name, attrs)#

When an attribute value is associated with a tag that can have multiple values for that attribute, convert the string value to a list of strings.

Basically, replaces class=”foo bar” with class=[“foo”, “bar”]

NOTE: This method modifies its input in place.

Parameters:
  • tag_name – The name of a tag.

  • attrs – A dictionary containing the tag’s attributes. Any appropriate attribute values will be modified in place.

class bs4.builder.SAXTreeBuilder(multi_valued_attributes=USE_DEFAULT, preserve_whitespace_tags=USE_DEFAULT, store_line_numbers=USE_DEFAULT, string_containers=USE_DEFAULT)#

Bases: TreeBuilder

A Beautiful Soup treebuilder that listens for SAX events.

This is not currently used for anything, but it demonstrates how a simple TreeBuilder would work.

abstract feed(markup)#

Run some incoming markup through some parsing process, populating the BeautifulSoup object in self.soup.

This method is not implemented in TreeBuilder; it must be implemented in subclasses.

Returns:

None.

close()#
startElement(name, attrs)#
endElement(name)#
startElementNS(nsTuple, nodeName, attrs)#
endElementNS(nsTuple, nodeName)#
startPrefixMapping(prefix, nodeValue)#
endPrefixMapping(prefix)#
characters(content)#
startDocument()#
endDocument()#
class bs4.builder.HTMLTreeBuilder(multi_valued_attributes=USE_DEFAULT, preserve_whitespace_tags=USE_DEFAULT, store_line_numbers=USE_DEFAULT, string_containers=USE_DEFAULT)#

Bases: TreeBuilder

This TreeBuilder knows facts about HTML.

Such as which tags are empty-element tags.

empty_element_tags#
block_elements#
DEFAULT_STRING_CONTAINERS#
DEFAULT_CDATA_LIST_ATTRIBUTES#
DEFAULT_PRESERVE_WHITESPACE_TAGS#
set_up_substitutions(tag)#

Replace the declared encoding in a <meta> tag with a placeholder, to be substituted when the tag is output to a string.

An HTML document may come in to Beautiful Soup as one encoding, but exit in a different encoding, and the <meta> tag needs to be changed to reflect this.

Parameters:

tag – A Tag

Returns:

Whether or not a substitution was performed.