bs4.element <a class="headerlink" href="#module-bs4.element" title="Permalink to this heading">#

find_all_next(name=None, attrs={}, string=None, limit=None, **kwargs)#

Find all PageElements that match the given criteria and appear later in the document than this PageElement.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
limit – Stop looking after finding this many results.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A ResultSet containing PageElements.

find_next_sibling(name=None, attrs={}, string=None, **kwargs)#

Find the closest sibling to this PageElement that matches the given criteria and appears later in the document.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A PageElement.

Return type:

find_next_siblings(name=None, attrs={}, string=None, limit=None, **kwargs)#

Find all siblings of this PageElement that match the given criteria and appear later in the document.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
limit – Stop looking after finding this many results.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A ResultSet of PageElements.

Return type:

find_previous(name=None, attrs={}, string=None, **kwargs)#

Look backwards in the document from this PageElement and find the first PageElement that matches the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A PageElement.

Return type:

find_all_previous(name=None, attrs={}, string=None, limit=None, **kwargs)#

Look backwards in the document from this PageElement and find all PageElements that match the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
limit – Stop looking after finding this many results.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A ResultSet of PageElements.

Return type:

find_previous_sibling(name=None, attrs={}, string=None, **kwargs)#

Returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A PageElement.

Return type:

find_previous_siblings(name=None, attrs={}, string=None, limit=None, **kwargs)#

Returns all siblings to this PageElement that match the given criteria and appear earlier in the document.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
limit – Stop looking after finding this many results.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A ResultSet of PageElements.

Return type:

find_parent(name=None, attrs={}, **kwargs)#

Find the closest parent of this PageElement that matches the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A PageElement.

Return type:

find_parents(name=None, attrs={}, limit=None, **kwargs)#

Find all parents of this PageElement that match the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
limit – Stop looking after finding this many results.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A PageElement.

Return type:

_find_one(method, name, attrs, string, **kwargs)#

_find_all(name, attrs, string, limit, generator, **kwargs)#: Iterates over a generator looking for things that match.

nextGenerator()#

nextSiblingGenerator()#

previousGenerator()#

previousSiblingGenerator()#

parentGenerator()#

class bs4.element.NavigableString#

Bases: str, PageElement

A Python Unicode string that is part of a parse tree.

When Beautiful Soup parses the markup penguin, it will create a NavigableString for the string “penguin”.

property name#

Since a NavigableString is not a Tag, it has no .name.

This property is implemented so that code like this doesn’t crash when run on a mixture of Tag and NavigableString objects:

[x.name for x in tag.children]

PREFIX = ''#

SUFFIX = ''#

strings#

__deepcopy__(memo, recursive=False)#

A copy of a NavigableString has the same contents and class as the original, but it is not connected to the parse tree.

Parameters:: recursive – This parameter is ignored; it’s only defined so that NavigableString.__deepcopy__ implements the same signature as Tag.__deepcopy__.

__copy__()#: A copy of a NavigableString can only be a deep copy, because only one PageElement can occupy a given place in a parse tree.

__getnewargs__()#

__getattr__(attr)#: text.string gives you text. This is for backwards compatibility for Navigable*String, but for CData* it lets you get the string without the CData wrapper.

output_ready(formatter='minimal')#

Run the string through the provided formatter.

Parameters:: formatter – A Formatter object, or a string naming one of the standard formatters.

_all_strings(strip=False, types=PageElement.default)#

Yield all strings of certain classes, possibly stripping them.

This makes it easy for NavigableString to implement methods like get_text() as conveniences, creating a consistent text-extraction API across all PageElements.

Parameters:

strip – If True, all strings will be stripped before being yielded.
types – A tuple of NavigableString subclasses. If this NavigableString isn’t one of those subclasses, the sequence will be empty. By default, the subclasses considered are NavigableString and CData objects. That means no comments, processing instructions, etc.

Yield:

A sequence that either contains this string, or is empty.

class bs4.element.PreformattedString#

A NavigableString not subject to the normal formatting rules.

This is an abstract class used for special kinds of strings such as comments (the Comment class) and CDATA blocks (the CData class).

PREFIX = ''#

SUFFIX = ''#

output_ready(formatter=None)#

Make this string ready for output by adding any subclass-specific: prefix or suffix.

Parameters:: formatter – A Formatter object, or a string naming one of the standard formatters. The string will be passed into the Formatter, but only to trigger any side effects: the return value is ignored.
Returns:: The string, with any subclass-specific prefix and suffix added on.

class bs4.element.CData#

A CDATA block.

PREFIX = '<![CDATA['#

SUFFIX = ']]>'#

class bs4.element.ProcessingInstruction#

A SGML processing instruction.

PREFIX = '<?'#

SUFFIX = '>'#

class bs4.element.XMLProcessingInstruction#

Bases: ProcessingInstruction

An XML processing instruction.

PREFIX = '<?'#

SUFFIX = '?>'#

class bs4.element.Comment#

An HTML or XML comment.

PREFIX = '<!--'#

SUFFIX = '-->'#

class bs4.element.Declaration#

An XML declaration.

PREFIX = '<?'#

SUFFIX = '?>'#

class bs4.element.Doctype#

A document type declaration.

PREFIX = '<!DOCTYPE '#

SUFFIX = '>\n'#

classmethod for_name_and_ids(name, pub_id, system_id)#

Generate an appropriate document type declaration for a given public ID and system ID.

Parameters:

name – The name of the document’s root element, e.g. ‘html’.
pub_id – The Formal Public Identifier for this document type, e.g. ‘-//W3C//DTD XHTML 1.1//EN’
system_id – The system identifier for this document type, e.g. ‘http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd’

Returns:

A Doctype.

class bs4.element.Stylesheet#

A NavigableString representing an stylesheet (probably CSS).

Used to distinguish embedded stylesheets from textual content.

class bs4.element.Script#

A NavigableString representing an executable script (probably Javascript).

Used to distinguish executable code from textual content.

class bs4.element.TemplateString#

A NavigableString representing a string found inside an HTML template embedded in a larger document.

Used to distinguish such strings from the main body of the document.

class bs4.element.RubyTextString#

https://dev.w3.org/html5/spec-LC/text-level-semantics.html#the-rt-element

A NavigableString representing the contents of the <rt> HTML element.

Can be used to distinguish such strings from the strings they’re annotating.

class bs4.element.RubyParenthesisString#

https://dev.w3.org/html5/spec-LC/text-level-semantics.html#the-rp-element

A NavigableString representing the contents of the <rp> HTML element.

class bs4.element.Tag(parser=None, builder=None, name=None, namespace=None, prefix=None, attrs=None, parent=None, previous=None, is_xml=None, sourceline=None, sourcepos=None, can_be_empty_element=None, cdata_list_attributes=None, preserve_whitespace_tags=None, interesting_string_types=None, namespaces=None)#

Bases: PageElement

Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

When Beautiful Soup parses the markup penguin, it will create a Tag object representing the tag.

property is_empty_element#

Is this tag an empty-element tag? (aka a self-closing tag)

A tag that has contents is never an empty-element tag.

A tag that has no contents may or may not be an empty-element tag. It depends on the builder used to create the tag. If the builder has a designated list of empty-element tags, then only a tag whose name shows up in that list is considered an empty-element tag.

If the builder has no designated list of empty-element tags, then any tag with no contents is an empty-element tag.

property string#

Convenience property to get the single string within this PageElement.

TODO It might make sense to have NavigableString.string return itself.

Returns:: If this element has a single string child, return value is that string. If this element has one child tag, return value is the ‘string’ attribute of the child tag, recursively. If this element is itself a string, has no children, or has more than one child, return value is None.

property children#

Iterate over all direct children of this PageElement.

Yield:: A sequence of PageElements.

property self_and_descendants#

Iterate over this PageElement and its children in a breadth-first sequence.

Yield:: A sequence of PageElements.

property descendants#

Iterate over all children of this PageElement in a breadth-first sequence.

Yield:: A sequence of PageElements.

property css#: Return an interface to the CSS selector API.

parserClass#

isSelfClosing#

DEFAULT_INTERESTING_STRING_TYPES = ()#

strings#

START_ELEMENT_EVENT#

END_ELEMENT_EVENT#

EMPTY_ELEMENT_EVENT#

STRING_ELEMENT_EVENT#

findChild#

findAll#

findChildren#

__deepcopy__(memo, recursive=True)#: A deepcopy of a Tag is a new Tag, unconnected to the parse tree. Its contents are a copy of the old Tag’s contents.

__copy__()#: A copy of a Tag must always be a deep copy, because a Tag’s children can only have one parent at a time.

_clone()#

Create a new Tag just like this one, but with no contents and unattached to any parse tree.

This is the first step in the deepcopy process.

_all_strings(strip=False, types=PageElement.default)#

Yield all strings of certain classes, possibly stripping them.

Parameters:

strip – If True, all strings will be stripped before being yielded.
types – A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. By default, the subclasses considered are the ones found in self.interesting_string_types. If that’s not specified, only NavigableString and CData objects will be considered. That means no comments, processing instructions, etc.

Yield:

A sequence of strings.

decompose()#

Recursively destroys this PageElement and its children.

This element will be removed from the tree and wiped out; so will everything beneath it.

The behavior of a decomposed PageElement is undefined and you should never use one for anything, but if you need to _check_ whether an element has been decomposed, you can use the decomposed property.

clear(decompose=False)#

Wipe out all children of this PageElement by calling extract(): on them.

Parameters:: decompose – If this is True, decompose() (a more destructive method) will be called instead of extract().

smooth()#

Smooth out this element’s children by consolidating consecutive strings.

This makes pretty-printed output look more natural following a lot of operations that modified the tree.

index(element)#

Find the index of a child by identity, not value.

Avoids issues with tag.contents.index(element) getting the index of equal elements.

Parameters:: element – Look for this PageElement in self.contents.

get(key, default=None)#: Returns the value of the ‘key’ attribute for the tag, or the value given for ‘default’ if it doesn’t have that attribute.

get_attribute_list(key, default=None)#

The same as get(), but always returns a list.

Parameters:

key – The attribute to look for.
default – Use this value if the attribute is not present on this PageElement.

Returns:

A list of values, probably containing only a single value.

has_attr(key)#: Does this PageElement have an attribute with the given name?

__hash__()#: Return hash(self).

__getitem__(key)#: tag[key] returns the value of the ‘key’ attribute for the Tag, and throws an exception if it’s not there.

__iter__()#: Iterating over a Tag iterates over its contents.

__len__()#: The length of a Tag is the length of its list of contents.

__contains__(x)#

__bool__()#: A tag is non-None even if it has no contents.

__setitem__(key, value)#: Setting tag[key] sets the value of the ‘key’ attribute for the tag.

__delitem__(key)#: Deleting tag[key] deletes all ‘key’ attributes for the tag.

__call__(*args, **kwargs)#: Calling a Tag like a function is the same as calling its find_all() method. Eg. tag(‘a’) returns a list of all the A tags found within this tag.

__getattr__(tag)#: Calling tag.subtag is the same as calling tag.find(name=”subtag”)

__eq__(other)#: Returns true iff this Tag has the same name, the same attributes, and the same contents (recursively) as other.

__ne__(other)#: Returns true iff this Tag is not identical to other, as defined in __eq__.

__repr__(encoding='unicode-escape')#

Renders this PageElement as a string.

Parameters:: encoding – The encoding to use (Python 2 only). TODO: This is now ignored and a warning should be issued if a value is provided.
Returns:: A (Unicode) string.

__unicode__()#: Renders this PageElement as a Unicode string.

encode(encoding=DEFAULT_OUTPUT_ENCODING, indent_level=None, formatter='minimal', errors='xmlcharrefreplace')#

Render a bytestring representation of this PageElement and its contents.

Parameters:

encoding – The destination encoding.
indent_level – Each line of the rendering will be indented this many levels. (The formatter decides what a ‘level’ means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
formatter – A Formatter object, or a string naming one of the standard formatters.
errors – An error handling strategy such as ‘xmlcharrefreplace’. This value is passed along into encode() and its value should be one of the constants defined by Python.

Returns:

A bytestring.

decode(indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal', iterator=None)#

_event_stream(iterator=None)#

Yield a sequence of events that can be used to reconstruct the DOM for this element.

This lets us recreate the nested structure of this element (e.g. when formatting it as a string) without using recursive method calls.

This is similar in concept to the SAX API, but it’s a simpler interface designed for internal use. The events are different from SAX and the arguments associated with the events are Tags and other Beautiful Soup objects.

Parameters:: iterator – An alternate iterator to use when traversing the tree.

_indent_string(s, indent_level, formatter, indent_before, indent_after)#

Add indentation whitespace before and/or after a string.

Parameters:

s – The string to amend with whitespace.
indent_level – The indentation level; affects how much whitespace goes before the string.
indent_before – Whether or not to add whitespace before the string.
indent_after – Whether or not to add whitespace (a newline) after the string.

_format_tag(eventual_encoding, formatter, opening)#

_should_pretty_print(indent_level=1)#

Should this tag be pretty-printed?

Most of them should, but some (such as <pre> in HTML documents) should not.

prettify(encoding=None, formatter='minimal')#

Pretty-print this PageElement as a string.

Parameters:

encoding – The eventual encoding of the string. If this is None, a Unicode string will be returned.
formatter – A Formatter object, or a string naming one of the standard formatters.

Returns:

A Unicode string (if encoding==None) or a bytestring (otherwise).

decode_contents(indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal')#

Renders the contents of this tag as a Unicode string.

Parameters:

indent_level – Each line of the rendering will be indented this many levels. (The formatter decides what a ‘level’ means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
eventual_encoding – The tag is destined to be encoded into this encoding. decode_contents() is _not_ responsible for performing that encoding. This information is passed in so that it can be substituted in if the document contains a <META> tag that mentions the document’s encoding.
formatter – A Formatter object, or a string naming one of the standard Formatters.

encode_contents(indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal')#

Renders the contents of this PageElement as a bytestring.

Parameters:

indent_level – Each line of the rendering will be indented this many levels. (The formatter decides what a ‘level’ means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
eventual_encoding – The bytestring will be in this encoding.
formatter – A Formatter object, or a string naming one of the standard Formatters.

Returns:

A bytestring.

renderContents(encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0)#: Deprecated method for BS3 compatibility.

find(name=None, attrs={}, recursive=True, string=None, **kwargs)#

Look in the children of this PageElement and find the first PageElement that matches the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
recursive – If this is True, find() will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
limit – Stop looking after finding this many results.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A PageElement.

Return type:

find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs)#

Look in the children of this PageElement and find all PageElements that match the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:

name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
recursive – If this is True, find_all() will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
limit – Stop looking after finding this many results.

Kwargs:

A dictionary of filters on attribute values.

Returns:

A ResultSet of PageElements.

Return type:

select_one(selector, namespaces=None, **kwargs)#

Perform a CSS selection operation on the current element.

Parameters:

selector – A CSS selector.
namespaces – A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
kwargs – Keyword arguments to be passed into Soup Sieve’s soupsieve.select() method.

Returns:

A Tag.

Return type:

bs4.element.Tag

select(selector, namespaces=None, limit=None, **kwargs)#

Perform a CSS selection operation on the current element.

This uses the SoupSieve library.

Parameters:

selector – A string containing a CSS selector.
namespaces – A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
limit – After finding this number of results, stop looking.
kwargs – Keyword arguments to be passed into SoupSieve’s soupsieve.select() method.

Returns:

A ResultSet of Tags.

Return type: