bs4.element
#
Module Contents#
Classes#
A namespaced string (e.g. 'xml:lang') that remembers the namespace |
|
A stand-in object for a character encoding specified in HTML. |
|
A generic stand-in for the value of a meta tag's 'charset' attribute. |
|
A generic stand-in for the value of a meta tag's 'content' attribute. |
|
Contains the navigational information for some part of the page: |
|
A Python Unicode string that is part of a parse tree. |
|
A NavigableString not subject to the normal formatting rules. |
|
A CDATA block. |
|
A SGML processing instruction. |
|
An XML processing instruction. |
|
An HTML or XML comment. |
|
An XML declaration. |
|
A document type declaration. |
|
A NavigableString representing an stylesheet (probably |
|
A NavigableString representing an executable script (probably |
|
A NavigableString representing a string found inside an HTML |
|
A NavigableString representing the contents of the <rt> HTML |
|
A NavigableString representing the contents of the <rp> HTML |
|
Represents an HTML or XML tag that is part of a parse tree, along |
|
Encapsulates a number of ways of matching a markup element (tag or |
|
A ResultSet is just a list that keeps track of the SoupStrainer |
Functions#
|
Alias one attribute name to another for backward compatibility |
Attributes#
- bs4.element.__license__ = 'MIT'#
- bs4.element.DEFAULT_OUTPUT_ENCODING = 'utf-8'#
- bs4.element.nonwhitespace_re#
- bs4.element.whitespace_re#
- bs4.element._alias(attr)#
Alias one attribute name to another for backward compatibility
- bs4.element.PYTHON_SPECIFIC_ENCODINGS#
- class bs4.element.NamespacedAttribute#
Bases:
str
A namespaced string (e.g. ‘xml:lang’) that remembers the namespace (‘xml’) and the name (‘lang’) that were used to create it.
- class bs4.element.AttributeValueWithCharsetSubstitution#
Bases:
str
A stand-in object for a character encoding specified in HTML.
- class bs4.element.CharsetMetaAttributeValue#
Bases:
AttributeValueWithCharsetSubstitution
A generic stand-in for the value of a meta tag’s ‘charset’ attribute.
When Beautiful Soup parses the markup ‘<meta charset=”utf8”>’, the value of the ‘charset’ attribute will be one of these objects.
- encode(encoding)#
When an HTML document is being encoded to a given encoding, the value of a meta tag’s ‘charset’ is the name of the encoding.
- class bs4.element.ContentMetaAttributeValue#
Bases:
AttributeValueWithCharsetSubstitution
A generic stand-in for the value of a meta tag’s ‘content’ attribute.
- When Beautiful Soup parses the markup:
<meta http-equiv=”content-type” content=”text/html; charset=utf8”>
The value of the ‘content’ attribute will be one of these objects.
- CHARSET_RE#
- encode(encoding)#
Encode the string using the codec registered for encoding.
- encoding
The encoding in which to encode the string.
- errors
The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
- class bs4.element.PageElement#
Bases:
object
Contains the navigational information for some part of the page: that is, its current location in the parse tree.
NavigableString, Tag, etc. are all subclasses of PageElement.
- property _is_xml#
Is this element part of an XML tree or an HTML tree?
This is used in formatter_for_name, when deciding whether an XMLFormatter or HTMLFormatter is more appropriate. It can be inefficient, but it should be called very rarely.
- property stripped_strings#
Yield all strings in this PageElement, stripping them first.
- Yield:
A sequence of stripped strings.
- property next#
The PageElement, if any, that was parsed just after this one.
- Returns:
A PageElement.
- Return type:
- property previous#
The PageElement, if any, that was parsed just before this one.
- Returns:
A PageElement.
- Return type:
- property next_elements#
All PageElements that were parsed after this one.
- Yield:
A sequence of PageElements.
- property next_siblings#
All PageElements that are siblings of this one but were parsed later.
- Yield:
A sequence of PageElements.
- property previous_elements#
All PageElements that were parsed before this one.
- Yield:
A sequence of PageElements.
- property previous_siblings#
All PageElements that are siblings of this one but were parsed earlier.
- Yield:
A sequence of PageElements.
- property parents#
All PageElements that are parents of this PageElement.
- Yield:
A sequence of PageElements.
- property decomposed#
Check whether a PageElement has been decomposed.
- Return type:
bool
- known_xml#
- nextSibling#
- previousSibling#
- default#
- getText#
- text#
- replaceWith#
- replace_with_children#
- replaceWithChildren#
- _lastRecursiveChild#
- findNext#
- findAllNext#
- findNextSibling#
- findNextSiblings#
- fetchNextSiblings#
- findPrevious#
- findAllPrevious#
- fetchPrevious#
- findPreviousSibling#
- findPreviousSiblings#
- fetchPreviousSiblings#
- findParent#
- findParents#
- fetchParents#
- setup(parent=None, previous_element=None, next_element=None, previous_sibling=None, next_sibling=None)#
Sets up the initial relations between this element and other elements.
- Parameters:
parent – The parent of this element.
previous_element – The element parsed immediately before this one.
next_element – The element parsed immediately before this one.
previous_sibling – The most recently encountered element on the same level of the parse tree as this one.
previous_sibling – The next element to be encountered on the same level of the parse tree as this one.
- format_string(s, formatter)#
Format the given string using the given formatter.
- Parameters:
s – A string.
formatter – A Formatter object, or a string naming one of the standard formatters.
- formatter_for_name(formatter)#
Look up or create a Formatter for the given identifier, if necessary.
- Parameters:
formatter – Can be a Formatter object (used as-is), a function (used as the entity substitution hook for an XMLFormatter or HTMLFormatter), or a string (used to look up an XMLFormatter or HTMLFormatter in the appropriate registry.
- abstract _all_strings(strip=False, types=default)#
Yield all strings of certain classes, possibly stripping them.
This is implemented differently in Tag and NavigableString.
- get_text(separator='', strip=False, types=default)#
Get all child strings of this PageElement, concatenated using the given separator.
- Parameters:
separator – Strings will be concatenated using this separator.
strip – If True, strings will be stripped before being concatenated.
types – A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. Although there are exceptions, the default behavior in most cases is to consider only NavigableString and CData objects. That means no comments, processing instructions, etc.
- Returns:
A string.
- replace_with(*args)#
Replace this PageElement with one or more PageElements, keeping the rest of the tree the same.
- Parameters:
args – One or more PageElements.
- Returns:
self, no longer part of the tree.
- unwrap()#
Replace this PageElement with its contents.
- Returns:
self, no longer part of the tree.
- wrap(wrap_inside)#
Wrap this PageElement inside another one.
- Parameters:
wrap_inside – A PageElement.
- Returns:
wrap_inside, occupying the position in the tree that used to be occupied by self, and with self inside it.
- extract(_self_index=None)#
Destructively rips this element out of the tree.
- Parameters:
_self_index – The location of this element in its parent’s .contents, if known. Passing this in allows for a performance optimization.
- Returns:
self, no longer part of the tree.
- _last_descendant(is_initialized=True, accept_self=True)#
Finds the last element beneath this object to be parsed.
- Parameters:
is_initialized – Has setup been called on this PageElement yet?
accept_self – Is self an acceptable answer to the question?
- insert(position, new_child)#
Insert a new PageElement in the list of this PageElement’s children.
This works the same way as list.insert.
- Parameters:
position – The numeric position that should be occupied in self.children by the new PageElement.
new_child – A PageElement.
- append(tag)#
Appends the given PageElement to the contents of this one.
- Parameters:
tag – A PageElement.
- extend(tags)#
Appends the given PageElements to this one’s contents.
- Parameters:
tags – A list of PageElements. If a single Tag is provided instead, this PageElement’s contents will be extended with that Tag’s contents.
- insert_before(*args)#
Makes the given element(s) the immediate predecessor of this one.
All the elements will have the same parent, and the given elements will be immediately before this one.
- Parameters:
args – One or more PageElements.
- insert_after(*args)#
Makes the given element(s) the immediate successor of this one.
The elements will have the same parent, and the given elements will be immediately after this one.
- Parameters:
args – One or more PageElements.
- find_next(name=None, attrs={}, string=None, **kwargs)#
Find the first PageElement that matches the given criteria and appears later in the document than this PageElement.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A PageElement.
- Return type:
- find_all_next(name=None, attrs={}, string=None, limit=None, **kwargs)#
Find all PageElements that match the given criteria and appear later in the document than this PageElement.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
limit – Stop looking after finding this many results.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A ResultSet containing PageElements.
- find_next_sibling(name=None, attrs={}, string=None, **kwargs)#
Find the closest sibling to this PageElement that matches the given criteria and appears later in the document.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A PageElement.
- Return type:
- find_next_siblings(name=None, attrs={}, string=None, limit=None, **kwargs)#
Find all siblings of this PageElement that match the given criteria and appear later in the document.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
limit – Stop looking after finding this many results.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A ResultSet of PageElements.
- Return type:
- find_previous(name=None, attrs={}, string=None, **kwargs)#
Look backwards in the document from this PageElement and find the first PageElement that matches the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A PageElement.
- Return type:
- find_all_previous(name=None, attrs={}, string=None, limit=None, **kwargs)#
Look backwards in the document from this PageElement and find all PageElements that match the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
limit – Stop looking after finding this many results.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A ResultSet of PageElements.
- Return type:
- find_previous_sibling(name=None, attrs={}, string=None, **kwargs)#
Returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A PageElement.
- Return type:
- find_previous_siblings(name=None, attrs={}, string=None, limit=None, **kwargs)#
Returns all siblings to this PageElement that match the given criteria and appear earlier in the document.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
string – A filter for a NavigableString with specific text.
limit – Stop looking after finding this many results.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A ResultSet of PageElements.
- Return type:
- find_parent(name=None, attrs={}, **kwargs)#
Find the closest parent of this PageElement that matches the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A PageElement.
- Return type:
- find_parents(name=None, attrs={}, limit=None, **kwargs)#
Find all parents of this PageElement that match the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
limit – Stop looking after finding this many results.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A PageElement.
- Return type:
- _find_one(method, name, attrs, string, **kwargs)#
- _find_all(name, attrs, string, limit, generator, **kwargs)#
Iterates over a generator looking for things that match.
- nextGenerator()#
- nextSiblingGenerator()#
- previousGenerator()#
- previousSiblingGenerator()#
- parentGenerator()#
Bases:
str
,PageElement
A Python Unicode string that is part of a parse tree.
When Beautiful Soup parses the markup <b>penguin</b>, it will create a NavigableString for the string “penguin”.
Since a NavigableString is not a Tag, it has no .name.
This property is implemented so that code like this doesn’t crash when run on a mixture of Tag and NavigableString objects:
[x.name for x in tag.children]
A copy of a NavigableString has the same contents and class as the original, but it is not connected to the parse tree.
- Parameters:
recursive – This parameter is ignored; it’s only defined so that NavigableString.__deepcopy__ implements the same signature as Tag.__deepcopy__.
A copy of a NavigableString can only be a deep copy, because only one PageElement can occupy a given place in a parse tree.
text.string gives you text. This is for backwards compatibility for Navigable*String, but for CData* it lets you get the string without the CData wrapper.
Run the string through the provided formatter.
- Parameters:
formatter – A Formatter object, or a string naming one of the standard formatters.
Yield all strings of certain classes, possibly stripping them.
This makes it easy for NavigableString to implement methods like get_text() as conveniences, creating a consistent text-extraction API across all PageElements.
- Parameters:
strip – If True, all strings will be stripped before being yielded.
types – A tuple of NavigableString subclasses. If this NavigableString isn’t one of those subclasses, the sequence will be empty. By default, the subclasses considered are NavigableString and CData objects. That means no comments, processing instructions, etc.
- Yield:
A sequence that either contains this string, or is empty.
- class bs4.element.PreformattedString#
Bases:
NavigableString
A NavigableString not subject to the normal formatting rules.
This is an abstract class used for special kinds of strings such as comments (the Comment class) and CDATA blocks (the CData class).
- PREFIX = ''#
- SUFFIX = ''#
- output_ready(formatter=None)#
- Make this string ready for output by adding any subclass-specific
prefix or suffix.
- Parameters:
formatter – A Formatter object, or a string naming one of the standard formatters. The string will be passed into the Formatter, but only to trigger any side effects: the return value is ignored.
- Returns:
The string, with any subclass-specific prefix and suffix added on.
- class bs4.element.CData#
Bases:
PreformattedString
A CDATA block.
- PREFIX = '<![CDATA['#
- SUFFIX = ']]>'#
- class bs4.element.ProcessingInstruction#
Bases:
PreformattedString
A SGML processing instruction.
- PREFIX = '<?'#
- SUFFIX = '>'#
- class bs4.element.XMLProcessingInstruction#
Bases:
ProcessingInstruction
An XML processing instruction.
- PREFIX = '<?'#
- SUFFIX = '?>'#
- class bs4.element.Comment#
Bases:
PreformattedString
An HTML or XML comment.
- PREFIX = '<!--'#
- SUFFIX = '-->'#
- class bs4.element.Declaration#
Bases:
PreformattedString
An XML declaration.
- PREFIX = '<?'#
- SUFFIX = '?>'#
- class bs4.element.Doctype#
Bases:
PreformattedString
A document type declaration.
- PREFIX = '<!DOCTYPE '#
- SUFFIX = '>\n'#
- classmethod for_name_and_ids(name, pub_id, system_id)#
Generate an appropriate document type declaration for a given public ID and system ID.
- Parameters:
name – The name of the document’s root element, e.g. ‘html’.
pub_id – The Formal Public Identifier for this document type, e.g. ‘-//W3C//DTD XHTML 1.1//EN’
system_id – The system identifier for this document type, e.g. ‘http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd’
- Returns:
A Doctype.
- class bs4.element.Stylesheet#
Bases:
NavigableString
A NavigableString representing an stylesheet (probably CSS).
Used to distinguish embedded stylesheets from textual content.
- class bs4.element.Script#
Bases:
NavigableString
A NavigableString representing an executable script (probably Javascript).
Used to distinguish executable code from textual content.
- class bs4.element.TemplateString#
Bases:
NavigableString
A NavigableString representing a string found inside an HTML template embedded in a larger document.
Used to distinguish such strings from the main body of the document.
- class bs4.element.RubyTextString#
Bases:
NavigableString
A NavigableString representing the contents of the <rt> HTML element.
https://dev.w3.org/html5/spec-LC/text-level-semantics.html#the-rt-element
Can be used to distinguish such strings from the strings they’re annotating.
- class bs4.element.RubyParenthesisString#
Bases:
NavigableString
A NavigableString representing the contents of the <rp> HTML element.
https://dev.w3.org/html5/spec-LC/text-level-semantics.html#the-rp-element
- class bs4.element.Tag(parser=None, builder=None, name=None, namespace=None, prefix=None, attrs=None, parent=None, previous=None, is_xml=None, sourceline=None, sourcepos=None, can_be_empty_element=None, cdata_list_attributes=None, preserve_whitespace_tags=None, interesting_string_types=None, namespaces=None)#
Bases:
PageElement
Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
When Beautiful Soup parses the markup <b>penguin</b>, it will create a Tag object representing the <b> tag.
- property is_empty_element#
Is this tag an empty-element tag? (aka a self-closing tag)
A tag that has contents is never an empty-element tag.
A tag that has no contents may or may not be an empty-element tag. It depends on the builder used to create the tag. If the builder has a designated list of empty-element tags, then only a tag whose name shows up in that list is considered an empty-element tag.
If the builder has no designated list of empty-element tags, then any tag with no contents is an empty-element tag.
- property string#
Convenience property to get the single string within this PageElement.
TODO It might make sense to have NavigableString.string return itself.
- Returns:
If this element has a single string child, return value is that string. If this element has one child tag, return value is the ‘string’ attribute of the child tag, recursively. If this element is itself a string, has no children, or has more than one child, return value is None.
- property children#
Iterate over all direct children of this PageElement.
- Yield:
A sequence of PageElements.
- property self_and_descendants#
Iterate over this PageElement and its children in a breadth-first sequence.
- Yield:
A sequence of PageElements.
- property descendants#
Iterate over all children of this PageElement in a breadth-first sequence.
- Yield:
A sequence of PageElements.
- property css#
Return an interface to the CSS selector API.
- parserClass#
- isSelfClosing#
- DEFAULT_INTERESTING_STRING_TYPES = ()#
- strings#
- START_ELEMENT_EVENT#
- END_ELEMENT_EVENT#
- EMPTY_ELEMENT_EVENT#
- STRING_ELEMENT_EVENT#
- findChild#
- findAll#
- findChildren#
- __deepcopy__(memo, recursive=True)#
A deepcopy of a Tag is a new Tag, unconnected to the parse tree. Its contents are a copy of the old Tag’s contents.
- __copy__()#
A copy of a Tag must always be a deep copy, because a Tag’s children can only have one parent at a time.
- _clone()#
Create a new Tag just like this one, but with no contents and unattached to any parse tree.
This is the first step in the deepcopy process.
- _all_strings(strip=False, types=PageElement.default)#
Yield all strings of certain classes, possibly stripping them.
- Parameters:
strip – If True, all strings will be stripped before being yielded.
types – A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. By default, the subclasses considered are the ones found in self.interesting_string_types. If that’s not specified, only NavigableString and CData objects will be considered. That means no comments, processing instructions, etc.
- Yield:
A sequence of strings.
- decompose()#
Recursively destroys this PageElement and its children.
This element will be removed from the tree and wiped out; so will everything beneath it.
The behavior of a decomposed PageElement is undefined and you should never use one for anything, but if you need to _check_ whether an element has been decomposed, you can use the decomposed property.
- clear(decompose=False)#
- Wipe out all children of this PageElement by calling extract()
on them.
- Parameters:
decompose – If this is True, decompose() (a more destructive method) will be called instead of extract().
- smooth()#
Smooth out this element’s children by consolidating consecutive strings.
This makes pretty-printed output look more natural following a lot of operations that modified the tree.
- index(element)#
Find the index of a child by identity, not value.
Avoids issues with tag.contents.index(element) getting the index of equal elements.
- Parameters:
element – Look for this PageElement in self.contents.
- get(key, default=None)#
Returns the value of the ‘key’ attribute for the tag, or the value given for ‘default’ if it doesn’t have that attribute.
- get_attribute_list(key, default=None)#
The same as get(), but always returns a list.
- Parameters:
key – The attribute to look for.
default – Use this value if the attribute is not present on this PageElement.
- Returns:
A list of values, probably containing only a single value.
- has_attr(key)#
Does this PageElement have an attribute with the given name?
- __hash__()#
Return hash(self).
- __getitem__(key)#
tag[key] returns the value of the ‘key’ attribute for the Tag, and throws an exception if it’s not there.
- __iter__()#
Iterating over a Tag iterates over its contents.
- __len__()#
The length of a Tag is the length of its list of contents.
- __contains__(x)#
- __bool__()#
A tag is non-None even if it has no contents.
- __setitem__(key, value)#
Setting tag[key] sets the value of the ‘key’ attribute for the tag.
- __delitem__(key)#
Deleting tag[key] deletes all ‘key’ attributes for the tag.
- __call__(*args, **kwargs)#
Calling a Tag like a function is the same as calling its find_all() method. Eg. tag(‘a’) returns a list of all the A tags found within this tag.
- __getattr__(tag)#
Calling tag.subtag is the same as calling tag.find(name=”subtag”)
- __eq__(other)#
Returns true iff this Tag has the same name, the same attributes, and the same contents (recursively) as other.
- __ne__(other)#
Returns true iff this Tag is not identical to other, as defined in __eq__.
- __repr__(encoding='unicode-escape')#
Renders this PageElement as a string.
- Parameters:
encoding – The encoding to use (Python 2 only). TODO: This is now ignored and a warning should be issued if a value is provided.
- Returns:
A (Unicode) string.
- __unicode__()#
Renders this PageElement as a Unicode string.
- encode(encoding=DEFAULT_OUTPUT_ENCODING, indent_level=None, formatter='minimal', errors='xmlcharrefreplace')#
Render a bytestring representation of this PageElement and its contents.
- Parameters:
encoding – The destination encoding.
indent_level – Each line of the rendering will be indented this many levels. (The formatter decides what a ‘level’ means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
formatter – A Formatter object, or a string naming one of the standard formatters.
errors – An error handling strategy such as ‘xmlcharrefreplace’. This value is passed along into encode() and its value should be one of the constants defined by Python.
- Returns:
A bytestring.
- decode(indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal', iterator=None)#
- _event_stream(iterator=None)#
Yield a sequence of events that can be used to reconstruct the DOM for this element.
This lets us recreate the nested structure of this element (e.g. when formatting it as a string) without using recursive method calls.
This is similar in concept to the SAX API, but it’s a simpler interface designed for internal use. The events are different from SAX and the arguments associated with the events are Tags and other Beautiful Soup objects.
- Parameters:
iterator – An alternate iterator to use when traversing the tree.
- _indent_string(s, indent_level, formatter, indent_before, indent_after)#
Add indentation whitespace before and/or after a string.
- Parameters:
s – The string to amend with whitespace.
indent_level – The indentation level; affects how much whitespace goes before the string.
indent_before – Whether or not to add whitespace before the string.
indent_after – Whether or not to add whitespace (a newline) after the string.
- _format_tag(eventual_encoding, formatter, opening)#
- _should_pretty_print(indent_level=1)#
Should this tag be pretty-printed?
Most of them should, but some (such as <pre> in HTML documents) should not.
- prettify(encoding=None, formatter='minimal')#
Pretty-print this PageElement as a string.
- Parameters:
encoding – The eventual encoding of the string. If this is None, a Unicode string will be returned.
formatter – A Formatter object, or a string naming one of the standard formatters.
- Returns:
A Unicode string (if encoding==None) or a bytestring (otherwise).
- decode_contents(indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal')#
Renders the contents of this tag as a Unicode string.
- Parameters:
indent_level – Each line of the rendering will be indented this many levels. (The formatter decides what a ‘level’ means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
eventual_encoding – The tag is destined to be encoded into this encoding. decode_contents() is _not_ responsible for performing that encoding. This information is passed in so that it can be substituted in if the document contains a <META> tag that mentions the document’s encoding.
formatter – A Formatter object, or a string naming one of the standard Formatters.
- encode_contents(indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal')#
Renders the contents of this PageElement as a bytestring.
- Parameters:
indent_level – Each line of the rendering will be indented this many levels. (The formatter decides what a ‘level’ means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
eventual_encoding – The bytestring will be in this encoding.
formatter – A Formatter object, or a string naming one of the standard Formatters.
- Returns:
A bytestring.
- renderContents(encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0)#
Deprecated method for BS3 compatibility.
- find(name=None, attrs={}, recursive=True, string=None, **kwargs)#
Look in the children of this PageElement and find the first PageElement that matches the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
recursive – If this is True, find() will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
limit – Stop looking after finding this many results.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A PageElement.
- Return type:
- find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs)#
Look in the children of this PageElement and find all PageElements that match the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
recursive – If this is True, find_all() will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
limit – Stop looking after finding this many results.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A ResultSet of PageElements.
- Return type:
- select_one(selector, namespaces=None, **kwargs)#
Perform a CSS selection operation on the current element.
- Parameters:
selector – A CSS selector.
namespaces – A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
kwargs – Keyword arguments to be passed into Soup Sieve’s soupsieve.select() method.
- Returns:
A Tag.
- Return type:
- select(selector, namespaces=None, limit=None, **kwargs)#
Perform a CSS selection operation on the current element.
This uses the SoupSieve library.
- Parameters:
selector – A string containing a CSS selector.
namespaces – A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
limit – After finding this number of results, stop looking.
kwargs – Keyword arguments to be passed into SoupSieve’s soupsieve.select() method.
- Returns:
A ResultSet of Tags.
- Return type:
- childGenerator()#
Deprecated generator.
- recursiveChildGenerator()#
Deprecated generator.
- has_key(key)#
Deprecated method. This was kind of misleading because has_key() (attributes) was different from __in__ (contents).
has_key() is gone in Python 3, anyway.
- class bs4.element.SoupStrainer(name=None, attrs={}, string=None, **kwargs)#
Bases:
object
Encapsulates a number of ways of matching a markup element (tag or string).
This is primarily used to underpin the find_* methods, but you can create one yourself and pass it in as parse_only to the BeautifulSoup constructor, to parse a subset of a large document.
- searchTag#
- _normalize_search_value(value)#
- __str__()#
A human-readable representation of this SoupStrainer.
- search_tag(markup_name=None, markup_attrs={})#
Check whether a Tag with the given name and attributes would match this SoupStrainer.
Used prospectively to decide whether to even bother creating a Tag object.
- Parameters:
markup_name – A tag name as found in some markup.
markup_attrs – A dictionary of attributes as found in some markup.
- Returns:
True if the prospective tag would match this SoupStrainer; False otherwise.
- search(markup)#
Find all items in markup that match this SoupStrainer.
Used by the core _find_all() method, which is ultimately called by all find_* methods.
- Parameters:
markup – A PageElement or a list of them.
- _matches(markup, match_against, already_tried=None)#