bs4.tests
#
Helper classes for tests.
Submodules#
bs4.tests.test_builder
bs4.tests.test_builder_registry
bs4.tests.test_css
bs4.tests.test_dammit
bs4.tests.test_docs
bs4.tests.test_element
bs4.tests.test_formatter
bs4.tests.test_fuzz
bs4.tests.test_html5lib
bs4.tests.test_htmlparser
bs4.tests.test_lxml
bs4.tests.test_navigablestring
bs4.tests.test_pageelement
bs4.tests.test_soup
bs4.tests.test_tag
bs4.tests.test_tree
Package Contents#
Classes#
A data structure representing a parsed HTML or XML document. |
|
A generic stand-in for the value of a meta tag's 'charset' attribute. |
|
An HTML or XML comment. |
|
A generic stand-in for the value of a meta tag's 'content' attribute. |
|
A document type declaration. |
|
Encapsulates a number of ways of matching a markup element (tag or |
|
A NavigableString representing an executable script (probably |
|
A NavigableString representing an stylesheet (probably |
|
Represents an HTML or XML tag that is part of a parse tree, along |
|
A mixin class for any class (a TreeBuilder, or some class used by a |
|
A basic test of a treebuilder's competence. |
|
Smoke test for a tree builder that supports HTML5. |
Attributes#
- bs4.tests.__license__ = 'MIT'#
- class bs4.tests.BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)#
Bases:
element.Tag
A data structure representing a parsed HTML or XML document.
Most of the methods you’ll call on a BeautifulSoup object are inherited from PageElement or Tag.
Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers. To write a new tree builder, you’ll need to understand these methods as a whole.
- These methods will be called by the BeautifulSoup constructor:
reset()
feed(markup)
- The tree builder may call these methods from its feed() implementation:
handle_starttag(name, attrs) # See note about return value
handle_endtag(name)
handle_data(data) # Appends to the current data node
endData(containerClass) # Ends the current data node
No matter how complicated the underlying parser is, you should be able to build a tree using ‘start tag’ events, ‘end tag’ events, ‘data’ events, and “done with data” events.
If you encounter an empty-element tag (aka a self-closing tag, like HTML’s <br> tag), call handle_starttag and then handle_endtag.
- ROOT_TAG_NAME = '[document]'#
- DEFAULT_BUILDER_FEATURES = ['html', 'fast']#
- ASCII_SPACES = Multiline-String#
Show Value
""" """
- NO_PARSER_SPECIFIED_WARNING = Multiline-String#
Show Value
"""No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system ("%(parser)s"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features="%(parser)s"' to the BeautifulSoup constructor. """
- _clone()#
Create a new BeautifulSoup object with the same TreeBuilder, but not associated with any markup.
This is the first step of the deepcopy process.
- __getstate__()#
Helper for pickle.
- __setstate__(state)#
- classmethod _decode_markup(markup)#
Ensure markup is bytes so it’s safe to send into warnings.warn.
TODO: warnings.warn had this problem back in 2010 but it might not anymore.
- classmethod _markup_is_url(markup)#
Error-handling method to raise a warning if incoming markup looks like a URL.
- Parameters:
markup – A string.
- Returns:
Whether or not the markup resembles a URL closely enough to justify a warning.
- classmethod _markup_resembles_filename(markup)#
Error-handling method to raise a warning if incoming markup resembles a filename.
- Parameters:
markup – A bytestring or string.
- Returns:
Whether or not the markup resembles a filename closely enough to justify a warning.
- _feed()#
Internal method that parses previously set markup, creating a large number of Tag and NavigableString objects.
- reset()#
Reset this object to a state as though it had never parsed any markup.
- new_tag(name, namespace=None, nsprefix=None, attrs={}, sourceline=None, sourcepos=None, **kwattrs)#
Create a new Tag associated with this BeautifulSoup object.
- Parameters:
name – The name of the new Tag.
namespace – The URI of the new Tag’s XML namespace, if any.
prefix – The prefix for the new Tag’s XML namespace, if any.
attrs – A dictionary of this Tag’s attribute values; can be used instead of kwattrs for attributes like ‘class’ that are reserved words in Python.
sourceline – The line number where this tag was (purportedly) found in its source document.
sourcepos – The character position within sourceline where this tag was (purportedly) found.
kwattrs – Keyword arguments for the new Tag’s attribute values.
- string_container(base_class=None)#
- new_string(s, subclass=None)#
Create a new NavigableString associated with this BeautifulSoup object.
- abstract insert_before(*args)#
This method is part of the PageElement API, but BeautifulSoup doesn’t implement it because there is nothing before or after it in the parse tree.
- abstract insert_after(*args)#
This method is part of the PageElement API, but BeautifulSoup doesn’t implement it because there is nothing before or after it in the parse tree.
- popTag()#
Internal method called by _popToTag when a tag is closed.
- pushTag(tag)#
Internal method called by handle_starttag when a tag is opened.
- endData(containerClass=None)#
Method called by the TreeBuilder when the end of a data segment occurs.
- object_was_parsed(o, parent=None, most_recent_element=None)#
Method called by the TreeBuilder to integrate an object into the parse tree.
- _linkage_fixer(el)#
Make sure linkage of this fragment is sound.
- _popToTag(name, nsprefix=None, inclusivePop=True)#
Pops the tag stack up to and including the most recent instance of the given tag.
If there are no open tags with the given name, nothing will be popped.
- Parameters:
name – Pop up to the most recent tag with this name.
nsprefix – The namespace prefix that goes with name.
inclusivePop – It this is false, pops the tag stack up to but not including the most recent instqance of the given tag.
- handle_starttag(name, namespace, nsprefix, attrs, sourceline=None, sourcepos=None, namespaces=None)#
Called by the tree builder when a new tag is encountered.
- Parameters:
name – Name of the tag.
nsprefix – Namespace prefix for the tag.
attrs – A dictionary of attribute values.
sourceline – The line number where this tag was found in its source document.
sourcepos – The character position within sourceline where this tag was found.
namespaces – A dictionary of all namespace prefix mappings currently in scope in the document.
If this method returns None, the tag was rejected by an active SoupStrainer. You should proceed as if the tag had not occurred in the document. For instance, if this was a self-closing tag, don’t call handle_endtag.
- handle_endtag(name, nsprefix=None)#
Called by the tree builder when an ending tag is encountered.
- Parameters:
name – Name of the tag.
nsprefix – Namespace prefix for the tag.
- handle_data(data)#
Called by the tree builder when a chunk of textual data is encountered.
- decode(pretty_print=False, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal', iterator=None)#
- Returns a string or Unicode representation of the parse tree
as an HTML or XML document.
- Parameters:
pretty_print – If this is True, indentation will be used to make the document more readable.
eventual_encoding – The encoding of the final document. If this is None, the document will be a Unicode string.
- class bs4.tests.CharsetMetaAttributeValue#
Bases:
AttributeValueWithCharsetSubstitution
A generic stand-in for the value of a meta tag’s ‘charset’ attribute.
When Beautiful Soup parses the markup ‘<meta charset=”utf8”>’, the value of the ‘charset’ attribute will be one of these objects.
- encode(encoding)#
When an HTML document is being encoded to a given encoding, the value of a meta tag’s ‘charset’ is the name of the encoding.
- class bs4.tests.Comment#
Bases:
PreformattedString
An HTML or XML comment.
- PREFIX = '<!--'#
- SUFFIX = '-->'#
- class bs4.tests.ContentMetaAttributeValue#
Bases:
AttributeValueWithCharsetSubstitution
A generic stand-in for the value of a meta tag’s ‘content’ attribute.
- When Beautiful Soup parses the markup:
<meta http-equiv=”content-type” content=”text/html; charset=utf8”>
The value of the ‘content’ attribute will be one of these objects.
- CHARSET_RE#
- encode(encoding)#
Encode the string using the codec registered for encoding.
- encoding
The encoding in which to encode the string.
- errors
The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
- class bs4.tests.Doctype#
Bases:
PreformattedString
A document type declaration.
- PREFIX = '<!DOCTYPE '#
- SUFFIX = '>\n'#
- classmethod for_name_and_ids(name, pub_id, system_id)#
Generate an appropriate document type declaration for a given public ID and system ID.
- Parameters:
name – The name of the document’s root element, e.g. ‘html’.
pub_id – The Formal Public Identifier for this document type, e.g. ‘-//W3C//DTD XHTML 1.1//EN’
system_id – The system identifier for this document type, e.g. ‘http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd’
- Returns:
A Doctype.
- bs4.tests.PYTHON_SPECIFIC_ENCODINGS#
- class bs4.tests.SoupStrainer(name=None, attrs={}, string=None, **kwargs)#
Bases:
object
Encapsulates a number of ways of matching a markup element (tag or string).
This is primarily used to underpin the find_* methods, but you can create one yourself and pass it in as parse_only to the BeautifulSoup constructor, to parse a subset of a large document.
- searchTag#
- _normalize_search_value(value)#
- __str__()#
A human-readable representation of this SoupStrainer.
- search_tag(markup_name=None, markup_attrs={})#
Check whether a Tag with the given name and attributes would match this SoupStrainer.
Used prospectively to decide whether to even bother creating a Tag object.
- Parameters:
markup_name – A tag name as found in some markup.
markup_attrs – A dictionary of attributes as found in some markup.
- Returns:
True if the prospective tag would match this SoupStrainer; False otherwise.
- search(markup)#
Find all items in markup that match this SoupStrainer.
Used by the core _find_all() method, which is ultimately called by all find_* methods.
- Parameters:
markup – A PageElement or a list of them.
- _matches(markup, match_against, already_tried=None)#
- class bs4.tests.Script#
Bases:
NavigableString
A NavigableString representing an executable script (probably Javascript).
Used to distinguish executable code from textual content.
- class bs4.tests.Stylesheet#
Bases:
NavigableString
A NavigableString representing an stylesheet (probably CSS).
Used to distinguish embedded stylesheets from textual content.
- class bs4.tests.Tag(parser=None, builder=None, name=None, namespace=None, prefix=None, attrs=None, parent=None, previous=None, is_xml=None, sourceline=None, sourcepos=None, can_be_empty_element=None, cdata_list_attributes=None, preserve_whitespace_tags=None, interesting_string_types=None, namespaces=None)#
Bases:
PageElement
Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
When Beautiful Soup parses the markup <b>penguin</b>, it will create a Tag object representing the <b> tag.
- property is_empty_element#
Is this tag an empty-element tag? (aka a self-closing tag)
A tag that has contents is never an empty-element tag.
A tag that has no contents may or may not be an empty-element tag. It depends on the builder used to create the tag. If the builder has a designated list of empty-element tags, then only a tag whose name shows up in that list is considered an empty-element tag.
If the builder has no designated list of empty-element tags, then any tag with no contents is an empty-element tag.
- property string#
Convenience property to get the single string within this PageElement.
TODO It might make sense to have NavigableString.string return itself.
- Returns:
If this element has a single string child, return value is that string. If this element has one child tag, return value is the ‘string’ attribute of the child tag, recursively. If this element is itself a string, has no children, or has more than one child, return value is None.
- property children#
Iterate over all direct children of this PageElement.
- Yield:
A sequence of PageElements.
- property self_and_descendants#
Iterate over this PageElement and its children in a breadth-first sequence.
- Yield:
A sequence of PageElements.
- property descendants#
Iterate over all children of this PageElement in a breadth-first sequence.
- Yield:
A sequence of PageElements.
- property css#
Return an interface to the CSS selector API.
- parserClass#
- isSelfClosing#
- DEFAULT_INTERESTING_STRING_TYPES = ()#
- strings#
- START_ELEMENT_EVENT#
- END_ELEMENT_EVENT#
- EMPTY_ELEMENT_EVENT#
- STRING_ELEMENT_EVENT#
- findChild#
- findAll#
- findChildren#
- __deepcopy__(memo, recursive=True)#
A deepcopy of a Tag is a new Tag, unconnected to the parse tree. Its contents are a copy of the old Tag’s contents.
- __copy__()#
A copy of a Tag must always be a deep copy, because a Tag’s children can only have one parent at a time.
- _clone()#
Create a new Tag just like this one, but with no contents and unattached to any parse tree.
This is the first step in the deepcopy process.
- _all_strings(strip=False, types=PageElement.default)#
Yield all strings of certain classes, possibly stripping them.
- Parameters:
strip – If True, all strings will be stripped before being yielded.
types – A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. By default, the subclasses considered are the ones found in self.interesting_string_types. If that’s not specified, only NavigableString and CData objects will be considered. That means no comments, processing instructions, etc.
- Yield:
A sequence of strings.
- decompose()#
Recursively destroys this PageElement and its children.
This element will be removed from the tree and wiped out; so will everything beneath it.
The behavior of a decomposed PageElement is undefined and you should never use one for anything, but if you need to _check_ whether an element has been decomposed, you can use the decomposed property.
- clear(decompose=False)#
- Wipe out all children of this PageElement by calling extract()
on them.
- Parameters:
decompose – If this is True, decompose() (a more destructive method) will be called instead of extract().
- smooth()#
Smooth out this element’s children by consolidating consecutive strings.
This makes pretty-printed output look more natural following a lot of operations that modified the tree.
- index(element)#
Find the index of a child by identity, not value.
Avoids issues with tag.contents.index(element) getting the index of equal elements.
- Parameters:
element – Look for this PageElement in self.contents.
- get(key, default=None)#
Returns the value of the ‘key’ attribute for the tag, or the value given for ‘default’ if it doesn’t have that attribute.
- get_attribute_list(key, default=None)#
The same as get(), but always returns a list.
- Parameters:
key – The attribute to look for.
default – Use this value if the attribute is not present on this PageElement.
- Returns:
A list of values, probably containing only a single value.
- has_attr(key)#
Does this PageElement have an attribute with the given name?
- __hash__()#
Return hash(self).
- __getitem__(key)#
tag[key] returns the value of the ‘key’ attribute for the Tag, and throws an exception if it’s not there.
- __iter__()#
Iterating over a Tag iterates over its contents.
- __len__()#
The length of a Tag is the length of its list of contents.
- __contains__(x)#
- __bool__()#
A tag is non-None even if it has no contents.
- __setitem__(key, value)#
Setting tag[key] sets the value of the ‘key’ attribute for the tag.
- __delitem__(key)#
Deleting tag[key] deletes all ‘key’ attributes for the tag.
- __call__(*args, **kwargs)#
Calling a Tag like a function is the same as calling its find_all() method. Eg. tag(‘a’) returns a list of all the A tags found within this tag.
- __getattr__(tag)#
Calling tag.subtag is the same as calling tag.find(name=”subtag”)
- __eq__(other)#
Returns true iff this Tag has the same name, the same attributes, and the same contents (recursively) as other.
- __ne__(other)#
Returns true iff this Tag is not identical to other, as defined in __eq__.
- __repr__(encoding='unicode-escape')#
Renders this PageElement as a string.
- Parameters:
encoding – The encoding to use (Python 2 only). TODO: This is now ignored and a warning should be issued if a value is provided.
- Returns:
A (Unicode) string.
- __unicode__()#
Renders this PageElement as a Unicode string.
- encode(encoding=DEFAULT_OUTPUT_ENCODING, indent_level=None, formatter='minimal', errors='xmlcharrefreplace')#
Render a bytestring representation of this PageElement and its contents.
- Parameters:
encoding – The destination encoding.
indent_level – Each line of the rendering will be indented this many levels. (The formatter decides what a ‘level’ means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
formatter – A Formatter object, or a string naming one of the standard formatters.
errors – An error handling strategy such as ‘xmlcharrefreplace’. This value is passed along into encode() and its value should be one of the constants defined by Python.
- Returns:
A bytestring.
- decode(indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal', iterator=None)#
- _event_stream(iterator=None)#
Yield a sequence of events that can be used to reconstruct the DOM for this element.
This lets us recreate the nested structure of this element (e.g. when formatting it as a string) without using recursive method calls.
This is similar in concept to the SAX API, but it’s a simpler interface designed for internal use. The events are different from SAX and the arguments associated with the events are Tags and other Beautiful Soup objects.
- Parameters:
iterator – An alternate iterator to use when traversing the tree.
- _indent_string(s, indent_level, formatter, indent_before, indent_after)#
Add indentation whitespace before and/or after a string.
- Parameters:
s – The string to amend with whitespace.
indent_level – The indentation level; affects how much whitespace goes before the string.
indent_before – Whether or not to add whitespace before the string.
indent_after – Whether or not to add whitespace (a newline) after the string.
- _format_tag(eventual_encoding, formatter, opening)#
- _should_pretty_print(indent_level=1)#
Should this tag be pretty-printed?
Most of them should, but some (such as <pre> in HTML documents) should not.
- prettify(encoding=None, formatter='minimal')#
Pretty-print this PageElement as a string.
- Parameters:
encoding – The eventual encoding of the string. If this is None, a Unicode string will be returned.
formatter – A Formatter object, or a string naming one of the standard formatters.
- Returns:
A Unicode string (if encoding==None) or a bytestring (otherwise).
- decode_contents(indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal')#
Renders the contents of this tag as a Unicode string.
- Parameters:
indent_level – Each line of the rendering will be indented this many levels. (The formatter decides what a ‘level’ means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
eventual_encoding – The tag is destined to be encoded into this encoding. decode_contents() is _not_ responsible for performing that encoding. This information is passed in so that it can be substituted in if the document contains a <META> tag that mentions the document’s encoding.
formatter – A Formatter object, or a string naming one of the standard Formatters.
- encode_contents(indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal')#
Renders the contents of this PageElement as a bytestring.
- Parameters:
indent_level – Each line of the rendering will be indented this many levels. (The formatter decides what a ‘level’ means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
eventual_encoding – The bytestring will be in this encoding.
formatter – A Formatter object, or a string naming one of the standard Formatters.
- Returns:
A bytestring.
- renderContents(encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0)#
Deprecated method for BS3 compatibility.
- find(name=None, attrs={}, recursive=True, string=None, **kwargs)#
Look in the children of this PageElement and find the first PageElement that matches the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
recursive – If this is True, find() will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
limit – Stop looking after finding this many results.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A PageElement.
- Return type:
- find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs)#
Look in the children of this PageElement and find all PageElements that match the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name – A filter on tag name.
attrs – A dictionary of filters on attribute values.
recursive – If this is True, find_all() will perform a recursive search of this PageElement’s children. Otherwise, only the direct children will be considered.
limit – Stop looking after finding this many results.
- Kwargs:
A dictionary of filters on attribute values.
- Returns:
A ResultSet of PageElements.
- Return type:
- select_one(selector, namespaces=None, **kwargs)#
Perform a CSS selection operation on the current element.
- Parameters:
selector – A CSS selector.
namespaces – A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
kwargs – Keyword arguments to be passed into Soup Sieve’s soupsieve.select() method.
- Returns:
A Tag.
- Return type:
- select(selector, namespaces=None, limit=None, **kwargs)#
Perform a CSS selection operation on the current element.
This uses the SoupSieve library.
- Parameters:
selector – A string containing a CSS selector.
namespaces – A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
limit – After finding this number of results, stop looking.
kwargs – Keyword arguments to be passed into SoupSieve’s soupsieve.select() method.
- Returns:
A ResultSet of Tags.
- Return type:
- childGenerator()#
Deprecated generator.
- recursiveChildGenerator()#
Deprecated generator.
- has_key(key)#
Deprecated method. This was kind of misleading because has_key() (attributes) was different from __in__ (contents).
has_key() is gone in Python 3, anyway.
- class bs4.tests.DetectsXMLParsedAsHTML#
Bases:
object
A mixin class for any class (a TreeBuilder, or some class used by a TreeBuilder) that’s in a position to detect whether an XML document is being incorrectly parsed as HTML, and issue an appropriate warning.
This requires being able to observe an incoming processing instruction that might be an XML declaration, and also able to observe tags as they’re opened. If you can’t do that for a given TreeBuilder, there’s a less reliable implementation based on examining the raw markup.
- LOOKS_LIKE_HTML#
- LOOKS_LIKE_HTML_B#
- XML_PREFIX = '<?xml'#
- XML_PREFIX_B = b'<?xml'#
- classmethod warn_if_markup_looks_like_xml(markup)#
Perform a check on some markup to see if it looks like XML that’s not XHTML. If so, issue a warning.
This is much less reliable than doing the check while parsing, but some of the tree builders can’t do that.
- Returns:
True if the markup looks like non-XHTML XML, False
otherwise.
- classmethod _warn()#
Issue a warning about XML being parsed as HTML.
- _initialize_xml_detector()#
Call this method before parsing a document.
- _document_might_be_xml(processing_instruction)#
Call this method when encountering an XML declaration, or a “processing instruction” that might be an XML declaration.
- _root_tag_encountered(name)#
Call this when you encounter the document’s root tag.
This is where we actually check whether an XML document is being incorrectly parsed as HTML, and issue the warning.
- exception bs4.tests.XMLParsedAsHTMLWarning#
Bases:
UserWarning
The warning issued when an HTML parser is used to parse XML that is not XHTML.
- MESSAGE = "It looks like you're parsing an XML document using an HTML parser. If this really is an HTML..."#
- bs4.tests.default_builder#
- bs4.tests.SOUP_SIEVE_PRESENT = True#
- bs4.tests.HTML5LIB_PRESENT = True#
- bs4.tests.LXML_PRESENT = True#
- bs4.tests.BAD_DOCUMENT = Multiline-String#
Show Value
"""A bare string <!DOCTYPE xsl:stylesheet SYSTEM "htmlent.dtd"> <!DOCTYPE xsl:stylesheet PUBLIC "htmlent.dtd"> <div><![CDATA[A CDATA section where it doesn't belong]]></div> <div><svg><![CDATA[HTML5 does allow CDATA sections in SVG]]></svg></div> <div>A <meta> tag</div> <div>A <br> tag that supposedly has contents.</br></div> <div>AT&T</div> <div><textarea>Within a textarea, markup like <b> tags and <&<& should be treated as literal</textarea></div> <div><script>if (i < 2) { alert("<b>Markup within script tags should be treated as literal.</b>"); }</script></div> <div>This numeric entity is missing the final semicolon: <x t="piñata"></div> <div><a href="http://example.com/</a> that attribute value never got closed</div> <div><a href="foo</a>, </a><a href="bar">that attribute value was closed by the subsequent tag</a></div> <! This document starts with a bogus declaration ><div>a</div> <div>This document contains <!an incomplete declaration <div>(do you see it?)</div> <div>This document ends with <!an incomplete declaration <div><a style={height:21px;}>That attribute value was bogus</a></div> <! DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">The doctype is invalid because it contains extra whitespace <div><table><td nowrap>That boolean attribute had no value</td></table></div> <div>Here's a nonexistent entity: &#foo; (do you see it?)</div> <div>This document ends before the entity finishes: > <div><p>Paragraphs shouldn't contain block display elements, but this one does: <dl><dt>you see?</dt></p> <b b="20" a="1" b="10" a="2" a="3" a="4">Multiple values for the same attribute.</b> <div><table><tr><td>Here's a table</td></tr></table></div> <div><table id="1"><tr><td>Here's a nested table:<table id="2"><tr><td>foo</td></tr></table></td></div> <div>This tag contains nothing but whitespace: <b> </b></div> <div><blockquote><p><b>This p tag is cut off by</blockquote></p>the end of the blockquote tag</div> <div><table><div>This table contains bare markup</div></table></div> <div><div id="1"> <a href="link1">This link is never closed. </div> <div id="2"> <div id="3"> <a href="link2">This link is closed.</a> </div> </div></div> <div>This document contains a <!DOCTYPE surprise>surprise doctype</div> <div><a><B><Cd><EFG>Mixed case tags are folded to lowercase</efg></CD></b></A></div> <div><our☃>Tag name contains Unicode characters</our☃></div> <div><a ☃="snowman">Attribute name contains Unicode characters</a></div> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> """
- class bs4.tests.SoupTest#
Bases:
object
- property default_builder#
- assertSoupEquals#
- soup(markup, **kwargs)#
Build a Beautiful Soup object from markup.
- document_for(markup, **kwargs)#
Turn an HTML fragment into a document.
The details depend on the builder.
- assert_soup(to_parse, compare_parsed_to=None)#
Parse some markup using Beautiful Soup and verify that the output markup is as expected.
- assertConnectedness(element)#
Ensure that next_element and previous_element are properly set for all descendants of the given element.
- linkage_validator(el, _recursive_call=False)#
Ensure proper linkage throughout the document.
- assert_selects(tags, should_match)#
Make sure that the given tags have the correct text.
This is used in tests that define a bunch of tags, each containing a single string, and then select certain strings by some mechanism.
- assert_selects_ids(tags, should_match)#
Make sure that the given tags have the correct IDs.
This is used in tests that define a bunch of tags, each containing a single string, and then select certain strings by some mechanism.
- class bs4.tests.TreeBuilderSmokeTest#
Bases:
object
- test_attribute_not_multi_valued(multi_valued_attributes)#
- test_attribute_multi_valued(multi_valued_attributes)#
- test_invalid_doctype()#
- class bs4.tests.HTMLTreeBuilderSmokeTest#
Bases:
TreeBuilderSmokeTest
A basic test of a treebuilder’s competence.
Any HTML treebuilder, present or future, should be able to pass these tests. With invalid markup, there’s room for interpretation, and different parsers can handle it differently. But with the markup in these tests, there’s not much room for interpretation.
- test_empty_element_tags()#
Verify that all HTML4 and HTML5 empty element (aka void element) tags are handled correctly.
- test_special_string_containers()#
- test_pickle_and_unpickle_identity()#
- assertDoctypeHandled(doctype_fragment)#
Assert that a given doctype string is handled correctly.
- _document_with_doctype(doctype_fragment, doctype_string='DOCTYPE')#
Generate and parse a document with the given doctype.
- test_normal_doctypes()#
Make sure normal, everyday HTML doctypes are handled correctly.
- test_empty_doctype()#
- test_mixed_case_doctype()#
- test_public_doctype_with_url()#
- test_system_doctype()#
- test_namespaced_system_doctype()#
- test_namespaced_public_doctype()#
- test_real_xhtml_document()#
A real XHTML document should come out more or less the same as it went in.
- test_namespaced_html()#
- test_detect_xml_parsed_as_html()#
- test_processing_instruction()#
- test_deepcopy()#
Make sure you can copy the tree builder.
This is important because the builder is part of a BeautifulSoup object, and we want to be able to copy that.
- test_p_tag_is_never_empty_element()#
A <p> tag is never designated as an empty-element tag.
Even if the markup shows it as an empty-element tag, it shouldn’t be presented that way.
- test_unclosed_tags_get_closed()#
A tag that’s not closed by the end of the document should be closed.
This applies to all tags except empty-element tags.
- test_br_is_always_empty_element_tag()#
A <br> tag is designated as an empty-element tag.
Some parsers treat <br></br> as one <br/> tag, some parsers as two tags, but it should always be an empty-element tag.
- test_nested_formatting_elements()#
- test_double_head()#
- test_comment()#
- test_preserved_whitespace_in_pre_and_textarea()#
Whitespace must be preserved in <pre> and <textarea> tags, even if that would mean not prettifying the markup.
- test_nested_inline_elements()#
Inline elements can be nested indefinitely.
- test_nested_block_level_elements()#
Block elements can be nested.
- test_correctly_nested_tables()#
One table can go inside another one.
- test_multivalued_attribute_with_whitespace()#
- test_deeply_nested_multivalued_attribute()#
- test_multivalued_attribute_on_html()#
- test_angle_brackets_in_attribute_values_are_escaped()#
- test_strings_resembling_character_entity_references()#
- test_apos_entity()#
- test_entities_in_foreign_document_encoding()#
- test_entities_in_attributes_converted_to_unicode()#
- test_entities_in_text_converted_to_unicode()#
- test_quot_entity_converted_to_quotation_mark()#
- test_out_of_range_entity()#
- test_multipart_strings()#
Mostly to prevent a recurrence of a bug in the html5lib treebuilder.
- test_empty_element_tags()#
Verify consistent handling of empty-element tags, no matter how they come in through the markup.
- test_head_tag_between_head_and_body()#
Prevent recurrence of a bug in the html5lib treebuilder.
- test_multiple_copies_of_a_tag()#
Prevent recurrence of a bug in the html5lib treebuilder.
- test_basic_namespaces()#
Parsers don’t need to understand namespaces, but at the very least they should not choke on namespaces or lose data.
- test_multivalued_attribute_value_becomes_list()#
- test_can_parse_unicode_document()#
- test_soupstrainer()#
Parsers should be able to work with SoupStrainers.
- test_single_quote_attribute_values_become_double_quotes()#
- test_attribute_values_with_nested_quotes_are_left_alone()#
- test_attribute_values_with_double_nested_quotes_get_quoted()#
- test_ampersand_in_attribute_value_gets_escaped()#
- test_escaped_ampersand_in_attribute_value_is_left_alone()#
- test_entities_in_strings_converted_during_parsing()#
- test_smart_quotes_converted_on_the_way_in()#
- test_non_breaking_spaces_converted_on_the_way_in()#
- test_entities_converted_on_the_way_out()#
- test_real_iso_8859_document()#
- test_real_shift_jis_document()#
- test_real_hebrew_document()#
- test_meta_tag_reflects_current_encoding()#
- test_html5_style_meta_tag_reflects_current_encoding()#
- test_python_specific_encodings_not_used_in_charset()#
- test_tag_with_no_attributes_can_have_attributes_added()#
- test_closing_tag_with_no_opening_tag()#
- test_worst_case()#
Test the worst case (currently) for linking issues.
- class bs4.tests.XMLTreeBuilderSmokeTest#
Bases:
TreeBuilderSmokeTest
- test_pickle_and_unpickle_identity()#
- test_docstring_generated()#
- test_xml_declaration()#
- test_python_specific_encodings_not_used_in_xml_declaration()#
- test_processing_instruction()#
- test_real_xhtml_document()#
A real XHTML document should come out exactly the same as it went in.
- test_nested_namespaces()#
- test_formatter_processes_script_tag_for_xml_documents()#
- test_can_parse_unicode_document()#
- test_can_parse_unicode_document_begining_with_bom()#
- test_popping_namespaced_tag()#
- test_docstring_includes_correct_encoding()#
- test_large_xml_document()#
A large XML document should come out the same as it went in.
- test_tags_are_empty_element_if_and_only_if_they_are_empty()#
- test_namespaces_are_preserved()#
- test_closing_namespaced_tag()#
- test_namespaced_attributes()#
- test_namespaced_attributes_xml_namespace()#
- test_find_by_prefixed_name()#
- test_copy_tag_preserves_namespace()#
- test_worst_case()#
Test the worst case (currently) for linking issues.
- class bs4.tests.HTML5TreeBuilderSmokeTest#
Bases:
HTMLTreeBuilderSmokeTest
Smoke test for a tree builder that supports HTML5.
- test_real_xhtml_document()#
A real XHTML document should come out more or less the same as it went in.
- test_html_tags_have_namespace()#
- test_svg_tags_have_namespace()#
- test_mathml_tags_have_namespace()#
- test_xml_declaration_becomes_comment()#