bs4#

Beautiful Soup Elixir and Tonic - “The Screen-Scraper’s Friend”.

http://www.crummy.com/software/BeautifulSoup/

Beautiful Soup uses a pluggable XML or HTML parser to parse a (possibly invalid) document into a tree representation. Beautiful Soup provides methods and Pythonic idioms that make it easy to navigate, search, and modify the parse tree.

Beautiful Soup works with Python 3.6 and up. It works better if lxml and/or html5lib is installed.

For more than you ever wanted to know about Beautiful Soup, see the documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Subpackages#

Submodules#

Package Contents#

Classes#

BeautifulSoup

A data structure representing a parsed HTML or XML document.

class bs4.BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)#

Bases: element.Tag

A data structure representing a parsed HTML or XML document.

Most of the methods you’ll call on a BeautifulSoup object are inherited from PageElement or Tag.

Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers. To write a new tree builder, you’ll need to understand these methods as a whole.

These methods will be called by the BeautifulSoup constructor:
  • reset()

  • feed(markup)

The tree builder may call these methods from its feed() implementation:
  • handle_starttag(name, attrs) # See note about return value

  • handle_endtag(name)

  • handle_data(data) # Appends to the current data node

  • endData(containerClass) # Ends the current data node

No matter how complicated the underlying parser is, you should be able to build a tree using ‘start tag’ events, ‘end tag’ events, ‘data’ events, and “done with data” events.

If you encounter an empty-element tag (aka a self-closing tag, like HTML’s <br> tag), call handle_starttag and then handle_endtag.

ROOT_TAG_NAME = '[document]'#
DEFAULT_BUILDER_FEATURES = ['html', 'fast']#
ASCII_SPACES = Multiline-String#
Show Value
"""

"""
NO_PARSER_SPECIFIED_WARNING = Multiline-String#
Show Value
"""No parser was explicitly specified, so I'm using the best available %(markup_type)s parser for this system ("%(parser)s"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument 'features="%(parser)s"' to the BeautifulSoup constructor.
"""
_clone()#

Create a new BeautifulSoup object with the same TreeBuilder, but not associated with any markup.

This is the first step of the deepcopy process.

__getstate__()#

Helper for pickle.

__setstate__(state)#
classmethod _decode_markup(markup)#

Ensure markup is bytes so it’s safe to send into warnings.warn.

TODO: warnings.warn had this problem back in 2010 but it might not anymore.

classmethod _markup_is_url(markup)#

Error-handling method to raise a warning if incoming markup looks like a URL.

Parameters:

markup – A string.

Returns:

Whether or not the markup resembles a URL closely enough to justify a warning.

classmethod _markup_resembles_filename(markup)#

Error-handling method to raise a warning if incoming markup resembles a filename.

Parameters:

markup – A bytestring or string.

Returns:

Whether or not the markup resembles a filename closely enough to justify a warning.

_feed()#

Internal method that parses previously set markup, creating a large number of Tag and NavigableString objects.

reset()#

Reset this object to a state as though it had never parsed any markup.

new_tag(name, namespace=None, nsprefix=None, attrs={}, sourceline=None, sourcepos=None, **kwattrs)#

Create a new Tag associated with this BeautifulSoup object.

Parameters:
  • name – The name of the new Tag.

  • namespace – The URI of the new Tag’s XML namespace, if any.

  • prefix – The prefix for the new Tag’s XML namespace, if any.

  • attrs – A dictionary of this Tag’s attribute values; can be used instead of kwattrs for attributes like ‘class’ that are reserved words in Python.

  • sourceline – The line number where this tag was (purportedly) found in its source document.

  • sourcepos – The character position within sourceline where this tag was (purportedly) found.

  • kwattrs – Keyword arguments for the new Tag’s attribute values.

string_container(base_class=None)#
new_string(s, subclass=None)#

Create a new NavigableString associated with this BeautifulSoup object.

abstract insert_before(*args)#

This method is part of the PageElement API, but BeautifulSoup doesn’t implement it because there is nothing before or after it in the parse tree.

abstract insert_after(*args)#

This method is part of the PageElement API, but BeautifulSoup doesn’t implement it because there is nothing before or after it in the parse tree.

popTag()#

Internal method called by _popToTag when a tag is closed.

pushTag(tag)#

Internal method called by handle_starttag when a tag is opened.

endData(containerClass=None)#

Method called by the TreeBuilder when the end of a data segment occurs.

object_was_parsed(o, parent=None, most_recent_element=None)#

Method called by the TreeBuilder to integrate an object into the parse tree.

_linkage_fixer(el)#

Make sure linkage of this fragment is sound.

_popToTag(name, nsprefix=None, inclusivePop=True)#

Pops the tag stack up to and including the most recent instance of the given tag.

If there are no open tags with the given name, nothing will be popped.

Parameters:
  • name – Pop up to the most recent tag with this name.

  • nsprefix – The namespace prefix that goes with name.

  • inclusivePop – It this is false, pops the tag stack up to but not including the most recent instqance of the given tag.

handle_starttag(name, namespace, nsprefix, attrs, sourceline=None, sourcepos=None, namespaces=None)#

Called by the tree builder when a new tag is encountered.

Parameters:
  • name – Name of the tag.

  • nsprefix – Namespace prefix for the tag.

  • attrs – A dictionary of attribute values.

  • sourceline – The line number where this tag was found in its source document.

  • sourcepos – The character position within sourceline where this tag was found.

  • namespaces – A dictionary of all namespace prefix mappings currently in scope in the document.

If this method returns None, the tag was rejected by an active SoupStrainer. You should proceed as if the tag had not occurred in the document. For instance, if this was a self-closing tag, don’t call handle_endtag.

handle_endtag(name, nsprefix=None)#

Called by the tree builder when an ending tag is encountered.

Parameters:
  • name – Name of the tag.

  • nsprefix – Namespace prefix for the tag.

handle_data(data)#

Called by the tree builder when a chunk of textual data is encountered.

decode(pretty_print=False, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter='minimal', iterator=None)#
Returns a string or Unicode representation of the parse tree

as an HTML or XML document.

Parameters:
  • pretty_print – If this is True, indentation will be used to make the document more readable.

  • eventual_encoding – The encoding of the final document. If this is None, the document will be a Unicode string.