bs4.dammit#

Beautiful Soup bonus library: Unicode, Dammit

This library converts a bytestream to Unicode through any means necessary. It is heavily based on code from Mark Pilgrim’s Universal Feed Parser. It works best on XML and HTML, but it does not rewrite the XML or HTML to reflect a new encoding; that’s the tree builder’s job.

Module Contents#

Classes#

EntitySubstitution

The ability to substitute XML or HTML entities for certain characters.

EncodingDetector

Suggests a number of possible encodings for a bytestring.

UnicodeDammit

A class for detecting the encoding of a *ML document and

Functions#

Attributes#

bs4.dammit.__license__ = 'MIT'#
bs4.dammit.chardet_module#
bs4.dammit.chardet_module#
bs4.dammit.chardet_dammit(s)#
bs4.dammit.xml_encoding = '^\\s*<\\?.*encoding=[\'"](.*?)[\'"].*\\?>'#
bs4.dammit.html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'#
bs4.dammit.encoding_res#
class bs4.dammit.EntitySubstitution#

Bases: object

The ability to substitute XML or HTML entities for certain characters.

CHARACTER_TO_XML_ENTITY#
BARE_AMPERSAND_OR_BRACKET#
AMPERSAND_OR_BRACKET#
_populate_class_variables()#

Initialize variables used by this class to manage the plethora of HTML5 named entities.

This function returns a 3-tuple containing two dictionaries and a regular expression:

unicode_to_name - A mapping of Unicode strings like “⦨” to entity names like “angmsdaa”. When a single Unicode string has multiple entity names, we try to choose the most commonly-used name.

name_to_unicode: A mapping of entity names like “angmsdaa” to Unicode strings like “⦨”.

named_entity_re: A regular expression matching (almost) any Unicode string that corresponds to an HTML5 named entity.

classmethod _substitute_html_entity(matchobj)#

Used with a regular expression to substitute the appropriate HTML entity for a special character string.

classmethod _substitute_xml_entity(matchobj)#

Used with a regular expression to substitute the appropriate XML entity for a special character string.

classmethod quoted_attribute_value(value)#

Make a value into a quoted XML attribute, possibly escaping it.

Most strings will be quoted using double quotes.

Bob’s Bar -> “Bob’s Bar”

If a string contains double quotes, it will be quoted using single quotes.

Welcome to “my bar” -> ‘Welcome to “my bar”’

If a string contains both single and double quotes, the double quotes will be escaped, and the string will be quoted using double quotes.

Welcome to “Bob’s Bar” -> “Welcome to &quot;Bob’s bar&quot;

classmethod substitute_xml(value, make_quoted_attribute=False)#

Substitute XML entities for special XML characters.

Parameters:
  • value – A string to be substituted. The less-than sign will become &lt;, the greater-than sign will become &gt;, and any ampersands will become &amp;. If you want ampersands that appear to be part of an entity definition to be left alone, use substitute_xml_containing_entities() instead.

  • make_quoted_attribute – If True, then the string will be quoted, as befits an attribute value.

classmethod substitute_xml_containing_entities(value, make_quoted_attribute=False)#

Substitute XML entities for special XML characters.

Parameters:
  • value – A string to be substituted. The less-than sign will become &lt;, the greater-than sign will become &gt;, and any ampersands that are not part of an entity defition will become &amp;.

  • make_quoted_attribute – If True, then the string will be quoted, as befits an attribute value.

classmethod substitute_html(s)#

Replace certain Unicode characters with named HTML entities.

This differs from data.encode(encoding, ‘xmlcharrefreplace’) in that the goal is to make the result more readable (to those with ASCII displays) rather than to recover from errors. There’s absolutely nothing wrong with a UTF-8 string containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that character with “&eacute;” will make it more readable to some people.

Parameters:

s – A Unicode string.

class bs4.dammit.EncodingDetector(markup, known_definite_encodings=None, is_html=False, exclude_encodings=None, user_encodings=None, override_encodings=None)#

Suggests a number of possible encodings for a bytestring.

Order of precedence:

1. Encodings you specifically tell EncodingDetector to try first (the known_definite_encodings argument to the constructor).

  1. An encoding determined by sniffing the document’s byte-order mark.

3. Encodings you specifically tell EncodingDetector to try if byte-order mark sniffing fails (the user_encodings argument to the constructor).

4. An encoding declared within the bytestring itself, either in an XML declaration (if the bytestring is to be interpreted as an XML document), or in a <meta> tag (if the bytestring is to be interpreted as an HTML document.)

5. An encoding detected through textual analysis by chardet, cchardet, or a similar external library.

  1. UTF-8.

  2. Windows-1252.

property encodings#

Yield a number of encodings that might work for this markup.

Yield:

A sequence of strings.

_usable(encoding, tried)#

Should we even bother to try this encoding?

Parameters:
  • encoding – Name of an encoding.

  • tried – Encodings that have already been tried. This will be modified as a side effect.

classmethod strip_byte_order_mark(data)#

If a byte-order mark is present, strip it and return the encoding it implies.

Parameters:

data – Some markup.

Returns:

A 2-tuple (modified data, implied encoding)

classmethod find_declared_encoding(markup, is_html=False, search_entire_document=False)#

Given a document, tries to find its declared encoding.

An XML encoding is declared at the beginning of the document.

An HTML encoding is declared in a <meta> tag, hopefully near the beginning of the document.

Parameters:
  • markup – Some markup.

  • is_html – If True, this markup is considered to be HTML. Otherwise it’s assumed to be XML.

  • search_entire_document – Since an encoding is supposed to declared near the beginning of the document, most of the time it’s only necessary to search a few kilobytes of data. Set this to True to force this method to search the entire document.

class bs4.dammit.UnicodeDammit(markup, known_definite_encodings=[], smart_quotes_to=None, is_html=False, exclude_encodings=[], user_encodings=None, override_encodings=None)#

A class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.

property declared_html_encoding#

If the markup is an HTML document, returns the encoding declared _within_ the document.

CHARSET_ALIASES#
ENCODINGS_WITH_SMART_QUOTES = ['windows-1252', 'iso-8859-1', 'iso-8859-2']#
MS_CHARS#
MS_CHARS_TO_ASCII#
WINDOWS_1252_TO_UTF8#
MULTIBYTE_MARKERS_AND_SIZES = [(194, 223, 2), (224, 239, 3), (240, 244, 4)]#
FIRST_MULTIBYTE_MARKER#
LAST_MULTIBYTE_MARKER#
_sub_ms_char(match)#

Changes a MS smart quote character to an XML or HTML entity, or an ASCII character.

_convert_from(proposed, errors='strict')#

Attempt to convert the markup to the proposed encoding.

Parameters:

proposed – The name of a character encoding.

_to_unicode(data, encoding, errors='strict')#

Given a string and its encoding, decodes the string into Unicode.

Parameters:

encoding – The name of an encoding.

find_codec(charset)#

Convert the name of a character set to a codec name.

Parameters:

charset – The name of a character set.

Returns:

The name of a codec.

_codec(charset)#
classmethod detwingle(in_bytes, main_encoding='utf8', embedded_encoding='windows-1252')#

Fix characters from one encoding embedded in some other encoding.

Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8.

Parameters:
  • in_bytes – A bytestring that you suspect contains characters from multiple encodings. Note that this _must_ be a bytestring. If you’ve already converted the document to Unicode, you’re too late.

  • main_encoding – The primary encoding of in_bytes.

  • embedded_encoding – The encoding that was used to embed characters in the main document.

Returns:

A bytestring in which embedded_encoding characters have been converted to their main_encoding equivalents.