Environment
import bs4
bs4.__version__
---
4.10.0'
import sys
print(sys.version)
---
3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0]
BS4/XML Parser on XML with namespace definition
from bs4 import BeautifulSoup
xbrl_with_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<xbrl
xmlns:dei="http://xbrl.sec.gov/dei/2020-01-31"
>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""
soup = BeautifulSoup(xbrl_with_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant.prettify())
---
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
BS4/XML Parser on XML without namespace definition
xbrl_without_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""
soup = BeautifulSoup(xbrl_without_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None
BS4/HTML Parser on XML without namespace definition
BS4/HTML parser regards <namespace>:<tag>
as a single tag, besides it lower the letters.
soup = BeautifulSoup(xbrl_without_namespace, 'html.parser')
registrant = soup.find("dei:EntityRegistrantName".lower())
print(registrant)
---
<dei:entityregistrantname>
Hoge, Inc.
</dei:entityregistrantname>
Does not match with capital letters as they have been converted into lower letters.
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None
Conclusion
- Provide the namespace definitions to use namespaces with XML parser, OR
- Use HTML parser and handle with all small letters.