I want to get html content from a url and parse the html content with regular expression. But the html content has some multibyte characters. So I met the error described in title.
Could somebody tell me how to resolve this problem?
I want to get html content from a url and parse the html content with regular expression. But the html content has some multibyte characters. So I met the error described in title.
Could somebody tell me how to resolve this problem?
You need to edit your question to show (1) the code that you used (2) the full error and traceback (3) the url that is involved (4) what is the unicode character that you are trying to encode as gbk
You seem to have somehow obtained unicode characters from the raw bytes in the the html content -- how? what encoding is specified in the html content?
Then (I guess) you are trying to write the unicode characters to a file, endcoding the unicode as gbk. During this process, you got an error something like this:
>>> u'\uffff'.encode('gbk')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character u'\uffff' in position 0: illegal multibyte sequence
>>>
If the raw bytes in the html content were not encoded in gbk, then it is quite possible that you have some unicode characters that can't be represented in gbk. In that case you may like to encode your results using the original encoding, or encode them in gb18030 which can take any unicode character.
Another possibility is that you have mangled the raw bytes or the unicode somehow. I certainly hope that your regex machinations have been done on the unicode and not on some variable-length-character encoding like gb2312, gbk, etc.
Update:
Here is your code snippet:
import sys, urllib.request
url = "http://www.meilishuo.com"
wp = urllib.request.urlopen(url)
content = wp.read()
str_content = content.decode('utf-8')
fp = open("web.txt","w")
fp.write(str_content)
fp.close()
From that I've had to deduce:
(1) You are running Python 3.x
(2) sys.defaultencoding == "gbk" -- otherwise you wouldn't have got the error message some part of which you reported earlier.
As my sys.defaultencoding is NOT 'gbk', I replaced your last 3 lines withgbk_content = str_content.encode('gbk')
and ran the amended snippet with Python 3.1.2.
Observations:
(1) website has charset=utf-8, decodes OK with utf-8
(2) Error message: UnicodeEncodeError: 'gbk' codec can't encode character '\u2764' in position 35070: illegal multibyte sequence
\u2664
is a dingbat (HEAVY BLACK HEART). The website is dynamic; in another attempt, the first offending character was \xa9 (COPYRIGHT SIGN).
So the web page contains Unicode characters which are not mapped in gbk. Options are
(1) encode with 'gbk' but use the 'replace' option
(2) encode with 'gbk' but use the 'ignore' option
(3) encode with an encoding that supports ALL Unicode characters (utf-8, gb18030) and for which you have a display mechanism that renders all those characters that aren't in gbk
Try
open(file, 'r', encoding='utf-8')
instead of
open(file, 'r')
You need to edit your question to show (1) the code that you used (2) the full error and traceback (3) the url that is involved (4) what is the unicode character that you are trying to encode as gbk
You seem to have somehow obtained unicode characters from the raw bytes in the the html content -- how? what encoding is specified in the html content?
Then (I guess) you are trying to write the unicode characters to a file, endcoding the unicode as gbk. During this process, you got an error something like this:
>>> u'\uffff'.encode('gbk')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character u'\uffff' in position 0: illegal multibyte sequence
>>>
If the raw bytes in the html content were not encoded in gbk, then it is quite possible that you have some unicode characters that can't be represented in gbk. In that case you may like to encode your results using the original encoding, or encode them in gb18030 which can take any unicode character.
Another possibility is that you have mangled the raw bytes or the unicode somehow. I certainly hope that your regex machinations have been done on the unicode and not on some variable-length-character encoding like gb2312, gbk, etc.
Update:
Here is your code snippet:
import sys, urllib.request
url = "http://www.meilishuo.com"
wp = urllib.request.urlopen(url)
content = wp.read()
str_content = content.decode('utf-8')
fp = open("web.txt","w")
fp.write(str_content)
fp.close()
From that I've had to deduce:
(1) You are running Python 3.x
(2) sys.defaultencoding == "gbk" -- otherwise you wouldn't have got the error message some part of which you reported earlier.
As my sys.defaultencoding is NOT 'gbk', I replaced your last 3 lines withgbk_content = str_content.encode('gbk')
and ran the amended snippet with Python 3.1.2.
Observations:
(1) website has charset=utf-8, decodes OK with utf-8
(2) Error message: UnicodeEncodeError: 'gbk' codec can't encode character '\u2764' in position 35070: illegal multibyte sequence
\u2664
is a dingbat (HEAVY BLACK HEART). The website is dynamic; in another attempt, the first offending character was \xa9 (COPYRIGHT SIGN).
So the web page contains Unicode characters which are not mapped in gbk. Options are
(1) encode with 'gbk' but use the 'replace' option
(2) encode with 'gbk' but use the 'ignore' option
(3) encode with an encoding that supports ALL Unicode characters (utf-8, gb18030) and for which you have a display mechanism that renders all those characters that aren't in gbk
Combining the above answers, I found the following code works very well.
import requests
r = requests.get("https://www.example.com/").content
str_content = r.decode('utf-8')
fp = open("contents.txt","w", encoding='utf-8')
fp.write(str_content)
fp.close()
My code works fine. It's just a simple encoding issue.
import requests
response = requests.get(url=URL, headers=headers)
response.raise_for_status()
#print(response.text)
response.encoding = 'uft-8'
with open('myPage.html', 'w') as fs:
fs.write(response.text)
You can open myPage.html with your browser.
© 2022 - 2024 — McMap. All rights reserved.