Convert io.BytesIO to io.StringIO to parse HTML page

Asked 4/7, 2014 at 4:18 Answered 10/7, 2018 at 3:33

Solved html beautifulsoup pycurl stringio type-conversion

I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.

Is there any way to convert io.BytesIO to io.StringIO?

Or Is there any other way to parse the HTML page?

I'm using Python 3.3.2.

Silverman answered 4/7, 2014 at 4:18 Comment(1)

does the naive approach of exhausting the BytesIO and then constructing a StringIO from the output not satisfy your constraints? – Compliment 4/7, 2014 at 4:32

A naive approach:

# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()

# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect

# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need

Compliment answered 4/7, 2014 at 4:35 Comment(6)

Thanks, it did work. But instead of bytes_io.read() I used bytes_io.getvalue() as the former didn't work. – Silverman 8/7, 2014 at 3:59

ah yeah I assumed your BytesIO was at the beginning of the stream. getvalue I believe should work regardless where you are :) – Compliment 8/7, 2014 at 4:25

Normally you would have to call bytes_io.seek(0) before the read() call. As @AnthonySottile mentions, getvalue gets around this. – Canthus 12/12, 2017 at 13:26

seems to be very inefficient - we need to load all the file in memory to make decode for that. This should work good for small files, but not for the large ones. – Callisthenics 2/12, 2020 at 8:35

both of the current answers have that inefficiency -- I could probably update this with an incremental decoder answer but at this point it's not really worth my efforts – Compliment 2/12, 2020 at 16:30

@AnthonySottile This was my first Programming/Python project ever. So, it was sufficient. Your answer really helped and encourage me to continue coding. Thank you. – Silverman 2/8, 2022 at 11:49

the code in the accepted answer actually reads from the stream completely for decoding. Below is the right way, converting one stream to another, where the data can be read chunk by chunk.

# Initialize a read buffer
input = io.BytesIO(
    b'Inital value for read buffer with unicode characters ' +
    'ÁÇÊ'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')

# Read from the buffer
print(wrapper.read())

Intradermal answered 10/7, 2018 at 3:33 Comment(2)

Could you please add an example of reading chunk by chunk? – Lou 30/6, 2020 at 11:28

@AlexeiMarinichenko you can read up on the docs about the methods of TextIOWrapper. Try wrapper.read(5), wrapper.realine(). – Intradermal 1/7, 2020 at 14:17