Convert io.BytesIO to io.StringIO to parse HTML page
Asked Answered
S

2

43

I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.

Is there any way to convert io.BytesIO to io.StringIO?

Or Is there any other way to parse the HTML page?

I'm using Python 3.3.2.

Silverman answered 4/7, 2014 at 4:18 Comment(1)
does the naive approach of exhausting the BytesIO and then constructing a StringIO from the output not satisfy your constraints?Compliment
C
30

A naive approach:

# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()

# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect

# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need
Compliment answered 4/7, 2014 at 4:35 Comment(6)
Thanks, it did work. But instead of bytes_io.read() I used bytes_io.getvalue() as the former didn't work.Silverman
ah yeah I assumed your BytesIO was at the beginning of the stream. getvalue I believe should work regardless where you are :)Compliment
Normally you would have to call bytes_io.seek(0) before the read() call. As @AnthonySottile mentions, getvalue gets around this.Canthus
seems to be very inefficient - we need to load all the file in memory to make decode for that. This should work good for small files, but not for the large ones.Callisthenics
both of the current answers have that inefficiency -- I could probably update this with an incremental decoder answer but at this point it's not really worth my effortsCompliment
@AnthonySottile This was my first Programming/Python project ever. So, it was sufficient. Your answer really helped and encourage me to continue coding. Thank you.Silverman
I
83

the code in the accepted answer actually reads from the stream completely for decoding. Below is the right way, converting one stream to another, where the data can be read chunk by chunk.

# Initialize a read buffer
input = io.BytesIO(
    b'Inital value for read buffer with unicode characters ' +
    'ÁÇÊ'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')

# Read from the buffer
print(wrapper.read())
Intradermal answered 10/7, 2018 at 3:33 Comment(2)
Could you please add an example of reading chunk by chunk?Lou
@AlexeiMarinichenko you can read up on the docs about the methods of TextIOWrapper. Try wrapper.read(5), wrapper.realine().Intradermal
C
30

A naive approach:

# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()

# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect

# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need
Compliment answered 4/7, 2014 at 4:35 Comment(6)
Thanks, it did work. But instead of bytes_io.read() I used bytes_io.getvalue() as the former didn't work.Silverman
ah yeah I assumed your BytesIO was at the beginning of the stream. getvalue I believe should work regardless where you are :)Compliment
Normally you would have to call bytes_io.seek(0) before the read() call. As @AnthonySottile mentions, getvalue gets around this.Canthus
seems to be very inefficient - we need to load all the file in memory to make decode for that. This should work good for small files, but not for the large ones.Callisthenics
both of the current answers have that inefficiency -- I could probably update this with an incremental decoder answer but at this point it's not really worth my effortsCompliment
@AnthonySottile This was my first Programming/Python project ever. So, it was sufficient. Your answer really helped and encourage me to continue coding. Thank you.Silverman

© 2022 - 2024 — McMap. All rights reserved.