Reading really big blobs without downloading them in Google Cloud (streaming?)
Asked Answered
V

1

6

please help!

[+] What I have: A lot of blobs in every bucket. Blobs can vary in size from being less than a Kilo-byte to being lots of Giga-bytes.

[+] What I'm trying to do: I need to be able to either stream the data in those blobs (like a buffer of size 1024 or something like that) or read them by chunks of a certain size in Python. The point is I don't think I can just do a bucket.get_blob() because if the blob was a TeraByte then I wouldn't be able to have it in physical memory.

[+] What I'm really trying to do: parse the information inside the blobs to identify key-words

[+] What I've read: A lot of documentation on how to write to google cloud in chunks and then use compose to stitch it together (not helpful at all)

A lot of documentation on java's pre-fetch functions (needs to be python)

The google cloud API's

If anyone could point me the right direction I would be really grateful! Thanks

Vortex answered 16/5, 2018 at 21:34 Comment(1)
I am trying to figure this out myself. If you have figured this out, can you share your solution to save me some time?Lomax
V
4

So a way I have found of doing this is by creating a file-like object in python then using the Google-Cloud API call .download_to_file() with that file-like object.

This in essence streams data. python code looks something like this

def getStream(blob):
    stream = open('myStream','wb', os.O_NONBLOCK)
    streaming = blob.download_to_file(stream)

The os.O_NONBLOCK flag is so I can read while I'm writing to the file. I still haven't tested this with really big files so if anyone knows a better implementation or see's a potential failure with this please comment. Thanks!

Vortex answered 17/5, 2018 at 19:52 Comment(1)
I think you found the right method. Check here the code for further insight.Parashah

© 2022 - 2024 — McMap. All rights reserved.