It is simple to get a StorageStreamDownloader
using the azure.storage.blob package:
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string("my azure connection string")
container_client = blob_service_client.get_container_client("my azure container name")
blob_client = container_client.get_blob_client("my azure file name")
storage_stream_downloader = blob_client.download_blob()
and it is simple to process a file-like object, or more specifically, I think, a string-returning iterator (or the file path of the object) in the csv package:
import csv
from io import StringIO
csv_string = """col1, col2
a,b
c,d"""
with StringIO(csv_string) as csv_file:
for row in csv.reader(csv_file):
print(row) # or rather whatever I actually want to do on a row by row basis, e.g. ascertain that the file contains a row that meets a certain condition
What I'm struggling with is getting the streaming data from my StorageStreamDownloader
into csv.reader()
in such a way that I can process each line as it arrives rather than waiting for the whole file to download.
The Microsoft docs strike me as a little underwritten by their standards (the chunks()
method has no annotation?) but I see there is a readinto()
method for reading into a stream. I have tried reading into a BytesIO
stream but cannot work out how to get the data out into csv.reader()
without just outputting the buffer to a new file and reading that file. This all strikes me as a thing that should be doable but I'm probably missing something obvious conceptually, perhaps to do with itertools
or asyncio
, or perhaps I'm just using the wrong csv tool for my needs?