file name vs file object as a function argument

G

2

15

If a function takes as an input the name of a text file, I can refactor it to instead take a file object (I call it "stream"; is there a better word?). The advantages are obvious - a function that takes a stream as an argument is:

much easier to write a unit test for, since I don't need to create a temporary file just for the test
more flexible, since I can use it in situations where I somehow already have the contents of the file in a variable

Are there any disadvantages to streams? Or should I always refactor a function from a file name argument to a stream argument (assuming, of course, the file is text-only)?

Grab answered 25/9, 2012 at 5:34 Comment(0)

C

4

There are numerous functions in the python standard library which accept both -- strings which are filenames or open file objects (I assume that's what you're referring to as a "stream"). It's really not hard to create a decorator that you can use to make your functions accept either one.

One serious drawback to using "streams" is that you pass it to your function and then your function reads from it -- effectively changing it's state. Depending on your program, recovering that state could be messy if it's necessary. (e.g. you might need to litter you code with f.tell() and then f.seek().)

Chretien answered 25/9, 2012 at 5:38 Comment(11)

Yes, when I said "stream", I meant "open file object". Wouldn't it be possible to write a decorator that saves and restores stream state? – Grab 25/9, 2012 at 5:43

And isn't there a way to create an inexpensive copy of a stream, such that the copy owns its own "pointer", while the "pointer" of the original stream is left untouched? That would be even cleaner than save/restore state approach. – Grab 25/9, 2012 at 5:44

@Grab -- Sure, you could write a decorator to do that. The important thing is to document when you're restoring the state and when you're not. As far as creating a copy, the only thing I can think of is itertools.tee, which is a little bit different (but it is way past my normal bedtime, so I don't guarantee anything that I type right now :^) . – Chretien 25/9, 2012 at 5:45

So file name vs file object feels a bit like iterable vs iterator. – Grab 25/9, 2012 at 5:51

@Grab -- I suppose it is similar. – Chretien 25/9, 2012 at 5:53

Actually, can you give an example of a library function that does this? – Grab 25/9, 2012 at 6:3

I can't speak for others, but I usually WANT the function to change the state of the stream. E.g. I want my (hypothetical) "parse_header" function to leave the file pointer at the end of the header, so that the following "read_item" can then start reading from the appropriate point in the file. – Bore 25/9, 2012 at 6:38

@Bore -- I do too. My point is that you need to be careful to keep track of where the file pointer is. – Chretien 25/9, 2012 at 6:42

The xml.etree.ElementTree.parse() function accept also filename or open file. The problem with users is that you never know what he prefers. It is sometimes simply handy just to pass the filename. Readability counts. It is easier to read simpler code. – Butterfat 25/9, 2012 at 21:45

@Chretien not sure if I've got you right. Actually, csv.reader doesn't seem to be supposed to accept filenames. It accepts any iterable that returns strings which will go horribly wrong on a filename. As 'streams', it definitely also accepts not just file-like objects though, but any other suitable iterable objects, if that's what you meant. – Vetavetch 9/5, 2014 at 13:3

@naxa -- Yeah, I'm not sure why I commented about csv.reader. It clearly doesn't have that behavior. xml.etree.ElementTree.parse is a correct example. – Chretien 9/5, 2014 at 16:17

B

7

... Here is how xml.etree.ElementTree module implements the parse function:

def parse(self, source, parser=None):
    close_source = False
    if not hasattr(source, "read"):
        source = open(source, "rb")
        close_source = True
    ...

As filename is a string, it does not have the read() method (here whatever attribute of that name is checked); however, the open file has it. The four lines makes the rest of code common. The only complication is that you have to remember whether to close the file object (here named source) or not. If it was open inside, then it must be closed. Otherwise, it must not be closed.

Actually, files differ from sreams slightly. Streams are potentially infinite while files usually not (unless some device is mapped as if it were file). The important difference when processing is, that you can never read the stream into memory at once. You have to process it by chunks.

Butterfat answered 25/9, 2012 at 21:50 Comment(1)

I was looking for a reference implementation on this in the stdlib. Thanks for the snippet it really saves time. I would give another +1 for the warning for chunks if I could. – Vetavetch 3/4, 2013 at 13:6