Python shutil copyfile - missing last few lines
Asked Answered
W

3

9

I am routinely missing the last few kb of a file I am trying to copy using shutil copyfile.

I did some research and do see someone asking about something similar here: python shutil copy function missing last few lines

But I am using copyfile, which DOES seem to use a with statement...

with open(src, 'rb') as fsrc:
    with open(dst, 'wb') as fdst:
        copyfileobj(fsrc, fdst)

So I am perplexed that more users aren't having this issue, if indeed it is some sort of buffering issue - I would think it'd be more well known.

I am calling copyfile very simply, don't think I could possibly be doing something wrong, essentially doing it the standard way I think:

copyfile(target_file_name,dest_file_name) 

Yet I am missing the last 4kb or so of the file eachtime.

I have also not touched the copyfile function which gets called in shutil which is...

def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)

So I am at a loss, but I suppose I am about to learn something about flushing, buffering, or the with statement, or ... Help! thanks


to Anand: Anand, I avoided mentioning that stuff bc it's my sense that it's not the problem, but since you asked... executive summary is that I am grabbing a file from an FTP, checking if the file is different from the last time I saved a copy, if so, downloading the file and saving a copy. It's circuitous spaghetti code and was written when I was a truly pure utilitarian novice of a coder I guess. It looks like:

for filename in ftp.nlst(filematch):
    target_file_name = os.path.basename(filename)
    with open(target_file_name ,'wb') as fhandle:
    try:
        ftp.retrbinary('RETR %s' % filename, fhandle.write)
        the_files.append(target_file_name)
        mtime = modification_date(target_file_name)
        mtime_str_for_file = str(mtime)[0:10] + str(mtime)[11:13] + str(mtime)[14:16]    + str(mtime)[17:19] + str(mtime)[20:28]#2014-12-11 15:08:00.338415.
        sorted_xml_files = [file for file in glob.glob(os.path.join('\\\\Storage\\shared\\', '*.xml'))]
        sorted_xml_files.sort(key=os.path.getmtime)
        last_file = sorted_xml_files[-1]
        file_is_the_same = filecmp.cmp(target_file_name, last_file)
        if not file_is_the_same:
            print 'File changed!'
            copyfile(target_file_name, '\\\\Storage\\shared\\'+'datebreaks'+mtime_str_for_file+'.xml') 
        else:
            print 'File '+ last_file +' hasn\'t changed, doin nothin'
            continue
Weinberg answered 21/7, 2015 at 18:29 Comment(8)
Can you show more of your code, how are you creating the target_file_name , as well as how are you creating the target_file itself?Dextroamphetamine
is there one specific file it always happens with? what os are you on and what python? can you post a file that it always does this with? are you trying to write to a network drive or something?Leticia
Anand, I responded to you in the post, wasn't sure how else to do it bc too many characters for a comment.Weinberg
Joran, it is happening almost every time. Maybe one out of every few hundred files work correctly. OS is windows 7 pro, python is 2.7. yes, writing to a network drive. the files are XML files. basically i am routinely missing the last say 10-50 lines of the XMLs.Weinberg
@Weinberg Are you creating those files that are being copied? In the same code? Or do they already exist in the system?Dextroamphetamine
@AnandSKumar i am not creating the xml files , just downloading them from an FTP from a third party. not sure if that answers your question - thxWeinberg
@Weinberg Try doing - fhandle.flush() right after - ftp.retrbinary('RETR %s' % filename, fhandle.write) .Dextroamphetamine
ZOMG ZOMG ZOMG that seemed to work. Anand, genius! Let me test a little further...Weinberg
D
4

The issue here would most probably be that , when executing the line -

ftp.retrbinary('RETR %s' % filename, fhandle.write)

This is using the fhandle.write() function to write the data from the ftp server to the file (with name - target_file_name) , but by the time you are calling -shutil.copyfile - the buffer for fhandle has not completely flushed, so you are missing out on some data when copying the file.

To make sure that this does not occur, you can either move the copyfile logic out of the with block for fhandle .

Or you can call fhandle.flush() to flush the buffer , before copying the file .

I believe it would be better to close the file (move the logic out of the with block). Example -

for filename in ftp.nlst(filematch):
    target_file_name = os.path.basename(filename)
    with open(target_file_name ,'wb') as fhandle:
        ftp.retrbinary('RETR %s' % filename, fhandle.write)
    the_files.append(target_file_name)
    mtime = modification_date(target_file_name)
    mtime_str_for_file = str(mtime)[0:10] + str(mtime)[11:13] + str(mtime)[14:16]    + str(mtime)[17:19] + str(mtime)[20:28]#2014-12-11 15:08:00.338415.
    sorted_xml_files = [file for file in glob.glob(os.path.join('\\\\Storage\\shared\\', '*.xml'))]
    sorted_xml_files.sort(key=os.path.getmtime)
    last_file = sorted_xml_files[-1]
    file_is_the_same = filecmp.cmp(target_file_name, last_file)
    if not file_is_the_same:
        print 'File changed!'
        copyfile(target_file_name, '\\\\Storage\\shared\\'+'datebreaks'+mtime_str_for_file+'.xml') 
    else:
        print 'File '+ last_file +' hasn\'t changed, doin nothin'
        continue
Dextroamphetamine answered 21/7, 2015 at 19:33 Comment(5)
Thank you Anand this fix definitely was the first and most direct fix for the symptom I had. I will also consider nsilent22's solution which seems like it might be more fundamentally sound. Any thoughts welcome.Weinberg
nslient22's is the same solution, to close the file before we send it to copyfile() function , by moving the copyfile (and all logic that is not dependent on the file handle) outside the with block.Dextroamphetamine
Ah you are right, I focused on the 'flush' idea - but you also specified "To make sure that this does not occur, you can either move the copyfile logic out of the with block for fhandle .". Thank you so much.Weinberg
(i focused on the flush idea bc i first saw your comment, before seeing this answer. thanks again).Weinberg
Oh ok, sorry that was just to test whether that is the exact issue (I had a hunch, just wanted to confirm).Dextroamphetamine
S
2

You are trying to copy a file that was not closed. That's why buffers were not flushed. Move the copyfileobj out of the with block, to allow fhandle beeing closed.

Do:

with open(target_file_name ,'wb') as fhandle:
    ftp.retrbinary('RETR %s' % filename, fhandle.write)

# and here the rest of your code
# so fhandle is closed, and file is stored completely on the disk
Schadenfreude answered 21/7, 2015 at 19:30 Comment(2)
I will take a look at this right after I finish testing Anand's flushing solution which I caught earlier and also seemed to fix it. I am not smart enough to know which solution is better/etc. Yours seems to be a more fundamental solution. Thank you and any other insight is appreciated.Weinberg
This also works after a little bit of testing. Wow, so obvious in hindsight. Thank you.Weinberg
E
1

This looks like there is a better way to do nested withs:

with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
        copyfileobj(fsrc, fdst)

I'd try something more like this. I'm far from an expert, hopefully someone more knowledgeable can lend some insight. My best thought is that the inner with closes before the outer one.

Escurial answered 21/7, 2015 at 18:40 Comment(2)
I would be happy to try/test that. Perhaps I am naive but isn't it odd though that ... shutil copyfile could be used so widely and yet be suboptimal / need that fix?Weinberg
Yes, it is odd. I'll honestly be a little surprised if this fixes it, but with blocks always mess me up. I use open() and close() like a noob :)Escurial

© 2022 - 2024 — McMap. All rights reserved.