Copy a file in a sane, safe and efficient way
Asked Answered
F

9

351

I search for a good way to copy a file (binary or text). I've written several samples, everyone works. But I want hear the opinion of seasoned programmers.

I missing good examples and search a way which works with C++.

ANSI-C-WAY

#include <iostream>
#include <cstdio>    // fopen, fclose, fread, fwrite, BUFSIZ
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    // BUFSIZE default is 8192 bytes
    // BUFSIZE of 1 means one chareter at time
    // good values should fit to blocksize, like 1024 or 4096
    // higher values reduce number of system calls
    // size_t BUFFER_SIZE = 4096;

    char buf[BUFSIZ];
    size_t size;

    FILE* source = fopen("from.ogv", "rb");
    FILE* dest = fopen("to.ogv", "wb");

    // clean and more secure
    // feof(FILE* stream) returns non-zero if the end of file indicator for stream is set

    while (size = fread(buf, 1, BUFSIZ, source)) {
        fwrite(buf, 1, size, dest);
    }

    fclose(source);
    fclose(dest);

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " << end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

POSIX-WAY (K&R use this in "The C programming language", more low-level)

#include <iostream>
#include <fcntl.h>   // open
#include <unistd.h>  // read, write, close
#include <cstdio>    // BUFSIZ
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    // BUFSIZE defaults to 8192
    // BUFSIZE of 1 means one chareter at time
    // good values should fit to blocksize, like 1024 or 4096
    // higher values reduce number of system calls
    // size_t BUFFER_SIZE = 4096;

    char buf[BUFSIZ];
    size_t size;

    int source = open("from.ogv", O_RDONLY, 0);
    int dest = open("to.ogv", O_WRONLY | O_CREAT /*| O_TRUNC/**/, 0644);

    while ((size = read(source, buf, BUFSIZ)) > 0) {
        write(dest, buf, size);
    }

    close(source);
    close(dest);

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " << end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

KISS-C++-Streambuffer-WAY

#include <iostream>
#include <fstream>
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    ifstream source("from.ogv", ios::binary);
    ofstream dest("to.ogv", ios::binary);

    dest << source.rdbuf();

    source.close();
    dest.close();

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " <<  end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

COPY-ALGORITHM-C++-WAY

#include <iostream>
#include <fstream>
#include <ctime>
#include <algorithm>
#include <iterator>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    ifstream source("from.ogv", ios::binary);
    ofstream dest("to.ogv", ios::binary);

    istreambuf_iterator<char> begin_source(source);
    istreambuf_iterator<char> end_source;
    ostreambuf_iterator<char> begin_dest(dest); 
    copy(begin_source, end_source, begin_dest);

    source.close();
    dest.close();

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " <<  end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

OWN-BUFFER-C++-WAY

#include <iostream>
#include <fstream>
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    ifstream source("from.ogv", ios::binary);
    ofstream dest("to.ogv", ios::binary);

    // file size
    source.seekg(0, ios::end);
    ifstream::pos_type size = source.tellg();
    source.seekg(0);
    // allocate memory for buffer
    char* buffer = new char[size];

    // copy file    
    source.read(buffer, size);
    dest.write(buffer, size);

    // clean up
    delete[] buffer;
    source.close();
    dest.close();

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " <<  end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

LINUX-WAY // requires kernel >= 2.6.33

#include <iostream>
#include <sys/sendfile.h>  // sendfile
#include <fcntl.h>         // open
#include <unistd.h>        // close
#include <sys/stat.h>      // fstat
#include <sys/types.h>     // fstat
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    int source = open("from.ogv", O_RDONLY, 0);
    int dest = open("to.ogv", O_WRONLY | O_CREAT /*| O_TRUNC/**/, 0644);

    // struct required, rationale: function stat() exists also
    struct stat stat_source;
    fstat(source, &stat_source);

    sendfile(dest, source, 0, stat_source.st_size);

    close(source);
    close(dest);

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " <<  end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

Environment

  • GNU/LINUX (Archlinux)
  • Kernel 3.3
  • GLIBC-2.15, LIBSTDC++ 4.7 (GCC-LIBS), GCC 4.7, Coreutils 8.16
  • Using RUNLEVEL 3 (Multiuser, Network, Terminal, no GUI)
  • INTEL SSD-Postville 80 GB, filled up to 50%
  • Copy a 270 MB OGG-VIDEO-FILE

Steps to reproduce

 1. $ rm from.ogg
 2. $ reboot                           # kernel and filesystem buffers are in regular
 3. $ (time ./program) &>> report.txt  # executes program, redirects output of program and append to file
 4. $ sha256sum *.ogv                  # checksum
 5. $ rm to.ogg                        # remove copy, but no sync, kernel and fileystem buffers are used
 6. $ (time ./program) &>> report.txt  # executes program, redirects output of program and append to file

Results (CPU TIME used)

Program  Description                 UNBUFFERED|BUFFERED
ANSI C   (fread/frwite)                 490,000|260,000  
POSIX    (K&R, read/write)              450,000|230,000  
FSTREAM  (KISS, Streambuffer)           500,000|270,000 
FSTREAM  (Algorithm, copy)              500,000|270,000
FSTREAM  (OWN-BUFFER)                   500,000|340,000  
SENDFILE (native LINUX, sendfile)       410,000|200,000  

Filesize doesn't change.
sha256sum print the same results.
The video file is still playable.

Questions

  • What method would you prefer?
  • Do you know better solutions?
  • Do you see any mistakes in my code?
  • Do you know a reason to avoid a solution?

  • FSTREAM (KISS, Streambuffer)
    I really like this one, because it is really short and simple. As far is I know the operator << is overloaded for rdbuf() and doesn't convert anything. Correct?

Thanks

Update 1
I changed the source in all samples in that way, that the open and close of the file descriptors is include in the measurement of clock(). Their are no other significant changes in the source code. The results doesn't changed! I also used time to double-check my results.

Update 2
ANSI C sample changed: The condition of the while-loop doesn't call any longer feof() instead I moved fread() into the condition. It looks like, the code runs now 10,000 clocks faster.

Measurement changed: The former results were always buffered, because I repeated the old command line rm to.ogv && sync && time ./program for each program a few times. Now I reboot the system for every program. The unbuffered results are new and show no surprise. The unbuffered results didn't changed really.

If i don't delete the old copy, the programs react different. Overwriting a existing file buffered is faster with POSIX and SENDFILE, all other programs are slower. Maybe the options truncate or create have a impact on this behaviour. But overwriting existing files with the same copy is not a real world use-case.

Performing the copy with cp takes 0.44 seconds unbuffered und 0.30 seconds buffered. So cp is a little bit slower than the POSIX sample. Looks fine for me.

Maybe I add also samples and results of mmap() and copy_file() from boost::filesystem.

Update 3
I've put this also on a blog page and extended it a little bit. Including splice(), which is a low-level function from the Linux kernel. Maybe more samples with Java will follow. http://www.ttyhoney.com/blog/?page_id=69

Faveolate answered 17/4, 2012 at 16:38 Comment(23)
fstream definitely is a good option for file operations.Gigantean
richelbilderbeek.nl/CppCopy_file.htmVarious
You forgot the lazy way: system("cp from.ogv to.ogv");Guidance
I don't like the lazy way. Thats why I don't mention it. It is ugly. On the other hand, the code in coreutils is long tested and proven.Faveolate
#include <copyfile.h> copyfile(const char *from, const char *to, copyfile_state_t state, copyfile_flags_t flags);Recrement
Why are you only taking the time to copy the files, and not the open/close time? If you're doing a number of iterations of this, the startup/shutdown time for each time you copy something could matter quite a lot.Soteriology
You are right Kevin. I changed the samples in this way with the update. My previous intention was, that the "file is already" accessed by the user and to monitor only the copy-progress itself. But the open/close of filedescriptor is a part of the copy-progress!Faveolate
I'd say to use boost::filesystem::copy_file.Rothko
Could you please fix the "units" of the timing results. "Clocks per second" makes no sense, that's a measure of processor clock speed, not anything to do with I/O. Did you mean "clocks per file"?Selfconfessed
I've changed the description to "CPU TIME", often refered also als "CPU TICKS". This matches the description of the man page of clock(). Better?Faveolate
Personally, I would decide only after adding error handling to each example and comparing the added code: outright performance is obviously a big factor in deciding what to do, but it usually shouldn't be the only factor.Fulgurant
The blog page no longer exists. Is your article still available somewhere?Improvisation
Currently not. I will set up the blog again within the next weeks.Faveolate
If source is empty, then writing it's rdbuf will result in the failbit being set on the destination stream. For this reason, you might prefer std::copy as a default C++ portable approach (if you intend to write other things to the destination stream).Wedlock
Sorry for chipping in so late, but I would describe none of these as 'safe', as they don't have any error handling.Homophony
The sendfile example is wrong. sendfile is declared as ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);. Note that the third argument is off_t *offset. So sendfile(dest, source, 0, stat_source.st_size); will probably result in SIGSEGV.Autonomous
@AndrewHenle: I'm sorry, but the man-page says that it is allowed to pass [0|NULL|nullptr] and sendfile will start at the offset (the start of the file in this case).Faveolate
@Faveolate What method would you prefer? Do you know better solutions? took this otherwise great question in an opinion-based direction, since preference and the ill-defined "better" is subjective (did you mean "faster"?). Questions like "What other methods exist? What are their advantages and disadvantages?" (like your Do you know a reason to avoid a solution?) is more appropriate for SO.Phase
@KeithM: To be honest, your suggestion sounds rather theoretical and changing this words wouldn't improve something. Because I don't maintain my blog anymore and the list of examples is missing splice() (which is neat) I looking forward to add splice() and try to merge this complete thing to the new documentation site, I think this would be really a good way. Please bear with me :)Faveolate
@Faveolate Theoretical how? Perhaps you missed my point; I'm saying that you're asking opinion-based questions, which is off-topic for Stack Overflow. The current answers are not opinions, so merely removing those two questions would improve this Q&A.Phase
Measuring the CPU time of an I/O-bound operation is pointless. The only time of interest is the elapsed time by the wall clock.Hopi
The sendfile example will fail on files larger than 2GBPratfall
I think that one of the requirements should also be preserving the metadata (time, permissions, etc.) of the original file.Fryer
R
296

Copy a file in a sane way:

#include <fstream>

int main()
{
    std::ifstream  src("from.ogv", std::ios::binary);
    std::ofstream  dst("to.ogv",   std::ios::binary);

    dst << src.rdbuf();
}

This is so simple and intuitive to read it is worth the extra cost. If we were doing it a lot, better to fall back on OS calls to the file system. I am sure boost has a copy file method in its filesystem class.

There is a C method for interacting with the file system:

#include <copyfile.h>

int
copyfile(const char *from, const char *to, copyfile_state_t state, copyfile_flags_t flags);
Recrement answered 17/4, 2012 at 16:49 Comment(32)
That's his KISS streambuffer C++ way. It's the most natural way for a C++ programmer (although using std::copy is also pretty idiomatic). And since the two idiomatic ways also happen to be the fastest portable ways...Indigestive
copyfile is not portable; I think it's specific to Mac OS X. It certainly doesn't exist on Linux. boost::filesystem::copy_file is probably the most portable way to copy a file via the native file system.Memory
@MikeSeymour: copyfile() seems to be a BSD extension.Recrement
Seems that Loki is right. The function copyfile() seems to be an extension for *BSD and Mac OS.Faveolate
@PeterWeber: I would not say Mac OS. Mac OS is a BSD implementation.Recrement
I'm just careful. I know some people want not call Mac OS a UNIX.Faveolate
Is the operator '<<' ok for binary files ?Saunderson
@Vincent: Yes. See: stackoverflow.com/questions/12766636/…Recrement
Copying a file correctly on Windows is ridiculously difficult to get right, unless you delegate the whole thing to Windows by saying CopyFile.Diahann
Beware that Boost copy_file does not work correctly in v 1.40.0Nagging
src.close(); dst.close(); ?Borlow
@duedl0r: No. Objects have destructors. The destructor for streams automatically call close(). codereview.stackexchange.com/q/540/507Recrement
@LokiAstari: yes, if they go out of scope.Borlow
@duedl0r: Yes. But that's like saying "if the sun sets". You can run really fast west and you may make your day slightly longer but the sun is going to set. Unless you have bug and leak memory (it will go out of scope). But since there is no dynamic memory management here there can not be a leak and they will go out of scope (just like the sun will set).Recrement
@LokiAstari, that's great, although I just found that std::ios::binary is necessary for correct file copying on windows.Mink
Is there an equivalently simple way of copying permissions, too?Mahaffey
@mangledorf: Not at the language level. The language does not have a concept of "File Systems" let alone permission. You will need to use OS or "File system" specific API to achieve that.Recrement
@Loki, do you know of a more C++ way of dealing with POSIX permissions than fchown/fstat (e.g. https://mcmap.net/q/17281/-keeping-fileowner-and-permissions-after-copying-file-in-c)Mahaffey
@mangledorf: No because C++ has no concept of a file system. For that you must use a library. The library you use will depend on the system you use (unless you use a lib to abstract the file system (like boost::filesystem)).Recrement
@LokiAstari: despite all the appeal of the "sun sets" analogy, a function may want to copy a file then start doing some time-consuming task - most extreme example: a copy atop main() before entering a daemon/server mode. Still - how/when to handle that is obviously going to be well understood by the person posting the question.Vibratile
Then simply wrap it in a { } scope blockHemstitch
Don't forget to #include <fstream>Containerize
@AlecJacobson C++17 has std::filesystem::permissions.Infamy
@MillieSmith: It does not need to. The C++ standard says that if the main function does not end in a return then the compiler will plant the return 0; for you. Also; it is common practice to leave out that return as an indication, to other developers, that the application can not fail (ie you always return the success status code back to the OS).Recrement
Warning: dst << src.rdbuf(); will set failbit on dst if src is an empty file. www.cplusplus.com:operator<<: Sets failbit if ... For (2), it is set if no characters could be extracted from the object pointed by sb.Bolte
Not only is copyfile not portable and macOS-specific, but it's also sub-optimal in that you cannot set the block size. MacOS itself doesn't even use it, which you can demonstrate yourself by doing a drag&drop of a bunch of photo/video/audio files in Finder and then trying the same operation with copyfile. You'll find it's significantly slower. We found this out at our company and now need to roll our own version of it to meet the performance of drag&drop.Curiel
@Curiel copyfile() has nothing to do with the mac. It's a BSD extension (as we noted 10 years ago in the comments above). If you are writing drag/drop; an OS level feature; you should be working with OS level features to do the copy (this question is nothing to do with kind of functionality). Look at your OS file system features to do the copy in this type of situation.Recrement
@Curiel Look at: filesystem/copy is now part of the standard (so should be portable). Note there is a difference between copying files at the file system level (which copies blocks) and reading and writing a file which requires the content to be streamed through the application. I actually put this as part of the answer ten years ago: better to fall back on OS calls to the file systemRecrement
@MartinYork -- google for a copyfile man page and all you find is macOS information. Yes, it's a BSD extension, but it seems to be most prevalent (or now exclusive to) macOS. I have no argument with your other points (and thanks for the link to filesystem/copy) but my original points still stand: macOS doesn't appear to use copyfile for data movement, it and it's significantly slower than whatever macOS does.Curiel
@Curiel copyfile() calls need the COPYFILE_CLONE flag to be as fast as Finder.Murton
With the first method you mentioned. How can I check that "dst << src.rdbuf();" operation was successfull ? Can you also suggest any other methods of making sure that files are the same except checking by md5sum or any other algorithm of this kind?Copy
@Copy if (dst << src.rdbuf()){std::cout << "Worked\n";}Recrement
F
88

With C++17 the standard way to copy a file will be including the <filesystem> header and using:

bool copy_file( const std::filesystem::path& from,
                const std::filesystem::path& to);

bool copy_file( const std::filesystem::path& from,
                const std::filesystem::path& to,
                std::filesystem::copy_options options);

The first form is equivalent to the second one with copy_options::none used as options (see also copy_file).

The filesystem library was originally developed as boost.filesystem and finally merged to ISO C++ as of C++17.

Fungi answered 17/4, 2012 at 16:38 Comment(4)
Why there's not a single function with a default argument, like bool copy_file( const std::filesystem::path& from, const std::filesystem::path& to, std::filesystem::copy_options options = std::filesystem::copy_options::none);?Leftist
@Leftist I'm not sure about this. Maybe it doesn't really matter.Fungi
@Leftist in the standard library, clean code is paramount. Having overloads (as opposed to one function with default parameters) makes the programmer's intention more clear.Copter
@Faveolate This should now probably be the accepted answer given that C++17 is available.Recrement
S
21

Too many!

The "ANSI C" way buffer is redundant, since a FILE is already buffered. (The size of this internal buffer is what BUFSIZ actually defines.)

The "OWN-BUFFER-C++-WAY" will be slow as it goes through fstream, which does a lot of virtual dispatching, and again maintains internal buffers or each stream object. (The "COPY-ALGORITHM-C++-WAY" does not suffer this, as the streambuf_iterator class bypasses the stream layer.)

I prefer the "COPY-ALGORITHM-C++-WAY", but without constructing an fstream, just create bare std::filebuf instances when no actual formatting is needed.

For raw performance, you can't beat POSIX file descriptors. It's ugly but portable and fast on any platform.

The Linux way appears to be incredibly fast — perhaps the OS let the function return before I/O was finished? In any case, that's not portable enough for many applications.

EDIT: Ah, "native Linux" may be improving performance by interleaving reads and writes with asynchronous I/O. Letting commands pile up can help the disk driver decide when is best to seek. You might try Boost Asio or pthreads for comparison. As for "can't beat POSIX file descriptors"… well that's true if you're doing anything with the data, not just blindly copying.

Swagman answered 17/4, 2012 at 16:52 Comment(5)
ANSI C: But I have to give the function fread/fwrite a size? pubs.opengroup.org/onlinepubs/9699919799/toc.htmFaveolate
@PeterWeber Well, yes, it's true that BUFSIZ is as good a value as any, and will probably speed things up relative to one or "just a few" characters at a time. Anyway, the performance measurement bears out that it's not the best method in any case.Swagman
I have not a in-depth understanding of this, so I should be careful with assumptions and opinions. Linux-Way runs in Kernelspace afaik. This should avoid slow Context-Switching between Kernelspace and Userspace? Tomorrow I will take again a look at the manpage of sendfile. A while ago Linus Torvalds said he doesn't like Userspace-Filesystems for heavy jobs. Maybe sendfile is a positive example for his view?Faveolate
"sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.": kernel.org/doc/man-pages/online/pages/man2/sendfile.2.htmlFulgurant
Could you post an example of using raw filebuf objects?Gadgetry
U
19

I want to make the very important note that the LINUX method using sendfile() has a major problem in that it can not copy files more than 2GB in size! I had implemented it following this question and was hitting problems because I was using it to copy HDF5 files that were many GB in size.

http://man7.org/linux/man-pages/man2/sendfile.2.html

sendfile() will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)

Unbar answered 17/4, 2012 at 16:38 Comment(4)
does sendfile64() have same problem?Fontanel
@Paladin It seems that sendfile64 was developed to get around this limitation. From the man page: """The original Linux sendfile() system call was not designed to handle large file offsets. Consequently, Linux 2.4 added sendfile64(), with a wider type for the offset argument. The glibc sendfile() wrapper function transparently deals with the kernel differences."""Unbar
sendfile64 has the same issue it seems. However the use of the offset type off64_t allows one to use a loop to copy large files as shown in an answer to the linked question.Pompey
this is wirtten in man : 'Note that a successful call to sendfile() may write fewer bytes than requested; the caller should be prepared to retry the call if there were unsent bytes.' sendfile or sendfile64 might require to be called within a loop until full copy is done.Mousetail
E
3

Qt has a method for copying files:

#include <QFile>
QFile::copy("originalFile.example","copiedFile.example");

Note that to use this you have to install Qt (instructions here) and include it in your project (if you're using Windows and you're not an administrator, you can download Qt here instead). Also see this answer.

Escalate answered 17/4, 2012 at 16:38 Comment(2)
QFile::copy is ridiculously slow due to it's 4k buffering.Mythopoeia
The slowness has been fixed in newer versions of Qt. I am using 5.9.2 and the speed is on par with the native implementation. Btw. taking a look at the source code, Qt seems to actually call the native implementation.Fryer
A
2

For those who like boost:

boost::filesystem::path mySourcePath("foo.bar");
boost::filesystem::path myTargetPath("bar.foo");

// Variant 1: Overwrite existing
boost::filesystem::copy_file(mySourcePath, myTargetPath, boost::filesystem::copy_option::overwrite_if_exists);

// Variant 2: Fail if exists
boost::filesystem::copy_file(mySourcePath, myTargetPath, boost::filesystem::copy_option::fail_if_exists);

Note that boost::filesystem::path is also available as wpath for Unicode. And that you could also use

using namespace boost::filesystem

if you do not like those long type names

Altricial answered 17/4, 2012 at 16:38 Comment(1)
Boost's filesystem library is one of the exceptions that requires it to be compiled. Just FYI!Sparry
H
1

Update

The most convenient and efficient way to copy files on Unix platforms is to use the sendfile system call, which internally uses a memory map to copy the file entirely in kernel mode. Note that sendfile can only copy 2GB at a time, so we should use it in a loop.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/sendfile.h>
#include <sys/stat.h>

int main(int argc, char **argv) {
    if (argc != 3) {
        fprintf(stderr, "usage: %s <source> <target>\n", argv[0]);
        return EXIT_FAILURE;
    }
    int source_fd = open(argv[1], O_RDONLY, 0);
    if (source_fd < 0) {
        perror("open source");
        return EXIT_FAILURE;
    }
    int target_fd = open(argv[2], O_RDWR | O_CREAT | O_TRUNC, 0666);
    if (target_fd < 0) {
        perror("open target");
        return EXIT_FAILURE;
    }
    struct stat stat;
    int r = fstat(source_fd, &stat);
    if (r < 0) {
        perror("fstat");
        return EXIT_FAILURE;
    }
    off_t offset = 0;
    ssize_t bytes_sent = 0;
    ssize_t total_bytes_sent = 0;
    while (offset < stat.st_size) {
        bytes_sent = sendfile(target_fd, source_fd, &offset, stat.st_size - offset);
        total_bytes_sent += bytes_sent;
        if (bytes_sent < 0) {
            perror("sendfile");
            return EXIT_FAILURE;
        }
    }
    if (total_bytes_sent != stat.st_size) {
        fprintf(stderr, "sendfile: copied file truncated to %zd bytes\n", bytes_sent);
        return EXIT_FAILURE;
    } else {
        printf("sendfile: %zd bytes copied\n", total_bytes_sent);
    }
    close(source_fd);
    close(target_fd);
    return EXIT_SUCCESS;
}

Copying a roughly 3.2GB file, the time usage is:

real    0m1.894s
user    0m0.000s
sys     0m1.880s

Here is a Python version:

import sys
import os

if len(sys.argv) != 3:
    print(f'Usage: {sys.argv[0]} <source> <destination>')
    sys.exit(1)

with open(sys.argv[1], 'rb') as src, open(sys.argv[2], 'wb') as dst:
    total_bytes_sent = 0
    while total_bytes_sent < os.path.getsize(sys.argv[1]):
        bytes_sent = os.sendfile(dst.fileno(), src.fileno(), offset=None, count=2**31-1)
        total_bytes_sent += bytes_sent
    print(f"{total_bytes_sent} bytes written")

Copying a roughly 3.2GB file, the time usage is:

real    0m2.015s
user    0m0.010s
sys     0m1.973s

The most convenient and efficient way to copy files on Windows is to use the CopyFile API. The real time and the sys time are incredibly low. Maybe it is optimized by calling low-level driver functions performing async DMA because the result is constant when the source and destination drives have different file systems, and when I copy the file from a slow USB 2.0 HDD, the sys time is not fully recorded.

#include <stdio.h>
#include <windows.h>

void PrintLastError(const char *name) {
    char *msg;
    FormatMessageA(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
        NULL, GetLastError(), MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (char *) &msg, 0, NULL);
    fprintf(stderr, "%s failed: %s", name, msg);
    LocalFree(msg);
    exit(EXIT_FAILURE);
}

int main(int argc, char *argv[]) {
    if (argc != 3) {
        printf("Usage: %s <source> <destination>\n", argv[0]);
        return 1;
    }
    if (!CopyFileA(argv[1], argv[2], TRUE)) {
        PrintLastError("CopyFile");
    }
    return EXIT_SUCCESS;
}

Copying a roughly 3.2GB file at a local SSD, the time usage measured by my timing tool is:

real    0.894460s
user    0.000000s
sys     0.734375s

Copying a roughly 3.2GB file from a slow USB 2.0 HDD, the time usage measured by my timing tool is:

real    90.149947s
user    0.015625s
sys     1.328125s

In theory, the most efficient way to copy files is to use a memory map, so the copying process can be done entirely in kernel mode.

If the file is smaller than 2GB, you can use the following code on Unix platforms:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>

int main(int argc, char **argv) {
    if (argc != 3) {
        fprintf(stderr, "usage: %s <source> <target>\n", argv[0]);
        return EXIT_FAILURE;
    }
    int source_fd = open(argv[1], O_RDONLY, 0);
    if (source_fd < 0) {
        perror("open source");
        return EXIT_FAILURE;
    }
    int target_fd = open(argv[2], O_RDWR | O_CREAT | O_TRUNC, 0666);
    if (target_fd < 0) {
        perror("open target");
        return EXIT_FAILURE;
    }
    struct stat stat;
    int r = fstat(source_fd, &stat);
    if (r < 0) {
        perror("fstat");
        return EXIT_FAILURE;
    }
    char *buf = mmap(NULL, stat.st_size, PROT_READ, MAP_PRIVATE, source_fd, 0);
    if (buf == MAP_FAILED) {
        perror("mmap");
        return EXIT_FAILURE;
    }
    r = write(target_fd, buf, stat.st_size);
    if (r < 0) {
        perror("write");
        return EXIT_FAILURE;
    } else if (r != stat.st_size) {
        fprintf(stderr, "write: copied file truncated to %d bytes\n", r);
        return EXIT_FAILURE;
    } else {
        printf("write: %d bytes copied\n", r);
    }
    munmap(buf, stat.st_size);
    close(source_fd);
    close(target_fd);
    return EXIT_SUCCESS;
}

Copying a roughly 2GB file, the time usage is:

real    0m1.457s
user    0m0.000s
sys     0m1.451s

But if the file size is larger than 2GB, write() will truncate the file to 2GB, so it cannot be used. We must map the destination file and use memcpy to copy the file. Since memcpy is used, we can see there is time spent in user mode.

Here is a universal version:

import sys
import mmap

if len(sys.argv) != 3:
    print(f'Usage: {sys.argv[0]} <source> <destination>')
    sys.exit(1)

with open(sys.argv[1], 'rb') as src, open(sys.argv[2], 'wb') as dst:
    mmapped_src = mmap.mmap(src.fileno(), 0, access=mmap.ACCESS_READ)
    print(f"{dst.write(mmapped_src)} bytes written")
    mmapped_src.close()

Copying a roughly 3.2GB file, the time usage on Linux is:

real    0m2.050s
user    0m0.010s
sys     0m2.012s

Copying a roughly 3.2GB file, the time usage measured by my timing tool on Windows is:

real    3.520454s
user    0.031250s
sys     2.046875s

Here is a Unix version:

#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>

int main(int argc, char *argv[]) {
    int src_fd, dst_fd;
    void *src_map, *dst_map;
    struct stat src_stat;

    if (argc != 3) {
        printf("Usage: %s <source> <destination>\n", argv[0]);
        return 1;
    }

    src_fd = open(argv[1], O_RDONLY);
    if (src_fd == -1) {
        perror("open source");
        return 1;
    }

    if (fstat(src_fd, &src_stat) == -1) {
        perror("fstat");
        return 1;
    }

    src_map = mmap(NULL, src_stat.st_size, PROT_READ, MAP_PRIVATE, src_fd, 0);
    if (src_map == MAP_FAILED) {
        perror("mmap source");
        return 1;
    }

    dst_fd = open(argv[2], O_RDWR | O_CREAT | O_TRUNC, src_stat.st_mode);
    if (dst_fd == -1) {
        perror("open destination");
        return 1;
    }

    if (ftruncate(dst_fd, src_stat.st_size) == -1) {
        perror("ftruncate");
        return 1;
    }

    dst_map = mmap(NULL, src_stat.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, dst_fd, 0);
    if (dst_map == MAP_FAILED) {
        perror("mmap destination");
        return 1;
    }

    memcpy(dst_map, src_map, src_stat.st_size);
    printf("Copied %ld bytes from %s to %s\n", src_stat.st_size, argv[1], argv[2]);

    munmap(src_map, src_stat.st_size);
    munmap(dst_map, src_stat.st_size);

    close(src_fd);
    close(dst_fd);

    return 0;
}

Copying a roughly 3.2GB file, the time usage is:

real    0m2.978s
user    0m0.639s
sys     0m2.325s

Here is a Windows version:

#include <stdio.h>
#include <windows.h>

void PrintLastError(const char *name) {
    char *msg;
    FormatMessageA(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
        NULL, GetLastError(), MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (char *) &msg, 0, NULL);
    fprintf(stderr, "%s failed: %s", name, msg);
    LocalFree(msg);
    exit(EXIT_FAILURE);
}

int main(int argc, char *argv[]) {
    HANDLE hSrc, hDst;
    HANDLE hSrcMap, hDstMap;
    LPVOID lpSrcMap, lpDstMap;
    DWORD dwSrcSize, dwDstSize;

    if (argc != 3) {
        printf("Usage: %s <source> <destination>\n", argv[0]);
        return 1;
    }

    hSrc = CreateFileA(argv[1], GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hSrc == INVALID_HANDLE_VALUE) {
        PrintLastError("CreateFile");
        return 1;
    }

    dwSrcSize = GetFileSize(hSrc, NULL);
    if (dwSrcSize == INVALID_FILE_SIZE) {
        PrintLastError("GetFileSize");
        goto SRC_MAP_FAIL;
    }

    hSrcMap = CreateFileMappingA(hSrc, NULL, PAGE_READONLY, 0, 0, NULL);
    if (hSrcMap == NULL) {
        PrintLastError("CreateFileMapping");
        goto SRC_MAP_FAIL;
    }

    lpSrcMap = MapViewOfFile(hSrcMap, FILE_MAP_READ, 0, 0, 0);
    if (lpSrcMap == NULL) {
        PrintLastError("MapViewOfFile");
        goto SRC_VIEW_FAIL;
    }

    hDst = CreateFileA(argv[2], GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hDst == INVALID_HANDLE_VALUE) {
        PrintLastError("CreateFile");
        goto DEST_OPEN_FAIL;
    }

    dwDstSize = dwSrcSize;
    hDstMap = CreateFileMappingA(hDst, NULL, PAGE_READWRITE, 0, dwDstSize, NULL);
    if (hDstMap == NULL) {
        PrintLastError("CreateFileMapping");
        goto DEST_MAP_FAIL;
    }

    lpDstMap = MapViewOfFile(hDstMap, FILE_MAP_WRITE, 0, 0, 0);
    if (lpDstMap == NULL) {
        PrintLastError("MapViewOfFile");
        goto DEST_VIEW_FAIL;
    }

    memcpy(lpDstMap, lpSrcMap, dwSrcSize);
    printf("Copied %lu bytes from %s to %s", dwSrcSize, argv[1], argv[2]);

    UnmapViewOfFile(lpDstMap);
DEST_VIEW_FAIL:
    CloseHandle(hDstMap);
DEST_MAP_FAIL:
    CloseHandle(hDst);
DEST_OPEN_FAIL:
    UnmapViewOfFile(lpSrcMap);
SRC_VIEW_FAIL:
    CloseHandle(hSrcMap);
SRC_MAP_FAIL:
    CloseHandle(hSrc);

    return 0;
}

Copying a roughly 3.2GB file, the time usage measured by my timing tool is:

real    3.223017s
user    0.906250s
sys     2.312500s
Honduras answered 17/4, 2012 at 16:38 Comment(2)
Could you explain why write does not work with file > 2 GiB (assuming size_t is large enough)? Also, why not just mmap the source file in small chunks (by specifying a a suitable offset) to circumvent this, rather than using memcpy? (Of course that version runs into race conditions if the file is modified during the copying process but I believe this is true for all versions: mmaping the file does not lock it, does it? — Doing this would probably generally be a good idea)Greenberg
Maybe write (and sendfile) internally uses a 32-bit signed int, so the maximum number it can represent is 2GB.Honduras
A
1

The simplest way in C++17 and later is:

Use the #include <filesystem> and copy() method. There are 4 overloads for the copy method. You can check that in this link

void copy( const std::filesystem::path& from,

           const std::filesystem::path& to );
void copy( const std::filesystem::path& from,
           const std::filesystem::path& to,
           std::error_code& ec );
    
void copy( const std::filesystem::path& from,

           const std::filesystem::path& to,
           std::filesystem::copy_options options );
           
void copy( const std::filesystem::path& from,
           const std::filesystem::path& to,
           std::filesystem::copy_options options,
           std::error_code& ec );

With copy() method can copy files and directories with some options like recursive, non-recursive, copy only directories or overwrite or skip existing files, and so on. you can read more about copy options in this link

This is a sample code from here with some edit:

#include <cstdlib>
#include <iostream>
#include <fstream>
#include <filesystem>
namespace fs = std::filesystem;
 
int main()
{
    // create directories. create all directories if not exist. 
    fs::create_directories("sandbox/dir/subdir");
    
    // create file with content 'a'
    std::ofstream("sandbox/file1.txt").put('a');
    
    // copy file
    fs::copy("sandbox/file1.txt", "sandbox/file2.txt");
    
    // copy directory (non-recursive)
    fs::copy("sandbox/dir", "sandbox/dir2"); 
    
    // copy directory (recursive)
    const auto copyOptions = fs::copy_options::update_existing
                           | fs::copy_options::recursive
                           ;
    fs::copy("sandbox", "sandbox_copy", copyOptions); 
    
    // remove sanbox directory and all sub directories and sub files.
    fs::remove_all("sandbox");
}
Alonaalone answered 17/4, 2012 at 16:38 Comment(0)
S
1

I'm not quite sure what a "good way" of copying a file is, but assuming "good" means "fast", I could broaden the subject a little.

Current operating systems have long been optimized to deal with run of the mill file copy. No clever bit of code will beat that. It is possible that some variant of your copy techniques will prove faster in some test scenario, but they most likely would fare worse in other cases.

Typically, the sendfile function probably returns before the write has been committed, thus giving the impression of being faster than the rest. I haven't read the code, but it is most certainly because it allocates its own dedicated buffer, trading memory for time. And the reason why it won't work for files bigger than 2Gb.

As long as you're dealing with a small number of files, everything occurs inside various buffers (the C++ runtime's first if you use iostream, the OS internal ones, apparently a file-sized extra buffer in the case of sendfile). Actual storage media is only accessed once enough data has been moved around to be worth the trouble of spinning a hard disk.

I suppose you could slightly improve performances in specific cases. Off the top of my head:

  • If you're copying a huge file on the same disk, using a buffer bigger than the OS's might improve things a bit (but we're probably talking about gigabytes here).
  • If you want to copy the same file on two different physical destinations you will probably be faster opening the three files at once than calling two copy_file sequentially (though you'll hardly notice the difference as long as the file fits in the OS cache)
  • If you're dealing with lots of tiny files on an HDD you might want to read them in batches to minimize seeking time (though the OS already caches directory entries to avoid seeking like crazy and tiny files will likely reduce disk bandwidth dramatically anyway).

But all that is outside the scope of a general purpose file copy function.

So in my arguably seasoned programmer's opinion, a C++ file copy should just use the C++17 file_copy dedicated function, unless more is known about the context where the file copy occurs and some clever strategies can be devised to outsmart the OS.

Sharecropper answered 17/4, 2012 at 16:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.