Windows fsync (FlushFileBuffers) performance with large files
Asked Answered
S

4

7

From information on ensuring data is on disk (http://winntfs.com/2012/11/29/windows-write-caching-part-2-an-overview-for-application-developers/), even in the case of e.g. a power outage, it appears that on Windows platforms you need to rely on its "fsync" version FlushFileBuffers to have the best guarantee that buffers are actually flushed from disk device caches onto the storage medium itself. The combination of FILE_FLAG_NO_BUFFERING with FILE_FLAG_WRITE_THROUGH does not ensure flushing the device cache, but merely have an effect on the file system cache, if this information is correct.

Given the fact that I will be working with rather large files, that need to be updated "transactionally", this means performing an "fsync" at the end of a transaction commit. So I created a tiny app to test the performance in doing so. It basically performs sequential writes of a batch of 8 memory page sized random bytes using 8 writes and then flushes. The batch is repeated in a loop, and after every so many written pages it records the performance. Additionally it has two configurable options: to fsync on a flush and whether to write a byte to the last position of the file, before beginning the page writes.

// Code updated to reflect new results as discussed in answer below.
// 26/Aug/2013: Code updated again to reflect results as discussed in follow up question.
// 28/Aug/2012: Increased file stream buffer to ensure 8 page flushes.
class Program
{
    static void Main(string[] args)
    {
        BenchSequentialWrites(reuseExistingFile:false);
    }
    public static void BenchSequentialWrites(bool reuseExistingFile = false)
    {
        Tuple<string, bool, bool, bool, bool>[] scenarios = new Tuple<string, bool, bool, bool, bool>[]
        {   // output csv, fsync?, fill end?, write through?, mem map?
            Tuple.Create("timing FS-E-B-F.csv", true, false, false, false),
            Tuple.Create("timing NS-E-B-F.csv", false, false, false, false),
            Tuple.Create("timing FS-LB-B-F.csv", true, true, false, false),
            Tuple.Create("timing NS-LB-B-F.csv", false, true, false, false),
            Tuple.Create("timing FS-E-WT-F.csv", true, false, true, false),
            Tuple.Create("timing NS-E-WT-F.csv", false, false, true, false),
            Tuple.Create("timing FS-LB-WT-F.csv", true, true, true, false),
            Tuple.Create("timing NS-LB-WT-F.csv", false, true, true, false),
            Tuple.Create("timing FS-E-B-MM.csv", true, false, false, true),
            Tuple.Create("timing NS-E-B-MM.csv", false, false, false, true),
            Tuple.Create("timing FS-LB-B-MM.csv", true, true, false, true),
            Tuple.Create("timing NS-LB-B-MM.csv", false, true, false, true),
            Tuple.Create("timing FS-E-WT-MM.csv", true, false, true, true),
            Tuple.Create("timing NS-E-WT-MM.csv", false, false, true, true),
            Tuple.Create("timing FS-LB-WT-MM.csv", true, true, true, true),
            Tuple.Create("timing NS-LB-WT-MM.csv", false, true, true, true),
        };
        foreach (var scenario in scenarios)
        {
            Console.WriteLine("{0,-12} {1,-16} {2,-16} {3,-16} {4:F2}", "Total pages", "Interval pages", "Total time", "Interval time", "MB/s");
            CollectGarbage();
            var timingResults = SequentialWriteTest("test.data", !reuseExistingFile, fillEnd: scenario.Item3, nPages: 200 * 1000, fSync: scenario.Item2, writeThrough: scenario.Item4, writeToMemMap: scenario.Item5);
            using (var report = File.CreateText(scenario.Item1))
            {
                report.WriteLine("Total pages,Interval pages,Total bytes,Interval bytes,Total time,Interval time,MB/s");
                foreach (var entry in timingResults)
                {
                    Console.WriteLine("{0,-12} {1,-16} {2,-16} {3,-16} {4:F2}", entry.Item1, entry.Item2, entry.Item5, entry.Item6, entry.Item7);
                    report.WriteLine("{0},{1},{2},{3},{4},{5},{6}", entry.Item1, entry.Item2, entry.Item3, entry.Item4, entry.Item5.TotalSeconds, entry.Item6.TotalSeconds, entry.Item7);
                }
            }
        }
    }

    public unsafe static IEnumerable<Tuple<long, long, long, long, TimeSpan, TimeSpan, double>> SequentialWriteTest(
        string fileName,
        bool createNewFile,
        bool fillEnd,
        long nPages,
        bool fSync = true,
        bool writeThrough = false,
        bool writeToMemMap = false,
        long pageSize = 4096)
    {
        // create or open file and if requested fill in its last byte.
        var fileMode = createNewFile ? FileMode.Create : FileMode.OpenOrCreate;
        using (var tmpFile = new FileStream(fileName, fileMode, FileAccess.ReadWrite, FileShare.ReadWrite, (int)pageSize))
        {
            Console.WriteLine("Opening temp file with mode {0}{1}", fileMode, fillEnd ? " and writing last byte." : ".");
            tmpFile.SetLength(nPages * pageSize);
            if (fillEnd)
            {
                tmpFile.Position = tmpFile.Length - 1;
                tmpFile.WriteByte(1);
                tmpFile.Position = 0;
                tmpFile.Flush(true);
            }
        }
        // Make sure any flushing / activity has completed
        System.Threading.Thread.Sleep(TimeSpan.FromMinutes(1));
        System.Threading.Thread.SpinWait(50); // warm up.

        var buf = new byte[pageSize];
        new Random().NextBytes(buf);
        var ms = new System.IO.MemoryStream(buf);

        var stopwatch = new System.Diagnostics.Stopwatch();
        var timings = new List<Tuple<long, long, long, long, TimeSpan, TimeSpan, double>>();
        var pageTimingInterval = 8 * 2000;
        var prevPages = 0L;
        var prevElapsed = TimeSpan.FromMilliseconds(0);

        // Open file
        const FileOptions NoBuffering = ((FileOptions)0x20000000);
        var options = writeThrough ? (FileOptions.WriteThrough | NoBuffering) : FileOptions.None;
        using (var file = new FileStream(fileName, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite, (int)(16 *pageSize), options))
        {
            stopwatch.Start();
            if (writeToMemMap)
            {
                // write pages through memory map.
                using (var mmf = MemoryMappedFile.CreateFromFile(file, Guid.NewGuid().ToString(), file.Length, MemoryMappedFileAccess.ReadWrite, null, HandleInheritability.None, true))
                using (var accessor = mmf.CreateViewAccessor(0, file.Length, MemoryMappedFileAccess.ReadWrite))
                {
                    byte* base_ptr = null;
                    accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref base_ptr);
                    var offset = 0L;
                    for (long i = 0; i < nPages / 8; i++)
                    {
                        using (var memStream = new UnmanagedMemoryStream(base_ptr + offset, 8 * pageSize, 8 * pageSize, FileAccess.ReadWrite))
                        {
                            for (int j = 0; j < 8; j++)
                            {
                                ms.CopyTo(memStream);
                                ms.Position = 0;
                            }
                        }
                        FlushViewOfFile((IntPtr)(base_ptr + offset), (int)(8 * pageSize));
                        offset += 8 * pageSize;
                        if (fSync)
                            FlushFileBuffers(file.SafeFileHandle);

                        if (((i + 1) * 8) % pageTimingInterval == 0)
                            timings.Add(Report(stopwatch.Elapsed, ref prevElapsed, (i + 1) * 8, ref prevPages, pageSize));
                    }
                    accessor.SafeMemoryMappedViewHandle.ReleasePointer();
                }
            }
            else
            {
                for (long i = 0; i < nPages / 8; i++)
                {
                    for (int j = 0; j < 8; j++)
                    {
                        ms.CopyTo(file);
                        ms.Position = 0;
                    }
                    file.Flush(fSync);
                    if (((i + 1) * 8) % pageTimingInterval == 0)
                        timings.Add(Report(stopwatch.Elapsed, ref prevElapsed, (i + 1) * 8, ref prevPages, pageSize));
                }
            }
        }
        timings.Add(Report(stopwatch.Elapsed, ref prevElapsed, nPages, ref prevPages, pageSize));
        return timings;
    }

    private static Tuple<long, long, long, long, TimeSpan, TimeSpan, double> Report(TimeSpan elapsed, ref TimeSpan prevElapsed, long curPages, ref long prevPages, long pageSize)
    {
        var intervalPages = curPages - prevPages;
        var intervalElapsed = elapsed - prevElapsed;
        var intervalPageSize = intervalPages * pageSize;
        var mbps = (intervalPageSize / (1024.0 * 1024.0)) / intervalElapsed.TotalSeconds;
        prevElapsed = elapsed;
        prevPages = curPages;
        return Tuple.Create(curPages, intervalPages, curPages * pageSize, intervalPageSize, elapsed, intervalElapsed, mbps);
    }

    private static void CollectGarbage()
    {
        GC.Collect();
        GC.WaitForPendingFinalizers();
        System.Threading.Thread.Sleep(200);
        GC.Collect();
        GC.WaitForPendingFinalizers();
        System.Threading.Thread.SpinWait(10);
    }

    [DllImport("kernel32.dll", SetLastError = true)]
    static extern bool FlushViewOfFile(
        IntPtr lpBaseAddress, int dwNumBytesToFlush);

    [DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Auto)]
    static extern bool FlushFileBuffers(SafeFileHandle hFile);
}

The performance results I am obtaining (64 bit Win 7, slow spindle disk) are not very encouraging. It appears that "fsync" performance depends very much on the size of the file being flushed, such that this dominates the time, and not the amount of "dirty" data to be flushed. The graph below shows the results for the 4 different settings options of the little benchmark app.

Benchmark timing for 4 the scenarios

As you can see, performance of "fsync" exponentially decreases as the file grows (until at a few GB it really grinds to a halt). Also, the disk itself does not seem to be doing a whole lot (i.e. resource monitor shows its active time as just around a few percent, and its disk queue mostly empty for most of the time).

I had obviously expected "fsync" performance to be quite a bit worse than doing normal buffered flushes, but I had expected it to be more or less constant and independent of file size. Like this it would seem to suggest that it is not usable in combination with a single large file.

Does somebody have an explanation, different experiences or a different solution that allows ensuring data is on disk and that has a more or less constant, predictable performance?

UPDATED See new information in answer below.

Sessile answered 16/8, 2013 at 15:8 Comment(0)
P
5

Your test shows exponential decrease in speed on the sync runs because you're recreating the file each time. In this case it's no longer a purely sequential write - every write also grows the file, which requires multiple seeks to update the file metadata in the filesystem. If you ran all of these jobs using a pre-existing fully allocated file, you'd see a much faster result because none of those metadata updates would be interfering.

I ran a similar test on my Linux box. The results while recreating the file each time:

mmap    direct  last    sync    time
0   0   0   0    0.882293s
0   0   0   1    27.050636s
0   0   1   0    0.832495s
0   0   1   1    26.966625s
0   1   0   0    5.775266s
0   1   0   1    22.063392s
0   1   1   0    5.265739s
0   1   1   1    24.203251s
1   0   0   0    1.031684s
1   0   0   1    28.244678s
1   0   1   0    1.031888s
1   0   1   1    29.540660s
1   1   0   0    1.032883s
1   1   0   1    29.408005s
1   1   1   0    1.035110s
1   1   1   1    28.948555s

The results using the pre-existing file (obviously the last_byte case is irrelevant here. Also, the very first result also had to create the file):

mmap    direct  last    sync    time
0   0   0   0    1.199310s
0   0   0   1    7.858803s
0   0   1   0    0.184925s
0   0   1   1    8.320572s
0   1   0   0    4.047780s
0   1   0   1    4.066993s
0   1   1   0    4.042564s
0   1   1   1    4.307159s
1   0   0   0    3.596712s
1   0   0   1    8.284428s
1   0   1   0    0.242584s
1   0   1   1    8.070947s
1   1   0   0    0.240500s
1   1   0   1    8.213450s
1   1   1   0    0.240922s
1   1   1   1    8.265024s

(Note that I only used 10,000 chunks not 25,000 chunks, so this is only writing 320MB, using an ext2 filesystem. I didn't have a larger ext2fs handy, my larger fs is XFS and it refused to allow mmap+direct I/O.)

Here's the code, if you're interested:

#define _GNU_SOURCE 1

#include <malloc.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

#define USE_MMAP    8
#define USE_DIRECT  4
#define USE_LAST    2
#define USE_SYNC    1

#define PAGE    4096
#define CHUNK   (8*PAGE)
#define NCHUNKS 10000
#define STATI   1000

#define FSIZE   (NCHUNKS*CHUNK)

main()
{
    int i, j, fd, rc, stc;
    char *data = valloc(CHUNK);
    char *map, *dst;
    char sfname[8];
    struct timeval start, end, stats[NCHUNKS/STATI+1];
    FILE *sfile;

    printf("mmap\tdirect\tlast\tsync\ttime\n");
    for (i=0; i<16; i++) {
        int oflag = O_CREAT|O_RDWR|O_TRUNC;

        if (i & USE_DIRECT)
            oflag |= O_DIRECT;
        fd = open("dummy", oflag, 0666);
        ftruncate(fd, FSIZE);
        if (i & USE_LAST) {
            lseek(fd, 0, SEEK_END);
            write(fd, data, 1);
            lseek(fd, 0, SEEK_SET);
        }
        if (i & USE_MMAP) {
            map = mmap(NULL, FSIZE, PROT_WRITE, MAP_SHARED, fd, 0);
            if (map == (char *)-1L) {
                perror("mmap");
                exit(1);
            }
            dst = map;
        }
        sprintf(sfname, "%x.csv", i);
        sfile = fopen(sfname, "w");
        stc = 1;
        printf("%d\t%d\t%d\t%d\t",
            (i&USE_MMAP)!=0, (i&USE_DIRECT)!=0, (i&USE_LAST)!=0, i&USE_SYNC);
        fflush(stdout);
        gettimeofday(&start, NULL);
        stats[0] = start;
        for (j = 1; j<=NCHUNKS; j++) {
            if (i & USE_MMAP) {
                memcpy(dst, data, CHUNK);
                if (i & USE_SYNC)
                    msync(dst, CHUNK, MS_SYNC);
                dst += CHUNK;
            } else {
                write(fd, data, CHUNK);
                if (i & USE_SYNC)
                    fdatasync(fd);
            }
            if (!(j % STATI)) {
                gettimeofday(&end, NULL);
                stats[stc++] = end;
            }
        }
        end.tv_usec -= start.tv_usec;
        if (end.tv_usec < 0) {
            end.tv_sec--;
            end.tv_usec += 1000000;
        }
        end.tv_sec -= start.tv_sec;
        printf(" %d.%06ds\n", (int)end.tv_sec, (int)end.tv_usec);
        if (i & USE_MMAP)
            munmap(map, FSIZE);
        close(fd);
        for (j=NCHUNKS/STATI; j>0; j--) {
            stats[j].tv_usec -= stats[j-1].tv_usec;
            if (stats[j].tv_usec < 0) {
                stats[j].tv_sec--;
                stats[j].tv_usec+= 1000000;
            }
            stats[j].tv_sec -= stats[j-1].tv_sec;
        }
        for (j=1; j<=NCHUNKS/STATI; j++)
            fprintf(sfile, "%d\t%d.%06d\n", j*STATI*CHUNK,
                (int)stats[j].tv_sec, (int)stats[j].tv_usec);
        fclose(sfile);
    }
}
Piecrust answered 25/8, 2013 at 13:55 Comment(1)
Thanks for looking into this Howard. I will study these results and rerun my tests using the pre-existing file scenario you mention. Results on a Linux and a Windows box may be different though as I suspect that OS file cache and/or kernel paging layer may play a significant role in the degrading performance. But perhaps, as you mention, this is related to file metadata updates.Sessile
P
2

Here's a Windows version of my synctest code. I only ran it inside a VirtualBox vm, so I don't think I have any useful numbers for comparison but you could give it a shot to compare to your C# numbers on your machine. I'm passing OPEN_ALWAYS to CreateFile, so it will reuse the existing file. Change that flag to CREATE_ALWAYS if you want to test again with an empty file each time.

One thing I noticed is that the results were much faster the first time I ran this program. Perhaps NTFS is not very efficient at overwriting existing data, and file fragmentation effects showed up on the subsequent runs.

#include <windows.h>
#include <stdio.h>

#define USE_MMAP    8
#define USE_DIRECT  4
#define USE_LAST    2
#define USE_SYNC    1

#define PAGE    4096
#define CHUNK   (8*PAGE)
#define NCHUNKS 10000
#define STATI   1000

#define FSIZE   (NCHUNKS*CHUNK)

static LARGE_INTEGER cFreq;

int gettimeofday(struct timeval *tv, void *unused)
{
    LARGE_INTEGER count;
    if (!cFreq.QuadPart) {
        QueryPerformanceFrequency(&cFreq);
    }
    QueryPerformanceCounter(&count);
    tv->tv_sec = count.QuadPart / cFreq.QuadPart;
    count.QuadPart %= cFreq.QuadPart;
    count.QuadPart *= 1000000;
    tv->tv_usec = count.QuadPart / cFreq.QuadPart;
    return 0;
}

main()
{
    int i, j, rc, stc;
    HANDLE fd;
    char *data = _aligned_malloc(CHUNK, PAGE);
    char *map, *dst;
    char sfname[8];
    struct timeval start, end, stats[NCHUNKS/STATI+1];
    FILE *sfile;
    DWORD len;

    printf("mmap\tdirect\tlast\tsync\ttime\n");
    for (i=0; i<16; i++) {
        int oflag = FILE_ATTRIBUTE_NORMAL;

        if (i & USE_DIRECT)
            oflag |= FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH;
        fd = CreateFile("dummy", GENERIC_READ|GENERIC_WRITE, 0, NULL,
            OPEN_ALWAYS, oflag, NULL);
        SetFilePointer(fd, FSIZE, NULL, FILE_BEGIN);
        SetEndOfFile(fd);
        if (i & USE_LAST)
            WriteFile(fd, data, 1, &len, NULL);
        SetFilePointer(fd, 0, NULL, FILE_BEGIN);
        if (i & USE_MMAP) {
            HANDLE mh;
            mh = CreateFileMapping(fd, NULL, PAGE_READWRITE,
                0, FSIZE, NULL);
            map = MapViewOfFile(mh, FILE_MAP_WRITE, 0, 0,
                FSIZE);
            CloseHandle(mh);
            dst = map;
        }
        sprintf(sfname, "%x.csv", i);
        sfile = fopen(sfname, "w");
        stc = 1;
        printf("%d\t%d\t%d\t%d\t",
            (i&USE_MMAP)!=0, (i&USE_DIRECT)!=0, (i&USE_LAST)!=0, i&USE_SYNC);
        fflush(stdout);
        gettimeofday(&start, NULL);
        stats[0] = start;
        for (j = 1; j<=NCHUNKS; j++) {
            if (i & USE_MMAP) {
                memcpy(dst, data, CHUNK);
                FlushViewOfFile(dst, CHUNK);
                dst += CHUNK;
            } else {
                WriteFile(fd, data, CHUNK, &len, NULL);
            }
            if (i & USE_SYNC)
                FlushFileBuffers(fd);
            if (!(j % STATI)) {
                gettimeofday(&end, NULL);
                stats[stc++] = end;
            }
        }
        end.tv_usec -= start.tv_usec;
        if (end.tv_usec < 0) {
            end.tv_sec--;
            end.tv_usec += 1000000;
        }
        end.tv_sec -= start.tv_sec;
        printf(" %d.%06ds\n", (int)end.tv_sec, (int)end.tv_usec);
        if (i & USE_MMAP)
            UnmapViewOfFile(map);
        CloseHandle(fd);
        for (j=NCHUNKS/STATI; j>0; j--) {
            stats[j].tv_usec -= stats[j-1].tv_usec;
            if (stats[j].tv_usec < 0) {
                stats[j].tv_sec--;
                stats[j].tv_usec+= 1000000;
            }
            stats[j].tv_sec -= stats[j-1].tv_sec;
        }
        for (j=1; j<=NCHUNKS/STATI; j++)
            fprintf(sfile, "%d\t%d.%06d\n", j*STATI*CHUNK,
                (int)stats[j].tv_sec, (int)stats[j].tv_usec);
        fclose(sfile);
    }
}
Piecrust answered 25/8, 2013 at 20:27 Comment(1)
Thanks Howard, I will also run that one. It may shed some light on the findings of my reruns using a fully written file. I am just in the process of adding those to my questions. You will see that some unexplained and unexpected behavior remains, whereas the explanation you provided fits well with some of the performance changes observed in that test.Sessile
S
1

I have experimented and benched some more and found a solution that may be acceptable to me (although currently I have only tested sequential writes). In the process I discovered some unexpected behaviors that raise a number of new questions. I will be posting a new SO question (Explanation/information sought: Windows write I/O performance with "fsync" (FlushFileBuffers)) for those.

I added the following two additional options to my benchmark:

  • Use unbuffered/writethrough writes (i.e. specifying the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags)
  • Write to the file indirectly, through a memory mapped file.

This provided me with some unexpected results, one of which gives me a more or less acceptable solution to my problem. When "fsyncing" in combination with unbuffered/writethrough I/O, I do not observe an exponential decay in the write speed. Thus (although it is not very fast) this provides me with a solution that allows ensuring data is on disk, and that has a constant predictable performance that is not impacted by the size of the file.

A few other unexpected results were the following:

  • If a byte is written to the last position in the file, before performing the page writes with the "fsync" and "unbuffered/writethrough" options, the write throughput almost doubles.
  • Performance of unbuffered/writethrough with or without fsync are nearly identical, except when a byte has been written to the last position in the file. The write throughput of the "unbuffered/writethrough" scenario without "fsync" on an empty file is around 12.5 MB/s, whereas in the same scenario on a file that has a byte written in the last position in the file the throughput is three times higher at 37 MB/s.
  • Writing to a file indirectly through a memory mapped file in combination with "fsync" shows the same exponential throughput decrease as observed in the buffered writes direct to a file, even if "unbuffered/writethrough" is set on the file.

I have added the updated code I used for the benchmark to my original question.

The graph below shows some of the additional new results.

Throughput for different combinations of options

Sessile answered 19/8, 2013 at 6:51 Comment(0)
A
0

[Wrong; see comments.]

I believe that the article you reference is incorrect in stating that FlushFileBuffers has any useful effect on unbuffered I/O. It refers to a Microsoft paper, but the paper in question makes no such claim.

According to the documentation, using unbuffered I/O has the same effect as, but is more efficient than, calling FlushFileBuffer after every write. So the practical solution is to use unbuffered I/O rather than using FlushFileBuffer.

Note however that using a memory mapped file defeats the buffering settings. I would not recommend using a memory mapped file if you are trying to push the data to disk as soon as possible.

Arnoldarnoldo answered 22/8, 2013 at 21:56 Comment(3)
From what I understand the article is correct. Also confirmed in e.g. here: support.microsoft.com/kb/332023 (see "More Information" section). FlushFileBuffers translates to SYNCHRONIZE CACHE command for SCSI devices and FLUSH CACHE command for IDE/ATAPI devices. Write Through is issued as ForceUnitAccess which apparently is usually not honored by IDE/ITAPI devices. I believe this is also what is mentioned in the referenced article ftp.research.microsoft.com/pub/tr/TR-2008-36.pdf.Sessile
@alex these articles are talking about a cache inside the disk/controller, not the OS level. When using writethrough, it should have made it to this 'in the disk cache', but it isnt actually 'on the disk'. For tests such as your Q, this cache might explain the exp-decline as your IO volume grows. This cache is a big deal for databases when these 'in the disk' caches fail under powerfail condition before being written to a sector, see support.microsoft.com/kb/234656Coelho
I can't find anything in that paper that says anything about the Win32 API. However the KB article certainly does!Arnoldarnoldo

© 2022 - 2024 — McMap. All rights reserved.