Files loading slower on second run of application, with repro code
Asked Answered
G

2

23

Application Description

I have an offline data processing tool. This tool loads hundreds of thousands of files. For each one it performs some calculations and when done writes a single index file. It is all C++ (all IO is via standard library objects/functions), and is being compiled with Visual Studio 2013 targeting amd64.

Performance

My test dataset has 115,757 files that need to be processed. Files total 731MB in size, and the Median file size is 6KB.

  • First run: 12 seconds
  • Second run: ~18 minutes

Thats 90x slower! The second run is extrapolated from one minute of run time. All runs after that, as i've experienced thus far, are equally slow.

Surprise!

If I rename the folder with the files in it, and then rename it back to what it originally was, the next time I run the application it will again perform quickly!

Its the same app, machine, and source data. The only difference is that one folder was temporarily renamed.

So far I can reproduce this 100% of the time.

Profiling

Naturally the next step was to profile. I profiled the quick run and the slow run and compared the hot spots. In the slow version about 86% of the application was spent in a function called NtfsFindPrefix. The quick version spends about 0.4% of its time here. This is the call stack:

Ntfs.sys!NtfsFindPrefix<itself>
Ntfs.sys!NtfsFindPrefix
Ntfs.sys!NtfsFindStartingNode
Ntfs.sys!NtfsCommonCreate
Ntfs.sys!NtfsCommonCreateCallout
ntoskrnl.exe!KySwitchKernelStackCallout
ntoskrnl.exe!KiSwitchKernelStackContinue
ntoskrnl.exe!KeExpandKernelStackAndCalloutEx
Ntfs.sys!NtfsCommonCreateOnNewStack
Ntfs.sys!NtfsFsdCreate
fltmgr.sys!FltpLegacyProcessingAfterPreCallbacksCompleted
fltmgr.sys!FltpCreate
ntoskrnl.exe!IopParseDevice
ntoskrnl.exe!ObpLookupObjectName
ntoskrnl.exe!ObOpenObjectByName
ntoskrnl.exe!NtQueryAttributesFile
ntoskrnl.exe!KiSystemServiceCopyEnd
ntdll.dll!NtQueryAttributesFile
KernelBase.dll!GetFileAttributesW
DataGenerator.exe!boost::filesystem::detail::status

The boost call in question is an exists call. It will test for the zipped version of a file, fail to find it, and then test for the unzipped one and find it.

Profiling also showed that the disk did not get hit by either runs of the application, however File IO was expectedly high. I believe this indicates that the files were already paged to memory.

File IO also showed that the duration of file "Create" events were on average MUCH higher in the slow version. 26 us vs 11704 us.

Machine

  • Samsung SSD 830 Series
  • Intel i7 860
  • Windows 7 64 bit
  • NTFS file system.
  • 32GB Ram

Summary

  • On the second run the calls into NtfsFindPrefix take much longer.
  • This is a function in the NTFS driver.
  • The disk didn't get hit in either profiles, the Files were served from pages in memory.
  • A rename operation seems to be enough to stop this issue occurring on the next run.

Question

Now that the background info is out of the way, Does anyone recognise what is going on and know how to fix it?

It seems like I could work around it by renaming the folder myself, but that seems...dirty. plus im not sure why that even works.

Is the rename invalidating the pages in memory and causing them to get updated before the next run? Is this a bug in the NTFS driver?

Thanks for reading!


Update!!

After some more profiling it looks like the part that is performing slower is testing to see if the non-existent zipped file exists. If I remove this test everything seems to get quicker again.

I have also managed to reproduce this issue in a small C++ app for everyone too see. Note that The sample code will create 100k 6KB files on your machine in the current directory. Can anyone else repro it?

// using VS tr2 could replace with boost::filesystem
#include <filesystem>
namespace fs = std::tr2::sys;
//namespace fs = boost::filesystem;

#include <iostream>
#include <string>
#include <chrono>
#include <fstream>

void createFiles( fs::path outDir )
{
    // create 100k 6KB files with junk data in them. It doesn't matter that they are all the same.
    fs::create_directory( outDir );
    char buf[6144];
    for( int i = 0; i < 100000; ++i )
    {
        std::ofstream fout( outDir / fs::path( std::to_string( i ) ), std::ios::binary );
        fout.write( buf, 6144 );
    }

    fs::rename( outDir, fs::path( outDir.string() + "_tmp" ) );
    fs::rename( fs::path( outDir.string() + "_tmp" ), outDir );
}

int main( int argc, const char* argv[] )
{
    fs::path outDir = "out";

    if( !fs::exists( outDir ) )
        createFiles( outDir );

    auto start = std::chrono::high_resolution_clock::now();

    int counter = 0;
    for( fs::recursive_directory_iterator i( outDir ), iEnd; i != iEnd; ++i )
    {
        // test the non existent one, then the other
        if( !fs::exists( fs::path( i->path().string() + "z" ) ) && fs::exists( i->path() ) )
            counter += 1;

        if( counter % 100 == 0 )
            std::cout << counter << std::endl;
    }
    std::cout << counter << std::endl;

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration< double, std::milli > s( end - start );

    std::cout << "Time Passed: " << s.count() << "ms" << std::endl;

    return 0;
}

Update 2

I have logged an issue with MS here. Hopefully they can help shed some light on the issue.

Guncotton answered 18/2, 2015 at 9:12 Comment(18)
The only thing I can think of that may be remotely close to this is if the first run locks the file(s) preventing the second run from accessing, but that is unlikely to take that long. You would get an error before that.Afterburner
The code in your program that reads the file data is causing the file system cache to lose track of the directory entries. Too much data, it pushes out the old info. It is fast the first time, the cache still has them from whatever you did before.. Slow after that, now the disk reader head has to grind through the MFT to find the files. You need more RAM or a faster disk. A database should be on the top of the list.Mila
@HansPassant - As I mentioned, profiling showed the disk did not get hit by my application at all on either runs.Guncotton
115,757 files and never hitting the disk. Hmm.Mila
Median file size is 6KB, files take up 731MB, I have 32GB of ram. Ill update the question with these figures.Guncotton
Suggestion: copy the files before processing the copies. Or do that in batches, so as to have usable progress at any time.Junette
Update: All the additional time in the slow version is spent looking for zipped version of the file which doesn't exist. Both versions do this check though.Guncotton
I now have repro code up! only ~50 lines to repro this issue.Guncotton
there some problems with compiling (probably something with my local settings), so I just used Qt for file parts, these are results: first run with creating files and checking for non existing and existing files: 13230ms second run with just checking for non existing and existing files: 10960msHector
Interesting. I've been able to repro this on a number of machines now. What is your machines details? Mine are Windows 7 64 bit, NTFS file system, 32GB RAM, Samsung SSD drive and i7 CPU.Guncotton
@Guncotton Did you try reproducing on a HDD, not SSD?Ichthyosaur
I just tried it out, and i get it on HDD also.Guncotton
Mine is i7 2.4GHz, 8GB ram, 64 windows 7 enterprise (sp1), ntfs ssdHector
Now it works slowly for first times, but gets faster after running few more times.Hector
Consider uploading an ETW trace so that people can investigate without having to run the repro code. This also serves as an archive of how it was behaving, and will include many pertinent details such as amount of memory, type of disk, OS version, etc.Tailpiece
It seems very likely that this is a performance bug in ntfs.sys. A run with all of the data cached should not be slower, period. You could report it to Microsoft, with a trace and the repro, and leave it like that. If you want to dig deeper: Look at the slow data with the CPU Usage (Sampled) columns arranged as Process, Thread, module, function, address, orange-bar, count. Then drill into Ntfs.sys!NtfsFindPrefix, and then sort by address. You now have a map of samples within the function. Use local kernel debugging to grab the assem for this function and correlate them.Tailpiece
@Guncotton I had the same behaviour at my Windows 7 machine, and there was no difference at Windows 10. Seems, that it was a bug in NTFS drivers. < offtop >nice names of main function parameters< /offtop> ))Corruption
Derp. Fixed. Thanks. Also for other people reading, before MS closed MS Connect, there was a comment in there saying the driver team marked this issue as "Will not fix", but couldn't link directly to the issue as it was a private MS issue tracker. However I now cannot find any information about this. RIP MS Connect.Guncotton
R
1

My best guess is that Windows Indexer is creating a lock of the files right after being modified as Damon said

NtfsFindPrefix sure does quite a bit of FCB acquiring and releasing,but unless the code is incorrect (in which case it should fail) why would it be slower on the second run? Windows Explorer or Indexing Service doing some kind of "smart shit" after you have modified a file in that folder, such as scanning all files to adjust the specialized folder type, and holding a lock doing so maybe? Did you test what happens if you wait a few mins doing nothing before running a second time, or if you kill Explorer and the indexer?

I would suggest creating a folder used specifically to keep data for any program you write:

Example

C:/Users/%yourusername%/DataAnalysis

Use this folder to keep any data file accessed by your original software. After creating the folder you have to disable indexing.

I believe that in the folder properties menu there is an option to disable indexing of the folder. If not, follow the instruction here to disable indexing that folder and avoid problems.

The page will tell you to visit Control Panel and visit Indexing Options. There you select Modify and travel through you file system and deselect the folder you just created to avoid indexing its contents.

Let me know if it works!

Riocard answered 18/2, 2015 at 19:15 Comment(2)
It didnt work. I also completely disabled my antivirus realtime file system protection. That improved performance of the first and second run, but the large difference between the two persists.Guncotton
Also FYI, I left my computer on all night giving it ample time to complete indexing, yet my next run this morning was equally slow.Guncotton
C
1

If the bottleneck is in the exist call, then you will probably get far better performance if you read the list of files into an in-memory data structure and testing for file existence against your data structure.

Use FindFirstFileEx/FindNextFile to get the names of all files in your working folder. Load the results into a std::vector (or your container of choice). Sort. Use std::binary_search to check if a specific file exists.

I've written a lot of tools that work with large numbers of files in a single folder and it's been my experience that FindFirstFileEx/FindNextFile is the best option for these scenarios.

Cyclopean answered 19/2, 2015 at 21:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.