How can I detect only deleted, changed, and created files on a volume?
Asked Answered
T

4

7

I need to know if there is an easy way of detecting only the files that were deleted, modified or created on an NTFS volume.

I have written a program for offsite backup in C++. After the first backup, I check the archive bit of each file to see if there was any change made, and back up only the files that were changed. Also, it backs up from the VSS snapshot in order to prevent file locks.

This seems to work fine on most file systems, but for some with lots of files and directories, this process takes too long and often the backup takes more than a day to finish backing up.

I tried using the change journal to easily detect changes made on an NTFS volume, but the change journal would show a lot of records, most of them relating to small temporary files created and destroyed. Also, I could the file name, file reference number, and the parent file reference number, but I could not get the full file path. The parent file reference number is somehow supposed to give you the parent directory path.

EDIT: This needs to run everyday, so at the beginning of every scan, it should record only the changes that took place since the last scan. Or atleast, there should be a way to say changes since so and so time and date.

Tabriz answered 14/9, 2011 at 18:49 Comment(0)
S
26

You can enumerate all the files on a volume using FSCTL_ENUM_USN_DATA. This is a fast process (my tests returned better than 6000 records per second even on a very old machine, and 20000+ is more typical) and only includes files that currently exist.

The data returned includes the file flags as well as the USNs so you could check for changes whichever way you prefer.

You will still need to work out the full path for the files by matching the parent IDs with the file IDs of the directories. One approach would be to use a buffer large enough to hold all the file records simultaneously, and search through the records to find the matching parent for each file you need to back up. For large volumes you would probably need to process the directory records into a more efficient data structure, perhaps a hash table.

Alternately, you can read/reread the records for the parent directories as needed. This would be less efficient, but the performance might still be satisfactory depending on how many files are being backed up. Windows does appear to cache the data returned by FSCTL_ENUM_USN_DATA.

This program searches the C volume for files named test.txt and returns information about any files found, as well as about their parent directories.

#include <Windows.h>

#include <stdio.h>

#define BUFFER_SIZE (1024 * 1024)

HANDLE drive;
USN maxusn;

void show_record (USN_RECORD * record)
{
    void * buffer;
    MFT_ENUM_DATA mft_enum_data;
    DWORD bytecount = 1;
    USN_RECORD * parent_record;

    WCHAR * filename;
    WCHAR * filenameend;

    printf("=================================================================\n");
    printf("RecordLength: %u\n", record->RecordLength);
    printf("MajorVersion: %u\n", (DWORD)record->MajorVersion);
    printf("MinorVersion: %u\n", (DWORD)record->MinorVersion);
    printf("FileReferenceNumber: %lu\n", record->FileReferenceNumber);
    printf("ParentFRN: %lu\n", record->ParentFileReferenceNumber);
    printf("USN: %lu\n", record->Usn);
    printf("Timestamp: %lu\n", record->TimeStamp);
    printf("Reason: %u\n", record->Reason);
    printf("SourceInfo: %u\n", record->SourceInfo);
    printf("SecurityId: %u\n", record->SecurityId);
    printf("FileAttributes: %x\n", record->FileAttributes);
    printf("FileNameLength: %u\n", (DWORD)record->FileNameLength);

    filename = (WCHAR *)(((BYTE *)record) + record->FileNameOffset);
    filenameend= (WCHAR *)(((BYTE *)record) + record->FileNameOffset + record->FileNameLength);

    printf("FileName: %.*ls\n", filenameend - filename, filename);

    buffer = VirtualAlloc(NULL, BUFFER_SIZE, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);

    if (buffer == NULL)
    {
        printf("VirtualAlloc: %u\n", GetLastError());
        return;
    }

    mft_enum_data.StartFileReferenceNumber = record->ParentFileReferenceNumber;
    mft_enum_data.LowUsn = 0;
    mft_enum_data.HighUsn = maxusn;

    if (!DeviceIoControl(drive, FSCTL_ENUM_USN_DATA, &mft_enum_data, sizeof(mft_enum_data), buffer, BUFFER_SIZE, &bytecount, NULL))
    {
        printf("FSCTL_ENUM_USN_DATA (show_record): %u\n", GetLastError());
        return;
    }

    parent_record = (USN_RECORD *)((USN *)buffer + 1);

    if (parent_record->FileReferenceNumber != record->ParentFileReferenceNumber)
    {
        printf("=================================================================\n");
        printf("Couldn't retrieve FileReferenceNumber %u\n", record->ParentFileReferenceNumber);
        return;
    }

    show_record(parent_record);
}

void check_record(USN_RECORD * record)
{
    WCHAR * filename;
    WCHAR * filenameend;

    filename = (WCHAR *)(((BYTE *)record) + record->FileNameOffset);
    filenameend= (WCHAR *)(((BYTE *)record) + record->FileNameOffset + record->FileNameLength);

    if (filenameend - filename != 8) return;

    if (wcsncmp(filename, L"test.txt", 8) != 0) return;

    show_record(record);
}

int main(int argc, char ** argv)
{
    MFT_ENUM_DATA mft_enum_data;
    DWORD bytecount = 1;
    void * buffer;
    USN_RECORD * record;
    USN_RECORD * recordend;
    USN_JOURNAL_DATA * journal;
    DWORDLONG nextid;
    DWORDLONG filecount = 0;
    DWORD starttick, endtick;

    starttick = GetTickCount();

    printf("Allocating memory.\n");

    buffer = VirtualAlloc(NULL, BUFFER_SIZE, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);

    if (buffer == NULL)
    {
        printf("VirtualAlloc: %u\n", GetLastError());
        return 0;
    }

    printf("Opening volume.\n");

    drive = CreateFile(L"\\\\?\\c:", GENERIC_READ, FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_ALWAYS, FILE_FLAG_NO_BUFFERING, NULL);

    if (drive == INVALID_HANDLE_VALUE)
    {
        printf("CreateFile: %u\n", GetLastError());
        return 0;
    }

    printf("Calling FSCTL_QUERY_USN_JOURNAL\n");

    if (!DeviceIoControl(drive, FSCTL_QUERY_USN_JOURNAL, NULL, 0, buffer, BUFFER_SIZE, &bytecount, NULL))
    {
        printf("FSCTL_QUERY_USN_JOURNAL: %u\n", GetLastError());
        return 0;
    }

    journal = (USN_JOURNAL_DATA *)buffer;

    printf("UsnJournalID: %lu\n", journal->UsnJournalID);
    printf("FirstUsn: %lu\n", journal->FirstUsn);
    printf("NextUsn: %lu\n", journal->NextUsn);
    printf("LowestValidUsn: %lu\n", journal->LowestValidUsn);
    printf("MaxUsn: %lu\n", journal->MaxUsn);
    printf("MaximumSize: %lu\n", journal->MaximumSize);
    printf("AllocationDelta: %lu\n", journal->AllocationDelta);

    maxusn = journal->MaxUsn;

    mft_enum_data.StartFileReferenceNumber = 0;
    mft_enum_data.LowUsn = 0;
    mft_enum_data.HighUsn = maxusn;

    for (;;)
    {
//      printf("=================================================================\n");
//      printf("Calling FSCTL_ENUM_USN_DATA\n");

        if (!DeviceIoControl(drive, FSCTL_ENUM_USN_DATA, &mft_enum_data, sizeof(mft_enum_data), buffer, BUFFER_SIZE, &bytecount, NULL))
        {
            printf("=================================================================\n");
            printf("FSCTL_ENUM_USN_DATA: %u\n", GetLastError());
            printf("Final ID: %lu\n", nextid);
            printf("File count: %lu\n", filecount);
            endtick = GetTickCount();
            printf("Ticks: %u\n", endtick - starttick);
            return 0;
        }

//      printf("Bytes returned: %u\n", bytecount);

        nextid = *((DWORDLONG *)buffer);
//      printf("Next ID: %lu\n", nextid);

        record = (USN_RECORD *)((USN *)buffer + 1);
        recordend = (USN_RECORD *)(((BYTE *)buffer) + bytecount);

        while (record < recordend)
        {
            filecount++;
            check_record(record);
            record = (USN_RECORD *)(((BYTE *)record) + record->RecordLength);
        }

        mft_enum_data.StartFileReferenceNumber = nextid;
    }
}

Additional notes

  • As discussed in the comments, you may need to replace MFT_ENUM_DATA with MFT_ENUM_DATA_V0 on versions of Windows later than Windows 7. (This may also depend on what compiler and SDK you are using.)

  • I'm printing the 64-bit file reference numbers as if they were 32-bit. That was just a mistake on my part. Probably in production code you won't be printing them anyway, but FYI.

Skantze answered 18/9, 2011 at 2:26 Comment(18)
Hey Harry, this is awesome. I will try this out. One questions though, how do we know if a new file that was created is something we need to backup? Because Windows creates and deletes a lot of temporary files, and it would be futile to see if they are still in existence or not.Tabriz
This is the "scanning the Master File Table" I mentioned in my answer. +1 for example code. But it should be combined with reading the Journal, you shouldn't try to use this instead of the journal.Toxic
Actually, you can use the USN returned by this method to determine which files are changed since the last run. However, you still need to read the journal once to get the current USN before starting the scan (otherwise modifications made during the scan could be missed on both this and the subsequent scan). And don't the the USN filtering capability of MFT_ENUM_DATA, you do need to enumerate all records in order to get information on parent directories.Toxic
Hey Ben, in my last implementation, I used to store the USN record number in the database. But yes, we have to store the file reference number and the parent file reference number of each file and query it to get the path.Tabriz
@Ben, the call to FSCTL_QUERY_USN_JOURNAL provides you with the current USN. You don't actually need to read the journal. (I'm not saying it might not be preferable to do so, but I don't think it is essential.)Skantze
@Harry: Didn't know about that, thought you'd need to read the most recent record to get that information. But yeah, I just meant to read the latest USN, not the entire journal.Toxic
Roy, to the best of my understanding and based on my testing, FSCTL_ENUM_USN_DATA only returns information about files that still exist.Skantze
@Ben, I think it might be reasonable to use the USN filtering in FSCTL_ENUM_USN_DATA to find the changed files, then make separate calls to look up the parent directories. That way you'd only be processing data for parent directories that you actually needed. I'm not sure about performance, but I gather the MFT is pretty well optimised, so as long as you cache the data you get I think you might be OK.Skantze
this sample code will be greatly improved in speed if within show_record() you don't read a megabyte worth of data but only a single USN recordTranspire
@nikos: the size of a USN record is variable, so that's a bit tricky - it isn't just the filename, it is also documented that extra members might be added to the structure in future releases. On the other hand, we don't need a megabyte, a kilobyte would be plenty. In production code you might want to further optimize, e.g., start by assuming that there are no extra entries and only increase the read size if it actually turns out to be necessary.Skantze
do you know if it is possible to drop the requirement for full administrator privileges to enumerate the MFT in this fashion in READ-ONLY mode?Transpire
@nikos: enumerating the MFT is always read-only, and if non-administrators could do it, they could, e.g., see file names in other user's homes. That would violate the security model.Skantze
I suppose this makes sense but it makes it less useful for my search programTranspire
@nikos: if you wanted to go to the trouble, you could install an optional service application (running as local system) to do the high-speed searching on your application's behalf. But you'd have to be sure that the service filtered the results based on your application's access rights.Skantze
@HarryJohnston , trying to run this code on my machine gives Error 87 "DeviceIoControl" at line 146 ? *windows 10 64Bit , Visul studio 2015 Community Update 2Retouch
@OzLe: which IOCTL is that? This code was written for Win7, to support newer versions you might need to use USN_RECORD_V3 or USN_RECORD_V4 rather than USN_RECORD.Skantze
@HarryJohnston thank you, Changing "MFT_ENUM_DATA" to "MFT_ENUM_DATA_V0" solved it.Retouch
@HarryJohnston Thanks for this! My implementation wasn't working and it turned out I made a bad pointer for the parent record and the (USN*)buffer+1 bit fixed it! Although I'm slightly confused, and curious if there's any official note in Microsoft's docs about this data that we're skipping? I did notice the page for FSCTL_ENUM_USN_DATA says this: "Each call to FSCTL_ENUM_USN_DATA retrieves the starting point for the subsequent call as the first entry in the output buffer." Is that it? It is a bit vague I have to say..Flamsteed
T
5

The change journal is your best bet. You can use the file reference numbers to match file creation/deletion pairs and thus ignore temporary files, without having to process them any further.

I think you have to scan the Master File Table to make sense of ParentFileReferenceNumber. Of course you only need to keep track of directories when doing this, and use a data structure that will allow you to quickly lookup the information, so you only need to scan the MFT once.

Toxic answered 17/9, 2011 at 2:14 Comment(7)
codeproject.com/KB/files/Eyes_on_NTFS.aspx I have used most of the code given here. The thing is he takes each file name and finds it file reference number and writes it in a database. Then, when the journal pops up, he does a query to match the filereferencenumber. Which is pointless, because this way you are going through the entire filesystem again.Tabriz
@roy: I don't think it's pointless. You only need to catalog directories from the MFT. And the journal still gives you an accurate list of changed files, without having to calculate hashes on all file content. How else would you detect changes? (and hashes carry a risk of collision) I assume you know that the "last modified time" isn't trustworthy.Toxic
By pointless, I didn't mean it was not accurate. It takes a long time to query a database to find its matching file reference number. I am trying to cut short on the processing time here.Tabriz
@roy: Does it a long time to build the database, or query it? If queries are really taking a long time, you need a more efficient data structure.Toxic
yes, it takes a long time to query the database. I will have to think of another way to store the data.Tabriz
I had never heard of MFT, so I will try to use that now.Tabriz
@roy: I'd suggest a radix-tree, with a layer for each octet of the PFRN.Toxic
E
1

You can use ReadDirectoryChanges and surrounding windows API.

Exterminate answered 14/9, 2011 at 19:40 Comment(6)
Does it detect changes in sub folders too? And, does it need to be constantly running to monitor the changes? Will it give only the changes since last scan?Tabriz
Yes, if you pass TRUE for parameter 4. You can implement it many ways but it's possible to do real-time monitoring. Read the MSDN article linked for usage ;-)Exterminate
Hey, I just ran a small program to test how this works, but after ReadDurectoryChangesW, it does not go over to the next step. IS this a known issue, am I doing something wrong? Can you please help me with my code?Tabriz
Ask a new question and post the code you have so far. You can link this question as a reference. Me or some other windows C++ guy will get to it.Exterminate
So, I made another question, and they say that this is not the right API to check what changed since last check. Do you think there is an alternative?Tabriz
There is ... I'll find the questionExterminate
A
-2

I know how to achieve this in java. It will help you if you implement Java code inside C++.

In Java you can achieve this using Jnotify API.It looks for changes in sub-directory also.

Alleviative answered 16/9, 2011 at 23:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.