Estimate the number of USN records on NTFS volume
Asked Answered
S

2

4

When the USN journal is used for the first time, the volume's entire set of USN records must be enumerated using the FSCTL_ENUM_USN_DATA control code. This is usually a lengthy operation.

Is there a way to estimate the number of records on the volume prior to running it, so progress can be displayed?

I'm guessing the USN data for the entire volume is generated from the MFT, with one record per file (approximately). So perhaps a way to estimate the number of active files in the MFT would work.

Servile answered 4/7, 2012 at 23:39 Comment(5)
FSCTL_ENUM_USN_DATA lists the contents of the MFT rather than the contents of the USN journal (for which you would use FSCTL_READ_USN_JOURNAL). So, yes, it contains one entry for every file and directory on the volume. I don't know of any way to estimate the number of entries. Instead of a progress bar or percentage, perhaps simply displaying the number of files/directories processed so far would do?Insured
Question: why do you want to enumerate the entire MFT? It might not be necessary. This answer may be useful: https://mcmap.net/q/540107/-how-can-i-detect-only-deleted-changed-and-created-files-on-a-volumeInsured
That was exactly my understanding (or guess at least). Displaying a count may have to do, but I'm still open to any suggestions for anything that could even roughly approximate the count.Servile
You can use FSCTL_GET_NTFS_VOLUME_DATA to get the length in bytes of the MFT. If you compare this to the number of records on a selection of representative volumes, you could estimate the average length of a single MFT record and use this to calculate an estimate for the number of records on a particular volume.Insured
@HarryJohnston - thanks for the link to your other answer, though I think it mostly confirms my current approach. I think your suggestion about using the MFT size for an esimate is good.Servile
I
4

You can use FSCTL_GET_NTFS_VOLUME_DATA to get the length in bytes of the MFT. If you compare this to the number of records on a selection of representative volumes, you could estimate the average length of a single MFT record and use this to calculate an estimate for the number of records on a particular volume.

Because the MFT contains (for example) the security information for every file, the average length will vary significantly from volume to volume, so I think you'll only get order-of-magnitude accuracy, but it may be good enough in most cases.

Another approach would be to assume that the file reference numbers increase linearly, which is roughly true. You can use FSCTL_ENUM_USN_DATA to find out whether there are any files with a reference number above a particular guess or not; you'd need no more than 128 guesses to determine the actual maximum reference number. That would at least give you a percentage complete between 0 and 100 at any given point, it wouldn't be entirely uniform but then progress bars never are. :-)

Additional:

Looking more closely, on Windows 7 x64 the "next id" field returned by FSCTL_ENUM_USN_DATA (the quadword returned before the first USN_RECORD structure) isn't a file reference number after all, but the file record segment number. So, as you observed, the last id number returned, multiplied by BytesPerFileRecordSegment (1024), is equal to MftValidDataLength.

File reference numbers appear to be made up of two parts. The low six bytes contain the file record segment number. The first record returned from each request always has a FRN whose segment number is the same as the "next id" fed into StartFileReferenceNumber, except for the first call when StartFileReferenceNumber is zero. The upper two bytes contain unspecified additional information, which is never zero.

It seems that FSCTL_ENUM_USN_DATA accepts either a file record segment number (in which case the top two bytes are zero) or a file reference number (in which case the top two bytes are nonzero).

One oddity is that I can't find two records with the same record segment number. This suggests that each file record is using at least 1K in the MFT, which doesn't seem reasonable.

Anyway, the upshot is that it is probably sensible to multiply the "next id" by BytesPerFileRecordSegment and divide it by MftValidDataLength to get a percentage completed, so long as you cope gracefully if this returns a nonsensical result.

Insured answered 5/7, 2012 at 4:36 Comment(7)
Hi Harry, I've done some experimenting and found an interesting thing - FSCTL_ENUM_USN_DATA actually returns the MFT offset as the next "StartFileReferenceNumber" or USN (depending on which part of the MSDN docs you read!). This number is relatively small and bears no relation to FRNs or USNs. So what I've done is use FSCTL_GET_NTFS_VOLUME_DATA to get the size of the MFT, then treat this MFT pos as an indicator of progress.Servile
On my system (Windows 7 SP1) the number returned by FSCTL_ENUM_USN_DATA is definitely the file reference number. I'd have thought it would always have to be, because that's what you pass in for StartFileReferenceNumber on the next call. What OS are you running? (It would make sense for the "reference number" to actually be an offset, but the numbers I get aren't, because they're often consecutive.)Insured
That was my on work machine, WinXP SP2. Since it's not the documented behaviour I probably shouldn't rely on it and I guess it won't work on 7 (which I'll try). The FileReferenceNumber fields of the returned USN records aren't in order (though they seem to be approximate order).Servile
On Windows 7 64 bit, it's the same for me: FSCTL_ENUM_USN_DATA returns monotonic values from approximately 0 to sizeof(MFT)/1024 as the StartFileReferenceNumber. And the number of returned records is the number of items on the system.Servile
My mistake; I was inadvertently stripping the upper four bytes from the file reference numbers. See the additional section I've added to my answer.Insured
Thanks! I'll make it use the BytesPerFileRecordSegment instead of a hardcoded value too. Do you think it would be prudent to use ONLY the lower 6 bytes from the returned value as the progress indicator though? (Just in case it starts returning non-zeros for those 2 bytes on some system.)Servile
@HarryJohnston I didn't get 100% the additional part, but do you think it could help for #45224265 ? It seems you studied the variations of med.StartFileReferenceNumber, is it right? Could you give a little more informations of how you would use it to estimate the number of files on volume? (sorry, number of files+dir) Thanks a lot!Saldana
P
2

In fact the MftValidDataLength field of the NTFS_VOLUME_DATA_BUFFER / NTFS_EXTENDED_VOLUME_DATA structure(s) place an upper limit on the number of USN records that will/would be returned by FSCTL_ENUM_USN_DATA (that is, assuming additional records aren't added to the journal between the time that you measure the estimate and the enumeration...)

In the C# example below, I divide the vd.MftValidDataLength value by vd.BytesPerFileRecordSegment, being sure to round-up by first adding dividend - 1 before dividing. As for the divisor, I believe that its value here is always universally 1,024 on any platform or system, in case you prefer to hard-code it.

[Serializable, StructLayout(LayoutKind.Sequential)]
public struct NTFS_EXTENDED_VOLUME_DATA
{
    public VOLUME_ID     /**/ VolumeSerialNumber;
    public long          /**/ NumberSectors;
    public long          /**/ TotalClusters;
    public long          /**/ FreeClusters;
    public long          /**/ TotalReserved;
    public uint          /**/ BytesPerSector;
    public uint          /**/ BytesPerCluster;
    public int           /**/ BytesPerFileRecordSegment;   // <--
    public uint          /**/ ClustersPerFileRecordSegment;
    public long          /**/ MftValidDataLength;          // <--
    public long          /**/ MftStartLcn;
    public long          /**/ Mft2StartLcn;
    public long          /**/ MftZoneStart;
    public long          /**/ MftZoneEnd;
    public uint          /**/ ByteCount;
    public ushort        /**/ MajorVersion;
    public ushort        /**/ MinorVersion;
    public uint          /**/ BytesPerPhysicalSector;
    public ushort        /**/ LfsMajorVersion;
    public ushort        /**/ LfsMinorVersion;
    public uint          /**/ MaxDeviceTrimExtentCount;
    public uint          /**/ MaxDeviceTrimByteCount;
    public uint          /**/ MaxVolumeTrimExtentCount;
    public uint          /**/ MaxVolumeTrimByteCount;
};

Typical constants, abridged for clarity:

public enum FSCTL : uint
{
    // etc...     etc...
    FILESYSTEM_GET_STATISTICS   /**/ = (9 << 16) | 0x0060,
    GET_NTFS_VOLUME_DATA        /**/ = (9 << 16) | 0x0064,  // <--
    GET_NTFS_FILE_RECORD        /**/ = (9 << 16) | 0x0068,
    GET_VOLUME_BITMAP           /**/ = (9 << 16) | 0x006f,
    GET_RETRIEVAL_POINTERS      /**/ = (9 << 16) | 0x0073,
    // etc...     etc...
    ENUM_USN_DATA               /**/ = (9 << 16) | 0x00b3,
    READ_USN_JOURNAL            /**/ = (9 << 16) | 0x00bb,
    // etc...     etc...
    CREATE_USN_JOURNAL          /**/ = (9 << 16) | 0x00e7,
    // etc...     etc...
};

Pseudo-code follows, since everyone has their own favorite ways of doing P/Invoke...

// etc..

if (!GetDeviceIoControl(h_vol, FSCTL.GET_NTFS_VOLUME_DATA, out NTFS_EXTENDED_VOLUME_DATA vd))
    throw new Win32Exception(Marshal.GetLastWin32Error());

var c_mft_estimate = (vd.MftValidDataLength + (vd.BytesPerFileRecordSegment - 1))
                                                        / vd.BytesPerFileRecordSegment;

Great, so what can you do with this value? Unfortunately, knowing this maximum cap on the number of USN records that FSCTL_ENUM_USN_DATA will return doesn't help with choosing a buffer size for the DeviceIoControl/FSCTL_ENUM_USN_DATA call themselves, since the USN_RECORD structures returned in each iteration vary in size according to the length of the reported filenames.

So while it is true that, if you happen to provide a buffer large enough for all of the USN_RECORD structures, then DeviceIoControl will indeed dutifully provide them all to you in a single call (thus avoiding the complication of an iterative-calling loop, which simplifies the code considerably), the little calculation above doesn't give any principled estimation of that buffer size, unless you're willing to settle for using it towards some kind of gross overestimation.

What the value is useful for, rather, is for pre-allocating your own fixed-size data structures, which you'll surely need, prior to the FSCTL_ENUM_USN_DATA enumeration operation. So if you have your own value-type which you'll create for each USN entry (dummy struct, just for example...)

[StructLayout(LayoutKind.Sequential)]
public struct MFT_IX_REC
{
    public ushort seq;
    public ushort parent_ix_hi;
    public uint parent_ix;
};

Then, using the estimate from above, you can pre-allocate an array of these before the DeviceIoControl and never have to worry about resizing during the iteration.

var med = new MFT_ENUM_DATA { ... };
// ...

var rg_mftix = new MFT_IX_REC[c_mft_estimate];
// ... ready to go, without having to check whether the array needs resizing within the loop

for (int i=0; DeviceIoControl(h_vol, FSCTL.ENUM_USN_DATA, in med, out USN_RECORD usn, ...); i++)
{
    // etc..
    rg_mftix[i].parent_ix = (uint)usn.ParentId;
    // etc..
}

This elimination of the dynamic array-resizing, usually needed when you don't know the number of entries in advance, is a non-trivial performance benefit, because it avoids the expensive jumbo-sized memcpy operations required for copying the existing data from the old array to a new, larger one each time you resize.

Phago answered 1/9, 2019 at 4:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.