Sorry for the title, it might be a bit confusing but I don't know how I could have explained it better.
There are two files with .cat (catalog file) and .dat extensions. A .cat file contains the information of binary files in a .dat file. This information is the name of the file, its size, offset in the .dat file, and md5 hash.
Example .cat file;
assets/textures/environments/asteroids/ast_crystal_blue_diff-small.gz 22387 1546955265 85a67a982194e4141e08fac4bf062c8f
assets/textures/environments/asteroids/ast_crystal_blue_diff.gz 83859 1546955265 86c7e940de82c2c2573a822c9efc9b6b
assets/textures/environments/asteroids/ast_crystal_diff-small.gz 22693 1546955265 cff6956c94b59e946b78419d9c90f972
assets/textures/environments/asteroids/ast_crystal_diff.gz 85531 1546955265 57d5a24dd4da673a42cbf0a3e8e08398
assets/textures/environments/asteroids/ast_crystal_green_diff-small.gz 22312 1546955265 857fea639e1af42282b015e8decb02db
assets/textures/environments/asteroids/ast_crystal_green_diff.gz 115569 1546955265 ee6f60b0a8211ec048172caa762d8a1a
assets/textures/environments/asteroids/ast_crystal_purple_diff-small.gz 14179 1546955265 632317951273252d516d36b80de7dfcd
assets/textures/environments/asteroids/ast_crystal_purple_diff.gz 53781 1546955265 c057acc06a4953ce6ea3c6588bbad743
assets/textures/environments/asteroids/ast_crystal_yellow_diff-small.gz 21966 1546955265 a893c12e696f9e5fb188409630b8d10b
assets/textures/environments/asteroids/ast_crystal_yellow_diff.gz 82471 1546955265 c50a5e59093fe9c6abb64f0f47a26e57
assets/textures/environments/asteroids/xen_crystal_diff-small.gz 14161 1546955265 23b34bdd1900a7e61a94751ae798e934
assets/textures/environments/asteroids/xen_crystal_diff.gz 53748 1546955265 dcb7c8294ef72137e7bca8dd8ea2525f
assets/textures/lensflares/lens_rays3_small_diff.gz 14107 1546955265 a656d1fad4198b0662a783919feb91a5
I did parse those files with relative ease and I used Span<T>
and after some benchmarks with BenchmarkDotNet
, I believe I have optimized the reading of those types of files as much I could.
But .dat files are another story. A typical .dat file is GBs in size.
I tried the most straightforward method I could think of first.
(I removed the null checks and validation codes to make the code more readable.)
public async Task ExportAssetsAsync(CatalogFile catalogFile, string destDirectory, CancellationToken ct = default)
{
IFileInfo catalogFileInfo = _fs.FileInfo.FromFileName(catalogFile.FilePath);
string catalogFileName = _fs.Path.GetFileNameWithoutExtension(catalogFileInfo.Name);
string datFilePath = _fs.Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");
IFileInfo datFileInfo = _fs.FileInfo.FromFileName(datFilePath);
await using Stream stream = datFileInfo.OpenRead();
foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)
{
string destFilePath = _fs.Path.Combine(destDirectory, catalogEntry.AssetPath);
IFileInfo destFile = _fs.FileInfo.FromFileName(destFilePath);
if (!destFile.Directory.Exists)
{
destFile.Directory.Create();
}
stream.Seek(catalogEntry.ByteOffset, SeekOrigin.Begin);
var newFileData = new byte[catalogEntry.AssetSize];
int read = await stream.ReadAsync(newFileData, 0, catalogEntry.AssetSize, ct);
if (read != catalogEntry.AssetSize)
{
_logger?.LogError("Could not read asset data from dat file: {DatFile}", datFilePath);
throw new DatFileReadException("Could not read asset data from dat file", datFilePath);
}
await using Stream destStream = _fs.File.Open(destFile.FullName, FileMode.Create);
destStream.Write(newFileData);
destStream.Close();
}
}
As you can guess this method is both slow and allocates a lot in heap and it keeps the GC busy.
I did some modifications to the method above and try reading with a buffer then using stackalloc and Span instead of allocation with new byte[catalogEntry.AssetSize]
. I didn't gain much in the buffered reading, and naturally, I got the StackOverflow exception with stackalloc since some files are large than the stack size.
Then after some research, I decided that I could use System.IO.Pipelines
introduced with .NET Core 2.1. And I changed the above method as below.
public async Task ExportAssetsPipe(CatalogFile catalogFile, string destDirectory, CancellationToken ct = default)
{
IFileInfo catalogFileInfo = _fs.FileInfo.FromFileName(catalogFile.FilePath);
string catalogFileName = _fs.Path.GetFileNameWithoutExtension(catalogFileInfo.Name);
string datFilePath = _fs.Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");
IFileInfo datFileInfo = _fs.FileInfo.FromFileName(datFilePath);
await using Stream stream = datFileInfo.OpenRead();
foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)
{
string destFilePath = _fs.Path.Combine(destDirectory, catalogEntry.AssetPath);
IFileInfo destFile = _fs.FileInfo.FromFileName(destFilePath);
if (!destFile.Directory.Exists)
{
destFile.Directory.Create();
}
stream.Position = catalogEntry.ByteOffset;
var reader = PipeReader.Create(stream);
while (true)
{
ReadResult readResult = await reader.ReadAsync(ct);
ReadOnlySequence<byte> buffer = readResult.Buffer;
if (buffer.Length >= catalogEntry.AssetSize)
{
ReadOnlySequence<byte> entry = buffer.Slice(0, catalogEntry.AssetSize);
await using Stream destStream = File.Open(destFile.FullName, FileMode.Create);
foreach (ReadOnlyMemory<byte> mem in entry)
{
await destStream.WriteAsync(mem, ct);
}
destStream.Close();
break;
}
reader.AdvanceTo(buffer.Start, buffer.End);
}
}
}
Well according to BenchmarkDotnet the results are worse than the first method both in performance and memory allocations. This is probably because I am using System.IO.Pipelines incorrectly or out of purpose.
I don't have much experience with this as I haven't done I/O operations for such large files before. How could I do what I want to do with minimum memory allocation and maximum performance? Thank you very much in advance for your help and correct guidance.