C++ / Fast random access skipping in a large file

Asked 21/12, 2016 at 22:33 Answered 21/12, 2016 at 23:15

I have large files, containing a small number of large datasets. Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.

I want to build an index of dataset names very quickly. An example of file is about 21MB large, and contains 88 datasets. Reading the 88 names quickly by using a std::ifstream and seekg() to skip between datasets takes about 1300ms, which I would like to reduce.

So in fact, I'm reading 88 chunks of about 30 bytes, at given positions in a 21MB file, and it takes 1300ms.

Is there a way to improve this, or is it an OS and filesystem limitation? I'm running the test under Windows 7 64bit.

I know that having a complete index at the beginning of the file would be better, but the file format does not have this, and we can't change it.

Burlingame answered 21/12, 2016 at 22:33 Comment(4)

Some hints at #5166763 – Merengue 21/12, 2016 at 22:53

Are these text files? Does Windows still store both a carriage return as well as a line feed at the end of lines? If so, does seeking with ifstream seek on the logical text file, filtering out the carriage return characters? If so, perhaps you could open the file in raw binary mode, and thus prevent the seek function from having to read all the characters to avoid including CR in the seek offset? But then of course, your code may have to ignore the unfiltered carriage returns. – Petrochemical 21/12, 2016 at 23:5

Is splitting in 88 smaller files with an unchanged data format an option for you? – Moyna 21/12, 2016 at 23:13

Any chance of writing the index information to a separate file entirely, or (if you can't do that) writing it to all to one spot at the end of your file? If you could do that you could avoid (almost) all seeking; you'd only have to read a small amount of information from one place. – Dongola 21/12, 2016 at 23:31

You could scan the file and make your own header with the key and the index in a seperate file. Depending on your use case you can do it once at program start and everytime the file changes. Before accessing the big data, a lookup in the smaller file gives you the needed index.

Moyna answered 21/12, 2016 at 23:12 Comment(0)

You could use a memory mapped file interface (I recommend boost's implementation.)

This will open the file into the virtual page for your application for quicker lookup time, without going back to the disk.

Damiandamiani answered 21/12, 2016 at 22:46 Comment(5)

I don't see how this would avoid having to seek the disk heads back and forth; the main difference would be now the OS is sending the seek commands instead of the application. – Dongola 21/12, 2016 at 23:33

@JeremyFriesner Neither case avoids the initial disk head seek. The main difference is that pages are living in kernel memory which you can access directly rather than a copy into user space which is done by iostreams. – Damiandamiani 3/1, 2017 at 17:32

it's not the (relatively insignificant) copy-from-kernel-into-user-space that the OP is trying to minimize, it's the seeking of the drive heads. Using mmap() does not avoid those seeks, and it might in fact add more (depending on how the OS's implementation of a lookahead algorithm interacts with the patterns in which the OP's program accesses the mmap()'d memory area) – Dongola 3/1, 2017 at 20:35

@JeremyFriesner Either case is going to seek the drive heads. If you think an index is going to avoid that you are mistaken. OP asked for a random access solution and that is one of the main reasons you would ever use a memory mapped file. If he needed to look up the name and then look up something in that same area again, the performance of a mmap on avg. out does the performance of a second read as that read cache from the first iteration could've been blown away by something else (i.e. resulting in another disk read to recover), whereas an mmap remains persistent within pages. – Damiandamiani 3/1, 2017 at 20:55

A separate index would allow all of the necessary data to be read from a single location on the drive (requiring only 1 seek to get there). Without an index, the program has to either has to read the entire 21MB file sequentially from disk and then discard most of it, or seek ~88 times in order to read only the data he needs -- neither of those two options is going to be fast, regardless of whether you implement them via mmap() or not. – Dongola 3/1, 2017 at 21:12

Moyna answered 21/12, 2016 at 23:12 Comment(0)

You may be able to do a buffer queuing process with multithreading. You could create a custom struct that would store various amounts of data.

You said:

Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.

So as opening and closing the files over and over again is slow you could read the file all in one go and store it into a full buffer object and then parse it or store it into batches. This would also depend on if you are reading in text or binary mode on how easy it is to parse the file. I'll demonstrate the later with populating multiple batches while reading in a buffered size amount of data from file.

Pseudo Code

struct Batch {
    std::string name; // Name of Dataset
    unsigned size;    // Size of Dataset
    unsigned indexOffset;  // Index to next read location
    bool empty = true;     // Flag to tell if this batch is full or empty
    std::vector<DataType> dataset; // Container of Data
}; 

std::vector<Batch> finishedBatches;

// This doesn't matter on the size of the data set; this is just a buffer size on how much memory to digest in reading the file
const unsigned bufferSize = "Set to Your Preference" 1MB - 4MB etc.

void loadDataFromFile( const std::string& filename, unsigned bufferSize, std::vector<Batch>& batches ) {

    // Set ifstream's buffer size 

    // OpenFile For Reading and read in and upto buffer size

    // Spawn different thread to populate the Batches and while that batch is loading 
    // in data read in that much buffer data again. You will need to have a couple local 
    // stack batches to work with. So if a batch is complete and you reached the next index 
    // location from the file you can fill another batch.

    // When a single batch is complete push it into the vector to store that batch.
    // Change its flag and clear its vector and you can then use that empty batch again.

    // Continue this until you reached end of file.           

}

This would be a 2 threaded system here. Main thread for opening and reading from file and seeking from file with a worker thread filling the batches and pushing batches into container and swapping to use next batch.

Duplicator answered 21/12, 2016 at 23:15 Comment(0)

Recommended topics

Hot tags