I have a 10GB CSV file which is essentially a huge square matrix. I am trying to write a function that can access a single cell of the matrix as efficiently as possible, ie matrix[12345,20000].
Given its size, it is obviously not possible to load the entire matrix into a 2D array, I need to somehow read the values direct from the file.
I have Googled around looking at file random access using FileStream.Seek, however unfortunately due to variable rounding each cell isn't a fixed width. It would not be possible for me to seek to a specific byte and know what cell I'm looking at by some sort of arithmetic.
I considered scanning the file and creating a lookup table for the index of the first byte of each row. That way, if I wanted to access matrix[12345,20000] I would seek to the start of row 12345 and then scan across the line, counting the commas until I reach the correct cell.
I am about to try this, but has anyone else got any better ideas? I'm sure I wouldn't be the first person to try and deal with a file like this.
Cheers
Edit: I should note that the file contains a very sparse matrix. If parsing the CSV file ends up being too slow, I would consider converting the file to a more appropriate, and easier to process, file format. What is the best way to store a sparse matrix?