CSV Random Access; C#

Asked 27/1, 2011 at 23:45 Answered 28/1, 2011 at 4:29

I have a 10GB CSV file which is essentially a huge square matrix. I am trying to write a function that can access a single cell of the matrix as efficiently as possible, ie matrix[12345,20000].

Given its size, it is obviously not possible to load the entire matrix into a 2D array, I need to somehow read the values direct from the file.

I have Googled around looking at file random access using FileStream.Seek, however unfortunately due to variable rounding each cell isn't a fixed width. It would not be possible for me to seek to a specific byte and know what cell I'm looking at by some sort of arithmetic.

I considered scanning the file and creating a lookup table for the index of the first byte of each row. That way, if I wanted to access matrix[12345,20000] I would seek to the start of row 12345 and then scan across the line, counting the commas until I reach the correct cell.

I am about to try this, but has anyone else got any better ideas? I'm sure I wouldn't be the first person to try and deal with a file like this.

Cheers

Edit: I should note that the file contains a very sparse matrix. If parsing the CSV file ends up being too slow, I would consider converting the file to a more appropriate, and easier to process, file format. What is the best way to store a sparse matrix?

Crowd answered 27/1, 2011 at 23:45 Comment(0)

I have used Lumenworks CSV reader for quite large CSV files, it may be worth a quick look to see how quickly it can parse your file.

Lumenworks CSV

Kop answered 27/1, 2011 at 23:50 Comment(1)

I don't see how can this prevent both seeking and loading all into ram. It's just a sequential reader. – Florrieflorry 28/1, 2011 at 0:3

First of all, how would you want to refer to a particular row? Is it the index of the row so that you have another table or something that will help you know which row you are interested? or is it by an id or something?

These ideas come to mind

Your approach
Binary search. Assuming you have average length (size/rows), you can use a binary search to find a row assuming there is an identifier in the row which is ordered and can tell you if you are hit or miss.
Loading it to a database! By the way, what prevents you to do that? You can even use SQL express - which is free - and to get around the size limit, you can shard your data to multiple databases.

Neanderthal answered 27/1, 2011 at 23:56 Comment(4)

Loading it to a database... which would create an index on that :) – Florrieflorry 27/1, 2011 at 23:59

I like the idea of binary search. However, as you stated, it would require him to have a rowid on each row of csv. – Florrieflorry 28/1, 2011 at 0:0

Oddly enough the matrix was generated from data from a SQL database. I didn't create the generator, and now have to deal with the output. I am considering putting the data back into a more structured datatype. Should I use MSSQL or a binary file? – Crowd 28/1, 2011 at 0:2

I would say go with DB (SQL Server Express). Gone the days where you had to deal with such requirements yourself, DB gives indexing and so much more. – Neanderthal 28/1, 2011 at 0:4

Index-file would be the best you could do. I bet. Having unknown size of row, there is no way to skip directly to the line other than either scan the file or have an index.

The only question is how large your index is. If it is too large, you could make it smaller by indexing only every 5th (for example) line and scan in range of 5 lines.

Magnify answered 27/1, 2011 at 23:54 Comment(0)

I disagree that you shouldn't load the file into RAM, especially if you use a 64bit OS.

It shouldn't be a problem to allocate a matrix of size 12345x20000 : that's only about 1.9 GB in double precision. And in fact even if the size was bigger, I would still recommend this approach under a 64bit platform (see "virtual memory").

Secondly you stated that your matrix was sparse, hence you could load into RAM but use a sparse representation to spare some memory.

In conclusion if your application requires many access to your matrix and performance is somewhat important, putting it in RAM would definitely be my favourite approach.

Rayfordrayle answered 28/1, 2011 at 0:52 Comment(1)

Of course, you could do that processing and put the result in file rather than in RAM, you would still get constant time seek, but it'd be slower – Rayfordrayle 28/1, 2011 at 1:2

Pre-process the file so that the fields are fixed width. Then you can do your random read easily.

From doing similar sorts of in the past you should be able to write some simple code that reads the 10G variable width file from a local disk and writes a 10G fixed width file to a local disk in a few (~20) minutes. If that up front time investment pays off depends on how many random reads you need to do and how often the file to be read changes.

Aback answered 28/1, 2011 at 0:56 Comment(0)

What if you created 12345 separate file that are read with Lazy instantiation. Each file would only be read if the data was needed. If the data is completely sparse you could create a data structure with an IsEmpty bool property.

Do you need to access the same element over and over or do you need to just read each element once?

Carrot answered 28/1, 2011 at 4:29 Comment(0)

Recommended topics

Hot tags