Using multiple cores to process large, sequential file in c++

Asked 20/5, 2011 at 10:46 Answered 20/5, 2011 at 13:12

I have a large file (bigger then RAM, can't read whole at once) and i need to process it row by row (in c++). I want to utilize multiple cores, preferably with Intel TBB or Microsoft PPL. I would rather avoid preprocessing this file (like splitting it to 4 parts etc).

I was thinking about something like using 4 iterators, initialized to (0, n/4, 2*n/4 3*n/4) positions in the file etc.

Is it good solution and is there simple way to achieve it?

Or maybe you know some libs that supports efficient, concurrent reading of streams?

update:

I did tests. IO is not the bottleneck, CPU is. And I have lot of RAM for buffers.

I need to parse record (var size, approx. 2000 bytes each, records are separated by unique '\0' char), validate it, do some calculations, and write result to another file(s)

Retortion answered 20/5, 2011 at 10:46 Comment(6)

What kind of processing are you doing? – Sinclair 20/5, 2011 at 10:47

I see a problem with this: every set of reads from (0, n/4, 2*n/4, 3*n/4) + i will include at least four disk seeks, and I/O might become the bottleneck. – Cupola 20/5, 2011 at 10:48

@sehe: You have a point, I was assuming too much. – Cupola 20/5, 2011 at 10:54

@sehe: given that the file is "bigger than RAM", I think we can safely assume it's not on a RAM disk. – Housemaster 20/5, 2011 at 10:57

@Space's question is very relevant, do you know whether it's the IO that's killing you or your processing? You could look at a memory mapped implementation (i.e. map a block, process, then move to the next block etc.) This may help you reduce IO... – Pearson 20/5, 2011 at 10:59

I did tests. IO is not the bottleneck, CPU is. And I have lot of RAM for buffers. I need to parse record (var size, approx. 2000 bytes each, records are separated by unique '\0' ), validate it, do some calculations, and write result to another file(s). – Retortion 20/5, 2011 at 11:2

Since you are able to split it into N parts, it sounds like the processing of each row is largely independent. In that case, I think the simplest solution is to set up one thread to read the file line by line and place each row into a tbb::concurrent_queue. Then spawn as many threads as you need to pull rows off that queue and process them.

This solution is independent of the file size, and if you find you need more (or less) worker threads its trivial to change the number. But this won't work if there's some kind of dependencies between the rows... unless you set up a second poll of "post processing" threads to handle that, but then things may start to get too complex.

Earplug answered 20/5, 2011 at 10:58 Comment(2)

How about the write-out? I was doing a similar task and used the exact approach in reading and processing. But it seems outputting files could be a bottleneck from multiple threads. Currently, I am using mutex lock in C to write to the same output file. Any suggestion? – Gothicism 19/1, 2023 at 19:3

If each piece of data is independent and you know where it goes without having to look at data before then you can write to the same file from multiple threads using pwrite to have each thread write the data in the correct place. – Earplug 4/2, 2023 at 18:9

My recommendation is to use TBB's pipeline pattern. The first, serial stage of the pipeline reads a desired portion of data from file; subsequent stages process data chunks in parallel, and the last stage writes into another file, possibly in the same order as the data were read.

An example for this approach is available in TBB distributions; see examples/pipeline/square. It uses "old" interface, the class tbb::pipeline and filters (classes inherited from tbb::filter) that pass data by void* pointers. A newer, type-safe and lambda-friendly "declarative" interface tbb::parallel_pipeline() may be more convenient to use.

Palaearctic answered 20/5, 2011 at 13:12 Comment(0)

ianmac already hinted at the seek issue. Your iterator idea is reasonable with a slight twist: initialize them to 0,1,2 and 3, and increment each by 4. So, the first thread works on items 0,4,8, etc. The OS will make sure the file is being fed to your app as quickly as possible. It may be possible to tell your OS that you'll be doing a sequential scan through the file (e.g. on Windows, it's a flag to CreateFile).

Housemaster answered 20/5, 2011 at 10:59 Comment(0)

In terms of reading from the file, I wouldn't recommend this. Hard drives, as far as I know, can't read from more than one place at a single time.

However, processing the data is a different thing entirely, and you can easily do that in multiple threads. (Keeping the data in the correct order also wouldn't / shouldn't be difficult at all.)

Cerenkov answered 20/5, 2011 at 10:51 Comment(1)

Yes, hdd can't read from multiple places at once, but OS is buffering I/O, so it would be one hdd read for dozens of records. – Retortion 20/5, 2011 at 10:56

You don't say very much about what type of processing you intend to do. It is unclear whether you expect the process to be compute- or I/O-bound, whether there are data dependencies between the processing of different rows, etc.

In any case, parallel reading from four vastly different positions in one large file is likely to be inefficient (ultimately, the disk head will have to keep moving back and forth between different areas of the hard drive, with negative consequences for the throughput).

What you might consider instead is reading the file sequentially from start to finish, and fanning out individual rows (or blocks of rows) to the worker threads for processing.

Boutte answered 20/5, 2011 at 10:57 Comment(0)

Recommended topics

Hot tags