Is there any way to atomically read a line from a file C++
Asked Answered
R

2

7

I am currently working a project where I have a large text file (15+ GB) and I'm trying to run a function on each line of the file. In order to speed the task along, I am creating 4 threads and attempting to have them read the file at the same time. This is similar to what I have:

#include <stdio.h>
#include <string>
#include <iostream>
#include <stdlib.h> 
#include <thread>
#include <fstream>

void simpleFunction(*wordlist){
    string word;
    getline(*wordlist, word);
    cout << word << endl;
}
int main(){
    int max_concurrant_threads = 4;
    ifstream wordlist("filename.txt");
    thread all_threads[max_concurrant_threads];

    for(int i = 0; i < max_concurrant_threads; i++){
        all_threads[i] = thread(simpleFunction,&wordlist);
    }

    for (int i = 0; i < max_concurrant_threads; ++i) {
        all_threads[i].join();
    }
    return 0;
}

The getline() function (along with *wordlist >> word) seems to increment the pointer and read the value in 2 steps, as I will regularly get:

Item1
Item2
Item3
Item2

back.

So I was wondering if there was a way to atomically read a line of the file? Loading it into an array first won't work because the file is too big, and I would prefer not to load the file in chunks at a time.

I couldn't find anything regarding fstream and the atomicity of getline(), sadly. If there is an atomic version of readline() or even a simple way to use locks to achieve what I want, I'm all ears.

River answered 1/12, 2016 at 7:9 Comment(9)
Is each line the same size? If not then no you can't really do it without some synchronization (e.g. semaphores or mutexes).Photooffset
I can't thing of a way to implement that with out a lock. Even with just read syscalls. However, it isn't the right way to do that: You should give your threads a line to process; then you don't have a shared resource.Synchronize
Odds are high that concurrent reads to the same file will slow down operation very much. There is a single disk to read from and you want to perform very fine grained accesses to different places, with synchronization.Gwenni
"I would prefer not to load the file in chunks at a time": isn't it exactly what you are trying to do ? I am afraid that the granularity of single lines is way too small.Gwenni
if you only read the file without writing anything back then why care about atomicity? File IO is probably the bottle neck in most applications so use a thread for reading the file into a buffer then distribute the computations on several other threads insteadWaterlogged
@LưuVĩnhPhúc If you read the post, he is getting same lines multiple times because getline() is not atomic.Inkberry
@Ville-ValtteriTiittanen yes I know. That's his design problemWaterlogged
c++ file reading is very slow at all. So if you need more performance, read file via native functions like posix calls. Splitting the read process to several threads with lock makes no sense. One reader thread and maybe n threads for processing looks more useful. As I know using mmap for posix give the fastest results.Sherikasherill
Be sure to measure your hard drive’s raw read rate, and also your program’s current read rate. There’s nothing more futile than trying to get software to perform faster than the hardware it’s relying on to do its task is capable of.Kironde
I
5

Proper way to do this would be locking the file, which would prevent all other processes from using it. See Wikipedia: File locking. This is probably too slow for you, because you only read one line at a time. But if you were reading for example 1000 or 10000 lines during each function call, it could be the best way to implement it.

If there are no other processes accessing the file, and it is enough that other threads don't access it, you can use mutex that you lock when you access the file.

void simpleFunction(*wordlist){
    static std::mutex io_mutex;
    string word;
    {
        std::lock_guard<std::mutex> lock(io_mutex);
        getline(*wordlist, word);
    }
    cout << word << endl;
}

Another way to implement your program could be creating a single thread that is reading the lines to the memory all the time, and the other threads would request single lines from the class that is storing them. You would need something like this:

class FileReader {
public:
    // This runs in its own thread
    void readingLoop() {
        // read lines to storage, unless there are too many lines already
    }

    // This is called by other threads
    std::string getline() {
        std::lock_guard<std::mutex> lock(storageMutex);
        // return line from storage, and delete it
    }
private:
    std::mutex storageMutex;
    std::deque<std::string> storage;
};
Inkberry answered 1/12, 2016 at 8:8 Comment(1)
Thanks for your help! I tested with using the mutex in the first example since it was easier to implement quickly. It read the file correctly, and gave a noticeable speedup from 1 core to 2 cores, but flattened after that. I imagine the locks from 3+ threads are slowing it down. I imagine the second result will be more scalable and I will implement that at a later date. Thanks again!River
A
1

Loading it into an array first won't work because the file is too big, and I would prefer not to load the file in chunks at a time.

So use a memory mapped file. The operating system will load the file into virtual memory on demand, but it will be transparent to your code, and will be far more efficient than using stream i/o and you may not even need or benefit from multiple threads.

Acetaldehyde answered 9/4, 2023 at 14:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.