Multi stream iterators c++
Asked Answered
C

5

14

The purpose of my program is to open a text file of m lines of the same length n, read the file column by column and print each column.

For example, for this text file

abcd
efgh 
jklm

I would like to print

a e j
b f k
c g l
d h m

As one line length can be 200 000 000 and the column length can be more than 10 000, I can't open all the file in memory in a matrix.

Theoretically, I would like to have a program that use O(m) in space and O(m*n) in time.

At the beginning, I had to think about these solutions:

  • if I see all the file for each column the complexity is O(m*n²),
  • If I use seekg and an array of positions and jump from position to position, the complexity is O(mnlog(n)).

Last point, for some server problems, I need to use only the STL.

My last idea is to create an array of iterators of a file and initialized these iterators at the beginning of each line. After that, to see the next column, I only need to increase each iterator. This is my code

ifstream str2;
str2.open ("Input/test.data", ifstream::in);

int nbline = 3;
int nbcolumn = 4;
int x = 0;

istreambuf_iterator<char> istart (str2);
istreambuf_iterator<char> iend ;

istreambuf_iterator<char>* iarray;
iarray = new istreambuf_iterator<char>[nbline];


while (istart != iend){
    if (x % nbcolumn == 0){
        iarray[x/nbcolumn] = istart;
    }
    istart++;
    x++;
}

for (int j = 0; j<nbcolumn;j++){
    for (int i = 0; i<nbline;i++){
        cout  << *iarray[i] << "\t";
        iarray[i]++;
    }
    cout << endl;
}

Sadly, it does not work, and I have this thing as output

a       e       f       
�       �       �       
�       �       �       
�       �       �       

I think the problem is that the array of iterators iarray are not independent of istart, how I can do that?

Cloven answered 22/10, 2018 at 10:14 Comment(12)
If you open the same file multiple times, you can have independent streams and thus independent iterators.Nabala
Good idea, it is my current method. I would like to improve this method because I need to open m files at the same time (where m is the number of lines), but on my server I can't use ulimit (-n) to increase the maximal number of open files at the same time.Cloven
Alternatively, you could memory map the file into your process and treat it like one giant byte array. The OS will take care of paging in and out file contents as needed.Nabala
I had alreay try this method, but It took a lot of RAM because the length of each line is really huge and it was slow (very slow).Cloven
Hope this helps: mcs.anl.gov/~itf/dbpp/text/node126.html and en.m.wikipedia.org/wiki/In-place_matrix_transposition , though not about giant data...Homeomorphism
I assume the first output row should be a e j, not a e f.Mythos
How did you get O(mnlog(n)) for using seekg (and why would you need an array of positions instead of calculated offsets)? It's the logarithmic factor that looks wrong to me; where did that come from?Lugworm
Only using STL but which version of C++? 03, 11, 14, 17, 20 or ... ?Presumptive
Yes sorry the first output is a e j. For the log n factor for the seekg, it comes in practice. And for the version of the c++, for now I use the 11.Cloven
I do not understand. Complexity is usually a theoretical calculation. How does a complexity factor come "in practice" with no theoretical explanation?Lugworm
Does the output have to go to standard out? If yes, can you use a temporary storage, then print it out to stddout later, maybe per line? IO usually is the bottleneck even with the best algorithm. Writing out the entire line will be faster than a for loop with n cout calls.Mercurialism
Look here sci-hub.tw/10.1109/83.210874Litchfield
M
6

You can break the task into chunks, then process each chunk before moving on to the next.

You'd need a buffer for each line (the larger this is the better the performance will be) and the seek position for that row. You may also need to make an initial pass thru the file to get the correct offsets for each row.

Read B bytes into the buffer for each row (using tellg to save the position in each row), then loop over those and generate your output. Go back and read the next B bytes from each row (using seekg to set the file position beforehand, and tellg to remember it afterwards) and generate the output. Repeat until you're done, being careful with the last chunk (or with small inputs) to not go past the end of the line.

Using your example, you have 3 rows to keep track of. Using a B size of 2, you'd read in ab, ef, and jk into your 3 buffers. Looping over those you'd output aej and bfk. Go back and read the next chunks: cd, gh, and lm. This gives cgl and dhm as output.

Mythos answered 9/12, 2018 at 18:38 Comment(5)
Maybe you have a better implementation than me but I have already try this method and it is really slow. Can you give me the code than you think about?Cloven
@B.Hel: I don't know why you say this is slow. I'm skeptical that any faster way exists.Bewitch
This method should be very fast if you choose a buffer size that plays well with your disk block size. Go with 1KB or 4KB and you should get great performance.Elexa
I know that it is really slow because I have already implemented this version. Try and compare with the read of the file line by line. I speak of file of size 200TbCloven
@B.Hel You want to maximize the amount of data you read every time you read, and minimize the number of seeks. Use 16K, 32K, 64K read buffers if you can. When possible, only read data once. If your row length is not a multiple of the storage unit of your OS (e.g. 2K clusters), this gets complicated as you need to maintain the end-of-line for all rows (read when the beginning of the next line was read), and have different offsets on each line that you need to read things in at. Once you do that, you can use unbuffered reads using OS API calls to avoid copying data around in memory.Mythos
M
5

I would do this like this:

  1. Open the source file.
  2. Measure line size
  3. Measure line count (file size / (line size + size of EOL)). Note EOL can be 2 bytes.
  4. calculate result file size. Open result file and force it to have it desired size, so later you can seek to any part of the file.
  5. peak some size of square which is memory manageable. For example 1024x1024
  6. Now you should load square part of the matrix. 1024 elements for rows of 1024 constitutive rows.
  7. Transpose square
  8. Write it to destination file by seeking to proper column of each part of row you are writing. (you can reduce memory consumption in previous point by transposing one column and then write it as a row, instead transposing whole square at once)
  9. Iterate square over whole file matrix

IMO you can't do it better. Most critical will be how to select size of the square. Big power of 2 is recommended.

Minuteman answered 13/12, 2018 at 16:24 Comment(0)
A
1

If you want to do this using multiple std::istreambuf_iterators then you will need multiple fstreams for them to act on, otherwise when you iterate one (i.e. istart++) that will affect all the iterators for that fstream, meaning that the next time you iterate one (i.e. *iarray[i]++) you will skip a character. This is explained more clearly in the reference. Consider this snippet:

std::ifstream str;
str.open("test.data", std::ifstream::in);

std::istreambuf_iterator<char> i1 (str);
std::istreambuf_iterator<char> i2 (str);

std::cout << "i1 - " << *i1 << "   i2 - " << *i2 << std::endl;
i1++;
std::cout << "i1 - " << *i1 << "   i2 - " << *i2 << std::endl;
i2++;
std::cout << "i1 - " << *i1 << "   i2 - " << *i2 << std::endl;

which will output

i1 - a   i2 - a
i1 - b   i2 - a
i1 - b   i2 - c

Where i2 has appeared to 'skip' b in the stream. Even if you assign the second iterator later, i.e.

std::ifstream str;
str.open("test.data", std::ifstream::in);

std::istreambuf_iterator<char> i1 (str);
std::istreambuf_iterator<char> i2;
std::istreambuf_iterator<char> iend;

int x = 0;
while (i1 != iend) {
    if (x % 4 == 0) {
        i2 = i1;
        break;
    }
    x++;
    i1++;
}

std::cout << *i1 << " " << *i2 << std::endl;
i1++;
std::cout << *i1 << " " << *i2 << std::endl;
i2++;
std::cout << *i1 << " " << *i2 << std::endl;

the output remains the same -

i1 - a   i2 - a
i1 - b   i2 - a
i1 - b   i2 - c

Why?

Because in either case both iterators are acting on the same stream object, and every time you iterate one it removes a character from the stream. In the code in question every iterator (istart, iarray[i]) acts on the same stream object and therefore every iteration of one of them removes a char from the stream. The output is then quickly the result of undefined behavior, as iterating beyond the end-of-stream is undefined (and since the iterators are iterating together you reach it quickly).


If you want to do this the way you have outline, you simply need multiple fstream objects, such as

#include <fstream>
#include <string>
#include <iostream>


int main(int argn, char** argv) {
    std::ifstream str2;
    str2.open ("test.data", std::ifstream::in);

    int nbline = 3;
    int nbcolumn = 4;
    int x = 0;

    std::istreambuf_iterator<char> istart (str2);
    std::istreambuf_iterator<char> iend ;

    std::ifstream* streams = new std::ifstream[nbline];
    for (int ii = 0; ii < nbline; ii++) {
        streams[ii].open("test.data", std::ifstream::in);
    }
    std::istreambuf_iterator<char>* iarray = new std::istreambuf_iterator<char>[nbline];
    for (int ii = 0; ii < nbline; ii ++) {
        iarray[ii] = std::istreambuf_iterator<char> (streams[ii]);
    }

    int idx = 0;
    while (istart != iend) {
        if (x % nbcolumn == 0) {
            std::advance(iarray[x/nbcolumn], (nbcolumn+1)*idx);
            idx++;
        }
        x++;
        istart++;
    }

    for (int ii = 0; ii < nbcolumn; ii ++) {
        for (int jj = 0; jj < nbline; jj ++) {
            std::cout << *iarray[jj]++ << "\t";
        }
        std::cout << std::endl;
    }
}

Which produces the output you are expecting,

a       e       j
b       f       k
c       g       l
d       h       m

I can make no comment on the speed of this method relative to others that have been suggested, but this is how you would do what you are asking using this method.

Adamok answered 12/12, 2018 at 1:19 Comment(1)
For now, it is the method that I used but I need to open one file by line and if I have a lot of line, it can be a problemCloven
I
0

General Considerations

If the array of iterators would work, it would have to iterate thru memory (see also answer William Miller), or where should it iterate thru?

The trade-off is:

  1. Parsing until first output line is complete, than the same for all the other output lines
    • slow, almost no memory used
  2. Fill a matrix completely and output the transposed matrix
    • lot of memory to be used
  3. Create an array of positions for all output lines, seek thru all positions
    • fast, and reasonable memory usage
  4. An very inteligent mix of method 2 and 3.
    • taking the shortes possible time with a given memory (e.g. lets say 8 GByte RAM).

Solution for trade-off 4

Neededs more knowledge regarding the boundary condition.

A concept for solution 4 depends on a lot of unknown conditions

  • What are the characteristics of input data?
    • Is the 200TByte for one matrix of for several matrixes?
    • For how many?
    • What is the worst case of ratio between columns and rows?
    • Is it just single characters or can is be words?
    • If it is just single characters, is it assured that each line the same memory size?
    • If not, how to recognise a new line?
  • How much free RAM memory is available?
  • How fast is the target computer to fill the whole free RAM memory?
  • What is the maximum of time that is acceptable?

Problem Analysis to the original programm

The question is also: Why it doesn't work.

The program ...

#include    <fstream>
#include    <string>
#include    <iostream>

int main(int argc, char* argv[]) {
    std::ifstream str2;
    str2.open ("test.data", std::ifstream::in);

    std::istreambuf_iterator<char> istart(str2);
    std::istreambuf_iterator<char> iend;
    std::istreambuf_iterator<char> iarray1 = istart;

    istart++;
    istart++;
    istart++;
    istart++;
    std::istreambuf_iterator<char> iarray2 = istart;

    std::cout  << *(iarray1);
    std::cout << std::endl;
    std::cout  << *(iarray2);
    std::cout << std::endl;
    return 0;
}

...reads test.data contains...

abcdefghjklm

...and the program prints...

e
e

Hence, the loop...

while (istart != iend){
    if (x % nbcolumn == 0){
        iarray[x/nbcolumn] = istart;
    }
    istart++;
    x++;
}

...will not lead to the expected result, because the iterator is working in a different way, and each call of...

iarray[i]++;

...is manipulating all iterator at the same time.


Solution for trade-off 3

What is the way out? Creating code according to trade-off #3.

The program...

#include    <iostream>
#include    <ios>
#include    <string>
#include    <fstream>

int main(int argc, char* argv[]) {
    int nbline = 3;
    int nbcolumn = 4;
    std::ifstream   fsIn;
    std::streampos  posLine[nbline];
    std::streampos  posTemp;

    fsIn.open("test.data", std::ifstream::in);
    for ( int i = 0; i < nbline; i++) {
        posLine[i] = posTemp;
        posTemp += nbcolumn;
    }

    for ( int j = 0; j < nbcolumn; j++) {
        for ( int i = 0; i < nbline; i++) {
            fsIn.seekg(posLine[i]);
            std::cout  << char(fsIn.get()) << " ";
            posLine[i] = fsIn.tellg();
        }
        std::cout << std::endl;
    }
    return 0;
}

...creates the output:

a e j
b f k
c g l
d h m
Indigestive answered 11/12, 2018 at 22:9 Comment(4)
Thank you for your respond but using a seekg in really to slowCloven
@B.Hel: Thank you for your response. OK, there is may be a solution for you. If you answer my questions then I can analyse the feasibility for you - just for stackoverflow reputation. I even can create the complete algorithm of Solution 4, and even the integration, but then I need a better motivator than stackoverflow reputation. For further investigation I need you commitment to proceed in this way. Please, tell me how we are going to proceed. Thank you.Perfusion
@B.Hel: Please find the questions in my post under "Solution for trade-off 4".Perfusion
@B.Hel: BTW.: Solution 4 goes into direction of 1201programalarm's and marek-r's idea.Perfusion
Z
0

You cannot use istreambuf_iterator twice it can only be used once. Anyhow hope code below helps you

Let me explain what I am trying to do first; You know file reads are much faster when you do it sequentally. What I am doing there is buffered read. Lets say in your example I am buffering two lines so I have to allocate 6 bytes of buffer and fill it with seeks; Each read will read two bytes as we are holding two lines. This can be optimized though if you print out first character as you read immediately you can buffer two lines just by using 3 bytes and threelines just by buffering 6 bytes in your example. Anyhow I am giving you non optimized version of it.

Again let me remind you, you cannot use istreambuf_iterator twice: How do I use an iterator on an ifstream twice in C++?

if you have to use iterator you can implement your iterator that can seek and read on a file; can be really messy though,,,

#include <iostream>
#include <fstream>
#include <vector>
#include <stdexcept>
#include <sstream>
#include <algorithm>

std::vector<std::size_t> getPositions(std::ifstream& str2, int &numcolumns) {
    std::vector<std::size_t> iarray;

    iarray.push_back(0); // Add first iterator

    bool newlinereached = false;
    int tmpcol = 0;
    int currentLine = 0;
    char currentChar = 0;
    char previosChar = 0;

    numcolumns = -1;

    for (str2.seekg(0, std::ios_base::beg); !str2.eof(); previosChar = currentChar) {
        const std::size_t currentPosition = str2.tellg();
        str2.read(&currentChar, 1);
        if (newlinereached) {
            if (currentChar == '\r') {
                // Always error but skip for now :)
                continue;
            }
            else if (currentChar == '\n') {
                // ERROR CONDITION WHEN if (numcolumns < 0) or previosChar == '\n'
                continue;
            }
            else if (tmpcol == 0) {
                throw std::runtime_error((std::stringstream() << "Line " << currentLine << " is empty").str());
            }
            else {
                if (numcolumns < 0) {
                    // We just found first column size
                    numcolumns = tmpcol;
                    iarray.reserve(numcolumns);
                }
                else if (tmpcol != numcolumns) {
                    throw std::runtime_error((std::stringstream() << "Line " << currentLine
                        << " have incosistend number of columns it should have been " << numcolumns).str());
                }

                iarray.push_back(currentPosition);
                tmpcol = 1;
                newlinereached = false;
            }
        }
        else if (currentChar == '\r' || currentChar == '\n') {
            newlinereached = true;
            ++currentLine;
        }
        else {
            tmpcol++;
        }
    }

    if (currentChar == 0) {
        throw std::runtime_error((std::stringstream() << "Line " << currentLine
            << " contains 'null' character " << numcolumns).str());
    }

    str2.clear(); // Restart 

    return iarray;
}

int main() {
    using namespace std;

    ifstream str2;
    str2.open("Text.txt", ifstream::in);
    if (!str2.is_open()) {
        cerr << "Failed to open the file" << endl;
        return 1;
    }

    int numinputcolumns = -1;

    std::vector<std::size_t> iarray =
        getPositions(str2, numinputcolumns); // S(N)

    const std::size_t numinputrows = iarray.size();

    std::vector<char> buffer;
    const int numlinestobuffer = std::min(2, numinputcolumns); // 1 For no buffer

    buffer.resize(numinputrows * numlinestobuffer); // S(N)

    const std::size_t bufferReadMax = buffer.size();


    for (int j = 0; j < numinputcolumns; j += numlinestobuffer)
    {
        // Seek fill buffer. Needed because sequental reads are much faster even on SSD
        // Still can be optimized more: We can buffer n+1 rows as we can discard current row read
        std::size_t nread = std::min(numlinestobuffer, numinputcolumns - j);
        for (int i = 0; i < numinputrows; ++i)
        {
            str2.seekg(iarray[i], ios_base::beg);
            size_t p = str2.tellg();
            str2.read(&buffer[i * numlinestobuffer], nread);
            iarray[i] += nread;
        }

        // Print the buffer
        for (int b = 0; b < nread; ++b)
        {
            for (int k = 0; k < numinputrows; ++k) {
                std::cout << buffer[b + k * numlinestobuffer] << '\t';
            }
            std::cout << std::endl;
        }
    }

    return 0;
}
Zoochemistry answered 14/12, 2018 at 23:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.