Transpose a large array without loading into memory
Asked Answered
O

2

5

I have a large gzipped file (5000 columns × 1M lines) consisting of 0's and 1's:

0 1 1 0 0 0 1 1 1....(×5000)
0 0 0 1 0 1 1 0 0
....(×1M)

I want to transpose it, but using numpy or other methods just loads the whole table on the RAM, and I just have at my disposal 6GB.

For this reason, I wanted to use a method that writes each transposed line to an open file, instead of storing it in the RAM. I came up with the following code:

import gzip

with open("output.txt", "w") as out:

    with gzip.open("file.txt", "rt") as file:

        number_of_columns = len(file.readline().split())

        # iterate over number of columns (~5000)
        for column in range(number_of_columns):

            # in each iteration, go to the top line to start again
            file.seek(0)

            # initiate list storing the ith column's elements that will form the transposed column
            transposed_column = []

            # iterate over lines (~1M), storing the ith element in the list
            for line in file:
                transposed_column.append(line.split()[column])

            # write the transposed column as a line to an existing file and back again
            out.write(" ".join(transposed_column) + "\n")

However, this is very slow. Can anybody suggest me another solution? Is there any way to append a list as a column (instead of as a line) to an existing open file? (pseudocode):

with open("output.txt", w) as out:
    with gzip.open("file.txt", rt) as file:
        for line in file:
            transposed_line = line.transpose()
            out.write(transposed_line, as.column)

UPDATE

The answer of user7813790 lead me to this code:

import numpy as np
import random


# create example array and write to file

with open("array.txt", "w") as out:

    num_columns = 8
    num_lines = 24

    for i in range(num_lines):
        line = []
        for column in range(num_columns):
            line.append(str(random.choice([0,1])))
        out.write(" ".join(line) + "\n")


# iterate over chunks of dimensions num_columns×num_columns, transpose them, and append to file

with open("array.txt", "r") as array:

    with open("transposed_array.txt", "w") as out:

        for chunk_start in range(0, num_lines, num_columns):

            # get chunk and transpose
            chunk = np.genfromtxt(array, max_rows=num_columns, dtype=int).T
            # write out chunk
            out.seek(chunk_start+num_columns, 0)
            np.savetxt(out, chunk, fmt="%s", delimiter=' ', newline='\n')

It takes a matrix like:

0 0 0 1 1 0 0 0
0 1 1 0 1 1 0 1
0 1 1 0 1 1 0 0
1 0 0 0 0 1 0 1
1 1 0 0 0 1 0 1
0 0 1 1 0 0 1 0
0 0 1 1 1 1 1 0
1 1 1 1 1 0 1 1
0 1 1 0 1 1 1 0
1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1
1 0 0 1 1 0 1 0
0 1 0 1 0 1 0 0
0 0 1 0 0 1 0 0
1 1 1 0 0 1 1 1
1 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1
1 1 1 1 0 1 0 1
1 0 1 1 1 0 0 0
0 1 0 1 1 1 1 1
1 1 1 1 1 1 0 1
0 0 1 1 0 1 1 1
0 1 1 0 1 1 0 1
0 0 1 0 1 1 0 1

and iterates over 2D chunks with both dimensions equal to the number of columns (8 in this case), transposing them and appending them to an output file.

1st chunk transposed:

[[0 0 0 1 1 0 0 1]
 [0 1 1 0 1 0 0 1]
 [0 1 1 0 0 1 1 1]
 [1 0 0 0 0 1 1 1]
 [1 1 1 0 0 0 1 1]
 [0 1 1 1 1 0 1 0]
 [0 0 0 0 0 1 1 1]
 [0 1 0 1 1 0 0 1]]

2nd chunk transposed:

[[0 1 1 1 0 0 1 1]
 [1 1 1 0 1 0 1 0]
 [1 0 0 0 0 1 1 0]
 [0 1 1 1 1 0 0 0]
 [1 1 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0]
 [1 0 1 1 0 0 1 0]
 [0 0 1 0 0 0 1 0]]

etc.

I am trying to append each new chunk to the out file as columns, using out.seek(). As far as I understand, seek() takes as first argument the offset from the beginning of the file (i.e. the column), and 0 as a second argument means to start from the first row again. So, I would have guessed that the following line would do the trick:

out.seek(chunk_start+num_columns, 0)

But instead, it does not continue at that offset along the following rows. Also, it adds n = num_columns spaces at the beginning of the first row. Output:

    0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0
1 1 0 1 1 0 1 0
1 1 1 0 1 1 1 1
1 1 1 1 1 1 0 0
1 0 1 1 1 0 1 1
1 1 0 1 1 1 1 1
1 0 0 1 0 1 0 0
1 1 0 1 1 1 1 1

Any insight on how to use seek() properly for this task? i.e. to generate this:

0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0
0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 0
0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 1
1 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0
1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 1
0 1 1 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0
0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1

Please mind that this is just a dummy test matrix, the actual matrix is 5008 columns × >1M lines.

UPDATE 2

I have figured out how to make this work, it can also make use of chunks of any dimensions.

import numpy as np
import random


# create example array and write to file

num_columns = 4
num_lines = 8

with open("array.txt", "w") as out:
    for i in range(num_lines):
        line = []
        for column in range(num_columns):
            line.append(str(random.choice([0,1])))
        out.write(" ".join(line) + "\n")


# iterate over chunks of dimensions num_columns×chunk_length, transpose them, and append to file

chunk_length = 7

with open("array.txt", "r") as array:

    with open("transposed_array.txt", "w") as out:

        for chunk_start in range(0, num_lines, chunk_length):

            # get chunk and transpose
            chunk = np.genfromtxt(array, max_rows=chunk_length, dtype=str).T

            # write out chunk
            empty_line = 2 * (num_lines - (chunk_length + chunk_start))

            for i, line in enumerate(chunk):
                new_pos = 2 * num_lines * i + 2 * chunk_start
                out.seek(new_pos)
                out.write(f"{' '.join(line)}{' ' * (empty_line)}"'\n')

In this case, it takes an array like this:

1 1 0 1
0 0 1 0
0 1 1 0
1 1 1 0
0 0 0 1
1 1 0 0
0 1 1 0
0 1 1 1

and transposes it using chunks of 4 columns × 7 lines, so the 1st chunk would be

1 0 0 1 0 1 0
1 0 1 1 0 1 1
0 1 1 1 0 0 1
1 0 0 0 1 0 0

it is written to file, deleted from memory, and then the 2nd chunk is

0
1
1
1

and again it is appended to file, so the final result is:

1 0 0 1 0 1 0 0
1 0 1 1 0 1 1 1
0 1 1 1 0 0 1 1
1 0 0 0 1 0 0 1
Orelle answered 25/6, 2019 at 11:39 Comment(3)
"this is very slow" - what's your current time and RAM consumption for this processing?Gastrectomy
It consumes just some MB, but I calculate that it will take more than 3 days to completeOrelle
In your chunked solution, you have to understand that the data is not stored as a two-way array inside the file -- it's one-dimensional, and you are just simulating two dimensions by treating fixed-length chunks of bytes as rows. You cannot write a whole multi-line chunk directly, but have to loop through each line, finding the correct offset for each row (and assuming you have allocated empty space for it in the file -- you cannot insert bytes in the middle).Leonard
L
6

In your working but slow solution, you are reading the input file 5,000 times -- that won't be fast, but the only easy way to minimize the reads is to read it all in memory.

You could try some compromise where you read, say, fifty columns at a time into memory (~50MB), and write them into the file as rows. This way you would read the file "only" 100 times. Try a few different combinations to get the performance/memory compromise you're satisfied with.

You would do this over three nested loops:

  1. Loop over the number of chunks (100 in this case)
  2. Loop over the lines of your input file
  3. Loop over the number of columns in your chunk (50 here)

In your inner-most loop you collect the column values as a row into a two-dimensional array, one row for each of the middle loop. In the outer-most loop, you clear the array before entering the inner loops, and print it out to the file as rows afterwards. For each iteration of loop 1. you will have written fifty rows of a million columns.

You can't really insert in the middle of a normal file without loading the whole target file into memory -- you need to shift the trailing bytes forward manually. Since you know your exact file size, however, you could pre-allocate it and always seek to the position when writing each byte; probably not very fast to do 5 billion seeks, either... If your ones and zeroes are fairly evenly distributed, you could initialize the file with all-zeroes, and then only write ones (or the other way around) to halve the number of seeks.

Edit: Added details how chunking could be implemented.

Leonard answered 25/6, 2019 at 11:52 Comment(0)
P
2

If your numbers are all 0 or 1 then every line has the same length (in bytes) so you can use file.seek to move around in the file (rather than reading in and ignoring data). However, this might not be so efficient with a gzipped input file. Since you are writing an uncompressed file, you can also use seek to jump around in the output.

A more efficient way to transpose an array is to read in a chunk which fits into RAM (e.g. 1000x1000), use numpy.transpose to transpose the chunk, then write the chunk into its location in the transposed array. With your array which is 5000 columns but 1M rows, it will probably be easiest to use 5000x5000 chunks i.e. read 5000 complete rows of the input matrix at a time. This avoids having to seek around in the compressed input file. Then you have to write this chunk into the output file, leaving blank space for the columns which come from the subsequent rows of the input.

More detail of how to write chunks to the 5000xN output file (as requested in comment):

To write the first 5000x5000 chunk:

  • Seek to the start of the file
  • Write the first row of the chunk (5000 elements)
  • Seek to the start of the second row of the output (i.e. offset 2N in the file, or 2N+1 if you have CRLF line endings)
  • Write the second row of the chunk
  • Seek to the start of the third row of the file
  • etc

To write the second chunk:

  • Seek to position 5000 (zero-based) of the first row of the output
  • Write the first row of the chunk
  • Seek to position 5000 of the second output row
  • Write the second row of the chunk
  • etc
Pharisaic answered 25/6, 2019 at 12:21 Comment(4)
I will try what you propose of transposing chunks of 5K×5K. What I do not quite understand is whether these transposed chunks should be output to separate files (~200 files of 5K×5K), and then merged by columns (so 5K rows × 1M columns), or instead there is a way to indicate where in the existing output file you want the next transposed chunk to be appended. If it is the former scenario, I guess I should load a growing file everytime I want to do the merging, so it would eventually become too large again. If it is the latter, do you know how to output to a specific location of a file?Orelle
Thanks for the update, could you please check out the code I came up with? I do not get the out.seek() part right.Orelle
In your example code the chunk size is not equal to the number of columns in the input. That makes it more complicated because you have to deal with the chunk index in two dimensionsPharisaic
I have updated it to use the number of columns as both chunk dimensions.Orelle

© 2022 - 2024 — McMap. All rights reserved.