I have a large gzipped file (5000 columns × 1M lines) consisting of 0's and 1's:
0 1 1 0 0 0 1 1 1....(×5000)
0 0 0 1 0 1 1 0 0
....(×1M)
I want to transpose it, but using numpy or other methods just loads the whole table on the RAM, and I just have at my disposal 6GB.
For this reason, I wanted to use a method that writes each transposed line to an open file, instead of storing it in the RAM. I came up with the following code:
import gzip
with open("output.txt", "w") as out:
with gzip.open("file.txt", "rt") as file:
number_of_columns = len(file.readline().split())
# iterate over number of columns (~5000)
for column in range(number_of_columns):
# in each iteration, go to the top line to start again
file.seek(0)
# initiate list storing the ith column's elements that will form the transposed column
transposed_column = []
# iterate over lines (~1M), storing the ith element in the list
for line in file:
transposed_column.append(line.split()[column])
# write the transposed column as a line to an existing file and back again
out.write(" ".join(transposed_column) + "\n")
However, this is very slow. Can anybody suggest me another solution? Is there any way to append a list as a column (instead of as a line) to an existing open file? (pseudocode):
with open("output.txt", w) as out:
with gzip.open("file.txt", rt) as file:
for line in file:
transposed_line = line.transpose()
out.write(transposed_line, as.column)
UPDATE
The answer of user7813790 lead me to this code:
import numpy as np
import random
# create example array and write to file
with open("array.txt", "w") as out:
num_columns = 8
num_lines = 24
for i in range(num_lines):
line = []
for column in range(num_columns):
line.append(str(random.choice([0,1])))
out.write(" ".join(line) + "\n")
# iterate over chunks of dimensions num_columns×num_columns, transpose them, and append to file
with open("array.txt", "r") as array:
with open("transposed_array.txt", "w") as out:
for chunk_start in range(0, num_lines, num_columns):
# get chunk and transpose
chunk = np.genfromtxt(array, max_rows=num_columns, dtype=int).T
# write out chunk
out.seek(chunk_start+num_columns, 0)
np.savetxt(out, chunk, fmt="%s", delimiter=' ', newline='\n')
It takes a matrix like:
0 0 0 1 1 0 0 0
0 1 1 0 1 1 0 1
0 1 1 0 1 1 0 0
1 0 0 0 0 1 0 1
1 1 0 0 0 1 0 1
0 0 1 1 0 0 1 0
0 0 1 1 1 1 1 0
1 1 1 1 1 0 1 1
0 1 1 0 1 1 1 0
1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1
1 0 0 1 1 0 1 0
0 1 0 1 0 1 0 0
0 0 1 0 0 1 0 0
1 1 1 0 0 1 1 1
1 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1
1 1 1 1 0 1 0 1
1 0 1 1 1 0 0 0
0 1 0 1 1 1 1 1
1 1 1 1 1 1 0 1
0 0 1 1 0 1 1 1
0 1 1 0 1 1 0 1
0 0 1 0 1 1 0 1
and iterates over 2D chunks with both dimensions equal to the number of columns (8 in this case), transposing them and appending them to an output file.
1st chunk transposed:
[[0 0 0 1 1 0 0 1]
[0 1 1 0 1 0 0 1]
[0 1 1 0 0 1 1 1]
[1 0 0 0 0 1 1 1]
[1 1 1 0 0 0 1 1]
[0 1 1 1 1 0 1 0]
[0 0 0 0 0 1 1 1]
[0 1 0 1 1 0 0 1]]
2nd chunk transposed:
[[0 1 1 1 0 0 1 1]
[1 1 1 0 1 0 1 0]
[1 0 0 0 0 1 1 0]
[0 1 1 1 1 0 0 0]
[1 1 1 1 0 0 0 0]
[1 0 0 0 1 1 1 0]
[1 0 1 1 0 0 1 0]
[0 0 1 0 0 0 1 0]]
etc.
I am trying to append each new chunk to the out file as columns, using out.seek(). As far as I understand, seek() takes as first argument the offset from the beginning of the file (i.e. the column), and 0 as a second argument means to start from the first row again. So, I would have guessed that the following line would do the trick:
out.seek(chunk_start+num_columns, 0)
But instead, it does not continue at that offset along the following rows. Also, it adds n = num_columns spaces at the beginning of the first row. Output:
0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0
1 1 0 1 1 0 1 0
1 1 1 0 1 1 1 1
1 1 1 1 1 1 0 0
1 0 1 1 1 0 1 1
1 1 0 1 1 1 1 1
1 0 0 1 0 1 0 0
1 1 0 1 1 1 1 1
Any insight on how to use seek() properly for this task? i.e. to generate this:
0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0
0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 0
0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 1
1 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0
1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 1
0 1 1 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0
0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1
Please mind that this is just a dummy test matrix, the actual matrix is 5008 columns × >1M lines.
UPDATE 2
I have figured out how to make this work, it can also make use of chunks of any dimensions.
import numpy as np
import random
# create example array and write to file
num_columns = 4
num_lines = 8
with open("array.txt", "w") as out:
for i in range(num_lines):
line = []
for column in range(num_columns):
line.append(str(random.choice([0,1])))
out.write(" ".join(line) + "\n")
# iterate over chunks of dimensions num_columns×chunk_length, transpose them, and append to file
chunk_length = 7
with open("array.txt", "r") as array:
with open("transposed_array.txt", "w") as out:
for chunk_start in range(0, num_lines, chunk_length):
# get chunk and transpose
chunk = np.genfromtxt(array, max_rows=chunk_length, dtype=str).T
# write out chunk
empty_line = 2 * (num_lines - (chunk_length + chunk_start))
for i, line in enumerate(chunk):
new_pos = 2 * num_lines * i + 2 * chunk_start
out.seek(new_pos)
out.write(f"{' '.join(line)}{' ' * (empty_line)}"'\n')
In this case, it takes an array like this:
1 1 0 1
0 0 1 0
0 1 1 0
1 1 1 0
0 0 0 1
1 1 0 0
0 1 1 0
0 1 1 1
and transposes it using chunks of 4 columns × 7 lines, so the 1st chunk would be
1 0 0 1 0 1 0
1 0 1 1 0 1 1
0 1 1 1 0 0 1
1 0 0 0 1 0 0
it is written to file, deleted from memory, and then the 2nd chunk is
0
1
1
1
and again it is appended to file, so the final result is:
1 0 0 1 0 1 0 0
1 0 1 1 0 1 1 1
0 1 1 1 0 0 1 1
1 0 0 0 1 0 0 1