I have 2 text files (*.txt) that contain unique strings in the format:
udtvbacfbbxfdffzpwsqzxyznecbqxgebuudzgzn:refmfxaawuuilznjrxuogrjqhlmhslkmprdxbascpoxda
ltswbjfsnejkaxyzwyjyfggjynndwkivegqdarjg:qyktyzugbgclpovyvmgtkihxqisuawesmcvsjzukcbrzi
The first file contains 50 million such lines (4.3 GB), and the second contains 1 million lines (112 MB). One line contains 40 characters, delimiter : and 45 more characters.
Task: get unique values for both files. That is, you need a csv or txt file with lines that are in the second file and which are not in the first.
I am trying to do this using vaex (Vaex):
import vaex
base_files = ['file1.txt']
for i, txt_file in enumerate(base_files, 1):
for j, dv in enumerate(vaex.from_csv(txt_file, chunk_size=5_000_000, names=['data']), 1):
dv.export_hdf5(f'hdf5_base/base_{i:02}_{j:02}.hdf5')
check_files = ['file2.txt']
for i, txt_file in enumerate(check_files, 1):
for j, dv in enumerate(vaex.from_csv(txt_file, chunk_size=5_000_000, names=['data']), 1):
dv.export_hdf5(f'hdf5_check/check_{i:02}_{j:02}.hdf5')
dv_base = vaex.open('hdf5_base/*.hdf5')
dv_check = vaex.open('hdf5_check/*.hdf5')
dv_result = dv_check.join(dv_base, on='data', how='inner', inplace=True)
dv_result.export(path='result.csv')
As a result, I get the result.csv file with unique row values. But the verification process takes a very long time. In addition, it uses all available RAM and all processor resources. How can this process be accelerated? What am I doing wrong? What can be done better? Is it worth using other libraries (pandas, dask) for this check and will they be faster?
UPD 10.11.2020 So far, I have not found anything faster than the following option:
from io import StringIO
def read_lines(filename):
handle = StringIO(filename)
for line in handle:
yield line.rstrip('\n')
def read_in_chunks(file_obj, chunk_size=10485760):
while True:
data = file_obj.read(chunk_size)
if not data:
break
yield data
file_check = open('check.txt', 'r', errors='ignore').read()
check_set = {elem for elem in read_lines(file_check)}
with open(file='base.txt', mode='r', errors='ignore') as file_base:
for idx, chunk in enumerate(read_in_chunks(file_base), 1):
print(f'Checked [{idx}0 Mb]')
for elem in read_lines(chunk):
if elem in check_set:
check_set.remove(elem)
print(f'Unique rows: [{len(check_set)}]')
UPD 11.11.2020: Thanks @m9_psy for the tips to improve performance. It's really faster! Currently, the fastest way is:
from io import BytesIO
check_set = {elem for elem in BytesIO(open('check.txt', 'rb').read())}
with open('base.txt', 'rb') as file_base:
for line in file_base:
if line in check_set:
check_set.remove(line)
print(f'Unique rows: [{len(check_set)}]')
Is there a way to further speed up this process?
awk
on the command line? If I understood your requirements correctly (i.e. return only lines which are present in file2.txt, but not already in file1.txt), this answer should do the job just fine. Note that you need to pipe the result to file, i.e.awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1.txt file2.txt > result.txt
– Illuminant