I have thousands of DNA sequences ranged between 100 to 5000 bp and I need to align and calculate the identity score for specified pairs. Biopython pairwise2 does a nice job but only for short sequences and when the sequence size get bigger than 2kb it shows severe memory leakage which leads to 'MemoryError', even when 'score_only' and 'one_alignment_only' options are used!!
whole_coding_scores={}
from Bio import pairwise2
for genes in whole_coding: # whole coding is a <25Mb dict providing DNA sequences
alignment=pairwise2.align.globalxx(whole_coding[genes][3],whole_coding[genes][4],score_only=True,one_alignment_only=True)
whole_coding_scores[genes]=alignment/min(len(whole_coding[genes][3]),len(whole_coding[genes][4]))
Result returned from supercomputer:
Max vmem = 256.114G #Memory usage of the script
failed assumedly after job because:
job 4945543.1 died through signal XCPU (24)
I know there are other tools for alignment, but they mainly can just write the score in output file which need to be read and parsed again for retrieving and using the alignment scores. Are there any tool which can align the sequences and return the alignment score inside python environment as pairwise2 does but without memory leakage?