I want to quickly bzip2 compress several hundred gigabytes of data using my 8 core , 16 GB ram workstation. Currently I am using a simple python script to compress a whole directory tree using bzip2 and an os.system call coupled to an os.walk call.
I see that the bzip2 only uses a single cpu while the other cpus remain relatively idle.
I am a newbie in queue and threaded processes . But I am wondering how I can implement this such that I can have four bzip2 running threads (actually I guess os.system threads ), each using probably their own cpu , that deplete files from a queue as they bzip them.
My single thread script is pasted here .
import os
import sys
for roots, dirlist , filelist in os.walk(os.curdir):
for file in [os.path.join(roots,filegot) for filegot in filelist]:
if "bz2" not in file:
print "Compressing %s" % (file)
os.system("bzip2 %s" % file)
print ":DONE"
bz2
package which allows you tobz2.compress()
andbz2.decompress()
files, which should increase performance a bit, instead of callingos.system()
use it likefile(outFile+".bz2", "wb").write(bz2.compress(file(inFile, "rb").read())
; and don't use the reserved Python keywordfile
as your variable names, it's like usingstring
orint
orstr
. – Jeffjeffcoat