multicpu bzip2 using a python script
Asked Answered
M

2

6

I want to quickly bzip2 compress several hundred gigabytes of data using my 8 core , 16 GB ram workstation. Currently I am using a simple python script to compress a whole directory tree using bzip2 and an os.system call coupled to an os.walk call.

I see that the bzip2 only uses a single cpu while the other cpus remain relatively idle.

I am a newbie in queue and threaded processes . But I am wondering how I can implement this such that I can have four bzip2 running threads (actually I guess os.system threads ), each using probably their own cpu , that deplete files from a queue as they bzip them.

My single thread script is pasted here .

import os
import sys

for roots, dirlist , filelist in os.walk(os.curdir):
    for file in [os.path.join(roots,filegot) for filegot in filelist]:
        if "bz2" not in file:
            print "Compressing %s" % (file)
            os.system("bzip2 %s" % file)
            print ":DONE" 
Meet answered 27/7, 2010 at 15:25 Comment(1)
Don't forget that python has got the bz2 package which allows you to bz2.compress() and bz2.decompress() files, which should increase performance a bit, instead of calling os.system() use it like file(outFile+".bz2", "wb").write(bz2.compress(file(inFile, "rb").read()); and don't use the reserved Python keyword file as your variable names, it's like using string or int or str.Jeffjeffcoat
M
1

Try this code from MRAB on comp.lang.python:

import os 
import sys 
from threading import Thread, Lock 
from Queue import Queue 
def report(message): 
     mutex.acquire() 
     print message 
     sys.stdout.flush() 
     mutex.release() 
class Compressor(Thread): 
     def __init__(self, in_queue, out_queue): 
         Thread.__init__(self) 
         self.in_queue = in_queue 
         self.out_queue = out_queue 
     def run(self): 
         while True: 
             path = self.in_queue.get() 
             sys.stdout.flush() 
             if path is None: 
                 break 
             report("Compressing %s" % path) 
             os.system("bzip2 %s" % path) 
             report("Done %s" %  path) 
             self.out_queue.put(path) 
in_queue = Queue() 
out_queue = Queue() 
mutex = Lock() 
THREAD_COUNT = 4 
worker_list = [] 
for i in range(THREAD_COUNT): 
     worker = Compressor(in_queue, out_queue) 
     worker.start() 
     worker_list.append(worker) 
for roots, dirlist, filelist in os.walk(os.curdir): 
     for file in [os.path.join(roots, filegot) for filegot in filelist]: 
         if "bz2" not in file: 
             in_queue.put(file) 
for i in range(THREAD_COUNT): 
     in_queue.put(None) 
for worker in worker_list: 
     worker.join() 
Meet answered 22/10, 2010 at 1:25 Comment(0)
S
1

Use the subprocess module to spawn several processes at once. If N of them are running (N should a bit bigger than the number of CPUs you have, say 3 for 2 cores, 10 for 8), wait for one to terminate and then start another one.

Note that this might not help much since there will be a lot of disk activity which you can't parallelize. A lot of free RAM for caches helps.

Swats answered 27/7, 2010 at 15:32 Comment(0)
M
1

Try this code from MRAB on comp.lang.python:

import os 
import sys 
from threading import Thread, Lock 
from Queue import Queue 
def report(message): 
     mutex.acquire() 
     print message 
     sys.stdout.flush() 
     mutex.release() 
class Compressor(Thread): 
     def __init__(self, in_queue, out_queue): 
         Thread.__init__(self) 
         self.in_queue = in_queue 
         self.out_queue = out_queue 
     def run(self): 
         while True: 
             path = self.in_queue.get() 
             sys.stdout.flush() 
             if path is None: 
                 break 
             report("Compressing %s" % path) 
             os.system("bzip2 %s" % path) 
             report("Done %s" %  path) 
             self.out_queue.put(path) 
in_queue = Queue() 
out_queue = Queue() 
mutex = Lock() 
THREAD_COUNT = 4 
worker_list = [] 
for i in range(THREAD_COUNT): 
     worker = Compressor(in_queue, out_queue) 
     worker.start() 
     worker_list.append(worker) 
for roots, dirlist, filelist in os.walk(os.curdir): 
     for file in [os.path.join(roots, filegot) for filegot in filelist]: 
         if "bz2" not in file: 
             in_queue.put(file) 
for i in range(THREAD_COUNT): 
     in_queue.put(None) 
for worker in worker_list: 
     worker.join() 
Meet answered 22/10, 2010 at 1:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.