Alternatives to Python Popen.communicate() memory limitations?
Asked Answered
M

3

21

I have the following chunk of Python code (running v2.7) that results in MemoryError exceptions being thrown when I work with large (several GB) files:

myProcess = Popen(myCmd, shell=True, stdout=PIPE, stderr=PIPE)
myStdout, myStderr = myProcess.communicate()
sys.stdout.write(myStdout)
if myStderr:
    sys.stderr.write(myStderr)

In reading the documentation to Popen.communicate(), there appears to be some buffering going on:

Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.

Is there a way to disable this buffering, or force the cache to be cleared periodically while the process runs?

What alternative approach should I use in Python for running a command that streams gigabytes of data to stdout?

I should note that I need to handle output and error streams.

Michaelmas answered 29/7, 2011 at 23:46 Comment(1)
I need to be able to stream output and error.Michaelmas
M
8

I think I found a solution:

myProcess = Popen(myCmd, shell=True, stdout=PIPE, stderr=PIPE)
for ln in myProcess.stdout:
    sys.stdout.write(ln)
for ln in myProcess.stderr:
    sys.stderr.write(ln)

This seems to get my memory usage down enough to get through the task.

Update

I have recently found a more flexible way of handing data streams in Python, using threads. It's interesting that Python is so poor at something that shell scripts can do so easily!

Michaelmas answered 30/7, 2011 at 0:15 Comment(4)
That looks interesting. I'll try that out too.Michaelmas
This ignores the Warning: Use communicate() rather than .stdin.write, .stdout.read or .stderr.read to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process. part in the documentation. It'll probably work generally but there's a potential risk of a deadlock at for ln in myProcess.stdout: if myProcess.stderr ever fills up. I came here looking for a solution to this myself.Soonsooner
Btw, using izip_longest() will only help if stdout and stderr are roughly the same size. If one runs out before the other, it'll block and one left will buffer up in its entirety until the process ends. In this case, memory usage won't be reduced and may actually be worse than using .communicate() as it'll deadlock if the internal buffer on the one left fills up. (This buffer is usually much smaller than what .communicate() can allocate.) At least with @Alex's solution, precedence is given to .stdout which is likely contain more data.Soonsooner
@antak: to avoid the deadlock, if both stdout/stderr are processed by Python code while the child process is running; there are several asynchronous techniques that can help: threads, select, fcntl, named pipes with IOCP.Masson
S
5

What I would probably do instead, if I needed to read the stdout for something that large, is send it to a file on creation of the process.

with open(my_large_output_path, 'w') as fo:
    with open(my_large_error_path, 'w') as fe:
        myProcess = Popen(myCmd, shell=True, stdout=fo, stderr=fe)

Edit: If you need to stream, you could try making a file-like object and passing it to stdout and stderr. (I haven't tried this, though.) You could then read (query) from the object as it's being written.

Sentinel answered 30/7, 2011 at 0:7 Comment(1)
regarding the suggestion on streaming: passing "a file-like object" will not work; Popen needs a true file handleIson
P
0

For those whose application hangs after a certain amount of time when using Popen, please look for my case below:

A Rule of thumb, if you're not going to use stderr and stdout streams then don't pass/init them in the parameters of Popen! because they will fill up and cause you a lot of problems.

If you need them for a certain amount of time and you need to keep the process running, then you can close those streams at any time.

try:
    p = Popen(COMMAND, stdout=PIPE, stderr=PIPE)
    # After using stdout and stderr
    p.stdout.close()
    p.stderr.close()
except Exception as e:
    pass
Pucida answered 12/7, 2021 at 22:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.