Python generator expression if-else
Asked Answered
K

4

6

I am using Python to parse a large file. What I want to do is

If condition =True
   append to list A
else 
   append to list B

I want to use generator expressions for this - to save memory. I am putting in the actual code.

def is_low_qual(read):
    lowqual_bp=(bq for bq in phred_quals(read) if bq < qual_threshold)  
    if iter_length(lowqual_bp) >  num_allowed:
        return True
    else:
        return False  

lowqual=(read for read in SeqIO.parse(r_file,"fastq") if is_low_qual(read)==True)
highqual=(read for read in SeqIO.parse(r_file,"fastq") if is_low_qual(read)==False)


SeqIO.write(highqual,flt_out_handle,"fastq")
SeqIO.write(lowqual,junk_out_handle,"fastq")

def iter_length(the_gen):
    return sum(1 for i in the_gen)
Keelykeen answered 24/8, 2012 at 16:22 Comment(7)
As a side note, don't compare to true/false. Use if is_condition_true(r) and if not is_condition_true(r).Ashcroft
delnan is right, other things are OK.Habitude
This looks fine. Has this failed? Is that why you're asking?Upholster
It probably works, but it's ugly and inefficient. It also breaks if sequences is an iterator (you can use itertools.tee for that though).Ashcroft
How are you using low and high after you have created the generators?Cassondra
Thanks about the True false. It works, just that I am doing it twice, so losing efficiency. Actually sequences is an iterator, but it still worked. Why should it break?This is sequence data, so I am writing them to files after this, using SeqIO.write.Keelykeen
Something else, don’t write if x then: return True else: return False. Write return xStoeber
V
6

You can use itertools.tee in conjunction with itertools.ifilter and itertools.ifilterfalse:

import itertools
def is_condition_true(x):
    ...

gen1, gen2 = itertools.tee(sequences)
low = itertools.ifilter(is_condition_true, gen1)
high = itertools.ifilterfalse(is_condition_true, gen2)

Using tee ensures that the function works correctly even if sequences is itself a generator.

Note, though, that tee could itself use a fair bit of memory (up to a list of size len(sequences)) if low and high are consumed at different rates (e.g. if low is exhausted before high is used).

Voluntaryism answered 24/8, 2012 at 16:30 Comment(6)
Oh, I have to avoid high memory situations, so can't use it. Sequences is an iterator, not a generator. sequences=SeqIO.parse(read_file,"fastq") Should it still break?Keelykeen
What kind of iterator? Iterator is a general term for anything you can iterate across in Python.Voluntaryism
It is from the Biopython package.".. Bio.SeqIO.parse() which takes a file handle and format name, and returns a SeqRecord iterator"Keelykeen
OK. So now your problem makes sense. You have a large file containing many records, and you want to split the file into two smaller files each containing half of the file according to some filter without reading the whole thing into memory. Is that about right?Voluntaryism
In that case, your best option is to write the records one-at-a-time to the appropriate file as each one comes off the input iterator. This uses the least memory and only iterates once.Voluntaryism
Reading the large file with records - yes, but we don't know what the proportion of low vs high is. OK, print it out each step sounds the cleanest...Keelykeen
G
3

I think you're striving to avoid iterating over your collection twice. If so, this type of approach works:

high, low = [], []
_Nones = [high.append(x) if is_condition_true() else low.append(x) for x in sequences]

This is probably less than advised because it's using a list comprehension for a side-effect. That's generally anti-pythonic.

Goins answered 24/8, 2012 at 16:31 Comment(4)
Well, that also creates a list of [None]*len(sequences), which is undesirable as it uses even more memory as his original suggestion.Voluntaryism
Rather than using a list comprehension, you could use any(...) with the equivalent generator expression. Since each item is None and thus false, any() is guaranteed to consume the entire iterator. (You could also use a collections.deque(maxlen=0) to consume the iterator; it'll probably be faster since it does no truth-testing.)Pellerin
I find it generally simpler (and more readable) to use a for-loop when side-effects are involved (especially append). So in this case I'd definitely write it out as a loop.Voluntaryism
I want to avoid list comprehensions because they are expensive. Ok , let me look at the any and deque - I am not familiar with themKeelykeen
W
1

Just to add a more general answer: If your main concern is memory, you should use one generator that loops over the whole file, and handle each item as low or high as it comes. Something like:

for r in sequences:
    if condition_true(r):
        handle_low(r)
    else:
        handle_high(r)

If you need to collect all high/low elements before using either, then you can't guard against a potential memory hit. The reason is that you can't know which elements are high/low until you read them. If you have to process low first, and it turns out all the elements are actually high, you have no choice but to store them in a list as you go, which will use memory. Doing it with one loop allows you to handle each element one at a time, but you have to balance this against other concerns (i.e., how cumbersome it is to do it this way, which will depend on exactly what you're trying to do with the data).

Wafture answered 24/8, 2012 at 16:46 Comment(2)
I'm sorry I don't understand. Is the above a generator? I wanted to use a generator for each array because I think that doesn't actually store it in memory, right, it's just an expression? What I need to do with the data is print it out using Bio.SeqIO.write - which is more efficient if I don't call it each time it loops. Alternatively, I could just print at each step using simple print statements. It comes down to this - is creating a generator really expensive? In which case creating 2 is even more?Keelykeen
@Nupur: The above is just a loop. The only reason to create a generator is to use it in a loop. It seems you currently have one generator (called sequences). What I'm saying is, if you really want to save memory, then instead of trying to create two generators from that, just loop over the original generator directly. If all you're doing is writing it out, then my solution should work fine. If you need more specifics you'll have to edit your question to give more details about the code where you use the data.Wafture
P
0

This is surprisingly difficult to do elegantly. Here's something that works:

from itertools import tee, ifilter, ifilterfalse
low, high = [f(condition, g) for f, g in zip((ifilter, ifilterfalse), tee(seq))]

Note that as you consume items from one resulting iterator (say low), the internal deque in tee will have to expand to contain any items that you have not yet consumed from high (including, unfortunately, those which ifilterfalse will reject). As such this might not save as much memory as you're hoping.

Here's an implementation that uses as little additional memory as possible:

def filtertee(func, iterable, codomain=(False, True)):
    it = iter(iterable)
    deques = dict((r, deque()) for r in codomain)
    def gen(mydeque):
        while True:
            while not mydeque:          # as long as the local deque is empty
                newval = next(it)       # fetch a new value,
                result = func(newval)   # find its image under `func`,
                try:
                    d = deques[result]  # find the appropriate deque, and
                except KeyError:
                    raise ValueError("func returned value outside codomain")
                d.append(newval)        # add it.
            yield mydeque.popleft()
    return dict((r, gen(d)) for r, d in deques.items())

This returns a dict from the codomain of the function to a generator providing the items that take that value under func:

gen = filtertee(condition, seq)
low, high = gen[True], gen[False]

Note that it's your responsibility to ensure that condition only returns values in codomain.

Period answered 24/8, 2012 at 16:31 Comment(1)
He doesn't want lists to be generated to save memory.Voluntaryism

© 2022 - 2024 — McMap. All rights reserved.