python: yield inside map function
Asked Answered
C

1

8

Is it possible to use yield inside the map function?

For POC purpose, I have created a sample snippet.

# Python 3  (Win10)
from concurrent.futures import ThreadPoolExecutor
import os
def read_sample(sample):
    with open(os.path.join('samples', sample)) as fff:
        for _ in range(10):
            yield str(fff.read())

def main():
    with ThreadPoolExecutor(10) as exc:
        files = os.listdir('samples')
        files = list(exc.map(read_sample, files))
        print(str(len(files)), end="\r")

if __name__=="__main__":
     main()

I have 100 files in samples folder. As per the snippet 100*10=1000 should be printed. However, it prints 100 only. When I checked it just print generator object only.

With what change it'll be 1000 printed?

Cid answered 24/5, 2020 at 8:16 Comment(9)
It is possible to use a generator ("yield function") inside map, but as you have observed this will just instantiate that generator. Is there a reason why read_sample does not just produce a list? What are you trying to achieve by using generators? Note that you can get the results by using list(itertools.chain(*exc.map(read_sample, files))) instead, but it will benefit from neither threads nor generator.Operon
does this help #44708812 ?Brake
I don't understand why you expect 1000. if original files is list with 100 names then result is also list with 100 elements and you print len() which means number of elements on list, not summary size of all elements (which could be 1000 - like sum(len(x) for x in files))Ulaulah
I guess what you actually want is to have map peek inside the generator, something like an non-existent map_from or something?Commissure
as suggested @Commissure you would need something like map_from to run generator 10 times for every filename. Normal map will run function/generator only once for every filename. BTW: using read() it reads all data from file in first execution and next executions would create only empty results - so it seems useless. You would have to use ie. read(5) to read only part of file.Ulaulah
Thank you for your replies. This is a snippet I created for POC only. In my product, I have a list of 100 A objects which has a regular expression. Those each A objects will have to generate 10 B objects. Therefore I am yielding 10 B objects. I am trying to make it multithreaded as they are all file operations and using the map function. If this POC is successful, I will apply the same concept in my product. However, I believe there should be a way in python to achieve this, regardless of whatever the business logic is.Cid
Can you please clarify why you want to use generators for this? Generators are inherently cooperative concurrency, which conflicts with using threads to achieve preemptive concurrency.Operon
As I mentioned I have a function which returns a list (list using return or generator using yield), I want to call that function 100 times for 100 files. For that I am using map. I am not bound to use generator. But the same problem lies in returning list as well, At the end I will get list of list which has to be flattened before using.Cid
Does this answer your question: How to make a flat list out of list of lists?Operon
C
3

You can use map() with a generator, but it will just try to map generator objects, and it will not try to descend into the generators themselves.

A possible approach is to have a generator do the looping the way you want and have a function operate on the objects. This has the added advantage of separating more neatly the looping from the computation. So, something like this should work:

  • Approach #1
# Python 3  (Win10)
from concurrent.futures import ThreadPoolExecutor
import os
def read_samples(samples):
    for sample in samples:
        with open(os.path.join('samples', sample)) as fff:
            for _ in range(10):
                yield fff

def main():
    with ThreadPoolExecutor(10) as exc:
        files = os.listdir('samples')
        files = list(exc.map(lambda x: str(x.read()), read_samples(files)))
        print(str(len(files)), end="\r")

if __name__=="__main__":
     main()

Another approach is to nest an extra map call to consume the generators:

  • Approach #2
# Python 3  (Win10)
from concurrent.futures import ThreadPoolExecutor
import os
def read_samples(samples):
    for sample in samples:
        with open(os.path.join('samples', sample)) as fff:
            for _ in range(10):
                yield fff

def main():
    with ThreadPoolExecutor(10) as exc:
        files = os.listdir('samples')
        files = exc.map(list, exc.map(lambda x: str(x.read())), read_samples(files))
        files = [f for fs in files for f in fs]  # flattening the results
        print(str(len(files)), end="\r")

if __name__=="__main__":
     main()

A more minimal example

Just to get to some more reproducible example, the traits of your code can be written in a more minimal example (that does not rely on files laying around on your system):

from concurrent.futures import ThreadPoolExecutor


def foo(n):
    for i in range(n):
        yield i


with ThreadPoolExecutor(10) as exc:
    x = list(exc.map(foo, range(k)))
    print(x)
# [<generator object foo at 0x7f1a853d4518>, <generator object foo at 0x7f1a852e9990>, <generator object foo at 0x7f1a852e9db0>, <generator object foo at 0x7f1a852e9a40>, <generator object foo at 0x7f1a852e9830>, <generator object foo at 0x7f1a852e98e0>, <generator object foo at 0x7f1a852e9fc0>, <generator object foo at 0x7f1a852e9e60>]
  • Approach #1:
from concurrent.futures import ThreadPoolExecutor


def foos(ns):
    for n in range(ns):
        for i in range(n):
            yield i


with ThreadPoolExecutor(10) as exc:
    k = 8
    x = list(exc.map(lambda x: x ** 2, foos(k)))
    print(x)
# [0, 0, 1, 0, 1, 4, 0, 1, 4, 9, 0, 1, 4, 9, 16, 0, 1, 4, 9, 16, 25, 0, 1, 4, 9, 16, 25, 36]
  • Approach #2
from concurrent.futures import ThreadPoolExecutor


def foo(n):
    for i in range(n):
        yield i ** 2


with ThreadPoolExecutor(10) as exc:
    k = 8
    x = exc.map(list, exc.map(foo, range(k)))
    print([z for y in x for z in y])
# [0, 0, 1, 0, 1, 4, 0, 1, 4, 9, 0, 1, 4, 9, 16, 0, 1, 4, 9, 16, 25, 0, 1, 4, 9, 16, 25, 36]
Commissure answered 24/5, 2020 at 11:9 Comment(1)
I was wondering could you also do it for .submit() as well instead of .map(). Does the same logic applies to it as well?Dunn

© 2022 - 2024 — McMap. All rights reserved.