Python - walk through a huge set of files but in a more efficient manner
Asked Answered
F

2

7

I have huge set of files that I want to traverse through using python. I am using os.walk(source) for the same and is working but since I have a huge set of files it is taking too much and memory resources since its getting the complete list all at once. How can I optimize this to use less resources and may be walk through one directory at a time or in some other efficient manner and still able to iterate the complete set of files. Thanks

for dir, dirnames, filenames in os.walk(START_FOLDER): 
    for name in dirnames: 
        #if PRIVATE_FOLDER not in name: 
            for keyword in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST: 
                if keyword in name.lower(): 
                    ignoreList.append(name)
Forefront answered 12/2, 2014 at 3:44 Comment(10)
os.walk already returns a generator, which is lazy. Are you turning it into a list or something? Because if not, it should not cause memory issues. (Also, post your code.)Horizon
I want to go through each of the file name and if contains certain keywords I want to add them to a list for dir, dirnames, filenames in os.walk(START_FOLDER): for name in dirnames: #if PRIVATE_FOLDER not in name: for keyword in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST: if keyword in name.lower(): ignoreList.append(name)Forefront
Okay. Post your code that does that.Horizon
@Horizon <code> for dir, dirnames, filenames in os.walk(START_FOLDER): for name in dirnames: #if PRIVATE_FOLDER not in name: for keyword in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST: if keyword in name.lower(): ignoreList.append(name) </code>Forefront
Comments can't contain formatted code, you'd better edit your post and insert the snippet there.Ortrude
How long does it take just to do the listing itself? As in, if you were to just run "for dir, dirnames, filenames in os.walk(START_FOLDER): pass", is it still unacceptably slow/memory intensive?Meares
What is len(FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST)? You can hoist the name.lower() out of the innermost loop, which can help if the keywords list is very large.Ns
See this link for a potential speed-up using C. I've run into problems doing lists on 100s of millions of files, and solved the problem using the method described in the link.Meares
How do you define huge? A few hundred? Thousand? One hundred million?Babiche
@senshin: The bottleneck can be an the OS/Python interface. Even in 3.4 all the directory entries are read in at once, which can make reading large directories slow or impossible. See Issue 11406 for details.Exclusion
E
3

If the issue is that the directory simply has too many files in it, this will hopefully be solved in Python 3.5.

Until then, you may want to check out scandir.

Exclusion answered 14/2, 2014 at 2:24 Comment(1)
Yes, os.scandir() was added to Python 3.5 and returns a generator that yields simple os.DirEntry objects containing file path and other file attributes.Sauncho
C
2

You should make use of the in keyword to test if a directory name matches a keyword.

for _, dirnames, _ in os.walk(START_FOLDER): 
    for name in dirnames:
        if any((k in name.lower() for k in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST)):
            ignoreList.append(name)

If your ignoreList is too big, you may want to think about creating an acceptedList and using that instead.

Captivate answered 12/2, 2014 at 15:33 Comment(5)
Now it requires python 3, which should be mentioned since OP didn't tag either 2.x or 3.x.Meares
@Meares what isn't p3k compatible?Captivate
You should never have to use True and False in a ternary if expression. True if x in y else False is the same as simply x in y. Second, you've still mixed up the OP's test. OP checks whether any keyword is a substring of the name; not whether the name is a substring of any keyword.Hwu
And yet you still didn't correct the direction of the substring test.Hwu
This will save a little time, but won't help at all if the OP has too many directory entries (and the OP could save the same time by adding a break after the append).Exclusion

© 2022 - 2024 — McMap. All rights reserved.