best way to get files list of big directory on python?
Asked Answered
B

9

13

I have insane big directory. I need to get filelist via python.

In code i need to get iterator, not list. So this not work:

os.listdir
glob.glob  (uses listdir!)
os.walk

I cant find any good lib. help! Maybe c++ lib?

Bahuvrihi answered 25/2, 2011 at 11:26 Comment(2)
Looks like a duplicate of Is there a way to efficiently yield every file in a directory containing millions of files?.Tachyphylaxis
oh, yes. cant find that post by search...Bahuvrihi
D
9

If you have a directory that is too big for libc readdir() to read it quickly, you probably want to look at the kernel call getdents() (http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html ). I ran into a similar problem and wrote a long blog post about it.

http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/

Basically, readdir() only reads 32K of directory entries at a time, and so if you have a lot of files in a directory, readdir() will take a very long time to complete.

Dendrochronology answered 11/8, 2011 at 19:56 Comment(0)
W
14

for python 2.X

import scandir
scandir.walk()

for python 3.5+

os.scandir()

https://www.python.org/dev/peps/pep-0471/

https://pypi.python.org/pypi/scandir

Woorali answered 21/1, 2016 at 11:4 Comment(0)
D
9

If you have a directory that is too big for libc readdir() to read it quickly, you probably want to look at the kernel call getdents() (http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html ). I ran into a similar problem and wrote a long blog post about it.

http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/

Basically, readdir() only reads 32K of directory entries at a time, and so if you have a lot of files in a directory, readdir() will take a very long time to complete.

Dendrochronology answered 11/8, 2011 at 19:56 Comment(0)
C
1

I found this library useful: https://github.com/benhoyt/scandir.

Cinchonidine answered 31/7, 2013 at 23:19 Comment(0)
M
0

i think that using opendir would work and there is a python package: http://pypi.python.org/pypi/opendir/0.0.1 that wraps it via pyrex

Munificent answered 25/2, 2011 at 11:29 Comment(1)
sounds nice, but cant install under windows... File "c:\python26\lib\site-packages\pyrex-0.9.9-py2.6.egg\Pyrex\Distutils\extension.py", line 69, in init **kw) TypeError: unbound method __init__() must be called with Extension instance as first argument (got Extension instance instead)Bahuvrihi
T
0

You should use generator. This problem is discussed here: http://bugs.python.org/issue11406

Trapp answered 7/6, 2013 at 19:59 Comment(0)
K
0

Someone built a python module off that article that wraps getdents. Btw, I know this post is old, but you could use scandir (and I have done that with dirs with 21 million files). Walk is way too slow though it is also a generator but too much overhead.

This module seems like it would have been an interesting alternative. Have not used it, but he did base it off 8 million files LS article referenced above. Reading through the code, thinking this would have been fun and faster to use.

Also allows you to tweak the buffer without having to go into C directly.

https://github.com/ZipFile/python-getdents And via pip and pypi though I recommend reading the docs.

https://pypi.org/project/getdents/

Karaganda answered 13/12, 2019 at 19:34 Comment(0)
B
0

I found this library really fast.
https://pypi.org/project/scandir/
I used below code from this library, it worked like a charm.

def subdirs(path):
"""Yield directory names not starting with '.' under given path."""
for entry in os.scandir(path):
    if not entry.name.startswith('.') and entry.is_dir():
        yield entry.name
Brew answered 4/6, 2020 at 8:42 Comment(0)
P
-1

http://docs.python.org/release/2.6.5/library/os.html#os.walk

>>> import os
>>> type(os.walk('/'))
<type 'generator'>
Purlin answered 25/2, 2011 at 17:1 Comment(1)
unfortunately os.walk uses listdir internally.Untouched
C
-2

How about glob.iglob? It's the iterator glob.

Clouded answered 4/12, 2013 at 23:10 Comment(1)
that it's a generator with a list behind the curtains, so why not calling the list directly?Woorali

© 2022 - 2024 — McMap. All rights reserved.