How do I parse a listing of files to get just the filenames in Python?
Asked Answered
P

8

6

So lets say I'm using Python's ftplib to retrieve a list of log files from an FTP server. How would I parse that list of files to get just the file names (the last column) inside a list? See the link above for example output.

Prunelle answered 26/10, 2008 at 7:42 Comment(0)
P
9

Using retrlines() probably isn't the best idea there, since it just prints to the console and so you'd have to do tricky things to even get at that output. A likely better bet would be to use the nlst() method, which returns exactly what you want: a list of the file names.

Plastic answered 26/10, 2008 at 7:55 Comment(0)
S
8

This best answer

You may want to use ftp.nlst() instead of ftp.retrlines(). It will give you exactly what you want.

If you can't, read the following :

Generators for sysadmin processes

In his now famous review, Generator Tricks For Systems Programmers An Introduction, David M. Beazley gives a lot of receipes to answer to this kind of data problem with wuick and reusable code.

E.G :

# empty list that will receive all the log entry
log = [] 
# we pass a callback function bypass the print_line that would be called by retrlines
# we do that only because we cannot use something better than retrlines
ftp.retrlines('LIST', callback=log.append)
# we use rsplit because it more efficient in our case if we have a big file
files = (line.rsplit(None, 1)[1] for line in log)
# get you file list
files_list = list(files)

Why don't we generate immediately the list ?

Well, it's because doing it this way offer you much flexibility : you can apply any intermediate generator to filter files before turning it into files_list : it's just like pipe, add a line, you add a process without overheat (since it's generators). And if you get rid off retrlines, it still work be it's even better because you don't store the list even one time.

EDIT : well, I read the comment to the other answer and it says that this won't work if there is any space in the name.

Cool, this will illustrate why this method is handy. If you want to change something in the process, you just change a line. Swap :

files = (line.rsplit(None, 1)[1] for line in log)

and

# join split the line, get all the item from the field 8 then join them
files = (' '.join(line.split()[8:]) for line in log)

Ok, this may no be obvious here, but for huge batch process scripts, it's nice :-)

Spriggs answered 26/10, 2008 at 9:9 Comment(1)
Seems more robust than nlst which did hang on an empty directory in my case.Suspect
P
1

And a slightly less-optimal method, by the way, if you're stuck using retrlines() for some reason, is to pass a function as the second argument to retrlines(); it'll be called for each item in the list. So something like this (assuming you have an FTP object named 'ftp') would work as well:

filenames = []
ftp.retrlines('LIST', lambda line: filenames.append(line.split()[-1]))

The list 'filenames' will then be a list of the file names.

Plastic answered 26/10, 2008 at 7:59 Comment(1)
This won't work if the filename contains spaces (Mohit Ranka's answer probably has the same problem, but I can't understand his code completely...)Bogan
B
1

Since every filename in the output starts at the same column, all you have to do is get the position of the dot on the first line:

drwxrwsr-x 5 ftp-usr pdmaint 1536 Mar 20 09:48 .

Then slice the filename out of the other lines using the position of that dot as the starting index.

Since the dot is the last character on the line, you can use the length of the line minus 1 as the index. So the final code is something like this:

lines = ftp.retrlines('LIST')
lines = lines.split("\n") # This should split the string into an array of lines

filename_index = len(lines[0]) - 1
files = []

for line in lines:
    files.append(line[filename_index:])
Bogan answered 26/10, 2008 at 8:0 Comment(1)
I think this is a pretty creative technique, but if you are listing the top level directory, then there might not be any dot files in the listing.Croak
W
1

Is there any reason why ftplib.FTP.nlst() won't work for you? I just checked and it returns only names of the files in a given directory.

Waylonwayman answered 26/10, 2008 at 8:15 Comment(1)
Oops, OK. Didn't notice that James had already suggested nlst()?Waylonwayman
F
1

If the FTP server supports the MLSD command, then please see section “single directory case” from that answer.

Use an instance (say ftpd) of the FTPDirectory class, call its .getdata method with connected ftplib.FTP instance in the correct folder, then you can:

directory_filenames= [ftpfile.name for ftpfile in ftpd.files]
Fright answered 24/6, 2010 at 23:17 Comment(0)
C
0

I believe it should work for you.

file_name_list = [' '.join(each_file.split()).split()[-1] for each_file_detail in file_list_from_log]

NOTES -

  1. Here I am making a assumption that you want the data in the program (as list), not on console.

  2. each_file_detail is each line that is being produced by the program.

  3. ' '.join(each_file.split())

To replace multiple spaces by 1 space.

Catabasis answered 26/10, 2008 at 7:42 Comment(0)
H
0

This gets a list of all the filenames plus their sizes. It also walks the sub-directories.

def ftp_login():
    """ Future FTP stuff """

    import os
    from ftplib import FTP
    ftp = FTP()
    ftp.connect('phone', 2221)
    ftp.login('android', 'android')
    print("ftp.getwelcome():", ftp.getwelcome())
    def walk(path, all):
        """ walk the path """
        files = []
        ftp.dir(path, files.append)  # callback = files.append(line)
        # Filename could be any position on line so can't use line[52:] below
        # dr-x------   3 user group            0 Aug 27 16:32 Compilations
        for f in files:
            line = ' '.join(f.split())  # compress multiple whitespace to one space
            parts = line.split()  # split on one space
            size = parts[4]
            # Date format is either: MMM DD hh:mm or MMM DD  YYYY or MMM DD YYYY
            date3 = parts[7] + " "  # doesn't matter if the size is same as YEAR
            # No shortcut ' '.join(parts[8:]) - name could have had double space
            name = f.split(date3)[1]
            if f.startswith("d"):  # directory?
                new_path = path + name + os.sep
                walk(new_path, all)  # back down the rabbit hole
            else:
                # /path/to/filename.ext <SIZE>
                all.append(path + name + " <" + size.strip() + ">")

    all_files = []
    walk(os.sep, all_files)  # 41 seconds
    print("len(all_files):", len(all_files))  # 4,074 files incl 163 + 289 subdirs
    for i in range(10):
        print(all_files[i])

Output:

ftp.getwelcome(): 220 Service ready for new user.
len(all_files): 4074
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.wav <47480228>
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.mp3 <7343013>
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.flac <31112653>
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.oga <8075357>
/Compilations/Greatest Hits of the 80’s [Disc #3 of 3]/3-12 Poison.m4a <7662899>
/Compilations/Don't Let Me Be Misunderstood/07 House Of The Rising Sun (Quasimot.m4a <8015709>
/Compilations/Don't Let Me Be Misunderstood/01 Don't Let Me Be Misunderstood.m4a <33668167>
/Compilations/Don't Let Me Be Misunderstood/03 You're My Everything.m4a <12505304>
/Compilations/Don't Let Me Be Misunderstood/02 Gloria.m4a <8115224>
/Compilations/Don't Let Me Be Misunderstood/04 Black Pot.m4a <14617541>

Usage

ftp.connect('phone', 2221)
  • Change 'phone' to the host name or IP address
  • Change 2221 to the port number or skip parameter if using port 21
ftp.login('android', 'android')
  • Change the first 'android' to user name
  • Change the second 'android' to password
Hexylresorcinol answered 3/9, 2023 at 7:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.