Extending Python's os.walk function on FTP server
Asked Answered
P

4

9

How can I make os.walk traverse the directory tree of an FTP database (located on a remote server)? The way the code is structured now is (comments provided):

import fnmatch, os, ftplib

def find(pattern, startdir=os.curdir): #find function taking variables for both desired file and the starting directory
    for (thisDir, subsHere, filesHere) in os.walk(startdir): #each of the variables change as the directory tree is walked
        for name in subsHere + filesHere: #going through all of the files and subdirectories
            if fnmatch.fnmatch(name, pattern): #if the name of one of the files or subs is the same as the inputted name
                fullpath = os.path.join(thisDir, name) #fullpath equals the concatenation of the directory and the name
                yield fullpath #return fullpath but anew each time

def findlist(pattern, startdir = os.curdir, dosort=False):
    matches = list(find(pattern, startdir)) #find with arguments pattern and startdir put into a list data structure
    if dosort: matches.sort() #isn't dosort automatically False? Is this statement any different from the same thing but with a line in between
    return matches

#def ftp(
#specifying where to search.

if __name__ == '__main__':
    import sys
    namepattern, startdir = sys.argv[1], sys.argv[2]
    for name in find(namepattern, startdir): print (name)

I am thinking that I need to define a new function (i.e., def ftp()) to add this functionality to the code above. However, I am afraid that the os.walk function will, by default, only walk the directory trees of the computer that the code is run from.

Is there a way that I can extend the functionality of os.walk to be able to traverse a remote directory tree (via FTP)?

Partisan answered 16/7, 2015 at 21:55 Comment(8)
pypi.python.org/pypi/ftptool/0.5.1Izy
I'm trying to avoid any interfaces beyond ftplib. Is this possible to do? Disclaimer: I've already tried ftptool and could not get it to do what I want. As such, the code above is a Python rehash of the Linux find command. I'm trying to extend it by incorporating an FTP switch to os.walk.Partisan
If someone can show me how to reimplement this in ftptool in a way that works for remote FTP databases, I will accept this as an answer as well.Partisan
what are you trying to actually do? what do you mean "couldnt get it to do what you want"? what do you mean by remote ftp database?Izy
When I use the find command in the Terminal, it by default searches the directory tree structure of my system (usually starting from the home directory). However, I am looking for a way to tell find to search the directory tree structure of a remote directory tree (such as any FTP database available on the web). Usually, you would need to open up a web browser and navigate to this FTP site. However, I would like to connect to it externally via a Python script and then use the find command to search it.Partisan
My question though pertains only to the searching. I've already written code to connect to an FTP website from the Terminal.Partisan
so you want to get a list of all paths that match a file pattern anywhere in the whole system? locate fname is much much faster and it runs on most linux machinesIzy
Yes, exactly. Should I use locate fname in the code of your answer anywhere?Partisan
O
7

All you need is utilizing the python's ftplib module. Since os.walk() is based on a Breadth-first search algorithm you need to find the directories and file names at each iteration, then continue the traversing recursively from the first directory. I implemented this algorithm about 2 years ago for using as the heart of FTPwalker, which is an optimum package for traversing extremely large directory trees Through FTP.

from os import path as ospath


class FTPWalk:
    """
    This class is contain corresponding functions for traversing the FTP
    servers using BFS algorithm.
    """
    def __init__(self, connection):
        self.connection = connection

    def listdir(self, _path):
        """
        return files and directory names within a path (directory)
        """

        file_list, dirs, nondirs = [], [], []
        try:
            self.connection.cwd(_path)
        except Exception as exp:
            print ("the current path is : ", self.connection.pwd(), exp.__str__(),_path)
            return [], []
        else:
            self.connection.retrlines('LIST', lambda x: file_list.append(x.split()))
            for info in file_list:
                ls_type, name = info[0], info[-1]
                if ls_type.startswith('d'):
                    dirs.append(name)
                else:
                    nondirs.append(name)
            return dirs, nondirs

    def walk(self, path='/'):
        """
        Walk through FTP server's directory tree, based on a BFS algorithm.
        """
        dirs, nondirs = self.listdir(path)
        yield path, dirs, nondirs
        for name in dirs:
            path = ospath.join(path, name)
            yield from self.walk(path)
            # In python2 use:
            # for path, dirs, nondirs in self.walk(path):
            #     yield path, dirs, nondirs
            self.connection.cwd('..')
            path = ospath.dirname(path)

Now for using this class, you can simply create a connection object using ftplib module and pass the the object to FTPWalk object and just loop over the walk() function:

In [2]: from test import FTPWalk

In [3]: import ftplib

In [4]: connection = ftplib.FTP("ftp.uniprot.org")

In [5]: connection.login()
Out[5]: '230 Login successful.'

In [6]: ftpwalk = FTPWalk(connection)

In [7]: for i in ftpwalk.walk():
            print(i)
   ...:     
('/', ['pub'], [])
('/pub', ['databases'], ['robots.txt'])
('/pub/databases', ['uniprot'], [])
('/pub/databases/uniprot', ['current_release', 'previous_releases'], ['LICENSE', 'current_release/README', 'current_release/knowledgebase/complete', 'previous_releases/', 'current_release/relnotes.txt', 'current_release/uniref'])
('/pub/databases/uniprot/current_release', ['decoy', 'knowledgebase', 'rdf', 'uniparc', 'uniref'], ['README', 'RELEASE.metalink', 'changes.html', 'news.html', 'relnotes.txt'])
...
...
...
Omegaomelet answered 5/5, 2017 at 8:18 Comment(2)
It should be noted that using backslashes with FTP servers doesn't always work. Instead, you need to ensure that os.path.join doesn't join a path with \ . To do this, replace line 40: path = ospath.join(path, name) with path = ospath.join(path, name).replace("\\", "/"). Worth noting this is only an issue with windows, because the geniuses at Microsoft decided to use backslashes for directories, and os.path.join intelligently joins paths based on the OS.Camporee
This doesn't seem to handle directories with spaces in the namePoop
C
0

I needed a function like os.walk on FTP and there where not any so i thought it would be useful to write it , for future references you can find last version here

by the way here is the code that would do that :

def FTP_Walker(FTPpath,localpath):
    os.chdir(localpath)
    current_loc = os.getcwd()
    for item in ftp.nlst(FTPpath):
        if not is_file(item):
            yield from FTP_Walker(item,current_loc)

        elif is_file(item):
            yield(item)
            current_loc = localpath
        else:
            print('this is a item that i could not process')
    os.chdir(localpath)
    return


def is_file(filename):
    current = ftp.pwd()
    try:
        ftp.cwd(filename)
    except Exception as e :
        ftp.cwd(current)
        return True

    ftp.cwd(current)
    return False

how to use:

first connect to your host :

host_address = "my host address"
user_name = "my username"
password = "my password"


ftp = FTP(host_address)
ftp.login(user=user_name,passwd=password)

now you can call the function like this:

ftpwalk = FTP_Walker("FTP root path","path to local") # I'm not using path to local yet but in future versions I will improve it. so you can just path an '/' to it 

and then to print and download files you can do somthing like this :

for item in ftpwalk:
ftp.retrbinary("RETR "+item, open(os.path.join(current_loc,item.split('/')[-1]),"wb").write) #it is downloading the file 
print(item) # it will print the file address

( i will write more features for it soon so if you need some specific things or have any idea that can be useful for users i'll be happy to hear that )

Conceit answered 1/7, 2019 at 11:25 Comment(0)
D
0

I wrote a library pip install walk-sftp. Event though it is named walk-sftp I included a WalkFTP class that lets you filter by start_date of files & end_date of files. You can even pass in a processing_function that returns True or False to see whether your process to clean & store data works. It also has a log parameter (pass filename) that uses pickle & keeps track of any progress so you don't overwrite or have to keep track of dates making backfilling easier.

https://pypi.org/project/walk-sftp/

Davis answered 13/1, 2021 at 20:26 Comment(0)
I
-2

Im going to assume this is what you want ... although really I have no idea

ssh = paramiko.SSHClient()
ssh.connect(server, username=username, password=password)
ssh_stdin, ssh_stdout, ssh_stderr = ssh.exec_command("locate my_file.txt")
print ssh_stdout

this will require the remote server to have the mlocate package `sudo apt-get install mlocate;sudo updatedb();

Izy answered 16/7, 2015 at 23:15 Comment(2)
Some databases I'm connecting to have this error: paramiko.ssh_exception.S SHException: Server 'ftp.server.org' not found in known_hosts. Does this mean I can't ssh to them using paramiko? I will try the mlocate approach and post an update.Partisan
@Partisan That's obvious to get such errors with such a protocol. The essence of SSH is connecting securely.Omegaomelet

© 2022 - 2024 — McMap. All rights reserved.