How do I search sub-folders and sub-sub-folders in Google Drive?
Asked Answered
P

5

42

This is a commonly asked question.

The scenario is:-

folderA____ folderA1____folderA1a
       \____folderA2____folderA2a
                    \___folderA2b

... and the question is how do I list all the files in all of the folders under the root folderA.

Pontus answered 19/1, 2017 at 12:12 Comment(0)
P
52

EDIT: April 2020 Google have announced that multi-parent files is being disabled from September 2020. This alters the narrative below and means option 2 is no longer an option. It might be possible to implement Option 2 using shortcuts. I will update this answer further as I test the new restrictions/features We are all used to the idea of folders (aka directories) in Windows/nix etc. In the real world, a folder is a container, into which documents are placed. It is also possible to place smaller folders inside bigger folders. Thus the big folder can be thought of as containing all of the documents inside its smaller children folders.

However, in Google Drive, a Folder is NOT a container, so much so that in the first release of Google Drive, they weren't even called Folders, they were called Collections. A Folder is simply a File with (a) no contents, and (b) a special mime-type (application/vnd.google-apps.folder). The way Folders are used is exactly the same way that tags (aka labels) are used. The best way to understand this is to consider GMail. If you look at the top of an open mail item, you see two icons. A folder with the tooltip "Move to" and a label with the tooltip "Labels". Click on either of these and the same dialogue box appears and is all about labels. Your labels are listed down the left hand side, in a tree display that looks a lot like folders. Importantly, a mail item can have multiple labels, or you could say, a mail item can be in multiple folders. Google Drive's Folders work in exactly the same way that GMail labels work.

Having established that a Folder is simply a label, there is nothing stopping you from organising your labels in a hierarchy that resembles a folder tree, in fact this is the most common way of doing so.

It should now be clear that a file (let's call it MyFile) in folderA2b is NOT a child or grandchild of folderA. It is simply a file with a label (confusingly called a Parent) of "folderA2b". OK, so how DO I get all the files "under" folderA?

Alternative 1. Recursion

The temptation would be to list the children of folderA, for any children that are folders, recursively list their children, rinse, repeat. In a very small number of cases, this might be the best approach, but for most, it has the following problems:-

  • It is woefully time consuming to do a server round trip for each sub folder. This does of course depend on the size of your tree, so if you can guarantee that your tree size is small, it could be OK.

Alternative 2. The common parent

This works best if all of the files are being created by your app (ie. you are using drive.file scope). As well as the folder hierarchy above, create a dummy parent folder called say "MyAppCommonParent". As you create each file as a child of its particular Folder, you also make it a child of MyAppCommonParent. This becomes a lot more intuitive if you remember to think of Folders as labels. You can now easily retrieve all descdendants by simply querying MyAppCommonParent in parents.

Alternative 3. Folders first

Start by getting all folders. Yep, all of them. Once you have them all in memory, you can crawl through their parents properties and build your tree structure and list of Folder IDs. You can then do a single files.list?q='folderA' in parents or 'folderA1' in parents or 'folderA1a' in parents.... Using this technique you can get everything in two http calls.

The pseudo code for option 3 is a bit like...

// get all folders from Drive
files.list?q=mimetype=application/vnd.google-apps.folder and trashed=false&fields=parents,name
// store in a Map, keyed by ID 
// find the entry for folderA and note the ID
// find any entries where the ID is in the parents, note their IDs
// for each such entry, repeat recursively
// use all of the IDs noted above to construct a ...
//   files.list?q='folderA-ID' in parents or 'folderA1-ID' in parents or 'folderA1a-ID' in parents...

Alternative 2 is the most effificient, but only works if you have control of file creation. Alternative 3 is generally more efficient than Alternative 1, but there may be certain small tree sizes where 1 is best.

Pontus answered 19/1, 2017 at 12:12 Comment(10)
Thanks for this. Being able to do it in two calls would be ideal, but how do you get all the folders in a tree in one call? It seems the recursive approach is required to get all the folders in the tree to build up the aggregated "in parents" query for the single call for all files in the tree. Is that not the case?Dibbrun
Also would be great to see some code associated to step 3 for best practicesKrasnodar
@Dibbrun The first http call gets all folders, not just those in the tree in question. Having fetched all folders, use the parents property to figure out in memory what the tree structure looks like. I've added some pseudo code.Pontus
Looks like Google has an internal-only field "'<folder-id>' in ancestors" to search recursively visit drive.google.com, click on any folder's down arrow, click on "Search within <folder-name>" & do a search, it does in fact recursively search. Open your Dev tools>Network tab, notice that it's making a GET request with a q param q=fullText%20contains%20%27foo%27%20and%20trashed%20%3D%20true%20and%20%271AGks0RvjtHU9gF_nnSdLUk5xk-NGYIl1%27%20in%20ancestors it's using in ancestors but if you try it with the api you get a 400 error with an Invalid Value messageTailpipe
too bad ancestors field isn't released to the publicTailpipe
@Tailpipe could you post the complete URL that Drive is using. It'll be interesting to studyPontus
it's too long to be pasted here (characters limit) but here's a pastebin link pastebin.com/raw/YGJeu15U obviously you won't get back any results if you use it, since it's not authenticatedTailpipe
we're not alone issuetracker.google.com/…Tailpipe
thx for the link. That doesn't look like a Drive API endpoint, so my guess is that the Drive webapp is calling some bespoke server code, which in turn is calling the Drive API. The ancestors query can be easily implemented on the server. The logic would be to use my option 3 above, then cache the resulting folder hierarchy and use that to resolve any future ancestor queries.Pontus
approach 3 works best for me. I had to make 13K http calls to get file list. But with this approach, i can batch/parallel into 100 parent folders per request. ; So what used to take 1 hr 30 mins, now takes less than 5 mins!!!Cyclorama
P
6

Sharing a Python solution to the excellent Alternative 3 by @pinoyyid, above, in case it's useful to anyone. I'm not a developer so it's probably hopelessly un-pythonic... but it works, only makes 2 API calls, and is pretty quick.

  1. Get a master list of all the folders in a drive.
  2. Test whether the folder-to-search is a parent (ie. it has subfolders).
  3. Iterate through subfolders of the folder-to-search testing whether they too are parents.
  4. Build a Google Drive file query with one '<folder-id>' in parents segment per subfolder found.

Interestingly, Google Drive seems to have a hard limit of 599 '<folder-id>' in parents segments per query, so if your folder-to-search has more subfolders than this, you need to chunk the list.

FOLDER_TO_SEARCH = '123456789'  # ID of folder to search
DRIVE_ID = '654321'  # ID of shared drive in which it lives
MAX_PARENTS = 500  # Limit set safely below Google max of 599 parents per query.


def get_all_folders_in_drive():
    """
    Return a dictionary of all the folder IDs in a drive mapped to their parent folder IDs (or to the
    drive itself if a top-level folder). That is, flatten the entire folder structure.
    """
    folders_in_drive_dict = {}
    page_token = None
    max_allowed_page_size = 1000
    just_folders = "trashed = false and mimeType = 'application/vnd.google-apps.folder'"
    while True:
        results = drive_api_ref.files().list(
            pageSize=max_allowed_page_size,
            fields="nextPageToken, files(id, name, mimeType, parents)",
            includeItemsFromAllDrives=True, supportsAllDrives=True,
            corpora='drive',
            driveId=DRIVE_ID,
            pageToken=page_token,
            q=just_folders).execute()
        folders = results.get('files', [])
        page_token = results.get('nextPageToken', None)
        for folder in folders:
            folders_in_drive_dict[folder['id']] = folder['parents'][0]
        if page_token is None:
            break
    return folders_in_drive_dict


def get_subfolders_of_folder(folder_to_search, all_folders):
    """
    Yield subfolders of the folder-to-search, and then subsubfolders etc. Must be called by an iterator.
    :param all_folders: The dictionary returned by :meth:`get_all_folders_in-drive`.
    """
    temp_list = [k for k, v in all_folders.items() if v == folder_to_search]  # Get all subfolders
    for sub_folder in temp_list:  # For each subfolder...
        yield sub_folder  # Return it
        yield from get_subfolders_of_folder(sub_folder, all_folders)  # Get subsubfolders etc


def get_relevant_files(self, relevant_folders):
    """
    Get files under the folder-to-search and all its subfolders.
    """
    relevant_files = {}
    chunked_relevant_folders_list = [relevant_folders[i:i + MAX_PARENTS] for i in
                                     range(0, len(relevant_folders), MAX_PARENTS)]
    for folder_list in chunked_relevant_folders_list:
        query_term = ' in parents or '.join('"{0}"'.format(f) for f in folder_list) + ' in parents'
        relevant_files.update(get_all_files_in_folders(query_term))
    return relevant_files


def get_all_files_in_folders(self, parent_folders):
    """
    Return a dictionary of file IDs mapped to file names for the specified parent folders.
    """
    files_under_folder_dict = {}
    page_token = None
    max_allowed_page_size = 1000
    just_files = f"mimeType != 'application/vnd.google-apps.folder' and trashed = false and ({parent_folders})"
    while True:
        results = drive_api_ref.files().list(
            pageSize=max_allowed_page_size,
            fields="nextPageToken, files(id, name, mimeType, parents)",
            includeItemsFromAllDrives=True, supportsAllDrives=True,
            corpora='drive',
            driveId=DRIVE_ID,
            pageToken=page_token,
            q=just_files).execute()
        files = results.get('files', [])
        page_token = results.get('nextPageToken', None)
        for file in files:
            files_under_folder_dict[file['id']] = file['name']
        if page_token is None:
            break
    return files_under_folder_dict


if __name__ == "__main__":
    all_folders_dict = get_all_folders_in_drive()  # Flatten folder structure
    relevant_folders_list = [FOLDER_TO_SEARCH]  # Start with the folder-to-archive
    for folder in get_subfolders_of_folder(FOLDER_TO_SEARCH, all_folders_dict):
        relevant_folders_list.append(folder)  # Recursively search for subfolders
    relevant_files_dict = get_relevant_files(relevant_folders_list)  # Get the files
Perforce answered 3/9, 2020 at 15:59 Comment(4)
nice. regarding the 599 parent limit, did you confirm that the limit is really on the number of parents vs simply the overall length of the URL?Pontus
No, I didn't think of that, you might be right about URL length. It seemed consistent across a few different drives, where folders might have different length IDs I suppose, but I didn't test enough to be sure. Could also be a restriction of whatever Google Drive 'plan' my company has..?Perforce
If it's not too difficult, it would be valuable to find out. Could you pad the URL with &fooooooo...ooooo=baa...aaaar and see if the number changesPontus
I think you're right about files().list() querystring length. 30000 chars succeeds, 30001 fails with "query too complex". I don't know if folder IDs have a uniform length, but with each of my segments (ie. "1OshWVukvcAFUqhHgyzKpEljlYpJ_cxkt" in parents or ) @ 50 chars each (including spaces), then that would neatly work out at 600 subfolders. Anyway, I think the headline is the same; if you have a lot of subfolders, you have to chunk the list...Perforce
C
4

Sharing a javascript solution using recursion to build an array of folders, starting with the first level folder and moving down the hierarchy. This array is composed by recursively cycling through the parent Id's of the file in question.

The extract below makes 3 separate queries to the gapi:

  1. get the root folder id
  2. get a list of folders
  3. get a list of files

the code iterates through the list of files, then creating an array of folder names.

const { google } = require('googleapis')
const gOAuth =  require('./googleOAuth')

// resolve the promises for getting G files and folders
const getGFilePaths = async () => {
  //update to use Promise.All()
  let gRootFolder = await getGfiles().then(result => {return result[2][0]['parents'][0]})
  let gFolders = await getGfiles().then(result => {return result[1]})
  let gFiles = await getGfiles().then(result => {return result[0]})
  // create the path files and create a new key with array of folder paths, returning an array of files with their folder paths
  return pathFiles = gFiles
                      .filter((file) => {return file.hasOwnProperty('parents')})
                      .map((file) => ({...file, path: makePathArray(gFolders, file['parents'][0], gRootFolder)}))
}

// recursive function to build an array of the file paths top -> bottom
let makePathArray = (folders, fileParent, rootFolder) => {
  if(fileParent === rootFolder){return []}
  else {
    let filteredFolders = folders.filter((f) => {return f.id === fileParent})
    if(filteredFolders.length >= 1 && filteredFolders[0].hasOwnProperty('parents')) {
      let path = makePathArray(folders, filteredFolders[0]['parents'][0])
      path.push(filteredFolders[0]['name'])
      return path
    }
    else {return []}
  }
}

// get meta-data list of files from gDrive, with query parameters
const getGfiles = () => {
  try {
    let getRootFolder = getGdriveList({corpora: 'user', includeItemsFromAllDrives: false,
    fields: 'files(name, parents)', 
    q: "'root' in parents and trashed = false and mimeType = 'application/vnd.google-apps.folder'"})
  
    let getFolders = getGdriveList({corpora: 'user', includeItemsFromAllDrives: false,
    fields: 'files(id,name,parents), nextPageToken', 
    q: "trashed = false and mimeType = 'application/vnd.google-apps.folder'"})
  
    let getFiles = getGdriveList({corpora: 'user', includeItemsFromAllDrives: false,
    fields: 'files(id,name,parents, mimeType, fullFileExtension, webContentLink, exportLinks, modifiedTime), nextPageToken', 
    q: "trashed = false and mimeType != 'application/vnd.google-apps.folder'"})
  
    return Promise.all([getFiles, getFolders, getRootFolder])
  }
  catch(error) {
    return `Error in retriving a file reponse from Google Drive: ${error}`
  }
}

// make call out gDrive to get meta-data files. Code adds all files in a single array which are returned in pages
const getGdriveList = async (params) => {
  const gKeys = await gOAuth.get()
  const drive = google.drive({version: 'v3', auth: gKeys})
  let list = []
  let nextPgToken
  do {
    let res = await drive.files.list(params)
    list.push(...res.data.files)
    nextPgToken = res.data.nextPageToken
    params.pageToken = nextPgToken
  }
  while (nextPgToken)
  return list
}
Cubitiere answered 9/8, 2020 at 1:28 Comment(4)
Am I correct in saying you fetch all files, not just the files under the required root? If so, it’s probably worth pointing that and possibly creating a version where the files.list is constrained to the parent folders. Having said that, your solution will work well as is if the app is using drive.file scope.Pontus
@Pontus you're correct. However this solution should work if the root folder is manually entered rather than programmatically getting the drive.file rootCubitiere
Is there a Google App Script equivalent to 'await'?Dread
Thank you for the solution. This is working perfectly.Sharpwitted
K
1

The following works very well but requires an additional call to the API. It shares the root folder, does a search where file is shared, then removed the share. This works great in our production environments.

userPermission = new Permission()
        {
            Type = "user",
            Role = "reader",
            EmailAddress = "AnyEmailAddress"
        };
        
var request = service.Permissions.Create(userPermission, rootFolderID);
var result = request.ExecuteAsync().ContinueWith(t =>
                {

                    Permission permission = t.Result;
                    if (t.Exception == null)
                    {
                       //Do your search here 
                     // make sure you add 'AnyEmailAddress' in readers 
                       service.Files.List......
                       
                       // then remove the share
                       var requestDeletePermission = service.Permissions.Delete(rootFolderID, permission.filePermissionID);
                     requestDeletePermission.Execute();
                    }
                });
Kazoo answered 21/11, 2020 at 4:8 Comment(0)
A
1

For Google Apps Script, I've written this function:

function getSubFolderIdsByFolderId(folderId, result = []) {
  let folder = DriveApp.getFolderById(folderId);
  let folders = folder.getFolders();
  if (folders && folders.hasNext()) {
    while (folders.hasNext()) {
      let f = folders.next();
      let childFolderId = f.getId();
      result.push(childFolderId);

      result = getSubFolderIdsByFolderId(childFolderId, result);
    }
  }
  return result.filter(onlyUnique);
}

function onlyUnique(value, index, self) {
  return self.indexOf(value) === index;
}

With this call:

const subFolderIds = getSubFolderIdsByFolderId('1-id-of-the-root-folder-to-check')

And this for loop:

let q = [];
for (let i in subFolderIds) {
  let subFolderId = subFolderIds[i];
  q.push('"' + subFolderId + '" in parents');
}
if (q.length > 0) {
  q = '(' + q.join(' or ') + ') and';
} else {
  q = '';
}

I get the required query part, for the DriveApp.searchFiles call.

A major disadvantage of this approach is the number of requests and the time you'll have to wait for, till you got the complete list - depending on the size of the root directory. I would not call this an ideal solution!

Maybe caching could increase the performance for additional calls, when you take the modification date into account of the drive API query.

I'm curious because, in the Google Drive Browser version, you can search recursively within folders. And it does not take that much time, as my approach.

Arteriole answered 3/9, 2021 at 9:51 Comment(3)
this implements Option 1 from my accepted answer. As you've found, it can be slow. It might be worth you trying my Option 3 and see if that works better for you.Pontus
Yes! I came up with a similar idea of searching all my files (everywhere) and just check their parents. This is probably the most performant way of solving "recursive file search". Thanks!Arteriole
note: if you have a lot of subfolders, API starts throwing 400 HTTP error, so you have to use it carefully, or split to several requestsAdobe

© 2022 - 2024 — McMap. All rights reserved.