How to list HDFS directory contents using webhdfs?
Asked Answered
M

1

5

Is it possible to check to contents of a directory in HDFS using webhdfs?

This would work as hdfs dfs -ls normally would, but instead using webhdfs.

How do I list a webhdfs directory using Python 2.6 to do so?

Masha answered 23/6, 2016 at 10:51 Comment(0)
R
8

You can use the LISTSTATUS verb. The docs are at List a Directory, and the following code can be found on the WebHDFS REST API docs:

With curl, this is what it looks like:

curl -i  "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS"

The response is a FileStatuses JSON object:

{
  "name"      : "FileStatuses",
  "properties":
  {
    "FileStatuses":
    {
      "type"      : "object",
      "properties":
      {
        "FileStatus":
        {
          "description": "An array of FileStatus",
          "type"       : "array",
          "items"      : fileStatusProperties
        }
      }
    }
  }
}

fileStatusProperties (for the items field) has this JSON schema:

var fileStatusProperties =
{
  "type"      : "object",
  "properties":
  {
    "accessTime":
    {
      "description": "The access time.",
      "type"       : "integer",
      "required"   : true
    },
    "blockSize":
    {
      "description": "The block size of a file.",
      "type"       : "integer",
      "required"   : true
    },
    "group":
    {
      "description": "The group owner.",
      "type"       : "string",
      "required"   : true
    },
    "length":
    {
      "description": "The number of bytes in a file.",
      "type"       : "integer",
      "required"   : true
    },
    "modificationTime":
    {
      "description": "The modification time.",
      "type"       : "integer",
      "required"   : true
    },
    "owner":
    {
      "description": "The user who is the owner.",
      "type"       : "string",
      "required"   : true
    },
    "pathSuffix":
    {
      "description": "The path suffix.",
      "type"       : "string",
      "required"   : true
    },
    "permission":
    {
      "description": "The permission represented as a octal string.",
      "type"       : "string",
      "required"   : true
    },
    "replication":
    {
      "description": "The number of replication of a file.",
      "type"       : "integer",
      "required"   : true
    },
   "type":
    {
      "description": "The type of the path object.",
      "enum"       : ["FILE", "DIRECTORY"],
      "required"   : true
    }
  }
};

You can process the filenames in Python using pywebhdfs, like this:

import json
from pprint import pprint
from pywebhdfs.webhdfs import PyWebHdfsClient

hdfs = PyWebHdfsClient(host='host',port='50070', user_name='hdfs')  # Use your own host/port/user_name config

data = hdfs.list_dir("dir/dir")  # Use your preferred directory, without the leading "/"

file_statuses = data["FileStatuses"]
pprint file_statuses   # Display the dict

for item in file_statuses["FileStatus"]:
    print item["pathSuffix"]   # Display the item filename

Instead of printing each object, you can actually work with the items as you need. The result from file_statuses is simply a Python dict, so it can be used like any other dict, provided that you use the right keys.

Rebatement answered 23/6, 2016 at 11:24 Comment(4)
Can this be used to retrieve the filenames inside a directory?Masha
It completely and totally depends on what language you're trying to use to do this. Since you hadn't included that in the original question, I assumed that it was immaterial to you, and that you'd work it out once you knew the verb to use. If you update the question with details on which language (C++, Python, Ruby, PHP, Fortran, etc) you're using, that would make it possible to answer that question. Please be as detailed as possible (e.g. Python 2.7 is different than Python 3.0, Ruby 1.8.7 is different than Ruby 2.3.1, C is different than C++, etc).Rebatement
Sure the language we're using at the moment is pythonMasha
I have updated the question for you. (note that I had asked you to update the question already.) Please confirm that's what you're actually asking.Rebatement

© 2022 - 2024 — McMap. All rights reserved.