Hdfs put VS webhdfs
Asked Answered
U

4

10

I'm loading 28 GB file in hadoop hdfs using webhdfs and it takes ~25 mins to load.

I tried loading same file using hdfs put and It took ~6 mins. Why there is so much difference in performance?

What is recommended to use? Can somebody explain or direct me to some good link it will be really helpful.

Below us the command I'm using

curl -i --negotiate -u: -X PUT "http://$hostname:$port/webhdfs/v1/$destination_file_location/$source_filename.temp?op=CREATE&overwrite=true"

this will redirect to a datanode address which I use in next step to write the data.

Unattended answered 23/7, 2015 at 7:29 Comment(1)
Can you also write your command for webhdfs..?Rollmop
R
18

Hadoop provides several ways of accessing HDFS

All of the following support almost all features of the filesystem -

1. FileSystem (FS) shell commands: Provides easy access of Hadoop file system operations as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS.
This needs hadoop client to be installed and involves the client to write blocks directly to one Data Node. All versions of Hadoop do not support all options for copying between filesystems.

2. WebHDFS: It defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing Hadoop, Advantage being language agnostic way(curl, php etc....).
WebHDFS needs access to all nodes of the cluster and when some data is read, it is transmitted from the source node directly but **there is a overhead of http ** (1)FS Shell but works agnostically and no problems with different hadoop cluster and versions.

3. HttpFS. Read and write data to HDFS in a cluster behind a firewall. Single node will act as GateWay node through which all the data will be transfered and performance wise I believe this can be even slower but preferred when needs to pull the data from public source into a secured cluster.

So choose rightly!.. Going down the list will always be an alternative when the choice above it is not available to you.

Rollmop answered 23/7, 2015 at 8:26 Comment(1)
Thanks Ruby..can you please elaborate on why there is huge performace difference ?Unattended
S
9

Hadoop provides a FileSystem Shell API to support file system operations such as create, rename or delete files and directories, open, read or write file. The FileSystem shell is a java application that uses java FileSystem class to provide FileSystem operations. FileSystem Shell API creates RPC connection for the operations.

If the client is within the Hadoop cluster, then this is useful because it use hdfs URI scheme to connect with the hadoop distributed FileSystem and hence client makes a direct RPC connection to write data into HDFS.

This is good for applications running within the Hadoop cluster but there may be use cases where an external application needs to manipulate HDFS like it needs to create directories and write files to that directory or read the content of a file stored on HDFS. Hortonworks developed an API to support these requirements based on standard REST functionality called WebHDFS.

WebHDFS provides the REST API functionality where any external application can connect the DistributedFileSystem over HTTP connection. No matter that the external application is Java or PHP.

WebHDFS concept is based on HTTP operations like GET, PUT, POST and DELETE. Operations like OPEN, GETFILESTATUS, LISTSTATUS are using HTTP GET, others like CREATE, MKDIRS, RENAME, SETPERMISSIONS are relying on HTTP PUT

It provides secure read-write access to HDFS over HTTP. It is basically intended as a replacement of HFTP(read only access over HTTP) and HSFTP(read only access over HTTPS).It used webhdfs URI scheme to connect with Distributed file system.

If the client is outside the Hadoop Cluster and trying to access HDFS. WebHDFS is usefull for it.Also If you are trying to connect the two difference version of Hadoop cluster then WebHDFS is usefull as it used REST API so it is independent of MapReduce or HDFS version.

Sheila answered 23/7, 2015 at 8:57 Comment(1)
Can you add a comparison to httpfs protocol? ThanksCortisol
R
5

The difference between HDFS access and WebHDFS is scalability due to the design of HDFS and the fact that a HDFS client decomposes a file into splits living in different nodes. When an HDFS client access file content, under the covers it goes to the NameNode and gets a list of file splits and their physical location on a Hadoop cluster.

It then can go do DataNodes living on all those locations to fetch blocks in the splits in parallel, piping the content directly to the client.

WebHDFS is a proxy living in the HDFS cluster and it layers on HDFS, so all data needs to be streamed to the proxy before it gets relayed on to the WebHDFS client. In essence it becomes a single point of access and an IO bottleneck.

Rosati answered 4/7, 2017 at 16:3 Comment(0)
N
1

You can you traditional java client api (which is being internally used by linux commands of hdfs).

From what I have read from here.

The performance of java client and Rest based approach have similar performance.

Norther answered 15/3, 2017 at 21:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.