What's the best module for interacting with HDFS with Python3? [closed]
Asked Answered
A

5

21

I see there is hdfs3, snakebite, and some others. Which one is the best supported and comprehensive?

Adenoid answered 27/10, 2016 at 12:57 Comment(1)
As of 2019, the last updated version of snakebite on pypi is Aug 8 2016.Kerley
L
10

As far as I know, there are not as many possibilities as one may think. But I'd suggest the official Python Package hdfs 2.0.12 which can be downloaded the website or from terminal by running:

pip install hdfs

Some of the features:

  • Python (2 and 3) bindings for the WebHDFS (and HttpFS) API, supporting both secure and insecure clusters.
  • Command line interface to transfer files and start an interactive client shell, with aliases for convenient namenode URL caching.
  • Additional functionality through optional extensions: avro, to read and write Avro files directly from HDFS. dataframe, to load and save Pandas dataframes. kerberos, to support Kerberos authenticated clusters.
Leighton answered 27/10, 2016 at 13:47 Comment(0)
S
8

I have tried snakebite, hdfs3 and hdfs.

Snakebite supports only download (no upload) so it's no go for me.

Out of these 3 only hdfs3 supports HA set up, so it was my choice, however I didn't manage to make it work with multihomed networks using datanode hostnames (problem described here: https://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresses/)

So I ended up using hdfs (2.0.16) as it supports uploads. I had to add some workaround using bash to support HA.

PS. There's interesting article comparing Python libraries developed for interacting with the Hadoop File System at http://wesmckinney.com/blog/python-hdfs-interfaces/

Slicker answered 7/3, 2017 at 10:31 Comment(0)
G
8

pyarrow, the python implementation of apache arrow has a well maintained and documented HDFS client: https://arrow.apache.org/docs/python/filesystems.html

Gyroscope answered 11/1, 2019 at 19:45 Comment(0)
B
1

There's pydoop, which is quite handy.

https://github.com/crs4/pydoop

Basir answered 23/1, 2020 at 13:53 Comment(0)
S
0

I found pyhdfs-client really good for large files. (File taking 1 hour with webhdfs got loaded in 18 mins with it).

pip install pyhdfs-client

Only downside is, it's new and it's interface is not clean as compared to other hdfs clients. Documentation is missing however you can check usage here: https://pypi.org/project/pyhdfs-client/

Supersonic answered 25/3, 2021 at 21:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.