Reading in csv file as dataframe from hdfs

Asked 26/2, 2016 at 1:57 Answered 23/5, 2023 at 22:32

I'm using pydoop to read in a file from hdfs, and when I use:

import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
    print f.read()

It shows me the file in stdout.

Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:

>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
  File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist

Predella answered 26/2, 2016 at 1:57 Comment(0)

I know next to nothing about hdfs, but I wonder if the following might work:

with hd.open("/home/file.csv") as f:
    df =  pd.read_csv(f)

I assume read_csv works with a file handle, or in fact any iterable that will feed it lines. I know the numpy csv readers do.

pd.read_csv("/home/file.csv") would work if the regular Python file open works - i.e. it reads the file a regular local file.

with open("/home/file.csv") as f: 
    print f.read()

But evidently hd.open is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs documentation.

Volga answered 26/2, 2016 at 5:25 Comment(0)

you can use the following code to read csv from hdfs

import pandas as pd
import pyarrow as pa
hdfs_config = {
     "host" : "XXX.XXX.XXX.XXX",
     "port" : 8020,
     "user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'], 
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))

Mariann answered 10/2, 2021 at 18:47 Comment(0)

Use read instead open, it works

with hd.read("/home/file.csv") as f:
    df =  pd.read_csv(f)

Comedo answered 18/11, 2020 at 19:46 Comment(0)

You can read and write with pyarrow natively. I found the pydoop library to be a bit clumsy and require lots of annoying dependencies. The syntax is as follows:

from pyarrow import fs
import pyarrow.parquet as pq

# connect to hadoop
hdfs = fs.HadoopFileSystem('hostname', 8020) 

# will read single file from hdfs
with hdfs.open_input_file(path) as pqt:
     df = pq.read_table(pqt).to_pandas()

# will read directory full of partitioned parquets (ie. from spark)
df = pq.ParquetDataset(path, hdfs).read().to_pandas()

Shana answered 23/5, 2023 at 22:32 Comment(0)

Recommended topics

Hot tags