Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB. If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0