How to convert a pandas dataframe to a an arrow dataset?

About

Asked 8/11, 2021 at 4:20 Answered 8/11, 2021 at 6:16

Solved pandas pyarrow

In huggingface library, there is a particular format of datasets called arrow dataset

https://arrow.apache.org/docs/python/dataset.html

https://huggingface.co/datasets/wiki_lingua

I have to convert a normal pandas dataframe to a dataset or read a tabluar csv file as a dataset.

Is that possible?

Lindly answered 8/11, 2021 at 4:20 Comment(1)

The word "dataset" is a little ambiguous here. It appears HuggingFace has a concept of a dataset nlp.Dataset which is (I think, but am not very sure) a single file. You can create an nlp.Dataset from CSV directly without involving pandas or pyarrow. Arrow also has a notion of a dataset (pyarrow.dataset.Dataset) which represents a collection of 1 or more files. @TDrabas has a great answer for creating one of those. You can also create a pyarrow.dataset.Dataset from CSV directly. – Straddle 8/11, 2021 at 19:26

You can create a pyarrow.Table and then convert it to a Dataset. Here's an example.

import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
from datasets import Dataset

df = pd.DataFrame({'a': [0,1,2], 'b': [3,4,5]})
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())

### convert to Huggingface dataset
hg_dataset = Dataset(pa.Table.from_pandas(df))

To convert to a Table only you can use from_pandas(…) method as shown in the docs and the example above. https://arrow.apache.org/docs/python/pandas.html

A reference to Huggingface docs: https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset

Demo answered 8/11, 2021 at 6:16 Comment(5)

In addition, the huggingface Dataset object seems to have a from_pandas() itself as well (huggingface.co/docs/datasets/loading.html#pandas-dataframe) – Merril 8/11, 2021 at 10:12

i want dataset not table. from_dataset converts dataframe into table. Or maybe you can tell how to convert table to dataset? – Lindly 8/11, 2021 at 10:48

Aaah I missed that. Altered the answer. – Demo 8/11, 2021 at 17:34

it says its an pyarrow._dataset.InMemoryDataset whereas what wiki_lingua is, a 'datasets.arrows_dataset.Dataset'. How to convert the inmemory to the one i need? – Lindly 9/11, 2021 at 8:29

I think @Straddle is right and the word dataset here is loaded. BUT, you can take pyarrow.Table and pass it as a parameter to datasets.Dataset to convert it. Added another line there. – Demo 9/11, 2021 at 16:28

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags