How to convert a pandas dataframe to a an arrow dataset?
Asked Answered
L

1

9

In huggingface library, there is a particular format of datasets called arrow dataset

https://arrow.apache.org/docs/python/dataset.html

https://huggingface.co/datasets/wiki_lingua

I have to convert a normal pandas dataframe to a dataset or read a tabluar csv file as a dataset.

Is that possible?

Lindly answered 8/11, 2021 at 4:20 Comment(1)
The word "dataset" is a little ambiguous here. It appears HuggingFace has a concept of a dataset nlp.Dataset which is (I think, but am not very sure) a single file. You can create an nlp.Dataset from CSV directly without involving pandas or pyarrow. Arrow also has a notion of a dataset (pyarrow.dataset.Dataset) which represents a collection of 1 or more files. @TDrabas has a great answer for creating one of those. You can also create a pyarrow.dataset.Dataset from CSV directly.Straddle
D
12

You can create a pyarrow.Table and then convert it to a Dataset. Here's an example.

import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
from datasets import Dataset

df = pd.DataFrame({'a': [0,1,2], 'b': [3,4,5]})
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())

### convert to Huggingface dataset
hg_dataset = Dataset(pa.Table.from_pandas(df))

To convert to a Table only you can use from_pandas(…) method as shown in the docs and the example above. https://arrow.apache.org/docs/python/pandas.html

A reference to Huggingface docs: https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset

Demo answered 8/11, 2021 at 6:16 Comment(5)
In addition, the huggingface Dataset object seems to have a from_pandas() itself as well (huggingface.co/docs/datasets/loading.html#pandas-dataframe)Merril
i want dataset not table. from_dataset converts dataframe into table. Or maybe you can tell how to convert table to dataset?Lindly
Aaah I missed that. Altered the answer.Demo
it says its an pyarrow._dataset.InMemoryDataset whereas what wiki_lingua is, a 'datasets.arrows_dataset.Dataset'. How to convert the inmemory to the one i need?Lindly
I think @Straddle is right and the word dataset here is loaded. BUT, you can take pyarrow.Table and pass it as a parameter to datasets.Dataset to convert it. Added another line there.Demo

© 2022 - 2024 — McMap. All rights reserved.