You can create a pyarrow.Table
and then convert it to a Dataset
. Here's an example.
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
from datasets import Dataset
df = pd.DataFrame({'a': [0,1,2], 'b': [3,4,5]})
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
### convert to Huggingface dataset
hg_dataset = Dataset(pa.Table.from_pandas(df))
To convert to a Table
only you can use from_pandas(…)
method as shown in the docs and the example above. https://arrow.apache.org/docs/python/pandas.html
A reference to Huggingface docs: https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset
nlp.Dataset
from CSV directly without involving pandas or pyarrow. Arrow also has a notion of a dataset (pyarrow.dataset.Dataset
) which represents a collection of 1 or more files. @TDrabas has a great answer for creating one of those. You can also create apyarrow.dataset.Dataset
from CSV directly. – Straddle