How to load a huggingface dataset from local path?
Asked Answered
V

3

6

Take a simple example in this website, https://huggingface.co/datasets/Dahoas/rm-static:

if I want to load this dataset online, I just directly use,

from datasets import load_dataset
dataset = load_dataset("Dahoas/rm-static") 

What if I want to load dataset from local path, so I download the files and keep the same folder structure from web Files and versions fristly,

-data
|-test-00000-of-00001-bf4c733542e35fcb.parquet
|-train-00000-of-00001-2a1df75c6bce91ab.parquet
-.gitattributes
-README.md
-dataset_infos.json

Then, put them into my folder, but shows error when loading:

dataset_path ="/data/coco/dataset/Dahoas/rm-static"
tmp_dataset = load_dataset(dataset_path)

It shows FileNotFoundError: No (supported) data files or dataset script found in /data/coco/dataset/Dahoas/rm-static.

Vain answered 1/9, 2023 at 3:3 Comment(0)
L
6

Save the data with save_to_disk then load it with load_from_disk. For example:

import datasets
ds = datasets.load_dataset("Dahoas/rm-static") 
ds.save_to_disk("Path/to/save")

and later if you wanna re-utilize it just normal load_dataset will work

ds = datasets.load_from_disk("Path/to/save")

you can verify the same by printing the dataset you will be getting same result for both. This is the easier way out. The file format it is generally saved in is arrow.

For the second method where you are downloading the parquet file. Would require you to explicitly declaring the dataset and it config, might be included in json and then you can load it.

Lussi answered 1/9, 2023 at 4:33 Comment(4)
Would suggest to go by first method might be time consuming but lot better in term of saving effort and time the next time you load the data. Please feel free to post anymore question over this.Lussi
Hi, thanks for your reply, I have tried your method, but when I load the dataset by dataset = load_dataset("Path/to/save") it shows that error, raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match") ValueError: Couldn't cast _data_files: list<item: struct<filename: string>> child 0, item: struct<filename: string> child 0, filename: string _fingerprint: string _format_columns: null _format_kwargs: struct<> _format_type: null _output_all_columns: bool _split: string to {'builder_name': Value(dtype='string', id=None), }Vain
On the other hand, if I use dataset = load_from_disk("Path/to/save") it didn't show a problemVain
Hi @4daJKong. That is the correct way to do it. Were you able to load it that way? you can refer same at huggingface.co/docs/datasets/package_reference/loading_methodsLussi
V
7

I solved this question by myself, it is easy to use:

data_files = {“train”:“train-00000-of-00001-2a1df75c6bce91ab.parquet”,“test”:“test-00000-of-00001-8c7c51afc6d45980.parquet”}
raw_datasets = load_dataset(“parquet”, data_dir=‘/Your/Path/Dahoas/rm-static/data’, data_files=data_files)
Vain answered 1/9, 2023 at 6:53 Comment(0)
L
6

Save the data with save_to_disk then load it with load_from_disk. For example:

import datasets
ds = datasets.load_dataset("Dahoas/rm-static") 
ds.save_to_disk("Path/to/save")

and later if you wanna re-utilize it just normal load_dataset will work

ds = datasets.load_from_disk("Path/to/save")

you can verify the same by printing the dataset you will be getting same result for both. This is the easier way out. The file format it is generally saved in is arrow.

For the second method where you are downloading the parquet file. Would require you to explicitly declaring the dataset and it config, might be included in json and then you can load it.

Lussi answered 1/9, 2023 at 4:33 Comment(4)
Would suggest to go by first method might be time consuming but lot better in term of saving effort and time the next time you load the data. Please feel free to post anymore question over this.Lussi
Hi, thanks for your reply, I have tried your method, but when I load the dataset by dataset = load_dataset("Path/to/save") it shows that error, raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match") ValueError: Couldn't cast _data_files: list<item: struct<filename: string>> child 0, item: struct<filename: string> child 0, filename: string _fingerprint: string _format_columns: null _format_kwargs: struct<> _format_type: null _output_all_columns: bool _split: string to {'builder_name': Value(dtype='string', id=None), }Vain
On the other hand, if I use dataset = load_from_disk("Path/to/save") it didn't show a problemVain
Hi @4daJKong. That is the correct way to do it. Were you able to load it that way? you can refer same at huggingface.co/docs/datasets/package_reference/loading_methodsLussi
A
0

You can load a csv data file from local path using:

from datasets import load_dataset
dataset = load_dataset('csv', data_files='final.csv')

or to load multiple files, use:

dataset = load_dataset('csv', data_files={'train' ['my_train_file_1.csv', 'my_train_file_2.csv'], 'test': 'my_test_file.csv'})

For more details, follow the Hugging Face documentation.

Aerobatics answered 3/7 at 6:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.