How to load file from custom hosted Minio s3 bucket into pandas using s3 URL format?
Asked Answered
L

1

10

I have Minio server hosted locally. I need to read file from minio s3 bucket using pandas using S3 URL like "s3://dataset/wine-quality.csv" in Jupyter notebook.

I tried using s3 boto3 library am able to download file.

import boto3
s3 = boto3.resource('s3',
                endpoint_url='localhost:9000',
                aws_access_key_id='id',
                aws_secret_access_key='password')
s3.Bucket('dataset').download_file('wine-quality.csv', '/tmp/wine-quality.csv')

But when I try using pandas,

data = pd.read_csv("s3://dataset/wine-quality.csv")

I'm getting client Error, Forbidden 403. I know that pandas internally use boto3 library(correct me if am wrong)

PS: Pandas read_csv has one more param, " storage_options={ "key": AWS_ACCESS_KEY_ID, "secret": AWS_SECRET_ACCESS_KEY, "token": AWS_SESSION_TOKEN, }". But I couldn't find any configuration for passing custom Minio host URL for pandas to read.

Longways answered 14/4, 2021 at 14:39 Comment(0)
K
12

Pandas v1.2 onwards allows you to pass storage options which gets passed down to fsspec, see the docs here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html?highlight=s3fs#reading-writing-remote-files.

To pass in a custom url, you need to specify it through client_kwargs in storage_options:

df = pd.read_csv(
    "s3://dataset/wine-quality.csv",
    storage_options={
        "key": AWS_ACCESS_KEY_ID,
        "secret": AWS_SECRET_ACCESS_KEY,
        "token": AWS_SESSION_TOKEN,
        "client_kwargs": {"endpoint_url": "localhost:9000"}
    }
)
Kastroprauxel answered 18/2, 2022 at 21:55 Comment(3)
It needs s3fs library to be installed.Longways
Yep! Can be installed with pip install s3fsKastroprauxel
If you get ValueError: Invalid endpoint: localhost:9000/ error then try setting, http://localhost:9000 for endpoint_urlFelske

© 2022 - 2024 — McMap. All rights reserved.