Pandas read_csv from url

Asked 4/9, 2015 at 14:44 Answered 21/1, 2020 at 8:35

250

I'm trying to read a csv-file from given URL, using Python 3.x:

import pandas as pd
import requests

url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
s = requests.get(url).content
c = pd.read_csv(s)

I have the following error

"Expected file path name or file-like object, got <class 'bytes'> type"

How can I fix this? I'm using Python 3.4

Lactiferous answered 4/9, 2015 at 14:44 Comment(5)

You would need something like c=pd.read_csv(io.StringIO(s.decode("utf-8"))) but you are getting html back not a csv file so it is not going to work – Forborne 4/9, 2015 at 14:49

I'm fairly certain the URL you want is "https://raw.github.com/cs109/2014_data/blob/master/countries.csv". – Imaginary 4/9, 2015 at 14:52

@venom, chose more popular answer as the right one – Reentry 11/10, 2019 at 15:58

Sicne the issue was with pandas.read_csv() not Python, you should have stated the pandas version too, but given Python 3.4 was released in 2014, so you were likely running pandas 0.12 .. 0.15 – Albumose 31/1, 2021 at 4:1

Since Pandas 1.2 for basic HTTP authentication: https://mcmap.net/q/118732/-handling-http-authentication-when-accesing-remote-urls-via-pandas – Sajovich 6/2, 2022 at 19:14

260

Update: From pandas 0.19.2 you can now just pass read_csv() the url directly, although that will fail if it requires authentication.

For older pandas versions, or if you need authentication, or for any other HTTP-fault-tolerant reason:

Use pandas.read_csv with a file-like object as the first argument.

If you want to read the csv from a string, you can use io.StringIO.
For the URL https://github.com/cs109/2014_data/blob/master/countries.csv, you get html response, not raw csv; you should use the url given by the Raw link in the github page for getting raw csv response , which is https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example:

import pandas as pd
import io
import requests

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s = requests.get(url).content
c = pd.read_csv(io.StringIO(s.decode('utf-8')))

Note: in Python 2.x, the string-buffer object was StringIO.StringIO

Finecut answered 4/9, 2015 at 14:50 Comment(7)

What if the response is large and I want to stream it instead of consuming memory for the encoded content, decoded content and the StringIO object? – Ultimo 4/10, 2016 at 6:0

In the latest version of pandas you can give the url directly i.e. c=pd.read_csv(url) – Amplify 26/1, 2017 at 18:29

Curiously I have a newer version of pandas (0.23.4), but I could not give url directly. This answer helped me get that working. – Confiteor 11/1, 2019 at 14:55

"Update From pandas 0.19.2 you can now just pass the url directly." Unless you can't because you need to pass authentication arguments, in which case the original example is much needed. – Waaf 12/7, 2019 at 17:49

This solution still valuable if you need a better error handling using HTTP codes that may be returned by request object (ex: 500 -> retry may be needed, 404 -> no retry) – Magnus 18/2, 2020 at 11:0

This seems to put all columns in one column for this url: ebi.ac.uk/Tools/services/rest/clustalo/result/… – Prying 1/11, 2020 at 5:55

This allows you to specify a timeout in requests.get, which one should always set in production code – Postpaid 11/5, 2021 at 15:18

353

In the latest version of pandas (0.19.2) you can directly pass the url

import pandas as pd

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c = pd.read_csv(url)

Amplify answered 26/1, 2017 at 18:34 Comment(6)

it seems that using this directly instead of requests directly does not use requests-cache even if used – Sampson 11/9, 2017 at 10:23

That code returns urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)> because of the https protocol which urllib cannot handle. – Hagio 13/2, 2018 at 16:0

For those using Python 2, you will have to use Python 2.7.10+. – Chloride 30/10, 2018 at 3:54

There seems to be some issue reading csv from a URL. I read the file once from a local storage and once from URL, I kept getting errors from URL. I then enabled error_bad_lines=False and more than 99% of data was ignored. The URL is link. Once I read the file, the shape of the dataset was found to be (88,1), which is completely wrong – Spondee 12/11, 2018 at 19:9

It seems not work well, I got an issue of urlopen error :<urlopen error [Errno 11004] getaddrinfo failed> – Clump 19/8, 2020 at 7:15

I installed certificate following #52805615, then pd.read_csv(url) works for me. – Hekker 11/11, 2020 at 20:0

260

Update: From pandas 0.19.2 you can now just pass read_csv() the url directly, although that will fail if it requires authentication.

For older pandas versions, or if you need authentication, or for any other HTTP-fault-tolerant reason:

Use pandas.read_csv with a file-like object as the first argument.

If you want to read the csv from a string, you can use io.StringIO.
For the URL https://github.com/cs109/2014_data/blob/master/countries.csv, you get html response, not raw csv; you should use the url given by the Raw link in the github page for getting raw csv response , which is https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example:

import pandas as pd
import io
import requests

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s = requests.get(url).content
c = pd.read_csv(io.StringIO(s.decode('utf-8')))

Note: in Python 2.x, the string-buffer object was StringIO.StringIO

Finecut answered 4/9, 2015 at 14:50 Comment(7)

What if the response is large and I want to stream it instead of consuming memory for the encoded content, decoded content and the StringIO object? – Ultimo 4/10, 2016 at 6:0

In the latest version of pandas you can give the url directly i.e. c=pd.read_csv(url) – Amplify 26/1, 2017 at 18:29

Curiously I have a newer version of pandas (0.23.4), but I could not give url directly. This answer helped me get that working. – Confiteor 11/1, 2019 at 14:55

This seems to put all columns in one column for this url: ebi.ac.uk/Tools/services/rest/clustalo/result/… – Prying 1/11, 2020 at 5:55

This allows you to specify a timeout in requests.get, which one should always set in production code – Postpaid 11/5, 2021 at 15:18

As I commented you need to use a StringIO object and decode i.e c=pd.read_csv(io.StringIO(s.decode("utf-8"))) if using requests, you need to decode as .content returns bytes if you used .text you would just need to pass s as is s = requests.get(url).text c = pd.read_csv(StringIO(s)).

A simpler approach is to pass the correct url of the raw data directly to read_csv, you don't have to pass a file like object, you can pass a url so you don't need requests at all:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

Output:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

From the docs:

filepath_or_buffer :

string or file handle / StringIO The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv

Forborne answered 4/9, 2015 at 15:6 Comment(4)

You can feed the url directly to pandas read_csv! of course! that's a much simpler solution than the one I found! :D – Erythrocyte 4/9, 2015 at 15:19

@pabtorre, yep , an example of why reading the docs is a good idea. – Forborne 4/9, 2015 at 15:21

That works, in my case though ,I need to set the param sep of function pd.read_csv, such as : pd.read_csv(StringIO(s), sep='\t') . If I use the default setting sep=None , it'll raise an errorError tokenizing data. C error: Expected 1 fields in line 6, saw 5 – Clump 19/8, 2020 at 7:21

Why do I still get just one column for this url? ebi.ac.uk/Tools/services/rest/clustalo/result/… – Prying 1/11, 2020 at 5:57

The problem you're having is that the output you get into the variable 's' is not a csv, but a html file. In order to get the raw csv, you have to modify the url to:

'https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'

Your second problem is that read_csv expects a file name, we can solve this by using StringIO from io module. Third problem is that request.get(url).content delivers a byte stream, we can solve this using the request.get(url).text instead.

End result is this code:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

output:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA

Erythrocyte answered 4/9, 2015 at 15:18 Comment(0)

url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")

Wheatley answered 21/1, 2020 at 8:35 Comment(2)

Please provide explanation how your solution works. – Ash 21/1, 2020 at 8:42

This may raise an url error ：urlopen error [Errno 11004] getaddrinfo failed – Clump 19/8, 2020 at 7:11

To Import Data through URL in pandas just apply the simple below code it works actually better.

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

If you are having issues with a raw data then just put 'r' before URL

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()

Lennalennard answered 25/11, 2019 at 3:25 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

To Import Data through URL in pandas just apply the simple below code it works actually better.

If you are having issues with a raw data then just put 'r' before URL

Recommended topics

Hot tags