Pandas read_csv from url
Asked Answered
L

6

250

I'm trying to read a csv-file from given URL, using Python 3.x:

import pandas as pd
import requests

url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
s = requests.get(url).content
c = pd.read_csv(s)

I have the following error

"Expected file path name or file-like object, got <class 'bytes'> type"

How can I fix this? I'm using Python 3.4

Lactiferous answered 4/9, 2015 at 14:44 Comment(5)
You would need something like c=pd.read_csv(io.StringIO(s.decode("utf-8"))) but you are getting html back not a csv file so it is not going to workForborne
I'm fairly certain the URL you want is "https://raw.github.com/cs109/2014_data/blob/master/countries.csv".Imaginary
@venom, chose more popular answer as the right oneReentry
Sicne the issue was with pandas.read_csv() not Python, you should have stated the pandas version too, but given Python 3.4 was released in 2014, so you were likely running pandas 0.12 .. 0.15Albumose
Since Pandas 1.2 for basic HTTP authentication: https://mcmap.net/q/118732/-handling-http-authentication-when-accesing-remote-urls-via-pandasSajovich
F
260

Update: From pandas 0.19.2 you can now just pass read_csv() the url directly, although that will fail if it requires authentication.


For older pandas versions, or if you need authentication, or for any other HTTP-fault-tolerant reason:

Use pandas.read_csv with a file-like object as the first argument.

  • If you want to read the csv from a string, you can use io.StringIO.

  • For the URL https://github.com/cs109/2014_data/blob/master/countries.csv, you get html response, not raw csv; you should use the url given by the Raw link in the github page for getting raw csv response , which is https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example:

import pandas as pd
import io
import requests

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s = requests.get(url).content
c = pd.read_csv(io.StringIO(s.decode('utf-8')))

Note: in Python 2.x, the string-buffer object was StringIO.StringIO

Finecut answered 4/9, 2015 at 14:50 Comment(7)
What if the response is large and I want to stream it instead of consuming memory for the encoded content, decoded content and the StringIO object?Ultimo
In the latest version of pandas you can give the url directly i.e. c=pd.read_csv(url)Amplify
Curiously I have a newer version of pandas (0.23.4), but I could not give url directly. This answer helped me get that working.Confiteor
"Update From pandas 0.19.2 you can now just pass the url directly." Unless you can't because you need to pass authentication arguments, in which case the original example is much needed.Waaf
This solution still valuable if you need a better error handling using HTTP codes that may be returned by request object (ex: 500 -> retry may be needed, 404 -> no retry)Magnus
This seems to put all columns in one column for this url: ebi.ac.uk/Tools/services/rest/clustalo/result/…Prying
This allows you to specify a timeout in requests.get, which one should always set in production codePostpaid
A
353

In the latest version of pandas (0.19.2) you can directly pass the url

import pandas as pd

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c = pd.read_csv(url)
Amplify answered 26/1, 2017 at 18:34 Comment(6)
it seems that using this directly instead of requests directly does not use requests-cache even if usedSampson
That code returns urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)> because of the https protocol which urllib cannot handle.Hagio
For those using Python 2, you will have to use Python 2.7.10+.Chloride
There seems to be some issue reading csv from a URL. I read the file once from a local storage and once from URL, I kept getting errors from URL. I then enabled error_bad_lines=False and more than 99% of data was ignored. The URL is link. Once I read the file, the shape of the dataset was found to be (88,1), which is completely wrongSpondee
It seems not work well, I got an issue of urlopen error :<urlopen error [Errno 11004] getaddrinfo failed>Clump
I installed certificate following #52805615, then pd.read_csv(url) works for me.Hekker
F
260

Update: From pandas 0.19.2 you can now just pass read_csv() the url directly, although that will fail if it requires authentication.


For older pandas versions, or if you need authentication, or for any other HTTP-fault-tolerant reason:

Use pandas.read_csv with a file-like object as the first argument.

  • If you want to read the csv from a string, you can use io.StringIO.

  • For the URL https://github.com/cs109/2014_data/blob/master/countries.csv, you get html response, not raw csv; you should use the url given by the Raw link in the github page for getting raw csv response , which is https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example:

import pandas as pd
import io
import requests

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s = requests.get(url).content
c = pd.read_csv(io.StringIO(s.decode('utf-8')))

Note: in Python 2.x, the string-buffer object was StringIO.StringIO

Finecut answered 4/9, 2015 at 14:50 Comment(7)
What if the response is large and I want to stream it instead of consuming memory for the encoded content, decoded content and the StringIO object?Ultimo
In the latest version of pandas you can give the url directly i.e. c=pd.read_csv(url)Amplify
Curiously I have a newer version of pandas (0.23.4), but I could not give url directly. This answer helped me get that working.Confiteor
"Update From pandas 0.19.2 you can now just pass the url directly." Unless you can't because you need to pass authentication arguments, in which case the original example is much needed.Waaf
This solution still valuable if you need a better error handling using HTTP codes that may be returned by request object (ex: 500 -> retry may be needed, 404 -> no retry)Magnus
This seems to put all columns in one column for this url: ebi.ac.uk/Tools/services/rest/clustalo/result/…Prying
This allows you to specify a timeout in requests.get, which one should always set in production codePostpaid
F
20

As I commented you need to use a StringIO object and decode i.e c=pd.read_csv(io.StringIO(s.decode("utf-8"))) if using requests, you need to decode as .content returns bytes if you used .text you would just need to pass s as is s = requests.get(url).text c = pd.read_csv(StringIO(s)).

A simpler approach is to pass the correct url of the raw data directly to read_csv, you don't have to pass a file like object, you can pass a url so you don't need requests at all:

c = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

print(c)

Output:

                              Country         Region
0                             Algeria         AFRICA
1                              Angola         AFRICA
2                               Benin         AFRICA
3                            Botswana         AFRICA
4                             Burkina         AFRICA
5                             Burundi         AFRICA
6                            Cameroon         AFRICA
..................................

From the docs:

filepath_or_buffer :

string or file handle / StringIO The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv

Forborne answered 4/9, 2015 at 15:6 Comment(4)
You can feed the url directly to pandas read_csv! of course! that's a much simpler solution than the one I found! :DErythrocyte
@pabtorre, yep , an example of why reading the docs is a good idea.Forborne
That works, in my case though ,I need to set the param sep of function pd.read_csv, such as : pd.read_csv(StringIO(s), sep='\t') . If I use the default setting sep=None , it'll raise an errorError tokenizing data. C error: Expected 1 fields in line 6, saw 5Clump
Why do I still get just one column for this url? ebi.ac.uk/Tools/services/rest/clustalo/result/…Prying
E
11

The problem you're having is that the output you get into the variable 's' is not a csv, but a html file. In order to get the raw csv, you have to modify the url to:

'https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'

Your second problem is that read_csv expects a file name, we can solve this by using StringIO from io module. Third problem is that request.get(url).content delivers a byte stream, we can solve this using the request.get(url).text instead.

End result is this code:

from io import StringIO

import pandas as pd
import requests
url='https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv'
s=requests.get(url).text

c=pd.read_csv(StringIO(s))

output:

>>> c.head()
    Country  Region
0   Algeria  AFRICA
1    Angola  AFRICA
2     Benin  AFRICA
3  Botswana  AFRICA
4   Burkina  AFRICA
Erythrocyte answered 4/9, 2015 at 15:18 Comment(0)
W
6
url = "https://github.com/cs109/2014_data/blob/master/countries.csv"
c = pd.read_csv(url, sep = "\t")
Wheatley answered 21/1, 2020 at 8:35 Comment(2)
Please provide explanation how your solution works.Ash
This may raise an url error :urlopen error [Errno 11004] getaddrinfo failedClump
L
0

To Import Data through URL in pandas just apply the simple below code it works actually better.

import pandas as pd
train = pd.read_table("https://urlandfile.com/dataset.csv")
train.head()

If you are having issues with a raw data then just put 'r' before URL

import pandas as pd
train = pd.read_table(r"https://urlandfile.com/dataset.csv")
train.head()
Lennalennard answered 25/11, 2019 at 3:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.