pandas.read_html not support decimal comma

D

4

15

I was reading an xlm file using pandas.read_html and works almost perfect, the problem is that the file has commas as decimal separators instead of dots (the default in read_html).

I could easily replace the commas by dots in one file, but i have almost 200 files with that configuration. with pandas.read_csv you can define the decimal separator, but i don't know why in pandas.read_html you can only define the thousand separator.

any guidance in this matter?, there is another way to automate the comma/dot replacement before it is open by pandas? thanks in advance!

Dripping answered 9/9, 2016 at 13:35 Comment(2)

So your 1000.21 in your file is 1000,21? – Reddish 9/9, 2016 at 13:40

@zhqiat actually its 1.000,21, it also has the thousand separator – Dripping 9/9, 2016 at 13:41

D

2

Thanks @zhqiat. I think upgrading pandas to version 0.19 will solve the problem. unfortunately I couldn't found an easy way to accomplish that. I found a tutorial to upgrade Pandas but for ubuntu (winXP user).

I finally chose the workaround, using the method posted here, basically converting all columns, one by one, to a numeric type of pandas.Series

result[col] = result[col].apply(lambda x: x.str.replace(".","").str.replace(",","."))

I know that this solution ain't the best, but works. Thanks

Dripping answered 9/9, 2016 at 15:7 Comment(0)

B

27

This did not start working for me until I used both decimal=',' and thousands='.'

Pandas version: 0.23.4

So try to use both decimal and thousands: i.e.: pd.read_html(io="http://example.com", decimal=',', thousands='.')

Before I would only use decimal=',' and the number columns would be saved as type str with the numbers just omitting the comma.(weird behaviour) For example 0,7 would be "07" and "1,9" would be "19"

It is still being saved in the dataframe as type str but at least I don't have to manually put in the dots. The numbers are correctly displayed; 0,7 -> "0.7"

Bondwoman answered 5/10, 2018 at 22:23 Comment(1)

This helped me for pandas version 1.1.2 too: when I added thousands='.' the decimal parameter worked – Dybbuk 23/10, 2020 at 22:43

R

4

Looking at the source code of read_html

def read_html(io, match='.+', flavor=None, header=None, index_col=None,
              skiprows=None, attrs=None, parse_dates=False,
              tupleize_cols=False, thousands=',', encoding=None,
              decimal='.', converters=None, na_values=None,
              keep_default_na=True):

The function header implies that there is a decimal separator available in the function call.

Further down in the documentation this looks like it was added in version 0.19 (so a bit further down the experimental branch). Can you upgrade your pandas?

decimal : str, default '.' Character to recognize as decimal point (e.g. use ',' for European data). .. versionadded:: 0.19.0

Reddish answered 9/9, 2016 at 13:51 Comment(1)

right now Im using the last stable version (default in anaconda). i will try to upgrade to 0.19 and see if that works. pd.__version__ u'0.18.1' as soon as I try, I will be back – Dripping 9/9, 2016 at 14:6

D

2

Thanks @zhqiat. I think upgrading pandas to version 0.19 will solve the problem. unfortunately I couldn't found an easy way to accomplish that. I found a tutorial to upgrade Pandas but for ubuntu (winXP user).

I finally chose the workaround, using the method posted here, basically converting all columns, one by one, to a numeric type of pandas.Series

result[col] = result[col].apply(lambda x: x.str.replace(".","").str.replace(",","."))

I know that this solution ain't the best, but works. Thanks

Dripping answered 9/9, 2016 at 15:7 Comment(0)

G

1

I am using pandas 0.19 but it still fails to correctly convert the numbers.

For example:

a=pd.read_html(r.text,thousands='.',decimal=',')

will recognize the value "1.401,40" in a table cell as 140140 (float).

I use a similar solution as 'Pablo A', just correcting for nan values:

def to_numeric_comma(series):
    new=series.apply(lambda x: str(x).replace('.','').replace(',','.'))
    new=pd.to_numeric(new.replace('nan',pd.np.nan))
    return new

Gca answered 9/1, 2017 at 11:40 Comment(0)

Recommended topics

Hot tags