Python Pandas - read date column as string
Asked Answered
L

4

6

I have some data in an excel file and I read it using pandas read_excel method. However I want to read the entire data in all columns as strings including the date column.

The problem is that I want to leave the date column in its original format as string. For example, I have '31.01.2017' in the excel and it is formatted as date and I want to have '31.01.2017' in my data frame.

I thought using dytpes parameter of read_excel with dtype=str was the correct approach. But pandas then reads the date column as datetime and then converts it to string. So at the end I always have '2017-01-31 00:00:00' in my data frame.

Is there any way to do this?

Longevous answered 11/10, 2017 at 16:19 Comment(0)
J
3

The behavior of pandas makes sense:

  • If the excel-format of your date column is text, pandas will read the dates as strings by default.
  • If the excel-format of your date column is date, pandas will read the dates as dates.

However, you point out that in the Excelfile the date column is formatted as a date. If that's the case, there is no string in your Excelfile to begin with. The underlying data of the date column is stored as a float. The string you are seeing is not the actual data. You can't read something as a raw string if it isn't a string.

Some more info: https://xlrd.readthedocs.io/en/latest/formatting.html

But let's say, for some reason, you want Python to display the same format as Excel, but in string form, without looking in the Excel.

First you'd have to find the format:

from openpyxl import load_workbook
wb = load_workbook('data.xlsx')
ws = wb.worksheets[0]
print(ws.cell(1,5).number_format)  # look at the cell you are interested in

> '[$]dd/mm/yyyy;@'

and then convert is to something the strftime function understands. https://www.programiz.com/python-programming/datetime/strftime#format-code

form = form[3:-2]
form = form.replace('dd','%d')
form = form.replace('mm','%m')
form = form.replace('yyyy','%Y')
print(form)
> '%d/%m/%Y'

And apply it

df.loc[:,"date_field"].apply(lambda x: x.strftime(form))

> 0     01/02/2018
1     02/02/2018
2     03/02/2018
3     04/02/2018
4     05/02/2018

However, if you're working with multiple Excel date-formats you'd have to make a strf-time mapping for each of them.

Probably there will be more practical ways of doing this, like receiving the data in csv format or just keeping the dates in excel's text format in the first place.

Jacks answered 23/4, 2019 at 14:51 Comment(0)
U
1

As you are trying to keep the date column in the initial type, the following code may help you. In the first row we insert to the variable "cols" all the columns except the date column, and then in the following two lines we just change the type of the rest columns:

cols=[i for i in df.columns if i not in ["Date_column"]]

for col in cols:
    df[col]=df[col].astype('category')

Hope it helps! :-)

Unshaped answered 12/10, 2017 at 8:51 Comment(2)
But you are iterating over the dataframe after(!) reading the excel file. And the problem occures during the reading of the file. So your dataframe has the dates stored in it and they have already lost the initial formatLongevous
And what is actually the point of changing the type of all other columns to categorical?Longevous
L
0
df['date_column'] = df['date_column'].dt.strftime('%d.%m.%Y')
Labdanum answered 11/10, 2017 at 16:27 Comment(6)
The problem is that after reading the data I don't know what the original date format wasLongevous
I don't know what the hesitation would be in not just looking at the date format by opening the file and then using strftime, seems inefficient. You've got a couple of options at the point. If the parameter converters={'Date': str} doesn't work, then convert to csv before reading in excel.Labdanum
But how can I look up what the format is? Do you mean opening the file with some other package or do I miss something...Longevous
Open it with excel?Labdanum
Ah ok. Unfortunately, I want to do the whole thing automaticallyLongevous
@Labdanum has the only correct generalisable solution on this whole page since date ranges can appear in a date column. Every other solution cannot possibly work because you can have items such as Jan/Feb 1665 that will not render correctly for Excel or a google sheet.Harmon
R
0

The docs of pd.read_excel say under the parameter parse_dates:

If you don't want to parse some cells as date just change their type in Excel to "Text".

This means one could preprocess the cells in Excel to the "Text" number format before using pd.read_excel. However, switching to "Text" number format alone changes the dates to numbers in Excel, e.g., 1900-12-31 becomes 366 (the cryptic number is the number of days since 1900-01-01 (it is 366 here because Excel falsely presumes that 1900 was a leap year)). This means that the instruction of just changing the number format to "Text" isn’t useful.

Instead, a practical solution is to iterate over the date columns and set the Excel VBA .Text as value and change the number format to "Text" to return the string how it displays, which can be done with xlwings:

from pathlib import Path
import xlwings as xw
import pandas as pd

path = Path(r"foo/test.xlsx")

with xw.App(visible=False):
    wb = xw.Book(path)
    ws = wb.sheets[0]
    for cell in ws.used_range:
        temp_str = cell.api.Text
        cell.number_format = "@"  # @ is the number format code of Excel for the "Text" number format.
        cell.value = temp_str
    wb.save(path.with_stem(f"{path.stem}_interim"))
    # Then the Excel file can be read with xlwings.
    df = ws.used_range.options(pd.DataFrame, index=False).value
    wb.close()

# Or it can be read with pandas.
df = pd.read_excel(path.with_stem(f"{path.stem}_interim"))

print(df)

Note the prerequisites for the installation of xlwings.
Unfortunately the .Text property can only be used on a range of one cell, which is the reason why a loop over the range of cells in question is required. Note that I used used_range in this example, which means that all cells with contents on the worksheet are formatted to the "Text" number format.

Rondo answered 13/9, 2024 at 20:28 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.