How to read excel cell and retain or detect its format in Python
Asked Answered
S

1

10

I am given the an excel file which contains some text formatting. Some can be bold, some italic, some are supercase1, and some other formats (but not as many as the three mentioned).

Examples:

  • Ku'lah 2ku.lah v; definition: some def; usage: some usage;
  • He'lahsa 2he.lah.sa n; definition: some def; usage: some usage;
  • And so on

Now, since this cell is to be made as dictionary (real, human, dictionary) database entry, I would like to retain the format of the cell, as it will be beneficial to tell the usage of the word (such as bold in the above case indicating the word type: v (verb) and italic indicating new section).

But it is all in the excel cell.

When I try to simply read the excel file directly using database tool like Toad for Oracle, the format is gone!

  1. Is there any way to read the excel file and yet retain the format?
  2. Alternatively, is there any way to detect the formatting? As long as we can detect the format, I can simply replace the text with some HTML format like <b>v</b> and that will be my work. I only want to know how we retain or detect the excel cell text format in Python. (in particular are these three formats: bold, italic, and supercase)

Edit:

I try to get the text format with xlrd package, but I can't seem to find a way to get the text format style as the cell object only consists of: ctype, value, and xf_index. It has no info about the text format, and when I create the instance with the formatting_info=True:

book = xlrd.open_workbook("HuluHalaDict.xlsx", sys.stdout, 0, xlrd.USE_MMAP, None, None, \
                          formatting_info=True, on_demand=False, ragged_rows=False)

I got the following error:

NotImplementedError: formatting_info=True not yet implemented

Raised by this line in the xlsx.py file of the xlrd package:

if formatting_info:
    raise NotImplementedError("formatting_info=True not yet implemented")

Which I found it strange, since I use version 0.9.4 xlrd (latest) and the documentation says that since version above 0.6.1, the formatting info is included:

Default Formatting

Default formatting is applied to all empty cells (those not described by a cell record). Firstly row default information (ROW record, Rowinfo class) is used if available. Failing that, column default information (COLINFO record, Colinfo class) is used if available. As a last resort the worksheet/workbook default cell format will be used; this should always be present in an Excel file, described by the XF record with the fixed index 15 (0-based). By default, it uses the worksheet/workbook default cell style, described by the very first XF record (index 0). Formatting features not included in xlrd version 0.6.1

Rich text i.e. strings containing partial bold italic and underlined text, change of font inside a string, etc. See OOo docs s3.4 and s3.2 Asian phonetic text (known as "ruby"), used for Japanese furigana. See OOo docs s3.4.2 (p15) Conditional formatting. See OOo docs s5.12, s6.21 (CONDFMT record), s6.16 (CF record) Miscellaneous sheet-level and book-level items e.g. printing layout, screen panes. Modern Excel file versions don't keep most of the built-in "number formats" in the file; Excel loads formats according to the user's locale. Currently xlrd's emulation of this is limited to a hard-wired table that applies to the US English locale. This may mean that currency symbols, date order, thousands separator, decimals separator, etc are inappropriate. Note that this does not affect users who are copying XLS files, only those who are visually rendering cells.

Did I make any mistake here? My code is simply as shown:

book = xlrd.open_workbook("HuluHalaDict.xlsx", sys.stdout, 0, xlrd.USE_MMAP, None, None, \
                          formatting_info=True, on_demand=False, ragged_rows=False)

Edit 2:

The example shown in the post shows that it creates the class instance (book) with formatting_info=True. But I check it in my implementation. It raises the error above. Any idea?

Schuyler answered 20/4, 2016 at 11:23 Comment(4)
Regarding xlrd: There is no greater authority than the source code. If it's not implemented in the source code, then it's not implemented, period. Writing documentation for a time machine doesn't make the time machine work. What you need to understand is that .xls files are very different from .xlsx, and xlrd started life as a reader for .xls. Indeed, .xlsx didn't even exist yet when xlrd was created. At the time of xlrd 0.6.1 (June 2007), .xlsx was still quite new and barely in use. So when it says it supports reading formatting as of 0.6.1, it means it supports reading formatting in .xls.Confidante
xlrd only officially supported reading .xlsx since 0.8.0 (August 2012). Some of xlrd's features still only work for .xls.Confidante
Another thing to understand is that "formatting" covers a lot of different things. For example, being able to read the numeric formatting (such as whether something has thousands separators or a fixed number of decimal places) is different than being able to tell whether the text is centered. The kind of formatting you're talking about, with multiple styles in a single cell, is called "rich" formatting, and it's among the most complicated kinds of formatting to deal with. Even if some package says it can read the formatting, it might not be able to read rich formatting.Confidante
@JohnY ah yes, there are plenty of things on formatting. My concern is therefore is only limited to excel cell with bold, italic, and superscript. Since the source code is opened in the case of xlrd, then I just want to know if there is any mistake I made (say, at installation/downloading) such that although the formatting_info=True is acceptable in the other posts, I can't do it for my case - though my xlrd version is 0.9.4. If the xlsx is supported from version 0.8.0, then I expect my code to work - but it didn't. Thanks for your valuable inputs.Schuyler
E
7

I suggest you the library xlrd https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html?p=4966

On GitHub here https://github.com/python-excel/xlrd

You can find an easy example on how to use xlrd to determine the font style here Using XLRD module and Python to determine cell font style (italics or not)

Here a practical example:

from xlrd import open_workbook

path = '/Users/.../Desktop/Workbook1.xls'
wb = open_workbook(path, formatting_info=True)
sheet = wb.sheet_by_name("Sheet1")
cell = sheet.cell(0, 0) # The first cell
print("cell.xf_index is", cell.xf_index)
fmt = wb.xf_list[cell.xf_index]
print("type(fmt) is", type(fmt))
print("Dumped Info:")
fmt.dump()

It outputs the following:

cell.xf_index is 62
type(fmt) is <class 'xlrd.formatting.XF'>
Dumped Info:
_alignment_flag: 0
_background_flag: 0
_border_flag: 0
_font_flag: 1
_format_flag: 0
_protection_flag: 0
alignment (XFAlignment object):
    hor_align: 0
    indent_level: 0
    rotation: 0
    shrink_to_fit: 0
    text_direction: 0
    text_wrapped: 0
    vert_align: 2
background (XFBackground object):
    background_colour_index: 65
    fill_pattern: 0
    pattern_colour_index: 64
border (XFBorder object):
    bottom_colour_index: 0
    bottom_line_style: 0
    diag_colour_index: 0
    diag_down: 0
    diag_line_style: 0
    diag_up: 0
    left_colour_index: 0
    left_line_style: 0
    right_colour_index: 0
    right_line_style: 0
    top_colour_index: 0
    top_line_style: 0
font_index: 6
format_key: 0
is_style: 0
lotus_123_prefix: 0
parent_style_index: 0
protection (XFProtection object):
    cell_locked: 1
    formula_hidden: 0
xf_index: 62

Where _font_flag: 1 indicates that is Bold

Evanston answered 20/4, 2016 at 11:51 Comment(7)
OK, give me sometime, I will try to play around with it for a while! ;)Schuyler
The package does not have any way to get the cell text format. Any idea?Schuyler
It does, by using the Font Class: secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/…Evanston
Do you have any example how you use that? I tried it and failed - but probably I tried it the wrong waySchuyler
I don't have it right now but you can find useful code here simplistix.co.uk/presentations/python-excel.pdfEvanston
Thanks for the update. As I mention in the question, The formatting_info=True raises exception for me, though I download the package (version 0.9.4) using PyCharmSchuyler
In the end, I choose to solve my problem using C# Microsoft.Office.Interop.Excel. But I still mark your answer as accepted for I think it may be useful for future visitors - besides showing you my gratitude. Nevertheless, as far as my actual problem is concerned, it could be due to wrong installation version (unitentionally) that I got the formatting_info=True not working. Thanks for the help.Schuyler

© 2022 - 2024 — McMap. All rights reserved.