I am given the an excel file which contains some text formatting. Some can be bold, some italic, some are supercase1, and some other formats (but not as many as the three mentioned).
Examples:
- Ku'lah 2ku.lah v; definition: some def; usage: some usage;
- He'lahsa 2he.lah.sa n; definition: some def; usage: some usage;
- And so on
Now, since this cell is to be made as dictionary (real, human, dictionary) database entry, I would like to retain the format of the cell, as it will be beneficial to tell the usage of the word (such as bold in the above case indicating the word type: v (verb) and italic indicating new section).
But it is all in the excel cell.
When I try to simply read the excel file directly using database tool like Toad for Oracle, the format is gone!
- Is there any way to read the excel file and yet retain the format?
- Alternatively, is there any way to detect the formatting? As long as we can detect the format, I can simply replace the text with some HTML format like
<b>v</b>
and that will be my work. I only want to know how we retain or detect the excel cell text format in Python. (in particular are these three formats: bold, italic, and supercase)
Edit:
I try to get the text format with xlrd package, but I can't seem to find a way to get the text format style as the cell
object only consists of: ctype
, value
, and xf_index
. It has no info about the text format, and when I create the instance with the formatting_info=True
:
book = xlrd.open_workbook("HuluHalaDict.xlsx", sys.stdout, 0, xlrd.USE_MMAP, None, None, \
formatting_info=True, on_demand=False, ragged_rows=False)
I got the following error:
NotImplementedError: formatting_info=True not yet implemented
Raised by this line in the xlsx.py
file of the xlrd
package:
if formatting_info:
raise NotImplementedError("formatting_info=True not yet implemented")
Which I found it strange, since I use version 0.9.4 xlrd (latest) and the documentation says that since version above 0.6.1, the formatting info is included:
Default Formatting
Default formatting is applied to all empty cells (those not described by a cell record). Firstly row default information (ROW record, Rowinfo class) is used if available. Failing that, column default information (COLINFO record, Colinfo class) is used if available. As a last resort the worksheet/workbook default cell format will be used; this should always be present in an Excel file, described by the XF record with the fixed index 15 (0-based). By default, it uses the worksheet/workbook default cell style, described by the very first XF record (index 0). Formatting features not included in xlrd version 0.6.1
Rich text i.e. strings containing partial bold italic and underlined text, change of font inside a string, etc. See OOo docs s3.4 and s3.2 Asian phonetic text (known as "ruby"), used for Japanese furigana. See OOo docs s3.4.2 (p15) Conditional formatting. See OOo docs s5.12, s6.21 (CONDFMT record), s6.16 (CF record) Miscellaneous sheet-level and book-level items e.g. printing layout, screen panes. Modern Excel file versions don't keep most of the built-in "number formats" in the file; Excel loads formats according to the user's locale. Currently xlrd's emulation of this is limited to a hard-wired table that applies to the US English locale. This may mean that currency symbols, date order, thousands separator, decimals separator, etc are inappropriate. Note that this does not affect users who are copying XLS files, only those who are visually rendering cells.
Did I make any mistake here? My code is simply as shown:
book = xlrd.open_workbook("HuluHalaDict.xlsx", sys.stdout, 0, xlrd.USE_MMAP, None, None, \
formatting_info=True, on_demand=False, ragged_rows=False)
Edit 2:
The example shown in the post shows that it creates the class instance (book
) with formatting_info=True
. But I check it in my implementation. It raises the error above. Any idea?
formatting
. My concern is therefore is only limited to excel cell with bold, italic, and superscript. Since the source code is opened in the case ofxlrd
, then I just want to know if there is any mistake I made (say, at installation/downloading) such that although theformatting_info=True
is acceptable in the other posts, I can't do it for my case - though my xlrd version is 0.9.4. If thexlsx
is supported from version 0.8.0, then I expect my code to work - but it didn't. Thanks for your valuable inputs. – Schuyler