Iterate through columns in Read-only workbook in openpyxl
Asked Answered
H

3

15

I have a somewhat large .xlsx file - 19 columns, 5185 rows. I want to open the file, read all the values in one column, do some stuff to those values, and then create a new column in the same workbook and write out the modified values. Thus, I need to be able to both read and write in the same file.

My original code did this:

def readExcel(doc):
    wb = load_workbook(generalpath + exppath + doc)
    ws = wb["Sheet1"]

    # iterate through the columns to find the correct one
    for col in ws.iter_cols(min_row=1, max_row=1):
        for mycell in col:
            if mycell.value == "PerceivedSound.RESP":
                origCol = mycell.column

    # get the column letter for the first empty column to output the new values
    newCol = utils.get_column_letter(ws.max_column+1)

    # iterate through the rows to get the value from the original column,
    # do something to that value, and output it in the new column
    for myrow in range(2, ws.max_row+1):
        myrow = str(myrow)
        # do some stuff to make the new value
        cleanedResp = doStuff(ws[origCol + myrow].value)
        ws[newCol + myrow] = cleanedResp

    wb.save(doc)

However, python threw a memory error after row 3853 because the workbook was too big. The openpyxl docs said to use Read-only mode (https://openpyxl.readthedocs.io/en/latest/optimized.html) to handle big workbooks. I'm now trying to use that; however, there seems to be no way to iterate through the columns when I add the read_only = True param:

def readExcel(doc):
    wb = load_workbook(generalpath + exppath + doc, read_only=True)
    ws = wb["Sheet1"]

    for col in ws.iter_cols(min_row=1, max_row=1):
        #etc.

python throws this error: AttributeError: 'ReadOnlyWorksheet' object has no attribute 'iter_cols'

If I change the final line in the above snippet to:

for col in ws.columns:

python throws the same error: AttributeError: 'ReadOnlyWorksheet' object has no attribute 'columns'

Iterating over rows is fine (and is included in the documentation I linked above):

for col in ws.rows:

(no error)

This question asks about the AttritubeError but the solution is to remove Read-only mode, which doesn't work for me because openpyxl won't read my entire workbook in not Read-only mode.

So: how do I iterate through columns in a large workbook?

And I haven't yet encountered this, but I will once I can iterate through the columns: how do I both read and write the same workbook, if said workbook is large?

Thanks!

Haematin answered 30/11, 2017 at 20:43 Comment(0)
X
3

According to the documentation, ReadOnly mode only supports row-based reads (column reads are not implemented). But that's not hard to solve:

wb2 = Workbook(write_only=True)
ws2 = wb2.create_sheet()

# find what column I need
colcounter = 0
for row in ws.rows:
    for cell in row:
        if cell.value == "PerceivedSound.RESP":
            break
        colcounter += 1
    
    # cells are apparently linked to the parent workbook meta
    # this will retain only values; you'll need custom
    # row constructor if you want to retain more

    row2 = [cell.value for cell in row]
    ws2.append(row2) # preserve the first row in the new file
    break # stop after first row

for row in ws.rows:
    row2 = [cell.value for cell in row]
    row2.append(doStuff(row2[colcounter]))
    ws2.append(row2) # write a new row to the new wb
    
wb2.save('newfile.xlsx')
wb.close()
wb2.close()

# copy `newfile.xlsx` to `generalpath + exppath + doc`
# Either using os.system,subprocess.popen, or shutil.copy2()

You will not be able to write to the same workbook, but as shown above you can open a new workbook (in writeonly mode), write to it, and overwrite the old file using OS copy.

Xylidine answered 30/11, 2017 at 23:27 Comment(5)
This won't work as you cannot pass cell objects between different workbooks.Camporee
can you serialize the row via something like: row = [cell.value for cell in row] ?Xylidine
Sure, but don't forget this will lose formatting, comments, etc.Camporee
it's pretty clear the OP wants to do a value-based transformation on the cell's valueXylidine
This seems like a reasonable solution to the problem as stated in my original question, given that column-based reads aren't implemented. Only caveat that I haven't tried the code myself, since it turns out it doesn't address what my problem actually was.Haematin
C
9

If the worksheet has only around 100,000 cells then you shouldn't have any memory problems. You should probably investigate this further.

iter_cols() is not available in read-only mode because it requires constant and very inefficient reparsing of the underlying XML file. It is however, relatively easy to convert rows into columns from iter_rows() using zip.

def _iter_cols(self, min_col=None, max_col=None, min_row=None,
               max_row=None, values_only=False):
    yield from zip(*self.iter_rows(
        min_row=min_row, max_row=max_row,
        min_col=min_col, max_col=max_col, values_only=values_only))

import types
for sheet in workbook:
    sheet.iter_cols = types.MethodType(_iter_cols, sheet)
Camporee answered 1/12, 2017 at 8:22 Comment(6)
I agree, this didn't seem like a huge workbook to me, so I was surprised when it initially threw a memory error. But nevertheless it did. How do you suggest investigating further?Haematin
You need to look at how much memory the Python process is using and what else is in the workbook. Without the file I can't say much more.Camporee
I've posted the entirety of my code above; as you can see, there's not a lot going on there in terms of memory-intensiveness. As for "what else is in the workbook", it's just a workbook with text (and integers) in cells. There are no formulas, no references, no gigantic amounts of text in any cell, and nothing is calculated within the worksheet. I can't share the file because it's participant data but I'm happy to look further if you can point me in the direction of what to look for!Haematin
Ah hah! Your suspicions were well-founded that it was strange that a sheet of this (smallish) size generated a memory error. Turns out there was something funny going on in that one particular cell, so it wasn't actually a memory error in terms of running out of space but rather getting messed up on parsing the contents of that cell. (Not sure why 'Memory Error' was thrown instead of, say, 'Attribute Error' or something else, but that's a different question.)Haematin
@Haematin do you remember what the issue was and how you fixed it? I'm having the exact same issue and don't know how to fix it.Aubine
@Aubine There was a strange character in the text in the cell - I think it was a bendy quote mark or something, which caused Excel to crash.Haematin
X
3

According to the documentation, ReadOnly mode only supports row-based reads (column reads are not implemented). But that's not hard to solve:

wb2 = Workbook(write_only=True)
ws2 = wb2.create_sheet()

# find what column I need
colcounter = 0
for row in ws.rows:
    for cell in row:
        if cell.value == "PerceivedSound.RESP":
            break
        colcounter += 1
    
    # cells are apparently linked to the parent workbook meta
    # this will retain only values; you'll need custom
    # row constructor if you want to retain more

    row2 = [cell.value for cell in row]
    ws2.append(row2) # preserve the first row in the new file
    break # stop after first row

for row in ws.rows:
    row2 = [cell.value for cell in row]
    row2.append(doStuff(row2[colcounter]))
    ws2.append(row2) # write a new row to the new wb
    
wb2.save('newfile.xlsx')
wb.close()
wb2.close()

# copy `newfile.xlsx` to `generalpath + exppath + doc`
# Either using os.system,subprocess.popen, or shutil.copy2()

You will not be able to write to the same workbook, but as shown above you can open a new workbook (in writeonly mode), write to it, and overwrite the old file using OS copy.

Xylidine answered 30/11, 2017 at 23:27 Comment(5)
This won't work as you cannot pass cell objects between different workbooks.Camporee
can you serialize the row via something like: row = [cell.value for cell in row] ?Xylidine
Sure, but don't forget this will lose formatting, comments, etc.Camporee
it's pretty clear the OP wants to do a value-based transformation on the cell's valueXylidine
This seems like a reasonable solution to the problem as stated in my original question, given that column-based reads aren't implemented. Only caveat that I haven't tried the code myself, since it turns out it doesn't address what my problem actually was.Haematin
M
0

This might be slower solution but given your query was to iterate through a single row tuple- i found a better solution

rowId=1
for i in range(len(ws[str(rowId)])):
    #print(str(ws[dbNameRow][i].value) + ' ' + str(i))
    if ws[dbNameRow][i].value == "<Provide your search string here>":
            lastColumn=i+1 #because counter starts from 0
Mckeown answered 17/8, 2024 at 23:40 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.