how to convert xls to xlsx
Asked Answered
R

19

50

I have some *.xls (excel 2003) files, and I want to convert those files into xlsx (excel 2007).

I use the uno python package, when I save the documents, I can set the Filter name: MS Excel 97 But there is no Filter name like 'MS Excel 2007',

How can set the the filter name to convert xls to xlsx ?

Rabjohn answered 29/3, 2012 at 3:20 Comment(0)
P
52

You need to have win32com installed on your machine. Here is my code:

import win32com.client as win32
fname = "full+path+to+xls_file"
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(fname)

wb.SaveAs(fname+"x", FileFormat = 51)    #FileFormat = 51 is for .xlsx extension
wb.Close()                               #FileFormat = 56 is for .xls extension
excel.Application.Quit()

Just wrapped the code above in a function together with better path handling to make sure the excel process is getting closed.

from pathlib import Path
import win32com.client as win32
    
def convert_xls_to_xlsx(path: Path) -> None:
    excel = win32.gencache.EnsureDispatch('Excel.Application')
    wb = excel.Workbooks.Open(path.absolute())

    # FileFormat=51 is for .xlsx extension
    wb.SaveAs(str(path.absolute().with_suffix(".xlsx")), FileFormat=51)
    wb.Close()
    excel.Application.Quit()
Phelps answered 29/6, 2014 at 13:48 Comment(6)
Many for the elegant solution!Shed
Great solution @Phelps . The only issue I am running into is that the Excel process is not removed from the Task manager. After some further research I came across this article which states that the variable used to open the Excel process is still holding on to it. So the solution is to use del to delete the variable of the process and indeed it is gone from the Task manager.Spade
I used this code to solve my problem but I was stuck with an error on SaveAs . This other answer helped me solve that.Ethanol
This answer preserves formatting, graphs and other weirdness, which was important in my application.Colum
@Spade Can you edit this answer or make new answer adding the del? Thanks.Phallicism
In my testing, if above script is wrapped in a function, excel instance gets terminated after 5-10 seconds following excel.Application.Quit()Phelps
F
28

Here is my solution, without considering fonts, charts and images:

$ pip install pyexcel pyexcel-xls pyexcel-xlsx

Then do this::

import pyexcel as p

p.save_book_as(file_name='your-file-in.xls',
               dest_file_name='your-new-file-out.xlsx')

If you do not need a program, you could install one additinal package pyexcel-cli::

$ pip install pyexcel-cli
$ pyexcel transcode your-file-in.xls your-new-file-out.xlsx

The transcoding procedure above uses xlrd and openpyxl.

Fulk answered 10/4, 2017 at 14:22 Comment(5)
I got an error : xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<?xml ve'. I saw from other post that this is because the file was corrupted. However, I can open it manually.Wastepaper
In this way how can I convert .ppt to .pptx using that librarySchnur
pyexcel seems a bit away from pyoffice or pyppt though. but if you insist, you can attempt it via this tutorial: io.pyexcel.org/en/latest/extensions.htmlFulk
if the reader want to try the latter method(no programming required), please refer a complete instruction below.Fulk
Note that dates didn't transfer as dates, they came over as strings. So when reading those lines of standard of a mm/dd/yyyy column, you need a line like this: val = datetime.strptime(row[date_column].value, "%m/%d/%Y")Fusillade
S
23

I've had to do this before. The main idea is to use the xlrd module to open and parse a xls file and write the content to a xlsx file using the openpyxl module.

Here's my code. Attention! It cannot handle complex xls files, you should add you own parsing logic if you are going to use it.

import xlrd
from openpyxl.workbook import Workbook
from openpyxl.reader.excel import load_workbook, InvalidFileException

def open_xls_as_xlsx(filename):
    # first open using xlrd
    book = xlrd.open_workbook(filename)
    index = 0
    nrows, ncols = 0, 0
    while nrows * ncols == 0:
        sheet = book.sheet_by_index(index)
        nrows = sheet.nrows
        ncols = sheet.ncols
        index += 1

    # prepare a xlsx sheet
    book1 = Workbook()
    sheet1 = book1.get_active_sheet()

    for row in xrange(0, nrows):
        for col in xrange(0, ncols):
            sheet1.cell(row=row, column=col).value = sheet.cell_value(row, col)

    return book1
Surprint answered 29/3, 2012 at 5:11 Comment(1)
This will only process the last worksheet of the workbook.Martinsen
D
14

I found none of answers here 100% right. So I post my codes here:

import xlrd
from openpyxl.workbook import Workbook

def cvt_xls_to_xlsx(src_file_path, dst_file_path):
    book_xls = xlrd.open_workbook(src_file_path)
    book_xlsx = Workbook()

sheet_names = book_xls.sheet_names()
for sheet_index, sheet_name in enumerate(sheet_names):
    sheet_xls = book_xls.sheet_by_name(sheet_name)
    if sheet_index == 0:
        sheet_xlsx = book_xlsx.active
        sheet_xlsx.title = sheet_name
    else:
        sheet_xlsx = book_xlsx.create_sheet(title=sheet_name)

    for row in range(0, sheet_xls.nrows):
        for col in range(0, sheet_xls.ncols):
            sheet_xlsx.cell(row = row+1 , column = col+1).value = sheet_xls.cell_value(row, col)

book_xlsx.save(dst_file_path)
Doorway answered 13/9, 2016 at 3:33 Comment(2)
Thanks. One thing to update: replace active() with activeEndothecium
I got an error: xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<?xml ve'Wastepaper
L
7

The answer by Ray helped me a lot, but for those who search a simple way to convert all the sheets from a xls to a xlsx, I made this Gist:

import xlrd
from openpyxl.workbook import Workbook as openpyxlWorkbook

# content is a string containing the file. For example the result of an http.request(url).
# You can also use a filepath by calling "xlrd.open_workbook(filepath)".

xlsBook = xlrd.open_workbook(file_contents=content)
workbook = openpyxlWorkbook()

for i in xrange(0, xlsBook.nsheets):
    xlsSheet = xlsBook.sheet_by_index(i)
    sheet = workbook.active if i == 0 else workbook.create_sheet()
    sheet.title = xlsSheet.name

    for row in xrange(0, xlsSheet.nrows):
        for col in xrange(0, xlsSheet.ncols):
            sheet.cell(row=row, column=col).value = xlsSheet.cell_value(row, col)

# The new xlsx file is in "workbook", without iterators (iter_rows).
# For iteration, use "for row in worksheet.rows:".
# For range iteration, use "for row in worksheet.range("{}:{}".format(startCell, endCell)):".

You can find the xlrd lib here and the openpyxl here (you must download xlrd in your project for Google App Engine for example).

Luhey answered 10/6, 2015 at 9:35 Comment(1)
When I tried this code, the resulting sheet is offset by one row and one column. To get a cell-for-cell replication, I altered the sheet.cell line to: sheet.cell(row=row+1, column=col+1).value = xlsSheet.cell_value(row, col)Telemeter
P
7

You can use pandas IO functions:

import pandas as pd

df = pd.read_excel('file_2003.xls', header=None)
df.to_excel('file_2003.xlsx', index=False, header=False)
Pharisaic answered 17/12, 2020 at 2:36 Comment(3)
need more explanation.Naive
I've added 'header=None' to reader and 'header=False' to writer to stop pandas using the first row of the file as column names. This is very important because if there are some equal cells in the first row, pandas will differentiate them by adding a suffix and will overwrite the cell content. Finally index=False is telling pandas not to add index column.Pharisaic
ValueError: File is not a recognized excel file when trying to read the .xls Excel file with pd.read_excel()Chambers
S
6

I'm improve performance for @Jackypengyu method.

Merged cells will be converted too.

Results

Convert same 12 files in same order:

Original:

0:00:01.958159
0:00:02.115891
0:00:02.018643
0:00:02.057803
0:00:01.267079
0:00:01.308073
0:00:01.245989
0:00:01.289295
0:00:01.273805
0:00:01.276003
0:00:01.293834
0:00:01.261401

Improved:

0:00:00.774101
0:00:00.734749
0:00:00.741434
0:00:00.744491
0:00:00.320796
0:00:00.279045
0:00:00.315829
0:00:00.280769
0:00:00.316380
0:00:00.289196
0:00:00.347819
0:00:00.284242

Solution

def cvt_xls_to_xlsx(*args, **kw):
    """Open and convert XLS file to openpyxl.workbook.Workbook object

    @param args: args for xlrd.open_workbook
    @param kw: kwargs for xlrd.open_workbook
    @return: openpyxl.workbook.Workbook


    You need -> from openpyxl.utils.cell import get_column_letter
    """

    book_xls = xlrd.open_workbook(*args, formatting_info=True, ragged_rows=True, **kw)
    book_xlsx = Workbook()

    sheet_names = book_xls.sheet_names()
    for sheet_index in range(len(sheet_names)):
        sheet_xls = book_xls.sheet_by_name(sheet_names[sheet_index])

        if sheet_index == 0:
            sheet_xlsx = book_xlsx.active
            sheet_xlsx.title = sheet_names[sheet_index]
        else:
            sheet_xlsx = book_xlsx.create_sheet(title=sheet_names[sheet_index])

        for crange in sheet_xls.merged_cells:
            rlo, rhi, clo, chi = crange

            sheet_xlsx.merge_cells(
                start_row=rlo + 1, end_row=rhi,
                start_column=clo + 1, end_column=chi,
            )

        def _get_xlrd_cell_value(cell):
            value = cell.value
            if cell.ctype == xlrd.XL_CELL_DATE:
                value = datetime.datetime(*xlrd.xldate_as_tuple(value, 0))

            return value

        for row in range(sheet_xls.nrows):
            sheet_xlsx.append((
                _get_xlrd_cell_value(cell)
                for cell in sheet_xls.row_slice(row, end_colx=sheet_xls.row_len(row))
            ))

        for rowx in range(sheet_xls.nrows):
            if sheet_xls.rowinfo_map[rowx].hidden != 0:
                print sheet_names[sheet_index], rowx
                sheet_xlsx.row_dimensions[rowx+1].hidden = True
        for coly in range(sheet_xls.ncols):
            if sheet_xls.colinfo_map[coly].hidden != 0:
                print sheet_names[sheet_index], coly
                coly_letter = get_column_letter(coly+1)
                sheet_xlsx.column_dimensions[coly_letter].hidden = True

    return book_xlsx
Subtraction answered 3/3, 2017 at 9:21 Comment(3)
Could you give an example of using this function? the *args and **kwargs is somewhat confusing, considering I usually just use the function this way : xlrd.open_workbook(file_name)Constantan
This didn't work as written. Error returned is NotImplementedError: formatting_info=True not yet implemented. Removing the formatting_info and checking for hidden does it. Unfortunately, formulae didn't carry over and I will post my solution that builds on your code when I do it.Fusee
@Constantan What I did is I changed the params to be function header to be cvt_xls_to_xlsx(src_file) and then used xlrd.open_workbook(src_file like in your comment. Otherwise, thank you very much for this work! This does preserve merged fields, which is exactly what I needed.Proudfoot
B
4

Simple solution

I required a simple solution to convert couple of xls to xlsx format. There are plenty of answers here, but they are doing some "magic" that I do not completely understand.

A simple solution was given by chfw, but not quite complete.

Install dependencies

Use pip to install

pip install pyexcel-cli pyexcel-xls pyexcel-xlsx

Execute

All the styling and macros will be gone, but the information is intact.

For single file

pyexcel transcode your-file-in.xls your-new-file-out.xlsx

For all files in the folder, one liner

for file in *.xls; do echo "Transcoding $file"; pyexcel transcode "$file" "${file}x"; done;
Bullough answered 5/9, 2017 at 13:12 Comment(0)
R
2

I tried @Jhon Anderson's solution, works well but got an "year is out of range" error when there are cells of time format like HH:mm:ss without date. There for I improved the algorithm again:

def xls_to_xlsx(*args, **kw):
"""
    open and convert an XLS file to openpyxl.workbook.Workbook
    ----------
    @param args: args for xlrd.open_workbook
    @param kw: kwargs for xlrd.open_workbook
    @return: openpyxl.workbook.Workbook对象
    """
    book_xls = xlrd.open_workbook(*args, formatting_info=True, ragged_rows=True, **kw)
    book_xlsx = openpyxl.workbook.Workbook()

    sheet_names = book_xls.sheet_names()
    for sheet_index in range(len(sheet_names)):
        sheet_xls = book_xls.sheet_by_name(sheet_names[sheet_index])
        if sheet_index == 0:
            sheet_xlsx = book_xlsx.active
            sheet_xlsx.title = sheet_names[sheet_index]
        else:
            sheet_xlsx = book_xlsx.create_sheet(title=sheet_names[sheet_index])
        for crange in sheet_xls.merged_cells:
            rlo, rhi, clo, chi = crange
            sheet_xlsx.merge_cells(start_row=rlo + 1, end_row=rhi,
            start_column=clo + 1, end_column=chi,)

        def _get_xlrd_cell_value(cell):
            value = cell.value
            if cell.ctype == xlrd.XL_CELL_DATE:
                datetime_tup = xlrd.xldate_as_tuple(value,0)    
                if datetime_tup[0:3] == (0, 0, 0):   # time format without date
                    value = datetime.time(*datetime_tup[3:])
                else:
                    value = datetime.datetime(*datetime_tup)
            return value

        for row in range(sheet_xls.nrows):
            sheet_xlsx.append((
                _get_xlrd_cell_value(cell)
                for cell in sheet_xls.row_slice(row, end_colx=sheet_xls.row_len(row))
            ))
    return book_xlsx

Then work perfect!

Roughshod answered 15/5, 2017 at 14:0 Comment(0)
H
2

Try using win32com application. Install it in your machine.

import sys, os
import win32com.client
directory = 'C:\\Users\\folder\\'
for file in os.listdir(directory):
    dot = file.find('.')
    end = file[dot:]
    OutFile =file[0:dot] + ".xlsx"
    App = win32com.client.Dispatch("Excel.Application")
    App.Visible = True
    workbook= App.Workbooks.Open(file)
    workbook.ActiveSheet.SaveAs(OutFile, 51)   #51 is for xlsx 
    workbook.Close(SaveChanges=True)
    App.Quit()

Thank you.

Heraclea answered 10/2, 2020 at 8:40 Comment(0)
F
1

CONVERT XLS FILE TO XLSX

Using python3.6 I have just come accross the same issue and after hours of struggle I solved it by doing the ff, you probably wont need all of the packages: (I will be as clear as posslbe)

make sure to install the following packages before proceeding

pip install pyexcel, pip install pyexcel-xls, pip install pyexcel-xlsx,

pip install pyexcel-cli

step 1:

import pyexcel

step 2: "example.xls","example.xlsx","example.xlsm"

sheet0 = pyexcel.get_sheet(file_name="your_file_path.xls", name_columns_by_row=0)

step3: create array from contents

xlsarray = sheet.to_array() 

step4: check variable contents to verify

xlsarray

step5: pass the array held in variable called (xlsarray) to a new workbook variable called(sheet1)

sheet1 = pyexcel.Sheet(xlsarray)

step6: save the new sheet ending with .xlsx (in my case i want xlsx)

sheet1.save_as("test.xlsx")
Fermin answered 15/3, 2018 at 17:24 Comment(1)
Although this code is throwing many errors in my case, Step3 should be xlsarray = sheet0.to_array()Billybillycock
H
1

Well I kept it simple and tried with Pandas:

import pandas as pd

df = pd.read_excel (r'Path_of_your_file\\name_of_your_file.xls')

df.to_excel(r'Output_path\\new_file_name.xlsx', index = False)
Heterotrophic answered 7/10, 2020 at 13:26 Comment(0)
C
1

This code works for me:

import pandas as pd

df = pd.read_html("test.xls") 
dfs= []
for i in range(len(df)):
    dfs.append(df[i])

   
def multiple_dfs(df_list, sheets, file_name, spaces):
    writer = pd.ExcelWriter(file_name,engine='xlsxwriter')   
    row = 0
    for dataframe in df_list:
        dataframe.to_excel(writer,sheet_name=sheets,startrow=row , startcol=0)   
        row = row + len(dataframe.index) + spaces + 1
    writer.save()

multiple_dfs(dfs, 'xlsx', 'test.xlsx', 1)
Champlain answered 30/11, 2022 at 13:2 Comment(3)
Having the full code block as part of your answer is great, but you should also highlight the key component of the code that addresses the problem before or after that. In this case it appears to be writer = pd.ExcelWriter(file_name,engine='xlsxwriter'), but maybe I'm wrong, which would be to my point.Bird
Thank you very much! I have been reading and trying to apply all the different solutions here but none seem to work. Only one small thing, writer.save() no longer works in pandas I tried writer.close() and voilá it did what I needed.Wolverine
Another question, if I have a series of files not just one, how can I do a loop in there? Try it my own, but only was able to load the first file.Wolverine
S
1

I spent a lot of time with my corrupted xls, so i decide to do good doing - show you my code brothers!

import xlrd
import pandas as pd


class ConvertLibrary:
    @staticmethod    
    def convert_xls_to_xlsx(file_name, new_name):
        wb = xlrd.open_workbook(file_name, ignore_workbook_corruption=True)

        excel = pd.read_excel(wb, index_col=None, header=None)
        excel.to_excel(new_name, index=False, header=False)
Stratopause answered 26/9, 2023 at 13:54 Comment(0)
A
0

The Answer from Ray was clipping the first row and last column of the data. Here is my modified solution (for python3):

def open_xls_as_xlsx(filename):
# first open using xlrd
book = xlrd.open_workbook(filename)
index = 0
nrows, ncols = 0, 0
while nrows * ncols == 0:
    sheet = book.sheet_by_index(index)
    nrows = sheet.nrows+1   #bm added +1
    ncols = sheet.ncols+1   #bm added +1
    index += 1

# prepare a xlsx sheet
book1 = Workbook()
sheet1 = book1.get_active_sheet()

for row in range(1, nrows):
    for col in range(1, ncols):
        sheet1.cell(row=row, column=col).value = sheet.cell_value(row-1, col-1) #bm added -1's

return book1
Angelika answered 11/4, 2017 at 3:37 Comment(0)
K
0

Tried @Jhon's solution 1st, then I turned into pyexcel as a solution

pyexcel.save_as(file_name=oldfilename, dest_file_name=newfilename)

It works properly until I tried to package my project to a single exe file by PyInstaller, I tried all hidden imports option, following error still there:

  File "utils.py", line 27, in __enter__
    pyexcel.save_as(file_name=self.filename, dest_file_name=newfilename)
  File "site-packages\pyexcel\core.py", line 77, in save_as
  File "site-packages\pyexcel\internal\core.py", line 22, in get_sheet_stream
  File "site-packages\pyexcel\plugins\sources\file_input.py", line 39, in get_da
ta
  File "site-packages\pyexcel\plugins\parsers\excel.py", line 19, in parse_file
  File "site-packages\pyexcel\plugins\parsers\excel.py", line 40, in _parse_any
  File "site-packages\pyexcel_io\io.py", line 73, in get_data
  File "site-packages\pyexcel_io\io.py", line 91, in _get_data
  File "site-packages\pyexcel_io\io.py", line 188, in load_data
  File "site-packages\pyexcel_io\plugins.py", line 90, in get_a_plugin
  File "site-packages\lml\plugin.py", line 290, in load_me_now
  File "site-packages\pyexcel_io\plugins.py", line 107, in raise_exception
pyexcel_io.exceptions.SupportingPluginAvailableButNotInstalled: Please install p
yexcel-xls
[3192] Failed to execute script

Then, I jumped to pandas:

pd.read_excel(oldfilename).to_excel(newfilename, sheet_name=self.sheetname,index=False)

Update @ 21-Feb 2020

openpyxl provides the function: append

enable the ability to insert rows to a xlxs file which means user could read the data from a xls file and insert them into a xlsx file.

  • append([‘This is A1’, ‘This is B1’, ‘This is C1’])
  • or append({‘A’ : ‘This is A1’, ‘C’ : ‘This is C1’})
  • or append({1 : ‘This is A1’, 3 : ‘This is C1’})

Appends a group of values at the bottom of the current sheet:

  • If it’s a list: all values are added in order, starting from the first column
  • If it’s a dict: values are assigned to the columns indicated by the keys (numbers or letters)
Kimsey answered 10/6, 2019 at 11:45 Comment(0)
T
0

@CaKel and @Jhon Anderson solution:

def _get_xlrd_cell_value(cell):
    value = cell.value
        if cell.ctype == xlrd.XL_CELL_DATE:
            # Start: if time is 00:00 this fix is necessary
            if value == 1.0:
                datetime_tup = (0, 0, 0)
            else:
            # end
                datetime_tup = xlrd.xldate_as_tuple(value, 0)

            if datetime_tup[0:3] == (0, 0, 0):
                value = datetime.time(*datetime_tup[3:])
            else:
                value = datetime.datetime(*datetime_tup)
    return value

And now this code runs perfect for me !

Telepathy answered 27/8, 2019 at 14:58 Comment(0)
T
0

This is a solution for MacOS with old xls files (e.g. Excel 97 2004).

The best way I found to deal with this format, if excel is not an option, is to open the file in openoffice and save it to another format as csv files.

Thereinto answered 15/4, 2021 at 15:27 Comment(0)
S
0

The use of win32com (pywin32), as the answer of @kvdogan is primarily the perfect approach.Some requisites:

  1. Using Windows (obviously);
  2. Have MSExcel installed;
  3. Ensure the path for file is the FULL path;

Also, the Pywin32 project is not up-to-date on SourceForge. Instead, use github: https://github.com/mhammond/pywin32 There's a .chm document, which you can read with SumatraPDF, for example, in the folder of project once installed.

#My answer contains no code at all.

EDIT: I don't have enough reputation to make comments. I'm sorry for the practical flood.

Sicken answered 5/6, 2021 at 19:0 Comment(1)
Thank you for pointing this out. I think the information you provided would be more valuable as a comment on the original answer from @Phelps than as a separate answer.Protonema

© 2022 - 2024 — McMap. All rights reserved.