Best language to parse extremely large Excel 2007 files [closed]
Asked Answered
M

10

5

My boss has a habit of performing queries on our databases that return tens of thousands of rows and saving them into excel files. I, being the intern, constantly have to write scripts that work with the information from these files. Thus far I've tried VBScript and Powershell for my scripting needs. Both of these can take several minutes to perform even the simplest of tasks, which would mean that the script when finished would take most of an 8 hour day.

My workaround right now is simply to write a PowerShell script that removes all of the commas and newline characters from an xlsx file, saves the .xlsx files to .csv, and then have a Java program handle the data gathering and output, and have my script clean up the .csv files when finished. This runs in a matter of seconds for my current project, but I can't help but wonder if there's a more elegant alternative for my next one. Any suggestions?

Meaghanmeagher answered 24/8, 2010 at 20:54 Comment(4)
Can you open Excel on the files, and ask Excel to do what you need?Orly
Well that's basically what's done now. Lots of vlookups and pivot tables that bring a top of the line core i7 machine with 8GB of ram to its knees.Meaghanmeagher
Is this of any use? I saw a demo and it seems to be quick: powerpivot.com/demo.aspx It allows you to quickly analyse millions of rows in Excel, the interactive results can be added to Sharepoint and even returned to Reporting Services for further analysis.Aureaaureate
Any chance you can query the same database as the boss?Ciborium
C
6

I kept getting all kinds of weird errors when working with .xlsx files.

Here's a simple example of using Apache POI to traverse an .xlsx file, updated to POI v5. See also Upgrading to POI 3.5, including converting existing HSSF Usermodel code to SS Usermodel (for XSSF and HSSF).

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.FormulaEvaluator;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

/** @see https://mcmap.net/q/829044/-best-language-to-parse-extremely-large-excel-2007-files-closed */
public class XlsxReader {

    public static void main(String[] args) throws IOException {
        InputStream myxls = new FileInputStream("test.xlsx");
        Workbook book = new XSSFWorkbook(myxls);
        FormulaEvaluator eval =
            book.getCreationHelper().createFormulaEvaluator();
        Sheet sheet = book.getSheetAt(0);
        for (Row row : sheet) {
            for (Cell cell : row) {
                printCell(cell, eval);
                System.out.print("; ");
            }
            System.out.println();
        }
        myxls.close();
    }

    private static void printCell(Cell cell, FormulaEvaluator eval) {
        switch (cell.getCellType()) {
            case BLANK:
                System.out.print("EMPTY");
                break;
            case STRING:
                System.out.print(cell.getStringCellValue());
                break;
            case NUMERIC:
                if (DateUtil.isCellDateFormatted(cell)) {
                    System.out.print(cell.getDateCellValue());
                } else {
                    System.out.print(cell.getNumericCellValue());
                }
                break;
            case BOOLEAN:
                System.out.print(cell.getBooleanCellValue());
                break;
            case FORMULA:
                System.out.print(cell.getCellFormula());
                break;
            default:
                System.out.print("DEFAULT");
        }
    }
}
Ciborium answered 25/8, 2010 at 1:22 Comment(3)
I'm getting out of memory errors when book is initialized in your code when passing the 30+ MB spreadsheets that I'm sorting through. It seems to work just fine for smaller files though. I think I'm just going to keep with my current method for now. I am going to keep this in mind for smaller files though.Meaghanmeagher
@user429997: When the time comes, you can increase the memory allocated to Java using the -Xms and -Xmx options.Ciborium
See also DataFormatter, illustrated here.Ciborium
A
3

Your goal is to do "data transformation" on your Excel files.

To solve this, I would use a dedicated ETL tool (Extract Transform Load), such as Talend Open Studio.

You just have to put a "Excel Input" component, a "data transform" component, and a "CSV output component". Talend ETL will convert this functional description of your problem into a Java code. Finally, you just have to execute this program...

Aura answered 24/8, 2010 at 21:4 Comment(1)
I haven't had much luck with Talend and "big" excel files, e.g. 10k lines or more. Getting OutOfMemoryErrorsSerieswound
V
2

I personally would use Python for this. I have found that it runs fast enough to not be a noticeable problem.

If you don't want to worry about a new language, why not just use Java for the entire thing? Removing commas and newlines is pretty trivial in Java and it would save you a step.

Valedictory answered 24/8, 2010 at 20:58 Comment(1)
From what I've seen Java or Python is my best bet. The problem I've had thus far is interacting with an .xlsx file, which I haven't been able to do easily with Java. I might give Python a try sometime this week if I don't make any more headway with the apache POI. Its pretty trivial with PowerShell, but I'd just like to narrow it down to a single language if possible.Meaghanmeagher
S
2

You should always think about the future of your code...

Who will maintain your script in the future? Does your company have any other developers that are familiar with PowerShell/VBScript?

I would have to say that you should stick to one language that fits your (and your company's) needs. As Nathan suggested, Python would be a great choice for creating fast scripts.

And one more thing - If you can control the SQL statements your boss does, you can make him create outputs that will ease your parsers' development and make them much more simple.

Good luck!

Tal.

Simonize answered 24/8, 2010 at 21:10 Comment(0)
T
2

in addition to trashgod's answer, for large files, I'd suggest POI SXSSF (Since POI 3.8 beta3). (http://poi.apache.org/spreadsheet/) With SXSSF, you can handle large files in streams, and that will help avoiding memory errors.

adding link to SXSSF details: http://poi.apache.org/spreadsheet/how-to.html#sxssf

Tuggle answered 10/2, 2012 at 3:42 Comment(0)
U
1

You could use Java POI to access the .xlsx directly.

Unmeaning answered 24/8, 2010 at 20:56 Comment(1)
I had tried this today, and it didn't seem quite as intuitive as everyone makes it sound. I had no problem working with .xls files, but I kept getting all kinds of weird errors when working with .xlsx files.Meaghanmeagher
W
1

If you save the file as a CSV, you can use any language you want to parse it.

Wetterhorn answered 24/8, 2010 at 21:11 Comment(0)
B
0

You can import data into an embedded database - e.g., apache derby (or http://hsqldb.org/). Depending on the nature of your queries, it can be a little bit faster. For sure, it will save a lot of your time if your boss requests new features often. You will simply write most of your new functionality in SQL.

Belanger answered 24/8, 2010 at 21:13 Comment(0)
B
0

If you need an ADVANCE analysis -- beyond grouping, joining, filtering --, just go for free mining tools, such as Wekka*, Rapid miner (based on Wekka but nicer GUI), or knime. These tools have very nice interfaces and provide operators to read cvs files. You can also run rapidminer and wekka libraries inside your java program. If not, go for embedded database as I proposed before.

Using Apache POI is not bad idea but I -- personally -- prefer to use it only to read excel before uploading it into e.g., a database.

Regarding the language. The best language that I have found for adhoc tasks is groovy. It is scripting language on the top of Java so you can use all Java libs (POI, jdbcs, ...a very looong list) and mix groovy classes with Java classes.

Belanger answered 26/8, 2010 at 11:26 Comment(0)
C
0

I have two options for parsing excel(.xlsx or xls) files. 1-You can use apache POI api to extract data from it.Now Apache poi has improved and fast.

2- Convert excel to open xml then write a xslt file. I think it should do work for a long file excel file.

Cudgel answered 9/8, 2016 at 7:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.