Error Parsing due to CSV Differences Before/After Saving (Java w/ Apache Commons CSV)
Asked Answered
B

3

6

I have a 37 column CSV file that I am parsing in Java with Apache Commons CSV 1.2. My setup code is as follows:

//initialize FileReader object
FileReader fileReader = new FileReader(file);

//intialize CSVFormat object
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withHeader(FILE_HEADER_MAPPING);

//initialize CSVParser object
CSVParser csvFileParser = new CSVParser(fileReader, csvFileFormat);

//Get a list of CSV file records
List<CSVRecord> csvRecords = csvFileParser.getRecords();

// process accordingly

My problem is that when I copy the CSV to be processed to my target directory and run my parsing program, I get the following error:

Exception in thread "main" java.lang.IllegalArgumentException: Index for header 'Title' is 7 but CSVRecord only has 6 values!
        at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:110)
        at launcher.QualysImport.createQualysRecords(Unknown Source)
        at launcher.QualysImport.importQualysRecords(Unknown Source)
        at launcher.Main.main(Unknown Source)

However, if I copy the file to my target directory, open and save it, then try the program again, it works. Opening and saving the CSV adds back the commas needed at the end so my program won't compain about not having enough headers to read.

For context, here is a sample line of before/after saving:

Before (failing): "data","data","data","data"

After (working): "data","data",,,,"data",,,"data",,,,,,

So my question: why does the CSV format change when I open and save it? I'm not changing any values or encoding, and the behavior is the same for MS-DOS or regular .csv format when saving. Also, I'm using Excel to copy/open/save in my testing.

Is there some encoding or format setting I need to be using? Can I solve this programmatically?

Thanks in advance!

EDIT #1:

For additional context, when I first view an empty line in the original file, it just has the new line ^M character like this:

^M

After opening in Excel and saving, it looks like this with all 37 of my empty fields:

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,^M

Is this a Windows encoding discrepancy?

Boehm answered 15/4, 2016 at 17:15 Comment(0)
V
2

Maybe that's a compatibility issue with whatever generated the file in the first place. It seems that Excel accepts a blank line as a valid row with empty strings in each column, with the number of columns to match some other row(s). Then it saves it according to CSV conventions with the column delimiter. (the ^M is the Carriage Return character; on Microsoft systems it precedes the Line Feed character at the end of a line in text files)

Perhaps you can deal with it by creating your own Reader subclass to sit between the FileReader and the CSVParser. Your reader will read a line, and if it is blank then return a line with the correct number of commas. Otherwise just return the line as-is.

For example:

class MyCSVCompatibilityReader extends BufferedReader
    {
    private final BufferedReader delegate;

    public MyCSVCompatibilityReader(final FileReader fileReader)
        {
        this.delegate = new BufferedReader(fileReader);
        }

    @Override
    public String readLine()
        {
        final String line = this.delegate.readLine();
        if ("".equals(line.trim())
            { return ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"; }
        else
            { return line; }
        }
    }

There are a lot of other details to implement correctly when implementing the interface. You'll need to pass through calls to all the other methods (close, ready, reset, skip, etc.), and ensure that each of the various read() methods work correctly. It might be easier, if the file will fit in memory easily, to just read the file and write the fixed version to a new StringWriter then create a StringReader to the CSVParser.

Vinegarette answered 15/4, 2016 at 21:5 Comment(1)
Great input, thanks for the time. I've also just read that RFC 4180 is the standard that Excel will save a "non-conforming" CSV file down into, which in turn is the reason I see the commas after saving. I'll look to implement my own reader because I have to solve it programmatically.Boehm
C
-1

Maybe try this: Creates a parser for the given File. parse(File file, Charset charset, CSVFormat format)

//import import java.nio.charset.StandardCharsets; //StandardCharsets.UTF_8

Note: This method internally creates a FileReader using FileReader.FileReader(java.io.File) which in turn relies on the default encoding of the JVM that is executing the code.

Chesterfieldian answered 15/4, 2016 at 17:40 Comment(3)
What does this have to do with adding the missing columns?Erminna
maybe withAllowMissingColumnNames? CSVFormat csvFileFormat = CSVFormat.DEFAULT.withHeader(FILE_HEADER_MAPPING).withAllowMissingColumnNames();Chesterfieldian
Don't try guessing. If you researched the matter and know the answer, then answer. But just throwing untested suggestions is not the way things work on Stack Overflow. Also, you are supposed to format your answer, especially the parts that are code.Erminna
C
-1

Or maybe try withAllowMissingColumnNames?

//intialize CSVFormat object 
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withHeader(FILE_HEADER_MAPPING).withAllowMissingColumnNames();
Chesterfieldian answered 15/4, 2016 at 18:10 Comment(1)
Nope, that would infer that the column is there, without a name. My columns are just not there (before I save the file)Boehm

© 2022 - 2024 — McMap. All rights reserved.