Java read file got a leading BOM [ ï»¿ ]

Asked 9/6, 2011 at 8:49 Answered 20/3, 2022 at 5:19

java csv unicode character-encoding java-io

I am reading a file containing keywords line by line and found a strange problem. I hope lines that following each other if their contents are the same, they should be handled only once. Like

sony
sony

only the first one is getting processed. but the problems is, java doesn't treat them as equals.

INFO: [, s, o, n, y]
INFO: [s, o, n, y]

My code looks like the following, where's the problem?

    FileReader fileReader = new FileReader("some_file.txt");
    BufferedReader bufferedReader = new BufferedReader(fileReader);
    String prevLine = "";
    String strLine
    while ((strLine = bufferedReader.readLine()) != null) {
        logger.info(Arrays.toString(strLine.toCharArray()));
        if(strLine.contentEquals(prevLine)){
            logger.info("Skipping the duplicate lines " + strLine);
            continue;
        }
        prevLine = strLine;
    }

Update:

It seems like there's leading a space in the first line, but actually not, and the trim approach doesn't work for me. They're not the same:

INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]

I don't know what's the first Char added by java.

Solved: the problem was solved with BalusC's solution, thanks for pointing out it's BOM problem which helped me to find out the solution quickly.

Boston answered 9/6, 2011 at 8:49 Comment(3)

does the file start with the byte sequence ef bb bf? If so, it is a UTF-8 file with a BOM. – Dyanne 9/6, 2011 at 9:14

Nope, it's UTF encoding, but not starts with the sequence you mentioned. – Boston 9/6, 2011 at 9:24

post the hex dump of the first two lines and the default charset of the system. Otherwise, we're just playing guess-the-code-point. – Dyanne 9/6, 2011 at 11:54

What is the encoding of the file?

The unseen char at the start of the file could be the Byte Order Mark

Saving with ANSI or UTF-8 without BOM can help highlight this for you.

Fantan answered 9/6, 2011 at 9:15 Comment(0)

The Byte Order Mark ^(BOM) is a Unicode character. You will get characters like ï»¿ at the start of a text stream, because BOM use is optional, and, if used, should appear at the start of the text stream.

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.

File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
// [{"Key2":"21","ï»¿Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

We can resolve by explicitly specifying charset as UTF-8 to InputStreamReader. Then in UTF-8, the byte sequence ï»¿ decodes to one character, which is U+FEFF (?).

Using Google Guava's ^jar CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom( input );
String clean = CharMatcher.ASCII.retainFrom( printable );

Full Example to read data from the CSV file to JSON Object:

public class CSV_FileOperations {
    static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
    protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();

    public static void main(String[] args) {
        String csvFilename = "D:/Yashwanth/json2Bson.csv";

        csvToJSONString(csvFilename);
        String jsonData = jsonArray.toString();
        System.out.println("File JSON Data : \n"+ jsonData);
    }

    @SuppressWarnings("deprecation")
    public static String csvToJSONString( String csvFilename ) {
        try {
            File file = new File( csvFilename );
            FileInputStream inputStream = new FileInputStream(file);

            String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
            System.out.println("File Extension : "+ fileExtensionName);

            // [{"Key2":"21","ï»¿Key1":"11","Key3":"31"} ]
            InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

            BufferedReader buffer = new BufferedReader( inputStreamReader );
            Stream<String> readLines = buffer.lines();
            boolean headerStream = true;

            List<String> headers = new ArrayList<String>();
            for (String line : (Iterable<String>) () -> readLines.iterator()) {
                String[] columns = line.split(",");
                if (headerStream) {
                    System.out.println(" ===== Headers =====");

                    for (String keys : columns) {
                        // ï»¿ - UTF-8 - ? « https://mcmap.net/q/523245/-remove-non-ascii-non-printable-characters-from-a-string
                        String printable = CharMatcher.INVISIBLE.removeFrom( keys );
                        String clean = CharMatcher.ASCII.retainFrom(printable);
                        String key = clean.replace("\\P{Print}", "");
                        headers.add( key );
                    }
                    headerStream = false;
                    System.out.println(" ===== ----- Data ----- =====");
                } else {
                    addCSVData(headers, columns );
                }
            }
            inputStreamReader.close();
            buffer.close();


        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    @SuppressWarnings("unchecked")
    public static void addCSVData( List<String> headers, String[] row ) {
        if( headers.size() == row.length ) {
            HashMap<String,String> mapObj = new HashMap<String,String>();
            JSONObject jsonObj = new JSONObject();
            for (int i = 0; i < row.length; i++) {
                mapObj.put(headers.get(i), row[i]);
                jsonObj.put(headers.get(i), row[i]);
            }
            jsonArray.add(jsonObj);
            listObjects.add(mapObj);
        } else {
            System.out.println("Avoiding the Row Data...");
        }
    }
}

json2Bson.csv File data.

Key1    Key2    Key3
11  21  31
12  22  32
13  23  33

Disallow answered 1/2, 2018 at 13:25 Comment(0)

Try trimming whitespace at the beginning and end of lines read. Just replace your while with:

while ((strLine = bufferedReader.readLine()) != null) {
        strLine = strLine.trim();
        logger.info(Arrays.toString(strLine.toCharArray()));
    if(strLine.contentEquals(prevLine)){
        logger.info("Skipping the duplicate lines " + strLine);
        continue;
    }
    prevLine = strLine;
}

Gorga answered 9/6, 2011 at 8:53 Comment(1)

@Boston - You might have an encoding problem with your text file then. I just tried your example and it works perfectly as is. – Gorga 9/6, 2011 at 9:7

I had a similar case in my previous project. The culprit was the Byte order mark, which I had to get rid of. Eventually I implemented a hack based on this example. Check it out, might be that you have the same problem.

Sublingual answered 9/6, 2011 at 9:17 Comment(0)

If spaces are not important in the processing it would probably be worth doing a strLine.trim() call each time anyway. This is what I generally do when handling input like this - spaces can easily creep into a file if it has to be edited manually and if they're not important they can and should be ignored.

Edit: is the file encoded as UTF-8? You may need to specify the encoding when you open the file. It could be the byte order mark or something like that, if it's happening on the first line.

Try:

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"))

Swiss answered 9/6, 2011 at 8:53 Comment(2)

@Boston You may need to specify the file encoding when you open the file, I've edited. – Swiss 9/6, 2011 at 9:14

Yeah, it's encoded by UTF-8, and I tried your code, doesn't work. – Boston 9/6, 2011 at 9:51

There must be a space or some non-printable character in the start. So, either fix that or trim the Strings during/before comparison.

[Edited]

In case String.trim() is of no avail. Try String.replaceAll() using proper regex. Try this, str.replaceAll("\\p{Cntrl}", "").

Minard answered 9/6, 2011 at 8:53 Comment(0)

Open the file in a text editor, navigate to File > Save As... and choose UTF-8 encoding, instead of UTF-8 with BOM.

Batch answered 20/3, 2022 at 5:19 Comment(0)

Recommended topics

Hot tags