ZWNBSP appears when parsing CSV
Asked Answered
A

3

7

I have a CSV and I want to check if it has all the data it should have. But it looks like ZWNBSP appears at the beginning of the 1st column name in the 1st string.

My simplified code is

@Test
void parseCsvTest() throws Exception {
    Configuration.holdBrowserOpen = true;
    ClassLoader classLoader = getClass().getClassLoader();
    try (
            InputStream inputStream = classLoader.getResourceAsStream("files/csv_example.csv");
            CSVReader reader = new CSVReader(new InputStreamReader(inputStream))
    ) {
        List<String[]> content = reader.readAll();
        var csvStrings0line = content.get(0);
        var csv1stElement = csvStrings0line[0];
        var csv1stElementShouldBe = "Timestamp";
        assertEquals(csv1stElementShouldBe,csv1stElement);

My CSV contains

"Timestamp","Source","EventName","CountryId","Platform","AppVersion","DeviceType","OsVersion"
"2022-05-02T14:56:59.536987Z","courierapp","order_delivered_sent","643","ios","3.11.0","iPhone 11","15.4.1"
"2022-05-02T14:57:35.849328Z","courierapp","order_delivered_sent","643","ios","3.11.0","iPhone 8","15.3.1"

My test fails with

expected: <Timestamp> but was: <Timestamp>
Expected :Timestamp
Actual   :Timestamp
<Click to see difference>

Clicking on the see difference shows that there is a ZWNBSP at the beginning of the Actual text.

enter image description here

Copypasting my text to the online tool for displaying non-printable unicode characters https://www.soscisurvey.de/tools/view-chars.php shows only CR LF at the ends of the lines, no ZWNBSPs.

But where does it come from?

Atonality answered 4/5, 2022 at 5:39 Comment(1)
Open it with a hex editor instead. The character is most likely in the file (or do you suggest there's a mechanism which inserts random characters for no reason?), and it's being dropped when you copy it online (bad idea to rely on online tools only).Esque
B
7

It's a BOM character. You may remove it yourself or use several other solutions (see https://mcmap.net/q/224860/-reading-utf-8-bom-marker for instance)

Boulware answered 3/9, 2022 at 8:56 Comment(0)
W
3

That is the Unicode zero-width no-break space character. When used at the beginning of Unicode encoded text files, it serves as a 'byte-order-mark' . You read it to determine the encoding of the text file, then you can safely discard it if you want. The best thing you can do is spread awareness.

Wilburn answered 16/2, 2023 at 5:48 Comment(0)
B
1

In Intellij IDEA you can remove BOM in text editor (down right corner) example

Bismarck answered 3/6 at 12:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.