Java BufferedWriter Creating Null Characters
Asked Answered
C

4

8

I've been using Java's BufferedWriter to write to a file to parse out some input. When I open the file after, however, there seems to be added null characters. I tried specifying the encoding as "US-ASCII" and "UTF8" but I get the same result. Here's my code snippet:

Scanner fileScanner = new Scanner(original);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "US-ASCII"));
while(fileScanner.hasNextLine())
  {
     String next = fileScanner.nextLine();
     next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
     out.write(next);
     out.newLine();
  }
 out.flush();
 out.close();

Maybe the issue isn't even with the BufferedWriter?

I've narrowed it down to this code block because if I comment it out, there are no null-characters in the output file. If I do a regex replace in VIM the file is null-character free (:%s/.*^L//g).

Let me know if you need more information.

Thanks!

EDIT: hexdump of a normal line looks like: 0000000 5349 2a41 3030 202a

But when this code is run the hexdump looks like: 0000000 5330 2a49 4130 202a

I'm not sure why things are getting mixed up.

EDIT: Also, even if the file doesn't match the regex and runs through that block of code, it comes out with null characters.

EDIT: Here's a hexdump of the first few lines of a diff: http://pastie.org/pastes/8964701/text

command was: diff -y testfile.hexdump expectedoutput.hexdump

The rest of the lines are different like the last two.

Consentaneous answered 19/3, 2014 at 13:48 Comment(9)
What kind of data is the input? Is it plain text with a known character encoding? Are you sure, you are opening it with this encoding? Do the spurious NULL bytes go away if you comment out the replaceAll line?Chincapin
It's a plain ASCII text file. It looks like anytime this block gets run something weird happens. I compared hexdumps of a file without the headers and a file run through this code to remove headers and it looks like it's swapping bytes. I added an example above.Consentaneous
Is it possible to get a copy of the input file you're using?Halifax
Unfortunately, no. I can't give out the information.Consentaneous
I added a partial hexdump above. The rest of the lines are different, and the file run through this code is actually shorter too.Consentaneous
The only difference between the two hexdumps is one has a LF (0A) line ending, and the other has a CRLF (0D 0A) line ending. The rest of the data is shifted forward to accomodate the extra byte.Jacquelynejacquelynn
@StuartCaie That's exactly the problem! If you create an answer with that, I'll mark it as correct. I guess I need to be more observant with my hexdumps.Consentaneous
What are these "Null Characters" to which you refer?? There are no null-value bytes in your hexdump, so your statement of the problem seems mistaken.Undersheriff
Yes. My initial guess was wrong. Looking at the file through a text editor showed weird characters, but that was because everything was shifted by the missing line endings. Either way, it has been resolved. See @StuartCaie's answer.Consentaneous
J
9

EDIT: Looking at the hexdump diff you gave, the only difference is that one has LF line endings (0A) and the other has CRLF line endings (0D 0A). All the other data in your diff is shifted ahead to accomodate the extra byte.

The CRLF is the default line ending on the OS you're using. If you want a specific line ending in your output, write the string "\n" or "\r\n".

Previously I noted that the Scanner doesn't specify a charset. It should specify the appropriate one that the input is known to be encoded in. However, this isn't the source of the unexpected output.

Jacquelynejacquelynn answered 24/3, 2014 at 15:7 Comment(2)
I've tried both "UTF-8" and "ASCII" with no luck. I added a partial hexdump above. The last few lines are different, and the rest of the lines are different according to the diff. The file that ran through this code also has a shorter hexdump for some reason.Consentaneous
Thanks for the help! I need to be more observant with my hexdumps.Consentaneous
V
0

Scanner.nextLine() is eating the existing line endings.
The javadoc for nextLine states:

This method returns the rest of the current line, excluding any line separator at the end.

The javadoc for BufferedWriter.newLine explains:

Writes a line separator. The line separator string is defined by the system property line.separator, and is not necessarily a single newline ('\n') character.

In your case your system's default newline seperator is "\n". The EDI file you are parsing uses "\r\n".

Using the system defined newLine seperator isn't the appropriate thing to do in this case. The newline separator to use is dictated by the file format and should be put in a format specific static constant somewhere.

Change "out.newLine();" to "out.write("\r\n");"

Valladares answered 24/3, 2014 at 18:57 Comment(0)
P
0

I think what is going on is the following

All lines that contain ^L (ff) get modified to remove everything before the ^L but in addition you have the side effect in 1 that all \r (cr) also get removed. However, if cr appears before ^L nextLine() is treating that as a line too. Note how, in the output file below, the number of cr + nl is 6 in the input file and the number of cr + nl is also 6 but they're all nl, so the line with c gets preserved because it's being treated on a different line than ^L. Probably not what you want. See below.

Some observations

  1. The source file is being generated on a system that uses \r\n to define a new line, and your program is being run on a system that does not. Because of this all occurrences of 0xd are going to be removed. This will make the two files different sizes even if there are no ^L.

  2. But you probably overlooked #1 because vim will operate in DOS mode (recognize \r\n as a newline separator) or non-DOS mode (only \n) depending on what it reads when it opens the file and hides the fact from the user if it can. In fact to test I had to brute force in \r using ^v^m because I was editing on Linux using vim more here.

  3. Your means to test is probably using od -x (for hex right)? But that outputs ints which is not what you want. Consider the following input file and output file. After your program runs. As viewed in vi

Input file

a
b^M
c^M^M ^L
d^L

Output file

a
b
c

Well maybe that's right, lets see what od has to say

od -x of input File

0a61    0d62    630a    0d0d    0c20    640a    0a0c 

od -x of output File

0a61    0a62    0a63    0a0a    000a

Huh, what where did that null come from? But wait from the man page of od

-t type     Specify the output format.  type is a string containing one or more of the following kinds of type specifiers:

   q          a       Named characters (ASCII).  Control characters are displayed using the following names:
-h, -x      Output hexadecimal shorts.  Equivalent to -t x2.
-a          Output named characters.  Equivalent to -t a.

Oh, ok so instead use the -a option

od -a of input

a  nl   b  cr  nl   c  cr  cr  sp  ff  nl   d  ff  nl

od -a of output

a  nl   b  nl   c  nl  nl  nl  nl 

Forcing java to ignore \r

And finally, all that being said, you really have to overcome the implicit understanding of java that \r delimits a line, even contrary to the documentation. Even when explicitly setting the scanner to use a \r ignoring pattern, it still operates contrary to the documentation and you must override that again by setting the delimiter (see below). I've found the following will probably do what you want by insisting on Unix line semantics. I also added in some logic to not output a blank line.

public static void repl(File original,File file) throws IOException
{
   Scanner fileScanner = new Scanner(original);
   Pattern pattern1 = Pattern.compile("(?d).*");

   fileScanner.useDelimiter("(?d)\\n");

   BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF8"));

   while(fileScanner.hasNext(pattern1))
   {
      String next = fileScanner.next(pattern1);

      next = next.replaceAll("(?d)(.*\\x0C)|(\\x0D)","");
      if(next.length() != 0)
      {
         out.write(next);
         out.newLine();
      }
   }
   out.flush();
   out.close();
}

With this change, the output above changes to.

od -a of input

a  nl   b  cr  nl   c  cr  cr  sp  ff  nl   d  ff  nl

od -a of output

a  nl   b  nl
Pulpboard answered 24/3, 2014 at 19:41 Comment(0)
L
0

Stuart Caie provided the answer. if you are looking for an code to avoid these characters.

Basic issue is , Org file using different line separator and the new file using different line separator character.

One easy way, find the Org file Separator character and use the same in new file.

    try(BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)));
            Scanner fileScanner = new Scanner(original);) {
        String lineSep = null;
        boolean lineSepFound = false;
        while(fileScanner.hasNextLine())
        {

            if (!lineSepFound){
                MatchResult matchResult = fileScanner.match();
                if (matchResult != null){
                    lineSep = matchResult.group(1);
                    if (lineSep != null){
                        lineSepFound = true;
                    }
                }
            }else{
                out.write(lineSep);
            }
            String next = fileScanner.nextLine();
            next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
            out.write(next);

        }
    } catch ( IOException e) {
        e.printStackTrace();
    }

Note ** MatchResult matchResult = fileScanner.match(); would provide the matchResult for the last Match performed. And in our case we have used hasNextLine() - Scanner used linePattern to find the next line .. Scanner.hasNextLine Source code finding the line Separator ,

but unfortunately no way to get the line separator back. So i have used thier code to get the lineSep only once. and used that lineSep for creating new file.

Also per your code , you would be having extra line separator at the end of file. Corrected here.

Let me know if that works.

Ladner answered 26/3, 2014 at 17:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.