Java Scanner(File) misbehaving, but Scanner(FIleInputStream) always works with the same file
Asked Answered
D

2

14

I am having weird behavior with Scanner. It will work with a particular set of files I am using when I use the Scanner(FileInputStream) constructor, but it won't with the Scanner(File) constructor.

Case 1: Scanner(File)

Scanner s = new Scanner(new File("file"));
while(s.hasNextLine()) {
    System.out.println(s.nextLine());
}

Result: no output

Case 2: Scanner(FileInputStream)

Scanner s = new Scanner(new FileInputStream(new File("file")));
while(s.hasNextLine()) {
    System.out.println(s.nextLine());
}

Result: the file content outputs to the console.

The input file is a java file containing a single class.

I double checked programmatically (in Java) that:

  • the file exists,
  • is readable,
  • and has a non-zero filesize.

Typically Scanner(File) works for me in this case, I am not sure why it doesn't now.

Depressed answered 29/2, 2012 at 1:52 Comment(7)
And is that the only code, or is there other things happening around all that? This snippet seems incomplete, as there would be at least some exception handling taking place. Could you provide us with the whole code?Endrin
Interesting question. Please post your actual code and a pastebin with your file. Also, what is the output of Charset.defaultCharset() on your system?Hindustan
@Perception: I thought of that as well, but the source of Scanner seems to hint that they use the default charset in both cases, if not using a constructor that would specify it explictly.Endrin
@kashiko: Ah, another very important follow-up question: what's the size of the file?Endrin
I have updated my original post to have code copied from my source file. Just as a test I am reading the file and outputting it to the terminal. The file is a java source file form an open source project. My character set is UTF-8. The size of the file is 18357 bytes.Depressed
size does not matter, look at my answer below (i found out how it happens, not why actually)Dropout
Wow, I was just having the opposite problem (works with File, not with FileInputStream). I don't know if it's related but +1 nonetheless. Wasted a good hour on this.Haiku
D
7

hasNextLine() calls findWithinHorizon() which in turns calls findPatternInBuffer(), searching a match for a line terminator character pattern defined as .*(\r\n|[\n\r\u2028\u2029\u0085])|.+$

Strange thing is that with both ways to construct a Scanner (with FileInputStream or via File), findPatternInBuffer returns a positive match if the file contains (independently from file size) for instance the 0x0A line terminator; but in the case the file contains a character out of ascii (ie >= 7f), using FileInputStream returns true while using File returns false.

Very simple test case:

create a file which contains just char "a"

# hexedit file     
00000000   61 0A                                                a.

# java Test.java
using File: true
using FileInputStream: true

now edit the file with hexedit to:

# hexedit file
00000000   61 0A 80                                             a..

# java Test.java
using File: false
using FileInputStream: true

in the test java code there is nothing else than what already in the question:

import java.io.*;
import java.lang.*;
import java.util.*;
public class Test {
    public static void main(String[] args) {
        try {
                File file1 = new File("file");
                Scanner s1 = new Scanner(file1);
                System.out.println("using File: "+s1.hasNextLine());
                File file2 = new File("file");
                Scanner s2 = new Scanner(new FileInputStream(file2));
                System.out.println("using FileInputStream: "+s2.hasNextLine());
        } catch (IOException e) {
                e.printStackTrace();
        }
    }
}

SO, it turns out this is a charset issue. In facts, changing the test to:

 Scanner s1 = new Scanner(file1, "latin1");

we get:

# java Test 
using File: true
using FileInputStream: true
Dropout answered 29/2, 2012 at 3:36 Comment(1)
Interesting. When looking at the Scanner contrustors they all seem to be assuming the default charset if not specified, yet there's a difference at runtime as you point out. Maybe the channel used internally maybe force a different one, one level deeper? I'm wondering... Will try check when I get a chance.Endrin
E
5

From looking at the Oracle/Sun JDK's 1.6.0_23 implementation of Scanner, the Scanner(File) constructor invokes a FileInputStream, which is meant for raw binary data.

This points to a difference in buffering and parsing technique used when invoking one constructor or another, which will directly impact your code on the call to hasNextLine().

Scanner(InputStream) uses an InputStreamReader while Scanner(File) uses an InputStream passed to a ByteChannel (and probably reads the whole file in one jump, thus advancing the cursor, in your case).

Endrin answered 29/2, 2012 at 2:17 Comment(2)
The contract for Java(File) and Java(FileInputStream) read the same though, so they should produce the same behavior from the API user's point of view. I have used Java(File) before without this issue.Depressed
Yanick: Thanks, this is an interesting question. But there seems to be more to this... (Still, the stuff you can dig up from the JDK's code sometimes... Had a "What??" moment when I noticed there are multiple definitions of ArrayList, for instance (and no, they aren't exactly identical).Endrin

© 2022 - 2024 — McMap. All rights reserved.