I am experimenting with an edge case we're seeing in production. We have a business model where clients generate text files and then FTP them to our servers. We ingest those files and process them on our Java backend (running on CentOS machines). Most (95%+) of our clients know to generate these files in UTF-8 which is what we want. However we have a few stubborn clients (but large accounts) that generate these files on Windows machine with the CP1252 character set. No problem though, we've configured our 3rd party libs (which are what do most of the "processing" work for us) to handle input in any character set through some magical voo doo.
Occasionally, we see a file come over that has illegal UTF-8 characters (CP1252) in its name. When our software tries to read these files in from the FTP server the normal method of file reading chokes and throws a FileNotFoundException
:
File f = getFileFromFTPServer();
FileReader fReader = new FileReader(f);
String line = fReader.readLine();
// ...etc.
The exceptions look something like this:
java.io.FileNotFoundException: /path/to/file/some-text-blah?blah.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at
java.io.FileInputStream.(FileInputStream.java:120) at java.io.FileReader.(FileReader.java:55) at com.myorg.backend.app.InputFileProcessor.run(InputFileProcessor.java:60) at
java.lang.Thread.run(Thread.java:662)
So what I think is happening is that because the file name itself contains illegal chars, we never even get to read it in the first place. If we could, then regardless of the file's contents, our software should be able to handle processing it correctly. So this is really an issue with reading file names with illegal UTF-8 chars in them.
As a test case, I created a very simple Java "app" to deploy on one of our servers and test some things out (source code is provided below). I then logged into a Windows machine and created a test file and named it test£.txt
. Notice the character after "test" in the file name. This is Alt-0163. I FTPed this to our server, and when I ran ls -ltr
on its parent directory, I was surprised to see it listed as test?.txt
.
Before I go any further, here is the Java "app" I wrote for testing/reproducing this issue:
public Driver {
public static void main(String[] args) {
Driver d = new Driver();
d.run(args[0]); // I know this is bad, but its fine for our purposes here
}
private void run(String fileName) {
InputStreamReader isr = null;
BufferedReader buffReader = null;
FileInputStream fis = null;
String firstLineOfFile = "default";
System.out.println("Processing " + fileName);
try {
System.out.println("Attempting UTF-8...");
fis = new FileInputStream(fileName);
isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
buffReader = new BufferedReader(isr);
firstLineOfFile = buffReader.readLine();
System.out.println("UTF-8 worked and first line of file is : " + firstLineOfFile);
}
catch(IOException io1) {
// UTF-8 failed; try CP1252.
try {
System.out.println("UTF-8 failed. Attempting Windows-1252...(" + io1.getMessage() + ")");
fis = new FileInputStream(fileName);
// I've also tried variations "WINDOWS-1252", "Windows-1252", "CP1252", "Cp1252", "cp1252"
isr = new InputStreamReader(fis, Charset.forName("windows-1252"));
buffReader = new BufferedReader(isr);
firstLineOfFile = buffReader.readLine();
System.out.println("Windows-1252 worked and first line of file is : " + firstLineOfFile);
}
catch(IOException io2) {
// Both UTF-8 and CP1252 failed...
System.out.println("Both UTF-8 and Windows-1252 failed. Could not read file. (" + io2.getMessage() + ")");
}
}
}
}
When I run this from the terminal (java -cp . com/Driver t*
), I get the following output:
Processing test�.txt
Attempting UTF-8...
UTF-8 failed. Attempting Windows-1252...(test�.txt (No such file or directory))
Both UTF-8 and Windows-1252 failed. Could not read file.(test�.txt (No such file or directory))
test�.txt
?!?! I did some research and found that the "�" is the Unicode replacement character \uFFFD
. So I guess what's happening is that the CentOS FTP server doesn't know how to handle Alt-0163 (£
) and so it replaces it with \uFFFD
(�
). But I don't understand why ls -ltr
displays a file called test?.txt
...
In any event, it appears that the solution is to add some logic that searches for the existence of this character in the file name, and if found, renames the file to something else (like perhaps do a String-wise replaceAll("\uFFFD", "_")
or something like that) that the system can read and process.
The problem is that Java doesn't even see this file on the file system. CentOS knows the file is there (test?.txt
), but when that file gets passed into Java, Java interprets it as test�.txt
and for some reason No such file or directory
...
How can I get Java to see this file so that I can perform a File::renameTo(String)
on it? Sorry for the backstory here but I feel it is relevant since every detail counts in this scenario. Thanks in advance!
java.io.File#listFiles()
? It may return references to such files. docs.oracle.com/javase/7/docs/api/java/io/File.html#listFiles() – Moser