How to check if a file is plain text?
Asked Answered
S

6

8

In my program, the user can load a file with links (it's a webcrawler), but I need to verify if the file that the user chooses is plain text or something else (only plain text will be allowed).

Is it possible to do this? If it's useful, I'm using JFileChooser to open the file.

EDIT:

What is expected from the user: a text file containing URLs.

What I want to avoid: the user loads an MP3 file or a document from the MS Word (examples).

Several answered 2/7, 2011 at 19:28 Comment(0)
P
5

A file is just a series of bytes, and without further information, you cannot tell whether these bytes are supposed to be code points in some string encoding (say, ASCII or UTF-8 or ANSI-something) or something else. You will have to resort to heuristics, such as:

  • Try to parse the file in a number of known encodings and see if the parsing succeeds. If it does, chances are you have a text file.
  • If you expect text files in Western languages only, you can assume that the majority of characters lies in the ASCII range (0..127), more specifically, (33..127) plus whitespace (tab, newline, carriage return, space). Count occurrences of each distinct byte value, and if the overwhelming part of your document is in the 'typical western characters' set, it's usually safe to assume it's a text file.
  • Extending the previous approach; sample a sufficiently large quantity of text in the languages you expect, and build a character frequency profile. To check your file, compare the file's character frequency profile against your test data and see if it's close enough.

But here's another solution: Just treat everything you receive as text, applying the necessary transformations where needed (e.g. HTML-encode when sending to a web browser). As long as you prevent the file from being interpreted as binary data (such as a user double-clicking the file), the worst you'll produce is gibberish data.

Pekin answered 2/7, 2011 at 19:37 Comment(0)
S
2

Text is also a form of binary data.

I suppose what you want to check is whether there are any characters in your input that are < 32. If you can safely assume that your text is multi-byte encoded, then you could just scan through the entire file and abort if you hit a byte in the range [0, 32) (excluding 9, 10, 13, and whatever else you may except in "text" -- or worst-case only check for null bytes [thanks, tdammers!]). If you could plausibly expect to receive UTF-16 or UTF-32 encoded text, you'll have to work harder.

Superintend answered 2/7, 2011 at 19:32 Comment(4)
Tab, newline and carriage return are < 32.Pekin
@tdammers: Whoops, good catch. OK, exclude those from the match! What about line feeds? :-)Superintend
I'd probably check whether the file is UTF-8, assuming that it's text if it's valid UTF-8 (possibly excluding codepoints < 32 apart from tab, newline and carriage return and also 127).Addy
@MRAB: How do you mean exactly? The formal check for valid multibyte sequences is already subsumed in my answer, but for a full Unicode validity check you would also have to check that the coded characters are valid codepoints.Superintend
S
1

If you do not want to guess by file extension, you may read the first portion of the file. But the next problem will be the character encoding. Using a BufferedInputStream (mark() before and reset() afterwards), wrap with a InputStreamReader with encoding "ISO-8859-1" and count the read character with Character.isLetterOrDigit() or Character.isWhitespace() to get a ratio of typical text content. I think the ratio should be more than 80% for a text file.

You can also try other encoding like UTF-8, but you may get problems with invalid caracters when it is not UTF-8.

Syringa answered 2/7, 2011 at 19:45 Comment(2)
I can easily rename the extension of an image to ".TXT" and try to load it into an app which is trying to open a text file and cause it to crash.Damar
@SiKni8: That was not the question and a good app won't crash when opening a binary file!Syringa
L
1

You can also check to see if the initial bytes are a BoM, which should indicate a file in UTF:

- UTF-8     => 0xEF, 0xBB, 0xBF
- UTF-16 BE => 0xFE, 0xFF
- UTF-16 LE => 0xFF, 0xFE

rossum

Lard answered 3/7, 2011 at 12:8 Comment(0)
V
0

You should create a filter that looks at the file description, and check for text.

Vishinsky answered 2/7, 2011 at 19:32 Comment(0)
S
0

You can call the shell command file -i ${filename} from Java, and check the output to see if it contains something like charset=binary. If it does, then it is binary file. Otherwise it is text based file.

You can play with file in the shell on various files and get familiar with it. In groovy I will write something like

'file -i ${path/to/myfile}'.execute().getText().contains('charset=binary')

In Java you can also call shell commands. Please refer to this.

Saltus answered 30/5, 2014 at 1:19 Comment(1)
On macOS, I saw with 'file --help' that '-I' should be used to output the mime type, instead of '-i', which only output 'regular file'.Blandishment

© 2022 - 2024 — McMap. All rights reserved.