CSV Autodetection in Java
Asked Answered
V

2

10

What would be a reliable way of autodetecting that a file is actually CSV, if CSV was redefined to mean "Character-Separated Values", i.e. data using any single character (but typically any non-alphanumeric symbol) as the delimiter and not only commas?

Essentially, with this (re)definition, CSV = DSV ("Delimiter-Separated Values"), discussed, for example, in this Wikipedia article, whereas the "Comma-Separated Values" format is defined in RFC 4180.

More specifically, is there a method for statistically deducting that the data is of somehow "fixed" length, meaning "possible CSV"? Just counting the number of delimiters does not always work, because there are CSV files with variable numbers of fields per record (i.e., records that, opposite to what RFC 4180 mandates, do not have the same number of fields across the same file).

CSV recognition seems to be a particularly challenging problem, especially if detection cannot based on the file extension (e.g., when reading a stream that does not have such information anyway).

Proper ("full") autodetection needs at least 4 decisions to be made reliably:

  1. Detecting that a file is actually CSV
  2. Detecting the presence of headers
  3. Detecting the actual separator character
  4. Detecting special characters (e.g., quotes)

Full autodetection seems to have no single solution, due to the similarities of other datasets (e.g., free text that uses commas), especially for corner cases like variable length records, single or double quoted fields, or multiline records.

So, the best approach seems to be telescopic detection, in which formats that can also be classified as CSV (e.g., log file formats like the Apache CLF) are examined before the application of the CSV detection rules.

Even commercial applications like Excel seem to rely on the file extension (.csv) in order to decide for (1), which is obviously no autodetection, although the problem is greatly simplified if the application is told that the data is CSV.

Here are some good relevant articles discussing heuristics for (2) and (3):

The detection of (4), the type of quotes, can be based on processing a few lines from the file and looking for corresponding values (e.g., an even number of ' or " per row would mean single or double quotes). Such processing can be done via initializing an existing CSV parser (e.g., OpenCSV) that will take proper care of CSV row separation (e.g., multiline events).

But what about (1), i.e., deciding that the data is CSV in the first place?

Could data mining help in this decision?

Vigilance answered 19/12, 2011 at 19:21 Comment(6)
Is ThIs Comment CSV? If So, How can I determIne the Rule?Quaternary
A single-line phrase could always be considered CSV with the presence of a delimiter (e.g., ',' or even '?' in your comment). There is obviously no 100% solution for all datasets, but actual CSV data will have at least a few records to process, so my question is, given such records, how can you reliably tell?Vigilance
If I provide another lIne? Can I determIne the Rule? What would I need besIde multIple lInes?Quaternary
I should think such a task is more effort than it's worth. Especially when you consider that some non-english csv uses period as a delimiter as comma is used as a decimal separator. That was a pain to figure out even by myself, let alone a computer trying to figure it out automatically.Scalable
There's no generic solution. Best you can hope for is a heuristic, tuned to the applications you use. You can have a 1000-line file that looks good for 999 lines, and then the very last line breaks the pattern.Workable
That would not be an issue, you can decide according to a threshold of successful matches.Vigilance
E
6

If you can't constrain whats used as a delimiter then you can use brute-force.

You could iterate through all possible combinations of quote character, column delimiter, and record delimiter (256 * 255 * 254 = 16581120 for ASCII).

id,text,date
1,"Bob says, ""hi
..."", with a sigh",1/1/2012

Remove all quoted columns, this can be done with a RegEx replace.

//quick javascript example of the regex, you'd replace the quote char with whichever character your currently testing
var test='id,text,date\n1,"bob, ""hi\n..."", sigh",1/1/2011';
console.log(test.replace(/"(""|.|\n|\r)*?"/gm,""));

id,text,date
1,,1/1/2012

Split on record delimiter

["id,text,date", "1,,1/1/2012"]

Split records on column delimiter

[ ["id", "text", "date"], ["1", "", "1/1/2012"] ]

If the number of columns per record match you have some CSV confidence.

3 == 3

If the number of columns don't match try another combination of row, column and quote character

EDIT

Actually parsing the data after you have confidence on the delimiters and checking for column type uniformity might be a useful extra step

  • Are all the columns in the first (header?) row strings
  • Does column X always parse out to null/empty or a valid (int, float, date)

The more CSV data (rows, columns) there is to work with, the more confidence you can extract from this method.

I think this question is kind of silly / overly general, if you have a stream of unknown data you'd definitely want to check for all of the "low hanging fruit" first. Binary formats usually have fairly distinct header signatures, then there's XML and JSON for easily detectable text formats.

Earhart answered 19/12, 2011 at 20:3 Comment(10)
What about the ties? If you claim the delimiter is "i" and there are no quotes, it seems to work. Also, if you claim the delimiter is "1" and there are no quotes, that would be equally valid. How can you decide among them?Quaternary
This is a clever approach that, unfortunately, won't work, exactly because of the ties. Essentially, the first match will be returned as the solution, which will probably not be correct, unless you can prioritize the delimiters based on probability of occurrence (and that's a big if).Vigilance
@PNS: That essentially means that the problem is undecidable.Quaternary
Largely yes, because virtually every format that has a structure is CSV. Even applying the "strict" rule that, for example, given a separator character, every record must have exactly the same number of fields, this still matches well-known formats like Apache CLF (with ':' as the separator). So, CSV is something like a superset of many formats. If an application checks for a specific set of formats, then doing the last check for CSV, using some of the rules mentioned here, will work better.Vigilance
@PNS: "some of the rules mentioned here" An important insight. There's no general solution -- it's undecidable. You must impose numerous constraints to "sniff" the CSV format. CSV sniffing is a common algorithm, but it's limited to a subset of all potential CSV's.Quaternary
Hopefully my edit can instill a little more confidence in this method. But all your points are valid and seem to go back to "unconstrained problems are difficult/impossible to solve completely".Earhart
@LastCoder: You could greatly reduce the set of all possible "brute force" attempts by only including symbols as the separator and not numbers or letters (a-z, A-Z). However, you still have the issue of resolving the ties. Furthermore, an equal number of columns does not always denote CSV and, even worse, there are valid CSV files with a variable number of fields per record (e.g., 5 fields on line 10 and 7 fields on line 12). It is obviously a "best effort" problem.Vigilance
I wonder if data mining techniques could help at all for this.Vigilance
@Vigilance - The question stated "any single character", so I assumed not just the symbols. According to Section 2.4 of the RFC for C[omma]SV files ietf.org/rfc/rfc4180.txt each record needs to contain the same number of columns. Perhaps, this isn't the case for C[haracter]SV files.Earhart
You are right, but usually a delimiter will be a symbol and, in practice, there are numerous datasets that have a variable number of fields er row.Vigilance
B
1

There are always going to be non-CSV files that look like CSV, and vice versa. For instance, there's the pathological (but perfectly valid) CSV file that frankc posted in the Java link you cited:

Name
Jim
Tom
Bill

The best one can do, I think, is some sort of heuristic estimate of the likelihood that a file is CSV. Some heuristics I can think of are:

  1. There is a candidate separator character that appears on every line (or, if you like, every line has one token).
  2. Given a candidate separator character, most (but not necessarily all) of the lines have the same number of fields.
  3. The presence of a first line that looks like it might be a header increases the likelihood of the file containing CSV data.

One can probably think up other heuristics. The approach would then be to develop a scoring algorithm based on these. The next step would be to score a collection of known CSV and non-CSV files. If there is a clear-enough separation, then the scoring could be deemed useful and the scores should tell you how to set a detection threshold.

Bixler answered 19/12, 2011 at 19:38 Comment(8)
The "candidate separator" -- I think -- is undecidable. Every character -- even if it doesn't occur -- is a candidate separator. That means that each file has 128 ASCII (or 65,000 Unicode) separators, all equally valid. What makes one separator a better choice than another?Quaternary
@Quaternary - Suppose the file contains neither ! nor @. There's nothing to be gained by including both of them in the list of candidate separators; you just need a single non-occurring character to test that case. (Alternatively--and more sensibly, I think--one can just give up on detecting single-column CSV files, since every file will look like a single-column CSV file for any separator character that doesn't occur in the file.)Bixler
"every file will look like a single-column CSV file for any separator character that doesn't occur in the file". That seems to make the whole thing undecidable. I can't see how to proceed given that fact.Quaternary
@Quaternary - Like I suggested: proceed by giving up on detecting that case.Bixler
"every file will look like a single-column CSV file". That means give up on every file, right? It's undecidable, isn't it?Quaternary
@Quaternary - No, it means giving up on single-column CSV files. There's still the very useful question of whether a file looks like a multi-column CSV.Bixler
Indeed, although many files look as CSV (and are CSV as well), although they also comply to a less generic format, as I note in a comment to the previous answer. For example, Apache log files have a fixed structure with a fixed number of ':' characters per line, so they can be considered as CSV.Vigilance
@PNS: "a comment to the previous answer" doesn't mean anything. Please update the question to contain all the facts. The answers have no inherent order; "Previous" doesn't mean anything.Quaternary

© 2022 - 2024 — McMap. All rights reserved.