What would be a reliable way of autodetecting that a file is actually CSV, if CSV was redefined to mean "Character-Separated Values", i.e. data using any single character (but typically any non-alphanumeric symbol) as the delimiter and not only commas?
Essentially, with this (re)definition, CSV = DSV ("Delimiter-Separated Values"), discussed, for example, in this Wikipedia article, whereas the "Comma-Separated Values" format is defined in RFC 4180.
More specifically, is there a method for statistically deducting that the data is of somehow "fixed" length, meaning "possible CSV"? Just counting the number of delimiters does not always work, because there are CSV files with variable numbers of fields per record (i.e., records that, opposite to what RFC 4180 mandates, do not have the same number of fields across the same file).
CSV recognition seems to be a particularly challenging problem, especially if detection cannot based on the file extension (e.g., when reading a stream that does not have such information anyway).
Proper ("full") autodetection needs at least 4 decisions to be made reliably:
- Detecting that a file is actually CSV
- Detecting the presence of headers
- Detecting the actual separator character
- Detecting special characters (e.g., quotes)
Full autodetection seems to have no single solution, due to the similarities of other datasets (e.g., free text that uses commas), especially for corner cases like variable length records, single or double quoted fields, or multiline records.
So, the best approach seems to be telescopic detection, in which formats that can also be classified as CSV (e.g., log file formats like the Apache CLF) are examined before the application of the CSV detection rules.
Even commercial applications like Excel seem to rely on the file extension (.csv) in order to decide for (1), which is obviously no autodetection, although the problem is greatly simplified if the application is told that the data is CSV.
Here are some good relevant articles discussing heuristics for (2) and (3):
- Autodetection of headers (Java)
- Autodetection of separator (C#)
- Autodetection of headers and separator (Python)
The detection of (4), the type of quotes, can be based on processing a few lines from the file and looking for corresponding values (e.g., an even number of ' or " per row would mean single or double quotes). Such processing can be done via initializing an existing CSV parser (e.g., OpenCSV) that will take proper care of CSV row separation (e.g., multiline events).
But what about (1), i.e., deciding that the data is CSV in the first place?
Could data mining help in this decision?