How should I detect which delimiter is used in a text file?
Asked Answered
O

13

26

I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use?

One way would be to read in every line and count both tabs and commas and find out which is most consistently used in every line. Of course, the data could include commas or tabs, so that may be easier said than done.

Edit: Another fun aspect of this project is that I will also need to detect the schema of the file when I read it in because it could be one of many. This means that I won't know how many fields I have until I can parse it.

Osculation answered 17/4, 2009 at 19:52 Comment(0)
M
14

You could show them the results in preview window - similar to the way Excel does it. It's pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.

Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).

Mertiemerton answered 17/4, 2009 at 19:56 Comment(2)
Showing the user the result before import is a good move I think, but guessing intelligently is great for userexperience too. So the combination is really nice!Churinga
a suggestion-If you're doing a preview window and you want to "guess" which is the correct delimiter, then you could split on one possible delim. and see if the first ten lines all have the same number of fields, compare with all of the other normal delims. It's a good bet one works with the same number of fields all the way through. As Jon Skeet said it's perfectly possible it's a valid comma delimited AND tab delimited, but that tab was the intended choice.Manning
M
19

In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):


[First, look] for text enclosed between two identical quotes (the probable quotechar) which are preceded and followed by the same character (the probable delimiter). For example:

         ,'some text',

The quote with the most wins, same with the delimiter. If there is no quotechar the delimiter can't be determined this way.

In that case, try the following:

The delimiter should occur the same number of times on each row. However, due to malformed data, it may not. We don't want an all or nothing approach, so we allow for small variations in this number.

  1. build a table of the frequency of each character on every line.
  2. build a table of freqencies of this frequency (meta-frequency?), e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows, 7 times in 2 rows'
  3. use the mode of the meta-frequency to determine the expected frequency for that character
  4. find out how often the character actually meets that goal
  5. the character that best meets its goal is the delimiter

For performance reasons, the data is evaluated in chunks, so it can try and evaluate the smallest portion of the data possible, evaluating additional chunks as necessary.


I'm not going to quote the source code here - it's in the Lib directory of every Python installation.

Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)

Massy answered 17/4, 2009 at 21:40 Comment(0)
M
14

You could show them the results in preview window - similar to the way Excel does it. It's pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.

Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).

Mertiemerton answered 17/4, 2009 at 19:56 Comment(2)
Showing the user the result before import is a good move I think, but guessing intelligently is great for userexperience too. So the combination is really nice!Churinga
a suggestion-If you're doing a preview window and you want to "guess" which is the correct delimiter, then you could split on one possible delim. and see if the first ten lines all have the same number of fields, compare with all of the other normal delims. It's a good bet one works with the same number of fields all the way through. As Jon Skeet said it's perfectly possible it's a valid comma delimited AND tab delimited, but that tab was the intended choice.Manning
B
7

I ran into a similar need and thought I would share what I came up with. I haven't run a lot of data through it yet, so there are possible edge cases. Also, keep in mind the goal of this function isn't 100% certainty of the delimiter, but best guess to be presented to user.

/// <summary>
/// Analyze the given lines of text and try to determine the correct delimiter used. If multiple
/// candidate delimiters are found, the highest frequency delimiter will be returned.
/// </summary>
/// <example>
/// string discoveredDelimiter = DetectDelimiter(dataLines, new char[] { '\t', '|', ',', ':', ';' });
/// </example>
/// <param name="lines">Lines to inspect</param>
/// <param name="delimiters">Delimiters to search for</param>
/// <returns>The most probable delimiter by usage, or null if none found.</returns>
public string DetectDelimiter(IEnumerable<string> lines, IEnumerable<char> delimiters) {
  Dictionary<char, int> delimFrequency = new Dictionary<char, int>();

  // Setup our frequency tracker for given delimiters
  delimiters.ToList().ForEach(curDelim => 
    delimFrequency.Add(curDelim, 0)
  );

  // Get a total sum of all occurrences of each delimiter in the given lines
  delimFrequency.ToList().ForEach(curDelim => 
    delimFrequency[curDelim.Key] = lines.Sum(line => line.Count(p => p == curDelim.Key))
  );

  // Find delimiters that have a frequency evenly divisible by the number of lines
  // (correct & consistent usage) and order them by largest frequency
  var possibleDelimiters = delimFrequency
                    .Where(f => f.Value > 0 && f.Value % lines.Count() == 0)
                    .OrderByDescending(f => f.Value)
                    .ToList();

  // If more than one possible delimiter found, return the most used one
  if (possibleDelimiters.Any()) {
    return possibleDelimiters.First().Key.ToString();
  }
  else {
    return null;
  }   

}
Backslide answered 3/2, 2015 at 23:55 Comment(0)
K
4

Do you know how many fields should be present per line? If so, I'd read the first few lines of the file and check based on that.

In my experience, "normal" data quite often contains commas but rarely contains tab characters. This would suggest that you should check for a consistent number of tabs in the first few lines, and go with that choice as a preferred guess. Of course, it depends on exactly what data you've got.

Ultimately, it would be quite possible to have a file which is completely valid for both formats - so you can't make it absolutely foolproof. It'll have to be a "best effort" job.

Khartoum answered 17/4, 2009 at 19:56 Comment(0)
C
3

Just read a few lines, count the number of commas and the number of tabs and compare them. If there's 20 commas and no tabs, it's in CSV. If there's 20 tabs and 2 commas (maybe in the data), it's in TSV.

Commodus answered 17/4, 2009 at 20:2 Comment(0)
N
3

It's in PHP but this seems to be quite reliable:

$csv = 'something;something;something
someotherthing;someotherthing;someotherthing
';
$candidates = array(',', ';', "\t");
$csvlines = explode("\n", $csv);
foreach ($candidates as $candidatekey => $candidate) {
 $lastcnt = 0;
 foreach ($csvlines as $csvline) {
  if (strlen($csvline) <= 2) continue;
  $thiscnt = substr_count($csvline, $candidate);
  if (($thiscnt == 0) || ($thiscnt != $lastcnt) && ($lastcnt != 0)) {
   unset($candidates[$candidatekey]);
   break;
  }
  $lastcnt = $thiscnt;
 }
}
$delim = array_shift($candidates);
echo $delim;

What it does is the following: For every specified possible delimiter, it reads every line in the CSV and checks if the number of times each seperator occurs is constant. If not, the candidate seperator is removed and ultimately you should end up with one seperator.

Nonpareil answered 23/8, 2010 at 22:0 Comment(0)
V
2

I'd imagine that your suggested solution would be the best way to go. In a well-formed CSV or TSV file, the number of commas or tabs respectively per line should be constant (no variation at all). Do a count of each for every line of the file, and check which one is constant for all lines. It would seem quite unlikely that the count of both delimeters for each line is identical, but in this inconceivably rare case, you could of course prompt the user.

If neither the number of tabs nor commas is constant, then display a message to the user telling them that the file is malformed but the program thinks it is a (whatever format has the lowest standard deviation of delimeters per line) file.

Verbose answered 17/4, 2009 at 19:58 Comment(0)
D
2

In my experience, data rarely contains tabs, so a line of tab delimited fields would (generally) be fairly obvious.

Commas are more difficult, though - especially if you're reading data in non-US locales. Numerical data can contain huge numbers of commas if you're reading files generated out of country, since floating point numbers will often contain them.

In the end, the only safe thing, though, is usually to try, then present it to the user and allow them to adjust, especially if your data will contain commas and/or tabs.

Dinnie answered 17/4, 2009 at 19:59 Comment(0)
W
2

I would assume that in normal text, tabs are very rare except as the first character(s) on a line -- think indented paragraphs or source code. I think if you find embedded tabs (i.e. ones that don't follow commas), you can assume that the tabs are being used as the delimiters and be correct most of the time. This is just a hunch, not verified with any research. I'd of course give the user the option to override the auto-calculated mode.

Wolcott answered 17/4, 2009 at 20:1 Comment(0)
F
1

There is no "efficient" way.

Figment answered 17/4, 2009 at 19:58 Comment(0)
C
1

Assuming that there are a fixed number of fields per line and that any commas or tabs within values are enclosed by quotes ("), you should be able to work it out on the frequency of each character in each line. If the fields aren't fixed, this is harder, and if quotes aren't used to enclose otherwise delimiting characters, it will be, I suspect, near impossible (and depending on the data, locale-specific).

Cypher answered 17/4, 2009 at 19:58 Comment(0)
U
1

Assuming you have a standard set of columns you are going to expect...

I would use FileHelper (open source project on SourceForge). http://filehelpers.sourceforge.net/

Define two reader templates, one for comas, one for tabs.

If the first one fails, try the second.

Ursala answered 17/4, 2009 at 21:9 Comment(1)
This is interesting. I will be reading in multiple schemas and trying to figure out which schema the current file is based on the file layout (number of fields, field order, etc.). Can the FileHelper determine which class to use at runtime?Osculation
Z
0

You can check whether a line is using one delimiter or another like this:

while ((line = readFile.ReadLine()) != null)
{
    if (line.Split('\t').Length > line.Split(',').Length) // tab delimited or comma delimited?
        row = line.Split('\t');
    else
        row = line.Split(',');

    parsedData.Add(row);
}
Zandra answered 15/8, 2013 at 12:14 Comment(1)
What if it's tab-delimited with a bunch of commas in the data, or vice versa? This potentially could attempt to parse the same file in both tab-delimited or comma-delimited formats based on the data in the lines.Osculation

© 2022 - 2024 — McMap. All rights reserved.