javacsvdata-miningautodiscovery

CSV Autodetection in Java


What would be a reliable way of autodetecting that a file is actually CSV, if CSV was redefined to mean "Character-Separated Values", i.e. data using any single character (but typically any non-alphanumeric symbol) as the delimiter and not only commas?

Essentially, with this (re)definition, CSV = DSV ("Delimiter-Separated Values"), discussed, for example, in this Wikipedia article, whereas the "Comma-Separated Values" format is defined in RFC 4180.

More specifically, is there a method for statistically deducting that the data is of somehow "fixed" length, meaning "possible CSV"? Just counting the number of delimiters does not always work, because there are CSV files with variable numbers of fields per record (i.e., records that, opposite to what RFC 4180 mandates, do not have the same number of fields across the same file).

CSV recognition seems to be a particularly challenging problem, especially if detection cannot based on the file extension (e.g., when reading a stream that does not have such information anyway).

Proper ("full") autodetection needs at least 4 decisions to be made reliably:

  1. Detecting that a file is actually CSV
  2. Detecting the presence of headers
  3. Detecting the actual separator character
  4. Detecting special characters (e.g., quotes)

Full autodetection seems to have no single solution, due to the similarities of other datasets (e.g., free text that uses commas), especially for corner cases like variable length records, single or double quoted fields, or multiline records.

So, the best approach seems to be telescopic detection, in which formats that can also be classified as CSV (e.g., log file formats like the Apache CLF) are examined before the application of the CSV detection rules.

Even commercial applications like Excel seem to rely on the file extension (.csv) in order to decide for (1), which is obviously no autodetection, although the problem is greatly simplified if the application is told that the data is CSV.

Here are some good relevant articles discussing heuristics for (2) and (3):

The detection of (4), the type of quotes, can be based on processing a few lines from the file and looking for corresponding values (e.g., an even number of ' or " per row would mean single or double quotes). Such processing can be done via initializing an existing CSV parser (e.g., OpenCSV) that will take proper care of CSV row separation (e.g., multiline events).

But what about (1), i.e., deciding that the data is CSV in the first place?

Could data mining help in this decision?


Solution

  • If you can't constrain whats used as a delimiter then you can use brute-force.

    You could iterate through all possible combinations of quote character, column delimiter, and record delimiter (256 * 255 * 254 = 16581120 for ASCII).

    id,text,date
    1,"Bob says, ""hi
    ..."", with a sigh",1/1/2012
    

    Remove all quoted columns, this can be done with a RegEx replace.

    //quick javascript example of the regex, you'd replace the quote char with whichever character your currently testing
    var test='id,text,date\n1,"bob, ""hi\n..."", sigh",1/1/2011';
    console.log(test.replace(/"(""|.|\n|\r)*?"/gm,""));
    
    id,text,date
    1,,1/1/2012
    

    Split on record delimiter

    ["id,text,date", "1,,1/1/2012"]
    

    Split records on column delimiter

    [ ["id", "text", "date"], ["1", "", "1/1/2012"] ]
    

    If the number of columns per record match you have some CSV confidence.

    3 == 3
    

    If the number of columns don't match try another combination of row, column and quote character

    EDIT

    Actually parsing the data after you have confidence on the delimiters and checking for column type uniformity might be a useful extra step

    The more CSV data (rows, columns) there is to work with, the more confidence you can extract from this method.

    I think this question is kind of silly / overly general, if you have a stream of unknown data you'd definitely want to check for all of the "low hanging fruit" first. Binary formats usually have fairly distinct header signatures, then there's XML and JSON for easily detectable text formats.