google-cloud-dataprep

Dataprep import dataset does not detect headers in first row automatically


I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.

However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...

What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:


Solution

  • While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.

    Another case I saw this is when there were more columns of data than there were headers.

    As you already hit on, you can use the following snippet to do mostly the same thing:

    rename type: header method: filter sanitize: true
    

    . . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.

    More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.

    When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the file—so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.

    It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)

    Best of luck!