apache-sparkpyarrowapache-arrowapache-arrow-cpp

reading csv file with header and tail into apache arrow


I have a issue to read a csv flat file into apache arrow. The file has a header and a tail. The header take just one line, so it is easy dealt with, just use the skip or skip_rows arguments of the readers arrow's functions you are using. But the tail (the last line of the file) is causing a mismatch of the number of columns with the rest of the data in the file.

I've tried to find out arguments of the readers functions to limit the numbers of lines to read, but it does not exist. It could be a solution as the header contains the number of lines in the file (as the tail also contains it).

Here is a sample of three lines of the file (delimiter by ";") and the header (the first line) and tail (the last line):

RH NEG                 2019-08-13 2019-08-13 001809338
2024-05-13;XYZD3                                            ;0000000040; 000000000015.000000;000000000000000100;10:00:00.000;1;2019-08-13;000861000975325;000000010296678;2;2019-08-13;000861000975326;000000010296679;2;0;00000003;00000003
2024-05-14;ABCD3                                             ;0000000060; 000000000015.000000;000000000000000100;10:00:00.000;1;2019-08-13;000861000984860;000000010296682;2;2019-08-13;000861000976050;000000010296683;2;0;00000003;00000023
2024-05-15;XXT5                                             ;0000000080; 000000000014.970000;000000000000000100;10:06:34.630;1;2019-08-13;000861001025610;000000010380862;1;2019-08-13;000861001017226;000000010380863;2;0;00000090;00000072
RT NEG                 2019-08-13 2019-08-13 001809338

One solution witch modifies the input file is to use the linux function tac before calling the arrow's reader function:

$ file tac.txt | head -n -1 > temp && mv temp file.txt

The code above will remove the last line of the file. It is undesirable and inconvenient to modify the file and so that solution is discarded.

What could be done to read a file with a tail into apache arrow without modifying the file?

Better if its done entirely with the R API. The file size is 50GB so could not be load into memory (say by R or python).


Solution

  • You can use this solution where in the parse_options arguments you can specify how to handle the rows with errors, in your case, just skip it, so your function to handle rows will be

    def skip_comment(row):
        return 'skip'