csvgoogle-bigquerygoogle-cloud-storage

Remove last lines in csv file with "Totals" info


I have a large csv file that I'm saving to Google Cloud Storage with the following sample format:

id,revenue
1,100
2,100
...
Totals
All,12000

Or:

title,profit
titanic,200
avatar,400
fox total,600
paramount total,400
grand total,1000

I want to be able to remove all lines after and including the Totals, because that isn't really valid information within the csv file (it's more summation information of the csv below the data).

Is there a way to delete this? Currently, we're saving the csv file to Google Cloud Storage and then importing it directly via csv-import in Big Query. Perhaps, there's a way to append row-number to a csv file when importing csv via BigQuery or something similar, so we could do a delete statement at the end?


Solution

  • What if you delete the MAX() row after importing?

    #standardSQL
    DELETE
    FROM `project.dataset.table` 
    WHERE revenue = (
      SELECT revenue
      FROM `project.dataset.table` 
      ORDER BY taxi_trips DESC
      LIMIT 1
    )
    

    Or go for the id "All":

    #standardSQL
    DELETE
    FROM `project.dataset.table` 
    WHERE id = "All"