phpcsvfgetcsv

How to parse a csv file that contains 15 million lines of data in php


I have a script which parses the CSV file and start verifying the emails. this works fine for 1000 lines. but on 15 million lines it shows memory exhausted error. the file size is 400MB. any suggestions? how to parse and verify them?

Server Specs: Core i7 with 32GB of Ram

function parse_csv($file_name, $delimeter=',') {
  $header = false;
  $row_count = 0;
  $data = [];

  //  clear any previous results
  reset_parse_csv();

  // parse
  $file = fopen($file_name, 'r');
  while (!feof($file)) {
    $row = fgetcsv($file, 0, $delimeter);
    if ($row == [NULL] || $row === FALSE) { continue; }
    if (!$header) {
      $header = $row;
    } else {
      $data[] = array_combine($header, $row);
      $row_count++;
    }
  }
  fclose($file);

  return ['data' => $data, 'row_count' => $row_count];

}

function reset_parse_csv() {
  $header = false;
  $row_count = 0;
  $data = [];    
}

Solution

  • Pushing a large dataset (file lines, etc.) into array increases memory usage and this is directly proportional to the number of items handling. So the bigger file, the bigger memory usage - in this case.

    So instead of collecting the data into array, better process it in place:

    $file = fopen($file_name, 'r');
    while (!feof($file)) {
        $row = fgetcsv($file, 0, $delimeter);
        if ($row == [NULL] || $row === FALSE) { continue; }
        if (!$header) {
            $header = $row;
        } else {
            // do whatever's intended to do with row
            // instead of $data[] = array_combine($header, $row);
            do_something($row);
        }
    }
    

    The major difference here is:
    you do not get (from memory) and consume all data at once. You get items on demand (like a stream) and process it instead, one item at time. It has huge impact on memory usage.

    However, in case the actual workload cannot be easily moved inside the loop (for example, it's a function that expects an array to be processed) you can move the processing elsewhere while keeping the memory footprint low. For this, you can use a generator.

    Reading the PHP doc it fits very well for your case (emphasis mine):

    A generator allows you to write code that uses foreach to iterate over a set of data without needing to build an array in memory, which may cause you to exceed a memory limit, or require a considerable amount of processing time to generate.

    Something like this:

    
    
    function csv_read($filename, $delimeter=',')
    {
        $header = [];
        $row = 0;
        # tip: dont do that every time calling csv_read(), pass handle as param instead ;)
        $handle = fopen($filename, "r"); 
    
        if ($handle === false) {
            return false;
        }
    
        while (($data = fgetcsv($handle, 0, $delimeter)) !== false) {
    
            if (0 == $row) {
                $header = $data;
            } else {
                # on demand usage
                yield array_combine($header, $data);
            }
    
            $row++;
        }
        fclose($handle);
    }
    

    And then:

    $generator = csv_read('rdu-weather-history.csv', ';');
    
    do_stuff_with_array($generator);
    

    Here, $generator variable do not hold all data at once. On the one hand, it behaves much like an array, and can be used inside foreach() statement. On the other hand, it acts like a stream and reads one item at time.


    P.S.: The CSV file above has taken from: https://data.townofcary.org/api/v2/catalog/datasets/rdu-weather-history/exports/csv