phpcsvfgetsfgetcsvfputs

PHP processes a CSV file into 19 Smaller CSV's very slow


I have created this php script which takes a very long time to filter the CSV file into 19 smaller ones. The link for the CSV's is within this google drive line

https://drive.google.com/drive/folders/1Bju4kkVgo21xu3IFeyKWNJ1uClTn534J?usp=share_link

I have the process running line by line to save memory but this exceeded the maxim count time within the PHP script. Is there an improved method to achieve this file breakdown?

<?php

@date_default_timezone_set("GMT");

ini_set('memory_limit', '512M');
ini_set('max_execution_time', '300');
ini_set('auto_detect_line_endings', TRUE);
ini_set('display_errors', 1);

$inputFile = '/Users/declanryan/Desktop/UNI3.0/year3/WEB/workshop/assgienment/air-quality-data-2003-2022.csv';
$outputDirectory = 'output';

// create the output directory if it doesn't exist
if (!file_exists($outputDirectory)) {
    mkdir($outputDirectory);
}

$handle = fopen($inputFile, 'r');

if ($handle) {
    $headerRow = null;
    $siteIDs = array();

    while (($data = fgets($handle)) !== false) {
        $row = str_getcsv($data, ";");

        if ($headerRow === null) {
            $headerRow = $row;
            continue; // Skip processing the header row
        }

        $siteID = $row[array_search('SiteID', $headerRow)];
        if (!in_array($siteID, $siteIDs)) {
            $siteIDs[] = $siteID;

            $newCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
            $f = fopen($newCSVFilename, 'w');

            fputs($f, implode(';', $headerRow) . PHP_EOL);
        }

        $currentCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
        $f = fopen($currentCSVFilename, 'a');

        fputs($f, implode(';', $row) . PHP_EOL);

        fclose($f);
    }

    fclose($handle);
}

echo "Done.";
echo "<br>";
echo 'Script took ' . round((microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"]), 2) . ' seconds to run.';


?>

It has taken a fair enough time to even get the file processing to occur. I was going to change the format to getcsv but my lecture told me this method is actually slower?

in reply to Sammath, would something like this be more inline with what's necessary?


@date_default_timezone_set("GMT");

ini_set('memory_limit', '512M');
ini_set('max_execution_time', '300');
ini_set('auto_detect_line_endings', TRUE);
ini_set('display_errors', 1);

$inputFile = '/Users/declanryan/Desktop/UNI3.0/year3/WEB/workshop/assgienment/air-quality-data-2003-2022.csv';
$outputDirectory = 'output';

// create the output directory if it doesn't exist
if (!file_exists($outputDirectory)) {
    mkdir($outputDirectory);
}

$source = fopen($inputFile, 'r');
if (!$source) {
    exit('Unable to open input file.');
}

$headerRow = fgetcsv($source, 0, ';');
if (!$headerRow) {
    exit('Unable to read header row.');
}

$columnIndexes = array_flip($headerRow);
$siteIDColumn = $columnIndexes['SiteID'];

$handles = [];

while (($row = fgetcsv($source, 0, ';')) !== false) {
    $siteID = $row[$siteIDColumn];
    if (!isset($handles[$siteID])) {
        $newCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
        $handles[$siteID] = fopen($newCSVFilename, 'w');
        if (!$handles[$siteID]) {
            exit('Unable to open output file for SiteID: ' . $siteID);
        }
        fputcsv($handles[$siteID], $headerRow, ';');
    }

    fputcsv($handles[$siteID], $row, ';');
}

foreach ($handles as $handle) {
    fclose($handle);
}

fclose($source);

echo "Done.";
echo "<br>";
echo 'Script took ' . round((microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"]), 2) . ' seconds to run.';



Solution

  • Touching the filesystem has a non-trivial amount of the slowest IO computers generally do, and PHP abstracts a lot of optimizations away. But when you repeatedly open a file, write a very small amount of data, and then close it, you're not only making those optimizations meaningless, but you're also doing about the worst thing you can do: constantly flushing tiny writes to disk.

    For something like this you should be opening those handles once, which might look roughly like:

    $source = fopen('somefile.csv', 'r');
    $handles = [];
    
    $header = fgetcsv($source, ';');
    
    while( $row = fgetcsv($source) ) {
      $some_id = $row[X];
      if( ! key_exists($some_id, $handles) ) {
        $handles[$some_id] = fopen("foo/$some_id.csv", w);
      }
      fputcsv($handles[$some_id], $row, ';');
    }
    
    foreach($handles as $handle) {
      fclose($handle);
    }
    fclose($source);
    

    Additionally, there is functionally little difference between fgetcsv() and str_getcsv(fgets()), but implode(';', $row) is not an appropriate CSV encode method as it will not perform any string quoting/escaping, etc.