phparraysmultidimensional-arraytext-extractionfileparsing

Split lines of a file into a 2d with row elements determined by string length because there are no delimiters


I have a data fine being read in using file() and iterate over each row. Need to be able to split the string into an array of "columns". Problem is the columns are not even widths (60 chars, 24 chars, 16 chars). Seems like all the functions to do this expect that the columns are a fixed size.

This will be performed on a large data file quite regularly so optimal performance is desired.

Example of data.

XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX                                  XXXXXXXXXXXXX           XX        XXXXXX
XXXXXXXXX                                                   XXX XXX                 X         XXX
XXXXXXXXXXXXXXX                                             XXXXXXXXXXXXX           XX        XXXXXX

Desired result:

array (
  0 => 
  array (
    0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  1 => 
  array (
    0 => 'XXXXXXXXX                                                   ',
    1 => 'XXX XXX                 ',
    2 => 'X         XXX',
  ),
  2 => 
  array (
    0 => 'XXXXXXXXXXXXXXX                                             ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  3 => 
  array (
    0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  4 => 
  array (
    0 => 'XXXXXXXXX                                                   ',
    1 => 'XXX XXX                 ',
    2 => 'X         XXX',
  ),
  5 => 
  array (
    0 => 'XXXXXXXXXXXXXXX                                             ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  6 => 
  array (
    0 => 'XXXXXXXXXXXXXXXXXXXXXXXXXX                                  ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
  7 => 
  array (
    0 => 'XXXXXXXXX                                                   ',
    1 => 'XXX XXX                 ',
    2 => 'X         XXX',
  ),
  8 => 
  array (
    0 => 'XXXXXXXXXXXXXXX                                             ',
    1 => 'XXXXXXXXXXXXX           ',
    2 => 'XX        XXXXXX',
  ),
)

Solution

  • The straightforward method would be using substr to split up the columns:

    foreach (file($fn) as $i=>$line) {
        $rows[$i] = array(substr($line, 0, 60), substr($line, 60, 40), substr($line, 100, 30));
    }
    

    But contrary to common wisdom it would be faster to use PCRE and a regular expression to split up the string:

    preg_match_all('/^(.{60})(.{40})(.{30})\K/m', file_get_contents($fn), $rows, PREG_SET_ORDER); 
    

    The disadvantage here is that it each row contains an empty [0] (would have contained the original line), and the data columns start at index [1].