phpstringtokenizetext-parsing

Gget all substring inside potentially nested curly braces


I'm trying to parse the following format with PHP:

// This is a comment
{
this is an entry
}
{
this is another entry
}
{
entry
{entry within entry}
{entry within entry}
}

Maybe is just the lack of caffeine, but I can't think of a decent way of getting the contents of the curly braces.


Solution

  • This is quite a common parsing task, basically you need to keep track of the various states you can be in and use a combination of constants and function calls to maintain them.

    Here is some rather inelegant code that does just that:

    <?php
    
    $input = file_get_contents('input.txt');
    
    define('STATE_CDATA', 0);
    define('STATE_COMMENT', 1);
    
    function parseBrace($input, &$i)
    {
        $parsed = array(
            'cdata' => '',
            'children' => array()
        );
        $length = strlen($input);
        $state = STATE_CDATA;
        for(++$i; $i < $length; ++$i) {
            switch($input[$i]) {
                case '/':
                    if ('/' === $input[$i+1]) {
                        $state = STATE_COMMENT;
                        ++$i;
                    } if (STATE_CDATA === $state) {
                        $parsed['cdata'] .= $input[$i];
                    }
                    break;
                case '{':
                    if (STATE_CDATA === $state) {
                        $parsed['children'][] = parseBrace($input, $i);
                    }
                    break;
                case '}':
                    if (STATE_CDATA === $state) {
                        break 2; // for
                    }
                    break;
                case "\n":
                    if (STATE_CDATA === $state) {
                        $parsed['cdata'] .= $input[$i];
                    }
                    $state = STATE_CDATA;
                    break;
                default:
                    if (STATE_CDATA === $state) {
                        $parsed['cdata'] .= $input[$i];
                    }
            }
        }
        return $parsed;
    }
    
    function parseInput($input)
    {
        $parsed = array(
            'cdata' => '',
            'children' => array()
        );
        $state = STATE_CDATA;
        $length = strlen($input);
        for($i = 0; $i < $length; ++$i) {
            switch($input[$i]) {
                case '/':
                    if ('/' === $input[$i+1]) {
                        $state = STATE_COMMENT;
                        ++$i;
                    } if (STATE_CDATA === $state) {
                        $parsed['cdata'] .= $input[$i];
                    }
                    break;
                case '{':
                    if (STATE_CDATA === $state) {
                        $parsed['children'][] = parseBrace($input, $i);
                    }
                    break;
                case "\n":
                    if (STATE_CDATA === $state) {
                        $parsed['cdata'] .= $input[$i];
                    }
                    $state = STATE_CDATA;
                    break;
                default:
                    if (STATE_CDATA === $state) {
                        $parsed['cdata'] .= $input[$i];
                    }
            }
        }
        return $parsed;
    }
    
    print_r(parseInput($input));
    

    This produces the following output:

    Array
    (
        [cdata] =>
    
    
    
    
        [children] => Array
        (
            [0] => Array
            (
                [cdata] =>
    this is an entry
    
                [children] => Array
                (
                )
    
            )
    
            [1] => Array
            (
                [cdata] =>
    this is another entry
    
                [children] => Array
                (
                )   
    
            )
    
            [2] => Array
            (
                [cdata] => 
    entry
    
    
    
                [children] => Array
                (
                    [0] => Array
                    (
                        [cdata] => entry within entry
                        [children] => Array
                        (
                        )
    
    
                    )
    
                    [1] => Array
                    (
                        [cdata] => entry within entry
                        [children] => Array
                        (
                        )
    
                    )
    
                )
    
            )
    
        )
    
    )
    

    You'll probably want to clean up all the whitespace but some well placed trim's will sort that for you.