phparrayssimpledom

How can I make this PHP DOM parser more efficient with nested arrays?


How can I make this parser more efficient? I feel like these if statements are crazy! I'm thinking that a callback function would be able to get the job done. However, most of my identifiers are wildly different and I need to go through many different keys. Should I make an array of tags and an array of DOM elements and create a callback function for each one to strip out the null values? I'm trying to put together a scraper for the first time and I'm really getting stumped by the logic here.

Any help would be seriously appreciated!

foreach($html->find('.b-card') as $article) {
    $data[$y]['business']     = $article->find('h1', 0)->plaintext;
    $data[$y]['address']      = $article->find('.address', 0)->plaintext;

    if($article->find('.phone-num', 0)) {
      $data[$y]['phone']      = $article->find('.phone-num', 0)->plaintext;
    } else {
       $data[$y]['phone']     = 'Number not listed.';
    }

    if($article->find('.link', 0)) {
      $data[$y]['website']    = $article->find('.link', 0)->plaintext;
    } else {
       $data[$y]['website']   = 'Website not listed.';
    }
    if($article->find('.established', 0)) {
      $data[$y]['years']    = str_replace("\r\n","",$article->find('.established', 0)->plaintext);
    } else {
       $data[$y]['years']   = 'Years established not listed.';
    }
    if($article->find('.open-hours', 0)) {
      $data[$y]['hours']    = $article->find('.open-hours', 0)->plaintext;
    } else {
       $data[$y]['hours']   = 'Hours not listed.';
    }
    if($article->find('.categories a', 0)) {
      $data[$y]['category']    = $article->find('.categories a', 0)->plaintext;
    } else {
       $data[$y]['category']   = 'Category not listed.';
    }

    $articles[] = $data[$y];
}

}

I feel like I could do something like this

function my_callback($element) {
        // remove all null tags 
        if ($element->tag)
                $article->find....;
} 

Solution

  • Use an array that collects all the related information in each if block.

    $selector_list = array(
        array('selector' => '.phone-num', 'index' => 'phone', 'name' => 'Number'),
        array('selector' => '.link', 'index'' => 'website', 'name' => 'Website'),
        array('selector' => 'open-hours', 'index' => 'hours', 'name' => 'Hours'),
        array('selector' => '.categories a', 'index' => 'category', 'name' => 'Category')
    );
    

    Then you can use a simple foreach loop with all the common code:

    foreach ($selector_list as $sel) {
        $found = $article->find($sel['selector'], 0);
        if ($found) {
            $data[$y][$sel['index']] = $found;
        } else {
            $data[$y][$sel['index']] = $sel['name'] . " not listed.";
        }
    }
    // Special case for selector that doesn't follow the same pattern
    $found = $article->find('.established', 0);
    if ($found) {
        $data[$y]['years'] = str_replace("\r\n", "", $found);
    } else {
        $data[$y]['years'] = Years established not listed.';
    }
    

    If you want the loop to be able to handle the one that requires special handling you could add a callback function to the array. But if it's just one weird case, that may be overkill.