phpweb-scrapingdomdocument

how to scrape data from web page through it's <script> tags in the html source. (PHP)


I want to retrieve some data items from a web page.

Link of webpage:

http://www.walmart.com/storeLocator/ca_storefinder_results.do

Data items which i want to retrieve.

I tried alot but i could not do it, because there are neither ids or specific class assigned to the tags, and there is no hierarchy of tags to fetch the data against each heading.

if you see the html source of above page, then there are already data items available in the form of variables inside <script> tag , can anyone tell me how to retrieve these data items against each store


Solution

  • I think that you'll have to use a regex for this, though it isn't perfect.

    $contents = file_get_contents('http://www.walmart.com/storeLocator/ca_storefinder_results.do?serviceName=&rx_title=com.wm.www.apps.storelocator.page.serviceLink.title.default&rx_dest=%2Findex.gsp&sfsearch_single_line_address=K6T');
    preg_match_all('/stores\[(\d+)\] \= \{/s', $contents, $matches);
    foreach ($matches[1] as $index) {
        preg_match('/stores\[' . $index . '\] \= \{(.*?)\}\;/s', $contents, $matches);
        preg_match_all('/\'([a-zA-Z0-9]+)\' \: ([^\,]*?)\,/s', $matches [1], $matches);
        $c = count ($matches [1]);
        for ($i=0; $i<$c; $i++) {
            $results [$matches [1] [$i]] = trim($matches [2] [$i], "\'");
        }
        print_r ($results);
    }
    

    Displays this:

    Array
    (
        [fullName] => Ogdensburg Walmart Store #2092
        [street1] => 3000 Ford Street Ext
        [city] => Ogdensburg
        [state] => NY
        [zipcode] => 13669
        [phone] => (315) 394-8990
        [latitude] => 44.7083
        [longitude] => -75.4564
        [storeName] => Walmart
        [storeTypeId] => 2
        [storeId] => 2092
        [distance] => 22.01 miles
        [directionsLink] => directionsLink
        [directionsAvailable] => directionsAvailable
        [directionsMessage] => directionsMessage
        [hasOpen24HoursService] => false
        [open24hrsMessage] => open24hrsMessage
        [hoursWeekDays] => hoursWeekDays
        [hoursSaturday] => hoursSaturday
        [hoursSunday] => hoursSunday
        [weekDays] => storeWeekDays
        [weekEndSaturday] => storeSaturday
        [weekEndSunday] => storeSunday
        [storeInfoDays] => storeInfoDays
        [storeInfoHours] => storeInfoHours
        [moreDetailsLink] => moreDetailsLink
        [openingSoon] => false
        [recentlyOpen] => false
        [siteToStoreAvailable] => true
        [hasStoreEvent] => true
        [eventLink] => http://localad.walmart.com/walmart/new_user_entry.aspx?storeref=2092&forceview=y
    )
    

    If you want to keep the single quotes ('), remove the trim() function.