phpweb-scraping

Bug with strtotime()


The Simple HTML DOM library is used to extract the timestamp from a webpage. strtotime is then used to convert the extracted timestamp to a MySQL timestamp.

Problem: When strtotime() is usede on a valid timestamp, NULL is returned (See 2:). However when Simple HTML DOM is not used in the 2nd example, everything works properly.

What is happening, and how can this be fixed??

Output:

1:2013-03-03, 12:06PM
2:
3:1970-01-01 00:00:00

var_dump($time)

string(25) "2013-03-03, 12:06PM"

PHP

include_once(path('app') . 'libraries/simple_html_dom.php');

// Convert to HTML DOM object
$html = new simple_html_dom();
$html_raw = '<p class="postinginfo">Posted: <date>2013-03-03, 12:06PM EST</date></p>';
$html->load($html_raw);

// Extract timestamp
$time = $html->find('.postinginfo', 0);
$pattern = '/Posted: (.*?) (.).T/s';
$matches = '';
preg_match($pattern, $time, $matches);
$time = $matches[1];

echo '1:' . $time . '<br>';
echo '2:' . strtotime($time) . '<br>';
echo '3:' . date("Y-m-d H:i:s", strtotime($time));

2nd Example

PHP (Working, without Simple HTML DOM)

// Extract posting timestamp
$time = 'Posted: 2013-03-03, 12:06PM EST';
$pattern = '/Posted: (.*?) (.).T/s';
$matches = '';
preg_match($pattern, $time, $matches);
$time = $matches[1];

echo '1:' . $time . '<br>';
echo '2:' . strtotime($time) . '<br>';
echo '3:' . date("Y-m-d H:i:s", strtotime($time));

Output (Correct)

1:2013-03-03, 12:06PM
2:1362312360
3:2013-03-03 12:06:00

var_dump($time)

string(19) "2013-03-03, 12:06PM"

Solution

  • According to your var_dump(), the $time string you extracted from the HTML code is 25 characters long.

    The string you see, "2013-03-03, 12:06PM", is only 19 characters long.

    So, where are those 6 extra characters? Well, it's pretty obvious, really: the string you're trying to parse is really "<date>2013-03-03, 12:06PM". But when you print it into an HTML document, that <date> is parsed as an HTML tag by the browser.

    To see it, use the "View Source" function in your browser. Or, much better yet, use htmlspecialchars() when printing any variables that are not supposed to contain HTML code.