symfonyxpathfilterdomcrawler

DomCrawler filterXpath for emails


In my project I am trying to use filterXPath for emails. So I get an E-Mail via IMAP and put the mail body into my DomCrawler.

$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml); //mail html content utf8

Now to my issue. I only want the plain text of the mail body, but still remain all new lines spaces etc - the exact same as the mail looks just in plain text without html (still with \n\r etc).

For that reason I tried using $crawler->filterXPath('//body/descendant-or-self::*/text()') to get every text node inside the mail.

However my test-mail containts html like:

<p>&#13;
    <u>
        <span>
            <a href="mailto:mail@example.com">
                <span style="color:#0563C1">mail@example.com</span>
            </a>
        </span>
    </u>
    <span>&#13;</span>
    <span>·</span>
    <span>
        <b>
            <a href="http://www.example.com">
                <span style="color:#0563C1">www.example.com</span>
            </a>
        </b>
    <p/>
    </span>
</p>&#13;

In my mail this looks like mail@example.com · www.example.com (in one single line).

With my filterXPath I get multiple nodes which result in following (multiple lines):

mail@example.com
· wwww.example.com

I know that probably the &#13; might be the problem, which is a \r, but since I can't change the html in the mail, I need another solution - as mentioned before in the mail it is only a single line.

Please keep in mind, that my solution has to work for every mail - I do not know how the mail html looks like - it can change every time. So I need a generic solution.

I already tried using strip_tags too - this does not change the result at all.


My current approach:

$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml);

$text = "";
foreach ($crawler->filterXPath('//body/descendant-or-self::*/text()') as $element) {
    $part = trim($element->textContent);
    if($part) {
        $text .= "|".$part."|\n"; //to see whitespaces etc
    }
}
echo $text;

//OUTPUT
|mail@example.com|
|·|
| |
|www.example.com|
| |


Solution

  • I believe something like this should work:

    $xpath = new DOMXpath($crawler);
    $result = $xpath->query('(//span[not(descendant::*)])');
    
    $text = "";
    foreach ($result as $element) {
        $part = trim($element->textContent);
        if($part) {
            $text .= "|".$part."|"; //to see whitespaces etc
        }
    }
    echo $text;
    

    Output:

    |mail@example.com||·||www.example.com|