In my project I am trying to use filterXPath
for emails. So I get an E-Mail via IMAP and put the mail body into my DomCrawler
.
$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml); //mail html content utf8
Now to my issue. I only want the plain text of the mail body, but still remain all new lines spaces etc - the exact same as the mail looks just in plain text without html (still with \n\r etc).
For that reason I tried using $crawler->filterXPath('//body/descendant-or-self::*/text()')
to get every text node inside the mail.
However my test-mail containts html like:
<p>
<u>
<span>
<a href="mailto:mail@example.com">
<span style="color:#0563C1">mail@example.com</span>
</a>
</span>
</u>
<span> </span>
<span>·</span>
<span>
<b>
<a href="http://www.example.com">
<span style="color:#0563C1">www.example.com</span>
</a>
</b>
<p/>
</span>
</p>
In my mail this looks like mail@example.com · www.example.com
(in one single line).
With my filterXPath
I get multiple nodes which result in following (multiple lines):
mail@example.com
· wwww.example.com
I know that probably the
might be the problem, which is a \r
, but since I can't change the html in the mail, I need another solution - as mentioned before in the mail it is only a single line.
Please keep in mind, that my solution has to work for every mail - I do not know how the mail html looks like - it can change every time. So I need a generic solution.
I already tried using strip_tags
too - this does not change the result at all.
My current approach:
$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml);
$text = "";
foreach ($crawler->filterXPath('//body/descendant-or-self::*/text()') as $element) {
$part = trim($element->textContent);
if($part) {
$text .= "|".$part."|\n"; //to see whitespaces etc
}
}
echo $text;
//OUTPUT
|mail@example.com|
|·|
| |
|www.example.com|
| |
I believe something like this should work:
$xpath = new DOMXpath($crawler);
$result = $xpath->query('(//span[not(descendant::*)])');
$text = "";
foreach ($result as $element) {
$part = trim($element->textContent);
if($part) {
$text .= "|".$part."|"; //to see whitespaces etc
}
}
echo $text;
Output:
|mail@example.com||·||www.example.com|