phphtmldomscreen-scrapingzend-dom-query

PHP HTML DOM: How to select all visible/readable text?


I'm trying to scrape websites, modify all visible text (meaning: links, paragraphs, headlines, etc) by keeping the html structure and then render the 'new' page afterwards.

Basically I want to scramble all readable text without destroying the design/functionality.

I tried it with Zend_Dom_Query, but how to select just text?

    $dom = new Zend_Dom_Query($html);
    $results = $dom->query( ??? );

Or is there another/better way of doing this?

Thanks a lot in advance.


Example

Input:

<html>
  <head>....</head>
  <body>

    <div>
      <h1>Headline</h1>
      <h2>Subheadline</h2>
      <p>Some text</p>
      <a href="...">
        A Link 
        <img src="..." />
        <span style="display:none">additional text</span>
      </a>  
    </div>

  </body>
</html>

Output:

<html>
  <head>....</head>
  <body>

    <div>
      <h1>Hinladee</h1>
      <h2>Suialebdhne</h2>
      <p>Smoe txet</p>
      <a href="...">
        A Lnik 
        <img src="..." />
        <span style="display:none">anodiaditl txet</span>
      </a>  
    </div>

  </body>
</html>

Solution

  • Solution:

    Thanks to @Yoshi and @Gordon. This is exactly what I was looking for:

    $dom = new Zend_Dom_Query($html);
    $results = $dom->query("//text()");