phphtml-parsingsimple-html-dom

How to parse an HTML webpage and remove <br> tags?


i need to parse a website that contains <p> tags (many of them) i want to get them and put them on a csv file (in same column).

After testing, i'm seeing the paragraphs are not on the same column, it's because of the <br> that are on <p> tags example :

HTML :

<div class="text">
     <p> hello <br> friends </p>
     <p> parsing is cool <br> using <br> simpleHTMLdom </p>
</div>

When i parse the html below i get the two <p> but not on same csv "column".

My code :

if($html_book_page->find('.text')){

   foreach($html_book_page->find('div[class=text] p') as $bookPreview){
      array_push($book, $bookPreview->plaintext);

        }                     
    }

$book is the array containing all text and i put $book on csv like :

fputcsv($open_csv, array_values($book), ',', ' ');

Any way to get : (header of csv : TEXT ) and inside : "Hello friends parsing is cool using simpleHTMLdom" ? Because for moment i have "Hello" and in another column i've "friends" .. "parsing is cool" ..."using".... "simpleHTMLdom"

Thank you all


Solution

  • Why don't you do a jQuery.remove() before your CSV insert? Something like this:

    $('.text p').find('br').remove()
    

    If you don't want to permanently remove <br> from the page, you could do something like this in your for-loop:

    foreach($html_book_page - > find('div[class=text] p') as $bookPreview) {
      $bookPreview.innerHTML.replace("<br>", "");
      array_push($book, $bookPreview - > plaintext);
    }