rubynokogiri

Using Nokogiri, how to convert html to text respecting block elements (ensuring they result in line breaks)


The Nokogiri #content method does not convert block elements into paragraphs; for example:

fragment = 'hell<span>o</span><p>world<p>I am Josh</p></p>'
Nokogiri::HTML(fragment).content
=> "helloworldI am Josh"

I would expect output:

=> "hello\n\nworld\n\nI am Josh"

How to convert html to text ensuring that block elements result in line breaks and inline elements are replaced with no space?


Solution

  • You can use #before and #after to add newlines:

    doc.search('p,div,br').each{ |e| e.after "\n" }