wagtailwagtail-streamfield

Remove html from RichTextField


I'm trying to remove the html code that wraps the RichTextField content, I thought I could do it using "raw_data" but that doesn't seem to work. I could use regex to remove it but there must be a wagtail/django way to do this?

for block in post.faq.raw_data:
    print(block['value']['answer'])

Outputs:

<p data-block-key="y925g">The time is almost 4.30</p>

Expected output (just the raw text):

The time is almost 4.30

StructBlock:

class FaqBlock(blocks.StructBlock):
    question = blocks.CharBlock(required=False)
    answer = blocks.RichTextBlock(required=False)

Solution

  • You can do this in Beautiful Soup easily.

    soup = BeautifulSoup(unescape(html), "html.parser")
    inner_text = ' '.join(soup.findAll(text=True))
    

    In your case, html = value.answer which you can pass into a template_tag

    EDIT: example filter:

    from bs4 import BeautifulSoup
    from django import template
    from html import unescape
    
    register = template.Library()
    
    @register.filter()
    def plaintext(richtext):
        return BeautifulSoup(unescape(richtext), "html.parser").get_text(separator=" ")
    

    There's the get_text() operator in BeautifulSoup which takes a separator - it does the same as the join statement I wrote earlier. The default separator is null string which joins all the text elements together without a gap.

    <h3>Rich Text</h3>
    <p>{{ page.intro|richtext }}</p>
    <h3>Plain Text</h3>
    <p>{{ page.intro|plaintext }}</p>
    

    enter image description here

    If you want to retain line breaks, it needs a bit more parsing to replace block elements with a \n. The streamvalue.render_as_block() method does that for you, but there's no method like this for RichTextField since it's just a string. You can find code examples to do this if you need.