[SOLVED] Remove html from RichTextField

Remove html from RichTextField

I'm trying to remove the html code that wraps the RichTextField content, I thought I could do it using "raw_data" but that doesn't seem to work. I could use regex to remove it but there must be a wagtail/django way to do this?

for block in post.faq.raw_data:
    print(block['value']['answer'])

Outputs:

<p data-block-key="y925g">The time is almost 4.30</p>

Expected output (just the raw text):

The time is almost 4.30

StructBlock:

class FaqBlock(blocks.StructBlock):
    question = blocks.CharBlock(required=False)
    answer = blocks.RichTextBlock(required=False)

Solution

You can do this in Beautiful Soup easily.

soup = BeautifulSoup(unescape(html), "html.parser")
inner_text = ' '.join(soup.findAll(text=True))

In your case, html = value.answer which you can pass into a template_tag

EDIT: example filter:

from bs4 import BeautifulSoup
from django import template
from html import unescape

register = template.Library()

@register.filter()
def plaintext(richtext):
    return BeautifulSoup(unescape(richtext), "html.parser").get_text(separator=" ")

There's the get_text() operator in BeautifulSoup which takes a separator - it does the same as the join statement I wrote earlier. The default separator is null string which joins all the text elements together without a gap.

<h3>Rich Text</h3>
<p>{{ page.intro|richtext }}</p>
<h3>Plain Text</h3>
<p>{{ page.intro|plaintext }}</p>

If you want to retain line breaks, it needs a bit more parsing to replace block elements with a \n. The streamvalue.render_as_block() method does that for you, but there's no method like this for RichTextField since it's just a string. You can find code examples to do this if you need.